Sponsored Links


Resources

.NET Research Library
Get .NET related white papers, case studies and webcasts

Five Considerations for Large Scale SystemsFive Considerations for Large Scale SystemsFive Considerations for Large Scale Systems Discuss DiscussDiscuss Printer friendly Printer friendlyPrinter friendly
Five Considerations for Large Scale Systems

January 13, 2004

Introduction

With the growth of the Internet, and of connected networks in general, the development and deployment of large scale systems has become increasingly common. A large scale system is one that supports multiple, simultaneous users who access the core functionality through some kind of network. In this sense, they are very different from the historically typical application, generally deployed on CD, where the entire application runs on the target computer.

The benefits of using a three-tier architecture for large scale system design are generally well known. A three-tier system is separated, both logically and physically, into:

  • Client tier, responsible for interacting with the user via a Graphical User Interface (GUI) and submitting requests via the network to the mid tier.
  • Mid tier, responsible for gathering requests from clients and executing transactions against the data tier.
  • Data tier, responsible for physical storage and manipulation of the information represented by application queries and the responses to those queries.

A diagram of such architecture is shown in Figure 1. There are many popular choices for implementation at each tier. One particularly common choice for the client tier is the use of a browser, such as Microsoft Internet Explorer or Netscape Navigator. In this case, the mid tier is responsible for producing HTML. If an alternate client- tier technology is chosen, some other format may be used. The Extensible Markup Language (XML) is often employed, since support tools are widely available on many platforms. The data tier is almost always some sort of relational database engine, such as the Oracle RDBMS or Microsoft SQL Server.

Figure 1 - Typical Three Tier Architecture

The benefits of a three tier architecture include:

  • Centralization of business logic for ease of maintenance.
  • Separation of user interface logic from data access logic.
  • The ability to spread work over several machines (load balancing).
  • When the client tier is a browser, an independence from the platform is used to execute user interface logic, allowing a broader reach for the application.

Given that more and more projects are developing large scale systems, and given that this architectural framework is so widely employed, why do many large scale development efforts fail? The answer, of course, is that an architecture is not enough– understanding requirements, employing appropriate technology, hiring skilled developers, and many other factors also contribute to the success or failure of a particular effort.

No single document could hope to address all pertinent factors in such a way to ensure success for those who follow its advice. However, there are still a broad set of guidelines that, when kept in mind by system designers throughout the product cycle, can greatly increase the chances of a project meeting its design goals.

This whitepaper will present just such a set of guidelines. We will try to avoid focusing on any particular technology, instead presenting a broad outline that should apply to the technologies of any vendor.

Aspects of Successful Large Scale System Design

While nothing can guarantee the success of any project, there are five factors that, when kept in mind throughout design and implementation, can help system architects ensure that they haven’t overlooked something important. These five are:

  • Scalability
  • Availability
  • Manageability
  • Security
  • Development practices

Let's examine each of these in detail.

Scalability

A system is said to be scalable if it can handle an increased load without redesign. The lion’s share of the cost of system development is usually labor, so being able to adjust to increasing load without having to rewrite every time ten new users are added is a crucial feature.

Scalable systems grow to adjust to increasing loads by adding hardware. In the mid tier, for example, we make use of load balancing technologies, such as Cisco Local Director or Big IP Controller. A load balancer provides scalability by distributing requests across several machines, so each handles only a percentage of the load. This is shown in Figure 2.

Figure 2 - Load Balancing

Load balancing must be taken into consideration during design in order to achieve scalability. For example, mid-tier logic that stores information in memory on a particular machine across requests cannot be efficiently load balanced, since subsequent requests must return to the same machine to access that information. This situation is shown in Figure 1, and while the ability to direct a particular client to the same machine request after request is generally supported by commercially available load balancing projects, designs that require this behavior should be avoided.

Figure 3 - Sticky Sessions

To understand why, consider a trip to your neighborhood grocery store. At checkout time, you generally choose the lane with the line you feel will take the shortest amount of time. In the same manner, a good load balancer will direct an incoming request to the least busy available machine.

If, however, your grocery store remembers what lane you checked out in last time, and requires that you again use that lane every time you visited the store, you would quickly take your business elsewhere. After all, what if the line in your assigned lane is the longest in the store but other lanes are completely empty? And what if it is closed entirely? This is clearly less efficient, yet this is exactly analogous to what happens in designs that require all requests from a particular client be directed to the same physical machine.

Scalability in the data tier is achieved in an entirely different manner. Because the whole point of the data tier is to manage information, splitting all requests between more than one machine doesn’t help us at all. If we did that, we’d have to maintain all of the data at each machine, which–while possible–leads to enormous complications in trying to keep each copy in sync with the other.

Large scale designs should attempt to make use of partitioning to achieve scalability in the data tier. Partitioning is putting a subset of the data on each of several machines, according to some set of rules, and then using those rules to route the mid-tier machines to the appropriate data store. This is shown in Figure 1.

Figure 4 - Partioning

Although sometimes partitioning is straightforward, for example with sales data that can easily be divided by region, there are cases where effective partitioning can be very difficult. Although some database vendors (Oracle and Microsoft, for example) are beginning to provide support for automatic partitioning, this can still be a challenging area to address in system design.

Availability

Large scale systems often need to be highly available. Availability is the ability of a system to be operational a large percentage of the time, the extreme being so-called “24/7/365” systems. The largest challenge to availability is surviving system instabilities, whether from hardware or software failures.

In the mid-tier, building highly available systems starts with using high-quality hardware with features such as RAID drive arrays and redundant power supplies in a data center with adequate protection from power outages. However, even with such hardware, systems can still fail. Fortunately, if the system has been designed to be load balanced, and more than one machine is handling the load, it can survive the failure of any individual machine. This situation is shown in Figure 2. Of course, this assumes that the load balancing product can detect the failure of an individual machine and reroute appropriately. Many commercial load balancers have this capability.

Figure 5 - High Mid-tier Availability

Availability in the data tier is achieved through a different technology: clustering. Clustering is the use of more than one machine to provide access to the same data store. Often this means that two machines are connected to the same physical drive array, with one of the machines on “standby.” If something should happen to the active machine, the second machine comes online and begins serving the mid-tier requests in place of the first machine. These two machines are collectively referred to as a cluster, although the term could apply to groups containing more than two machines if higher degrees of availability are required.

Clustering can be combined with partitioning to provide a data tier that is both highly available and highly scalable. Note, however, that each individual partition must be separately clustered. This is shown in Figure 1.

Figure 6 - Partitioned and Clustered Data Tier

Manageability

The key to high availability is redundancy. If you have more than one of everything, you’re not susceptible to any single points of failure. However, although redundancy means that the system as a whole is up when you need it to be up, individual components are still bound to fail; that fact is the whole reason we use technologies like clustering in the first place.

It is important to know when this happens. After all, in a two node cluster, when the first node fails, we’ve lost our backup, and there is now a single point of failure, jeopardizing the high availability characteristics around which we so carefully crafted our system. For this reason, manageability is an important aspect of successful large scale system design.

Manageability is twofold. It includes both the ability to detect faults (or the absence of faults) and the ability to automatically react to them. These abilities come from instrumentation and enterprise monitoring, respectively.

Instrumentation is the feature of system components that lets us determine their health at any given point in time. The simplest form of software instrumentation is code that simply logs a message when a fault is detected. While this capability is necessary as an absolute minimum, system designers should strive for a more complete solution. If a system is unresponsive, for example, but no faults have been recorded, is it merely idle? Is it hung? Is the logging mechanism itself faulty?

To be able to answer these important questions, complete instrumentation is necessary. This should include not only fault notification, but notification of proper behavior and possibly even “heartbeat” capability to determine that a component is functioning but idle. Without instrumentation, systems operators are flying blind in the face of alleged or actual systems failures.

Instrumentation, even complete instrumentation, is only half the manageability story, however. Without enterprise monitoring tools, instrumentation only helps if someone happens to be looking at the instrumentation output when a fault occurs. If the output is not constantly monitored, corrective action can only be taken after something more serious occurs, like the failure of a second node in a clustered system. Obviously, we want to know as soon as possible when a failure occurs.

Enterprise monitoring tools, then, are a necessity, unless system designers wish to include in their project plans a budget for an army of low-paid, caffeinated interns dedicated to the task of watching instrument output for failure conditions. Enterprise monitoring tools,, such as Tivoli Management Solutions, automate this task, watching systems for health and fault tolerance. Most can be configured to automatically notify administrators in response to specific events.

In the best of all worlds, the enterprise monitoring tool should directly integrate with the system’s instrumentation product. This allows for maximum flexibility and speed in responding to faulty events. And given that it can take even well-trained and responsive system support personnel fifteen or more minutes to identify and correct a problem, once noticed, automation is an important facet of maintaining a highly available, scalable system.

Security

Security is an important aspect of system design, and all the more so for distributed systems, since they are often open to attack from agents at any of millions of worldwide locations. Therefore, system designers need to carefully consider what mechanisms they use for authentication and authorization.

Authentication is the process of having users of the system identify themselves to the system. This typically takes the form of the user entering a username and password into a form or dialog box. Once it has been determined that an appropriate password has been provided for the given username, the system must then use that information to determine whether or not that user should be allowed to perform the set of operations that they are requesting. This is what is meant by authorization, sometimes also called access control.

There is a common myth among systems designers that the most secure authentication and authorization mechanisms are those whose algorithms are secret, perhaps even created specifically for this project. This is known as “security through obscurity,” and it should be avoided. The best way to defeat a determined attacker is to use one of the many publicly available algorithms, like Kerberos, TripleDES, MD5, and SHA-1, that can withstand a determined attack, even when the details of the algorithm are fully known.

However, simply choosing publicly available algorithms does not make your application secure. Aside from the many considerations of actually using these algorithms in a secure manner, itself no small task, there are the human factors to consider. Bruce Schneier, in his excellent book Secrets and Lies, states the three necessary ingredients for a secure system: prevention, detection, and reaction.

Prevention is the one where people tend to get caught up. Prevention technologies are about denying an attacker any access to the system at all. This is where authentication and access control come in. But it's a losing game, given enough time and resources, a determined attacker can always defeat your prevention mechanism. Thinking otherwise is setting you up to get taken advantage of in a big way. But it's not hopeless if you have meaningful detection mechanisms in place.

Detection means building your application in such a way that you can tell when it is being attacked. It goes with manageability, as instrumentation can go a long way towards assessing whether anomalous behavior is the result of a glitch in the system or an intruder. Detection is a crucial addition to prevention, having one does not enable you to omit the other, but still it isn't the end of the story. Reaction is still required.

Reaction is the ability of your system, and here I mean the whole system, to include the administrators and support staff and to take appropriate measures when an attack is detected. In the context of writing code, this might mean putting switches in the program to enable you to quickly disable it while returning a meaningful message to the users. Or it might mean isolating the attacker so they can be observed without their knowledge, enabling you to gain clues to their identity or locate the way they were able to circumvent your protection mechanisms.

Think of these three components in the context of a safe: you wouldn't build one without a lock (prevention), but a thief could still cut through it given enough time. That's why you have burglar alarms (detection). But burglar alarms only do you any good if they result in the police arriving (reaction). All three pieces are necessary.

Development Practices

Properly speaking, this last ingredient is not an aspect of the system itself, but rather an element of the process used to develop the system. Still, employing proper development practices is a fundamental that must be kept in mind during planning, as much as scalability, availability, manageability, and security.

The exact details of what constitutes proper development practice have long been debated, and so much has been written on the topic that there is no need (nor space) to revisit the complete subject. Here is a (very) incomplete list of considerations:

  • Use of source control is essential.
  • Adequate documentation must be generated at every phase of development.
  • System developers must be sufficiently trained on the technologies that they are employing.
  • Specific requirements for availability, scalability, and performance must be captured.

Suffice it to say that systems developers are unlikely to succeed unless they take a structured, methodical approach to system construction, using personnel experienced in the technologies that they are employing.

Conclusion

Successfully designing a large scale system is difficult, despite the promises to the contrary by development tool vendors. Building software that will execute correctly and consistently in a distributed environment where hundreds or even millions of requests need to be serviced on a daily basis is no small task. While there is nothing that will guarantee success, designers that remember to design for scalability, availability, manageability, security, and to use proper development practices will greatly decrease the chance of project failure.

Authors

Craig Andera has a Masters degrees in Electrical Engineering from MIT. After a brief stint in the electronic entertainment industry, Craig built his COM knowledge through a combination of DevelopMentor classes (then as a student) and his troubleshooting efforts in the finance industry. Craig is currently focused on the .NET platform, which he has been exploring full-time since February 2001. He has a particular interest in the design and implementation of secure, scalable systems. Craig is currently a consultant at Wangdera Corporation.

News | Blogs | Discussions | Tech talks | White Papers | Downloads | Articles | Media kit | About
All Content Copyright ©2007 TheServerSide Privacy Policy
Site Map