Thursday, December 9, 2010

The Coming Data Center Singularity: How Fabric Computing Must Evolve


Summary:
The next generation in data center structure will be fabric-based computing, but the fabric will be two full steps beyond today’s primitive versions. First, the fabric will include network switching and protection capabilities embedded within. Second, the fabric will incorporate full energy management capabilities: electric power in, and heat out.
Hypothesis:
Ray Kurzweil describes the Singularity as that moment when the ongoing increase in information and related technologies provides so much information that the sheer magnitude of it overwhelms traditional human mental and physical capacity. Moore’s law predicts this ongoing doubling of the volume of available computing power, data storage, and network bandwidth, at constant cost. There will come a time when the volume of information suddenly present will overwhelm our capacity to comprehend it. In Dr. Kurzweil’s utopian vision, humanity will transcend biology and enter into a new mode of being (which has resonances with Pierre Teilhard de Chardin’s Noosphere).
Data centers will face a similar disruption, but rather sooner than Dr. Kurzweil’s 2029 prediction. Within the next ten years, data centers will be overwhelmed. Current design principles rely on distinct cabling systems for power and information. As processors, storage, and networks all increase capacity exponentially (at constant cost) the demands for power and the need for connectivity will create a rat’s nest of cabling, compounded with ever-increasing requirements for heat dissipation technology.
There will be occasional reductions in power consumption and physical cable density, but these will not avoid the ultimate catastrophe, only defer it for a year or two. Intel’s Nehalem chip technology is both denser and less power-hungry than its predecessor, but such improvements are infrequent. The overall trend is towards more connections, more electricity, more heat, and less space. These trends proceed exponentially, not linearly, and in an instant our data center capacity will run out.
Steady investment in incremental improvements to data center design will be overrun by this deluge of information, connectivity, and power density. Organizations will freeze in place as escalating volumes of data overwhelm traditional configurations of storage, processors, and network connections.
The only apparent solution to this singularity is a radical re-think of data center design. As power and network cabling are the symptoms of the problem, an organizational layout that eliminated these complexities would defer, if not completely bypass, the problem. By embedding connectivity, power, and heat (collectively called energy management) in the framework itself, vendors will deliver increasingly massive compute capabilities in horizontally-extensible units – be they blades, racks, or containers.
Conclusion:
The next generation in data center structure will be fabric-based computing, but the fabric will be two full steps beyond today’s primitive versions. First, the fabric will include network switching and protection capabilities embedded within. Second, the fabric will incorporate full energy management capabilities: electric power in, and heat out. 

Wednesday, December 8, 2010

The Software Product Lifecycle

Traditional software development methodologies end when the product is turned over to operations. Once over that wall, the product is treated as a ‘black box’: Deployed according to a release schedule, instrumented and measured as an undifferentiated lump of code.
This radical transition from the development team’s tight focus on functional internals to the production team’s attention to operational externals can impact the delivery of the service the product is intended to deliver. The separation between development and production establishes a clear, secure, auditable boundary around the organization’s set of IT services, but it also discourages the flow of information between those organizations.
The ITIL Release Management process can improve the handoff between development and production. To understand how, let’s examine the two processes and their interfaces in more detail.
Software development proceeds from a concept to a specification, then by a variety of routes to a body of code. While there are significant differences between Waterfall, RAD, RUP, Agile and its variants (Scrum, Extreme Programming, etc) the end result is a set of modules that need to supplement or replace existing modules in production. Testing of various kinds assesses the fitness for duty of those modules. During the 1980s I ran the development process for the MVS operating system (predecessor of zOS) at IBM in Poughkeepsie and participated in the enhancement of that process to drive quality higher and defect rates lower. The approach to code quality improvement echoes other well-defined quality improvement processes, especially Philip Crosby’s “Quality is Free.” Each type of test assesses code quality along a different dimension.
Unit test verifies the correctness of code sequences, and involves running individual code segments in isolation with a limited set of inputs. This is usually done by the developer himself.
Function/component test verifies the correctness of a bounded set of modules against a comprehensive set of test cases, designed jointly with the code, that exercise both “edge conditions” and expected normal processing sequences. This step validates the correctness of the algorithms encoded in the modules: for instance, the calculated withholding tax for various wage levels should be arithmetically correct and conform to the relevant tax laws and regulations.
Function/component test relies on a set of test cases, which are programs that invoke the functions or components with pre-defined sets of input to validate expected module behavior, side effects, and outputs. As a general rule, for each program step delivered the development organization should define three or four equivalent program steps of test case code. These test cases should be part of the eventual release package, for use in subsequent test phases – including tests by the release, production, and maintenance teams.
(Note that we avoid the notion of “code coverage” as this is illusory. It is not possible to cover all paths in any reasonably complex group of modules. For example, consider a simple program that had five conditional branches and no loops. Complete coverage would require 32 sets of input, while minimal coverage would require 10. Moderately complex programs have a conditional branch every seven program steps, and a typical module may have hundreds of such steps: coverage of one such module with about 200 lines of executable code would require approximately 2**28 variations, or over 250 million separate test cases.
For a full discussion of the complexity of code verification, see “The Art of Software Testing", by Glenn Meyers. This book, originally published in 1979 and now available in a second edition with additional commentary on Internet testing, is the strongest work on the subject.)
System test validates operational characteristics of the entire environment, but not the correctness of the algorithms. System test consists of a range of specific tests of the whole system, with all new modules incorporated on a stable base. These are:
Performance test: is the system able to attain the level of performance and response time the service requires? This test involves creating a production-like environment and running simulated levels of load to verify response time, throughput, and resource consumption. For instance, a particular web application may be specified to support 100,000 concurrent users submitting transactions at the rate of 200 per second over a period of one hour.
Load and stress test: How does the system behave when pressed past its design limits? This test also requires a production-line environment running a simulated workload, but rather than validating a target performance threshold, it validates the expected behavior beyond those thresholds. Does the system consume storage, processing power, or network bandwidth to the degree that other processes cannot run? What indicators does the system provide to alert operations that a failure is imminent, ideally so automation tools could avert a disaster (by throttling back the load, for instance)?
Installation test: Can the product be installed in a defined variety of production environments? This test requires a set of production environments that represent possible real-world target systems. The goal is to verify that the new system installs on these existing systems without error. For instance, what if the customer is two releases behind? Will the product install or does the customer first have to install the intermediate release? What if the customer has modified the configuration or provided typical add-on functionality? Does the product install cleanly? If the product is intended to support continuous operations, can it be installed non-disruptively? Can it be installed without forcing an outage?
Diagnostic test: When the system fails, does it provide sufficient diagnostic information to correctly identify the failing component? This requires a set of test cases that intentionally inject erroneous data to the system causing various components to fail. These tests may be run in a constrained environment, rather than in a full production-like one.
The QA function may be part of the development organization (traditional) or a separate function, reporting to the CIO (desirable). After the successful completion of QA, the product package moves from development into production. The release management function verifies that the development and QA teams have successfully exited their various validation stages, and performs an integration test, sometimes called a pre-production test, to ensure that the new set of modules is compatible with the entire production environment. Release management schedules these upgrades and tests, and verifies that no changes overlap. This specific function, called “collision detection”, can be incorporated into a Configuration Management System (CMS), or Configuration Management Database (CMDB), as described in ITIL version 3 and 2 respectively.
Ideally this new pre-production environment is a replica of the current production one. When it is time to upgrade, operations transfers all workloads from the existing to the new configuration. Operations preserves the prior configuration should problems force the team to fall back to that earlier environment. So at any point in time, operations controls three levels of the production environment: the current running one, called “n”, the previous one, called “n-1”, and the next one undergoing pre-production testing, called “n+1”.
Additional requirements for availability and disaster recovery may force the operations team to also maintain a second copy of these environments – an “n-prime” copy of the “n” level system and possibly and “n-1-prime” copy of the previous good system, for fail-over and fall-back, respectively.
When operations (or the users) detect an error in the production environment, the operations team packages the diagnostic information and a copy of the failing components into a maintenance image. This becomes the responsibility of a maintenance function, which may be a separate team within operations, within development, or simply selected members of the development team itself. Typically this maintenance team is called “level 3” support.
Using the package, the maintenance function may provide a work-around, provide a temporary fix, develop a replacement or upgrade, or specify a future enhancement to correct the defect. How quickly the maintenance team responds follows from the severity level and the relevant service level agreements governing the availability and functionality of the failing component. High severity problems require more rapid response, while lower severity issues may be deferred if a work-around can suffice for a time.
Note that the number of severity levels should be small: three or four at most. Also, the severity levels should be defined by the business owner of the function. Users should not report problems as having high severity simply to get their issues fixed more rapidly; the number of high severity problems should be a small percentage of the total volume of problems over a reasonable time period.
The successful resolution of problems, and the smooth integration of new functions into the production environment, requires a high degree of communication and coordination between development, QA, and production. Rather than passing modules over a wall, they should be transferred across a bridge: The operations team needs access to test cases and test results along with the new code itself, for diagnostic purposes; and the development team benefits from diagnostic information, reports on failure modes and problem history, and operational characteristics such as resource consumption and capacity planning profiles. The ITIL release management function, properly deployed, provides this bridging function.