You are here

A Delicate Balancing Act

High-availability storage success depends on the ability to juggle emergency management within a realistic budget.

There's no such thing as a data recession, nor is there any trend in sight indicating that data and associated IT resources will need to be less available in the future. Quite the contrary: As more data are generated and stored for longer periods of time, they will need to be more available and accessible, not less.

The design of a high-availability data storage infrastructure can be as varied as the environments and applications that it supports.

There are several techniques, technologies and best practices that can be aligned to diverse needs and budgets to counter threat risks and meet a business's high-availability requirements. But six essential steps will help deliver HA for storage environments:

  • Develop strategies to address issues, threats and mission-objective requirements.
  • Establish a plan that includes applicable technologies, techniques and ongoing activities.
  • Implement a plan that includes technology deployment, configuration and day-to-day support.
  • Document and integrate with change control and business-continuity processes.
  • Measure and rely on recurring testing to validate the plan and the technologies.
  • Use problem determination, isolation, resolution and post-mortems to avoid future issues.

A basic tenet of high availability for storage (which applies as well to networks and servers) is fault isolation and fault containment -- that is, eliminate single points of failure (SPOFs) and configure systems so that (if the SPOFs cannot be eliminated) any resulting fault or error condition will be contained to prevent a rolling disaster. For example, you could configure a pair of networking or storage adapters to have separate paths to a shared storage system; in the event of a failure, you would have access to the storage on the surviving adapter.

Keep in mind that HA is a balancing act between the availability needed to protect against the most likely scenarios (or scenarios that would have the most dire impact) and your budget.

Annual Downtime
3.65 days
8.77 hours
52.6 minutes
5.26 minutes
31.56 seconds
3.16 seconds
1/2 second
Availablity expressed in nines

The perception is that components that have more "nines of availability" will enable HA. More nines of availability is good if you can afford it, but more important is how well the components work together. Overall availability is the sum of all of the pieces working together.

Measuring Availability

Availability is often discussed in terms of five nines, six nines or higher. It is important to understand that availability is the sum of all components and their configuration. The amount of downtime per year is calculated as a percentage: (100 – N)/100, in which N is the desired number of nines of availability. Availability is the sum of all components combined with design for fault isolation and containment. How much availability you need and can afford will be a function of your environment, application and requirements, and objectives.

The reality is that applications can be looked at from the standpoint of a specific layer or resource, or from end to end, which is what a user of IT services sees.

Anticipating and Preparing for Failure

Availability is only as good as the weakest link. In the case of a data center, that weakest link could be the applications, software, servers, storage, network, facilities and processes or best practices. Virtual data centers rely on physical resources to function; a good design can help eliminate unplanned outages to compensate for individual component failure. A good design removes complexity while providing scalability, stability, ease of management and maintenance, as well as fault containment and isolation.

As part of the configuration, costs could be saved by using a single switch, but even with five or six nines of availability, that switch and its firmware or software still present a single point of failure. You should therefore configure a pair of switches, each on its own network, to avoid device failure, software or configuration errors or network disruptions.

There is a tendency to try to reduce costs by replacing multiple smaller devices with a single, larger higher-availability device -- for instance, using a large switch in place of two separate switches. In that scenario, even with a manufacturer that boasts support for more nines than the competition, the physical frame itself might have a common SPOF -- for example, a backplane that creates the potential for multiple component failures.

The bottom line: If something can fail, it will; it's just a matter of time. HA is about mitigating risk while balancing the PACE -- performance, availability, capacity and economics -- of your business requirements.

Any technology (hardware, software, network or service) can fail at some point, because of the technology itself, its configuration or other error, or from acts of nature or man. Most manufacturers will claim that their products have no single points of failure, and thus will not fail. But they also typically describe how to implement fault isolation and other capabilities so that if and when their products fail, they do so gracefully and predictably.

Look for a storage system that is resilient, yet scales with stability and flexibility -- meaning that as performance increases, availability does not suffer, or as availability increases, performance and capacity do not suffer. Likewise, combine individual component availability with sound configuration best practices, keeping in mind that even highly available components can break down because of technical or human error. In the end, it's how you configure components that reduces the impact of a failure and maintains HA.

The Basic Staples of High Availability

  • Fault containment and fault isolation designs
  • Availability tools and technologies, including RAID, failover and clustering
  • Reliable technologies, including individual components and entire solutions
  • Point-in-time snapshots, copies and disk-to-disk backups
  • Monitoring and diagnostics for both reactive and proactive analysis
  • Server adapters, cabling, switches, routers and input/output data path network items
  • Business continuity and disaster recovery techniques and processes
  • Performance, availability and capacity-planning management
  • Distance for survivability on a local or long-distance basis
  • Divergent network paths that do not share common infrastructure items
  • Configuration management databases (CMDB) and technology tracking
  • Cross-domain infrastructure resource management (IRM) tools and technologies
  • Testing, change control and configuration document management
Dec 15 2009