Data Center Redundancy Classifications
Redundancy in critical data center systems and services are like insurance policies. How much risk is your organization comfortable with, and how much insurance does your organization want to purchase to align the acceptable level of risk with the budgets required to mitigate unacceptable risks. The answers to these questions are unique to each organization, even if within the same market or industry.
The classifications of the data center services within the ANSI/BICSI 002 Data Center Design and Implementation Best Practices standard define a set of performance requirements to meet varying needs of reliability and availability. See my previous post "Reliability Decision Tree for Critical Infrastructure" to read more about how to identify which classification would be appropriate for your organization.
The ANSI/BICSI 002 standard defines the Data Center services performance characteristics within five "Classes", Class-0 through Class-4, with Class-4 providing the highest level of reliability. Class-0 and Class-1 are both "single path" solutions, with Class-0 being any data center service that does not meet the minimum requirements of Class-1, but meets all the basic recommendations within the standard. In this post I will simply focus on the performance characteristics of Class-1 through Class-4.
The performance characteristics defined within each of these "Classes" can be applied to any of the data center ICT services, or the supporting critical infrastructure. The performance characteristics can also be applied to:
In-house strategies such as
Enterprise
Private Cloud
Outsourced strategies such as
Public Cloud
Hybrid Cloud
Colocation
When implementing outsourced strategies it is important to understand what level of performance, redundancy and reliability is required to support the business objectives. Outsourced Cloud strategies do not inherently provide the highest level of reliability, unless they have been designed to provide this level of reliability and the Cloud service vendor has implemented the lower layers of the critical infrastructure to align the high level of reliability throughout the entire data center services stack.
The performance characteristics can help define the requirements for:
Application Architecture: Level of performance reliability required will impact if applications are designed such that they are hardware independent, have the ability to seamlessly fail over, provide abstraction between the various application layers to enable scale-out capabilities.
ICT Systems Architecture: Level of performance reliability required will impact if servers and storage systems are designed such that there is no direct dependency between specific servers and storage devices, compute processing is provided by location transparent processing or high availability mirrored systems, mirrored data on redundant storage systems.
Network Architecture and Topology: Level of performance reliability required will impact component redundancy within discrete network chassis (PSU, supervisors, NIC failover) and system redundancy with redundant chassis at each tier within the network topology.
Network Cabling Infrastructure: Level of performance reliability required will impact if network services is simply provides logical redundancy or if all network services, links and channels also provide diverse physical redundancy.
Power Distribution and Backup Energy Sources: Level of performance reliability required will impact the level of redundancy at all layers of the power distribution (switchgear, backup power source, UPS) so that they provide "N", "N+1", "2N", "2(N+1)" or various other configurations to ensure performance requirements are met.
Cooling Solutions: Level of performance reliability required will impact the level of redundancy at all systems within the cooling solution (heat exchanger, piping, valves, cooling units, ect.) so that they provide "N", "N+1", "2N", "2(N+1)" or various other configurations to ensure performance requirements are met. Note, it is common for sub-systems within the cooling solution to have differing levels of redundancy ("N+1" vs "2N") to provide an efficient solution while still meeting the specific performance requirement.
Security Systems: Level of performance reliability required will impact the level of redundancy of all security system head-end processing, discrete end-point devices, power sources and network connections.
Structural Robustness to withstand external forces: Level of performance reliability required will impact the level of robustness implemented within the structural hardening of the data center. A minimum level of hardening may result in the data center being able to withstand external forces but not necessarily operational after an event. The highest levels of hardening should ensure the data center can not only withstand external forces, but also remain operational during and after an event.
The requirements and design criteria for each of the data center elements listed above can be defined by the following performance characteristics.
Class-1 Characteristics The objective of Class-1 is to support the basic requirements of the Data Center ICT services. There is a high risk of downtime due to planned and unplanned events. However, in Class-1 data center strategies, remedial maintenance can be performed during nonscheduled hours, and the impact of downtime is relatively low.
Performance Definition: Single Path
Component or ICT discrete service redundancy:Not required
System or ICT solution redundancy: Not required
Quality control: Standard non-critical quality
Survivability: No additional robustness incorporated to withstand external forces
Class-2 Characteristics The objective of Class-2 is to provide a higher level of reliability than required from Class-1 to reduce the risk of unplanned downtime due to component or discrete ICT service failure. Components or discrete ICT services that have high failure rates or ICT services that are outside the direct control of the data center Ops team should have redundancy incorporated into the designed solution to reduce unplanned outages. In a Class-2 data center strategy there is a moderate risk of downtime due to unplanned events, and downtime may be required to support planned maintenance activities.
Performance Definition: Basic Redundancy
Component or ICT discrete service redundancy:Redundancy is provided for critical components or discrete services with high failure rates
System or ICT solution redundancy: Not required
Quality control: Standard non-critical quality
Survivability: Moderate hardening for security and robustness to withstand external forces
Class-3 Characteristics The objective of Class-3 is to provide sufficient redundancy at the component, ICT discrete service, system or ICT solution level to ensure maintenance on any of these elements can be achieved without impacting the data center services. This requires that any of any of the elements can have a planned shut down for maintenance activities without impacting the redundant path or service.
Performance Definition: Concurrently Maintainable
Component or ICT discrete service redundancy:Redundancy is required for critical and non-critical components or ICT discrete services, except when the components or ICT discrete services are part of a redundant system or ICT solution.
System or ICT solution redundancy: System or ICT solution redundancy is required where component or ICT discrete service redundancy does not exist or can not achieve concurrent maintainability.
Quality control: Premium quality throughout
Survivability: Significant hardening for security and robustness to withstand external forces
Class-4 Characteristics The objective of Class-4 is to eliminate downtime due to either planned or unplanned activities. A Class-4 strategy shall provide sufficient redundancy such if a component or ICT discrete service with high failure rates fails (unplanned event) while a system or ICT solution is off-line due to planned maintenance, the data center services shall not be disrupted.
Performance Definition: Fault Tolerant
Component or discrete service redundancy:Redundancy is required for critical and non-critical components or ICT discrete services with high failure rates.
System or ICT solution redundancy: System or ICT solution redundancy is required.
Quality control: Premium quality throughout
Survivability: Highest level of hardening for security and robustness to withstand external forces
Note that there are other standards and guidelines that use the term "Fault Tolerant" to describe their highest level of redundancy and robustness. However, the performance characteristics are defined as requiring two independent systems supporting the data center systems to ensure services are not disrupted in the event one path fails or is taken off-line for maintenance. This definition of Fault Tolerant does not meet the same level of redundancy and reliability as Class-4. I only highlight this to illustrate the need for owners and designers to clearly define what they mean by fault tolerant, either approaches are appropriate and suitable if it meets the expectations of the owner.