How Many Failures Can Occur At Once?
Have you ever wondered if your Standard Operating Procedures (SOP), Maintenance Operating Procedures (MOP) or Emergency Operating Procedures (EOP) are sufficient to support continuous operations and proper responses to unplanned events?
When I facilitate our Data Center Operations training I often use the following airline event to illustrate how a lack of adequate procedures, or the lack of following developed procedures, can quickly result in undesirable results. See if you can count how many SOP, MOP or EOP deficiencies there are in this story.
This example dates back to July 23, 1983 when Flight 143, now known as the “Gimli Glider”, was traveling between Montreal and Edmonton Canada. I remember this event happening as it occurred relatively close to my home town.
About half way through the 3,000 km (1,900 mile) trip the cockpit warning system indicated a fuel pressure problem on the left engine. Assuming the pump had failed, the pilots turned off the pump since gravity would feed fuel to the engines if all else was operating normally. However, that assumption was not based on all the information that would normally be available since the fuel gauges were not working. The pilots knew the fuel gauges were inoperative due to an electronic fault which was indicated on the instrument panel and documented in the maintenance logs.
A few minutes after the first fuel pressure alarm activated, a second alarm sounded indicating a fuel pressure problem on the right engine. The pilots diverted their flight plan to land in Winnipeg, the nearest suitable airport to land the jet. Within seconds the left engine failed (shut down) and the pilots began to prepare for a single engine landing, something procedures have been developed for and pilots are trained for.
As they were communicating with the Winnipeg Control Tower the cockpit warning system initiated a loud long bong sound that the pilots had never heard before and was not covered in any flight simulator training at that time. This system alarm was supposed to alert the pilots “All Engines Out”. The pilots were not familiar with this alarm because flying without engines was never expected to occur and therefor was never included in any training exercises.
Even though the fuel gauges were inoperative, the airplanes management system did in fact indicate sufficient fuel for the flight, but only because the initial fuel load was incorrectly entered by maintenance crew. This error was the result of confusion between Imperial units vs Metric units, since Canada was in the process of converting from Imperial to Metric. The fuel data had been entered using Pounds as the unit when it should have been entered in Kilograms. So as an example, the fuel was registering as 1,000 kg of fuel instead of the actual 1,000 lbs of fuel, which equals 454 kg, or just under half of what was expected (I don’t know the actual weight of the fuel).
So at an altitude of approximately 11,000 m (35,000 ft) and half way through the intended flight both engines had failed, and the plane lost all power and most of the instrument panels in the cockpit. The 767 was one of the first planes to incorporate an electronic flight instrumentation system, which at the time required power from the plane’s engines, this left only a few basic emergency battery-powered instrumentation operating to enable landing the plane.
The vertical speed indicator was not one of the instruments that was battery powered and therefore not functioning. This would have indicated the descending rate and provide information on how long the plane could glide. With the limited on-board information available to the pilots, they could not determine how far the plane could glide.
The pilots searched their emergency procedures manual for instructions on flying the aircraft with both engines failed. There were no procedures. This goes back to the fact that flying without engines was never expected to occur.
In order to determine how far they could glide they had to use a backup altitude instrument, and communicate with Air Traffic Control so they could estimate how far they traveled based on starting and stopping their measurements at exactly the same time. They had to precisely time this with Air Traffic Control because the pilots could measure the vertical drop within a defined time period, but they could not measure how far they traveled horizontally. Only Air Traffic Control could measure how far they traveled.
To avoid trying to land the plane at an airport in a populated area (there’s only one chance to land successfully when you have no power), the pilot decided to land at an abandoned Air Force Base in Gimli, north of Winnipeg (one of the pilots previously served as an RCAF pilot at the base). However, neither the pilot or Air Traffic Control knew that a portion of the runway at the Air Force Base had been converted to a motor race track. Since the engines had all failed the airplane made virtually no noise as it came in for a landing. People on the ground had no advance warning of the pending landing.
As the plane came in for it’s one and only landing opportunity the pilots were able to avoid hitting two young boys driving their bicycles along the abandoned runway, manage the collapse of the front nose wheel upon first impact, and keep the plane from careening of the runway into the crowd of spectators alongside the runway who were there for the motor race. The pilots had successfully landed the powerless 767 air plane. A minor fire in the plane’s nose area was extinguished by the racers and course workers with hand held fire extinguishers. There were no serious injuries.
The total time that elapsed from the point in which the engines failed and the plane landed was 17 minutes. Setting aside the combination of corporate and human errors that contributed to this event, lives were saved because of the skill and expertise of the pilots.
This is an example of how one (or in this case a few) oversights can cascade resulting in a critical, and potentially catastrophic, unplanned event. The takeaway that can be applied to your critical IT service procedures, and the methodology we use when guiding data center operation teams, are:
Do your SOP’s:
Address all normal operating procedures?
Are your personnel trained and tested in those procedures?
Are your personnel following those procedures?
Do your MOP’s:
Address the maintenance cycles of all components and systems?
Are your personnel trained and tested in carrying out the maintenance activities?
Are your personnel following the procedures to mitigate the impact of a failed component or system?
Do your EOP’s:
Address unexpected events?
Are your personnel trained and tested in responding to various unexpected events so that they are able to make appropriate decisions when required?
If your response to a possible unexpected event is “that will never happen, or that has a very low probability of actually occurring”, your personnel will be in uncharted territory if they ever need to respond to an unexpected event. I’m not suggesting that detailed step-by-step processes need to be developed for every unexpected event, but they should at least identify who is responsible to make decisions and how information is to be communicated throughout the operations team. It would be even better if the EOP’s provided general guidance so that those responsible could make decisions that align with the business expectations. The risks include personnel making decisions that “bail out” too soon without implementing safe procedures that could avoid an unplanned outage, or even more important, making decisions that “stay engaged” too long putting people’s safety at risk.
So you many deficiencies did you find?