Powered by Invision Power Board


  Reply to this topicStart new topicStart Poll

> Power outage on Monday, July 9, 2018, description, analysis, and fixes
andy
Posted: Jul 12 2018, 04:10 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,302
Member No.: 9
Joined: 12-July 02



On Monday, July 9, 2018 at approximately 8AM Pacific Daylight Time (GMT -7:00), there was a service affecting utility power outage. We've been very busy working through the issues this raised and hadn't had a chance to post a full analysis until now.

When the utility power went out, our battery backup system (UPS) failed to supply power to the equipment long enough for the generator to start and supply power. The generator came on approximately 25 seconds after the power went out, but the UPS only supplied power to the equipment for approximately 20 seconds before failing, leaving a 5 second gap that caused all equipment to ungracefully power down and then power on 5 seconds later. Given that the UPS is fully redundant and supposed to be capable of detecting and correcting hardware faults, we were very frustrated that it did not live up to the design and marketing claims by the manufacturer.

Customer downtime ranged from a few minutes to a few hours, depending on the service involved, how the related equipment was configured, and many other independent factors that caused some services to come up later than others. With a couple exceptions, we had completed recovering by 11AM.

The power outage was 3 hours 40 minutes and we ran on generator for the duration.

Please consider this event in the context that the UPS and generator have successfully protected all the equipment from 4 prior power outage events in the last 3 years, totaling 6 hours 17 minutes of utility power outage that never caused a service outage. There have also been numerous successful full system load tests (30 minutes each) in that same period that involved turning off our main service breaker with no impact to services. While this event may have been painful for some, it is important to remember that it is a rare and unlikely event that doesn't represent a pattern of unreliability.

A full analysis is available below for those interested in the details.



Utility input power was lost due to an animal incident at a substation servicing 10,000+ electric customers, including us. Battery backup failed due to a fault in 1 of the 9 battery modules and a second fault in the redundant "intelligence" modules that run the battery backup system, causing it to fail to supply power. Note that these are different subsystems in the UPS than what failed in any prior incident where the UPS failed to function properly. The generator automatically started as intended and took the load within 30 seconds of the utility power outage beginning and continued running for the entire 3 hour 40 minute duration of the electric utility outage. When the utility outage ended, the facility power automatically transitioned from the generator back to utility power, as intended.


Through this experience and prior experience with this UPS, I have come to believe that a different brand battery backup system would be better able to detect faults so we can identify and replace almost failed components during non-critical times in a fashion that won't cause an outage.


What we observed in this most recent incident is that the battery backup system (Uninterruptable Power Supply or UPS) simply turned off the power output at the time of the power outage. There wasn't even any logging done to indicate that a power failure had occurred, just simply the message that it had turned off the output. It also didn't log a reason why the output was turned off, which was extremely odd. Once the generator started a few seconds later, the UPS logged that it had turned on the output power again.

After recovering all the various systems that required manual intervention when the power returned, I began troubleshooting the UPS. In the 10 hours of troubleshooting the UPS that I did that day (over and above the 4 hours spent recovering various systems affected by the power outage), I discovered the following:

1. Self-tests on the UPS were being refused because of a supposed "overload" condition, but the system showed it was at only 47% capacity. This contradiction leads me to believe that there is something wrong with the intelligence modules running the system. They are hot-swappable and redundant with automatic fail-over when one fails, but the system didn't detect any faults in the intelligence modules. That said, clearly, they weren't doing their job correctly.

2. Self-test attempts were only occasionally being logged with no apparent reason why some were logged and some not logged. This was different than its behavior from two weeks prior when it last logged a successful self-test.

3. Failed self-tests were not sending notifications, in spite of being configured to do so. This is further reason for me to believe the intelligence modules are not working properly.

4. Voltages detected on the batteries fell very quickly when running on battery, but no battery modules were marked as failed. Again, another indication that there is probably something wrong with the intelligence modules.

5. Further troubleshooting on the batteries identified one battery module that was drawing down the voltage on the other 8 modules. Removal of that failed (but not marked as failed by the system) battery module allowed voltages on the other battery modules to return to normal.

6. Even after removal of the failed battery module, the remaining 8 battery modules were somewhat weak. A battery run-time calibration test I ran showed they were only capable of completing 20 minutes of holding power up for the load when they would have been able to do 56 minutes when new. This was expected given the age of the battery modules. Still, we only needed 30 seconds of run time from the batteries, so the weak batteries weren't entirely to blame, though it may have compounded the issue somehow.

7. At times during the testing and troubleshooting the battery voltages recorded by the UPS were erratic and occasionally registering unlikely levels. I was unable to discern if this was a problem with the battery modules or the intelligence modules misreading the information.

8. During the testing and troubleshooting, the system correctly reported 47% capacity usage in one area and incorrectly reported 999% capacity usage in another area of the system, both reporting on the same figure, but with two very different numbers. Based upon some simple math and what I know about how much electricity our equipment is using, the 47% figure was correct. Again, this inconsistency points to a problem in the intelligence modules.

9. During a retest that I did on Tuesday, July 10, the UPS marked one of the power inverter modules as "failed". I'm unsure if this is a true failure or something being misread and misreported by the intelligence modules, but we removed the power inverter module to clear the fault alarm.


In the end, my analysis was that there is a high probability that the intelligence modules, though not marked as failed, were the root cause of the problem with the UPS. One failed battery module (out of 9 in the system) contributed to quickly falling voltages, which has a moderate probability of having contributed to the overall issue. There is a low to moderate probability that the less than optimal capacity of the remaining 8 battery modules contributed to the problem. There is also a low to moderate probability that the power inverter module that was detected as "failed" during the retest the following day may have played a role.

The summation of the analysis is that there was at least one internal fault in the UPS that went undetected by the self-tests. This was not something we could have foreseen. If operating as designed, the system would have detected the fault, turned off the failed component, and automatically brought into service the redundant component. Given the timing and sequence of events, there is no way this could have been prevented short of massive duplication of infrastructure assets and no customer would be willing to pay the price we'd have to charge for that level of redundancy and fault tolerance.


Our short term fix in the next few days is to purchase and install replacement intelligence modules, replacement battery modules, and a replacement power inverter module. This is almost as expensive as an entirely new UPS, but is the quickest solution we can employ at this time. This is already in progress and replacement parts are on the way.

Our long term fix in the coming months is to switch to a different brand of UPS. Our current system is made by APC, which was long considered to be an industry leader in battery backup, but they have not met my expectations (or their design/marketing claims). A system as expensive and supposedly fault tolerant as ours should be able to detect and report marginal components much better than it does. We're evaluating several options, but our top pick right now is made by Eaton, a well respected name in electrical power products and long time leader in battery backup systems. Our pick is subject to change as my research progresses, but our experience with APC has taught me several key things to look for in the new system that should address the shortcomings of our current system.

Once in our new datacenter (hopefully in first half of next year), we'll also be utilizing a completely different electric utility provider with two feeding substations (instead our current single substation feed) and all underground facilities (versus pole mounted). The new electric utility should be much more reliable. That facility is already spec'ed with a high-end Eaton UPS.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Jul 20 2018, 02:34 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,302
Member No.: 9
Joined: 12-July 02



Additional testing with fresh batteries installed revealed the following:
  • The power module "failure" was a false alert due the battery module voltage being low. It now tests good with the fresh battery modules installed.
  • The odd behavior of the intelligence modules was due to low battery module voltage combined with an obscure configuration directive that operates exactly opposite of what is indicated in the manufacturer's documentation. We've changed the setting to reflect the desired operation, in spite of what the documentation indicates.

That said, we have also installed (and tested) another power module, making the system n+2 redundant instead of n+1 redundant for power modules. This is a precautionary measure in case the module detected as "failed" (now tests good) is flaky under load.

In light of the discovery of the configuration directive that was not operating as documented by the manufacturer, we've decided not to replace the intelligence modules since they were operating as designed, though opposite the documentation for that configuration directive. We deemed this to be a firmware or documentation flaw that would not be resolved by replacing the hardware.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
0 User(s) are reading this topic (0 Guests and 0 Anonymous Users)
0 Members:

Topic Options Reply to this topicStart new topicStart Poll