Powered by Invision Power Board


  Reply to this topicStart new topicStart Poll

> Site wide power loss, recovered / explained
andy
Posted: Dec 25 2018, 01:52 AM
Quote Post





Group: Advantagecom Staff
Posts: 4,339
Member No.: 9
Joined: 12-July 02



We're currently getting all systems up and running again after site wide power loss.

We'll have a full report once we have everything up and running again. Our first priority is to ensure everything is running.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Dec 25 2018, 03:10 AM
Quote Post





Group: Advantagecom Staff
Posts: 4,339
Member No.: 9
Joined: 12-July 02



At this time, we have one VPS node still down - vds8.schmolie.com.

We'll be hooking up a physical console to see why it did not boot when power recovered and hope to have it running again within the hour.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Dec 25 2018, 03:51 AM
Quote Post





Group: Advantagecom Staff
Posts: 4,339
Member No.: 9
Joined: 12-July 02



VPS on vds8.schmolie.com are coming back up. Each will go down briefly again when disk quotas are automatically recalculated.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Dec 25 2018, 05:00 AM
Quote Post





Group: Advantagecom Staff
Posts: 4,339
Member No.: 9
Joined: 12-July 02



While correcting a time issue on our phone system, it has gone offline. We'll need to get on site to investigate and correct this issue.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Dec 25 2018, 05:18 AM
Quote Post





Group: Advantagecom Staff
Posts: 4,339
Member No.: 9
Joined: 12-July 02



The phone system became accessible again before we left to go on-site for troubleshooting. Logs showed that setting the time on that system caused it to believe that the kernel had locked up and so numerous processes were knocked offline for a little over 5 minutes.

For good measure, we rebooted the phone system and checked its time when it came up and all is good now.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Dec 25 2018, 05:20 AM
Quote Post





Group: Advantagecom Staff
Posts: 4,339
Member No.: 9
Joined: 12-July 02



At this time, all systems we monitor are up and running.

If your service is experiencing any trouble at this time, please open a trouble ticket to make us aware of the issue and we'll address it.

You can open a trouble ticket here: https://cbp.speedingbits.com/billing/clientarea.php


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Dec 25 2018, 05:49 AM
Quote Post





Group: Advantagecom Staff
Posts: 4,339
Member No.: 9
Joined: 12-July 02



The sequence of events that led to this outage were as follows:

1. An unknown event caused our standby generator to turn on at approximately 11:50PM Pacific Standard Time (GMT -8:00) on December 23. Because of how this happened, there was no notice to us that the generator had started and began supplying power to our datacenter instead of utility. While we suspect a power quality issue caused a failure in either the generator control board or utility sensing circuit of the automatic transfer switch, we have not yet diagnosed the issue to that level of precision.

2. At 8:14PM Pacific Standard Time on December 24, we were notified by the Port of Walla Walla that a neighbor had complained about the generator running continuously. We immediately knew something was wrong and got on site within about 30 minutes from the call.

3. After much troubleshooting, we could not get the transfer switch to switch back to utility power. Further reading of the manual revealed that the generator, input power, and all load had to be switched so it was not served by the automatic transfer switch in order for us to manually switch back to utility power.

4. While manually switching back to utility power, our battery backup system had a battery module failure in one of the 6 month old (new) battery modules that we'd installed the last time we had trouble with the battery backup system. This was an early failure and within warranty.

5. Unfortunately, we were unsuccessful at manually switching to utility power and had to turn the generator back on to restore power before the battery backup system dropped the load.

6. We removed and replaced the failed battery module with a functional older spare we had on hand and waited for the battery backup system to charge before continuing.

7. When we attempted to manually switch back to utility power again, the battery backup system supplied power for only about 2 minutes (instead of the 21 minutes expected) and dropped the load before we could restore utility or generator power.

8. At this point we became concerned that we had either a fault in our automatic transfer switch or there was something wrong with the utility power that was preventing us from manually switching back.

9. We called in the electric utility provider (yes, on Christmas Eve near midnight) and had them test their side, which they found to be good, but he couldn't rule out that the problem might be intermittent or related to load, rain, or other issue that couldn't be immediately replicated.

10. With everything already down from the battery backup system dropping the load, we decided to run some more invasive tests (as quickly as possible) while we had the opportunity. This is when we discovered that the generator was getting an erroneous "utility power fail" signal either from its own control board or from the automatic transfer switch. Unfortunately, this meant we had to take the generator off "automatic" to return to utilizing utility power. We didn't want to cause mechanical failure or unnecessary wear and tear on the generator or excessive fuel usage from several days of continuous run time before a generator technician could come out and look at it.

At present and until we can have a generator technician do a final diagnosis and fix, the generator is set to "off" mode, which means it will not run automatically if utility power fails. If utility power fails, we will need to manually start the generator.

As for preventing this type of issue in the future, we're going to begin procuring separate battery backup systems for critical systems to prevent another full load drop by our site-wide battery backup system from having such a large impact. If the site-wide battery backup system drops the load, the auxiliary battery backup systems will take the load. We will implement this as soon as funds permit.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Dec 28 2018, 03:24 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,339
Member No.: 9
Joined: 12-July 02



We've isolated the cause of the generator problem and it has been repaired. Full auto-start generator backup power has been restored. Details below.

After identifying the root cause of the issue, we've been able to piece together what happened in more detail.

On December 23, 2018 at approximately 11:50PM Pacific time (GMT -8:00), the outdoor temperature dropped below 40F and the battery warmer pad thermostat turned on the battery warmer. By design, it draws about 60W of 240V power through the 240V utility sense lines, which are protected by 5A fuses (one per 120V leg) in the automatic transfer switch (ATS). The battery warmer pad shorted out and blew both fuses, cutting utility sense power to the generator. The generator controller sensed the "loss" of utility power and started, taking the load away from the utility source. This is why we were unable to switch from generator to utility while the generator was set to "auto".

Today, we traced this issue from the generator control panel back to the ATS, which led us to the fuses. Before replacing the fuses we traced all loads on those fuses and discovered the shorted out battery warmer pad. We removed the battery warmer pad, replaced the fuses, verified utility sense voltage was normal again, and set the generator back to "auto". Everything is operating normally at this time, except the battery warmer. The battery warmer is not essential, but we'll be replacing it soon.

We're also going to wire in some inline fuses on the wires feeding the battery warmer to avoid blowing the utility sense fuses the next time the battery warmer shorts out. Why the manufacturer did not have this as part of their design for the battery warmer is unfathomable, but we're going to fix their design so that this type of problem will not happen again.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
0 User(s) are reading this topic (0 Guests and 0 Anonymous Users)
0 Members:

Topic Options Reply to this topicStart new topicStart Poll