Powered by Invision Power Board


Pages: (3) [1] 2 3   ( Go to first unread post ) Reply to this topicStart new topicStart Poll

> fs2.schmolie.com down (failed storage backplane), restoration completed
andy
Posted: Feb 4 2018, 04:23 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,302
Member No.: 9
Joined: 12-July 02



fs2.schmolie.com is currently under heavy load which is slowing down all services running on that machine. The following systems are affected: mdh0.speedingbits.com, multiple resolver and authoritative DNS servers, and 16 VPS accounts from various clients.

We're investigating the cause and hope to resolve it soon.

New information will be posted to this thread as it becomes known.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Feb 4 2018, 04:32 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,302
Member No.: 9
Joined: 12-July 02



Remote access to the system does not appear to be functioning, so a technician has been dispatched.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Feb 4 2018, 05:09 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,302
Member No.: 9
Joined: 12-July 02



The system appears to be in the middle of replacing a failed hard drive from a hot spare. The system should not be going unresponsive during this process, though, so we're investigating further.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Feb 4 2018, 05:16 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,302
Member No.: 9
Joined: 12-July 02



It looks like the replacement hard drive failed in the middle of the replacement and a bug in the RAID controller firmware took the storage offline rather than continuing in degraded mode.

We're power cycling the system, which should bring it back online.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Feb 4 2018, 05:33 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,302
Member No.: 9
Joined: 12-July 02



Unfortunately, it gets worse.

Due to another firmware bug in the RAID controller, one of the member disks was marked as a hot spare and the RAID controller began automatically rebuilding onto that disk. It sounds like a fine thing until you realize that it just catastrophically ate a full disk's worth of data, sending it to oblivion, when it presumed the RAID 5 array to be 6 disks instead of 7.

We'll be beginning the restore process soon.

If your account is one of the affected web sites or VPS on this node, please bear with us while we work to bring this system back online from backups.

ETA is 12 to 16 hours to have everything restored.

If you cannot wait that long, please order a new account and you can either restore from your personal backups or instruct us to restore, which we will do as soon as possible.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Feb 4 2018, 06:22 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,302
Member No.: 9
Joined: 12-July 02



On closer examination and additional thought, the root cause is likely a failed backplane for the hard drives. We don't have a spare backplane on hand for this type of machine, so we're going to be restoring to other machines.

Please avoid contacting off-hours support about this issue. We're obviously aware of it and working on it as quickly as possible. Screaming at us to work faster won't help.

If you have system which is otherwise unrelated but experiencing difficulties, it may be because of the DNS resolvers that have gone down with this node. If you're experienced and have a VPS, you can change your resolvers in /etc/resolv.conf . Ping each one listed there, remove the ones that are down and save the file.

If you're inexperienced, do not attempt to edit the /etc/resolv.conf on your own and, instead, open a ticket at https://cbp.speedingbits.com/billing/clientarea.php to get assistance.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Feb 4 2018, 06:30 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,302
Member No.: 9
Joined: 12-July 02



I've noted that due to an oversight on our part, both dns5.speedingbits.com and dns6.speedingbits.com were on the same node. They should not have been because they are the primary and secondary nameserver pair for many services.

Our priority right now is restoring dns5.speedingbits.com, which we should have complete in the next hour or two. This will restore service for many otherwise unrelated services that are currently showing as "down" for clients.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Feb 4 2018, 07:31 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,302
Member No.: 9
Joined: 12-July 02



QUOTE (andy @ Feb 4 2018, 06:30 PM)
Our priority right now is restoring dns5.speedingbits.com, which we should have complete in the next hour or two. This will restore service for many otherwise unrelated services that are currently showing as "down" for clients.

This is still in progress.

We're creating a new system for dns5.speedingbits.com and will have the files restored within the next 60 minutes or so.

To facilitate this system being on a new high-availability platform, it needed to change IP addresses. We've changed the IP with the registrar already to begin the process of expiring the old information during the time while we're working to restore the files.

More information will be posted as it is known.

Thank you for your continued patience. We're working as quickly as possible to restore service for all those affected.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Feb 4 2018, 08:30 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,302
Member No.: 9
Joined: 12-July 02



QUOTE (andy @ Feb 4 2018, 06:30 PM)
Our priority right now is restoring dns5.speedingbits.com, which we should have complete in the next hour or two. This will restore service for many otherwise unrelated services that are currently showing as "down" for clients.

The new dns5.speedingbits.com has modern versions of all system software (Apache, PHP, MySQL), so bear with us while we adapt the 10+ year old code to work on the new platform.

Our first priority will be to restore all authoritative zones (domains) serviced by this name server to get all of them functional again. Once we have done that, we'll begin work on making the web interface accessible and functional again.

We're currently working on transferring the files from our backup system to the new dns5.speedingbits.com.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Feb 4 2018, 09:54 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,302
Member No.: 9
Joined: 12-July 02



Please stop texting and asking me to review tickets. They're likely all related to this issue at the moment. The few that aren't will be addressed by ticket support. Any tickets escalated to me will be delayed until further notice.

I won't be reviewing tickets or responding to texts until the resolution of all critical issues is complete. This will likely resolve your issue anyway.

At present, I'm the only one working this issue since nobody else has the technical qualifications to be of any use to the process. Unfortunately, I'm also off-hours support, so off-hours support is being ignored until the critical issues are resolved. The constant interruptions are substantially slowing the resolution, so I'll be handing off-hours support to a non-technical individual who will be responsible for telling people to check the forum for updates.

If you have an issue that you believe is not related to this set of issues, please open a ticket at https://cbp.speedingbits.com/billing/clientarea.php .

Please don't take random actions on your own in an attempt to "fix" this if you don't know what you're doing. You'll likely make a bigger mess that will have to be resolved at a later time. You're far better off just sitting tight or opening a ticket as indicated above.

No further responses to individual inquiries by text, phone, or email will be answered until further notice. It may appear that nothing is happening, but, rest assured, I'm working as quickly as possible to resolve ****ALL**** issues in the manner best suited to help as many people as possible as soon as possible.

If it appears nothing is happening because you aren't seeing constant updates on this thread, set that thought aside. I'm working on it as quickly as possible and will not leave anyone hanging any longer than is absolutely necessary.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Feb 4 2018, 10:37 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,302
Member No.: 9
Joined: 12-July 02



dns5.speedingbits.com files are restored and adapted to the new system. Currently reviewing logs to vet the configs before making it reachable by the public. It will take 10 to 15 minutes to review the logs and if no problem are found, we'll unfirewall it to allow access by the public.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Feb 4 2018, 10:51 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,302
Member No.: 9
Joined: 12-July 02



Aside from a few minor issues that can be resolved later (doesn't impact any active clients), the logs for dns5.speedingbits.com looked good and the configuration is working as expected.

We're now in the process of unfirewalling the new dns5.speedingbits.com.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Feb 4 2018, 11:03 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,302
Member No.: 9
Joined: 12-July 02



dns5.speedingbits.com has been unfirewalled for several minutes now and it appears to be processing DNS queries normally. This should resolve a huge chunk of issues related to this outage of fs2.schmolie.com.

Access to sites using dns5.speedingbits.com and dns6.speedingbits.com may still be slower than usual since dns6.speedingbits.com isn't restored yet, but that is coming soon.

We're going to examine the remaining issues, triage them, and begin working them in order of priority.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Feb 4 2018, 11:45 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,302
Member No.: 9
Joined: 12-July 02



The web interface for dns5.speedingbits.com has not yet been restored. We're saving that task for later since adding/changing zones is less critical than serving the existing zones.

The next thing in our list of priorities is to create a couple new DNS resolvers to replace the two that went down. Most systems and VPS on our network reference three DNS resolvers and two of them are currently down, resulting in slow access to anything that does DNS lookups such as SSH logins, ftp logins, and email logins. The services that rely on DNS resolution appear to be very slow to respond when only one of three DNS resolvers are up.

For some customer VPS that were already near to crashing for lack of CPU or RAM resources (you know who you are), this may have tipped them over the edge and caused their VPS to crash.

In total the DNS resolvers being down affects far more customers than just the ones on this node. They should be relatively quick to setup and put in service and have a large impact to our overall customer base.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Feb 5 2018, 02:14 AM
Quote Post





Group: Advantagecom Staff
Posts: 4,302
Member No.: 9
Joined: 12-July 02



The resolver dns4.schmolie.com has been replaced by a new system with a different IP.

The old ip was 66.29.143.240. The new IP is 66.29.136.240.

If you have a VPS or dedicated server and are technically inclined, you may edit your /etc/resolv.conf to reflect the new IP.

We'll be taking care of this for all the shared hosting systems and managed VPS immediately. For other systems, you can make the change on your own, open a ticket at https://cbp.speedingbits.com/billing/clientarea.php , or wait for us to make the change later.

We still have one more resolver to replace, which we'll be working on shortly.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
0 User(s) are reading this topic (0 Guests and 0 Anonymous Users)
0 Members:

Topic Options Pages: (3) [1] 2 3  Reply to this topicStart new topicStart Poll