Powered by Invision Power Board


  Reply to this topicStart new topicStart Poll

> Emergency maintenance to several systems, resolved
andy
Posted: Jul 12 2016, 07:56 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,340
Member No.: 9
Joined: 12-July 02



The systems responsible for cbp.speedingbits.com (billing and provisioning system), webpro2.speedingbits.com (WebPro2 accounts), www.simplywebhosting.com, and several DNS servers went offline unexpectedly.

We've restarted the systems and everything came back up.

We're now examining why the systems went offline.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Jul 12 2016, 08:11 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,340
Member No.: 9
Joined: 12-July 02



We've determined that one of the storage devices common to each of the systems failed to respond to requests in a timely manner, blocking storage IO long enough to knock the systems offline. The device was reset automatically by the storage system and seems to be functioning fine at this time.

We're going to be watching this issue closely and researching to see if we can find a solution to prevent this from reoccurring.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Jul 25 2016, 03:38 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,340
Member No.: 9
Joined: 12-July 02



We've traced the problem to a faulty SSD and have initiated an advance replacement with the vendor. Details below.

For those who like the technical details, the SSD had a failed "erase" block. This failure prevented the SSD from responding to any requests while it was attempting to work with the faulty block. The SSD failing to respond to requests caused the entire storage system to fail to respond to requests. This was passed all the way back up the stack to the nodes and the VMs on those nodes. Linux falls over when storage is inaccessible for 30 to 60 seconds. It appears that this issue persisted for 3 to 6 minutes, causing active VMs and nodes to give up trying to access storage, causing crashes.

This behavior from a "data center" class SSD (high end SSD designed to take heavy loads with no hiccups) is far from acceptable. It's the very thing they're *not* supposed to do compared to their consumer counterparts in the SSD world. Normally, this type of thing can be attributed to a bug in the firmware, but there is no newer firmware available for this model of SSD. It's rather ironic that a similar bug was discovered and fixed in the consumer versions of this same SSD and, yet, no fix is forthcoming for the "data center" device that cost nearly 4 times as much.

We looked at other possible ways to prevent this scenario from re-occurring, but came up empty handed. A single device failure (temporary issue at that) should not bring down all systems relying on that storage system and we'll continue to look into ways we can make the system more robust and resilient.

Since the problematic SSD is still very early in its expected useful life (and under warranty), we're getting a replacement at no charge. We should know soon whether we're getting the same exact model or a newer model. Sometimes manufacturers opt to supply newer models as warranty replacements if they don't have any of the old model on hand.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Jul 26 2016, 04:22 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,340
Member No.: 9
Joined: 12-July 02



Short non-tech version of this post: we think we fixed it permanently.

More technical details for those who like such things:

We'd been searching for and finally located some obscure settings related to storage device command timeouts and number of retries. It defaulted to 60 seconds for the timeout and 5 retries. That's 5 minutes it could be unavailable before it gives up and says there's a problem with the storage device. While there are some cases out there where this might be reasonable, for our application, if a device doesn't respond after 5 seconds and 3 retries, we consider it faulty and remedial action is needed.

Now, with the new settings, if a storage device doesn't respond within 15 seconds (5 seconds X 3 retries), it is marked as faulted and storage IO continues. This prevents higher layers from freaking out due to storage IO being blocked for 30 seconds (60 seconds for others) or more and keeps everything running. The redundancy does its job and everybody is happy.

It is somewhat counter intuitive that forcing a device to be considered faulty in a shorter time span allows everything to keep working, but that's how this works, given all the interactions at the higher layers that rely on the storage system.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
andy
Posted: Oct 7 2016, 04:09 PM
Quote Post





Group: Advantagecom Staff
Posts: 4,340
Member No.: 9
Joined: 12-July 02



The SSD was sent back to the manufacturer under warranty. They wiped it, reinstalled the firmware, did a short self-test, and sent it back to us last week. We're not thrilled with their handling of this issue (wasted everyone's time and shipping money), but, if the drive doesn't act up, we're happy to continue using it. Time will tell.

After reading through a bunch of code, we were able to determine that the "fix" we applied previously would not solve the problem, so we continued researching.

We did eventually find a solution (no thanks to the open source community which ignored our polite questions completely - not that a closed solution would have been better). Technical details follow for those that want them.

The SSDs themselves have a programmable value that tells the SSD firmware how long it is allowed to attempt to complete an IO operation before returning a failure code to the operating system. Usually, an enterprise/datacenter storage device will have this value set to something like 7 seconds from the factory. These SSDs, in spite of being "datacenter" level SSDs, had that setting set to "disabled", which effectively says it can take however much time it wants. Hence the trouble we had when one of the flash cells went wonky and that flash cell had to be replaced from one of the spare flash cells.

We set the value to 7 seconds and created a way for the value to persist across reboots, which should alleviate the trouble. We're taking it on faith that the firmware will honor the value and behave properly in response to that value since there is no possible way to test this. Again, time will tell, but we have a reasonably high level of certainty that this fix will actually work and prevent further trouble.


--------------------
Sincerely,
Andrew Kinney
CTO, Advantagecom Networks

Please do not private message me. My regular management duties preclude responding to every customer that sends me a support issue. Instead, post on the forum or contact tech support.
PMUsers Website
Top
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

Topic Options Reply to this topicStart new topicStart Poll