Civo.com inaccessible
Incident Report for Civo
Resolved
Networking hardware is definitely working OK, storage cluster has full rebalanced and is back alive, we've re-enabled instance and cluster launches. Any remaining instances in a problematic state, reach out through the usual support channels. Thanks everyone for your patience.
Posted Nov 08, 2020 - 16:51 GMT
Update
Most of the team is calling it a night now, it's 4am. We've got network access restored, Civo.com and API back up and running - but we've had to disable instance and cluster creation temporarily while we wait for our storage cluster to finish rebalancing and have all drives accessible again. We'll update again when we re-enable creations and all try to get some sleep. Thanks for everyone's patience.
Posted Nov 08, 2020 - 03:50 GMT
Update
We are continuing to monitor for any further issues.
Posted Nov 08, 2020 - 02:44 GMT
Update
In typical fashion, as we're about to call it a night, the core switch died again. We replaced it with the ready one in the rack, had some configuration issues to resolve (as it's a different brand of switch) and it's back working again. Civo.com and the API are still down (our internal Galera cluster is having a major wobble), but we're working on it as well as ensuring our storage cluster is correctly responding to writes again.
Posted Nov 08, 2020 - 02:43 GMT
Monitoring
A switch in our core switching infrastructure failed tonight, so we've prepared a new switch and it's in the rack and ready to go, however, the old switch has suddenly started behaving itself again and most instances are back working normally (as is Civo.com and the API). So rather than cause another outage in swapping over, we'll leave this new switch in the rack as a warm spare ready to jump in to action if needed. If you have a k3s node that's not working, a recycle should fix it. If you have an instance with a readonly filesystem (our storage cluster does this to protect data), a soft reboot should fix it (and if not, SSH in and run fsck /dev/vda1 and that should get it working). If those aren't working, feel free to reach out to support via the bubble in the bottom right and we'll get to it as soon as we can. We’ll be continuing to investigate why this switch failure took down the whole network
Posted Nov 08, 2020 - 00:07 GMT
Update
We've got a replacement switch being configured at the moment, we're hoping to have it in operation in the next hour or so, then will be working on any resulting issues after that. More updates to follow...
Posted Nov 07, 2020 - 22:21 GMT
Identified
We have identified a hardware fault and are working on replacing the faulty equipment.
Posted Nov 07, 2020 - 20:02 GMT
Update
We believe the issue is with the inbound network to the website, API and customer instances/clusters. Our networks team is investigating further and the upstream network provider has been informed.
Posted Nov 07, 2020 - 18:12 GMT
Update
We are continuing to investigate this issue.
Posted Nov 07, 2020 - 18:02 GMT
Investigating
We are aware of an issue preventing users from accessing civo.com and are investigating the cause. The user API and website are both currently inaccessible.
Posted Nov 07, 2020 - 18:02 GMT
This incident affected: API, Civo.com, and Inbound network.