Postmortem: Paris DC hosting outage

On Tuesday, 7 October, we experienced a series of serious incidents affecting some of the storage units in our Parisian datacenter. These incidents caused two interruptions in service for some of our customers, affecting both Simple Hosting instances and IaaS servers.

The combined effect of these interruptions represents the most serious hosting outage we've had in three years.

First and foremost, we want to apologize. We understand how disruptive this was for many of you, and we want to make it right.

In accordance with our Service Level Agreement, we will be issuing compensation to those whose services were unavailable.

Here's what happened:

On Tuesday, October 7, shortly before 8:00 p.m. Paris time (11:00 a.m. PDT), a storage unit in our Parisian datacenter housing a part of the disks of our IaaS servers and Simple Hosting instances became unresponsive.

At 8:00 p.m., after ruling out the most likely causes, we made the decision to switch to the backup equipment.

At 9:00 p.m., after one hour of importing data, the operation was interrupted, leading to a lengthy investigation that resulted in eventually falling back to the original storage unit. Our team, having determined the culprit to be the caching equipment, proceeded to change the disk of the write journal.

At 2:00 a.m., the storage unit whose disk had been replaced was rebooted.

Between 3:00 and 5:30 a.m., the recovery from a 6-hour outage caused a heavy overload, both on the network level and on the storage unit itself. The storage unit became unresponsive, and we were forced to restart the VMs in waves.

At 8:30 a.m., all the VMs and instances were once again functional, with a few exceptions which were handled manually.

We inspected our other storage units that were using the same model of disk, replacing one of them as a precaution.

At 12:30 p.m., we began investigating some slight misbehavior exhibited by the storage unit whose drive we had replaced as a precaution.

At 3:50 p.m., three virtual disks and a dozen VMs became unresponsive. We investigated and identified the cause, and proceeded to update the storage unit while our engineers worked on the fix.

Unfortunately, this update caused an unexpected automatic reboot, causing another interruption for the other Simple Hosting instances and IaaS servers on that storage unit.

By 4:15 p.m., all Simple Hosting instances were functional again, but there were problems remounting IaaS disks. By 5:30 p.m., 80% of the disks were accessible again, with the rest following by 5:45 p.m.

This latter incident lasted about two hours (4:00 to 6:00 p.m.). During this time, all hosting operations (creating, starting, or stopping servers) were queued.

Due to the large number of queued operations, it took until 7:30 p.m. for all of them to complete.

These incidents have seriously impacted the quality of our service, and for this we are truly sorry. We have already begun taking steps to minimize the consequences of such incidents in the future, and are working on tools to more accurately predict the risk of such hardware failures.

We are also working on a customer-facing tool for incident tracking which will be announced in the coming days.

Thank you for using Gandi, and please accept our sincere apologies. If you have any questions, please do not hesitate to contact us.

The Gandi team

[Resolved] Storage unit incident underway

Following an incident on a storage unit, it has been necessary to reboot it in order to complete an update necessary for fixing the problem.

All operations will be paused until the unit is running normally again.

In the meantime, please do NOT launch any operation on your server(s). The situation will return to normal shortly.

20:00 CEST, 11:00 Pacific: Incident officially resolved, all operations back to normal.

[RESOLVED] Incident: storage unit [Paris datacenter]

An incident has occurred on one of our storage units in the Parisian datacenter. Our technical team is working to resolve the issue as quickly as possible.

Please do not perform any operations on your virtual machines in the meantime. Services should be restored automatically once the issue has been corrected.

We will update this post as new information arises.

Update Tue Oct 7 19:28:19 UTC: Some faulty hardware has been identified; we're in the process of swapping it out.

Update Tue Oct 7 22:33:14 UTC: Our technical team is still trying to fix the issue.

Update Tue Oct 7 23:35:44 UTC: A ZIL disk has failed, and its failover also failed. We're currently performing a manual switchover, and are proceeding very carefully to minimize the risk of data loss.

Most importantly: we understand how disruptive this is for you and we're working as hard as we can to fix it. We will do our best to make it right.

Update Wed Oct 8 00:39:21 UTC: Our technical team is bringing the storage unit back up. The incident is nearly resolved and services are already beginning to come back online.

Update 02:31:10 UTC: We're now seeing high loads on the problematic filer. The investigation continues!

Update 04:05:54 UTC: After working all night, our technical team in Paris has resolved the problem. Services should now be back to normal.

A postmortem and compensation details, as described in our IaaS Hosting Contract (section 2.2) will be provided in the days to come.

Update Thu Oct 9 17:31:34 UTC: A postmortem about this incident is available here.

[COMPLETE] Emergency maintenance: storage unit

We will reboot a storage unit on the Paris/FR datacenter tonight.

The maintenance window will start 3 October at midnight and end at 1am CEST (3-4pm PDT, 22:00-23:00 UTC) Update: the maintenance window has been extended by 30 minutes and is expected to end at 1:30am CEST (4:30pm PDT, 23:30 UTC).

You will not need to reboot your server (IaaS) or instance (PaaS) during this maintenance.

Sorry for the inconvenience.

Update : end of the maintenance at 2AM CEST, sorry for the delay.

[COMPLETE] Emergency maintenance on IaaS/PaaS hosting on Paris/FR datacenter

We will need to reboot a storage unit as an emergency.

Indeed, a bug is at the source of this emergency maintenance.

This maintenance will not impact the data hosted on this storage unit.

The disks will take their I/O back where they stopped.

Thank you not to do any operation on your VM in the meantime.

The hosting operations will be stopped during the maintenance.

Sorry for the inconvenience this emergency maintenance may cause to you.

[COMPLETE] Emergency maintenance : hosting console

We will proceed to the migration of the hosting emergency console.

This maintenance will occur on Monday, the 23rd of June 2014 CEST (Paris timezone).

It should take a few minutes, the impact will be the connection of the IaaS (VM) emergency console, the PaaS/Simplehosting instances console, and the IaaS statistics on the web control panel.

Sorry for the inconvenience this maintenance may cause to you.

[RESOLVED] Emergency maintenance: internal storage

Our operations team will perform an emergency maintenance on storage units today, Monday 9 June 2014.

This maintenance may impact the following services :

www.gandi.net: access to the control panel, orders
mail.gandi.net: IMAP, POP, SMTP, webmail
Web redirections
GandiSite (Basekit) / Sitemaker
DNS zone provisioning
Gandi Groups / Gandi wiki
Emergency console / DNS cache for hosting datacenter in Paris
SimpleHosting (PaaS) virtualhost provisioning

The maintenance window will be four hours, and will occur between 2 and 4pm PDT (11pm to 3am CEST/Paris).

We apologize for the inconvenience this maintenance may cause.

Updates:

2:00pm PDT [11:00 PM CEST, 21:00 UTC]: start of the maintenance

2:47pm PDT: end of the maintenance

[RESOLVED] Network incident in progress

We are currently experiencing a routing issue causing a part of our network to be unreachable from certain locations.

Our technical team is analyzing the problem in order to correct it as quickly as possible.

We will provide updates as the situation evolves.

UPDATE 05h15 : The release of a new network configuration disturbed our network. Sorry for the inconvenience this issue may have caused to you.

Maintenance on Gandi Cloud VPS

Following the discovery of an intermittent but serious network issue, Gandi teams have determined that a rolling maintenance will be necessary.

We regret that, while most of the affected systems will simply require migration and no interruption in service, some will almost certainly require restarting. We will endevor to make these interruptions as short as possible, and only perform them when absolutely necessary.

We are starting with dc0 (Paris) this week. We will proceed on to dc2 (Luxembourg) on Monday, June 9.

The issue is not detected on dc1 (Baltimore) at the moment but, if necessary, we will proceed to fix it there. We apologize for any inconvenience this may cause.

[Resolved] Service issue with Gandi Cloud VPS

Here's the incident history:

* 08:25 UTC 12 hosting nodes are made inaccessible due to a switch failure. ~200 Virtual machines (VMs) are made unreachable.

* 08:40 UTC Switches are recovered and VMs are once again accessible. Investigation does not reveal cause of incident.

* 12:01 UTC A second incident occurs, affecting 8 nodes and ~180 VMs.

* 12:09 UTC Switches are recovered, VMs are made available again. Additional data collection measures are put in place to help determine cause.

* 14:56 UTC A third incident occurs, affecting 10 nodes and 321 VMs.

* 15:10 UTC Nodes and VMs are available again. This time extensive forensic data is made available, and we expect to find the root cause and execute a permanent fix, which will be implemented as soon as possible.

We do apologise for the inconvenience this issue may have caused.

Gandi

Postmortem: Paris DC hosting outage

[Resolved] Storage unit incident underway

[RESOLVED] Incident: storage unit [Paris datacenter]

[COMPLETE] Emergency maintenance: storage unit

[COMPLETE] Emergency maintenance on IaaS/PaaS hosting on Paris/FR datacenter

[COMPLETE] Emergency maintenance : hosting console

[RESOLVED] Emergency maintenance: internal storage

[RESOLVED] Network incident in progress

Maintenance on Gandi Cloud VPS

[Resolved] Service issue with Gandi Cloud VPS

Gandi v4 customer web interface closed

Arriving soon: .dev domains!

The .PAGE new TLD is coming!

Our #GandiV5 reseller platform is now open in beta

Register your .llc

DNSSEC in one click

Apply for your .app, now in Sunrise

October newsletter is on its way!

New language and database combinations on Simple Hosting