Our infrastructure has experienced several noticeable slowdowns today, 7 February 2014, due to an ongoing DDoS. Our teams are working to mitigate the attacks.
Updates will be provided here as the situation evolves.
Simple Hosting instances located in our Baltimore data center only may be currently experiencing issues. Our technical staff is investigating the issue. Please do not perform any operations on your instance in the meantime.
This post will be updated as the situation evolves.
Update 00:51:20 CET:
A member of our technical staff is currently onsite in Baltimore to address the problem.
Update 01:35:13 CET:
The issue has been resolved. Services should be now operating normally.
The incident of November 11th is part of a series of incidents over the past few weeks caused by the gateway units, which provide Internet access for the Simple Hosting instances.
The Simple Hosting platform has experienced a number of different issues, principally with the gateway equipment, which seems to be the weakest link in the architecture. It is suject to:
HSRP instability causing short interruptions in connectivity,
Saturation of NAT translation tables as a result of a number of factors, including DDoS and Customer Activity,
High CPU usage under certain conditions.
What will Gandi do to fix the situation, replace this gateway and improve the Simple Hosting product ?
Replace the network equipment which provides the gateway to Internet for the Simple Hosting product with more powerful appliances, and greater numbers of units (scaling). The new units will better handle the current load and will support the growth of Simple Hosting instances in the near future,
Set up a deeper level of monitoring to better detect technical problems,
Implement advanced monitoring to detect abuse from specific instances and enable quicker reaction from our technical team for handling these abuses before they impact the quality of services for all other customers.
We apologise for the inconvenience, and please be assured that our teams are endeavouring to correct these issues in the shortest possible time.
We experienced a hardware fault on routing equipment on the simple hosting platform. Below is a chronology of the various events: - 20:06 UTC : CPU load on the equipment shows significant increase. - 20:06 UTC : Equipment is running at 100% CPU for no apparent reason, and has failed to respond to commands. - 20:08 UTC : We made the decision to migrate to secondary equipment. - 20:08 UTC : The secondary equipement exhibits the same symptoms as the primary, so traffic was not transferred. - 20:09 UTC : Debugging underway as to ascertain the cause of the problem. - 20:26 UTC : Migration to the now-stabilised secondary equipment. - 20:27 UTC : Service returned to nominal operation. - 22:42 UTC : Following this incident, there was a secondary effect on DNS resolution; the Simple Hosting instances failing to resolve DNS since 20:06 UTC. the problem is now resolved. - The network equipment used for the Gateways for this service are visibly showing signs of weakness. An in-depth analysis of the anomaly and behaviour of the primary unit is underway (likely due to a memory fault). We are currently running on the secondary gateway for the moment.