On May, 30th we attempted to deploy some new features to our Kits service which required a multi-step deployment procedure we've been using for over 3 years.
Our Kits service runs on Amazon EC2 instances. We have the service distributed globally in 7 regions.
Those instances function as origin servers for our CDN service. We are using Cloudflare's Load Balancer product in order to serve traffic at the edge.
When we upgrade the software we systematically take regions offline, upgrade them, and them bring them back online. For our customers this normally results in zero downtime and the process is unnoticeable and seamless.
Up until recently our deployment procedure has been stable and we haven't had any major downtime related to the deployment process itself. However, we saw different behavior today that led to a significant degradation of the service.
During the last phases of the deployment which usually takes about an hour we noticed that a region became overloaded during the transition from out-of-service to in-service.
The load on the now in-service region jumped to unexpected levels well above the normal traffic patterns seen. This load caused the individual servers to become unresponsive which then led to load shedding to other regions. Unfortunately, this only compounded the issue as the pattern repeated in the fallback region.
With the increased and surging loads in various regions a cycle of in-service, surge load, instance failure continued until the entire Kits service was unstable and no viable origin servers were available to service requests.
During the Kit service failure we also began to see load increase on one of the database servers that is used by fontawesome.com. The reason for this is unknown right now but we suspect there is some indirect tie that needs to be found and corrected.
The additional load on the database caused the fontawesome.com website to stop responding to most requests that require database connectivity.
At this point our mitigation strategy was two-fold:
Our team began performing the steps necessary to scale up to handle the surge load. After these steps were complete both the Kits service and fontawesome.com site began functioning as normal.
We are still investigating the link between Kits service load and the fontawesome.com database server.
We will also be working with Cloudflare to understand what changes might have contributed to this issue.
Over the next few days and weeks (however long it takes) we will look at this issue as a team and determine what steps we need to take in order to prevent this type of failure in the future. We understand that our customers rely on us and the high availability-especially the Kits service. When we are down we lose trust and fail to provide the level of service that we've pledged to you. That’s not acceptable to us and we know it’s not acceptable to our customers either.
If you have any questions please feel free to email us at hello@fontawesome.com.