Downtime for kit.fontawesome.com

Incident Report for Fort Awesome

Postmortem

At approximately 1:05 a.m. Central, the master Redis database which is hosted by Sendgrid.io reported a failover event. This event indicated that the primary Redis instance was no longer serviceable and that the secondary instance was being promoted.

The Kits service is deployed in 7 different geographical regions around the globe. Each region has multiple application servers and each of those has a replica of the primary Redis database. While the service was designed to handle intermittent disconnection to the primary Redis database it looks like the promotion after a failover event caused the replicas to go offline.

Once the Redis replica went offline for a particular region, our monitoring and disaster recovery tools begin trying to work around this situation. We use Nomad for scheduling jobs and after health checks started failing that tool restarted the job.

Without any intervention from our operations team the service came back online 6 minutes after it went offline. During the downtime some requests would succeed as the Cloudflare cache still held valid cache records for some resources.

Our team has identified that Redis failover events are not handled in the most ideal way. Optimally, the distributed Redis replicas would continue operating until the new primary Redis database is elected and takes over for the old one.

If you have any questions please feel free to email us at hello@fontawesome.com.

Posted Mar 03, 2021 - 17:27 UTC

Resolved

This incident has been resolved.

Posted Feb 28, 2021 - 08:04 UTC

Investigating

At approximately 1:05 Central Time we began seeing alerts from our monitoring systems indicating downtime for kit.fontawesome.com. Upon investigating we saw that our Redis database cluster was experiencing a failover event which led to about 6 minutes of downtime for the service as a replica was being promoted to primary. Systems self-healed around this issue and came back online without operator intervention. We'll be investigating the root cause of this over the next few days.

Posted Feb 28, 2021 - 07:41 UTC

This incident affected: kit.fontawesome.com.