At approximately 1:05 a.m. Central, the master Redis database which is hosted by Sendgrid.io reported a failover event. This event indicated that the primary Redis instance was no longer serviceable and that the secondary instance was being promoted.
The Kits service is deployed in 7 different geographical regions around the globe. Each region has multiple application servers and each of those has a replica of the primary Redis database. While the service was designed to handle intermittent disconnection to the primary Redis database it looks like the promotion after a failover event caused the replicas to go offline.
Once the Redis replica went offline for a particular region, our monitoring and disaster recovery tools begin trying to work around this situation. We use Nomad for scheduling jobs and after health checks started failing that tool restarted the job.
Without any intervention from our operations team the service came back online 6 minutes after it went offline. During the downtime some requests would succeed as the Cloudflare cache still held valid cache records for some resources.
Our team has identified that Redis failover events are not handled in the most ideal way. Optimally, the distributed Redis replicas would continue operating until the new primary Redis database is elected and takes over for the old one.
If you have any questions please feel free to email us at hello@fontawesome.com.