- May 7, 2002
- 10,371
- 762
- 126
In short, spammers were abusing Gitlab's systems, and when an admin tried to delete a directory, they did so on the wrong server causing the whole site to go into a death loop while they attempted to recover via backups.
I bet this happens more often than what is publicized.
Big OUCH on the S3 backups not working!
https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-incident/
I bet this happens more often than what is publicized.
Big OUCH on the S3 backups not working!
https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-incident/
First Incident
At 2017/01/31 18:00 UTC, we detected that spammers where hammering the database by creating snippets, making it unstable. We then started troubleshooting to understand what the problem was and how to fight it.
...
Second Incident
At 2017/01/31 22:00 UTC - We got paged because DB Replication lagged too far behind, effectively stopping. This happened because there was a spike in writes that were not processed on time by the secondary database.
...
Third Incident
At 2017/01/31 23:00-ish team-member-1 thinks that perhaps pg_basebackup is refusing to work due to the PostgreSQL data directory being present (despite being empty), decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com.
At 2017/01/31 23:27 team-member-1 - terminates the removal, but it’s too late. Of around 300 GB only about 4.5 GB is left
We had to bring GitLab.com down and shared this information on Twitter
...
Problems Encountered
- LVM snapshots are by default only taken once every 24 hours. Team-member-1 happened to run one manually about 6 hours prior to the outage because he was working in load balancing for the database.
- Regular backups seem to also only be taken once per 24 hours, though team-member-1 has not yet been able to figure out where they are stored. According to team-member-2 these don’t appear to be working, producing files only a few bytes in size.
- Team-member-3: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.
- Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
- The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
- The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented
- Our backups to S3 apparently don’t work either: the bucket is empty
- So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place. We ended up restoring a 6 hours old backup.
- pg_basebackup will silently wait for a master to initiate the replication progress, according to another production engineer this can take up to 10 minutes. This can lead to one thinking the process is stuck somehow. Running the process using “strace” provided no useful information about what might be going on.
Last edited:

