Gitlab's admin deletes wrong directory, site goes down

Elixer · Feb 1, 2017

In short, spammers were abusing Gitlab's systems, and when an admin tried to delete a directory, they did so on the wrong server causing the whole site to go into a death loop while they attempted to recover via backups.

I bet this happens more often than what is publicized.
Big OUCH on the S3 backups not working!

https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-incident/

First Incident
At 2017/01/31 18:00 UTC, we detected that spammers where hammering the database by creating snippets, making it unstable. We then started troubleshooting to understand what the problem was and how to fight it.
...
Second Incident
At 2017/01/31 22:00 UTC - We got paged because DB Replication lagged too far behind, effectively stopping. This happened because there was a spike in writes that were not processed on time by the secondary database.
...
Third Incident
At 2017/01/31 23:00-ish team-member-1 thinks that perhaps pg_basebackup is refusing to work due to the PostgreSQL data directory being present (despite being empty), decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com.

At 2017/01/31 23:27 team-member-1 - terminates the removal, but it’s too late. Of around 300 GB only about 4.5 GB is left

We had to bring GitLab.com down and shared this information on Twitter
...
Problems Encountered

LVM snapshots are by default only taken once every 24 hours. Team-member-1 happened to run one manually about 6 hours prior to the outage because he was working in load balancing for the database.

Regular backups seem to also only be taken once per 24 hours, though team-member-1 has not yet been able to figure out where they are stored. According to team-member-2 these don’t appear to be working, producing files only a few bytes in size.

Team-member-3: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.

Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.

The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost

The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented

Our backups to S3 apparently don’t work either: the bucket is empty

So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place. We ended up restoring a 6 hours old backup.

pg_basebackup will silently wait for a master to initiate the replication progress, according to another production engineer this can take up to 10 minutes. This can lead to one thinking the process is stuck somehow. Running the process using “strace” provided no useful information about what might be going on.

Red Squirrel · Feb 1, 2017

Ouch!

Reminds me of a F up I did on my file server years back. It has some NFS shares for various purposes such as VMs and just raw data access from other machines throughout the network. It's a one central location for all data, which is raided, and backed up regularly. One of the server's data sets is movies that are accessed by my HTPC. I had them organized alphabetically in folders. A, B, C etc. I decided that I did not like that scheme and just wanted to put all the files in the main share. Because Linux permissions don't have any kind of inheritance and just take the ownership/permissions of whatever user is performing an action, all these movies that had been moved/added by different processes/users throughout the years meant all the permissions were different. So doing the move operation was just easier to do as root as I'd keep getting permission errors if I do it as my regular user. Big mistake. I should have went as root to do a chmod -R to set permissions properly then use my own user for the rest of the work.

So as root, I'm in the main folder issuing this command:

Code:

mv A/* .

Up arrow, change the A to B, enter, continue on. etc.

Then I fumbled the keyboard, and issued this command:

Code:

mv /* .

That's BAAAAAD! It basically tried to move every single folder, including drive mounts into the folder. I hit CTRL+C as soon as I hit enter because I actually did not mean to hit enter. But it was too late. All the system folders like /bin /etc /boot were now in the folder I was working in. The mv command was also invalid, because that got moved too. Typing the full new path to mv I was able to use it to move all the folders back. But now all the permissions, attributes etc of those folders were no longer valid. I compared with another server and tried to set everything the best I could, but the /boot folder is what really bothered me. Some of those files need to be in a specific location physically on the disk. Before cylinder 63 or w/e. I forget the rule. But those files are special because of how the computer booting process works.

Needless to say, while the server was running fine, I knew I could never ever ever ever reboot it as it would probably not come back up. It has redundant power supplies and redundant UPSes, one of the UPSes is good for 4 hours. If the power goes out and the ETA is longer than 4 hours, I have 4 hours to go buy a generator and gas it up. That was basically my plan if shit hits the fan. It simply absolutely could not get rebooted.

It chugged along fine for a year or two and no major power outages or any incident. Well several months ago I was doing work in my electrical panel and shut the power off. For some weird reason, the UPS never tripped fast enough, but only that server was affected. It rebooted.... Hard. Never came back up.

I was able to use the grub rebuild tool to rebuild the MBR and boot stuff and get the server going again, except one problem: DNS. The DNS server got virtualized at one point. The VM is stored on THAT server. The VMs on my VM server had all crashed because they basically got the rug pulled from under them when the file server went down. I could not mount the NFS shares because DNS was not working, and the NFS service seemed to have lot of trouble with the idea of DNS not running as well. It took over an hour for the server to actually boot since it had to time out for every single mount point.

Thankfully, the old physical DNS server was and is still running and DNS was just turned off. I turned it back up and was able to get stuff going again. I recently added a secondary DNS server that is physical (since I plan to retire the other physical server). There are certain things that you should not rely 100% on virtualization for, I'd say DNS is one of those things.

Basically my mistakes were using root in a situation where I really should not have but the way my DNS was setup was also a mistake that I had overlooked. I now have a secondary DNS server running on my home automation server which is a physical server. It kinda makes sense to have it setup on there anyway, since if my server infrastructure has to be shut down, that's one server I would probably want to keep running longer as it controls the furnace and stuff. So it is basically more self contained and does not rely on anything else. It can technically operate even if the switches and router is down.

John Connor · Feb 2, 2017

Elixer said:
Big OUCH on the S3 backups not working!

If they deployed Glacier storage I doubt that would have been an issue. Or I think it was called something else where there's many back ups.

Skyclad1uhm1 · Feb 2, 2017

Looooong ago, at the uni, a guy came up to me and said 'you know how to use Unix, right?'. I confirmed this and he sheepishly said that he was trying to delete a subfolder but suddenly couldn't find the whole website for the uni rowing team anymore.

I checked his computer, and there it was, rm -rf blah /*

He accidently placed a space after the folder name, which meant it did rm -rf on all files he had access to. Fortunately it was set up correctly so he could only delete his own files, but as he managed the website for the team he deleted all of that.
I told him to go and run to the sysadm that managed that server and grovel in the hope that they had a backup still

(I knew they backed up that server, but if you do something like that you deserve to grovel a bit before getting help!)

Exterous · Feb 2, 2017

John Connor said:
If they deployed Glacier storage I doubt that would have been an issue. Or I think it was called something else where there's many back ups.

Given the scope of S3 I'd wager its more of an issue with GitLab's setup than S3. I've had plenty of projects set to backup\replicate to AWS without issue (and, yes, we verify)

Ken g6 · Feb 2, 2017

I heard this morning they found most of the data on a staging server.

[DHT]Osiris · Feb 2, 2017

Red Squirrel said:
Ouch!

Reminds me of a F up I did on my file server years back. It has some NFS shares for various purposes such as VMs and just raw data access from other machines throughout the network. It's a one central location for all data, which is raided, and backed up regularly. One of the server's data sets is movies that are accessed by my HTPC. I had them organized alphabetically in folders. A, B, C etc. I decided that I did not like that scheme and just wanted to put all the files in the main share. Because Linux permissions don't have any kind of inheritance and just take the ownership/permissions of whatever user is performing an action, all these movies that had been moved/added by different processes/users throughout the years meant all the permissions were different. So doing the move operation was just easier to do as root as I'd keep getting permission errors if I do it as my regular user. Big mistake. I should have went as root to do a chmod -R to set permissions properly then use my own user for the rest of the work.

So as root, I'm in the main folder issuing this command:

Code:

mv A/* .

Up arrow, change the A to B, enter, continue on. etc.

Then I fumbled the keyboard, and issued this command:

Code:

mv /* .

That's BAAAAAD! It basically tried to move every single folder, including drive mounts into the folder. I hit CTRL+C as soon as I hit enter because I actually did not mean to hit enter. But it was too late. All the system folders like /bin /etc /boot were now in the folder I was working in. The mv command was also invalid, because that got moved too. Typing the full new path to mv I was able to use it to move all the folders back. But now all the permissions, attributes etc of those folders were no longer valid. I compared with another server and tried to set everything the best I could, but the /boot folder is what really bothered me. Some of those files need to be in a specific location physically on the disk. Before cylinder 63 or w/e. I forget the rule. But those files are special because of how the computer booting process works.

Needless to say, while the server was running fine, I knew I could never ever ever ever reboot it as it would probably not come back up. It has redundant power supplies and redundant UPSes, one of the UPSes is good for 4 hours. If the power goes out and the ETA is longer than 4 hours, I have 4 hours to go buy a generator and gas it up. That was basically my plan if shit hits the fan. It simply absolutely could not get rebooted.

It chugged along fine for a year or two and no major power outages or any incident. Well several months ago I was doing work in my electrical panel and shut the power off. For some weird reason, the UPS never tripped fast enough, but only that server was affected. It rebooted.... Hard. Never came back up.

I was able to use the grub rebuild tool to rebuild the MBR and boot stuff and get the server going again, except one problem: DNS. The DNS server got virtualized at one point. The VM is stored on THAT server. The VMs on my VM server had all crashed because they basically got the rug pulled from under them when the file server went down. I could not mount the NFS shares because DNS was not working, and the NFS service seemed to have lot of trouble with the idea of DNS not running as well. It took over an hour for the server to actually boot since it had to time out for every single mount point.

Thankfully, the old physical DNS server was and is still running and DNS was just turned off. I turned it back up and was able to get stuff going again. I recently added a secondary DNS server that is physical (since I plan to retire the other physical server). There are certain things that you should not rely 100% on virtualization for, I'd say DNS is one of those things.

Basically my mistakes were using root in a situation where I really should not have but the way my DNS was setup was also a mistake that I had overlooked. I now have a secondary DNS server running on my home automation server which is a physical server. It kinda makes sense to have it setup on there anyway, since if my server infrastructure has to be shut down, that's one server I would probably want to keep running longer as it controls the furnace and stuff. So it is basically more self contained and does not rely on anything else. It can technically operate even if the switches and router is down.

A few months back I was fixing Windows DFS(N/R) at my current location (previous build). The previous admin tried to shortcut the DFS replication by robocopying everything over to the second system. He never actually enabled replication properly (you have to issue a powershell command stating which server is the primary if you do it this way). DFS scanned the secondary/target, found files and folders in place which it kindly shoved into a 'pre-existing' folder (space limited of course, so as it moved it also deleted). No biggy since the data was stale anyhow. Fun part was that now the timestamps for all those folders got updated, and of course now they're more up-to-date than the primary, sooo DFSR doing what it does best it starts replicating all those empty folders back over to the primary. Whoops!

Luckily our backups were indeed functional and other than a few hours of work lost for a handful of people, no harm done. Oh and DFS is functioning correctly now

ultimatebob · Feb 2, 2017

Good ol rm -Rf strikes again. Linux should really have a "noob mode" that throws an "Are you sure you wanna do this?" warning if you try to use it outside of a shell script.

It's surprisingly easy to FUBAR an entire system as root with that command. Hell.. the difference between these two commands is life and death for your server:

rm -Rf /junk <-- deletes everything in the /junk directory
rm -Rf / junk <-- deletes everything in the whole / file system. It basically wipes out the whole system because you got an extra space in there.

[DHT]Osiris · Feb 2, 2017

ultimatebob said:
Good ol rm -Rf strikes again. Linux should really have a "noob mode" that throws an "Are you sure you wanna do this?" warning if you try to use it outside of a shell script.

It's surprisingly easy to FUBAR an entire system as root with that command. Hell.. the difference between these two commands is life and death for your server:

rm -Rf /junk <-- deletes everything in the /junk directory
rm -Rf / junk <-- deletes everything in the whole / file system. It basically wipes out the whole system because you got an extra space in there.

It actually does at root, for a good long while rm -rf / will throw a 'are you sure you aren't actually a retard?' at you if you try it.

But yeah it really doesn't care about mv or rm to another directory.

Ns1 · Feb 2, 2017

Regular backups seem to also only be taken once per 24 hours, though team-member-1 has not yet been able to figure out where they are stored. According to team-member-2 these don’t appear to be working, producing files only a few bytes in size

amateur hour

[DHT]Osiris · Feb 2, 2017

Ns1 said:
amateur hour

Unless a few bytes is all the data that's been uploaded to Gitlab :d

brianmanahan · Feb 2, 2017

this would have never happened with svn! :colbert:

Red Squirrel · Feb 2, 2017

The fun thing about / is that it also includes all network and local mount points. Even as a regular user doing rm -rf / or other operation is going to be disastrous as it's going to nuke all you shares, and chances are, you will have RW access to that stuff, because what would be the point of mounting it otherwise. So you may possibly spare the system files but you're going to blow away a lot of data.

Always good to have good backups, and don't mount the backup folder anywhere.

My backups are kinda meh, they're decent, but could be better. I don't really have any kind of versioning other than the cold backup rotations. I like the idea of tapes, but they're so freaking expensive, especially the drives. So I use hard drives as tapes.

Some of the most important stuff is versioned though. month folders with week day folders inside. So I can go back 7 days or 12 months, and in some cases even years.

Cappuccino · Feb 2, 2017

nerdz dafuq is github u nerdz give me deh creapz

Chaotic42 · Feb 2, 2017

Cappuccino said:
nerdz dafuq is github u nerdz give me deh creapz

GitLab.... How'd you know about GitHub unless... nerd!

brianmanahan · Feb 2, 2017

Gustavo Woltmann said:
That's good. Because I don't either like them or care about them at all.

-Gustavo Woltmann

lemme guess, cvs?

or... *gasp* vss?

ultimatebob · Feb 3, 2017

brianmanahan said:
lemme guess, cvs?

or... *gasp* vss?

The company I work for went from went from CVS, to Subversion, to Git over the last few years. Honestly, they all seem to do the same thing as far as I'm concerned.

brianmanahan · Feb 3, 2017

ultimatebob said:
The company I work for went from went from CVS, to Subversion, to Git over the last few years. Honestly, they all seem to do the same thing as far as I'm concerned.

but cvs and svn don't have anything like a local repo

so that'd be one extra step

Search

Gitlab's admin deletes wrong directory, site goes down

Elixer

Lifer

Red Squirrel

No Lifer

John Connor

Lifer

Skyclad1uhm1

Lifer

Exterous

Ken g6

Programming Moderator, Elite Member

[DHT]Osiris

Lifer

ultimatebob

Lifer

[DHT]Osiris

Lifer

Ns1

No Lifer

[DHT]Osiris

Lifer

brianmanahan

Lifer

Red Squirrel

No Lifer

Cappuccino

Diamond Member

Chaotic42

Lifer

brianmanahan

Lifer

ultimatebob

Lifer

brianmanahan

Lifer

TRENDING THREADS