Are you sure you want to ruin your entire day? (Y/N)
Preface: This happened a couple years back and has completely resolved since. It’s also one of those events that fellow sysadmins might relate to that make you go back and rethink (and rewrite) your backup scripts and policies involving production machines and how to not be stupid, in general.
Today is the day after Easter.
This past Good Friday was not… well… good.
I can blame a cavalier attitude toward my work on a Friday before a holiday weekend or I could blame the bar for keeping me there until midnight the night before or I could blame the webdev people for breaking the DEV web page and requiring me to try to restore it from backup, but in the end I only blame myself.
I have made the joke plenty of times that “Linux doesn’t ask ‘are you sure?’ You tell Linux to delete a file and BAM, that file is GONE! There’s no sissy recycle bin. There’s no ‘are you sure?’ prompt. If you want to shoot yourself in the foot - Linux will shoot you in the most efficient way possible.” I usually follow up that joke with: “That’s also why I have this really fun ulcer and aren’t supposed to take aspirin anymore.”
Well, Friday I wiped out two production databases from the production server.
Intending to delete
rm -rf /var
rm -rf var
Instead of deleting the var directory tree in the current directory – that preceding slash (which was pure muscle memory) indicated my desire to delete var from the file system root.
The same /var that contains core OS files including ALL of the database instances, their backups, mail and a variety of other important ‘variable data’.
Years of being hyperaware of how long it takes for the cursor to come back after a command helped me catch and kill the command after about 5 seconds. But it was too late to save
/var/spool. (How Linux decides the order in which to delete things still baffles me - it sure isn’t alphabetical)
Immediately afterwards my world turned black. That familiar feeling of adrenaline and bile rushing to my head while my heartbeat almost stops.
I did my best to assess the damage. I scrolled back through my history and wrote down what directories were affected. I briefly thought about how I could cover this up long enough to fix it without anyone knowing. I decided I could not.
I quickly told my boss what happened. Not in detail, but I told him it was my fault and what was affected.
He told his boss it was a ‘hard crash’ on the file system and that I was ‘working on restoring things’. (God, I love my boss)
I set about planning my recovery. I was shaking so bad I kept making typing errors. I double and triple-checked the status line of my screen to ensure I was on the correct server when I typed commands. I started copying things then I canceled them. I checked the backups to see which ones I could recover from and started them copying to the DR server in case I broke the production box beyond repair.
In short - I was a hot mess. And it was only 11:00 am.
By 5:45 pm I had MySQL restored (4 and a half hours of copying and an hour and a half to restore) and sent the email.
In the interim I compared the /var tree to the other RedHat box and started fixing things like my inability to SSH into the server (missing file in
/var/empty) and the missing CRON jobs and scripts (also in
I started copying the DEV backup and went home on-time for dinner. By 9:00 pm I had the DEV restored as well.
Hours had gone by and I had heard no new complaints. I started googling ‘deleted /var’ to see what happened to other people when it happened. Most had no recourse but to reinstall. But most had also deleted the entire directory. I only lost a few sub-directories.
I spent this weekend rethinking my backups and re-writing cron scripts. I also took steps to bring the DR server up to date in case I have to roll everything over to it the next time I have to reboot the production box.
This weekend I slept little and thought about how I am the only one that knows and can fix these kinds of problems. (My boss asked if would be worth getting Redhat involved. I told him no, mostly because I KNEW what happened and it was my fuck up)
As of today, all is quiet. There have been no new complaints. Is everything OK?
I won’t know until the next reboot.
But I have a DR plan working and my backups are running. If I have to move this data off this box, I can do it. I also learned where my existing backups are lacking and have addressed those areas.
I hate learning lessons in this manner.
But Linux doesn’t ask ‘are you sure?’
I’ve seen message boards full of threads where people own up to their worst mistakes. I thought this might be a good contribution. I hope it’s seen as a cautionary tale to noobs and a “Thickheaded Thursday” type of story where I explain how I:
- Screwed up
- Owned up to my mistake
- Took ownership of what I did
- Figured out how to fix it
- Shared it with others