Fort Impostor

I have named my office “Fort Impostor” and this is my… manifesto.

“The Fraud Police are the imaginary, terrifying force of ‘real’ grown-ups who you believe – at some subconscious level – are going to come knocking on your door in the middle of the night, saying: We’ve been watching you, and we have evidence that you have NO IDEA WHAT YOU’RE DOING. You stand accused of the crime of completely winging it, you are guilty of making shit up as you go along, you do not actually deserve your job, we are taking everything away and we are TELLING EVERYBODY.”

Amanda Palmer, The Art of Asking; or, How I Learned to Stop Worrying and Let People Help

Linux doesn’t ask “Are You Sure?”

Are you sure you want to ruin your entire day? (Y/N)

Preface: This happened a couple years back and has completely resolved since. It’s also one of those events that fellow sysadmins might relate to that make you go back and rethink (and rewrite) your backup scripts and policies involving production machines and how to not be stupid, in general.

Today is the day after Easter.

This past Good Friday was not… well… good.

I can blame a cavalier attitude toward my work on a Friday before a holiday weekend or I could blame the bar for keeping me there until midnight the night before or I could blame the webdev people for breaking the DEV web page and requiring me to try to restore it from backup, but in the end I only blame myself.

I have made the joke plenty of times that “Linux doesn’t ask ‘are you sure?’ You tell Linux to delete a file and BAM, that file is GONE! There’s no sissy recycle bin. There’s no ‘are you sure?’ prompt. If you want to shoot yourself in the foot – Linux will shoot you in the most efficient way possible.” I usually follow up that joke with: “That’s also why I have this really fun ulcer and aren’t supposed to take aspirin anymore.”

Well, Friday I wiped out two production databases from the production server.

Intending to delete var/www/html/web-sandbox from /localdepot I typed

rm -rf /var

instead of

rm -rf var

Instead of deleting the var directory tree in the current directory – that preceding slash (which was pure muscle memory) indicated my desire to delete var from the file system root.

The same /var that contains core OS files including ALL of the database instances, their backups, mail and a variety of other important ‘variable data’.

Years of being hyperaware of how long it takes for the cursor to come back after a command helped me catch and kill the command after about 5 seconds. But it was too late to save /var/lib/mysql, /var/empty and /var/spool. (How Linux decides the order in which to delete things still baffles me – it sure isn’t alphabetical)

Immediately afterwards my world turned black. That familiar feeling of adrenaline and bile rushing to my head while my heartbeat almost stops.

I did my best to assess the damage. I scrolled back through my history and wrote down what directories were affected. I briefly thought about how I could cover this up long enough to fix it without anyone knowing. I decided I could not.

I quickly told my boss what happened. Not in detail – but I told him it was my fault and what was affected.

He told his boss it was a ‘hard crash’ on the file system and that I was ‘working on restoring things’. (God, I love my boss)

I set about planning my recovery. I was shaking so bad I kept making typing errors. I double and triple-checked the status line of my screen to ensure I was on the correct server when I typed commands. I started copying things then I canceled them. I checked the backups to see which ones I could recover from and started them copying to the DR server in case I broke the production box beyond repair.

In short – I was a hot mess. And it was only 11:00 am.

By 5:45 pm I had MySQL restored (4 and a half hours of copying and an hour and a half to restore) and sent the email.

In the interim I compared the /var tree to the other RedHat box and started fixing things like my inability to SSH into the server (missing file in /var/empty) and the missing CRON jobs and scripts (also in /var/spool)

I started copying the DEV backup and went home on-time for dinner. By 9:00 pm I had the DEV restored as well.

Hours had gone by and I had heard no new complaints. I started googling ‘deleted /var’ to see what happened to other people when it happened. Most had no recourse but to reinstall. But most had also deleted the entire directory. I only lost a few sub-directories.

I spent this weekend rethinking my backups and re-writing cron scripts. I also took steps to bring the DR server up to date in case I have to roll everything over to it the next time I have to reboot the production box.

This weekend I slept little and thought about how I am the only one that knows and can fix these kinds of problems. (My boss asked if would be worth getting Redhat involved. I told him no, mostly because I KNEW what happened and it was my fuck up)

As of today – all is quiet. There have been no new complaints. Is everything OK?

I won’t know until the next reboot.

But I have a DR plan working and my backups are running. If I have to move this data off this box – I can do it. I also learned where my existing backups are lacking and have addressed those areas.

I hate learning lessons in this manner.

But Linux doesn’t ask ‘are you sure?’

I’ve seen message boards full of threads where people own up to their worst mistakes. I thought this might be a good contribution. I hope it’s seen as a cautionary tale to noobs and a “Thickheaded Thursday” type of story where I explain how I:

  • Screwed up
  • Owned up to my mistake
  • Took ownership of what I did
  • Figured out how to fix it
  • Shared it with others

Do a post about #2

I saw this article linked from a Facebook post this morning and was inspired to elaborate on #2 in the article about how Working from Home is not all that it’s cracked up to be. I even sent myself an email so I’d remember to it. So, relax, past-Chuck. I’m doing it now.

I used to work in an office with only a few others. The majority of the staff was remote (including my boss and his boss).

My work/life balance was non-existent. I was constantly at the whim of people working from home in remote locations. This was the fictional, never-sent email that I wrote trying to explain why I left that job.

Dear Boss,

You were so disappointed when I told you I was leaving my job. You mentioned numerous times how I gave no indication that I was unhappy. You kept saying that the fact that I “never mentioned that anything was wrong” really bothered you.

The truth is – while I knew what it was that bothered me, I just didn’t know how to put it. Until today.

I found an article online that elaborated in a tongue-in-cheek way all the downsides of working from home. #2 struck me in a way that I don’t think the author intended. It reads:

  1. YOU NEVER REALLY STOP WORKING In an office you have a start time and a stop time. Even if you work late, you still have that structure. When you work from home you never really stop working because there’s no separation between work and home. You’re just this slimy amoeba slithering from your home office with your work laptop, to your living room with your work laptop, and finally to your bed… with your work laptop.

While the author’s intention was to poke fun at what is otherwise thought to be a ‘perk’ in an office job. What I saw was an attitude that my remote teammates exemplified.

You never left work.

I can’t comment on how committed you are to sitting at your desks for nine hours straight. Perhaps you really do. But I strongly suspect that if your wives or children wandered in with a question – you answered them. If you were hungry – perhaps you ate. Doorbell rang? You answered.

Were those distractions? Maybe.

But I’ll bet you also walked away from your desks and laid on the couch to relax for few minutes or work through a problem. You went to a private bathroom when needed. Maybe you took that conference call while laying on the floor?

You see, I never had that luxury. While the office was comfortable and well appointed, it was still “the office”. My day started two hours before the office opened everyday. If I was lucky it ended an hour and a half after the office closed. If I was hungry I was stuck with what I brought with me or what I could afford to buy. I wandered from floor to floor sometimes to find an empty restroom.

As the end of the day approached, I would get ready to leave “the office” only to be told on different occasions that there was an issue that required my attention “right now”.

I received several calls between 5:45pm and 6:00pm to inform me a potential customer wanted after-hour service.

I was to perform server updates or configuration changes just before closing.

I had last minute meetings to go over this weekend’s maintenance tasks.

While you were all “working from home”, you were, in fact, home.

Me? My daily routine was non-existent.

Dinner time at my house was randomly between 7pm and 10pm. (Cold McDonald’s at 9:30pm on more occasions that I’d like to admit)

Social plans? Nope, not until I know if I am coming home on time tonight.

While there were indeed other reasons I left, today I found the words to describe the worst part of my time there.

And now you know.

Nerves of Jello

Ever have a really bad day? Or a bunch of bad days in a row? Me too. I’ve masked the names to protect the guilty.

For a very long time I have been aware that stress can take a toll on your health. But in my experience, it’s never been immediate like it has lately.

It started this past weekend with an on-call nightmare: I got two calls about Asterisk, (not my specialty) on Saturday. One guy just needed it started. He had rebooted his machine and now it didn’t work. Easy fix – but the call ruined my lunch. By the time I was done my frozen pizza was almost frozen again.

Later, just before a dinner I was really looking forward to, the shit hit the fan:
A major client experienced a catastrophe.

Here’s the timeline of events over the next hour:

  • 6:27p Call from Answering Service – Client Engineer #1 from India
  • 6:29p Call from Answering Service – Client Engineer #2 from India
  • 6:43p Call from Answering Service – Client Engineer #1 from India
  • 6:40p I called my backup – no answer – left message
  • 6:43p I text my boss’ boss
  • 6:45p I text my boss
  • 6:51p I call my boss’ boss – left message
  • 6:53p I call my boss – left message
  • 7:02p Call from Answering Service – Client Engineer #1 from India
  • 7:10p Call from Answering Service – Client Engineer #2 from India
  • 7:12p I call our company president – told him I was working on it and can’t get a hold of anyone to help

It seems Client Engineer #3 (from New Jersey) pulled a cable at the Major Client data center and severed connectivity to over a dozen machines. I was being called by three different Major Client employees (and they were giving the answering service grief since they were all from India) and I was unable to contact them because one of the servers affected was the XMPP server.

I tried to call my backup to see if he could help me clear calls until I could figure out how to reach someone via skype, but there was no answer. Then I escalated to my boss’ boss with a text that went unresponded. Then I tried my boss – since he knows the most about Major Client – but he wasn’t on-call – so he didn’t answer or call back either.

Finally, I called our company president in a panic. He wanted to know what I wanted him to do. I told him – nothing, I just wanted him to know there was a shitstorm brewing, and I was alone, in case somebody complained.

Long story short – my boss’ boss and I stayed up until 2am straightening out the mess.

And then Client Lead Engineer – who was on a plane coming home during all of this – called the answering service at 5:59am (AM!) to wake me and find out what happened.  Then he kept me awake for an hour to watch him as he fixed the problem the way he knew how.

An hour later the pager went off to remind me I was on-call again. Sleep was a memory. Sunday was going to be shitty.

Monday wasn’t horrible and I got what I considered a proper amount of thanks for holding it together on Saturday. I was still a bit shaky and sleep-deprived, but I was back in the office.

Tuesday I had to be at our data center downtown to move our phone server from one rack to another. It should be an easy task. I had planned all my cabling and showed up at the data center at 6:30a to make the switch at 7am.

Unfortunately, my stomach had other ideas. I barely made it to the bathroom at the data center when I arrived. I was nervous, running on fumes and just chugged a Monster energy drink. That wasn’t starting well at all.

So I get into the data center and start the move. Things are OK. In about 30 minutes I had the server moved and ready to power on. I pushed the power button and IM’d our office assistant. “All done. Please test” She messaged back that she couldn’t get into the server. I double-checked my cables. I had link lights. Everything looked ok.

We waited. Still no access. Oh, shit. This is the phone server. It has to be up by 8am or the office will have no phones.

I triple checked my cables against my documentation. I was at a loss. It was 8:15am. It was time to call my boss’ boss.

Now, I’m pretty sure I woke him up. But his first sentiment to me was “Chuck, this isn’t that hard.”

“Double-check your cables. Did you use new cables? Did you try to change cables?”

I felt insulted and stupid. Of course, I had tried different cables. Did he really think I tried nothing before calling him? After I admitted that I’d been working on it for an hour, he decided to stay on the line with me to help troubleshoot.

We went over everything. From start to finish.

“Did you drop the server?” No.

“Did you do anything that you suspect might have caused an issue?” No.

Again, I felt stupid and embarrassed. I told him that I was aware of how this situation makes me look, but I am certain that I did nothing that I didn’t plan to do. There were no obvious errors or detours in the plan I had made.

Eventually – it was discovered that the cable port that I plugged into the public network (which is how our office assistant would have reached the server) was not configured for the public network. Not my fault – his fault. After he fixed it and everything started working, he apologized. But the damage was done. I felt like jello and was ready to just leave.

Wednesday was another trip to the data center to move another server. This one went OK, but wasn’t without its moments of feeling belittled by my boss’ boss via phone. At least there were no problems, and we were able to get things back up quickly.

Finally, this morning, I had to run updates on the firewall at our data center at 7am, from home. Ok, no problem. 7am hits and I press the button. Shortly thereafter I realize that I didn’t set enough downtime in our monitoring software. As the monitor lost communication with everything behind the firewall (because it was rebooting) I received 56 “down alerts”. 56 text message notification sounds. One after the other. The fucking phone wouldn’t stop.
I wasn’t sure if things had just gone to shit on me or not.

As I’m scrambling to try to not wake the house and shut up the phone at 7am and acknowledge the alerts, the box comes back up and I start to get “up alerts”. The connectivity was being restored. Another 56 text messages in a row.

At his point I am literally shaking like I’ve got fucking Parkinson’s. My hands are cold and I can’t keep my head still.

As things start to come back – I verify that things are OK, now and send my email explaining why there were so many alerts.

Then I sit down in my chair.

I couldn’t decide if I wanted to shit, throw-up, faint or die.

This was too much. Too many things in a row. Too many days in a row.

I can’t do this anymore.