I really don’t like starting a Saturday working on work stuff at home. Especially when I am not supposed to be working on a Saturday. I don’t mind working when I am supposed to be working, but when I am supposed to be making breakfast for my family or taking care of stuff at home I get a little testy that I have to interrupt that so that I can tend to things at work.
So it went this morning, when by 8:45 I was getting called on my home phone and cell phone that there were things breaking left and right and why was this happening and what is going on? Huh?
It started with a frequent background job failing but not notifying. It then led to a series of other issues including an intranet that went down, an account management app that went down and our core site that went down. Coupled with that were all of the background jobs and middleware apps that were not transferring data as expected. Things were not working so me and a few fellow co-workers were.
The cause of the situation ended up being a few other nightly jobs running, killing something and never bringing those somethings back up. Then another job decided to reboot a server but not bring it back up properly. This in effect made all of the other servers that talk to the one that was reboot to stop talking to it. It happens. Anyone that has ever worked in an IT environment knows things like this happen.
But when handled properly it only happens the same way once. A second time means you didn’t learn from the first. Screwing up is inevitable and should be handled with grace and understanding the first time it happens. Any subsequent (and by that I mean the second time only… after that more than one person is to blame for these issues) instances of failure like this indicates an inability to learn from your own mistakes and should not, in my opinion, be handled with the same amount of grace and understanding as the first time. I know that may sound harsh, but that is the reality of living in a real-time environment in which money transacts and businesses serve.
The incident this morning was a case in point. Nothing that took place this morning should have taken place this morning. In fact, a remedy for this situation was put in place, for a different identical server, just a few days ago. So it begs the question, if an identical machine bought at the same time experienced something like this and was fixed, why would it’s twin not be fixed? You get my point.
What ended up happening is that three people, one of whom was me, ended up spending non-work time working because of an error on the part of another teammate of ours didn’t learn from the last time something like this happened. Yes, I am complaining about a teammate. We all are, or at least should be, held to the same level of accountability in everything we do as a unit. And this does not just hold true for work. It can easily be applied to sports, families, friendships and businesses.
I love my job. If I come across in a way that indicates anything otherwise please don’t hesitate to call me out on that. I actually look forward to coming into the office and taking care of business everyday with the team I work with. I appreciate their knowledge, their expertise and their experience. This is very much like my family, and more specifically my wife, whose experience and knowledge provides a wealth of protection to me and my family. My team at work provides a similar level of protection to our team at work.
That is the nature of a team environment. We protect us. We advance us. We are us. Every person on the team should take the responsibility for the success of the team and do whatever it takes to maintain a high level of accountability to the team to see to it that the team succeeds. If that is done then Saturdays can be spent as a normal Saturday and my blog post for today could have been about something entirely more pleasant than this.