Sunday 21 April 2019

Crisis, what crisis? Don’t panic when you suffer a system meltdown

Picture posed
Picture posed

Richard Rodger, Voxgig founder

The pitch deck can finally, thankfully, be put to one side. This week I’d like to return to more operational matters. Let’s talk about how to handle your website going down, in public. We can go one step further: there are always going to be disasters in business. How do we deal with them?

The first step is to accept that they will happen. If you run an online service, then your service will fail. It happens to the best companies in the world. Google, Amazon, Facebook, Apple – they have all had services crash and burn. In 2011 a lightning strike near CityWest business park in Dublin took out Microsoft and Amazon services. No amount of high-end software engineering is going to help you when that happens. Backup generators don’t work when they’re burnt out.

In our own little startup, we suffered a system outage in August that took down a live trial client. I caused it by manipulating our database directly. I knew it was risky at the time, but these are the trade-offs you make in a startup to get stuff done. It’s important not to overreact and hamper your agility – just because you cut yourself with scissors when you were eight doesn’t mean you shouldn’t wrap Christmas presents ever again.

If you manage crises the right way, they can be a net gain. You should be inspired by the way airline safety is improved. Each accident is thoroughly investigated, new understanding of engineering problems is gained, and processes are introduced or modified. As a result, airline safety gets better very year.

In a startup you won’t have your own independent accident investigation team. You’re barely have time to breathe after you fix the issue, and the pressure will be on to move on and keep building. If you lost your temper, and were harsh with staff, everybody will want to sweep it under the carpet and carry on.

To come out ahead, prepare for disaster by adopting a new mental model. Failures will happen. They are either the result of deliberate risks you choose to accept, or they are the result of systemic failures. It is perfectly OK to take risks. Do so deliberately, by choice. You have limited resources, and you cannot make everything safe. When things inevitably fail, do not lose your head and blame your staff, or anybody else.

If you got the short end of the stick on a deliberate risk, you need to make a judgement call. Do the same factors still apply? It may well make sense to continue taking the same risk. Or should you reassess and consider the incident to be a symptom of an underlying systemic issue? This is not without cost – systemic risks need to be analysed and entail changes in your processes.

Babe Ruth, one of the most flamboyant baseball players of the 20th century revolutionised the game by going all out for home runs. Before him, a cautious style that avoided strikeouts was considered the best strategy. But by deciding to take a repeated deliberate risk, he was able to outperform his contemporaries. He ended his career with a record 714 home runs, but more importantly, 1,330 strikeouts. Startups require deliberate risk-taking, and they require level headedness and perseverance when the odds go against you.

That was the easy part. The hard part is dealing with disasters that you did not choose. As a software developer, and business owner, I‘ve dealt with many of these over the last 20 years. I’ve been the end of the phone to very irate clients, so I know how much it hurts when you fail to deliver. Here’s my approach.

Step Zero: When things go wrong, do not lose your temper. This is always a failing move. Do not get passive aggressive with your staff either. Reality is reality and you need to focus on fixing the problem right here, right now. Everybody needs to work together quickly and efficiently. People can only do that when they feel safe. You need to make sure everybody does.

Step One: ‘Work the problem’. This is the mantra Nasa astronauts use, and so should you. Airline pilots do that same thing. If an engine cuts out, they follow a checklist all the way down.

You can’t be distracted by outcomes when you are in the thick of problem diagnosis and resolution. It’s easy to recommend keeping a clear head, and some us have better psychological reserves for a crisis, so this advice is not given flippantly. At the very least, make a conscious decision now that this is how you want to handle a crisis.

Step Two: When the dust settles, do a post-mortem, and produce an incident report. This is where all your efforts to create a healthy company culture really pay off. You must use a no-blame approach. Human error may be an immediate cause, but your systems and processes enabled human error to occur. Blame the system and the process, not the person.

The incident report should contain a timeline of the events: who did what and when. Someone senior needs to lead the investigation. The report should then analyse and comment on the problem. Then, there should be a set of recommendations for changes. Finally, a little bit later, you update the report with the actual actions that you took.

Step Three: Implement some improvements. You will not be able to fix everything and maintain a functioning business. You may very well know that a problem exists. You may have known about it before the incident. And you and your management team may still have to decide to just accept it as an ongoing risk. That’s OK – now it’s a deliberate choice.

The best approach here is to delegate improvements to the your front-line staff – they usually have better knowledge than you. As a manager you will need to help with company-wide process changes, but often a team can handle things themselves. You should trust your teams to fix things themselves – show trust on this and you’ll get some wonderful results. When our newsletter team suffered a glitch in the subscription process, it led to a wide range of minor improvements after the incident analysis – led by the team themselves.

In fact, as a result of that incident, our weekly subscription rate is materially higher due to fundamental structural changes – a blessing in disguise.

Newsletter update: 4447 subscribers, and an open rate of 16pc. We’re conducting monthly experiments to optimise the newsletter. Each month we change something and see what effect it has. This month we’ve been trying an earlier mail-out time on Fridays which seems to be leading to a higher open rate. This is a little counter-intuitive as our readership is mostly North American. Perhaps we were underperforming on the European side?

Richard Rodger is the founder of voxgig and former co-founder of Nearform, a Waterford-based tech consultancy

Online Editors

Also in Business