Lessons on complexity, failure modes, and effective use of time from my thermostat
The Prelude
A little over a year ago I had purchased and installed a WIFI enabled thermostat (a cheap one, not a fancy Nest). The ability to turn up the temperature from a cell phone on the nightstand in the cold bedroom without leaving the warm covers or doing the same from 40 minutes away to be greeted by a warm home is one of my small pleasures of the modern world.
Then I left on vacation for two and a half weeks on a cruise. I didn’t want to temp fate by trying out “Away” mode for the first time so I left my thermostat schedule as is.
Halfway through the trip at a cafe with WIFI I decided to check on my thermostat. It’s 40°F in the house and 8°F outdoors and the heat is OFF. The thermostat is upstairs so it’s unclear how cold it is in the basement with all the water pipes. There’s also been over a foot of snow dumped on NJ so the nearby friends that were checking on the house were snowed in.
I used my thermostat app to turn up the heat and found that it wouldn’t stay on more than 30 minutes or so before turning off again. I spent the rest of the vacation making sure I could get on WIFI at least every 8-12 hours to bump up the heat a few times a day, researching what could be wrong with the thermostat, and worrying what I might find when I got home.
The Completely Uninspiring Climax
After arriving home luckily the pipes were fine and the thermostat had a clear indicator that the backup batteries were weak. Replacing the battery resolved the issue with the thermostat turning off the heat after 30 minutes.
The Retrospective
All my well-intentioned and proactive research on what have been the problem was a waste of time as I had started it before I had been able to spend even a minute looking at the issue in person.
That was my mistake. Now what about the thermostat designers?
The thermostat was sitting hardwired to the house electricity and had NO cause to require its backup batteries yet was intermittently malfunctioning in what should have been a normal state because something was amiss with its fallback plan. It ironically even continued being operational while I replaced the backup batteries.
Does this remind you of any troubleshooting or high availability improvements that have themselves led to unavailability?
- A misconfigured or problematic heartbeat ping mistakenly causing a failover
- Misbehaving clustering solutions that in their early days were the cause of more downtime than your potentially risky but surprisingly well mannered single server
- Issues with logging or monitoring additions disrupting the main application they were intended to help
We are living in an “Always On” world nowadays and not planning for high availability is an unacceptable risk to business continuity. However, we must be cognizant of the additional complexity involved in designing systems more resilient to failure.
HA systems must have their partial and failure modes well exercised and understood. When you are testing an HA solution initially in a non-production environment have dev, operations, QA, product and support representatives around to participate controlled and uncontrolled (turn off a service, reboot a machine, unplug a network cable, shut off a managed network switch port, etc) failure testing. What is the end user experience at various phases? How long does it take to recover? Did it recover completely on its own or did it need some intervention? What sort of errors are presented both to the user and in internal logs? Do they make sense?
HA systems should expose monitoring specific to their internals. You should never consider your HA solution to be a magic black box without visibility or understanding of its own internal state even if a vendor tries to sell you on the idea that it’s all handled and you don’t need to worry. Where are their logs? How can you verify its internal state? Your app may be up but not in an optimal state should some other trigger condition occur. Consider if the thermostat vendor had provided a battery low indicator on their mobile app as they had on the physical device how much easier it would have been to arrive at a root cause.
Further Reading
I had the privilege of attending this talk in person at Velocity and it was one of the most thought provoking and humorous ones at the event. Systems in life and death situations are talked about and a prevailing theme is not just about some unusual circumstance where complex systems fail but the realization that given how they are designed why they aren’t failing far more often.
I would highly recommend viewing both the video and reading the paper.
Video: Velocity 2012: Richard Cook, “How Complex Systems Fail”
Paper: How Complex Systems Fail by Richard Cook, MD