DevOps Bard – Pragmatic DevOps

That sounds like a great blog post…

Alec Lazarescu — Mon, 02 Jun 2014 01:13:47 +0000

I was at a meeting with a leading Big Data vendor recently and a group presented their current data analysis pipeline featuring Storm, Kafka, Elastic Map Reduce, Spark, and a number of other plumbing pieces that ferried the data through.

One of the vendor’s lead technologists mentioned that it was pretty involved and it sounded worthy of a good blog post. That mention had come up again when another piece of architecture was discussed.

On the surface this had seemed like a good thing that the company was working at the edge of this space, but as the day went on and the themes of simplification of architecture and focus were discussed, it had become apparent that they had spent entirely too much energy in a space that was still nascent in the industry building a complex data and analytics pipeline. Much of this effort would have been better spent on simplifying the ability for new data insights to be generated and productizing them instead.

While it may seem somewhat meta and ironic to discuss the merits of blog posts in a blog post, this seemed like a good experience to share. Blogging about bleeding edge technical pursuits at your company can have other benefits such as building the strength of your brand in the technical community which can be a powerful recruiting boost. Just don’t conflate useful research and POC’s that you learn from and blog about with what you actually put together for your production data architecture nor lose sight of the actual analysis goal.

Photo Credit: Link by Rob Davies

Perception

Alec Lazarescu — Wed, 02 Apr 2014 14:00:14 +0000

This is a fun parable from famous web performance guru Steve Souders (formerly working at Yahoo and currently at Google). It’s part of his lengthy but excellent video on web front-end latency (~31 minutes in).

An office building owner receives escalating complaints from his tenants about how long they have to wait for the elevators.

The owner calls a civil engineer to ask what he could do. The engineer suggests the building can structurally support another two elevators. It will cost about 5 million dollars and the building will need to be closed for 6 months.

Shocked he calls in a computer science engineer instead. The CS guy mentions he’s been working on AI lately and can write a learning algorithm that can adapt to the schedules of the tenants and can position the elevators more effectively for shorter wait times. It would take about 6 months about cost about $300,000.

The owner finally calls a systems engineer. The systems engineer quickly suggests putting a TV in every elevator lobby and no one will complain again. The tenants will be distracted even if watching a terrible show and not perceive the slowness of the elevators as much. The simplest and cheapest solution avoided complaints.

The tie in with web front-end performance is even if you do not actually improve the full load time at all, if you can improve the perception of the users that things are happening or they have something to view and read in the meantime it demonstrably increases the amount of people that will stay on your site rather than click away.

Lessons on complexity, failure modes, and effective use of time from my thermostat

Alec Lazarescu — Sun, 30 Mar 2014 17:02:17 +0000

The Prelude

A little over a year ago I had purchased and installed a WIFI enabled thermostat (a cheap one, not a fancy Nest). The ability to turn up the temperature from a cell phone on the nightstand in the cold bedroom without leaving the warm covers or doing the same from 40 minutes away to be greeted by a warm home is one of my small pleasures of the modern world.

Then I left on vacation for two and a half weeks on a cruise. I didn’t want to temp fate by trying out “Away” mode for the first time so I left my thermostat schedule as is.

Halfway through the trip at a cafe with WIFI I decided to check on my thermostat. It’s 40°F in the house and 8°F outdoors and the heat is OFF. The thermostat is upstairs so it’s unclear how cold it is in the basement with all the water pipes. There’s also been over a foot of snow dumped on NJ so the nearby friends that were checking on the house were snowed in.

I used my thermostat app to turn up the heat and found that it wouldn’t stay on more than 30 minutes or so before turning off again. I spent the rest of the vacation making sure I could get on WIFI at least every 8-12 hours to bump up the heat a few times a day, researching what could be wrong with the thermostat, and worrying what I might find when I got home.

The Completely Uninspiring Climax

After arriving home luckily the pipes were fine and the thermostat had a clear indicator that the backup batteries were weak. Replacing the battery resolved the issue with the thermostat turning off the heat after 30 minutes.

The Retrospective

All my well-intentioned and proactive research on what have been the problem was a waste of time as I had started it before I had been able to spend even a minute looking at the issue in person.

That was my mistake. Now what about the thermostat designers?

The thermostat was sitting hardwired to the house electricity and had NO cause to require its backup batteries yet was intermittently malfunctioning in what should have been a normal state because something was amiss with its fallback plan. It ironically even continued being operational while I replaced the backup batteries.

Does this remind you of any troubleshooting or high availability improvements that have themselves led to unavailability?

A misconfigured or problematic heartbeat ping mistakenly causing a failover
Misbehaving clustering solutions that in their early days were the cause of more downtime than your potentially risky but surprisingly well mannered single server
Issues with logging or monitoring additions disrupting the main application they were intended to help

We are living in an “Always On” world nowadays and not planning for high availability is an unacceptable risk to business continuity. However, we must be cognizant of the additional complexity involved in designing systems more resilient to failure.

HA systems must have their partial and failure modes well exercised and understood. When you are testing an HA solution initially in a non-production environment have dev, operations, QA, product and support representatives around to participate controlled and uncontrolled (turn off a service, reboot a machine, unplug a network cable, shut off a managed network switch port, etc) failure testing. What is the end user experience at various phases? How long does it take to recover? Did it recover completely on its own or did it need some intervention? What sort of errors are presented both to the user and in internal logs? Do they make sense?

HA systems should expose monitoring specific to their internals. You should never consider your HA solution to be a magic black box without visibility or understanding of its own internal state even if a vendor tries to sell you on the idea that it’s all handled and you don’t need to worry. Where are their logs? How can you verify its internal state? Your app may be up but not in an optimal state should some other trigger condition occur. Consider if the thermostat vendor had provided a battery low indicator on their mobile app as they had on the physical device how much easier it would have been to arrive at a root cause.

Why Stories Still Matter

Alec Lazarescu — Sun, 30 Mar 2014 17:01:11 +0000

Attention spans are shortening and there’s an avalanche of information. Why do we need stories when we can just get sound bites or tweets? Just tell me what I need to know and skip the setup, right?

The truth is that despite the presumed time savings that’s not nearly as effective. Following the journey is what gives you context to truly grasp the situations and solutions and why they are relevant. Most importantly it gives the reader time to think about the problem space and start considering the solutions before they are presented by the author.

By the time the solution is presented the reader may even think

“That’s what I would have done”
“That’s obvious. Why isn’t X department doing this? Let me talk to them.”
“I knew it!”

This adds a dose of humility and minimizes the author’s role but ultimately achieves the goal of assimilation and propagation of the ideas nonetheless.

Consider instead just the potential responses receiving the end result in bite sized form without the context:

“That would never happen”
“That wouldn’t work in my situation”
“This is too theoretical”
“This is too ivory tower and not practical”

The impact of Goldratt’s The Goal and its more recent IT focused spiritual successor The Phoenix Project present a strong case for the power of stories. Despite its age, The Goal is still worth read to get a less rushed path through the theory of constraints than the later book presents. The Phoenix Project is a must read.

In this venue I’ll try to include not just end state advice but also some stories and anecdotes from myself and others.