Pragmatic DevOps https://www.pragmaticdevops.com DevOps Bard Stories / Analysis / Hacking Management Wed, 27 Aug 2014 12:32:59 +0000 en-US hourly 1 You can’t automate relationships https://www.pragmaticdevops.com/2014/08/devops/cant-automate-relationships-2/ Sun, 10 Aug 2014 12:32:27 +0000 http://www.pragmaticdevops.com/?p=181

Categories: DevOpsManagement

Originally appearing on devops.com.   As an organization sold on the idea of DevOps what’s your first step in the journey? Despite a constant deluge of well-reasoned guides for implementing DevOps the most prevalent step most companies have made is simply contributing to a proliferation of DevOps job titles. Most likely it’s in no small part because asking […]

(Read more...)

]]>
Originally appearing on devops.com.


 

As an organization sold on the idea of DevOps what’s your first step in the journey?

Despite a constant deluge of well-reasoned guides for implementing DevOps the most prevalent step most companies have made is simply contributing to a proliferation of DevOps job titles.

Most likely it’s in no small part because asking the HR department to add “DevOps” to the latest sysadmin job posting title and “Chef” or “Puppet” to the requirements is easy. This is a woefully low bar to set for DevOps implementation. Compounding this further, afterwards when the CIO gets a survey asking about DevOps adoption, they gleefully indicate the company is well on their way and join many others as they artificially inflate industry statistics of a very worthwhile but nonetheless major undertaking for large companies.

Surveys can contribute to confused starts. Any survey that does not include questions on specific practices and draws its conclusions from simply asking if a company is doing a practice that has positive buzz in the industry is worthless. A welcome contrast is the Puppet Labs 2014 State of DevOps report which is quite a beneficial collection of supporting information covering various practices, recommendations and nuances of organizational structures.

Even actually making progress on an individual automation project, whether with newly titled DevOps members or from a grass roots effort in an existing engineering or operations team, does not fulfill a complete implementation strategy nor is it the best way to start. Automation is certainly vital to reliably and repeatably managing environments, but groups shouldn’t rush to tackle automation first only because it’s comfortable to start hacking away at a problem they think they can deal with on their own.

These isolated and well-intentioned practitioners may find that without buy in and solid trust and relationships across groups they may not be given access to automate and orchestrate certain key areas.

You can’t automate relationships!

There needs to be a holistic understanding of the value streams driving engineering work and how best to make improvements. Instead of diving headfirst into automation work on identifying common goals, shared pain, and potential wins across organizational groups. Starting either with collaboration between group members directly or through management level alignment can work provided the other quickly follows. Buy in is required from both engineering and management or progress will stutter.

Taking people through a story can bring support more readily than just listing bullet points. People need to feel emotionally engaged and connected to the pains of the culture and process prior to DevOps and how they can be realistically mitigated. For a large company audience, seeing a story from another large company with timeframes and phases that shepherd the transformation in bold, yet manageable steps can be helpful. The story of SAP Global IT is a candid and useful case study.

Once the stage is set then tackle automation together in a cross-functional group with shared measures of success leading your focus. Depending on the complexity of your process it may be beneficial to collaborate on a smaller application or time boxed simple proof of concept to vet new ideas and working relationships.

Despite the potential for rocky or inefficient starts, the transformative power of a meaningful DevOps implementation contributes to incredible innovation speed and satisfaction. What has worked to keep your organization keeping the long view in mind regarding DevOps?

Many companies are misappropriating the term DevOps to get attention, headcount, or perceived leverage in internal turf wars instead of building a cross-team continuous improvement engine. Is this a sign of the early peak of inflated expectations in the hype cycle?

Are we heading for a serious backlash? Is the tireless promotion of proper practices from prolific thought leaders enough to make failures clearly misguided implementations and not fundamental flaws in the movement?

I’m eager to hear your thoughts in the comments.

]]>
Branching out https://www.pragmaticdevops.com/2014/07/devops/branching/ Thu, 10 Jul 2014 12:04:04 +0000 http://www.pragmaticdevops.com/?p=166

Categories: DevOps

Starting today I’m very excited to announce that I will also be blogging on http://devops.com/author/alecl/ in addition to here at http://www.pragmaticdevops.com/. While I’m certain I will keep both running I haven’t quite settled on which exact topics I’ll post on each so I’ll play it by ear for a while. My first piece is out: http://devops.com/blogs/cant-automate-relationships/ I’m very […]

(Read more...)

]]>
Starting today I’m very excited to announce that I will also be blogging on http://devops.com/author/alecl/ in addition to here at http://www.pragmaticdevops.com/.

While I’m certain I will keep both running I haven’t quite settled on which exact topics I’ll post on each so I’ll play it by ear for a while.

My first piece is out: http://devops.com/blogs/cant-automate-relationships/

I’m very excited to be part of a larger group and audience with many amazing authors there.

]]>
That sounds like a great blog post… https://www.pragmaticdevops.com/2014/06/devops-bard/sounds-like-great-blog-post/ Mon, 02 Jun 2014 01:13:47 +0000 http://www.pragmaticdevops.com/?p=73

Categories: Big DataDevOps BardManagement

Tags:

but don't lose sight of your real goals

(Read more...)

]]>
I was at a meeting with a leading Big Data vendor recently and a group presented their current data analysis pipeline featuring Storm, Kafka, Elastic Map Reduce, Spark, and a number of other plumbing pieces that ferried the data through.

One of the vendor’s lead technologists mentioned that it was pretty involved and it sounded worthy of a good blog post. That mention had come up again when another piece of architecture was discussed.

On the surface this had seemed like a good thing that the company was working at the edge of this space, but as the day went on and the themes of simplification of architecture and focus were discussed, it had become apparent that they had spent entirely too much energy in a space that was still nascent in the industry building a complex data and analytics pipeline. Much of this effort would have been better spent on simplifying the ability for new data insights to be generated and productizing them instead.

While it may seem somewhat meta and ironic to discuss the merits of blog posts in a blog post, this seemed like a good experience to share.  Blogging about bleeding edge technical pursuits at your company can have other benefits such as building the strength of your brand in the technical community which can be a powerful recruiting boost.  Just don’t conflate useful research and POC’s that you learn from and blog about with what you actually put together for your production data architecture nor lose sight of the actual analysis goal.

Photo Credit: Link by Rob Davies
]]>
Feature flags and canary, dark, and A/B releases https://www.pragmaticdevops.com/2014/05/continuous-delivery/feature-flags-and-canary-dark-and-ab-releases/ Mon, 19 May 2014 00:17:28 +0000 http://www.pragmaticdevops.com/?p=63

Categories: Continuous Delivery

They're not just for startups!

(Read more...)

]]>
What are feature flags?

Feature flags are toggles in the code base that allow UI areas and/or backend functionality to be enabled or disabled via a configuration file or other configuration system. Users that have the feature disabled see no trace of it.

What can feature flags do for you?

Fast Feedback Loops

Fast feedback loops are one of the cornerstones for realizing Agile development benefits. Developers working independently on isolated versions or feature branches of the code base for too long run the risk of diverging paths substantially and having a large and risky merge effort surprise at the end.

Using feature flags developers can all work in the same branch and merge more routinely by ensuring they have their feature config disabled prior to check-ins until ready for testing. Merging once a day or even more often is not uncommon. This brings to the fore any file locations where multiple developers may have worked and makes the mechanics of the merge simpler as small changes and in recent memory to boot are easier to reconcile. Furthermore, it also keeps the developers aware of each other and may prompt useful conversations on code design in an area of joint interest. This is one of many practices that can take some getting used to, but doing it more often will rapidly build team alignment and understanding after the initial growing pains.

Experiments and A/B Testing

These have become popular especially in e-commerce, marketing, advertising, and some UX circles. Effective experiments require a good target metric and a historical record of its measurements. Your experiments will be attempting to influence this metric.  A few examples:

  • sales $/time
  • ratio of shopping cart checkouts to abandon
  • advertising click rates/time
  • ratio of registrations to visitors

As with the scientific domain roots of experiments, a group of users is selected to partake in the test whether it’s a new code build, UI, or product offering having the feature flag for the experiment enabled and non-participating users are in the control group. Keep in mind strategies to minimize selection bias. Some types of changes may rate particularly well or poorly with power users so it’s best to be aware of the characteristics of your selected user samples and you may even want to consider separate experiments by user archetype to not conflate too many variables together.

During the course of the experiment the target metric can be compared to the control group to verify if any statistically relevant change has occurred.

If the change is promising it can be rolled out to the population of all users. If not it can be scrapped.

Canary Releases

Despite best intentions and even with a serious automated testing suite there’s always a chance a particular change may have corner case bugs, performance issues, or unintended consequences.

Given a strongly instrumented system it is beneficial to release initially to a very small subset of servers/users via enabling the feature flag for them and monitoring for any issues. For the full value it is imperative that there’s not only monitoring for error rates, CPU, and the like present but also metrics that correlate to normal usage thresholds such as posts/minute, checkouts/minute, registrations/minute. If for some reason your application is unusable yet silently errors out, noting the drop in traffic compared to typical activity will be what gives this away.

Dark Launch

Though around for a while, this style of release was publicized by Facebook. With this technique there’s no UI impact of the feature flag enabling, but behind the scenes additional work is being done generally to provide real world test data to a system. This could be querying a new data store or sending data through a new data channel for example.

Load tests and staged data are very important first steps, but for crucial changes having an additional intermediate step of a dark launch can help root out further scalability bottlenecks or unexpected behaviors in production.

What to test?

Testing every permutation of feature flags is unnecessary.  It’s important to test at a minimum:

  • every flag enabled
  • every flag expected to be on for the release deployment enabled

I would add a further case that if you are running experiments where different combinations of flags are enabled to production users you can consider each of those a release deployment configuration of sorts and should give thoughts to testing each to ensure there’s no odd interaction of flag states.

Closing

Feature flags and having a rock solid configuration management system to apply them can enable all of these techniques that help get value to your users at a much faster rate.

These techniques have been around a few years at some of the internet technology leaders and startups so they are not new. Those are exciting and often very vocal places and that can skew the perception of how entrenched the techniques actually are. Scott Hanselman’s notion of the Dark Matter Developers comes to mind.

Nonetheless, with even SAP arriving at continuous delivery it’s time for all companies to take notice.


Further Reading

Feature flags at Flickr
Feature flags at Etsy (similar style to Flickr) library
Feature Bits at Lyrics – nice explanation including business involvement and design patterns

Photo Credit: Link by Toshiyuki IMAI

 

 

]]>
It’s easier to spend money than to change habits https://www.pragmaticdevops.com/2014/05/management/its-easier-to-spend-money-than-to-change-habits/ Mon, 12 May 2014 01:36:26 +0000 http://www.pragmaticdevops.com/?p=79

Categories: Management

Tags:

whether in personal life or DevOps.

(Read more...)

]]>
When you want to make any change in your life whether it’s diet, exercise, learning a new skill, etc. often it’s easy to start with some sort of purchase to commemorate your bold decision. It’s easy to think that some magic purchase is what’s going to kickstart your transformation and you’ll just wait until you receive it to get started. Whether it’s a treadmill, a juicer, a diet book, or that new miter saw nine times out of ten all you get out of that experience is the short-lived jolt of satisfaction that you’ve put into “action” your plan for a new you. In reality you haven’t put anything at all into action. You’ve just traded some cash to temporarily make yourself feel better and added more unused clutter to your home. Even worse you’ll likely feel disappointed and discouraged that you have failed to make a change when in reality you hadn’t even started.

You’re better off taking real steps to change your habits first and after you’ve adjusted your routine for the better even if in a small but measurable and consistent way, then make your purchase for whatever will save you time or help you get to the next plateau. So start walking more, eating better, or doing small home projects first before making your bold proclamation purchase.

Here’s the kicker. This is not the end of a post just talking about your personal life…

Businesses fall into the same traps!

It can manifest by hiring some outside consultant or speaker or having a series of large offsite meetings intending to effect change but in the absence of management and cultural support for the initiative. The response within the events themselves can run the spectrum from excitement to disinterest, but often the same end result of no meaningful change occurs without proper pre and post event alignment and follow-up.

In the DevOps space we’re at the point on the hype curve where organizations with a shallow understanding want to jump on for touted benefits, but aren’t really interested in transformation. That’s not going to happen no matter what consultant you hire.

There’s a great clip from the consultant Andrew Clay Shafer on the Food Fight podcast “The Future of DevOps”.  The whole show is a fantastically blunt, but very real assessment and discussion of learning organizations, DevOps, and a host of worthwhile topics and worth a listen, but you can hear the clip here:

https://www.youtube.com/watch?v=jT6JdGHVbj0&feature=share&t=13m37s

We want “The DevOps” but we don’t want to change anything. Is there some way that we can DevOps without changing anything. Give us that thing. Can we just have the DevOps certification and then we’re done with you?

Hiring a consultant or even getting a handful of people training/certification is easy. Changing your organization, changing multiple teams’ responsibilities and relationships to each other, building trust and tearing down walls is hard.


Another inspiration for this post was the video Don’t Reward Yourself Before You Earn It by John Sonmez

Photo Credit: Link from http://taxcredits.net/
]]>
DevOps as a team or a responsibility? https://www.pragmaticdevops.com/2014/04/management/hacking-management/devops-as-a-team-or-a-responsibility/ Sun, 27 Apr 2014 17:28:31 +0000 http://www.pragmaticdevops.com/?p=49

Categories: Hacking Management

Tags:

what worked for us

(Read more...)

]]>
John E. Vincent writes in “DevOps – the Title Match

I worked at a company several years ago. We created a dedicated devops team. The rationale was solid – the company had a monolithic idea of roles and titles. We also had a large group on both sides that were only interested in doing their little bit and going home. By creating this title/team, it was easier at a company level to justify them working on non-standard projects.

So a “devops” team was created. This was a small team of what essentially boiled down to “super sysadmins”. We wrote puppet manifests, worked with the developers to automate build processes…shit like that.

What ended up happening was that the devops team was seen as elitist by the operations team, nosy and invasive by the developers and everyone just passed the blame on to them – “Devops did that. Not us”

In my department we setup a DevOps responsibility rather than a team or specific job title. The responsibility lies with a subset of the sysadmins and developers. It doesn’t take them out of their main teams and it’s not their only responsibility.  They do their main day to day work respectively with the rest of the sysadmins and developers but additionally have a role being on call, exposing monitoring capabilities in code as well as hooking it up to tools, setting up automation, doing deployments, etc.   Lessons learned are shared with the rest of their teams and the members with this on call responsibility are not coincidentally also some of the most well respected troubleshooters and thought leaders in their groups leading the whole team to improve.

User stories for themselves or other dev/sysadmin team members are created in JIRA under a DevOps epic along with a DevOps label. Though it’s not a perfect world where these stories are always done right away in the next sprint or on top of Kanban TODO list, the team members do have incentive to push for the improvements generally in the areas of monitoring or some reliability improvement.

As with many places, there isn’t a special budget to just spin up a new DevOps team, but by building the roles from within existing teams we captured the experience of the teams.  Augmented occasionally with some help from automation experts from other parts of the organization, this has worked well. It should go without saying that members with DevOps responsibility should have capacity budgeted in their sprint to deal with production issues and continuous reliability improvements, but sadly common sense is not necessarily common.

Photo Credit: Doc SearlsLink
]]>
Perception https://www.pragmaticdevops.com/2014/04/devops-bard/perception/ Wed, 02 Apr 2014 14:00:14 +0000 http://blog.alecl.com/?p=34

Categories: DevOps Bard

Tags:

This is a fun parable from famous web performance guru Steve Souders (formerly working at Yahoo and currently at Google).  It’s part of his lengthy but excellent video on web front-end latency (~31 minutes in). An office building owner receives escalating complaints from his tenants about how long they have to wait for the elevators. The owner calls […]

(Read more...)

]]>
This is a fun parable from famous web performance guru Steve Souders (formerly working at Yahoo and currently at Google).  It’s part of his lengthy but excellent video on web front-end latency (~31 minutes in).

An office building owner receives escalating complaints from his tenants about how long they have to wait for the elevators.

The owner calls a civil engineer to ask what he could do.  The engineer suggests the building can structurally support another two elevators.  It will cost about 5 million dollars and the building will need to be closed for 6 months.

Shocked he calls in a computer science engineer instead.  The CS guy mentions he’s been working on AI lately and can write a learning algorithm that can adapt to the schedules of the tenants and can position the elevators more effectively for shorter wait times.  It would take about 6 months about cost about $300,000.

The owner finally calls a systems engineer.  The systems engineer quickly suggests putting a TV in every elevator lobby and no one will complain again.  The tenants will be distracted even if watching a terrible show and not perceive the slowness of the elevators as much.  The simplest and cheapest solution avoided complaints.


The tie in with web front-end performance is even if you do not actually improve the full load time at all, if you can improve the perception of the users that things are happening or they have something to view and read in the meantime it demonstrably increases the amount of people that will stay on your site rather than click away.

]]>
Lessons on complexity, failure modes, and effective use of time from my thermostat https://www.pragmaticdevops.com/2014/03/devops-bard/lessons-on-complexity-failure-modes-and-effective-use-of-time-from-my-thermostat/ https://www.pragmaticdevops.com/2014/03/devops-bard/lessons-on-complexity-failure-modes-and-effective-use-of-time-from-my-thermostat/#comments Sun, 30 Mar 2014 17:02:17 +0000 http://blog.alecl.com/?p=9

Categories: DevOps Bard

Tags:

make sure HA complexity doesn't REDUCE your stability

(Read more...)

]]>
The Prelude

A little over a year ago I had purchased and installed a WIFI enabled thermostat (a cheap one, not a fancy Nest). The ability to turn up the temperature from a cell phone on the nightstand in the cold bedroom without leaving the warm covers or doing the same from 40 minutes away to be greeted by a warm home is one of my small pleasures of the modern world.

Then I left on vacation for two and a half weeks on a cruise. I didn’t want to temp fate by trying out “Away” mode for the first time so I left my thermostat schedule as is.

Halfway through the trip at a cafe with WIFI I decided to check on my thermostat. It’s 40°F in the house and 8°F outdoors and the heat is OFF.  The thermostat is upstairs so it’s unclear how cold it is in the basement with all the water pipes. There’s also been over a foot of snow dumped on NJ so the nearby friends that were checking on the house were snowed in.

I used my thermostat app to turn up the heat and found that it wouldn’t stay on more than 30 minutes or so before turning off again. I spent the rest of the vacation making sure I could get on WIFI at least every 8-12 hours to bump up the heat a few times a day, researching what could be wrong with the thermostat, and worrying what I might find when I got home.

The Completely Uninspiring Climax

After arriving home luckily the pipes were fine and the thermostat had a clear indicator that the backup batteries were weak. Replacing the battery resolved the issue with the thermostat turning off the heat after 30 minutes.

The Retrospective

All my well-intentioned and proactive research on what have been the problem was a waste of time as I had started it before I had been able to spend even a minute looking at the issue in person.

That was my mistake. Now what about the thermostat designers?

The thermostat was sitting hardwired to the house electricity and had NO cause to require its backup batteries yet was intermittently malfunctioning in what should have been a normal state because something was amiss with its fallback plan. It ironically even continued being operational while I replaced the backup batteries.

Does this remind you of any troubleshooting or high availability improvements that have themselves led to unavailability?

  • A misconfigured or problematic heartbeat ping mistakenly causing a failover
  • Misbehaving clustering solutions that in their early days were the cause of more downtime than your potentially risky but surprisingly well mannered single server
  • Issues with logging or monitoring additions disrupting the main application they were intended to help

We are living in an “Always On” world nowadays and not planning for high availability is an unacceptable risk to business continuity. However, we must be cognizant of the additional complexity involved in designing systems more resilient to failure.

HA systems must have their partial and failure modes well exercised and understood. When you are testing an HA solution initially in a non-production environment have dev, operations, QA, product and support representatives around to participate controlled and uncontrolled (turn off a service, reboot a machine, unplug a network cable, shut off a managed network switch port, etc) failure testing. What is the end user experience at various phases? How long does it take to recover?  Did it recover completely on its own or did it need some intervention? What sort of errors are presented both to the user and in internal logs? Do they make sense?

HA systems should expose monitoring specific to their internals. You should never consider your HA solution to be a magic black box without visibility or understanding of its own internal state even if a vendor tries to sell you on the idea that it’s all handled and you don’t need to worry. Where are their logs?  How can you verify its internal state? Your app may be up but not in an optimal state should some other trigger condition occur. Consider if the thermostat vendor had provided a battery low indicator on their mobile app as they had on the physical device how much easier it would have been to arrive at a root cause.

Further Reading

I had the privilege of attending this talk in person at Velocity and it was one of the most thought provoking and humorous ones at the event. Systems in life and death situations are talked about and a prevailing theme is not just about some unusual circumstance where complex systems fail but the realization that given how they are designed why they aren’t failing far more often.

I would highly recommend viewing both the video and reading the paper.

Video: Velocity 2012: Richard Cook, “How Complex Systems Fail”
Paper: 
How Complex Systems Fail by Richard Cook, MD

Photo Credit: Paul VanDerWerf – Link
]]>
https://www.pragmaticdevops.com/2014/03/devops-bard/lessons-on-complexity-failure-modes-and-effective-use-of-time-from-my-thermostat/feed/ 1
Why Stories Still Matter https://www.pragmaticdevops.com/2014/03/devops-bard/why-stories-still-matter/ https://www.pragmaticdevops.com/2014/03/devops-bard/why-stories-still-matter/#comments Sun, 30 Mar 2014 17:01:11 +0000 http://blog.alecl.com/?p=23

Categories: DevOps Bard

Tags:

bullet points don't promote empathy. Cat pics only slightly better.

(Read more...)

]]>
Attention spans are shortening and there’s an avalanche of information. Why do we need stories when we can just get sound bites or tweets? Just tell me what I need to know and skip the setup, right?

The truth is that despite the presumed time savings that’s not nearly as effective. Following the journey is what gives you context to truly grasp the situations and solutions and why they are relevant. Most importantly it gives the reader time to think about the problem space and start considering the solutions before they are presented by the author.

By the time the solution is presented the reader may even think

“That’s what I would have done”
“That’s obvious. Why isn’t X department doing this? Let me talk to them.”
“I knew it!”

This adds a dose of humility and minimizes the author’s role but ultimately achieves the goal of assimilation and propagation of the ideas nonetheless.

Consider instead just the potential responses receiving the end result in bite sized form without the context:

“That would never happen”
“That wouldn’t work in my situation”
“This is too theoretical”
“This is too ivory tower and not practical”

The impact of Goldratt’s The Goal and its more recent IT focused spiritual successor The Phoenix Project present a strong case for the power of stories. Despite its age, The Goal is still worth read to get a less rushed path through the theory of constraints than the later book presents. The Phoenix Project is a must read.

In this venue I’ll try to include not just end state advice but also some stories and anecdotes from myself and others.

Photo Credit: Dmitry Dzhus – Link
]]>
https://www.pragmaticdevops.com/2014/03/devops-bard/why-stories-still-matter/feed/ 1