Operational Capabilities of a DevOps Environment

So what are some of the broad capabilities you need to implement in a DevOps environment?

Automated Environment Creation

First, and possibly foremost, you need the ability to automatically and consistently spin up environments. That's a huge deal, and it isn't easy.

  • Automatically: This means enabling a variety of authorized roles within your organization to start environments on-demand, without involving any human beings. This might be a developer spinning up a development or test environment, which might be something they need to do several times a day. It might also be an automated process spinning up an environment in which to run acceptance tests.

  • Consistently: The environments that are spun up must accurately reflect the final production environment. There are two ways to do that:

    • Come up with a method of creating environments, and use that to also create the production environment as well as whatever other environments are needed. That way, you know they all match.

    • Come up with a method of modeling the production environment, and then applying that model to whatever other environments you spin up.

Emerging configuration management technologies - such as Microsoft's Desired State Configuration, or products like Chef, Salt, Puppet, and Ansible - are examples of tools that help implement some of these capabilities. When you can write some kind of configuration document that describes the environment, and then have a tool that can implement that document wherever and whenever you want, then you're getting close to the necessary capability. Containerization is another enabling technology that can help in this space, since it helps abstract a number of environmental variables, reducing variation and complexity.

It's easy to understand why this is such an important capability, though. If you can guarantee that everyplace an application might run - development, test, or production - is exactly the same, all the time, then you're much less likely to run into problems moving the code from environment to environment. And, by giving other roles - like developers - the ability to spin up these accurate environments on demand, you help facilitate more real-world testing, and eliminate more problems during the development phase.

I don't want to downplay the difficulty involved in actually creating this capability, nor do I want to dismiss the management concerns. Environments take resources to run, and so organizations can be justifiably concerned about having developers spin up virtual machines willy-nilly. But we're not talking about unmanaged capability. That's something that kills me every time I get into a discussion about DevOps with certain kinds of organizations. "Well, once we give developers permission to spin up whatever VMs they want, that's the end of the world!" and they throw up their hands in defeat. But that's not what we're talking about.

The reason DevOps has "Ops" at the end of it is because Operations doesn't go away. Developers don't "take over." Our job is to provide developers with a managed set of capabilities. So yes, a developer working on a project should be able to spin up a virtual environment without anyone else's intervention, and they should be able to recycle - that is, delete and re-create - that environment anytime they want. That doesn't mean they get to change the environment's specification on their own, nor does it mean they get free reign of the virtualization infrastructure.

Let me offer you a really simplistic, yet incredibly real-world, example of this. Amazon's Elastic Beanstalk service is designed to spin up new environments - that is, virtual machines - more or less on-demand in response to customer load. Each new virtual machine starts as an identical copy of a base operating system image, and each new virtual machine can load content - like a web site - from a GitHUb repository. So right there, you've created some of the automation and consistency you need. With a button push, or in reaction to user load, you can automate the creation of new environments, and because they all come from known, standard sources, they'll be consistent.

It's extremely likely that developers will need environmental changes beyond what's in the OS base image, and so developers can specify additional items. They can set environment variables, specify packages to be downloaded and installed, and so on. In the past, a developer would have tinkered with their development environment until everything worked, and then hopefully communicated the results of that tinkering to someone in Operations. Ops would then, hopefully, faithfully re-create what the developer did. But did you get the right versions of the packages? Did you set all the environment variables?

In Elastic Beanstalk, though, developers don't just "tweak" the environment. That's because every time a virtual machine shuts down, it vanishes. Any tinkering that was done is gone. On next startup, it reverts back to that base OS image. So, as part of the project's source in GitHub, developers can specify a configuration file that explicitly lists all the extra packages, environment settings, or whatever. Because that configuration information is part of the GitHub source, every new VM created by Elastic Beanstalk will be created with those exact same settings, every time.

This is a very DevOps approach, and in this case, Amazon has taken on the role of "Ops." If a developer wants to make an environmental change, they modify the project's source, and then tell Amazon to recycle the environment. Everything shuts down, and a whole new, fresh environment spins up. It's completely documented, so if it works the way the dev wants, then it'll be perfect when it's used for test, production, or anything else. And, in a typical cloud-centric way, Ops - that is, Amazon - doesn't have to be manually involved in any way. They've created automation interfaces that let any authorized user spin up whatever they want.

As a sidebar, this DevOps idea is a kind of follow-on to the concept of "private cloud." Private cloud simply means running your private IT resources in a way similar to public cloud providers - meaning automation on the Operations side. You come up with a way of specifying who can do what, and then you let them do it on their own. With a public cloud provider, permissions more or less consist of "whatever you want to pay for," but in a private cloud situation, permissions can be much more granular or even completely different. Nobody's suggesting that you build your own AWS or Azure; that's not what private cloud means. But you'll find that the private cloud capabilities are the very ones that you need to provide, as an Operations person, to enable a DevOps approach within your organization.

Development and Test Infrastructure

As I described in the previous chapter, traditional IT management places some pretty firm "gates" between development, test, and especially operations - with "operations" being more or less synonymous with "production." In DevOps, we break that relationship and eliminate the gates. Operations is responsible for infrastructure, whether that infrastructure supports developers, testing efforts, or production users. And those different phases of the application lifecycle get much more tightly integrated. Some of the high-level things you'll need include:

  • Source code repositories. Git is a common example these days, as is Microsoft's Team Foundation Server and others. What's important is that your developers' tools be tightly integrated with whatever you've chosen. Ideally, these repositories should have, or be capable of integrating with, some pretty deep coding of their own. For example, the repository should be able to run pre-defined tests on code before it allows check-ins, and might perform an automated build-and-test routine each time code is checked in.

  • Dashboards. Developers and testers need access to the operational capabilities you've provided them, such as the ability to recycle a virtual development environment. Ideally, you can integrate this as part of their main tool surface, such as an integrated development environment. Being able to click one button to "compile that, spin up the dev environment, load the compiled code, and run the app" is pretty powerful. In cases where that level of integration isn't possible, then you'll need to provide some other interface for making some of those activities easy to perform.

  • Testing tools. A certain amount of testing needs to be automated, so that developers can get immediate feedback, and so that tests can be run as consistently as possible.

That last capability is perhaps one of the most complex. In one ideal approach (although certainly not the only one, and even this will be a simplified example), the workflow might be something like this:

  1. Developer writes code.

  2. Developer runs code in a "private" development environment, performing unit tests.

  3. Developer repeats steps 1-2 until they're satisfied with the code, and then checks it into a repository.

  4. Repository runs certain quality checks - which might simply enforce things like coding conventions - before allowing check-in.

  5. If check-in succeeds, repository kicks off an automated build of the code. This is deployed to a newly-created test environment.

  6. Automated testing tools run a number of acceptance tests on the code. This might involve providing specific inputs to the application and then looking for specific outputs, "hacking" data into a database to test application response, or so on. Creating these tests is really a coding effort in and of itself, and it might be completed by the developer working on the code, or by a dedicated test coder.

  7. Test results are stored - often in a part of the source code repository.

  8. If tests were successful, then the build is staged for deployment. Deployment might happen during a scheduled window following that build.

You can see that the human labor here is almost all on developers, which is one reason people refer to DevOps as a "software development methodology." But the Ops piece provides all the infrastructure and automation from step 4 on, enabling a successful build to move directly to production.

Obviously, different organizations will have different takes on this. Some might mandate user acceptance testing as an additional manual step, although Ops could help automate that. For example, after step 7 above, you might automate the creation of a user acceptance testing environment, deploy the code to that environment, and then notify someone that it's ready for testing. Their acceptance might trigger the stage-for-production step, or their rejection might feed back to the developer to begin again at step 1.

The point is that Operations needs to provide the automation so that this sequence runs with as little unnecessary manual intervention as possible. Certainly, Ops should never be acting as a gatekeeper. We're not code testers. If the code passed whatever quality checkpoints have been defined, then the code's ready to deploy, and we should handle as much of that automatically as possible. Even the deployment - once approved, and on whatever schedule we've defined - should happen automatically.

You can see that DevOps, as an abstract philosophy, actually requires a lot of concrete tooling. And you can perhaps see that, because organizations will all have different particulars about how they want to manage the process, it would be difficult for commercial vendors to produce that tooling. There's not really a "one size fits all" approach for DevOps, which means Operations will end up creating a lot of it's own tooling. That's where platform technologies come into play. They can provide a set of building blocks that make it easier to create those custom DevOps tools you'll need.

End-User Experience Monitoring

This is perhaps the most important part of a DevOps organization, and it's the easiest to overlook.

As an IT Ops person, you're probably already pretty familiar with monitoring, and make no mistake: it's just as important under DevOps as it was before DevOps. Monitoring not only to notify someone when something goes wrong, but also monitoring to help profile applications (and their supporting services and infrastructure), so you can proactively address problems before they become severe.

But IT Ops' definition of "monitoring" often isn't as inclusive as it should be. We tend to only monitoring the things that are directly under our control. We monitor network usage, processor load, and disk space. We monitor network latency, service response times, and server health. We monitor these things because we can affect these things.

One of the biggest collaborations a DevOps organization can have, however, is monitoring the end user experience. It's something we, as IT people, can't directly touch, but if the whole point of IT is to deliver apps and services to users (and yes, that is the whole point), then the end-user experience of those apps and services is quite literally the only metric that matters. Why do we measure network latency? Because it contributes to the user experience. Why do we measure service response time? User experience. We attempt to indirectly measure the end-user experience, because we've often no way of directly measuring it.

DevOps' philosophy of developers and operations collaborating comes to a pinnacle with end-user experience monitoring. Developers should build applications with the ability to track the end-user experience. For example, when some common operation is about to begin, the application should track the start time, and then track the end time. Any major steps in between should receive a timestamp, too, and that information should be logged someplace. In Operations, we need to provide a place for that log - that performance artifact - to live, and we need to provide a way for developers to access it. We need to baseline what "normal" performance looks like, and monitor to track for declines in that baseline. Operations may be responsible for the monitoring itself, but developers, in their code, can give us the instrumentation to monitor what matters most.

If end-user experience numbers begin to decline - say, the time it takes to perform a common query and display the results starts to get longer and longer - then we can dig into more detailed instrumentation and see if we can find the cause. Is it network latency? Server response time? Any other correlations that might point to a cause? But by directly measuring what our users experience, we have an unassailable top-level metric that represents the most real-world thing we can possibly have on the radar.

I'm making a big deal of end-user experience monitoring not only because it's important and useful, but also because it's one of the easiest-to-grasp examples of what DevOps is all about. Developers have traditionally cared about users' experience (in theory), but they're extremely disconnected from it. Operations is very connected to what users experience (we get the Help Desk calls, after all), but we're relatively powerless to put our fingers directly on it. Through the collaboration that drives DevOps philosophy, though, developers and operations personnel can come together to do their collective job better.