Conference Report: 2014 LISA Advanced Topics Workshop

Invited Talk: Making "Push on Green" a Reality: Issues & Actions Involved in Maintaining a Production Service

Dan Klein spoke on Making "Push on Green" a Reality: Issues & Actions Involved in Maintaining a Production Service. Despite encouragement from the audience he didn't give the talk as an interprative dance. The talk basically asked (and answered) the question, "What is, and what do you need to do before you can get to, 'push on green'?"

He laid out his assumptions for this talk: You have at least one server, at least 1 environment, some number of dependents or dependencies, the need to make updates, and a limited tolerance for failure.

What's "push on green?" When something — such as human, build system, oe test suite — says it's okay to release something, do it. It's unfortunately complicted by reality: How do you avoid rollout pain with a new or modified service (or API or library or...)? In summary:

Developing — Peer reviewed code changes; nobody does a checkin-and-push (with possible exceptions when Production is broken, but the code review needs to happen after the fact). Is the code readable? Well-documented? Test your code — with both expected and unexpected conditions; does it fail gracefully? Use new libraries, modules, and APIs; don't do a "first upgrade in 5 years" thing.

Testing — Unit tests, module tests, end-to-end tets, smoke-tests and probers, and regression tests. Find a bug? Write a test to reproduce it, patch it, and rerun the test. (Example: OpenSSL has only 5 simple tests at a high level and hundreds of modules that aren't directly tested at all.)

Monitoring — Volume (how many hits), latency, throughput (mean, minimum, maximum, standard deviation, rate of change, and so on); historical data and graphing; alerting and Service Level Agreements (SLA). (As a side note, SLAs require Service Level Objectives (SLO), which require Service Level Indicators (Sli).)

Updating (and rolling back) — Should be an automated and mechanical idempotent process. This requires static builds, ideally with human-readable version numbers like yyyymmdd_rcn). It needs to be correlated with monitoring. You can mark a version as "livem, then push is just changing the pointer to that live version; rollback is remarking the "old" version as live and updating the pointer ("rolling forward to a previous version" — assuming no database schema changes anyhow). You should also have canary jobs; a canary job is in the case when you have more than one machine or process. You say "some amount of traffic will hit the canary job with the new version." Through the configs or load balancing or whatever? You need to check the canary for "did it crash?" first. If you monitor the canaries and let them have some fraction of the traffic, you can look at those graphs and check for anomalies and trending and see if the canary looks as expected. If it looks good, you can push things live. If it doesn't, only a small fraction of users are affected for a short period of time.

Your organization needs a cross-cultural mindset across all of these.

So how do you do a safe rollout? In general:

Silence the relevant alerts in your monitoring system.

Update the canary jobs.

Run your smoke tests.

Let canaries "soak," or run for a while: the code or test might require some number of iterations such as loading a disk cache.

Push the remaining jobs.

Run the smoke tests again.

Unsilence alerts.

What about making the configuration changes? You can have job restarts with runtime flags, or HUPping the job to reread the config file. The latter is faster but riskier.

At Google, in about 84 weeks with this process, they went from about 5 rollouts per week towards 60 (with a peak of 75) and freed up an FTE engineer. Having more frequent and smaller rollouts help developers, who don't have to wait for the "weekly build" to release their code.

Process improvements they've made include automated recurring rollouts (using a 4.5-day week, excluding weekends and Friday afternoons), inter-rollout locking (to prevent stomping on each other), schedules of rollouts (to prevent things from happening when people are sleeping), and one-button rollbacks.

Future enhancements to the process include rollback feasibility (how easy is it to roll back the release, e.g. if there's schema changes), continuous delivery (just release it automatically if there's a change in the binary or config file checked in), rollout quotas (prevent someone from taking all the slots for a person, team, or tool), and green on green (if there's continuous delivery and something breaks, should that halt additional deployments?).

This is an evolutionary process. Things will go wrong — things break. Adjust attitudes, so find reasons and don't assign blame, and fix the process. Let humans be smart and machines be repeatable. Have a Big Red Button so a human can stop things if needed.

You don't need to be at Google-scale to do this. There are no silver bullets; sorry. It's a laborious process with lots of baby steps. You have to be careful and not take shortcuts, but keep going.

While waiting for the Q&A to start, Dan did perform an intrpretive dance after all.

For more information, please see the "Push on Green" article in the October 2014 (vol.39 num.5) issue of ;login:.

Invited Talk: Distributing Software in a Massively Parallel Environment

Dinah McNutt has been coming to LISA since LISA IV (1990) and chaired LISA VIII (1994), and while she used to be a sysadmin she's now a release engineer. One of her passions is packaging; she's fascinated by different package managers and she'll be talking about Google's. She spoke on on Distributing Software in a Massively Parallel Environment.

The problem is that with very large networks, it may take a long time to distribute things; there are bottlenecks (such as network, disk, CPU, and memory), a machine may be offline, networks might be partitioned ("you can't get there from here"), and even concurrent writers.

Their package management system is called Midas Package Manager (MPM). They store package metadata (more below) is stored in their Bigtable database; package data is stored in their Colossus File System and replicated. The transport mecahinism is a custom P2P mechanism based on torrent.

Note: This talk is not applicable to Android. It's just for the Linux systems.

An MPM package and metadata contain the contents of the package (the files), a secure hash of the unique version ID, signatures for verification and auditing, labels (such as "canary," "live," "rc" with date, and "production" with date; for more on the "canary" and "live" labels see the preceding talk), pre-packaging commands, and optionally any pre- and post-installation commands.

She gave a quick case study: A config file needs to go to thousands of machines, so the relevant pieces are packaged into an MPM file, and a job (that is, a process running in a container) on each remote machine fetches and installs a new version of that MPM every 10 minutes, so the config changes can go out quickly. A post-fetch script is in the MPM to install the new config file. Easy, right? Alas it's not quite that simple: Machines may be offline, bottlenecks must be minimized, jobs have to specify the version of a package, jobs on the same machine may use different versions of the same package, the system must be able to guarantee files aren't tampered with in flight, and the system must be able to roll back to a previous version.

At package creation, the build system creates a package definition file, which includes the file list; ownership and permssions; pre, post install and remove commands; and is all generated automatically. Then it runs the build command. It can apply labels and signatures at any point during the process.

If files going into the package aren't changed, a new package isn't created; the label or signature is just applied to the existing (unchanged) package.

The metadata can be both immutable (who, when, and how was it built; list of files, attributes, and checksums; some labels (especially those with equals signs); and version ID) and mutable (labels (those without equals signs) and cleanup policy).

The durability of an MPM package depends on its use case: Test packages are kept for only 3 days, ephemeral packages are kept for a week (usually for frequently-pushed configuration files), and durable packages are kept for 3 months after their last use (and stored only on tape thereafter).

Distribution is via pull (by the client-side job). The advantages are that it avoids network congestion (things are only fetched when needed) and lets the job owners decide when to accept new versions (e.g., "wait until idle"). The drawbacks include that job owners can decide when to accept new versions, there has to be extra logic in the job to check for new versions or the ability to restart jobs easily, and it can be difficult to tell who's going to be using a specific version.

The package metadata is pushed to Bigtable (which is replicated) immediately. Root servers read and cache data from their local Bigtable replica. MPM queries the local root server; failover logic is in the client, so if requests fail they're automatically redirected to another Bigtable replica.

Package data is in the Colossus file system, scattered geographically. It's a 2-tiered architecture; frequently used packages are cached "nearby" (closer to the job). The fetch is via a torrent-like protocol and the data is stored locally, so as long as it's in use you don't need to talk to either Bigtable or Colossus. There's only one copy on the machine no matter how many jobs on the machine use it. They have millions of fetches and petabytes of data moving daily.

Security is controlled via ACLs. Package name space is hierarchical, like storage/client, storage/client/config, and storage/server. ACLs are inherited (or not). There're 3 levels of access:

Owner can create and delete packages, modify labels, and manage ACLs

Builder can create packages and add/modify labels

Label can control who can add/modify specific lables: production.*, canary, my_label=blah, and so on

Individual files can be encrypted within a package, and ACLs define who can decrypt the files (MPM can't). Encryption and decryption are performed locally and automatically, which allows there to be passwords that aren't ever stored unencrypted.

Signatures can be signed at build time or later. Secure key escrow uses the package name and metadata so a package can be verified using the name and signer.

So all of that said, we can go back to the case study and now see a bit more under the hood.

Why love MPM? There's an mpmdiff that can compare any two packages regardless of name (like the file owner, file mode, file size, file checksums, and the pre and post scripts).

Labels are great. You can fetch packages using labels. You can use them to indicate where the package is in the release process (dev, canary, or production). You can promote a package by moving labels from one package to another, though some labels (those with equals signs) are immutable and can't be moved. Some labels are special ("latest" which shouldn't be used because that bypasses using a canary). They can assist in rollbacks (like "last_known_good" or "rollback" to label the current MPM while promoting the new one).

There's a concept of file groups: It's a grouping of binaries within an MPM. Binaries can belong to more than one group. Common practice is to store both stripped and unstripped binaries in the same MPM but in different file groups, to ensure the unstripped and stripped binaries match when troubleshooting problems.

There's a web interface to browse all MPMs and show the metadata. It also shows graphs by size (so you can see how file groups change over time).

In the Q&A, Dinah did not perform an interpretive dance. However, she did address the questions.

Because job owners have control over when they accept new versions, the MPM team can't guarantee that every machine in production runs the "correct" version; you may have to nag people to death to upgrade. The release processes can therefore vary wildly. The SREs are good and well-respected; they're gatekeepers to keep the release processses sane. The automated build system (which is optional) enforces workflows. There is a continuous testing system where every CL submitted triggers a test. They insist that formal releases also run tests since the flags are different.

One thing possibly missing is dependency management, but that's because packages are self-contained. Performing a fetch pulls in the dependent packages, and the code explicitly lists the dependencies. In MPM, the goal was to avoid dependency management since anything can be in a package.

Back to my conference reports page

Back to my professional organizations page

Back to my work page

Back to my home page

Last update Feb01/20 by Josh Simon (<jss@clock.org>).