Conference Report: 2021 SREcon Americas

The following document is intended as the general trip report for me at the 2021 SREcon Conference held virtually from October 12–14, 2021. It is going to a variety of audiences, so feel free to skip the parts that don't concern or interest you.

General Comments

The Good, the Bad, and the Ugly

SREcon21 Americas continued USENIX's long-term tradition of not starting on time. Between broadcast issues and getting the staff to talk to each other and play the correct videos we started about 10 minutes late.

As with the other pandemic-related conferences where the videos were pre-recorded, the quality varied based on the microphone and camera of the people recording and what editing they may have (or more often have not) done.

Unlike the also-virtual LISA 2021, the speakers were not available on video (the Zoom feed) for interactive Q&A, though they were available in real-time in the ephemeral conference chat.

There were effectively two conferences:

Conference name Timezone Audience geography

UCT EDT PDT

SREcon21 Americas East 1400–1830 1000–1430 0700–1130 Americas and EMEA

SREcon21 Americas West 0100–0530 2100–0130 1800–2230 Americas and APAC

While the one registration fee covered all events in both conferences, this report covers only the former.

All of the conference videos were online on YouTube shortly after the relevant track ended.

Conference Survey

My remarks included the following.

I am extremely happy that the talk-specific videos were posted to YouTube and included in the conference program as quickly as they were. I don't know (because I haven't played any yet) how much post-production work went into these (e.g., teaser and trailer, closed captioning, watermarks, etc.).

I am disappointed but not surprised to learn that USENIX still retains the tradition of not being able to start the first day's opening remarks on time and as scheduled, regardless of whether the event is in person or virtual. I cannot think of a USENIX conference that I've attended in the past decade or two that did.

I am also disappointed (but in this case also surprised) that you didn't have the foresight to allow for overflow. You have the prerecorded speaker videos and thus know their length, and you can estimate how much time the session chairs will take for introductions before and thanks after each. You should therefore be able to calculate whether the session is likely to run long — or even assume each will by, say, a couple of minutes — and schedule the stream to last that much longer and not cut out sharply on the hour. If the content ends early you have the music loop and sponsor slideshow to stream.

I am also disappointed that the conference backchannel and social channels are all purely ephemeral. By choosing Swapcard for the chat feature you (a) combine all the information for all the talks in a given track in one place and (b) don't have the ability to thread (or even reliably track) conversations. Compare that to Slack where you can have session- or track- and not only conference-wide channels all with single-level threading of conversations therein. I concede that both allow for @tagging @people. I also concede that the hallway track and Q&A sessions themselves at in-person conferences also tend to be ephemeral (though the in-person formal Q&A can and at least used to be recorded on video) although the backchannel in Slack remains.

Insert usual comment about speakers' video recording setups (lighting, backdrop, sound) here. There's no good solution, unfortunately; USENIX can't mandate high-end recording studios, professional lighting and sound crews, or multiple takes to minimize any vocal tics (e.g., um er uh like). It was obvious which speakers were single-take and which were multiple-take (especially Niall Murphy's plenary; the lighting shifts were very distracting).

I recognize the exhibitors bring in money that helps pay for the conference without you having to charge us [tens of] thousands of dollars per attendee. In person I'll at least wander the vendor floor, though there's rarely a product or service I'm interested in these days (between what I already know and what my job limits me in doing). At a virtual event I don't even bother.

Technical Sessions

Tuesday, October 12

Opening Remarks

Once the broadcast issues were resolved, the conference proper began with the usual opening remarks: Thanks and notes (sponsors, program committee, liaisons, staff, and board). They stressed the code of conduct so the conference could be a safe environment for everyone.

Plenary: "Don't Follow Leaders" or "All Models Are Wrong (and So Am I)"

[Direct video link]

Five years after the publication of the SRE book, it's a good time to reflect on what it did — the good and the bad, the ugly, and the beautiful, and relate it to what is going on in production engineering in general, SRE in particular, and the problems in the field we've yet unaddressed and/or created for ourselves.

He talked about how the book came to be and its impact and legacy. Doing the book required a new way of looking at work. There were over 70 contributors to the book. Making edits (especially "cut 20%") at short notice was difficult but the right thing to do.

Impact: It sold like hotcakes. Publishers are sensitive about numbers but this book sold multiples of the "best seller" number of copies.

Criticisms: There were questions of tone and perceived arrogance (in part because Google was, especially in 2016, a behemoth), but even with that people still felt the book to have value. The second criticism, where DevOps people were showcasing their accomplishments while ignoring others' accomplishments. Another is that the book isn't a complete model, and that SRE relies on things that aren't written yet (and not just because they had to cut 20%) and thus provides an incomplete view of lifecycles.

Legacy: The value of different mental models for hope operations can work in organizations is high. Having software-founded concepts and activities is foundational. The book has, unfortunately, led to a lack of innovation; people aren't thinking about what they have to do from first principles but are instead using copy/paste. We need to work on improving the models and answering questions for which we have no or fewer answers today.

Don't just rebrand Operations as SRE. Don't just return to using no models or less-useful models.

Reliability matters; it's in the name "SRE." However there's no good (numeric or mechanistic) model to justify to management why you should do one thing as opposed to another. We can't really describe the tradeoffs between reliability and anything else (security, machine learning, etc.) because there's no model for it. Thus how we do and value reliability is all socially constructed. We assert its importance but have no foundation for that assertion.

If you have users who won't care if your site (application, etc.) is down — for example, "We know they'll stick around because we're the only game in town" — then reliability is really less important than you might think.

We have some hints about what that space may look like (for example, customer experience versus latency has been reproduced enough to recognize as latency gets worse customers tend to stop ordering or even go away). It's definitely an area for exploration. Others include the tradeoffs between simplicity (required in the long term) and spending it in the short term (revenue acquisition), and sufficiently-mature systems can provide some coarse-grained values (if we're down for so long we lose so much money).

SLOs have value, especially in their social constructions, but also have their own weaknesses. There are questionable assumptions:

"The nines don't matter if the user isn't happy" (Chastity Majors), but is measuring the nines even useful? Or is it arbitrary, or dimensionless, or something else? We wanted to treat the availability and experience for the users so we added fives (e.g., 99.95%)... but why not other models?

They model services that have independent similar requests, but many useful services don't fit that paradigm. What about SQL-like services where there may be a lot of required computation? SLOs aren't written to handle when some queries are more important than others or when some are expected to run much longer than others.

SLOs don't distinguish between one 100-minute outage and 100 one-minute outages, but that's the difference between the front page of the New York Times and users just coming back later.

In theory the error budget model says that after a sufficiently-large outage you have to stop releases, fix the thing, and then move forwards again. But what about when you blow your error budget for the year early on? You can't stop all releases for the rest of the year!

How important is on-call for SRE? As well as the viewpoint that SRE is "competent people who happen to be on-call," the difficulty is ex post facto value: Reacting to the incident has value but it avoids the value of prevention. Preventative measures (in the design and build phases) can be much more valuable, reducing the requirement of on-call.

Finally, we need to integrate safety cultures. Post-incident response needs to be blameless. Metrics are necessary when speaking to executives but shouldn't be the end-all and be-all.

What is our role, ethically, in managing machines that manage society? We've looked at DEI in the past. As we move into stronger corporate control, SRE needs to stand for something other than making things faster and more efficient. Do we need a controlled profession with entry criteria and professional standards, a whistleblowers' charter, a code of ethics? SRE itself, as a profession (as wonderful as it is), is a very pragmatic and conservative one. As a result it's hard for its practitioners to push forward theoretical things or things with no perceived short-term business value.

He ended with a call to action: We need to address these actions, and move on from the book, as soon as possible.

10 Lessons Learned in 10 Years of SRE

[Direct video link]

In this talk they discussed some key principles and lessons learned that they've developed and refined in more than 10 years of experience as a Site Reliability Engineer across several teams within Google and Microsoft. These are topics that often come up as they discuss Site Reliability Engineering with Microsoft customers that are at different stages of their own SRE journey, and that they — hopefully! — find insightful. They broadly belong to the areas of "Starting SRE" and "Steady-state SRE."

They discussed fundamental principles of adopting SRE, were honest about their mistakes (so we can avoid making them!), and want to compare notes on different ways of doing SRE.

Document what you do. If it isn't written it doesn't exist.

Align with business goals (and establish feedback loops at all levels so the SRE and product development roadmaps remain aligned).

Expertise matters; generalists (with breadth) are necessary but not sufficient; experts (with depth) are also required.

SRE can't just be declared; it's a culture team. Success can be measured against metrics and regarding goals. Other success proxies include how the teams interact, what kind of projects does the SRE team work on, is there an SRE community

Trust must exist at every level of the organization. SRE relies on shared ownership. Build it by aligning roadmaps, by having feedback loops at all levels, and by delivering complete and incrementally more impactful projects over time.

Once you have an established SRE team, how can they be successful?

Finish what you start and meet your deadlines. Eternal projects or projects without stakeholders can be canceled (planned and negotiated). Manage expectations. Having a business case for doing something else can help.

Be wary against "SRE services" that may be understaffed. Influence platforms to support missing use cases, including direct contributions. When needed, seek new platforms with dedicated staff and funding.

Be wary of scaling SRE. Don't disrupt your contact surface with the production environment. The SREs need to have learned production experience in order to give good and environment-specific advice. Don't overreach; doubling the funding doesn't immediately double the bandwidth or performance.

SREs are typically excellent engineers with key knowledge and skills who work across teams. Given that they may develop hubris, arrogance, or both, which is unproductive. Stay humble and curious; keep asking questions with respect, an open mind, and positive attitude.

Talk more. We all have interesting problems to work on and working in isolation doesn't help. The profession relies on collaboration so share problems and solutions within the SRE team(s). Invest in your network.

Let the Chaos Begin: SRE Chaos Engineering Meets Cybersecurity

[Direct video link]

Security Chaos Engineering is built around observability and cyber resiliency practices, aiming to uncover the "unknown unknowns" and build confidence in the system. Engineering teams will progressively work against missing understanding for security concerns within complex infrastructure and distributed systems. The talk was a live demo of formulating a simple hypothesis and then showing how their tools can be used to confirm or deny it based on security chaos experimentation.

Both speakers had heavy accents and spoke quickly making the talk difficult to understand.

What to Do When SRE Is Just a New Job Title

[Direct video link]

When the SRE book was published in 2016 the job title of SRE was not widely used outside Google. Fast-forward five years and it seems like every company is hiring SREs. The talk asked if the System Administrator and Operations jobs disappeared or have their job titles simply changed. The speaker talked about transforming a disjoint team of engineers into a high-performing SRE team, based largely on an overland trip he took with his wife and dog through North and South America (2017–2020).

Capacity Management for Fun & Profit

[Direct video link]

The speaker talked about her journey green fielding all things infrastructure capacity for Elastic's growing multi-cloud-based SaaS and showing how capacity management and planning, moving past the "throw unoptimized infrastructure at a problem and worry about waste later" stage, can lead to increased profit margins.

This speaker stumbled a lot as if she was giving a live talk, but didn't go back to edit or re-record anything.

A Political Scientist's View on Site Reliability

[Direct video link]

Political science can provide novel and fresh insights into software engineering problems:

Empirical research on social change is helpful to understand team dynamics and how to evolve teams.

Analyzing political systems as self-organizing systems provides insight on how to simplify modern production environments.

This talk aimed to give a different perspective on everyday questions focusing on international relations and international law. The first third of his talk was about political science historically and "not what [he] wants to talk about today," begging the question "Then why are you talking about it?"

He spent a lot of time discussing Luhmann's theories as applied to systems design and debugging. Complexity begets complexity even when hidden behind abstractions.

Panel: Engineering Onboarding

[Direct video link]
In this panel on Engineering Onboarding a few industry experts discussed their thoughts on what the big questions and challenges in this field are, what the significant changes in the past few years have been, and finally, what's next.

The panelists started with a round-robin set of introductions then began discussing some of the challenges. For example, Azure onboards 300-800 new engineers per week and having them be both performant and comfortable early is their big one. In general challenges include:

Building the new employees' confidence

Cultural onboarding

Feeling comfortable

Getting the right tools

Knowing whether the onboarding process is effective

Moving away from both hero culture and tribal knowledge or oral tradition

Setting the correct expectations (on both sides)

It should be understood that there are challenges both for those being onboarded and for those doing the onboarding. This expands into hiring practices in general (finding the right people, hinr them in, and then bringing them in) as well as understanding that onboarding is more than just the employees' first day drinking from the metaphorical firehose.

Most onboarding focuses on introduction and orientation (day one), but there's preboarding before then, then the assignments (first project, starting small and growing bigger), and finally ongoing work and development (feedback and performance evaluations). Research says it takes a year to go through all four stages.

A good onboarding experience also leads to higher retention, so factor retention into the cost/benefit analysis of onboarding.

They also discussed how the pandemic changed things as in-person became impossible, and what changes they plan to keep (such as multiple learning modalities), and how younger people tend to want to learn in different ways now than previously.

Q: Since there's no college course, how do you learn to be an SRE?
Study aspects of large system design; sre.google.com/classroom

Apprentice programs.

Finding good mentors.

Scaffold the experience: Start small and build your way up.

Learn how and where things can break.

This is probably a talk worth watching as we continue to onboard people, especially remotely.

Wednesday, October 13

Plenary: DevOps Ten Years After: Review of a Failure

[Direct video link]

Ten years after the Velocity talk that started the DevOps movement ("10+ Deploys Per Day: Dev and Ops Cooperation at Flickr"), John Allspaw (more operations) and Paul Hammond (more development) had a discussion, moderated by Thomas Depierre with old men yelling at clouds.

In retrospect, they knew what they wanted to talk about at Velocity: They wanted to tell a story about how they were doing things at Flickr, and it might be different than how others are (it was different than at their previous employers), and it might be useful to others. It resonated more than they expected: Delivering services was different than shipping software on an installation medium (like CD-ROM, floppy disk, or magnetic tape). In a pre-web world there was no way to operate without literally and physically shipping things.

DevOps was a bottom-up set of perspectives different from what came before and that generated new types of discussion. Over time, people have taken every slide from the original talk and turned that into a talk of its own. The amount of work done in learning methodologies, monitoring, and observability, and just finding and sharing solutions that work for them. At the time, "developer commit" didn't automatically mean "push to production."

They referenced Sarah Sheard's "Life Cycle of a Silver Bullet" paper. But as people apply it a movement forms, then it's applied to areas where it's not applicable, then it's watered down, and then the backlash starts. And a few years later someone looks at it, realizes it's not working, builds their own new solution... and the cycle repeats itself. Given that, does "We're practicing DevOps" really mean anything any more?

"The business requires change" is perhaps the most important slide from the initial talk that hasn't really turned into subsequent talks. Nobody has a talk about talking to other teams, and internal and external stakeholders, and what the end users actually want to accomplish. Issues have different costs based on what you're doing (such as banking and finance as opposed to governments and nonprofits), and on your size (startups focus on growth and larger companies have other driving forces). It could be argued that a lot of DevOps practices, like feature flags, automated testing, and CI/CD pipelines, can be considered hedges against uncertainty. DevOps is being clear and understanding what the business' change is and mitigating risks, regardless of what the business is and what its restrictions and limitations may be (including costs, headcount, regulatory bodies, and so on).

As an industry we're terrible at the buy-versus-build decision: Buying something that'll meet 90% of your requirements, even with a large up-front cost, may make more business sense than building something perfect that won't be available for 18–36 months which may have a larger yet more-distributed cost. Many places have physical space, salaries, and the AWS bill (in that order) as their three largest costs. (CapEx versus OpEx rears its head again.)

What changes would they like to see next? Paul thinks there'll be more work on pager rotations and out-of-hours service management; "you wrote it you run it" doesn't scale and doesn't work for people with schedule conflicts (new parents, volunteers, education, multiple jobs). Running an online service without burning out people or requiring individual heroics is still an area of improvement. John thinks (a) how difficult it is for engineers to understand what is actually happening especially with their mental models and (b) helping engineers tell stories about what's difficult both warrant attention and are important. Software engineering is only going to get more complex and more difficult over time. Engineers will face difficulties in understanding what's happening and what's needed but we don't have stories about what makes things difficult.

Grand National 2021: Managing Extreme Online Demand at William Hill

[Direct video link]
The Grand National is the biggest Horse Race in the world, watched by over 500 million people worldwide with 1 in 4 people in the UK placing a bet — it makes a Black Friday look like a wet Tuesday — and it is William Hill's biggest betting day of the year. 2021 was a year like no other; with retail closed online demand was huge. The challenge was coping with the demand of once-a-year customers whilst maintaining service to our long-serving customers. This talk looked at how SRE prepared for and ran the day and how they implemented lambda@edge logic and a queueing system to help them achieve this.

They changed their planning model from a monolithic "A typical weekend day has a load of N so the big race will be 1.5N" to one that looks at customers by group (one may be N, another 1.5N, and yet another might be 3N). They did performance tuning and enabled prescaling and autoscaling, queuing, and load shedding.

They used a cookie (existing_customer=1) to identify existing customers who've logged in two weeks prior to Grand National, and based on that cookie value they used lambda@edge to queue the national-only customers while sending their existing customers directly to the site. The queue size is configurable and when exceeded can return a 503/Unavailable instead of 200/OK.

Day of it mostly worked as expected. They increased the zone count based on measured performance and before the race nobody got queued. Pre-race everything was smooth; post-race they planned to shrink the zone count but they saw 70% of the users return which came close to their limits. Within an hour of the race ending they were able to disable the zone counts and queues.

What did they learn to change next April?

Load predictions were within 5% of actual.

Treating new versus existing customers differently worked.

Testing and tuning repeatedly worked.

Practice the runbooks.

Load shedding via a queue worked.

The queuing system is available in Git.

Microservices Above the Cloud: Designing the ISS for Reliability

[Direct video link]

The International Space Station has been orbiting the Earth for over 20 years. It was not launched fully formed, as a monolith in space. It is built out of dozens of individual modules, each with a dedicated role — life support, engineering, science, commercial applications, and more. Each module (or container) functions as a microservice, adding additional capabilities to the whole. While the modules independently deliver both functional and non-functional capabilities, they were designed, developed, and built by different countries on Earth at different times and once launched into space (deployed in multiple different ways) somehow manage to work together — perfectly.

Despite the many minor reliability issues which have occurred over the decades, the ISS remains a highly reliable platform for cutting-edge scientific and engineering research.

Space travel is getting from point to point with a spacecraft, like development, and a space station is the place where we do the work. Spacecraft are temporary and have distinct beginning, middle, and end; space stations are permanent, multiple crews and missions, and both continuous and stateless.

We had a brief history lesson about the monolithic, launched all-at-once stations like Salyut 1-5 and Skylab in the 1970s; Salyut 6–7 adding sidecars in the late 1970s and 1980s; and then Mir, the ISS, and Tiangong in the 1980s et seq. which are much more modular, constructed in stages, where modules can be moved and replaced. The ISS is much more efficient in terms of use of space.

Some of the resiliency use cases include:

Oxygen generation — Multiple redundant and complementary solutions. In 1988 Elektron was deployed; it converts water to oxygen, but potassium hydroxide is its byproduct (technical debt: can cause clogs and breakdowns). In 2006 they added the Oxygen GeneratioSystem (OGS) that created different and easier to manage byproducts. In 2018 they added an advanced closed loop system (ACLS) to convert CO₂ to oxygen and which doesn't require water; it even creates water for the Elektron and OGS environments.

For emergencies they have chemical generation of oxygen ("candles," like in airplanes) and bottled oxygen. There have been no emergencies requiring either. Most of the problems were in the oldest Elektron technology, with the most technical debt and the hardest to replace components.

Spacesuits — Spacesuits are 2-piece suits with each piece having 2 different sizes. In 2019 they didn't have enough spacesuit pieces to fit both astronauts for a spacewalk at the same time.

CIMON — AI-powered autonomous assistant that follows the astronauts around and can assist them in whatever task they're doing. It uses IBM Watson and takes advantage of low network latency and increased computing power.

Some dos and don'ts:

Interfaces and standards — There's a standard payload rack for all experiments, but Elektron doesn't confirm to them so it can't be replaced but must be repaired in situ. There were multiple standard connections — America-America, Russia-Russia, America-Russia, and something else — so now there's a fifth, the ISSDocking Standard.

Freedom — The US' Space Station Freedom failed; it was announced in 1984 but canceled in 1993 without anything having been built. The ROI wasn't there. Building a coalition of more governments to pay for the ISS is what helped it be successful.

Politics — In 2018 a leak was found on the ISS caused by tiny holes that had been drilled into the Soyuz transport spacecraft. We don't know whether it was done in space or on the ground, or why, and no public RCA has been announced.

Supplemental reading is available.

Horizontal Data Freshness Monitoring in Complex Pipelines

[Direct video link]

Growing complexity of data pipelines and organizations poses more challenges to data reliability. The risk of data incidents multiplies with each hop downstream. Teams see decreasing operational readiness to deal with specific classes of outages: Data staleness, corruption, and loss, while the costs of incident resolution and the revenue impact of outages can grow non-linearly. Without understanding the full data dependency graph, it is hard to measure completeness of monitoring, leading to gaps.

The speaker talked about:

Understanding critical business data flows, upstream and downstream dependencies — For any sizable organization, figuring out data dependencies manually is a non-starter. How do we map them automatically and convert the spaghetti of pipelines into a sane graph? Recording data access events — levels (job and process), types (file and database), metadata (mtime atime writer reader and size), provisioning, and so on — helps you build your dependency graphs and then can be grouped for areas of interest. They instrumented the data access recording with central logging, distributed tracing, and file system level events (open-source examples: logash, Jaeger, and ionotify, respectively). Data can be automatically annotated (criticality, ownership, recovery objections, and retention plans). Data integration plans can be built on top of that.

Holistic data monitoring at scale to eliminate unnoticed data staleness, which potentially can lead to accumulated negative business impact — Leverage the dependency graph to detect slow bleeds which can hide for a while until the problem is exposed. How do you monitor freshness at scale? Based on the curated data dependency graph and harvested update behavior, you can analyze what changes are expected, then alert, provide historical health insights, or provide health reports as needed.

What kinds of anomalies can be detected? Both data staleness (mtime is too ld) and data corruption (direct via consistency checks; inferred via unexpected spikes or dips in the size or rate of change, or rogue concurrency). Note that as new data sets are added they'll show up in the dependency graph and thus be monitored by default.

How We Built Out Our SRE Department to Support over 100 Million Users for the World's 3rd Biggest Mobile Marketplace

[Direct video link]

March 2020 was a strange month for everyone — our work and employee interactions changed fundamentally, and perhaps permanently, as the entire office-bound workforce shifted to working from home. While the shift to from-home work had short-term challenges (such as bandwidth, furniture, additional electronic equipment) and then longer-term challenges (ergonomics for ed), as well as challenges when the return-to-office is required.

How to communicate was an interesting set of challenges on its own. Email and IM is the easiest medium but you lose a lot of tone and all of body language. A voice call is the next medium up providing more of the body language but still limited eye contact. Finally, in-person communication provides everything but is the most complex and hardest (especially during the pandemic). Where to start communicating and when to shift from one medium to another was challenging.

COVID-19 wasn't the only challenge that came their way. They combined an increased role in Huawei service management, with retiring their managed services SRE team. This meant that over the course of the year they needed to hire aggressively to replace the team (how many people with what skill sets and levels of expertise, and who will search for candidates, interview candidates, and approve offers) and also to support their new growth. Working through their onboarding (with a clear plan for at least the first four weeks identifying goals and milestones, possible mentors and subject matter experts) over the course of the year would cause some hiccups along the way, but ultimately it forced them to change into a leaner and more professional SRE department.

They also talked briefly about exit processes: Engineers move on, so services all need at least a primary and secondary owner and need to be well-documented. You need to plan for new hires to backfill and come up to speed for those leaving.

You've Lost That Process Feeling: Some Lessons from Resilience Engineering

[Direct video link]

Software systems are brittle in various ways, and prone to failures. We can sometimes improve the robustness of our software systems, but true resilience always requires human involvement: people are the only agents that can detect, analyze, and fix novel problems. But this is not easy in practice. Woods' Theorem states that as the complexity of a system increases, the accuracy of any single agent's own model of that system - their "process feel" - decreases rapidly. This matters, because we work in teams, and a sustainable on-call rotation requires multiple people.

This talk brought a researcher and a practitioner together to discuss some Resilience Engineering concepts as they apply to SRE, with a particular focus on how teams can systematically approach sharing experiences about anomalies in their systems and create ongoing learning from "weak signals" as well as major incidents.

Process feel is real, is essential to monitoring networks of highly-autonomous units, requires investment, and can't be measured directly... but does make your humans more effective.

On a meta level, this was a well-rehearsed talk with good handoffs, although one speaker had a noticeably better microphone than the other.

Scaling for a Pandemic: How We Keep Ahead of Demand for Google Meet During COVID-19

[Direct video link]

Many teams will have practiced and refined their Incident Management skills and practices over time, but no one had a playbook ready to go to manage the dramatic Coronavirus-driven usage growth of Google Meet without a user-facing incident. The respose resembled more a temporary reorganization of more than 100 people than it did your typical page — the fact that there was no user-facing outage (yet) notwithstanding.

This talk covered the Feb-Mar 2020 growth spike. Early in Feb an overload caused a failure in Asia (which failed over successfully so users saw no outage), and their mitigations weren't keeping pace with growing demand, so they declared an incident (with the speaker as the incident commander) even though there wasn't an outage. Their objetives were to:

Avoid outages (and minimize any if they happen).

Have enough serving capacity.

Identify where demand was coming from.

They modeled scaling up by 2x, 10x, 50x, and 100x their Jan 2020 load. They also worked with product and development leadership to align priorities, and had an ordered list of priorities that was something like:

Scale the system

Customer-visible P0/P1s

Reducing tech debt

Everything else

Basically everyone stopped what they would normally have been working on to make this happen, across the whole cross-functional organization.

By Apr 2020 they had surpassed 100M users and were still growing at 3M/day.

How did they know the incident was done? When the changes could be identified as happening on a weekly basis instead of daily basis. They continued to invest in automation to scale up and down afterwards. Without restarting an all-hands-on-deck incident management team, by Oct 2020 they had 235M users and 7.5B video calls... all without any customer-visible outages!

Ceci N'est Pas Un CPU Load

[Direct video link]

As the speaker had an electrical engineering background, he wondered why we could not combine and troubleshoot code like we do for analog electronics. In his talk he discussed why digital forces us to use limited mental models. Any signal can, with some logic, be transformed into something else.

For analog, a transformation usually has lots of well-defined modules inside it, with well-defined inputs and outputs. That means that you can probe the signal directly which makes any troubleshooting easier.

Analogic, however, doesn't work that way:

There's noise: Every component in the path adds some noise, which accumulates, which makes the signal harder to identify.

There's complexity: Lots of specific components are needed, with complex supply chains and corresponding costs.

There's a really slow design cycle: Add component, test, debug, lather, rinse, repeat... which can take weeks or months when talking about boards needing soldering.

What if the transformation was itself a reusable component that could be adapted on the fly? That's digital programmable electronics (DPA, like FPGAm a microcontroller, or Adurino). ADC and DAC transformation are the only pieces that add noise, and you know what it is and can deal with it. The transformation can be infinitely complex and changed quickly. (We still use digital components designed 30 years ago!) But that means we can't just have known inputs and outputs, or probe the signal (aka memory) directly, or have easy troubleshooting.

What If the Promise of AIOps Was True?

[Direct video link]

Many SREs treat the idea of AIOps as a joke, and the community has good reasons for this. But what if it were actually true? What if our jobs were in danger? What if AI — which can play chess, Go, and operations that doesn't include AIOps" (Gartner). These and others are nice claims... but how true are they?

Opposing positions are emerging:

Underwood's and Ross' stance is that AIOps doesn't make much sense because:

The math for effective ML doesn't work out.

The ROI for generating the model doesn't work out

It's often faster and cheaper to just do statistics.

Techniques will improve so perhaps some tasks will be better but it's not a revolution.

The strong AIOps stance is that it'll dominate in 5–10 years.

The weak AIOps stance is that:

ML can be really useful in Operations contexts, specifically in narrow circumstances with context-free applications like anomaly detection.

Silos will often accede to sharing data with machines but not with people.

While satistics may be cheaper but it requires expertise.

In-scope for where AIOps may be useful include several areas:

Monitoring, alerting, and anomaly detection — Monitoring is dynamic (and frequently changing), often needing to add metrics and aggregations as well as adding monitoring of new features. You may often be monitoring too much; dashboards often accrete over time and have more noise than signal. Alerting noise is a problem solvable by tuning and pruning. Anomaly detection definitions are highly context sensitive. For AIOps to be useful, we only need to find some kind of relationship for signals, anomalies, and outages to be useful.

Bad change detection — Changes may be a combination of high risk and necessary actions that if not performed sometimes leads to higher risk, and validating changes can be on a spectrum from completely manual to completely autonomous, rollback is a powerful tool that should be done if there's any suspicion of failure, but rollback is also tricky. For AIOps to be useful we don't need to solve rollback but to be useful at detection. Testing in production allows user behavior/SLOs to be part of the decision making (good for AIOps).

Incident response and root cause analysis — You can't automate incident response because it's too fundamentally chaotic and creative. RCA work is extremely complicated and often primarily a social activity. Ideally root causes don't recur because you put in the work to make sure they don't. For AIOps to be useful, root causes recur all the time so with enough repetition you can train on it. Incident response can be divided into "known knowns" and everything else, and we only need a structure to make useful suggestions. Big Red Button-style approaches make it easy for AIOps to contributvalue.

Scaling, toil, and general operations — We hope scaling is automatic... except when we don't. Scaling is usually easy except when non-linear effects happen. Toil is supremely automatable but incident response isn't. For AIOps to be useful, scaling isone of the clearest use cases for matching inputs to actions, even a non-linear relationship is okay as long as the back-off is too. Toil is automatable but it's hard to see how AIOps can meaningfully exceed human performance. We can trigger actions on meics today without ML.

Thursday, October 14

Lessons Learned Using the Operator Pattern to Build a Kubernetes Platform

[Direct video link]
In Kubernetes, the Operator pattern helps capture good production practices and the expert knowledge of practitioners for managing a service or a set of services.

An Operator acts as an SRE for an application. Instead of manually following playbooks or scripts to deploy databases or mitigate issues in Kubernetes, SRE teams can use off-the-shelf Operators or develop their own to automate these processes and reduce toil work.

This session explored the Operator pattern with some examples of how Red Hat has used them to build OpenShift. We will discuss some lessons learned, common pitfalls running Operators, and when it makes sense to write one (don't reinvent the wheel).

None of this is immediately useful to us but Ben B. or Jack might be interested in it.

Nine Questions to Build Great Infrastructure Automation Pipelines

[Direct slides link] [Direct video link]

Sure we love Infrastructure as Code (IaC), but it's not the end of the story. This talk (chapter 3 of his IaC automation series) steps back to see how different automation types and components can be connected together to create Infrastructure Pipelines. They reviewed the nine essential questions that turn great automation into modular, portal, and continuous infrastructure delivery pipelines.

The questions themselves are:

Why doesn't my CI/CD pipeline understand infrastructure?

Why is a "Pipeline Flow" different than other Orchestration?

Why focus on Intents instead of the chained actions?

Why are provisioning and configuration so different?

Why can't I share state between tools?

Why can't I ignore what is between pipeline flows?

Why is IaC central to Infrastructure Pipelines?

Why is it so hard to reuse and share automation parts?

What is holding us back from doing this work?

Answers, including five things to help build Infrastructure Pipelines, are in the talk and the slides.

Hard Problems We Handle in Incidents but Aren't Often Recognized

[Direct video link]

If we know how and where to look closely, we can find a number of dynamics, dilemmas, and sacrifices that people make in handling incidents. This talk highlighted a few of these often-missed aspects of incidents, what makes them sometimes difficult to notice, and gave some descriptive vocabulary that we can use when we do notice them in the future.

Incidents tend to be like a TARDIS — bigger on the inside. Coordination has hidden costs, and no matter what processes you use in managing incidents there are always trade offs involved in how much information you share with how many people how often. "We can know more than we can tell."

A sacrifice decision is when, during a disturbance, achieving important ("high level") goals may require abandoning less important ("low level") ones. A sacrifice may require incurring damage (even severe damage) in order to prevent an even greater catastrophe. One example the speaker gave was the July 2015 NYSE outage where they shut down all trading for four hours during the trading day, minimizing customer disruption but getting a lot of bad press for it. It's often a lose/lose situation.

There's also another set of coordination trade offs if there are simultaneous incidents. If the incidents are related then combining the efforts and observations from each incident response team could make a lot of sense... but it can be tricky to see if it's worth the time to see if they're related.

Experiments for SRE

[Direct video link]

Incident management for complex services can be overwhelming. SREs can use experiments to attribute and mitigate production changes that contribute to an outage. With experiments to guard production changes, SREs can also reduce a (potential) outage's impact by preventing further experiment ramp up if the production change is associated with unhealthy metrics. Beyond incident management, SREs can use experiments to ensure that reliable changes are introduced to production.

Ideation — Developers ask if the feature will increase usage. SREs ask if it's reliable.

Trial — Roll it out to (e.g.) 10% of the population; what happens?

Do you start at 1% then 5% then 10%, or start at 10%?

Do you stop at 10% or jump to 20% or 50%?

How does the population feel about or use the new feature?

Does the reliability of the application change with the introduction of the feature?

Launch — Roll it out to the rest of the user population.

Developers want features to launch if they're popular but SREs want them to launch only if it doesn't make reliability worse.

SRE best practices include gradual rollouts (ramp experimental features), change attribution (was the change associated with an experiment), and controlled mitigation (rollback experiments guarding new code paths). Rollback can be difficult in a complex system, but if changes are tagged with a unique experiment ID and controlled by feature flags that can help — consider changing the trial percentage back down to 0% instead of trying to roll back to an older binary.

We automate it by engineering the reliability into the lifecycle itself, automating checks at each ramp stage, and encoding best practices into policies. Maybe the first ramp stage is the development team, then a small subset of users, then more. A server can be in each subset and check to see whether it does or doesn't see the feature and if/when it does if the feature works and the metrics are useful.

Monitoring can be a challenge. Some metrics need to be signaled in real time but other metrics (especially those needing confidence intervals) need statistical analysis over time to be valuable. A lot of this factors down to the semantics of the experimentation framework itself — for example, one way is to view an "experiment" as a program and think about ways to modularize it. Another way to solve this issue is having better tools and monitoring around experimentation such that even if you have multiple features being rolled out as experiments, finding a bad experiment is quick (i.e. given all this data, what is the best way to organize the data such that the information is actually useful).

Reliable Data Processing with Minimal Toil

[Direct video link]

Learn about the risks involved with data processing pipelines and how Google and Slack mitigate them. This talk shared insights into making batch jobs (which may not be SRE- supported) safer and less manual. They have researched and implemented ways to do canarying, automated global rollouts on increasingly larger target populations and different kinds of validations. All these are necessary to remove the manual work involved in updating a batch job globally across millions of users.

One example of the batch jobs he's working with is the job that empties the files in Google Drive trash that's older than 30 days.

Nothing in this talk was particularly new or exciting. The speaker from Google went back to basics such as what the differences are between development, QA/staging, and production environments or stages; what batch jobs and automated testing are; and what terms like canarying, data freshness, data validation, dry-run, andalse positive mean.

The speaker from Slack talked about some of the tooling they used. The audience was amused by their naming — the 1% group is internal and called "Dogfood" and the 19% group is the next step up and called "Kindergarten."

SRE "Power Words:" The Lexicon of SRE as an Industry

[Direct video link]

As the SRE industry develops, we've come to rely on certain words, phrases, and mnemonics as part of our conversations with ourselves and our stakeholders. Words and naming have power, and the collective definition and use of words like "toil" as a shorthand can help with any SRE practice. This talk set out the premise and some examples (including various dialects and argots) and included a call to action around thinking how naming and words can stgthen SRE's position as the function continues to develop.

One difference between SRE and other spaces is that SREs tend to work more across organizational boundaries. Another is that SRE is still new enough that our nomenclature isn't as developed as other fields. Two examples include toil (repetitive work that, while necessary, doesn't necessarily provide direct benefit to the software development or engineering audiences) and selling SLOs (as "KPIs for Production" for business-speakers and "A 'Single pane of glass' onto reliability" to CxOs).

This talk was only fifteen minutes long and the subject probably deserved more attention.

How Our SREs Safeguard Nanosecond Performance — at Scale — in an Environment Built to Fail

[Direct video link]

The core priiples of SRE — automation, error budgets, risk tolerance — are well described, but how can we apply these to a tightly regulated high-frequency trading environment in an increasingly competitive market? How do you maintain sufficient control of your environment while not blocking the innovation cycle? How do you balance efficiency with an environment where misconfigured components can result in huge losses, monetary or otherwise?

They discussed the production environment at Optiver (a market maker), how they deal with these challenges on their trading floor, and how they have applied (some of) the SRE principles to different areas of their systems.

Their production environment is on local machines due to latency; nanoseconds matter to pricing for the marks. Their design is to fail — stop processing, hard, as quickly as possible — in the event of unexpected events. In 2012 Knight Capital had a bad upgrade of their trading system that subsequently generated millions of orders for $7.5B in a matter of 45 minutes, though they were able. Quoting wholesale from Wikipedia:

On August 1, 2012, Knight Capital caused a major stock market disruption leading to a large trading loss for the company. The incident happened after a technician forgot to copy the new Retail Liquidity Program (RLP) code to one of the eight SMARS computer servers, which was Knight's automated routing system for equity orders. RLP code repurposed a flag that was formerly used to activate an old function known as 'Power Peg'. Power Peg was designed to move stock prices higher and lower in order to verify the behavior of trading algorithms in a controlled environment.^[12] Therefore, orders sent with the repurposed flag to the eighth server triggered the defective Power Peg code still present on that server.^[13]
When released into production, Knight's trading activities caused a major disruption in the prices of 148 companies listed at the New York Stock Exchange. For example, shares of Wizzard Software Corporation went from $3.50 to $14.76. For the 212 incoming parent orders that were processed by the defective Power Peg code, Knight Capital sent millions of child orders, resulting in 4 million executions in 154 stocks for more than 397 million shares in approximately 45 minutes.^[13]

Knight Capital took a pre-tax loss of $440 million. This caused Knight Capital's stock price to collapse, sending shares lower by over 70% from before the announcement. The nature of the Knight Capital's unusual trading activity was described as a "technology breakdown".^[14][15]

Optiver is different from most of us in that they don't care about web metrics and uptime. They mitigate risk by retaining control, by making their changes explicit, in in-house applications written with the principle of least complexity, and they have a simplified trading stack on all-physical (no cloud) hardware.

This talk was interesting to me because my previous-to-UM employer was itself a market maker.

Panel: Unsolved Problems in SRE

[Direct video link]

Every field of endeavor has its leading edge where the answers are unclear and active exploration is warranted. Although the phrase "here be dragons" might be an appropriate warning, this panel of intrepid adventurers ventured into that unknown territory. (The panelists know and often disagree with each other)

Their first discussion wa about models of SRE work. Google's model is one SRE [team] per service focused on both engineering and operations work with sustained ownership. Slack's model is more a consulting model for getting a service production ready without that sustained ownership. Scaling is another issue; hiring people good at both engineering and operations, willing to work an on-call shift, and with enough work to stay busy implies a certain level of scale.

That segued into a skills discussion about skills like maintaining data reliability and freshness. Another unsolved problem is how to stop teams from starting over from first principles every time.

Some services have predictable and well-understood outcomes but others do not. Some complex services are unpredictable and ls well-understood. How do we tell that the latter services are truly working? We need to be more organized about thinking about such things. The emergency "safety too" movement typically observes that every system is sociotechnical: Some degree of human behavior and some degree of technical behavior is required.

We haven't done a lot of work on determining how similar versus how different our systems are. Abstractions may help... but may also not capture everything necessary. How leaky can an abstraction be before it becomes useless or (worse) harmful?

Their next discussion was on understanding the system. Too much is still artisanal or custom. "It depends" is problematic unless you say what it depends on. There's complexity in our systems, how we think abothem, and how we manage them. Decisions are often dependent on multiple variables. One black box describing another black box can lead to predictability if not understanding. Even if completely accurate predictions doesn't necessarily mean we understand the system: Do we know what happens if we adjust one input or add another? We don't have models for building understanding for more complex systems.

We also don't always have concise terminology. How exactly do we define reliability or outage? "It depends" must say what it depends on to be useful. We don't have a scientific approach, just an empirical one. Empiricism is different based on different conditions: Turning the knob one way yesterday under one set of environmental conditions may give different rults than turning the knob the same way tomorrow under a different set of environmental conditions.

Summary: We know nothing, can't describe it, and don't have terms to describe it. How do we proceed?

We have a lot more data than we did in the past. We need to figure out how to interpret the data. It's easier to understand the technical bits of the system, but understanding the business need or impact, and understanding the social bits around the system, is harder. Describing reliability on the customer's (business) terms is often better than on our own (technical) terms. We can use the data we have to inform decisions, but we have to interpret it. We can explore the space and make conceptual progress. But we need to deliver business value, and understanding cross-business stuff is how we scale, and businesses are impatient. We need to spend less time keeping the lights on and more time progressing with the data analysis and doing deep conceptual work. (But is the team in a position to do that work and has it enough experience with the complexities of the business to do so successfully?)

The next area of discussion is learning the qualitative aspects of the sociotechnical systems. We can instrument the systems but it's much more difficult to provide qualitative aspects of people. Social definitions of success are relatively consistent, but the organizational mandate is mostly incremental (keep it running and improve a little over time) and not revolutionary. Can a compelling business case be made for the pure science of SRE to be revolutionary? Maybe, but it's a struggle to get people to care about reliability to begin with until it becomes a problem for the business. The social challenge is related to the economic challenge: Spending $10 on a feature and getting the money back in revenue or spending the same $10 on a feature for a chance of improving revenue or profitability in the future is an economic (business) decision and often they'll choose the short-term gain over the longer-term possibility.

Our services are getting too big, too complex, and too important to allow reliability to be an afterthought.

Closing Remarks

The conference ended with closing remarks from our conference chairs. Due to a meeting conflict I was unable to view this session, which was not recorded.

SREcon22 Americas will be in San Francisco CA (March 14-16, 2022). Mohit Suley (Microsoft) and Heidi Waterhouse (LaunchDarkly) are our co-chairs.

Recordings

Recordings of the keynotes and talks were online by the end of each conference day.

Back to my conference reports page
Back to my professional organizations page
Back to my work page
Back to my home page

Last update Mar24/22 by Josh Simon (<jss@clock.org>).

Conference name	Timezone			Audience geography
Conference name	UCT	EDT	PDT	Audience geography
SREcon21 Americas East	1400–1830	1000–1430	0700–1130	Americas and EMEA
SREcon21 Americas West	0100–0530	2100–0130	1800–2230	Americas and APAC