Conference Report: 2024 SREcon Americas

The following document is intended as the general trip report for me at the 2024 SREcon Conference held in person in San Francisco CA from March 19–21, 2023. It is going to a variety of audiences, so feel free to skip the parts that don't concern or interest you.

Saturday, March 16

Travel day! As usual I managed to wake up before the alarm. I'd showered the nightbefore so I packed up the toiletries, CPAP, laptop, and phone; set the thermostat to heat to only 60F; loaded the car; and drove off to Detroit Metropolitan Airport.

Traffic was moving at or above posted speeds — well, it was either side of 5:00 a.m. — and other than the Michigan State Police vehicle tailgating me on Michigan Avenue from State Street allt he way to US 23, the drive was uneventful. I parked at my usual spot in the Big Blue Deck (level 3 zone E row 5, and the cost has gone up to $22/day); went up, over, and down to the terminal shuttle; had a brief wait for the driver to hit the head; and got to the Delta terminal. Stood in line to drop my bags (and got to watch a dog-carrying Karen get told off by the staff people), had no wait to get past the TSA agent and had a brief wait to send my stuff through the X-ray machine.

(Since last time I flew, about a year ago, they've got all new GE-branded X-ray systems, Same nudie scanner, though.)

I got to my gate in plenty of time, but if I have to sit around waiting I'd rather do it at the destination to reduce the stress. I read through my backlog of magazines and finished those up about 10 minutes before boarding began. I took advantage of preboarding (with the cane nobody says anything) and got to my seat (1B) without incident. This was my first trip on a Boeing 757-200, and it was configured with the Delta One lie-flat seats and a slight angle towards the windows. (The product is showing its age a bit. There's a tiny privacy partition between the seets in a 2x2 configuration, but no privacy so I could watch my neighbor's screen.

We got a tour of the airport. We departed from our gate at one end of the terminal, drove around the airport to the deicing station, then had to drive all the way back to get to our assigned taxiway and runway.

I had the option of ordering breakfast in advance. The default options didn't appeal (a ricotta-avocado omelet, Nutella french toast, and oatmeal, if I remember right), so I ordered the smoked salmon "special." This was a very reasonable portion of smoked salmon, three slices of crustless pumpernickel bread, a couple of pieces of smoked trout, a grape tomato, some sliced cucumber, and a horseradish sour cream sauce. Considering it was served at altitude it was very tasty. It was served alongside a fruit salad of canteloupe, honeydew, an orange slice, pineapple, and red grapes, and a croissant with butter and strawberry jam.

The flight itself was mostly uneventful, We had the occasional turbulence, which was at its worst when crossing over the Rocky Mountains. We actually landed early of all things.

I took Lyft to the hotel. This was mildly complicated by the driver expecting me to be at tower C but the signage from Delta's bag claim directing me to tower D. A longer-than-expected ride later, due to the St. Patrick's Day parade-related street closures, and he dropped me off at the wrong place. At least the weather was nice (64F and sunny with a light breeze) for the hike from 2 Embarcadero Center to 5 Embarcadero Center. Checked in without major incident; my reservation was for a king bed, but they only had a two-queen bed room ready. They were willing to discount the $70 upcharge to $40 for a "high floor, one king, bay view, with balcony," but I declined. (I didn't plan on being in the room much during daylight hours, and my employer is already verklempt about the room rates in downtown San Francisco).

I unpacked everything, threw the laptop and phone on to charge, wrote up the trip report thus far, and switched into more comfortable clothing before heading out to lunch. There was a burger place just across the street so I grabbed a (tasty and expensive) burger and fries.

I napped in the afternoon. I tried to find someone else in early to go to dinner with, but was ultimately unsuccessful. I had a half slab of ribs (with slaw and fries) at Perry's. After dinner I returned to my room and, having been awake for most of the preceding 18 hours, went to bed.

Sunday, March 17

Today was mostly my free day to acclimate to Saturday being a 27-hour day thanks to the time zone shift. I logged into work to deal with a time-critical issue that cropped up late Friday. I prepped more of the trip report, adding skeletal session information and preparing for the eventual talk-specific slides and video links. I also fired up the conference Slack workspace and joined all of the relevant tracks' channels. Then after my morning ablutions I headed out to grab a bacon, egg, and cheese biscuit for breakfast (good, but the biscuit was a bit too flaky to be stable in the sandwich).

Badge printing and pickup opened at 5:00 p.m. I went down early to see who was around. Ran into a newbie who asked about my work so I shared (probably overshared). Ran into Carolyn, Casey, Cory, Matt, and Pat at the opening. After some catching up, first at registration then in the lobby after, Brent joined us and he, Matt, and I went to Gott's at the Ferry Building for dinner. I had a bacon cheeseburger and fries. Brent's dulce de leche shake, this month's special, looked amazing; I had to skip it due to lactose intolerance (I should probably check out Lactaid or the equivalent some time). (The organizers, including Carolyn and Pat, had their own reception to attend. I'm not an organizer, despite what my badge holder says.)

We were back at the hotel after dinner, but there was nothing on the evening's agenda since the conference didn't start until Monday morning.

Monday, March 18

The conference proper began on Monday morning with the continental breakfast in the Grand Foyer. Today's selections were butter croissants, ginger-peach scones (which didn't taste of either ginger or peach), and fruit and granola parfaits.

The program chairs kept us out of the ballroom until the session start time of 8:45 a.m., which was a very poor decision if they wanted the plenary session to start at 9:00 a.m. as scheuled. Luckily, however, they only had 10 minutes of material for their opening remarks so we still started on time.

Opening remarks

[No slides | No video]

The conference began with the remarks from the chairs, Sarah Butt and Dan Fainstein. This is the tenth anniversary of SREcon. They advised us to make the conference our own and attend the sessions that make sense for us as individuals.

They reviewed the code of conduct; we want to be a safe environment for everyone. The Slack channels are where changes are announced and the talks' Q&A will take place (#24amer- is the prefix). Thanks to the sponsors (the showcase is open all three days; please visit). The showcase will be open all three days, including a happy-hour this evening. Birds-of-a-Feather sessions (BOFs) are tonight and tomorrow.

Like last year, for Q&A we'll be using the day-and-track channel with :question: as the question prefix, and moderators will ask them.

Later in Slack it was reported that we had over 600 people in attendance, well over the 525 we had last year. A subsequent discussion with the executive director said that based on self-reported data alone, the attendees were about 25% female (up from 10–15% in the past). I had noticed that the percentage of white men was lower than in past years.

Plenary: 20 Years of SRE: Highs and Lows

[No slides | Direct video link]

Niall Murphy has occupied various engineering and leadership roles in Microsoft, Google, and Amazon, though not at the same time, and is the instigator of the best-selling & prize-winning Site Reliability Engineering, which he hopes at some stage to live down. His most recent book is Reliable Machine Learning, with Todd Underwood and many others.

SREcon is 10 years old this year. The precise starting point for SRE as a profession is a bit harder to pin down, but it's certainly considerably more than ten years ago. In this session, Niall Murphy told the story of SRE as he has seen it evolve, from its beginnings within Google and other early adopter organizations to its subsequent spread throughout the tech industry and beyond.

Personal view, not corporate and not universal. He told stories, and noted that not all stories are necessarily true from everyone's point of view.

SRE started with startups: Putting the slightly less terrible thing in place just long enough to get to tomorrow's terrible thing. You do things when they're useful and stop when they're not. (Example: cluster.py forked one per FTE then moved away from unneeded forks.)

SRE materials like Niall's books are selling well... even though the content is available for free online. SRE ideas have penetrated general engineering consciousness. We no longer need to define SLA, SLO, OLA, etc. because people know those initialisms now. The ideas have also permeated to general business consciousness like Gartner and Forrester.

SREs are doing a lot of work that benefits others, including ethics (silence breakers, work on healthcare.gov, etc.) — to the point that two Time magazine covers have been about SREs.

SREs try to do things because they're the right things to do, not just because we've already done it that way.

Some lowlights:

The SRE career pipeline needs attention, especially for junior-level people: SRE isn't just for senior-level people ("staff engineer").

Doing SRE is hard; there have been a number of failed SRE implementations in companies whose names you've never heard of

Widespread mishandling, dilution, and problems with what the term and role means (and sometimes too much worrying too much about it).

Neither generating nor applying (mathematical) models often and widely enough. We're still unable to put generic numeric frameworks on how much our work matters.

Zero-interest rate policy (ZIRP)-no-more may undermine several of the abstract economic bases for caring about reliability.

Doctrine overtakes reality-oriented pragmatism, flexibility, ... and engineering?

The "Operations is low status" battle is not yet won.

Summary: The still-radical idea that it is legitimate to apply software techniques and systems thinking to the operations domain.

Plenary: Scam or Savings? A Cloud vs. On-Prem Economic Slapfight

[No slides | Direct video link]

Corey Quinn is the Chief Cloud Economist at The Duckbill Group, where he specializes in helping companies improve their AWS bills by making them smaller and less horrifying. He also hosts the "Screaming in the Cloud" and "AWS Morning Brief" podcasts; and curates "Last Week in AWS," a weekly newsletter summarizing the latest in AWS news, blogs, and tools, sprinkled with snark and thoughtful analysis in roughly equal measure.

Since not giving a single crap about money turned out to be a purely zero-interest-rate phenomenon, the "cloud is overpriced vs. no it is not" argument has reignited — usually championed on either side by somebody with something to sell you. The speaker has nothing to sell you, and many truths to tell. It's time to find out where the truth is hiding.

In his experience some themes and trends emerge:

At scale, nobody's paying retail. Once you get over $1M/yr in spending there are discounts available, either as discounts or as pricing structures.

Cloud repatriation isn't real. Nobody is moving whole hog from (e.g.) AWS back to on-prem. We are seeing some actual repatriation around smaller workloads for reasons other than cost:

POCs fail and they move workloads back to on-prem

Workloads expand to be both cloud and on-prem for reasons

Weird political nonsense

But in all cases, they already had a well-understood on-prem data center environment.

Maslow's hierarchy of needs is true... and "The AWS bill" is at the basis.
People will always cost more than infrastructure. Staff compensation, cloud bill, and real estate are the top three.

Some hard truths (not controversial but people don't like hearing them):

Cloud is probably the default place to build things now.

It's not all-or-nothing at scale. Everyone uses everything.

Cost isn't the primary driver.

Runaway spend is worse in the cloud. On-prem has spending limits even if you're not paying attention (running out of rack space or disks).

Cloud is (almost) always more expensive than on-prem in the long term. (One-off experimentation notwithstanding.)

Companies generally aren't run by fools (usually). The reasons, constraints, and contexts may not be visible externally, but there's usually something that makes sense. Saying "All-in" may be (is almost always) contractual.

There's no one right path because It Depends on your own context, constraints, and so on.

When does the cloud always make sense?

You're building something new.

Your workload scales up and down significantly.

You need something fairly complex beyond "A bunch of VMs."

When does on-prem make sense?

You already have a data center.

Your workload is a bunch of VMs at steady-state.

Your staff's skill set is on-prem centric.

There are economic consequences (like Broadcom pirates building VMware and seeing a 10–12x increase).

So what have we learned today?

It's contextually dependent.

People cost more than computers.

Don't trust the tech press about repatriation trends.

Ms. Frizzle should never be allowed to teach again.

There is no "right answer."

Giraffes aren't real.

Caveat: Today you're just gonna go anywhere that has GPUs.

Plenary: Is It Already Time to Version Observability?

[No slides | Direct video link]

Charity Majors is the cofounder and CTO of honeycomb.io, the O.G. observability company, and the coauthor of O'Reilly books "Database Reliability Engineering" and "Observability Engineering". She writes about tech, leadership and other random stuff at https://charity.wtf. (You may be more familiar with their competitor, NewRelic.)

Modern software development is all about fast feedback loops, with best practices like testing in production, continuous delivery, observability driven development, and feature flags. Yet we often hear people complaining that only startups can get away with doing these things; real, grown-up companies are subject to regulatory oversight, which prevents engineers from deploying their own code due to separation of concerns, requires managers to sign off on changes, etc.

This is categorically false: there is nothing in any regulation or standard to prevent you from using modern development best practices. Let's take a stroll through the regulatory landscape and do some myth busting about what they do and don't say, and what this means for you. Teams that figure out how to follow modern best practices can build circles around teams that don't, which is a huge competitive advantage. Your competition is working on this right now: you should be too.

"Observability" has many meanings — control theory, unknown unknowns, pillars, high cardinality, high dimensionality, and exploitability. We want to make better decisions... so this term may well be overloaded.

2016, applying control theory mean when applied to software.

2018, the "well, actually..." years

2020, three pillars...

It's basically a property of complex systems (like reliability). You can't just buy a tool and be done.

1.0 is more ops; 2.0 is more devops and lifecycle.

1.0 is about operating software (bugs, errors, MTTR, MTTD, reliability, monitoring, and performance). 2.0 is about how you develop software (underpins the entire software development lifecycle, allowing you to hook up tight feedback loops and move swiftly with confidence).

(See mipsytipsy's blog post.)

There are three types of data under the hood:

Metrics

Unstructured logs

Structured logs

Product Reliability for Google Maps

[No slides | Direct video link]

After the morning break I attended Micah Lerner's and Joe Abrams' talk about Google Maps' product reliability. As their organization has gotten very good at protecting server SLOs with reliability best practices like scaling globally distributed at-scale architectures, toil mitigation, and continuous reliability improvements they noticed that a majority of incidents impacting their end-users were not showing up as an SLO miss.

In many cases these outages were not even observable from the server side — for example, the rollout of a new version of the consumer mobile application (that our services powers) to an app store could break one or more critical feature(s) due to bugs in client code. This reality has led to a change in the way they approach reliability — they're shifting their focus from server reliability to product reliability.

They're not yet finished with the transition, but they're starting to see very positive results. Their talk shared challenges they've solved so far, lessons they've learned, and their vision for the future.

Problem: Many significant outages were not detected via automated alerts. Manually-detected incidents were more severe for more users. The server would be up but perhaps not performing as they'd like, or a data source update caused routes to look odd.

Analysis showed two gaps: Production alerting didn't fully capture the user experience, and rollout systems for code and configuration received incomplete information. That led to insights to have a higher team focus on user-centricity and to provide the right feedback to change delivery systems.

They pivoted to user-focused engagement. Their strategy was to:

Partner with product teams to identify what a non-working feature means to users. They defined and prioritized critical user cases and came to an agreement on what to monitor and why: What qualifies as a mut-win, what's the impact of failure, what regressions are severe enough to allocate incident management resources.

Directly measure the user experience to be able to tell when a feature isn't working from the user's perspective. They translated user expectations into concrete signals, harnessing existing signals, and measured availability, latency, and data quality.

Availability — Are users seeing and interacting with the expected UI? A lot of "if this then that" are fed into the alerting system to catch anomalies quickly.

Latency — Can users accomplish their goals quickly?

Data quality — They look at the app front end, backend, and dependencies. They use graceful degradation to return a subset instead of no results. This is more nuanced.

Implement defense-in-depth to automatically prevent and detect outages. Using the availability, latency, and data quality measurements, they can better detect production alerts, and they can try to prevent server rollout (~55%), application binary releases (~30%), and feature ramps (~15%).

They shared a case study where the query returned zero search results.

What lessons did they learn and what's coming up next?

Integrate into the delivery feedback look. Client instrumentation pilots uncovered many more issues now that they could catch the smaller ones sooner. Integrating with the change systems is an Aha! moment. Outage impacts and debugging costs plummeted.

Get buy-in from product owners: Be clear about the costs for the initial investment in learning and setup and possibly-arduous debugging, but that leads to fewer outages and higher change velocity. The key point here is to introduce a pilot around a well-scoped feature set.

Future work includes debugging and maintainability. One problem is the delay in getting changes pushed to other vendors' app stores.

There were a lot of engaged questions in the chat... too many to get them all answered during the talk itself.

Using Generative AI Patterns for Better Observability

[No slides | No video]

The second of the two talks in the session was John Feminella's "Using Generative AI Patterns for Better Observability."Observability has always been a cornerstone for understanding and maintaining complex digital systems. Our familiar friends of metrics, logs, and traces have been enriched over the years with more powerful tools for greater understanding. Now, a new emerging superpower for practitioners has been added to that toolbox: generative AI.

In this talk, he covered several new generative AI patterns of interest that he's found to be of particular relevance to observability practitioners in production settings. They showed how even relatively basic generative AI scaffolding can be extraordinarily helpful for practitioners, how to leverage it in our day-to-day work, and how to get started right now. He left us with a mix of practical advice and solid theoretical grounding.

Be skeptical of vendor claims.

The two big patterns are the "helpful assistant pattern" to turn natural language questions into structured discovery queries and the "artifact inspector" to answer questions about a large set of documents that you haven't read.

The hierarchy is:

Artificial intelligence (AI)
Machine Learning (ML)
Deep Learning (DL)
Large Language Model (LLM)

where:

Artificial Intelligence (AI) is a machine approximating cognition of various kinds.

Machine Learning (ML) learns patterns from data even when we haven't seen it before. This requires training to get better over time. Structured data is easier to work with and can provide a level of confidence especially as the data point approaches a boundary. Observability data here is tempting... but real world systems are too complicated and can fail in surprisng ways. Don't reach for models at this level first.

Words are laden with context and meaning so representation is more complex. For example, "The old man the boat" is "The [elderly people] [operate] the boat." Those are five small words.

Deep Learning (DL) learns patterns from unstructured data. Computer science has the neural network example: 4 layers and 18 neurons have 4 input and 2 output. This might be observability-adjacent... but it's still too complex to be useful.

A Large Language Model (LLM) is fancy text prediction. If a sentence begins with "Should we go to" could be followed by "the park [0.0148]," "the grocery store [0.0161]," or "the waste incinerator [0.0001]" and only those with the highest confidence (shown in square brackets) are being shown.

Just because you've never seen the words doesn't mean you can't operate on them. The model learns properties about the text. You can use it to solve problems; some use cases include:

Given suggestions for what to try next when stuck.

Summarize something complicated.

Quickly gain footing in unfamiliar territory.

Takeaway: Fancy text prediction is powerful.

Gen AI can solve problems but there's a lot of hype:

Companies that want to sell an "it does everything" language model

"It's the singularity! It's the end times!"

Models are "reasoning" or "thinking"

"It's AIOps"

It's "cheating" (not hype but wrong)

[At this point the speaker doused his laptop and the slides went away for the duration of the talk. – ed.]

Looking at the semantic content instead of the structural content.

Given an arbitrary colelction of documents and you want it to be interesting, the artifact inspector is useful.

Warranted skepticism:

Can the machine be trusted? Does it return the right results? Does it return the same results repeatedly?

Plausibility isn't necessarily correctness. Models predict and can be plausible, not necessarily correct.

Embrace using GenAI... modulo the caveats above.

Final summary:

They're real.

They exist now (COTS).

You should try them out if they meet your needs.

The Ticking Time Bomb of Observability Expectations

[Direct slides link | Direct video link]

After the conference lunch, which was moderately awful food¹ served in the hotel atrium, I attended David Caudill's observability talk which explored the fundamental problems with the popular "monitor everything" maxim which allows vendors to control our discourse about monitoring. He shared some fundamental principles to guide your approach to observability in a cost-conscious manner.

The problems with observability are:

The loudest voices are vendors. They want to sell you something).

Unrealistic expectic expectations. "Monitor everything and sort it out later." We want to retrieve them instantly and error free, using the data to solve real world problems despite no training or docs. We want it even if the cloud provider has an outage, and all for less than it actually costs.

Lousy data and too much. It's highly contextual and we have limited or even no utilization data. Some of the data is important and useful (but most isn;'t). It's very hard to browse in this space.

Complex architecture. Moving from service into microservice and serverless have introduced IPC overhead, so what used to be a method call could now be to a separate service or function on any one of a dozen platforms, and everything is potentially instrumented to forensic detail and (in some cases) even instrument startup/teardown actions.

High price/cost. Vendors need to make money so they can't give stuff away for free now that the VC investments don't subsidize things anymore. Storing the data isn't cost effective for the vendor (who then bills it back to you), so even with discounts it's unfeasible. FInance is going to come back to you and ask how the tools can possibly be worth what you're paying for them.

High cognitive load. An individual can only hold so many things in their head at a time (generally, 72). What that thing (item or chunk) may be varies. With more stressors and noise, like late-night incidents, executives on incident calls, a pile of poorly-named dashboards, 200 people on an incident bridge, and millions of metrics, their (your) capacity shrinks.

Dead ends and well-worn paths to nowhere. You already have more data than you know what to do with. The same vendors and tools that got you into this mess are going to sell you "solutions" for getting out of it. Vendors will appeal to your ego and convince you that you can process more information than humanly possible. AI/ML solutions seem designed to expend maximum carbon for generating spurious correlations. "AI" is not going to save you from this, and it's not going to stay cheap, either.

Nobody and nothing can understand your system for you.

How do we construct meaning? Here, meaning is that the team understands the consequences of a given metric or alert, changes their behavior based upon what they see, can put the data into context, and uses the data to identify problems and to verify that they've been solved. And here constructed shows how we need to interact with the data and with someone who understands it well. Playing in the space will let them make sense of it to understand how it fits together with what they already know. You can't just tell them, they also have to actively learn it.

We should work backward from a vision. Think like a product manager, identify a few scenarios the tools should cover, and think about the personas involved in each (like "new developer with a migraine" and not "yourself on a good day"):

Incidence response

Performance degradation

High-level status communication

Some DOs and DON'Ts:

DO design dashboards for a new developer with a migraine.

DO take advantage of note widgets in dashboards.

DO ensure similar items are grouped logically and use a visual container if your tools provide it.

DO decide on a few very important dashboards you will share with your team.

DO put someone in charge of making the data meaningful.

DON'T pack status dashboards with more info than you need.

DON'T build a new dashboard every single time you want to know something.

DON'T build dashboards in response to every single incident.

DON'T break Conway's Law.

Separate status ("check engine" light, or red/yellow/green) from diagnostics (the OBD2 code) on your dashboard. You want to look at the status first (and it can link to more details). Think about the all-Google-products status dashboard.

Start with what's important, not what's easy. Use plain language, like "When a customer clicks the Checkout button, what is the error rate and how long does it take?"

Regarding costs, billing schemes are cryptic and complex (and often adversarial). Watch out for introductory rates especially once they end. Understand where the value versus costs doesn't make sense and reduce that spending. Review the bill and understand where the money is going; it'll take engineering hours but it's worth it. Also, if they're not giving you a detailed bill, consider another vendor. To reduce costs you need to send less, retain less, or both.

Consider an Otel collector to strip out high-cardinality stuff you don't care about and to fork the output to different targets.

Question your assumptions: "Do I really have this problem, or is it just really interesting?" Are you blind to assumptions because they serve your ego? Are people using the tool(s) you're paying a lot of money for?

How will you know if things are getting better? "You'll know it when you see it." Ask questions like:

"Hey I noticed we have an alert set on the CPU. What does it mean for customers when that alert fires?"

"These two things are next to each other on your team's important dashboard. How do they relate to each other?"

"What's the most important thing for me to monitor about this application?"

And ask several people and compare their answers. Evaluate behaviors; review AARs and look for cases where data was used to make or validate a decision, and look for anti-patterns as red flags.

Synthesizing Sanity with, and in Spite or, Synthetic Monitoring

[Direct slides link | Direct video link]

Next up this session was Daniel O'Dea about monitoring. Synthetic monitoring, particularly browser-based monitoring, is hard to do well. When tests pass, synthetic monitoring provides a uniquely intuitive kind of psychological safety - human-like, verified confidence, compared to other forms of monitoring. When tests fail, synthetic monitoring is often blamed as flaky, misconfigured, or unreliable. If not properly implemented, it can not only be financially, mentally and organisationally draining, but damaging to real customer experience.

This talk was a conceptual and technical story of four years working with Atlassian's in-house synthetic monitoring solution, being the owning developer for a tool actively used by 30–40 internal teams to build and manage synthetic monitoring for Jira. How can we make synthetic monitoring better serve its purpose of providing useful signals?

Synthetic monitoring are checks that act as if they were a real person using the system end-to-end. Selenium for browser-based automated testing is common (about 70% of us knew about it) but problematic (almost nobody had good experience with it). You can poll an endpoint to make sure it's up. You can run at different frequencies (every N time-units) or as a one-off. Other common or popular, non-proprietary frameworks are Playwright, Puppeteer, and Cypress. Proprietary include DataDog (and he mentioned a few others I didn't catch).

Testing pyramid: Manual over E2E (where synthetic monitoring tends to live) over Integration over Unit (base). Most places do more manual tasks than they'd like. It's good in that it represents the ideal quantity and it's simple, but it's bad in that it implies that higher (manual) is better) and that you have to build it from the ground up. Flip it upside down (triangle point down) and you get the testing funnel. Bugs should be caught by unit tests or integration tests, not E2E and almost never by manual tests.

Synthetic testing may not be the best solution. It's expensive to run. You may not get the granularity you want. Are the right alerts enabled?

Browser testing is flaky: There's customer noise, code, dependencies, test rigidity, scope, and lack of granularity.

What's the difference between Real User Monitoring (RUM) and synthetic monitoring? RUM is real user metrics using actual users or customers, but you may not get data frequently as they have to go down that path. Synthetic monitoring can go down all the code paths regardless of whether the actual users' or customers' behavior. You can run one or both.

He devolved into Atlassian-specifics about Pollinator, Pollinator Manager 2 (PM2), and various testing frameworks.

Granularity is important: Are you looking at the tenant, shard, or region? How often? How many checks? How often do you audit your checks?

They have had trouble with ownership, specifically for synthetic checks... but things are stable now.

[Slides didn't always align with what he was talking about, and he was occasionally hard to follow since he had a habit of speaking sotto vocce away from the microphone.]

Migrating a Large Scale Search Dataset in Production in a Highly Available Manner

[Direct slides link | Direct video link]

After the afternoon break (which was beverage-only) I went to Leila Vayghan's talk, "Migrating a Large Scale Search Dataset in Production in a Highly Available Manner."Shopify is an ecommerce platform supporting over 3 million global merchants which uses Google Kubernetes Engine to run Elasticsearch on Google Cloud Platform.

The COVID-19 pandemic led to an increase in global clients, causing latency issues and GDPR compliance challenges. To address this, the search infrastructure team at Shopify migrated European merchants' search data to European regions. However, this migration was complex due to the mixed storage of European and non-European merchants' data and the constraints of the indexing pipeline. Moreover, the scale of data that needed to be migrated was large and would lead to outages for merchants' search services which would negatively impact their revenue. In her talk Leila told the story of how this team designed an architecture to migrate a large dataset to European regions without impacting merchants' sales. She also reviewed the technical decisions and the tradeoffs that were made to address the challenges faced.

OIDC and CI/CD: Why Your CI Pipeline Is Your Greatest Security Threat

[Direct slides link | Direct video link]

Next up was Mark Hahn and Ted Hahn from Qualys. They noted most CI/CD processes are chock full of credentials, and almost anyone in your company has access to it. Configuring your CI correctly is vital to supply chain security. They discussed how to reduce that attack surface by enforcing proper branch permissions and using OIDC to reduce long-lived credentials and tie branches to roles.

Continuous Deployment (CD) can also be considered as a Confused Deputy.

Using OIDC with your CI/CD is reasonably straight forward:

Create roles in your cloud.

Set up pipelines to use them (AWS has some transformation functions).

Section off privileges into roles attached to branches. Don't get too granular; Prod and Non-Prod may be sufficient (principle of least privilege).

[They skipped the demo but the commands and links are in the slides.]

Complexity is the enemy of security.

When Your Open Source Turns to the Dark Side

[No slides | Direct video link]

The last talk of the day was from Dotan Horovits. Imagine waking up one morning to find out that your beloved open source tool, which lies at the heart of your system, is being relicensed. What does it mean? Can you still use it as before? Could the new license be infectious and require you to open source your own business logic?

This doomsday nightmare scenario isn't hypothetical. It is, in fact, very real, for databases (Elasticsearch and Kibana), for Infrastructure-as-Code tooling (like Terraform!), and for other OSS, with several examples over the past year alone.

In his talk, he reviewed some of the less known risks of open source, and shared his lessons learned facing such relicensing moves, as well as other case studies from the past few years. If you use OSS, you'll learn how to safeguard yourself. If you're in the process of evaluating a new OSS, you'll learn to look beyond the license and consider additional criteria. If you're debating open-sourcing a project, you'll gain important perspectives to consider.

(Note: SSPL is not an open source license. "Source available" does not mean "open source.")

Is having an OSS license enough to be considered open source? What prevents the project from changing license? Who can change the license? (Who governs it? Most are individuals on their own time, like log4j; some are those owned by vendors like ES and K and Grafana, mongoDB, Terraform, etc.; and some are governed by foundations, like the vendor-neutral Linux Foundation and Apache Foundation).

He ran through some case studies including the colors.js/faker.js "Pay me or fork this."

What can we learn about building, selecting, and using open source wisely?

Building — Open source is not a business model. You need to build a sustainable business model. Maintainers should not expect material compensation, even if the entire Fortune 500 uses it.

Using — Manage your third party licensing exposure just like you do with security risk exposure. Is there license contamination? sbones is a tool to track it. Take care with automation lest you upgrade to a newly-encumbered version. Look for code smells in the OSS space.

Selecting — In addition to what we do already, consider which open source software license applies? (SA != OS) Who is behind the open source software? What are the governance policies? Vendor distros of open source software can shield or mitigate risks.

Summary:

Open source software is more than a license.

Open source software can turn to the darkside (licenses, rogue, offending commits, terms of service changes, even on older projects)

Select open source software wisely.

Use open source software wisely.

Build open source software wisely. It is not a business model.

Always ask who is the "our" in "source"?

Evening

After the sessions there was a happy hour at the vendor showcase in the Pacific Concourse. Somehow I obtained additional drink tickets (I had eight instead of four to use between the Monday happy hour and the Tuesday reception).

I nibbled crudites, grilled asparagus, crackers, and cheese (the blue had a nice spicy funk), plus a little pepperoni and salami from the charcuterie board, along with two glasses of a perfectly reasonable pinot grigio. Had some nice conversations with friends old and new until they kicked us out around 6:40 p.m. Following that I headed up to the IBM hospitality suite and downed a Diet Coke while talking with a bunch of other folks. It was a bit noisy for me so I baled a bit after 7:00 p.m. and headed up to the room to write more of this trip report and to get off my feet.

Tuesday, March 19

After a wonderful eight hours (and change) of sleep I did a little bit of work (skimming email and cleaning up a ticket that came off hold) before heading down to the vendor floor for the continental breakfast (today was a repeat of the parfaits and three varieties of scone: cheddar-chive, lemon-blueberry, and maple oat). After breakfast I snuck into the ballroom to write more of the trip report.

Plenary: Meeting the Challenge of Burnout

[Direct slides link | Direct video link]

The conference day began with Christina Maslach's plenary session on burnout. Burnout is an occupational phenomenon that results from chronic workplace stressors that have not been successfully managed. Research on burnout has identified the value of fixing the job, and not just the person, within six areas of job-person mismatch. Improving the match between people and their jobs is the key to managing the chronic stressors, and can be done on a routine basis as part of regular organizational checkups. Better matches enable people to work smarter, rather than just harder, and to thrive rather than to get beaten down.

Burnout isn't well understood; many people think it's the individual's problem or fault and that it's up to the individual to take whatever steps are appropriate. The problem is we're framing it as who is burning out which leads to who answers; coping focuses on the effects not the cause. We need to reframe it as why the individual is burning out. That leads to more about what's going on in the environment to trigger the response, and focuses more on the causes than the effects... and moves from dealing with chronic stress (that happens all the time) to dealing with acute stress (which happens less often).

In 2019 the World Health Organization (WHO) came out with a statement on burnout (before their COVID statement): "Burnout is a syndrome conceptualized as resulting from chronic workplace stressors that have not been successfully managed." (Note the word chronic and the phrase about managing the stressors.) Who should manage them? Everyone: The individual, the team, and the manager.

Burnout is characterized by three dimensions:

Feelings of energy depletion or exhaustion

Increased mental distance from one's job, or feelings of negativism or cynicism related to one's job

Reduced professional efficacy

When you have all three with high frequency that's burnout. WHO said burnout is an occupational phenomenon or experience. It is not a medical condition but it could lead to medical conditions.

She's developed the Maslach Burnout Inventory (MBI) that measures all three of those dimensions in terms of frequency. That leads to five work profiles:

Burnout (three high-frequency negative scores)

Disengaged (cynicism only)

Overextended (exhaustion only)

Ineffective (inefficacy only)

Engagement (3 low-frequency negative scores)

Burnout begins as a management issue. The (poor) mantra is "You have to do more with less" or even "If you can't take the heat, get out of the kitchen." Those help cause burnout. Not managing the chronic job stressors successfully can lead to negative work outcomes (poor performance, absenteeism, and turnover) and negative health problems (chronic illness, anxiety, and depression). Chronic mismatches are often called "the pebbles in your shoe."

So the message we get is to fix the job, not the person. Helping the employee cope is necessary but not sufficient; you also have to help the workplace modify its sources of stress. This needs shared responsibility for the solutions.

Job-person match in six areas of work life (the Areas of Worklife Scale (AWS)):

Demand workload versus sustainable workload

Lack of control versus choice and control [over your work]

Insufficient reward or feedback versus recognition and reward (and not just money or benefits)

Breakdown of community versus supportive work community

Absence of fairness versus fairness respect and social justice

Value conflicts versus clear values and meaningful work

Fix the ones where there are real issues. There are six paths to a healthier workplace, one for each of these areas.

[Comment in Slack: Don't each of these six areas exactly correspond to what happens when companies do layoffs? overload "do more with less", lack of control, insufficient rewards (since bonuses get cut at same time), breakdown of community (from colleagues being ripped out), feelings of unfairness? See also this article.]

Matching people to the job needs training and education (skill development and practical experience) as well as coping with stressors (resilience, strength, and time away from work). These don't make the job less stressful.

Matching the job to people needs to modify the work conditions that create negative outcomes for the employees. Use environmental psychology and the model of ergonomics, which focus on the relationship between workers and their physical environment. You need to apply the design model to the social and psychological environment as well as the physical environment. A match is achieved by satisfying core social and psychologica; needs:

Autonomy

Belongingness

Competence

Psychological safety

Fairness

Meaning

Positive emotions

Helpful points to keep in mind are the three C's:

Collaborate — Get feedback and buy-in — reach out and listen.
Customize — Adapt to local culture — one size of "best practice" doesn't fit all.
Commit — Sustain effort to achieve positive gains — evaluate and modify until we get it right.

Bottom line: There are many possibilities within all six areas of job-person fit to make a better match between people and their job. These changes can be small, inexpensive, and customizable. The healthy job environment takes care of both the workers and the workplace, so the former will thrive and the latter will succeed. This needs regular checkups asking "How do we make things a little better around here?" (Ask that regularly: Quarterly? Annually?)

More details and examples are in The Burnout Challenge: Managing People's Relationships with Their Jobs by Christina Maslach and Michael P. Peiter.

Plenary: What We Want Is 90% the Same: Using Your Relationship with Security for Fun and Profit

[Direct slides link | Direct video link]

The plenary session continued with Lea Kissner's talk, "What We Want Is 90% the Same: Using Your Relationship with Security for Fun and Profit." Security and SRE are natural allies in the fight against randomness, terrible systems, and how those systems can hurt people and give us yet another reason to hate surprises. This talk went over where our interests overlap, where they don't, why, and how to take advantage of this in a world of infinite possibilities and limited, prioritized time.

Security and SRE generally want 90% of the same things (avoiding incidents, keeping people away from things they don't need to be in, etc.). Any sufficiently large system contains sufficiently random behavior that can appear as malicious behavior. If Security and SRE work together they can plan ahead and avoid duplicate work.

Shared goals:

Access controls (and dependency controls which can be the same)

Network design (incl. DoS attacks and AWS proxies have egress controls)

Observability (detecting incidents; tracking an attacker is different than debugging but there are a lot of similarities)

Releases (patching, break things fast and often; testing; feature flags)

Everything in its place

Incident response (incident coordinator, blameless post-mortem; security tends to have more lawyers involved)

Eliminating toil (we all have a lot to do so automate the stuff you can)

A goal metric is a quantifiable measurement used to figure out whether you're doing well (or not) at something. A perverse metric is a number that doesn't tell you what you think it's telling you. Real life example: A company used "Number of notifications to my phone" for a very large N. It's easy to drive that up. "We're getting a lot of requests" might be good... or might indicate a denial of service attack.

Given the similarities, why are we different teams?

Error budgets. SREs have one that is nonzero. Security does not; one error is one incident.

Measurement. Security aspires to the SRE levels of measurement, because what they're looking for is hiding.

Compliance. SREs see it as inflexible and often useless. (FedRAMP limits cryptographic implementations so OpenSSL bugs can not be patched quickly.) If you talk to your compliance folks you can translate and use their needs to make your life better.

Given all of that, how do we work together? Respect needs to happen first. Respect for each other and respect for the product experience. We need to collaborate. "One of the goals of this project is that nobody goes to jail" goes a long way with executives. We can choose to prioritize each others' projects ("Network redesign? Call it 'Egress protection!'")

Most CISOs aren't engineers of any stripe. You may need to translate between your language and theirs.

SREs make very good security people. Having multiple skill sets works really well. There's a difference between "How does this work" and "How does this break" and both SREs and Security look at the latter.

Logs Told Us It Was Kernel — It Wasn't

[Direct slides link | Direct video link]

After the morning break I went to track two (where I'd planned to spend the rest of the day). Unfortunately they had been unable to close the airwall separating the two halves of the ballroom because it jammed. At 11:00 a.m. they announced a 10-minute program hold to give the hotel staff a little more time to fix it and the conference staff time to prep the backup location with A/V if needed. At 11:05 a.m. they made the call to implement the backup plan and moved track two upstairs to a quickly-implemented set-up.

We started with a packed house about 13 minutes late with Valery Sigalov's "Logs Told Us It Was Kernel — It Wasn't." He demonstrated that the Linux kernel is not always responsible for application performance problems. He reviewed various techniques that can be used to investigate application performance issues. He expected the audience to learn how to write cache-friendly code to optimize application performance and how to use compilation flags to achieve further performance optimizations.

They saw severe disk latency with a new kernel upgrade, so that must be the problem, right? No, it was a misconfigured parameter combined with a large number of cgroups. They:

Profiled the additional "nop" instruction in the code.

Placed the local vars in the registers.

Ran the Intel VTune profiler.

Profiled the code block.

Researched the compilation flags for perf optimization.

Research profile-guided compilation for better optimization.

Topics included CPU architecture, caching, making code cache-friendly, and local memory.

The new kernel (4.18.0 vs 3.10.0) used gcc 8.5.0 (not 4.8.5) which added a nop instruction. It doesn't emit microcode but must be fetched and decoded, and they contributed a lot to increasing the code size. Profiling it showed over 1,000,000,000 added instructions (10,009,141,973 to 11,008,906,212).

The new compiler put the local variables on the stack instead of the register (which would be 4 faster).

They profiled the register keyword; including the register keyword suggest the compiler use it for local variables, and the add instruction can work with two registers directly. Including the register keyword reduced the cycles, instructions, and time. But that required too many production code changes to try to implement.

Next they looked at the Intel VTune profile. In the old environment the front-end took up 14.4% of pipeline slots, but in the new it was 45.2%. A significant portion of pipeline slots is remaining empty due to issues in the front-end. Tips: Make sure the code working size is not too large and the code layout does not require memory.

They saw that the loop test code was enough to exceed the instruction cache entry so it had to read both 64b blocks, not just one.

They added two "nop" instructions to force-align the loop block and the performance significantly improved. But that required too many production code changes to try to implement.

They looked at the -falign-functions compiler option to align by 16b to improve execution speed. Other options included -falign-loops, -funroll-loops, -O2 (recommended for production), -Os, and -O3. Higher levels of optimization can restrict debugging visibility. The performance optimization options increase the time and memory consumption during compilation.

They also looked at profile guided optimization (PGO) for the compiler to make more accurate guesses based on a previously-compiled and -run instrumented executable. Looking at some other applications found 1.13–1.25x improvements (Python 3.12.0).

Long story short: It wasn't the kernel after all. The lesson to be learned is to take a holistic approach to investigating the entire application stack.

(References are in the last slide.)

[This was a highly technical talk by someone whose mother tongue was not English. This was also lower-level than anything we're actually doing.]

Autopsy of a Cascading Outage from a MySQL Crashing Bug

[Direct slides link | No video]

The second session in this block was the Spotify team's "Autopsy of a Cascading Outage from a MySQL Crashing Bug." Once upon a time, an application query triggered a crashing bug. After automated failure recovery, the application resent the query, and MySQL crashed again. This was the beginning of a cascading failure that led to a full datastore unavailability and some partial data loss.

MySQL stability means that we can easily forget to implement operation best practices like cascading failure prevention and testing of unlikely recovery scenarios. It happened to us and this talk is about how we recovered and what we learned from this situation.

The team provided a full post-mortem of the cascading outage caused by a crashing bug. This talk not only shared the incident operational details, but also included what they could have done differently to reduce its impacts (including avoiding data loss), and what they changed in their infrastructure to prevent this from happening again (including cascading failure prevention).

This incident included a data outage, not just a service outage.

Their blameless incident process is governed by a strict timeline to ensure completing the post-mortem and next steps. It's automated and there's an ownership model where the SRE team provides guidance.

Infrastructure-wise they use Vitess in front of (as a sidecar to) MySQL. Their durability policy is to use MySQL Semi-sync so commits from a primary require ack from a replica, and the up-to-date replica is promoted if the primary fails. The scale is over 1,000 distinct primary databases for ~54 TiB data and 64+ billion queries per day. They support 1,500 engineers in major portions of the product, in small autonomous teams performing high-velocity development.

Because semi-sync requires an ack, and because both replicas are unavailable, so all writes are blocking and there are no more available connections. This takes the database down hard. The oncall says "semi-sync is a problem" and they disabled it to unblock writes and so reads and writes continued. However, after that the primary failed and there are three failed nodes. Usually MySQL will restart but in this case they didn't because of an InnoDB assertion failure... that caused the initial outage and also causes MySQL to crash on startup. The stack trace shows a rollback attempt after the connection was killed, but the rollback doesn't complete. (A failure in rollback is very bad and b0rks the instance).

We have the chain of events:

Killing a query or session caused a rollback (unclear what is killing at this point).

The rollback caused a crash and a failover.

This repeated on the new primary (kill, rollback, crash, and failover).

After the second failover, writes block because of no semi-sync replica.

Each blocked writes holds a connection, things pile-up and all reads start failing.

Disabling semi-sync unblocked things (reads and writes).

But this also allowed the problematic query to re-run.

Killing the problematic query crashed the last node.

They can use Vitess' backups to restore the database... and it blocks and the MySQL instance doesn't go live. The node was forced as primary with Vitess commands (which they were unfamiliar with) bringing the database and application back up. Meanwhile another group identified a long transaction timeout followed by a kill in the vttablet. They could disable that timeout, bring the database back up, the redundancy restored, and they could close the incident but they still need to analyze what happened.
Vitess' restore did not complete. For its point-in-time recovery it needs to stream binary logs from the primary (which was unavailable). They could've manually applied the binary logs but it assumes they're available and they brought the database up without them, losing up to eight hours of data until they could pull it from a still-b0rked node),
They thought that the rollback of the long/large transaction was thought to be the cause. They knew one database was affected, but what about others. The command

BEGIN; DELETE FROM TABLE LIMIT 10000; ROLLBACK;

sometimes crashes MySQL. After further analysis a single-row UPDATE or DELETE followed by a ROLLBACK could crash MySQL in an unrecoverable way. It's not big transactions but a corrupted row. SELECT is fine, a COMMIT succeeds, and ALTER TABLE FORCE or OPTIMIZE TABLE clears the corruption.

It's impossible to (and unwise to) prevent rollbacks. The problem happens in InnoDB's Compressed Row-Format feature. (BugIDs are in the slides.)

They created an extension for pt-archiver to detect the corruption, then they could ask the database owner to rebuild each corrupted table with their existing tooling. They run that detection script weekly on each database.

That wasn't the only root cause. Crashing buts will always exist. Fixing that is necessary but not sufficient. Disabling semi-sync caused the last node to crash, which is what led to the data loss. Also failing over to the last node also caused the data loss.

Lessons learned:

Vitess backup/restore lacks point in time recovery when there is no working primary. (This was a blindspot.)

With 99+% availability, recovery depending on live nodes is a trap. It lacks "recovery from scratch." What is the RPO and RTO of your infrastructure without a live standby? With a degraded standby? Are recovery procedures possible?

Rethinking whether the InnoDB Table Compression was the right answer. Using compression everywhere on all tables may have been a premature optimization or a bad tradeoff. Compression spends CPU to save on disk, but what are your actual constraints (for example, is the table read a lot)? It also adds a lot of bugs and configuration could be difficult.

This also applies to other cool features. Sometimes boring is better.

Fire fighting versus fire prevention. They should have waited before asking the whole company to reduce large queries or long transactions. Consider if the action is related to fixing the current outage or to fixing it after the fact. If you need the action done before closing the incident then work it, otherwise it can wait.

Precipitation is sometimes worse than well-thought solutions: "Don't make a bigger mess, don't failover, don't reboot."

When rushed into something you don't like, "I am not sure it is the best course of action and I am still thinking about it. Without anything better in 5 minutes we'll do it."

Consider your priorities during a data incident [pull from slide]

Cascading failure prevention: Human or on-call should not just failover again or just disable semi-sync. Avoid a longer outage by checking a few things first to make things safe.

They went through four parallel paths:

Backport proposed patches from MariaDB (available patches didn't fully resolve the issue)

Limit calling patterns in the application code that were prone to timed out transactions. It bought some time but was labor intensive.

Measure the extent of exposure and remediate with table rebuilds. (Successful path.)

Remove compression across all tables. This was very labor-intensive. They stopped it once the measurement showed promise.

Summary:

Removed guidance to disable semi-sync from all runbooks; disabling semi-sync turned an unavailability incident into a data loss incident.

We will not failover to the last standby anymore; wait for a standby to be up and running.

Exploring multiple paths helped us remediate faster; our incident response plans involve distinct information gathering phases instead of focusing on singular paths

Availability vs Durability; our customers want both, but actually prefer availability.

Lunch

Lunch today was down on the vendor floor. The selection included:

Roasted beet salad

Cucumber salad with tarragon

Coconut carrot soup

Roasted broccolini

Spring pea and asparagus risotto

Chicken with olives and artichokes

Sea bass in a roasted tomato sauce

Hummingbird cake

I had a nice discussion with a lady from New Zealand and a gentleman from Brazil.

What Is Incident Severity but a Lie Agreed Upon?

[No slides | Direct video link]

After lunch I resumed with Em Ruppe's talk about incident severity, where she asked several questions:

Is your incident severity a complex word problem?

Do you arbitrate which severity an incident should be during the incident itself?

Do you know what the difference between your severities are?

Do you even have incident severity in your org?

We talked about the lies we tell ourselves about incident severity, and how to find what kind of severities might actually work for any given organization.

The bad news: There is no correct way to do incident severity.
The good news: There is no correct way to do incident severity.

The hard part isn't that severity is a lie but that agreeing on it. We need a mutual understanding of what purpose severity serves in an incident. In general, we don't want the best way, but a way, any way, that (1) works right now and (2) doesn't get in the way of response. (If you wonder if it's an incident, then it's one. Move forward. Worry about severity later.)

Questions to work on trade-offs in severity:

What is "severity" measuring?

Impact. Does "impact" have an agreed-upon definition (revenue, traffic, customers, services, SLAs, SLOs, uptime calculations, MRR, ...)? Can you measure it (at the start, during, or after the incident)?

Priority. It's subjective. Is there an agreed upon definition (services, customer priority, customer facing, code freeze, calculation of dependency, revenue, ...)? Can you measure it (at the start, during, or after the incident)? It's often how you feel in the moment when it's scary and unknown and later how you feel once it's over with.

Can you combine impact and priority to get to severity? Not really, no. It's not reasonably quantifiable.

What is "severity" a mechanism for? Do you pull in different teams for different severities? Communicate differently internally? Communicate differently externally? What about after: Does the process change after? Does the SLA require an RCA? Does the severity change during or after the incident?

What organization problem are you blaming on "severity" and hoping it will solve?

Consider this as an organizational issue: Part process, part technical, part cultural, and part political.

Do you need to "fix severity" because you're underleveling? (Why are you underleveling? In the example, engineers were afraid to declare a sev-1 because it makes the team look bad or jump through more hoops.)

Do you need to "fix severity" because you're overleveling? (In a past job, if you were Unknown for >2 boxes it auto-defaulted to sev-1, which burned out Support and Em since they were all paged on every sev-1 with a 10-minute response time.)

Fischer's Rule: "If you're questioning whether it's an incident, declare an incident." Don't let severity get in the way.

Do you need to "fix severity" because stakeholders don't know what's going on from the outside? This is easiest to fix: Communicate and set expectations.

Do you need to "fix severity" because no one is sure if this incident means we pull an all-nighter now versus come back to it in the morning? This is setting expectations both internally and with customers.

Severity is hopefully measurable, partly technical (needs to be understood, observable, measurable), actionable (if you're using it as a mechanism to pull people in or mobilize responders), and part cultural (set expectations both internally and externally). For example, high severity may be "We know it's bad" or "We don't know how bad it is yet," while low severity may be "We know it's not bad or can wait but." Severity is not one size fits all.

Severity is a canary; it will sing about organizational problems whether you want it to or not. It's a sociotechnical issue. You need to communicate (and yes, it's not necessarily our strongest suit).

Severity is not forever. There's always an expiration date... which nobody knows until it happens. If something is contentious in the after-action discussions that have to do with expectation-setting and measurement and actions to take (or not), then it may be time to update your agreed-on definitions.

The most important part: Talk to the people at the sharp ends, those declaring incidents and determining severity. (This is how you win friends as an SRE, get people to join you on your journey; caring about their experience makes them feel heard, which is one of the six areas discussed this morning.) Use their input, troubleshoot, and try things out.

Again, all parties need to agree on the definitions.

(With the right trust between customer support and engineering, you can do without severity. That trust lets you pull someone in when needed. The problem is not having everything amount to a sev-1 all the time.)

Have the discussions before incidents, and socialize it across teams.

Hard Choices, Tight Timelines: A Closer Look at Skip-Level Tradeoff Decisions during Incidents

[Direct slides link | Direct video link]

The session continued with Dr. Laura Maguire's and Courtney Nash's "Hard Choices, Tight Timelines: A Closer Look at Skip-Level Tradeoff Decisions During Incidents." Unexpected outages in software service delivery — also known as incidents-often require making rapid tradeoff decisions on the road to recovery. Tradeoffs can be relatively minor — rolling back a recent change or temporarily disabling a certain feature — or they can represent significant threats to reliability or reputation, such as when facing a loss of customer data. While the resolution of incidents is unquestioningly in the hands of engineers, senior management also have an active role in making tradeoff decisions during significant incidents. As researchers interested in software incidents, they recognized a gap in the industry's understanding of how different levels across the organization work together to resolve challenging incidents.

Their objective in this research is to examine the kinds of tradeoff decisions management faces during incidents, the patterns in how and when they become involved, and the strategies used to coordinate effectively with their incident response teams.

During this talk we got a behind the scenes (and between the ears!) look at management tradeoff decisions and how this knowledge can be used to increase an organization's capacity to handle unexpected events.

They started with a "My computer doesn't work" bit to judge how uncomfortable we in the audience felt.

Tradeoffs tend to be advantageous but conflicting properties (such as speed versus accuracy) ubiquitous in cognition. Also, choices between different but interacting or conflicting goals between values or costs placed on different possible outcomes or courses of action, or between the risks of different errors. It's not that you will or won't have errors, but what sorts of errors you will have. Decisions are often made while facing uncertainty, risk, and the pressure of limited resources (like time pressure and opportunity costs).

There's a lot of research on tradeoffs in other domains (aviation, healthcare, power, etc.), but not so much in information technology. They took a multipronged approach to investigate tradeoffs in incident response. They started with the VOID. They want data not "anecdata."

One problem with collecting these is that what's public is often redacted from what's internal. Not a lot of people talk about tradeoffs, but Laura Nolan wrote one about Slack's February 22, 2022 outage. The negative result is that organizations don't tend to discuss tradeoffs decisions in public because there are tradeoffs about talking about tradeoffs. But talking about them in public can help normalize that the discussions are inevitable within complex systems.

They walked us through a vignette of cascading issues from deprecating an API leading to cached data that doesn't expire leading to email with data that should've been but wasn't redacted in breach of the law, and efforts to move to the new API led people to use old services. They asked probing questions in a 40–60 minute survey (n=27, 16% senior leadership, 20% manager or skip-level manager, and 64% individual contributor).

They had seven key findings:

Tradeoff decision making in incidents is complex. Tradeoff decisions are technical, organizational, and social... which may be undocumented and tacit knowledge or understanding.

Tradeoff decisions are considered and managed differently across roles and levels within the organizations. A surprising finding was how many senior leaders wanted to discuss the mechanism of failure because many came up through the technology track. Another was that ICs are thinking about reputation.

Tradeoff decisions cross boundaries in a variety of ways. The ripple effects within and outside the organization. More people involved muddies decision making slowing things down, but having the necessary teams involved was usually helpful.

Knowing more about organizational context increases focus on anticipation and optimization for others. Legal might not understand the technical details so you might not bring them into the conference call or chat room, but side channel them to keep them informed. Also, senior management being present can create a chilling effect and confuse authoriyty in an incident, so keeping them out (and briefing them via side channel) may be helpful.

Costs and benefits of tradeoffs may be unevenly distributed.

Tradeoff decisions evolve over time. Priorities change, possibility of failure can narrow or broaden, affecting possible or probable paths to resolution.

Some goals and priorities are likely to get trashed along the way. As awareneess of the extent of the problem grows, the emphasis on economic loss shifts to company impact and reputation. "Will we be on the front page of Hacker News or CNN?"

Key takeaways:

Making tradeoff decisions can be as complex as the technical debugging. Let's start recognizing this and developing these skills.

Tradeoff decisions are managed differently across the organization. Bringing those perspectives to bear effectively takes practice.

Invest in cross boundary decision-making capabilities.

Encourage decision-making that emphasizes anticipation and optimization across boundaries.

Be transparent about costs and benefits and ask if they align with the values and long-term success of the organization.

Tradeoff decisions evolve over time, so practice effectively reframing the problem and continual model updating to avoid frustrations and oversimplifications.

Recognize when and how conditions are changing in ways that require some goals and priorities to be trashed. Be explicit so others can adapt to this reality.

There are limitations here: Self-selection, self-identification of level, variability in role titles and authority across organizations. And the duration of the vignette may have impeded more senior leadership participation.

What's next? Future research! Collect more data, map the extent of the information needed for tradeoff decisions, better understand role goals and priorities in organizations, and evaluate the effects of introducing tradeoff decision debriefing in incident reviews.

Triage with Mental Models

[No slides | Direct video link]

After the afternoon break I went to Marianne Bellotti's talk on mental models. She's the author of Kill It with Fire. What powers those amazing insights that certain engineers bring to the conversation during triage? How do some people just look at a monitoring dashboard and immediately have a hypothesis of what's really going wrong? Experts carry around their own libraries of mental models of systems built up over years of operational experience, but you can learn how to form them intentionally, how to extract them from others and how to manipulate them to see complex systems clearly.

Here we mean "model" as an understandable mental model storable in a single person's brainspace. It can be an abstraction that communicates a key concept about a system, such as expected behavior, relationships between concepts, or invariants. They're simplified within a given scope or under certain conditions or specific definitions.

We don't mean ML or AI. We don't mean like data models or declarative specifications.

Why use them? You, your colleagues, the actual implementation, and the context and requirements may vary. Models help find gaps in understanding, transfer knowledge, refine requirements by forcing specificity, generate scenarios and hypotheses about the system, and they train your intuition. We already use them in class inheritance, unit tests, data schemas, and triage.

We create them by observation and exploration (and sometimes analogies); reading documentation is not necessarily useful in this context. (It can provide architecture, commands, endpoints, interactions, errors, and schemas. That's all useful... but they don't describe the system behavior. Good documentation can include design intent, why we built it this way, why we decided X instead of Y, and so on.)

How do we build the models proactively?

Process map (step → choice → step → result). This helps find the unknowns and can overlay counterfsactual reasoning. It's a de facto topographic search. Counterfactuals can crop up (X=true or X=false, but what if X=nil? What could cause that?). With the topographic map you can figure out where and how it could happen. Best practices are to not invite the one guy who knows everything. People learn from observation and exploration not information dumps. The goal is a list of questions and unknowns After-action is to find out the answers. Ground truth is sacrosanct: Logs, source code, and contemporaneous notes (AAR, RFC/ADRs, Slack threads, PR comments, etc.).

[Aside: After the talk I mentioned the case where we effectively received the panic message from:
if a > 0 { ... } elseif a = 0 { ... } elseif a < 0 { ... } else panic("This should not happen");
This happened because "a" flipped the sign bit between the first and last tests. See also "But Rob, what if time runs backwards?"]

Mind map. This is free association grouping and mapping concepts, terms, and themes. Invite the old expert. You can flesh out defined and undocumented relationships and compare analogies across multiple levels of experience (novice to expert).

SLOs are a great decision-making tool for product maps. We discussed the boundary of acceptable performance (see the next talk). Functional boundaries emphasize tradeoffs and provide SLO-type thinking when the organization maturity isn't ready for SLOs.

Phases:

Find gaps in thinking.

Form hypotheses about our systems.

Proving our ideas about the system make sense. (Here we need computers.) You can have a computer model to check. Steps → Algorithms and properties/invariants → assertions. There are specialized modeling languages that can explore the state space (like TLA+, Alloy, P, etc.). It's based on first order logic.

The larger and more complex the system is, the larger the gap between the specification and the implementation can be. It's important to build a strong discipline in using models to cultivate critical thinking and conversation rather than jumping straight into "proving."

Defense at the Boundary of Acceptable Performance

[Direct slides link | Direct video link]

The last session of the day was Andrew Hatch's "Defence at the Boundary of Acceptabile Performance." In the 1990s, Jens Rasmussen (Danish system safety, human factors, and cognitive systems engineering researcher) published "Risk Management in a Dynamic Society." The Dynamic Safety Model was a key element of this, used to illustrate how socio-technical organizations cope with pressure from competing economic, workload, and performance forces. He used this model to demonstrate how it can represent forces acting within large technology organizations, continually pushing the point of operations closer to the Boundary of Acceptable Performance and, as we approach or cross it, how our lives as SREs become negatively impacted.

We unpacked the ruthless nature of forces protecting economic boundaries, manifesting as layoffs and budget cuts. How pushing people to exhaustion at the workload boundary decreases system safety and, ultimately, profitability. Lastly, we examined how this model forms the underlying theory behind "chaos engineering" to detect and reinforce risk boundaries through feedback loops to build more resilient systems.

Our goals in this talk:

Broaden our perspective on internal and external forces that create risk.

Getting greater value from AAR and asking better questions.

Identify where the margins for error exist and how to reinforce them.

[See the diagram. The slides themselves were very interactive, making it hard to take notes, even if it weren't the last talk of the day.]

Because we have little power and can't affect the workload or economic boundaries, we have to look at the performance boundary. Pushing back is hard. It's often the last line of defense. Visibility and awareness becomes difficult as complexity increases.

The error margin is a virtual boundary. Once we cross it we assume bad things will start happening. It is never static; it drifts undetectedly ("normalization of deviance"). Major deviations can be powerful forces against other boundaries (for example, we can harness how upset the VPs get).

Collective local knowledge creates broader awareness... but surprises will still happen. Reactive tuning needs prioritizing, but proactive anticipation needs budget and time. Awareness and anticipation of economic and workload forces, and their implications, is critical for leaders. Nothing is static; unanticipated changes can have surprising results.

What can we learn from Rasmussen?

Actions in other parts of the system will always have unintended consequences. Your systems are complex.

Identifying workload and economic affects on localised domains enriches the context of implications.

Clear identification of error margins improves system health through better focus of adaptive actions.

Reducing distractions through effective proactive action removes wasted work and hopefully you'll sleep better.

Avoid the root cause whack-a-mole game on the performance boundary. Look more broadly and consider all contributing factors.

Stop using vanity and rolled up metrics; they can trap you in reactive cycles. Close feedback loops.

Drastic economic actions can be necessary but their effects can be long-lasting or permanent.

Morale and engagement are critical to success. Keep crossing that boundary and performance will regress.

Reception

After the day's sessions ended I headed to the hotel atrium for the conference reception. They went with an Asian-esque theme: The Chinatown dim sum station had potstickers, har gow, siu mai, bao, tempura shrimp, and vegetarian spring rolls; they also had their usual cheese and crackers and crudite and grilled asparagus.

I caught up with friends old and new, talked about how well things were (and weren't) going with the conference with the executive director (and one of my complaints — that they kept attendees from entering the ballroom spaces until the session start time (specificslly, after sound check) — has been resolved already). I also returned seven excess drink tickets (between my four and two others gifting me theirs, I had 12 total tickets; I used two yesterday and three today, all on pinot grigio).

I thought about going to the lightning talks at 7:00 p.m. but between the food and the drink, and the sore hips, knees, ankles, and feet from the hotel floors, I figured I'd err on the side of caution. I went upstairs, popped some muscle relaxants and ibuprofen (with Norco on standby if that wasn't enough), and crashed.

Wednesday, March 20

It's the last day of the conference and depite an early bedtime and at least nine hours of sleep I was dragging. (How did we ever do six days of this?)

The morning's continental breakfast was back in the Grand Foyer on street level. The parfaits made their third appearance, but the carbohydrates today were in the form of toast-your-own bread and English muffins. I was not impressed, but not enough to actually go to the hotel market or out for a breakfast sandwich.

In good news, however, the overnight crew was able to fix and close the air wall in the Grand Ballroom, so we could continue to have two side-by-side tracks.

The Art of SRE: Building People Networks to Amplify Impact

[No slides | Direct video link]

The day kicked off with Daria Barteneva's "The Art of SRE" talk. If SRE were a well-defined science, we wouldn't need too much mentoring/coaching/cross-team groups. We could just do "the thing" and be done with it. But the truth is that our profession needs a significant component of human interaction to unlock a degree of success that would be unattainable without it.

In this talk we looked at some learnings from another field: Choir direction. As a trained opera singer, she always sang in choirs — there is no better way to improve your technique! Going deeper into how choirs are balanced to elevate overall choir performance and help inexperienced singers, we uncovered an opportunity not broadly formalized in the engineering field: Implicit coaching. Common in music and sports, implicit coaching is a powerful way to help engineers build and practice critical skills.

Expanding beyond traditional coaching techniques and understanding their pros and cons allows you to find a solution that works for your individual situation and helps you and your team to learn and effectively improve across different dimensions.

Why is this an art not a science? A science task is to define SLI for your service: Understand an SLI, instrument code, translate metrics to SLI, setup SLI/error budget dashboards, and document what you did. An art task is convincing all your partner teams to define and use SLIs. Lesson 1: Soft/people skills are important.

Lesson 2: There is no one way to get to SRE; we derive value from our diversity.

Her day-to-day job used to be lots of coding and few human interactions; now it's the reverse ("flip the pyramid"). The middle is always documentation; write your ideas or nobody will know about them.

Lesson 3: Systems are built, run, or used by humans. Human relationships are complex. We weave intricate connections, shaping the world (and systems) around us. (Reference: Tanya Reilly's "The Staff Engineer's Path.")

How do we get better? We practice both alone and in groups.

What's a choir? A group of people who sing together, led by a conductor, usually have four or more voices in harmony, dating back to the second century BC. We had a brief discussion of acoustics, volume (amplitude), pitch (frequency), and color/timbre (shape of the wave). In a choir all of those waveforms overlap. This leads to lesson 4: Pairing singers in certain ways changes the shape of the sound wave. Similarly, collaboration in an organization may generate greater (or lesser) impact than the sum of the individual impacts.

Lesson 5: Practicing a skill in a group with varied levels of expertise amplifies your learning. Implicit learning happens by observation first, followed by conditioning, using role models and proximity.

How do we learn? Learning is inevitable (new tech, new people, new ideas, etc.). Information comes in, moves to working memory, some moves to long-term memory, and a lot gets thrown out. This leads to cognitive load theory where you can only hold 7±2 pieces of information at a time, Over time the blocks of information can contain more information. Example: Chess. At the start you know the pieces; later you know the moves, theories, and strategies.

Cognitive load types are:

Intrinsic (complexity of the information)

Extraneous (how information is presented)

Germane (automation and informational schemas)

Lesson 6: Learning is inevitable but how we load is under our control. Managing cognitive load helps.

Some common challenges:

Some skills require practice

Poor onboarding

Tacit knowledge only in people's heads (undocumented)

Hero culture

No formal curriculum to build the skills needed to be successful

Teams operate in silos

Mentoring can help:

One on one, but there can be a chemistry mismatch, is time intensive, provides limited perspective, and there's a limited number of mentors available

Formal group mentoring, either in lecture format, which can be too narrowly scoped or provide less interactivity, or discussion format, which doesn't provide hands-on practice and is not as good for quiet participants.

Implicit (like the choir): Work groups, direct and reverse shadowing, onboarding buddies, code reviews

We can extrapolate the choir mosaic to the engineering world.

Deciding the size of your choir or work group: Up to five expert/senior and up to 15 people total, with diverse skills and experiences.

Lesson 7: Just being present is not enough. You have to practice... and make mistakes! Work groups (and choir rehearsals) are a safe space to make mistakes and learn from them.

To recap:

Our profession depends on human interaction to be successful.

Soft skills are as critical as technical skills for an engineer.

Developing some skills requires practice in a group setting.

Your network is a great source of learning.

Cognitive load theories help you to optimize your learning.

Helping every individual in the team to get better makes the whole team better.

Lack of consensus and disconnected teams are very expensive for the organization.

Teaching Site Reliability Engineering as a Computer Science Elective

[Direct slides link | Direct video link]

Next up was Mikey Dickerson's talk about teaching SRE. After 15 years complaining about how we can never find enough SREs to hire, he designed and taught a class for computer science majors called "Managing Complex Systems." He talked about how this went and what it said about the long-term future of the industry.

Intro slide has a pyramid of reliability hierarchy from monitoring on the bottom (most reliable) up to product (least).

Observation: New grad hires struggle with the jump to workplace expectations (self-sufficiency). They're best prepared to write new code on a new green-field project with minimal dependencies. That's rarely the job that they get, though; they tend to join an existing project to help fix bugs, but there's no documentation and they're told to create the documentation as they learn the project. Meanwhile, many CS departments have more students. College has worked to shift the student body to more lower-income, first-generation, and underrepresented groups. There's an identity crisis in higher education: They don't want to be vocational training and don't want to exclusively serve trust fund babies.

He designed a class with two parallel sequences:

Practicum — In teams of 4 or 5, run through a series of approximately weekly assignments ("milestones") that start with a git repository and end with a reasonable facsimile of a customer-facing production service. Given a survey on experience, the sequence is:

Get an AWS account, set up a VM, and be able to use ssh. Oh, and document everything.

Install Postgres, import the starter data CSV file, and answer some questions with SQL.

Install Tomcat, Python, and the webapp and service process. They have to show it's installed and working (even if it's the 500/Error page).

Make it all actually work together (difficulty spike).

Set up APM-style instrumentation. They can choose on their own. Most pick NewRelic and some choose free Honeycomb.

Set up an on-call rotation with pagerduty etc. (There's a diagram of things at this point in the slides.)

Show improvement over week 05 aviilability under synthetic load of about 100 rpm (confidence improves).

Show 90% availability (+ side activity).

Solve a synthetic outage and write a post-mortem.

Rearchitect and scale up to withstand a 15-minute load test of 10,000 rpm.

Deploy changes with CI/CD, sometimes containers/Docker.

Reading and discussions — Well-studied systems disasters (Challenger, Three Mile Island/Chernobyl, Air France 447, and Therac-25), "Sensemaking" from Karl Weick et al., and "Systems Safety" by Nancy Leveson. Student presents on the paper first so they can have the discussion. Goals include encouraging systems thinking, building resilience, and getting awareness of the effects of technology on the world.

We want to change their mindset. They start at "If the step by step instructions don't work exactly as written, I'm stuck" to "I can solve this problem with the assortment of tools and capabilities that I have, even though they're all imperfect." It includes planned surprises to help build resilience and adaptability.

There are unplanned surprises too:

Tools in everyday use don't have the safety margins students expect.

In year 1, one group ran "rm -rf /" on the server vm in week 4, obliterating their first month of work. They subsequently created backups and version control.

In year 2, one group destroyed their serving VM image and backups with confused application of package upgrades. This would have been an excellent time to introduce containers.

Reactions and results: 17 in 2021, 33 in 2022, and 31 in 2024 (24 women, 19 underrepresented). Three have dropped and two incompletes; grades tightly clustered around B+/A-. 11 of the first 50 said "best class ever" or equivalent. Six (five women) said it increased their problem-solving confidence. Negative comments were about workload (all would prefer less) and balance of time spent on various topics (no clear trend).

Students take this as an elective once all the core courses are done; most are seniors.

The Invisible Door: Reliability Gaps in the Front End

[No slides | Direct video link]

After the morning break I went to Isobel Redelmeier's talk about reliability gaps. It's hard for anyone to experience our services' many nines if our apps keep crashing. Front end reliability is critical to keeping our users happy — so how can we reliability engineers help improve it?

Her talk explored the unique technical and organizational challenges facing the front end, with particular emphasis on observability. Through specific examples from her experience as an SRE working on mobile reliability, she hoped we would learn about hurdles like how difficult it can be to add even basic metrics as well as get inspired by how well-connected front end tooling is to business analytics. Ultimately, we imagined what a more holistic approach to reliability could look like and consider a path to achieving it.

This talk was more geared towards front-end developers than for back-end SREs. As someone who doesn't work on the front end and is not a developer, especially for mobile apps, it was difficult to follow.

SREs care about what is inside our castle: Our services. It should be well fortified to protect everyone. We expand our concerns to gateways (load balancers, service meshes, and the moat). What about the town living around the castle (the front end: API and the web, desktop, and mobile clients)?

Analytics tools like those from BigQuery are high cardinality and high volume, by design. They're designed for non-engineers.

They formed a team of an Android expert, an Apple expert, Isobel, and a couple of managers, to own the hot potatoes and empowered to prioritize the work to resolve the front end problems.

SREs tend to measure indicators like availability, durability, latency, and throughput. On the front end they care about availability and latency, but the definitions can be different (for example, on the front end "availability" can mean app crashing),Front end people also care about system metrics and bugs. You don't want to cause a huge data bill or drain their battery from an infinite loop. For them, "incident" can mean "A bug made it to production."

Measurements are inherently temporal. Anomaly detection can help.

Can be expensive: At $1/month per metric:

2 core vitals

3 release channels (alpha, beta, prod)

1,000 versions

20,000 devices

would be $120M/month.

Automating Disaster Recovery: The Ultimate Reliability Challenge

[Direct slides link | Direct video link]

The morning sessions ended with Ricard Bejarano's talk on automating disaster recovery. He explains his job to non-techies by saying "If a meteor struck our servers, it's on my team to fix it." But what if it did? Realistically, what would happen if a meteor struck your datacenter?

This talk was the story of a vision, one to fully automate disaster recovery away, how he pushed back on it claiming it was impossible, and how they still executed on it to great success. Theirs was also a case study on why looking at these wide surface problems through the sociotechnical lens will set you up for success in places where you could've never anticipated.

So if a metaphorical meteor hit their datacenter, we would just press our metaphorical big red button.

What could we have done differently a decade ago (including architecture and design decisions) that would've prevented this? (Work with colleagues to resist confirmation bias.) Did requirements change? Are there better components now that didn't exist then?

Most incidents are resolved (that is, prevented) at the drawing board? Reliability cannot be an afterthought [if you're building reliable systems]; it has to be considered at the very beginning of the technical design.

Given Conway's law ("the architecture of the software you're building will closely resemble that of the human structure you set out to build"), should SREs influence team structure decisions? Putting SREs, or at elast those who think like SREs, in the room earlier likely could have helped if your end-goal is reliable systems.

This is like Security; after decades of insecure software, Security is being brought into the discussions earlier. Like security, reliability is a team effort.

We need to account for reliability (as well as observability and security) when building our software.

So what does all this mean about automating system recovery?

During an incident, priority 1 is still recovery. We can introduce generic mitigations as panic buttons that incident responders can press to mitigate the impact even if they don't know the contributing factors and causes:

Binary rollback (roll out 1.2, has problems, revert to 1.1)

Data rollback (recover to a previous point-in-time and then replay the "lost" data)

Degrade (handle fewer requests per second, or return partial results)

Upsize (add more resources to the system, horizontally or vertically)

Block or quarantine (limit the user ip address or whatever)

Drain (move the workload from the unhealthy cluster or region to the healthy one)

All of these should be considered while designing and building the product or service software.

They didn't design for reliability or DR a decade ago, so they have to copy the data to another region and then move the workload over. Consider global design.

He reminded us to take specific advice with a grain of salt. This talk was about developing reliable software, but if your goal is something other than reliability it may not be as applicable.

Lunch

Lunch today was Mexican-themed, including an apple-jicama salad, grilled veggies with Mexican oregano, jackfruit tacos, grilled chicken thighs with rice, salmon with quinoa,[2] and churros.

Taming the Linux Distribution Sprawl: A Journey to Standardization and Efficiency

[No slides | Direct video link]

The afternoon sessions began with Raj Shekhar's and James Fotherby's talk on standardizing their Linux distributions. They encountered a significant challenge: The proliferation of numerous Linux distributions in their production environment following their cloud migration, which complicated management. In this session, they revealed how they identified the right distributions, persuaded their team to embrace them, and navigated the crucial trade-offs that facilitated this important transition. This talk is specifically designed for Site Reliability Engineers (SREs) who are striving for system standardization.

They wanted to simplify and went with rpm-based packages to stay with the majority of their toolkit. They came down to Amazon Linux 2, CentOS 7.9, and hardened images for both. Then they looked at moving from their existing suite to those two (four).

How did they do it? It's a bottom-up organization which helped. They had an open dialog FAQ session and tech talk to explain why, minimized differences, and had dedicated support channels. Their discovery process involved identifying candidates for early migration within teams with a strong existing partnership and extensive clusters, but with streamlined and automated deployment processes. They expected failures and expected to learn from them.

After discovery they integrated into goals (quarterly OKRs), had a FAQ and sample pull requests, expected iteration (2–3 cycles of deploy, test, and rollback) with critical services, and having their platform engineers assist teams who didn't have enough engineers.

They expected some bottlenecks, especially in identification and categorization across a large variety of teams, pipelines, and systems; outdated builds and libraries; and some machines that could be retired.

Planned solutions included prioritizing work on large clusters, leveraging dashboards and dynamic tracking to organize inventory, moving infrastructure to Kubernetes, and directly partnering with teams.

At scale they used Jira, ElasticSearch, and host tagging conventions for discovery; version control and commits for collaboration; and Terraform, Packer, Cloud Init, and AWS for deployment.

Unexpected problems included organizational changes at the start of the pandemic, and maintaining focus on more than one quarter was a challenge. An unexpected positive result was identifying opportunities to containerize or retire applications and systems and to clean up the cruft that had accumulated in their environment.
[slide 30 has good outcomes and 31 has takeaways]
Frontend Design in SRE

[Direct slides link | Direct video link]

That was followed by "Frontend Design in SRE" by Andreas Bobak. Historically, SREs built tools that were often only useful for a small set of users, leading companies to put their focus (and money) on training users and not on building a frontend that is easily comprehensible to a large number of different users. As a company grows, the percentage of people that understand the intricacies of the full production stack will shrink compared to the percentage of people who have a desire to understand the state of production, requiring a shift in tradeoffs: away from scrappy tooling to a more deliberate approach to designing user experiences for SRE. This talk summarizes three important aspects to keep in mind when making those tradeoffs.

Many of us have written scripts to make our lives easier... but they may not have good front-ends, or they may be hard to explain to others or for others to use (and we won't even discuss documentation).

Historically, we had user training (server as pets) but are mainly moving more towards user experience (UX; server as cattle). Internal apps are usually written as pets and external apps are usually written as cattle. The challenge is that SRE is at the crossroads; we need to shift left and move from training to UX. Our applications need to work for non-SREs.

Takeaways:

Be careful with the information density. White space is helpful but SREs need all the data they can get. Allow for drill-down, but allow for cross-correlation (example: Google Maps zooming in).

Make sure the data is explainable. Summaries are well and good but SREs need to get to the specifics.

Design for adaptability. Allow for the users' different roles, requirements, and backgrounds.

Measuring Reliability Culture to Optimize Tradeoffs: Perspectives from an Anthropologist

[No slides | Direct video link]

Next up was Casey Bouskill's "Measuring Reliability Culture to Optimize Tradeoffs: Perspectives from an Anthropologist." Understanding how culture influences engineering practices is often a black box. At Meta, we focused on transforming our mantra of "move fast and break things," to "move fast with stable infrastructure" by giving attention to the cultural elements of doing reliability work. This talk describes this process and decodes how to systematically measure the on-the-ground perspectives of engaging with reliability work so that we can have an informed perspective on how to best optimize the right degree of reliability. The audience will take away actionable practices that can be used to understand how to evaluate their underlying reliability culture, take data-driven approaches to measuring reliability sentiment and barriers and facilitators to performing the work, and identify practices that allow for a more holistic data-driven prioritization of reliability efforts that aligns with cultural values, especially when there are competing demands and increasing pressure to optimize efficiency.

The former Facebook had to upgrade from "Move fast and break things" to "Move fast with stable infrastructure." You have to optimize for more than just speed. Making everything airtight stifles innovation. That shift towards a more reliable infrastructure was a deliberate cultural decision.

Anthropologists look at reality through two lenses, the outsider and the insider.

Outsider — Leadership's priorities, industry standards, SOPs

Insider — On-the-ground priorities, perceived value and recognition, friction in workflows

How?

Interviews. Conduct initial research to get the insider and outsider perspectives via interviews of a representative sample of viewpoints. This required a lot of networking, and statistical demographic work. She spoke to 40 people and asked "What comes to mind when you think about reliability at Meta." That alone led to fruitful discussions. After the interviews they applied the codes (tags) to the answers.

Key themes:

Reliability is necessary the greater we scale.

We want reliability to facilitate innovation, not block it.

Our current system depends on reliability champions, relying on a few key people who care about it.

Demonstrating the impact of reliability work and rewarding that impact.

Surveys. Turn the key points from the foundational research into a survey. They basically got four waves of data to get results over time. One of the findings was overall sentiment: 78% say their team values reliability but 70% agree teams should do more.

Actions. Respond to the needs of engineers with actions. They changed engineering performance expectations to explicitly include reliability. They formalized their Reliability Measurement Program to get quality top-line SLOs. They created a maturity model to guide reliability programs and prioritize actions.

Balance. Make tradeoffs in conjunction with culture in mind. You have to modulate how quickly you can move or pivot. They're keeping people in mind as they track the ROI of reliability.

Storytelling as an Incident Management Skill

[Direct slides link | Direct video link]

Last in this block of talks was Laura de Vries' "Storytelling as an Incident Management Skill." Managing incidents well requires a number of skills — debugging and systems understanding, strong communication, and high-speed project management — but we rarely talk about the power of storytelling in our incident management loop. From oncall preparation, to incident handling, to postmortem creation, skill at storytelling can support and even improve our engineering skills. Establishing a setting, building a coherent narrative, and understanding your characters builds memorable and compelling communication that can level up all stages of your incident management.

She's talking about a narrative, a story, an account of a string of events occurring in space and time. One example is a three-service example:

a (producer) → b (deduplication) → c (consumer)

Service a stopped receiving data, because b got overloaded, because a blocks on unsendable messages. They mitigated this by scaling up b, and will follow up later to shard a so it doesn't block when it has unsendable messages.

Consider narrative in the middle of the incident to bring new responders up to speed. Say WHY you did what you did, and WHAT the result was, to build the logical flow.

In on-call prep, discuss what the systems do, what the services do, and how requests flow through it. If it's broken in the past, how? What caused the pages? (Oncall prep included empathy to provide the emotional feelings when you got paged.) You can also use the Wheel of Misfortune exercise.

During an incident, collaborative communication happens. (Why does a lead to be?) Ask leading questions: Why did you think X? Why did you do Y? What could cause Z?

Check your assumptions.

After the incident, use storytelling to make the post-mortem engaging:

Set the scene. What services are involved, what do they do, and how do they function?

Add some drama. You can share the impact (how it affected customers) or discuss an underlying flaw in the service (foreshadowing). The latter tends to be better.

Chain the events together as a logical narrative. Talk through the trigger (X happened causing Y).

Explain the response. What did we do what we did, and what were the results? If you have to choose, talk about the systems not the troubleshooting.

Plan fixes. Every post-mortem should have one or more action items.

Plenary: Real Talk: What We Think We Know — That Just Ain't So

[Direct slides link | Direct video link]

After the break the airwall came down [we hope] and we resumed with our closing plenary talks. The first was John Allspaw's "Real Talk: What We Think We Know — That Just Ain't So." Dr. David Woods once said to me "We cannot call it a scientific field unless we can admit we've gotten things wrong in the past." Do we, in this community, do that? Well-formed critique is critical for any field - including SRE - to progress. He talked about a few ideas, assumptions, and concepts often talked about in this community but whose validity is rarely questioned or explored.

He asked about whether we'd read How Complex Systems Fail. Most of us had.

The title is a riff on the Mark Twain quote, "It ain't what you don't know that gets you into trouble. It's what you know for sure... that just ain't so." Examples:

The Great Wall of China is visible from space. (Untrue.)

A penny dropped from the top of the Empire State building will kill someone if it hits them on the head. (Also untrue.)

David Woods said "We cannot call it a scientific field unless we can admit we've gotten things wrong in the past." We can and we are. Several researchers were here at the conference. We revisit and we have history.

Up until the early 1990s, the standard expected way to understand software engineers' productivity was to count the lines of code they wrote. That was the conventional wisdom until Capers Jones' 1994 comment,"The use of lines of code metrics for productivity and quality studies is to be regarded as professional malpractice starting in 1995."

The idea that change was the leading cause of incidents was considered true... and Gartner said it was true 85% of the time in 1995. But changes are also the leading cause of resolving incidents, and one of the leading causes of incidents that don't happen.

Sometimes we revisit an assumption and it's true (validation); sometimes we do and it's not. We need to have productive skepticism in our inquiries. Courtney Nash has done a lot of work for reliability data.

The simple sequence "Detect → Mobilize → Diagnose → Resolve → Learn" was considered always to be true. And it may be true under some circumstances... but we often need to go back to previous steps. In reality it's not that crisp and clean. We can make some assumptions (like "Assume a frictionless spherical chicken of uniform density"). Also, if you detect something after it's resolved, it has a negative time to resolve (TTR); there's a Honeycomb writeup about that.

See Em Ruppe's SREcon22 India talk on repeats. The criteria used in labeling an incident as a "repeat" matters more than the "repeat" happening. "It happened again" is an invitation to ask for more details, And who gets to label an incident as a "repeat" can matter a great deal.

Assertion: An organization can be the most skilled and efficient at keeping stakeholders up to date about ongoing incidents and still be terrible at learning from them or responding to them.

The best scenario when it comes to responding to an incident is (a) the people responding can recognize immediately what is happening and (b) they know exactly what to do about it. Anything that can bolster people's expertise in support of those two things is paramount. Everything else is secondary. And when this happens it was handled so quickly and fluidly it might not even be classed as an incident!

Plenary: What Can You See from Here?

[Direct slides link | Direct video link]

The second and last plenary, and the last talk of the conference, was Tanya Reilly's "What Can You See from Here?" What you can see depends on where you stand. Your vantage point plays a big part in how you work, what you think is important, and how you interpret what's going on around you. Even your job title or team name can influence-and limit!-how you see the world. It's easy to find yourself stuck: no longer learning, polishing a service nobody else cares about, or wondering why it's so hard to get anything done outside your own team.

A change in perspective can change what you think is important, how you influence the decisions that you care about, and even what you think is possible for yourself. In this talk, we'll look at how to get a broader view.

WRT fights on the internet: People zoom in to the thing they're most interested in. There was a proposal in 1997 in FreeBSD about sleep(1) handling non-integer values.

Up close everything is amazing, like a fractal. Getting up close means we make subjective decisions that are right for us but not necessarily for the broader view or organization. This is interesting in that we can also take a broad view.

Up close, you can see arguments about our titles, including the definitions of DevOps Engineer, SysAdmin, Platform Engineer, or {Service|System}RE. Further away it doesn't matter. Another example is Google culture back in the day. There were a lot of good things, but the "drinking at your desk" might've been problematic but they couldn't see it because they were inside their bubble (or village or cult).

What's obvious to us may not be obvious to others, especially if they're earlier on our shared journey than we are, or in a different context than we are. And just because they don't necessarily know what we know, they almost certainly know things I don't.

What is "normal" shifts. (She didn't use "overton window.")

Computers used to be big.

Carrying a floppy next door was networking.

We all have cellphones and thus cameras now.

We used to have long-term jobs and not hop every 3–5 years.

History is not just the past, and the present, and the future.

Business value is not the only value.

We don't know what will break but we know things will break.

We don't know what will change but we know things will change.

We look for ways to make things a little better.

Closing Remarks

Thank you all for being here.
When Sarah and Dan put the program together they wanted to make a program that pulled in all the facets of SRE and where it intersects with other areas (security, academic research), to help give us broad perspectives. They also asked what we liked and didn't; we're an engaging and supportive environment, and people enjoyed learning together.

Thank you to our sponsors, program committee, lightning talk coordinators, steering committee, Usenix staff, Usenix board, and our speakers. We had over 600 participants, 24 countries, and 229 companies.

Videos of all talks (including the three that got bumped to the Bayview room) will be online in a few weeks.

SREcon24 EMEA is in Dublin (29–31 Oct); CFP is open.

SREcon25 Americas is Mar 25–27 in Santa Clara CA; the CFP will open around August.

Please fill out the survey!

Dinner

I went to dinner with Laura, Brian, three others whose names I didn't catch, and we were later joined by Melissa and someone else whose name I didn't catch, at Gott's. I had the monthly special (cubano) with garlic fries. After dinner they headed off to a bar and I swung into the Ferry Building for some ice cream (dark chocolate smoked sea salt).

I got back to the hotel, packed up most things, scheduled my Lyft pickup for the morning, and headed to bed.

Thursday, March 21

Ugh. I slept poorly; I was awake at least every hour or two. I gave up when I got the one-hour warning from Lyft. Showered, fininshed packing, downed the Coke Zero I had grabbed from Monday's afternoon break, used the emailed express checkout lihk to check out of the hotel, and headed to catch my scheduled Lyft to the airport.

The kiosk I walked to wasn't reading my fingertips so I stood in a short line to drop off my bag. Security moved pretty quickly and was mostly uneventful (the TSA agent said to send the cane through the X-ray instead of swabbing it, but it got stuck partway through). I recombobulated and hiked to my gate, getting there during the boarding process... for the flight to Atlanta. I wasn't going to Atlanta so I took a seat and waited.

Boarding itself was uneventful. The flight attendant didn't seem pleaseed when I asked her to hang my coat and cane. I don't know what she was expecting; I walk with a cane, was in seat 1B, and the closet was aft of row 4. Preboarding or not, that wasn't really something I wanted to fight with. The flight was full but we still pushed back two minutes early.

The in-flight breakfast was a version of French toast with a mango pur�e and toasted coconut topping. We had another croissant (with butter and orange marmalade) but instead of the fruit salad we had something else that got lost when the trip report file got corrupted.

The flight itself was uneventful. Landed, got my bag out of the carousel (eventually), did the hike up over down across and back up to get to my car, which was still where I left it. Traffic was moving at a good clip and I got home without incident.

__________

There was a salad of bitter greens and a potato-green bean dish that my buffet ran out of when I was in line. The mains were a grilled or roasted vegetable dish, a vegan gluten-free pasta, a chicken-mushroom-n-rice dish, and something labeled as grilled tilapia but which was thinly-sliced grilled beef with arugula. I tried the chicken and rice (eating around the mushrooms), which was cold and mostly flavorless, and the beef, which was cold and tough. The tiramisu cake was good, but not enough to save the meal. Later I found out there was also a soup of some kind... but it wasn't in or visible from my buffet line.

Back to my conference reports page

Back to my professional organizations page

Back to my work page

Back to my home page

Last update Mar15/25 by Josh Simon (<jss@clock.org>).