Conference Report: 2017 LISA

The following document is intended as the general trip report for me at the 31st Systems Administration Conference (LISA 2017) in San Francisco, CA, from October 29–November 3, 2017. It is going to a variety of audiences, so feel free to skip the parts that don't concern or interest you.

Saturday, October 28

Travel day! Easy traffic, easy bagcheck, easy security, easy tram, easy wait, easy board, mildly bumpy ride, short bag claim wait, short shuttle wait, last shuttle passenger out, short checkin wait, drop bags at the room, then lunch with David Nolan and Tom Limoncelli at the Ferry Street Farmer's Market. I wound up getting a lox and schmear on sourdough with pickled onions and tomatoes and followed that with a scoop each of chocolate malted and peanut butter fudge ripple ice creams in hot fudge.

After lunch I took a quick power nap until shortly before registration/badge pickup opened at 5pm. Got my badge and bag of stuff, then schmoozed at the Welcome Reception. Wound up going to dinner with Lee, Cory, Johann, and Mark at Gott's Burgers, where I got a cheeseburger and garlic fries. (We ate outside which is why I think it got too cold too quickly.)

After dinner we wandered back to the hotel. Since this was a 27-hour day I was definitely feeling it so I headed upstairs to crash.

Sunday, October 29

Slept badly (first night in the new bed), and was up every hour or two. Gave up around 6, caught up on social media, showered, shaved, and headed to the continental breakfast. Grabbed a croissant (and stuffed it with strawberry jam) and a pear. Schmoozed with folks, then hung out at LISA Labs for a bit.

Since today was a free day for me on the schedule I did some touristy things around San Francisco.

After I got back to the conference I hung out in LISA Labs and planned dinner. Travis Campbell, Lee Damon, Christopher DeMarco, John Kuroda, Branson Matheson, and I went to Espetus Brazillian Steak House for dinner. We started with the salad bar (surprisingly there was no seafood on it). Then the parade of meats: Filet, beef rib, top sirloin, pork sausage, pork belly, leg of lamb, house special sirloin, flap steak, grilled shrimp, and grilled pineapple. Too full for dessert we caught the F streetcar back to the hotel. Swung by Game Night where there were 3 simultaneous games of Exploding Kittens going on, schmoozed with some folks for a while, then headed upstairs to crash a bit before 10pm.

Monday, October 30

Slept better but still up every couple of hours. Continental breakfast (bagel, schmear, strawbery jam). Schmoozing. Got a Clipper Card. Killed time until 11am when the store I was going to opened up, and I bought some fun stuff while I was there.

Got back to the conference in time for the free tutorial lunch.

At his request I swung thru Branson's "Defense Against the Dark Arts" tutorial after lunch. Hooray for social engineering!

I took the rest of the afternoon off.

I went out to Interval bar (for a mai tai) then to Green's for dinner with local friend and former LISA-ite Ben Woodard. We split the spring roll appetizer and I had the squash-n-tomato pesto pizza.

Tuesday, October 31

I slept in until past 7:30am! Of course, odds are it's because I'm getting sick; the cold symptoms I'd been fighting off the last couple of weeks were getting worse. Spent the morning hanging out in the Labs space before heading up to the tutorial lunch. After lunch I napped, trying to fight off a headcold. And did some work for the office, but mostly napped. (The cold is winning.)

Tonight was the 0xdeadbeef dinner: Travis Campbell, Nicole Forsgren, Branson Matheson, and I went to the House of Prime Rib for, well, prime rib. (A couple of others were going to join us but were confused as to what day this had been set for and made other plans... which they kept. But hey, we got Nicole to join us, so we win!)

Salad prepared tableside, then King's Cut (~16 oz.) prime rib (medium rare, on the bone), baked potato (hold the sour cream), creamed corn (I hate spinach which was the other option), and Yorkshire pudding. Apparently if you finish the King's Cut you get offered another thin slice... which we declined. Had a glass or so of malbec with the meal, and then I had English trifle for dessert (it had fruit so it's healthy, right?).

We got back to the hotel in time for the last 10 minutes of the LGBT Issues BOF so I more or less waved Hi, gave Tom L. a hard time, then went upstairs to change out of the suit and tie into normal conference drag. Didn't feel like hanging out in the lobby bar and most of the other spaces were either empty or locked up, so I headed off to bed.

Wednesday, November 1

Contintental breakfast was scones (in my case, lemon-blueberry; I didn't want the maple oat or the pecan pumpkin). Caught up via Hangouts with Mike C from Oz (who attended at least some sessions via telepresence robot).

Opening remarks: Thanks to all the folks who helped put this together (from program to office to sponsors). Please go to LISA Labs for hands-on learning about new technologies.

Theme: Continuous improvement and scaling the future. We wanted a lot of first-time speakers and had about 300 blind submissions; the speaker slate is very diverse, especially since we reached out to underrepresented communities. They want not only technical solutions but social solutions.

LISA'18 is Oct28–Nov02 in Nashvillle; Rikki Endsley and Brendan Gregg are chairs.

Pleanary 1: "Security in Automtion," Jamesha Fisher (SecOps at GitHub), Jamesha was up first, focusing on Security Automation and why we should use it, what it should do, and what it can do (and how). Cross-team collaboration is important.

Automation will help in the "0 day -> deploy certificates" and "reset admin creds" cases. (All teams involved should be on the same tools, and consider a backup for when Slack^Wone of your tools goes away.) Consider 2FA. In her example they have a bot (BoardBoat) for on- and off-boarding staff, and another (Aegis) for generating SSL cetificate. Automation and tools for SAs/SREs to do their jobs makes SecOps' jobs easier.

Plenary 2: "Halfway Between Heaven and Hell: Securing Grassroots groups" by Leigh Honeywell (ACLU; slides online). Started with example of safety gear: Net under the SF bridge saved 29 people's lives. [xref: neveragain.tech]

"'Perimeter security is dead.' They've been saying that for 30 years."

She says, "I explain computers to lawyers so we can sue the government."

Q: How can we be safer?

A: Use a password manager. (applause) Don't reuse passwords. Use common sense when giving away personal information.

Q: When do you say No?

A: When someone had a virus present but not runnning on their laptop for a month I shut off their wifisay No?

Q: How can I help volunteer at my favorite non-profit?

A: Be persistent, be specific as to skills and time scope, and build a relationship with the techiest person you can find. Also, go to an open-to-the-public event so they see a face.

Q: What if your primary tool (e.g., Slack) goes down?

A: Have a backup tool for backchannel (e.g., irc or Twitter DMs).

Q: What threat model are non-profits looking like?

A: Extra-legal actors, financially-motiviated (ransomware), political agendas against you.

Q: What got you into Security and why do you stay?

A: Money; the community.

Q: What can Ops people do to make SecOps' life earier?

A: Sec needs to learn more discipline from Ops. Sec is still very hero culture-based and wild west-ish. Provide resources.

Q: How woukld you improve security culture where each developer has their own production environment (e.g., very decentralized)?

A: A SecEng on every team. Or wait for a breach. Or have a strike-force model.

(USENIX has provided video of the plenaries.).
After the morning break I went to the invited talk "Never Events" by Matt Provost of Yelp (@hypersupermeta). It's about building safe systems, based on lessons learned from England's National Health Service (NHS), established in 1948 to provide free healthcare at point of service to all 64.6 million UK residents.

A Never Event is a serious incident that "arise[s] from [the] failure of strong systemic protective barriers which can be defined as successful, reliable and comprehensive safeguards or remedies". They key criteria for defining Never Events is that they are preventable and have the potential to cause serious patient harm or death. All Never Events are reportable and undergo Root Cause Analysis to determine why the failure occurred, to prevent similar incidents from happening again.

Considering that the NHS is a healthcare service where incidents can obviously have serious, life-threatening or life-changing consequences, together with the scale of services provided (the NHS in England deals with over 1 million patients every 36 hours), their list of Never Events is actually quite short (14 events), including such items as "Wrong site surgery", "Retained foreign object post-procedure", and "Wrong route administration of medication."

In our industry, the requirement for these events to be preventable would exclude things like DDOS attacks or security breaches which are outside of the SRE team's direct control. Of course steps should be taken to minimise or prevent these types of incidents, the same way that doctors work to prevent patients from dying of cancer. But they don't cause cancer, so a patient dying of it is not a Never Event. However, a nurse administering the wrong type of cancer medication, or cancer medication to the wrong patient, or delivering the medication via the wrong route (such as intravenous instead of spinal) can all be Never Events.

If there are insufficient processes in place to prevent such mistakes, then they cannot be Never Events. This system is designed to protect the staff as well as patients, so that they aren't put under pressure to be perfect. There must be procedures in place so that it doesn't come down to an individual to make all of the correct choices on their own.

Events are a fundamental part of the safety culture of the NHS which is a "just culture that rejects blame as a tool." In recent years, modern systems safety concepts such as just culture and blameless postmortems have been introduced to the System Administration/Site Reliability Engineering/Devops community from other fields (such as healthcare). However the concept of defining specific Never Events has not been explored in this context and can bring similar benefits to those reported by the healthcare community with a reduction in the reoccurrence of such events.

Many systems engineering organisations already have their own formal or informal guidelines for reportable events. Publishing postmortems (either internally or public facing) is now becoming standard practise in our industry, but not all of these events are Never Events. These incidents should be studied by each organisation after each postmortem to generate a list of failures that should never occur again because safety systems/protective barriers have been put in place to prevent them. Any occurrence of such an incident after the fact is therefore a Never Event.

The goal of implementing the Never Events system is firstly to reduce the number of these serious events, but also to protect staff and to provide a safe working environment. Repeated Never Events indicate that management has not addressed the underlying causes of these incidents, which shifts responsibility away from the front line staff who are operating in (clearly) unsafe conditions or with inadequate safety systems in place to prevent these events.

While each organisation will come up with its own list of Never Events for their specific environment based on their examination and analysis of previous incidents, some generalisations can be made. For example, looking at "Wrong Site Surgery" from the NHS list, where the wrong part of the body is operated on (left vs right leg etc). This is a process failure, where the staff may do the correct procedure but to the wrong location. Transferring that to the systems administration world, this is analogous to running the correct command on the wrong system.

During their careers, most (if not all) system administrators have made certain classes of similar mistakes such as rebooting the wrong server, removing the wrong directory (including the classic rm -rf /) or executing a SQL DELETE statement without a WHERE clause. We will examine the steps the NHS has taken to prevent this type of "wrong site" incident, along with other Never Events. By learning from other industries we can come up with recommendations for preventing similar mistakes in our field.

That was followed by Cody Wilbourn, "The Hidden Costs of On-Call:" False Alarms." Costs, especially for exempt employees, need to be considered. Pruning alerts to get rid of false alarms may be worthwhile.

Investigation time (wage * time spent)... plus the opportunity cost of lost work, and the cost of the interruption (typically 40% of the opportunity cost). It amounts to a 4x multiplier.

Cognitive load: Having to remember stuff.

Stress: All of that gives stress. The manifestation of thinking your pager went off when it didn't. Expectations (internal or external). "Recovery from stress takes 20 days on average. Could your business survive that?"

Sleep deprivation: Causes problems thinking, processing, and remembering. Accidents are more likely.

Work-Life Balance: 32% of companies saw absenteeism decrease once flex time policies were introduced.

Alarm fatigue.

So what are some possible solutions?

Identify problematic alerts. Are there canned reports you can run in the tool? Parse the monitoring logs?

Prune the alerts. How actionable are the alerts? If you wait for self-resolution, or wait until tomorrow, it's not worth paging out.

Make sure the scope is right: Page-worthy, notifications, and informational. Don't rely on email if you don't have to; if you do, send it to another (shared) email box.

Check the right things. Downgrde node checks to info. Lose host-level alerts in clusters.

Limit reminders and combine related alerts.

Automate... but keep it in debug mode for a few weeks, since automation can be dumb.

Fix any system fragilities. Add capacity, add HA, etc.

The "printer rule": Not everything eneds to be managed and monitored 24x7. Is 9x5 sufficient?

Lunch was held on the vendor floor. Today it was sliders — buttermillk-brined fried chicken, angus burger, black bean burger, crab cakes — miscellaneous cheeses and toppings (bacon jam! blue cheese! tomato and pickle!), onion rings, and cookies. Swung by some of the vendors to pick up some swag for the coworkers then headed back to the sessions.

In the first afternoon block I went to the mini-tutorial "Handling Emergency Changes and Urgent Requests," by Jeanne Schock (jeanneschock@gmail.com (contact if slides aren't on the USB drive), who has ITIL certification... which means knowning when to ignore ITIL). The key here is differenctiating between EMERGENCY and URGENT.

They're similar:

Require production changes

Time sensitive

Consequential ("if we don't do this then ${badthing}")

High risk

Inevitable

(Changes to resolve an incident that don't require a change aren't included. Just include things under our Change Control Policies.)

Things are inevitable:

Human error

Things break

Hardware fails

VIPs have requests

Just because you're getting an emergency or urgent request is NOT the time to push back with "You shoulda...."

EMERGENCY — Have an emergency change type and follow that process.
URGENT — Use the normal change process but expedite it.

Emergencies:

Break fix to restore service

Urgent security patch

Change needed to prevent imminent outage

Issues of serious consequence and critical business urgency

Emergencies therefore are hended off from incident management. As a corrolary, that must be from an incident ticket, probably of critical or high priority, for something that should be monitored.

So what's the emergency change type?

Specific criteria and policies

Tailored change request form

Unique or special procedures

Specific change authority

Accept that it's high-risk

Expect less info than usual

Accelerate submission and authorization; verbal approval can be okay as long as we've documented that it is

Remember security incidents can be different; "server is down" doesn't always mean "reboot."

All critical/P1 tickets should have review after the fact (AAR/Post Mortem).

Can downgrade the priority of an incident after the emergency change fixes the critical exposed bits.

So that's emergencies. What about urgent? Just because it wasn't planned isn't an emergency; just because you missed the CAB review meeting isn't an emergecny. We need those to be assessed, authorized, and expedited but not through the emergency process. Assign an _expediter_ to communicate and to walk the thingy through the process. Use TBD or N/A in mandatory fields to revisit later but get the ticket opened and start the process. Immediately talk (communicate!) with the change manager or release manager, your own manager, and other teams whose help you'll need.

Define in policy that every emergency and urgent change get reviewed after the fact.

After the afternoon break I went to the "The 7 Deadly Sins of Documentation" talk by Chastity Blackwell, an SRE at Yelp. Why do we need good docs?

On-boarding

Instutituional Knowledge

Incident Response

Inclusiveness

Self-Perpetuation (good docs encourage others to write their own with high standards)

So why don't we have good docs if we know we need it?

SIN 1: To make a long story short, and as a lead to the other six sins, deprioritization. Because docs is seen as extra work that anyone can do without special skills. It's necessary, not extra, work; it needs to be a deliverable and in people's work or project plans. We need doc and style guides; references for proper grammar and language is a good idea. Call out common mistakes (length, overuse of adverbs and adjectives, accessiblity, etc.). There needs to be a review process. (Juniors/newbies aren't great to write things since they don't know the details and historical data, but they're great readers in the review process.)

SIN 2: Burying the Lede, or the Master of None. For incidents we need runbooks; when dealing with infradtructure we need to know what works with what other components. Most places have one, or the other, or something trying to do both at once. Runbooks should address specific questions, use inverted pyramid format (most important info up top), simple and straightforward with benign examples. A good TOC can mitigate but it's a crutch. Runbooks should be brief and link out to the "why" docs.

In tech docs — specs, guides, and references — provide context for the WHY and not just HOW. It's critical to understanfind how stuff's supposed to work. Use concrete examples. Know your audience, don't necessarily asume in-depth knowledge in an overview. Don't rely on autogenerated docs. Make sure what people understand is what you meant to say.

SIN 3: Repository Overload. There should be one single start point. Wiki, Google, git, ...? How does someeone know where to find what they need or to put the docs they write?

If you need to duplicate docs, automate it so there's no drift. Any repo needs to track changes and be searchable, Discoverable is better than searchable but Google is bad at this. A portal with broad categories might be good (example: documentation.its.umich.edu).

SIN 4: Documentation Overgrowth. Don't neglect curation duties. Don't keep things around for historical purposes since they can clutter search results. If documentation is a priority and in the project goals, then review (and delete obsolete) is better. Archive old things outside the repo if you really need them. Best practice: Use and update them regularly.

SIN 5: Comment Neglect. Good code doesn't show the why, there's an assumption that code is readable, and not everyone udnerstands the language that well. Context is king: Comments or docs should describe the why. Include concrete examples as to how things should behave. Have people with less familiarity review the code.

SIN 6: Jargon Overuse. It promotes an us-vs-them in/out group dichomoty. Ideally avoid jargon entirely; make the service or component name is logical; expand ALL acronyms on first use. Keep your audience in mind (e.g., new people or non-team members need different information). A glossary might be useful but it must NOT be a crutch. Make sure the environment is inclusive so people can ask "stupid" questions.

SIN 7: Video Addicition. It might be more engaging or easier to present and explain complex topics, but it's not easily-searched, there's no content marks to find specific bits, it's harder to edit when things change, and they are... less than good for accessibility. Video can be an okay supplement for some things (consumed in their entirety and updated infrequently). Graphics should always have alt-text. Don't use too many pictures. Make sure videos have timestamps for content and subtitles (and consider transcripts).

(I skipped out on the second half of the talk, "Persistent SRE Antipatterns: Pitfalls on the Road to Creating a Successful SRE Program Like Netflix and Google" by Blake Bisset and Jonah Horowitz of Stripe.)

I swung by the expo floor to pick up some more swag and grab some munchies at the happy hour before heading out to dinner. I met up with Andrew and Kyle at 6:30pm to go to Fang for dinner. Chinese; you say likes and dislikes and they bring you food. We wound up with chicken buns (kind of like banh mi), shrimp with apple and pear, whitefish with baby spinach, and a ginger chicken dish.

Did some brief socializing in the atrium before heading up to bed.

Thursday, November 2

I slept badly thanks to the cough and was more or less awake between 3 and 5 and gave up around 6. Went down to the continental breakfast for some pain au chocolat and two cups of hot tea with honey (and I almost never drink hot beverages) before heading up to the "Handling the Interruptive Nature of Operatons" mini-tutorial by Avleen Vig (Facebook) and Carolyn Rowland (NIST).

So what's the big deal? We're usually good at interrupts from co-workers, email, tickets, meetings, phone calls, alerts, users, and so on. Interrupts are the nature of our jobs in operations roles... but we tend to like always having something different all the time.

Multitasking — doing two completely separate things at the same time — is virtually impossible... unless you're using different parts of the brain or the skills are very well learned (e.g., walking and talking, or cooking and watching TV). Concentration on one task makes it harder to switch contexts. Being in the flow (focused, in the zone, whatever) is good, but we still switch tasks every 3 minutes or so. And after an interruption it takes 25 minutes to get back on task, nd 60% of the time we don't return to that original task.

What kinds of interruptions are there and how do we manage them?

Questions — Create a question space for non-emergencies, like a Google Doc or Slack channel; check it first and last thing on the day.

Time bandits — Talk to them, talk to your manager, try to reduce frequency. Teaching them how to search for answers, understand man pages, read source code, etc. Give them a test instance to play with so they don't screw up production.

Pages/alert fatigue — See https://github.com/etsy/opsweekly for auto-analysis. Focus first on alerts that wake you up, then the most frequent/common, and so on. Classify alerts: What needs to page 24x7, vs what needs to let us know in Slack during work hours? For HA systems, remove misbehaving systems from production, automatically resolve the problem, and only alert if a high percentage of the cluser shows the problem.

Calendars, managers (lots of meetings) versus makers (block out large chunks of time) — work together to find the best time to meet. Manage upwards, set expectations of short meetings; allow feedback through other simpler means (emails, chat, IM, weekly status updates, etc.). Do you have to attend or can someone (or noone) else go? Can it be canceled? Can you go first then leave? Make agendas mandatory. What about time blocking (e.g., "No meeting Mondays" or "2-hour blocks for no meetings daily")?

(Note: Some people, especially those with ADHD or ASD, can't handle interrupts as well as others. Mental health isn't talked about [enough] in our community. Avleen notes that lists can help... but keep them smaller. And adjust your work/life balance.)

There are some fallacies (see the slides for more) to dispute.

Not all interruptions are bad.

Calendars can be tools... or crutches.

"I get more done when multitasking" is usually untrue. You're constantly task switching instead.

What else can you do:

Work remotely, though there are some drawbacks (communication can be harder especially without kinesics).

Look at _Getting Things Done_ (Gtd): Do, delegate, or defer.

Find a meeting room or quiet space for some block of time. (A team met in a space but the meeting was canceled. They stayed there to get stuff done and were really productive.)

Timeboxes (cf. sprints).

Take turns (e.g., "interruption rotation").

Take vacations. Without your laptop or work phone.

Ask for help.

Write more documentation.

After the morning break there wasn't anything I really wanted to see so I went to a talk I didn't care about or pay attention to until the 11:30am invited talk "Stories from the Trenches of Government Technology" by Matt Cutts (USDS) and Raquel Romano (DSVA). This was telling stories of how they've managed to make bad situations better. I didn't take detailed notes (but look up "Hack the Pentagon").

There was nothing in the third third of the session block that interested me so I headed to the vendor expo to get some more swag for coworkers and kill time until lunch (on the expo floor again). I think yesterday's menu was better; today we had a bean-n-pasta salad, shaved brussels sprouts with pecans, gnocchi, roasted veggies, turkey with couscous, and halibut with spinach.

There was nothing in the program in the post-lunch block that jumped out at me so I hung out at LISA Labs and caught up on a couple of work tickets (including one from someone trying to reschedule a change from Thu Nov 09 to Mon Nov 06).

After the afternoon break I went to the plenary session "Scaling Talent: Attracting and Retaining a Diverse Workforce." Tameika Reed moderated the four panelists: Derek Arnold. Amy Nguyen, Qianna Patterson, Derek Watford, and Wayne Sutton. We were warned to be comfortable with being uncomfortable.

Topics included:

Pipeline from education to industry (and failures in the education system: who does and doesn't it work for and why, and how should we fix that).

Whether internships are the only way into companies.

Outreach: Pulling HS and college students into (paid?) internships into industry, esp. in underrepresented minorities. Corproations engage with some schools but not many and only the prestigious ones.

Skills development (from HS to summer camp to internship to job), including teahcing and mentoring.

Beating impostor syndrome, promotions, leadership, succession plans, and so on.

(A lot of this was stuff I already knew. Someone joked about there not being a white male on the panel. There often seems to be an overwhelmingly white male (if not cis straight and christian) audience.)

After sessions broke I went to the conference reception on the hotel atrium level. They had crudites... but also a Thai-inspired line of skewers (I had the satay chicken and the shrimp) and a dim sum line (potstickers, baozi, siu mai, har gow, and spring rolls). And mini cupcakes for dessert (I had one each red velvet and chocolate; I skipped the carrot cake). With the head cold I skipped the wine and stuck with Diet Coke. Chatted with friends old and new, and bailed around 7:30pm so I could pop the NyQuil Derek was kind enough to provide and crash.

Friday, November 3

The NyQuil helped some, I think. I slept (except for a few bio-breaks) in until 6:30am or so, so call it 10 hours' worth. Showered, shaved, and headed down to the continental breakfast for some honey-apple coffee cake and more hot tea with honey.

My first session of the final day was "An Internet of Governments: How Policymakers Became Interested in 'Cyber'" by Maarten Van Horenbeeck. Gradually, the internet has become a bigger part of how we socialize, do business, and lead our daily lives. Though they typically do not own much of the infrastructure, governments have taken ever-increasing note, often aspirational, and sometimes with suspicion. In this talk, we covered how governments internationally debate and work on topics of cybersecurity, agree on what the challenges are, and get inspiration on solutions. The talk showed how these concerns often originate from domestic concerns, but then enter several processes in which governments meet, debate, agree, and disagree on their solutions. We learned about initiatives such as the ITU, the UNGGE, the Global Conference on Cyberspace, and the Internet Governance Forum, and how we as engineers can contribute!

That was followed by "Clarifying Zero Trust: The Model, the Philosophy, the Ethos" by Evan Gilman and Doug Barth. The world is changing, though our network security models have had difficulty keeping up. In a time where remote work is regular and cloud mobility is paramount, the perimeter security model is showing its age — badly. We deal with VPN tunnel overhead and management. We spend millions on fault-tolerant perimeter firewalls. We carefully manage all entry and exit points on the network, yet still we see ever-worsening breaches year over year. The Zero Trust model aims to solve these problems.

Zero Trust networks are built with security at the forefront. No packet is trusted without cryptographic signatures. Policy is constructed using software and user identity rather than IP addresses. Physical location and network topology no longer matter. The Zero Trust model is very unique, indeed. In this talk, they discussed the philosophy and origin of the Zero Trust model, why it's needed, and what it brings to the table. It's a follow-on to last year's deep dive talk (which I didn't go to), "Zero Trust Networks: Building Systems in Untrusted Networks." It's all anbout the IPSec, and not just in a server-server or environment-environment context but also a client-server context. (There is no such thing as a "zero-trust network vendor," since so much depends on your environment's needs.)

After the morning break I went to "DevOps in Regulatory Spaces: It's Only 25% What You Thought It Was" by Peter Lega, partly because I've worked (briefly) in regulaed environments in the past. We've embraced the DevOps concepts, found our sympaticos, established a solid technical ecosystem and culture, and even delivered some great early results with a first follower portfolio. Now, we have entered the mission-critical regulatory problem space at scale. The traditional DevOps "goodness" and culture have taken you this far. Now it's time to scale, with a whole new set of regulatory and compliance constituents and technical maturity needs. In this talk, we shared the "first contact" experience with the long established regulatory community as we embarked on delivering larger complex solutions and the challenges and compelling opportunities to transform that have unfolded to enable compliance as code from portfolio through production.

In addition to the regular programmer to DevOps (learning testing and automation) flow, and DevOps interacting with the business, regulatory enviornments add a Quality group to enforce compliance. This was a good introduction to regulaed environments for people who haven't worked there, and the case studies were useful in thinking about how we'd implement regulated devops if we had to.

I stayed in the same room for Damon Edwards' "Failure Happens: Improving Incident Response in Large-Scale Organizations." Deployment is a mostly-solved problem. Yes, there is still work to be done, but the operations community has successfully proven that we can both scale deployment automation and distribute the capability to execute deployments. Now, we have to turn our attention to the next critical constraint: What happens after deployment?

We all know that failure is inevitable and is coming our way at any moment. How do respond quickly and effectively to those failures? What works when there is just a small set of teams or an isolated system to manage will quickly break down when the organization grows in size and complexity. But on the other hand, what has been commonly practiced in large-scale enterprises is proving to be too cumbersome, too silo dependent, and simply too slow for today's business needs.

How do we rapidly respond to incidents and recover complex interdependent systems while working within an equally complex and interdependent organization? How does operations embrace the DevOps and Agile inspired demand for speed and self-service while maintaining quality and control?

His talk examined the trial-and-error lessons learned by some forward-thinking enterprises who are currently streamlining how they:

Resolve incidents.

Reduce friction between teams.

Divide up operational responsibilities.

Improve the quality of their ongoing operations.

See how these companies are rethinking how and where operations happens by applying Lean and DevOps principles mixed with modern tooling practices. In the talk, we:

Dissected examples of operational incidents from inside actual large enterprises.

Identified the common organizational and technical anti-patterns that prevent quick and effective incident resolution and interfere with organizational learning.

Discusssed emerging design patterns and techniques that remove the friction and bottlenecks while empowering teams (highlighting publicly referenceable work shared with the DevOps community).

MTtd should be mean time to DIAGNOSE not DETECT.

MTTR should be mean time to REPAIR (fix) not RESTORE.

After lunch — with a middle-eastern theme today — nothing really jumped out at me. We had a usually-good speaker — David Blank-Edelman — speaking in the middle of the 90-minute block (2:30–3:00pm in the 2:00–3:30pm block), so I went to his talk, "Where's the Kaboom? There Was Supposed to Be an Earth-Shattering Kaboom!"

Let's face it. We are great at building things-systems, services, infrastructures-you name it. But we are terrible, absolutely terrible, at decommissioning, demolishing, or destroying these same things in any sort of principled way. We spend so much time focused on how to construct systems that when it comes time to do the dance of destruction we are at a loss. We are even worse at building systems that will later be easy to destroy.

But it doesn't have to be this way. When they take down a bridge, a building, or even your bathroom before a renovation, things just don't get ripped out willy-nilly (hopefully). There are methods, best practices, and lots of lots of careful work being brought to bear in these situations. There are people who demolish stuff for a living, let's see what we can learn from them to take back to our own practice. Come to this talk not just for the explosions (and oh, yes, there will be explosions), but also to explore an important part of your work that never gets talked about: The kaboom.

After the afternoon break, then I attended Jon Kuroda's closing plenary "System Crash, Plane Crash: Lessons from Commercial Aviation and Other Engineering Fields." Commercial aviation, civil and structural engineering, emergency medicine, and the nuclear power industry all have hard-earned lessons gained over their respective histories, histories that stretch back decades or even centuries. Often acquired at a bloody cost, these experiences led to the development of environments typified by stringent regulation, strict test and design protocols, and demanding training and education requirements-all driven by a need to minimize loss of life.

In stark contrast, the computer industry in general and systems administration specifically have developed in a relatively unrestricted environment, largely free, outside of a few niche fields, from the regulation and external control seen in life-safety critical fields.

However, despite these major differences, these far more demanding environments still have many lessons to offer systems administrators and systems designers and engineers to apply to the design, development, and operation of computing systems.

We looked at incidents ranging from Air France 447 to Three Mile Island and what we can learn from the experiences of those involved both in the incidents and the subsequent investigations. We will draw parallels between our field as a whole and these other less forgiving fields in areas such as Education and Training, Monitoring, Design and Testing, Human Computer/Systems Interaction, Human Performance Factors, Organizational Culture, and Team Building.

The speaker's hope was that we would take away not just a list of object lessons but also a new perspective and lens through which to view the work we do and the environment in which we do it.

I went to dinner at the Ferry Building with Strata Chalup, Philip Kizer, Dan Rich, Mark Roth, Adele Shakal, Steve Vandevender, and David Williamson. We wound up at Oyster House and I had a bowl of mussels and side of fries. After dinner most of us hung out in the Atrium bar (where several people ordered a Karl the Fog: Woodford Reserve, bitttermilk honey whiskey sour mix, dash of orange bitters, and hickory smoke) until the Chairs Dinner ended and we could move upstairs to the super-secret dead dog party. I chatted with folks there and sampled some very good bourbon (thanks Branson!) and mulberry mead (thanks David!) before calling it a night a bit after 10pm.

Got back to the room, packed (and everything fits in the suitcase, though I might need to open the expand-o-zipper on it), and prepped for leaving tomorrow morning.

Saturday, November 4

Travel day! Finished packing, did the way overpried breakfast buffet at the hotel (seriously, $30?), and caught the shuttle to the airport with Philip. Got through check-in and security with minimal waits, though the flight was already delayed 7 minutes when I got to the airport. Other than that, boarding was fine and the flight was uneventful except for a little bit of so-called bumpy air (especially on approach and landing). I did finally watch Wonder Woman and the tail end of Iron Man 3, though.

Back to my conference reports page

Back to my professional organizations page

Back to my work page

Back to my home page

Last update Feb01/20 by Josh Simon (<jss@clock.org>).