Josh Work Professional Organizations Trip Reports Conference Report: 2014 LISA

The following document is intended as the general trip report for me at the 28th Systems Administration Conference (LISA 2014) in Seattle, WA from November 9–14, 2014. It is going to a variety of audiences, so feel free to skip the parts that don't concern you.


Saturday, November 8

Today was my travel day. Despite having the alarm set for a much more rational 6am, my body decided to wake up at 4am. After a while I gave up tossing and turning, got up, showered, packed up the CPAP and remaining toiletries, had a quick snack, checked email (no crises overnight, yay), packed up the laptop, and headed out. Traffic was unsurprisingly light and moving at or above posted speeds. Got to the airport without any difficulties, parked in my usual spot in the Big Blue Deck, hiked down to the interterminal shuttle, got to my terminal, checked my bag, and got through Security without major incident. (Getting to jump the line at Security was very nice, I must admit, though the line itself was reasonably short.) When I got to my gate I checked the WiFi situation. Apparently DTW now has "free" WiFi if you watch a sponsor's video. I figured I could mute the laptop and ignore it while it played so I agreed... and promptly got 24 hours of access without having to watch it. (I'm assuming either AdBlock or ClickToFlash took care of it.) Boarded without incident, but we were delayed out of the gate for about 15 minutes due to a catering issue. (Apparently we had ice but no beverages for the economy cabin.)

Another advantage of flying first class is that they served breakfast... and all things considered it didn't suck. They'd run out of Cheerios by the time they got to me, but I would've chosen the cheese omelette anyhow. I was pleasantly surprised it wasn't too dry, it was hot and the cheese was melty and the breakfast potatoes and turkey sausage that came with it were also tasty. The tray included a fruit salad (honeydew melon, blueberries, orange slice, and strawberry) and a sesame bagel, with butter, cream cheese, and apricot preserves as spreadable choices.

During the flight I read through my five-month backlog of magazines. Unfortunately, that only lasted about half the trip. We were late getting into Seattle — between the catering delay, the 120 MPH headwinds, and the delay in getting the jetbridge moved plane side, we were 39 minutes late in arriving. My bag came out reasonably quickly and apparently undamaged. I managed to find my shuttle service and, after a wait to fill up (I was the third of passengers bodies to enter the vehicle) was the second stop downtown. Checked in and got a delightful corner room overlooking 7th and Union. Unpacked then, when I couldn't find anyone to take with me, I went to lunch solo and wound up at Rock Bottom Brewery for a bacon and chicken sandwich with barbecue sauce and cheddar cheese, plus a side of fries. That held me over until a late dinner, but more on that later.

After lunch I hung out in the lobby spaces with several people (including, but not limited, to Cory, Derek, Patrick, Ryo, and Travis) until a bit before Registration (really Badge Pickup) opened at 5pm. They had two lines: One for the organizing committee, speakers, and those willing to spring for the more expensive Passport options, and the other for the regular folk. I was first in line in the latter, since everyone else I was with was either organizing, speaking, or passporting. After getting my registration packet (no big surprises, though the "scavenger hunt" cards aren't there this year), I dumped the stuff back in my room and then went to the welcome reception or "newbie BOF" to be visible to those here for their first LISA conference. That turns out to have been a good idea, since one of the people here for his first LISA was a local I've known electronically for ages but hadn't made it to a conference yet.

Around 7 or so a group of us — Chris, Mark, Mike, Nicole, Tom, and Travis — went out to dinner. The concierge had suggested Relish, the upscale burger joint at the Westin which was a few blocks away. The food was good, the company better, but the service was unfortunately slow. We wound up waiting 15 minutes for a table that had been clean, vacant, and ready for guests when we got there, and once we'd ordered it took longer than expected for the food to arrive (though it did arrive correctly and completely at the proper temperature and at the same time, modulo it having to be on two trays). I suspect they were throttling seating based on the ability of the kitchen to deliver.

Got back to the hotel a bit before 9pm and, since biologically that was midnight for me, I headed up to my room to crash.


Sunday, November 9

Today was my "weekend free" day; I had nothing scheduled for the bulk of the day. Therefore my body insisted on waking up at 2am local, 3:30am local, and 5am local, whereupon I gave up, got up, checked my email, and wrote more of this trip report.

Lazed around in a hot bath before showering, shaving, and otherwise getting ready for the day. Got downstairs to the registration lobby to pick up my one-off ATW t-shirt (it's the conference shirt in a deep purple instead of gray, and with "ATW 1995–2014" on the side). Hallway tracked the bulk of the day; most of the morning was with Derek and Nomad, and most of the early afternoon with Kyrre and Tim. Thanks to the rainy weather I ate lunch in the hotel (with Adele, Brian, Derek, Mark, Mike, Nomad, and Thomas); despite the weather a small group of us (Adam, Mario, Mark, Travis, and I) went out via taxi (due to the rain) to Grill from Ipanema for meat on swords (AKA churrascaria or rodizio). They had 17 skewers (16 meat); in order of arrival:

  • Buffalo
  • Leg of lamb
  • Pork ausage
  • Cinnamon pineapple
  • Garlic steak
  • Pork ribs
  • Tritip
  • Bacon wrapped turkey
  • Bacon wrapped beef
  • Passion fruit chicken
  • Ribeye steak
  • Sirloin (picanha)
  • Pepper steak
  • Shrimp
  • Filet mignon
  • Cheese steak (with provolone)
  • Beef rib

We walked back as the rain had stopped and got back to the hotel just after 8:30pm. Several folks adjourned to the bar; I swung by Board Game Night. Wound up spectating a too-big Cards Against Humanity game (12 players is too many, especially when the room is loud and most of them are drinking). Had a bit of a drinking problem with alcohol abuse, as I managed to spill about a couple of shot's worth of Maker's Mark bourbon onto myself. I eventually headed up to the room a bit after 10pm and crashed.


Monday, November 10

Today was a mostly free day, so I wound up heading offsite to be a bit of a tourist in the morning. (I didn't do a lot of touristy things when LISA was in Seattle in 1999, though I did when motss.con.xxiv was in 2011.)

I did a very little bit of Hallway Tracking, skipped lunch, and had dinner with Paul and Travis; we intended to go for sushi but both our first and second choice restaurants had 90+minute wait times so we fell back to a nearby pub. (The mussels steamed in white wine, garlic, and herbs were very tasty.)

Very few BOFs were scheduled for this evening so I set up the ATW notes in advance of the Tuesday workshop, and schmoozed with the folks in the hotel bar (and had a too-expensive dessert), before crashing.


Tuesday, November 11

Tuesday morning I swung by the conference floor to schmooze before the sessions started. Caught up very briefly with AEleen and Frank. Spent time while most were in sessions doing Lobby Track — it's like the Hallway Track, but with electricity in the main hotel lobby.

Tuesday's sessions included the 20th annual and final Advanced Topics Workshop; once again, Adam Moskowitz was our host, moderator, and referee.

[... The rest of the ATW writeup has been redacted; please check my web site for details if you care ...]

For dinner tonight I went with the 0xdeadbeef crowd — Adam, Bill, Branson, Dan, Doug, and Mario — to Sullivan's Steakhouse, just across Union Street from the hotel. I started with a French onion soup, segued into the main course of a bone-in dry-aged 26-oz ribeye, mashed potato, and asparagus, and bananas Foster bread pudding, with a couple of red wines with the starter and steak and a moscato with dessert. My food was excellent and the service was excellent, though one of our party got a steak that was more gristle than edible, which was sad.

After dinner, which ran long enough I missed the GLBTUVWXYZ BOF, I went to a private party for Cards Against Humanity. I was the last player to get a point, but I had 4 when the game ended around 11pm. The host had to get some work done so we all bailed then, and I went to my room to write up the trip report, update the budget tracking sheets, and go to bed.


Wednesday, November 12

Program chair Nicole Forsgren began with the usual announcements. This is the 28th LISA. The program committee accepted 9 papers from 33 submissions (about half from students), and as of shortly before the keynote we had 1,100 attendees.

LISA underwent a major change this year in how it's organized. Special thanks to the conference organizers (especially the curators and chairs), the USENIX staff and board, sponsors, exhibitors, vendors, BOF hosts, network set up from LISA Build, the authors, reviewers, attendees, employers, and so on. They gave the usual housekeeping information (speakers to meet with their session chairs 15 minutes before they go on-stage, BOFs in the evenings, and reception at the museum). The LISA Labs returned; it's a hands-on hacking space where you can experiment with new stuff. Go to the expo floor (for snacks and lunch too). Go to the poster session (tonight at 6:30pm).

She asked people to please fill their surveys out after tutorials, mini-tutorials, and the conference itself. These are especially important for the mini-tutorials which are new and an experiment.

LISA 2015 (the 29th) will be November 8–13 in Washington DC and Cory Lueninghoener and Amy Rich will be the program chairs.

This year's awards:

Today's keynote speaker was Ken Patchett. We've all seen the impact that open source has had on innovation in software; open sharing and collaboration have been at the root of some of our greatest achievements as an industry. Similarly, the Open Compute Project, a prominent industry initiative focused on driving greater openness and collaboration in infrastructure technology, has cultivated a community working together establishing common standards for scalable and highly efficient technologies that everyone can adopt and build upon, from the bottom of the hardware stack to the top. He provided a brief history of the project and an overview of current technologies, and discuss how open compute platforms are shaping the future of computing. He shared learnings from his work as Facebook's director of data center operations in the Western region and highlight how open source is changing the data center.

Scale is a big problem so they founded the Open Compute Project. Their scale is 864 million daily active users (average Sep 2014), 703 million on mobile, 82.2% of them are outside of the US and Canada. Complexity creates waste. Resiliency is better than redundancy. Good enough is not enough. They looked at the datacenter as an ecosystem, in a vertical stack (e.g., software, server, and data center).

No single server should be so important that your application suffers because it's down. They used to have a 10% failure rate; now they're at a 0.6% rate. They've increased energy efficiency 38% and reduced cost by 24%.

How many technicians per server do you have? MSFT back in the day had 1 tech for 500 servers and it ran him ragged. Best now is 1 to 20,000. They do this via monitoring and serious tool design.

Their Power Usage Effectiveness (PUE) was well over 2. They got down to 1.9 PUE with standard heat/cool cycle. Going to outside air and hydro they're closer to 1.07 PUE. By not stepping down from 480V to 208V and avoiding unnecessary UPSes, they went from 21–27% loss per server down to 7.5% per server... and still getting six 9s of reliability. See OpenCompute.org for more.

After the break I went to the invited talks. First up was George Beech, SRE at Stack Exchange. Stack Exchange is nuts about performance, and very proud of what they are able to do with a Microsoft-based stack at scale. This talk went over the architecture that is used at Stack Exchange to serve millions of users on a WISC stack. He went over their general architecture, their tooling, and what they use to make the site run fast.

They have 560M page views a month, transferring 34TB data, with 1665 rps (2250 peak) across the web farm. Their first priority is performance. They've written tools over time to help identify bottlenecks so they know where to focus. One example is moving a bunch of iptables rules into a sed command to reduce 100% of a core down to 10% (20% at peak).

See also their tag engine (custom; they exceeded MSSQL's text search system) and Elastic Search (203 GB index). (They want to integrate with logstash and kibana together....) Moved from mercurial to git.

http://stackexchange.com/performance is a publicly-viewable page.

Next up was Tom Limoncelli, also at Stack Exchange, wrangling the stuff that George just talked about. He highlighted some of the most radical ideas from his new book, The Practice of Cloud System Administration. Topics included: Most people use load balancers wrong, you should randomly power off machines, cloud computing will eventually be so inexpensive you won't be able to justify running your own hardware, the most highly reliable systems are built on cheap hardware that breaks a lot, and sysadmins should never say no to installing new releases from developers.

"Cloud" is an overloaded markwting term; for sysadmins, it's really just a new buzzword for "distributed computing." Distributed computing can do more work than any one single computer: More storage, computing power, memory, and throughput. However, more computers lead to more problems: Bigger risks, more visible failures, mandatory automation, and cost containment becomes critical.

Make peace with failure: Parts, networks, systems, code, and people are all imperfect. We have to build resilient systems. Therefore, we need to fail better. Remember that even things that don't fail need to be taken down for certain maintennce. So the recommendations to take away are:

Wait, what?

Use cheaper, less reliable, hardware. Rental cars are an example with the buy extra insurance concept. People generally buy high-end Use cheaper, less reliable, hardware. Rental cars are an example with the buy extra insurance concept. People generally buy high-end servers with RAID and dual power supplies, put them on UPSes and get the gold maintenance package with 4-hour response time, and then get 5 behind a load balancer, and then rewrite the code of the apps to handle being distributed, and then add a second load balancer for redundancy.... We're designing resiliency by spending money at all the different levels. If instead the resiliency was done through software, it's costly to develop but free to deploy, as opposed to through hardware which costs on every purchase. We can then focus on efficiency since it's resilient. That efficiency comes from starting with an SLA and buying just enough hardware to meet it.

If a process or procedure is risky, do it a lot. Risky behavior is different than a risky procedure. Behavior has an inherent risk (e.g., smoking, shooting yourself in the foot, and blindfolded chainsaw juggling). Procedures can be improved through practice: Software upgrades, database failovers, and hardware hot-swaps can be risky but they become less so with practice. THey did a DR drill as a test: It took over 10 hours, required hands-on involvement from 3 teams, identified 30+ bug fixes for code behavior and docs, and identified single points of failure. They did it more often and it went from (30 bugs,10 hours) to (20,5), (12,2), and now (5,1). The drills stress the system and identifies areas of improvement, and gives people experience and builds confidence.

Don't punish people for outages. They'll always happen — everything can fail, from components to people — so getting angry about them is equivalent to expecting them to never happen, which is irrational. So since we embrace failure, aim for uptime goals, so use 9.99±0.05 standard deviation (not 100%) uptime. Anticipate outages with resiliency, and drill for practice. This encourages transparency and communication and addresses problems while they're small before they become big. Incidentally, there's no more "root cause," just "contributing factors." Culturally people need to analyze the outage in a post-mortem (or after-action), and identify what happened, how, and what can be done to prevent similar problems in the future, and then publish it widely: Take responsibility (not blame), inclluding responsibility for implementing long-term fixes and for education other teams to learn from it.

As a reminder, we run services not servers. A server isn't useful until it's a service (which is poerwed up, running, and useful). Healthy services run themselves.

After the lunch break I continued in the Invited Talks, this time with a pair from Google. The first speaker was Dan Klein (SRE), on Making "Push on Green" a Reality: Issues & Actions Involved in Maintaining a Production Service. Despite encouragenment he didn't give the talk as an interpretive dance (though he did do so to kill time for the AV team to enable the audience microphones). His talk basically asked what you need to do before you can get to "push on green." His assumptions for this talk:

What's "push on green?" When a human, build system and test suite, or other test says it's okay, do it." It's complicated by reality. How do you avoid roll out pain with a new or modified service (or API, library, or whatever)? In summary:

It needs a cross-cultural mindset across all of these.

So how do you do a safe rollout?

  1. Silence (some) alerts.
  2. Update canary jobs.
  3. Run smoke tests.
  4. Let canaries "soak" (run for a while: might require some number of iterations such as loading the disk cache).
  5. Push remaining jobs.
  6. Run smoke tests again.
  7. Unsilence alerts.

What about making the config changes? You can have job restarts with runtime flags, or HUPping the job to reread the config file. The latter is faster but riskier.

At Google, in about 84 weeks, they went from ~5 rollouts per week towards 60 (peak 75) and freed up an FTE engineer.

What process improvements?

More frequent, smaller rollouts help developers. They don't have to wait for the "weekly build."

What's in the future?

This is an evolutionary process. Things will go wrong — things break. Adjust attitudes, so find reasons not blame, and fix the process. Let humans be smart and machines be repeatable. Have the Big Red Button so a human can stop things if needed.

You don't need to be at Google-scale to do this. There are no silver bullets; sorry. It's a laborious process with lots of baby steps. You have to be careful and not take shortcuts. But keep going.

The second speaker in this block was Dinah McNutt (Release Engineer), on Distributing Software in a Massively Parallel Environment. She's been coming here since LISA IV (1990) and chaired LISA VIII (1994), and while she used to be a sysadmin she's now a release engineer. One of her passions is packaging; she's fascinated by different package managers and she'll be talking about Google's.

The problem is that with very large networks, it may take a long time to distribute things, there're bottlenecks (network, disk, CPU, and memory), a machine may be offline, networks might be partitioned ("you can't get there from here"), and even concurrent writers.

Theirs is called Midas Package Manager (MPM). They store package metadata (more below) is stored in their Bigtable database; package data is stored in their Colossus File System and replicated. The transport mecahinism is a custom P2P mechanism based on torrent.

Note: This talk is not applicable to Android. It's just for the Linux systems.

MPM characteristics:

Case study:

Sounds easy, right? Well....

Terminology: Job is a process running in a container. There can be more than 1 container on a single machine.

At package creation, the build system creates a package definition file, which includes the file list, ownership and perms, pre, post install and remove commands, and is all generated automatically. Then it runs the build command. It can apply labels and signatures at any point.

If files going into the package aren't changed, a new package isn't created; the label or signature is just applied to the existing (unchanged) package.

The metadata can be both immutable (who, when, and how was it built; list of files, attributes, and checksums; some labels (especially those with equals signs); version ID) and mutable (labels (those without equals signs) and cleanup policy).

Durability depends on environment:

Distribution is via pull. Advantages are that it avoids network congestion (only fetched when needed) and lets the job owners decide when to accept new versions (e.g., "wait until idle"). Drawbacks include that job owners can decide when to accept new versions, there has to be extra logic in the job to check for new versions or the ability to restart jobs easily, and it can be difficult to tell who's going to be using a specific version.

Package metadata is pushed to Bigtable (which is replicated) immediately. Root servers read and cache data from their local Bigtable replica. MPM queries the local root server. Failover logic is in the client (and if the requests fail they're redirected to another Bigtable replica).

Package data is in the Colossus file system, scattered geographically It's a 2-tiered architecture; frequently used packages are cached "nearby" (closer to the job). The fetch is via torrent-like protocol and the data is stored locally so as long as it's in use you don't need to talk to Bigtable or Colossus. There's only one copy on the machine no matter how many jobs on the machine use it. They have millions of fetches and petabytes of data moving daily.

Security is controlled via ACLs. Package name space is hierarchical, like storage/client, storage/client/config, and storage/server. ACLs are inherited (or not). There're 3 levels of access:

Individual files can be encrypted within a package, and ACLs define who can decrypt the files (MPM can't). En- and decryption are performed locally and automatically. That allows there to be passwords that aren't ever stored unencrypted.

Signatures can be signed at build time or later. Secure key escrow uses the package name and metadata so a package can be verified using the name and signer.

Back to the case study, we can now see a bit more under the hood.

Why love MPM? There's an mpmdiff that can compare any two packages regardless of name (like the file owner, file mode, file size, file checksums, and the pre and post scripts).

Labels are great. You cna fetch packages using labels. You can use them to indicate where the package is in the release process (dev, canary, or production). You can promote a package by moving labels from one package to another, though some labels (with equals signs) are immutable and can't be moved. Some labels are special ("latest" which shouldn't be used because that bypasses using a canary). They can assist in rollbacks (like "last_known_good" or "rollback" to label the current MPM while promoting the new one).

There's a concept fo file groups: It's a grouping of binaries within an MPM. Binaries can beloing to more than one group. Common practices is to store both stripped and unstripped binaries in the same MPM but different file groups, to ensure the unstripped and stripped binaries match when troubleshooting problems.

There's a web interface to browse all MPMs and show the metadata. It also shows graphs by size (so you can see how file groups change over time).

After the break I stayed in the Invited Talks track. The first talk was "LISA Build: Mind. Blown." It started as a talk about what LISA Build: What it was, what they did, and so on. Back in 2000 we could replace the hotel's 2 APs with 30. Nowadays, wifi can be a revenue stream so asking them to shut things down for a week.

So what happens? 3–4 months before the conference you figure out the physical plant and access points and so on in the hotel itself. This hotel has been great; we've got full access to the network closets and MDF and so on. This crew was great; they knew what they were doing. The Xirrus arrays were a big win. Use the 5GHz or 2GHz. We negotiated with the hotel for 40Mbps (and only exceeded it when testing traffic shaping). What they did: tc, firewall builder, 2 linksys switches (with an IE7 on XP SP0 box to manage), 3 Xirrus arrays, 1 Cisco switch, 3 servers, and running cacti, kibana, nagios, ntpd, dhcp, ntop, dns, rsyslog

The bulk of the session was a panel of the folks who did it.

What was your favorite part? Meeting new people and working with friends. You should try this some time, it was fun. Another was most interested in deploying a network without money; they scrooged what they could. Neeting people gives you the opportunity to have something to talk about with them. Doing something new (e.g., a systems guy doing networking) was interesting too. Knowing how much goes into the physical plant aspects and building it from the ground up. The positive attitude was great ("You haven't done this? Let's try!" and not "I won't try").

They started with bins of cables, bins of power supplies, and one server with no memory and another with 2x4GB and 2x1GB memory. The concept is designed to start with nothing. They try different OSes and different configuration. Some machines were so old they can't boot of USB stick, so how do you set up BOOTP any more?

Another aspect is learning how to deal with stress.

How does someone become part of this next year? Click the LISA Build box on the registration form.

Several touched on this during intros, but how many have done networking before? Not many.

Another design goal is to do the least about what you know so you learn new stuff.

Someone has done a couple of offsite builds (running events for fun) and is tasked with getting computers. What do you keep lying around Just In Case? What will we hang onto? Not the Linksys switches. A conference like this or IETF can get a sponsor to supply equipment or money. Hotels are getting better at it. Smaller conferences need a concept of plant, power supplies, cabling, management infrastructure, power tester, and cable tester. Branson brought a Mac Mini with VMs on it, and a Cisco switch they'll leave behind.

In terms of the setup, with =>2 APs, how do you configure the networks? They NAT out the Xirusses via Ethernet to the outside world, but there's no wifi bridge.

Assume the hotel will have cat5/cat6 over 100m.

What was and wasn't planned? On Sunday morning nothing was planned, and Brett came in from Nepal. First, what hardware did we have, and what OSes can we put on the machines, and then what can we do with what we have. Frequently there's no idea what'll be on-site; they keep a Google Sheet with "what we have" and "what we need," but even assumptions (like "a server has some memory") can be false. They did have an idea of the VLANs and networks. Tomorrow we'll be running a VOIP call (P2P without Skype) from the White House.

How much testing did you have to do to find the dead spots? None; we let the users do it. They watch IRC and let people share feedback there. In the past they've used AirSnort or a fluke tool to check out dead spots, for us we're in a long room with permeable airwalls.

What kind of traffic shaping are we doing? Nothing too fancy; we have a 40 Mbps allotment from the hotel that's not really being enforced (if we break 40 the price goes up exponentially, so we're capped at 38).

How are you preventing cross-talk on a large /16? The easy answer is "We're not." There aren't that many devices; the /16 is mostly empty with 1,200 devices. We're not filtering (but please don't surf porn).

Across the board, what would you do differently next year? Drive with a trunk ful of equipment. Not answer questions. Bring an XP install image. Focus on new technologies to work on. Bring some equipment and images. Not going to DC next year (x2).

Most of the audience would be willing to do this next year.

Remember Build is all week long, not just in the morning.

I had to skip out of the second half of the session because I'd misplaced my phone and had to hunt it down. I did manage that successfully, at least.

After the sessions wrapped I ran out to dinner at Blue C Sushi with David, Susan, and Kalita. It's a conveyor sushi place, and I wound up eating enough to be full... for more than I wanted to spend, but it was good. Got back in time to catch the tail end of the poster sessions.

Ran up to the room to drop off the jacket and sweatshirt, then caught up on work email and did some trip report writing before heading back down to the BOFs. I did swing by the (originally quiet) scotch BOF and have a few sips of a couple of the libations. A bit after 10pm I started to head out, but got caught up in a couple of conversations. I did manage to get to bed by 11pm and was hoping I'd sleep in a bit later than 5am.


Thursday, November 13

I made the mistake of not shutting my phone's audio off overnight, so I was awakened by a text from my mother. It wasn't urgent so I silenced the phone and went right back to bed... and slept in until 7am. (Finally.)

Grabbed a banana and a banana nut muffin for breakfast and grabbed a seat up front (ability to see the screen, see the speaker, and plug the laptop into a power strip) for the Thursday keynote: Gene Kim, "Why Everyone Needs DevOps Now: A Fifteen Year Study of High Performing IT Organizations."

Organizations employing DevOps practices such as Google, Amazon, Facebook, Etsy, and Twitter are routinely deploying code into production hundreds, or even thousands, of times per day, while providing world-class availability, reliability and security. In contrast, most organizations struggle to do releases more than every nine months.

The authors of the upcoming DevOps Cookbook have been studying high performing organizations since 1999, and we capture and codify how these high-performing organizations achieve this fast flow of work through Product Management and Development, through QA and Infosec, and into IT Operations. By doing so, other organizations can now replicate the extraordinary culture and outcomes enabling their organization to scale and win in the marketplace.

Why is DevOps important?

Act 1 begins in operations. Fragile artifacts are the mane of Ops' existence. Where are the most fragile artifacts (code, infrastructure, and so on)? Either in the most critical business operations (like the revenue stream) or the most critical project.

Act 2 is the developers. They can release things untested or incompletely, which can become or lead to artifacts. Technical debt gets worse over time and there's almost never time to fix it.

Cynicism and despair and hopelessness are difficult at best.

What patterns do they see?

DevOps spans the boundaries: Dev who think like ops and ops who think like dev. Someone wants it to be *Ops (.*ops, ^(?.+)Ops$).

He had 3 aha! moments:

Many places have adopted DevOps practices, across all industries (com gov edu mil). His project the past 3 years was to benchmark 9600 organizations to identify high performance.

High performers are more agile — 30x more frequent deployments and 8000x faster lead times than their peers — and more reliable — 2x the change successrate and 12x faster MTTR. (This requires smaller, more frequent deployments.) High performers win in the marketplace, too; they're 2x as likely to exceed profitability, market share, and productivity goals, and 50% higher market cap growth over 3 years. Example: SAP went from 9 months to 1 week for code going from dev to full prod.

Consider allocating 20% of cycles to reducing your technical debt.

What's the opportunity cost of wasted IT spending? 2.6 trillion (USD) per year.

After the morning break I went to the mini-tutorial, "Building PowerShell Commands" by Steven Murawski of Chef. Mini-tutorials are new this year; they're 90-minute sessions that are a deep and fast dive into a single topic. I was one of three in the room who hadn't previously used PowerShell. We covered:

This was all for PowerShell v2 and later, when advanced functions were introduced.

Phases of the pipeline:

Pipelines are streams of objects not of commands like a Unix shell. All functions' BEGIN blocks run first, then the processing happens, and finally the END block runs. There's no guaranteed order (e.g., something is asynchronous); you should consider ordering explicitly if it matters for the output.

Note that the output of functions isn't necessairly what they return, but they can also generate output — so you can see your results twice... in any phase of the process. In PowerShell 2 through 4, you have to be careful in what you explicitly do to avoid seeing duplicate results. PowerShell 5 adds classes, and methods therein only output what you explitly return.

So what're parameters? They can be mandatory, positioned, allow binding, hve set affinity, hint for output types (v4 and later), and validation.

Tab completion is your friend, especially at the command line. Adding [parameter()] to any parameter in the basic function turns it into an advanced function and therefore allows it to use common parameters as well.

PowerShell has several streams — output (like stdout), verbose, debug, error (which also has a variable $errors), warning, and so on. Note that debug and verbose are different streams. Toggling the debug parameter turns the action preference to enquire, to allow a break, continue, or kill the command.

PowerShell is dynamically typed by default; if we care about typing and want to validate, we can cast a type (e.g., with [string] to force the parameter to be a string, or [int] integer, and so on. This is documented with PowerShell's documentation, and most .NET objects can work here too.

You can use [parameter( mandatory=$true )] (v2); v3 and later can omit the =$true part. This makes the default value (as defined in the script) no longer accessed. If the type is [string[]] it's an aray, and you can specify multiple values for a given parameter.

How do you validate the input (either via pipeline, or option, or...)? You can omit the mandatory, but specifiy something like [validatenotnullorempty()]. There are a lot of validation methods.

How parameters are passed (on command line or on pipeline input) affects in what block they're available.

Using $ScriptBlock parameters let you tweak things as they move down the pipeline. $_ represents the current object.

ValueFromPipeline is cheaper than ValueFromPipelineByPropertyName.

What about positionality? Declaring position (with position = 1 in the param( [parameter( ... )] declaration. The number order is what matters; they don't have to be sequential. All it means is that the lowest number gets bound first. (It looks like the value doesn't need to be an integer, either.)

trace-command will show you what the command is doing, step by step, for parameter binding and type forcing.

what-if and confirm aren't often used in get functions but should be in set functions: Put this above the param() declaration:

[cmdletbinding(SupportsShouldProcess = $true)]

Use custom objects to test bindings.

Comment-based help allows man-page type sections, and ISE provides templates. SYNOPSIS and DESCRIPTION sections are mandatory; other sections are optional.

In the session immediately after the lunch break there wasn't a lot I wanted to see. That was just as well, since there was a severe outage at work: Seems that in order to remediate against a major Microsoft bug they patched the Active Directory domain controllers, which broke secure LDAP, so nobody can log into the web content management system (it's failing the TLS handshake). Of course, I didn't find out about it until almost 5pm Eastern, when everyone there goes home, but I did a little poking around, came up with a few ideas, and notified the coworkers.

In the final sit-down session I went to the invited talk, "One Year After the healthcare.gov Meltdown: Now What?" This talk involved a video conference from the White House (specifically OMB Conference Room 1). Mikey was unable to attend in person because the open enrollment period starts up again Saturday and he couldn't be on the airplanes.

A year ago he was with CyberEngineering at Google and was tapped to be part of the "tech surge" to figure out what was wrong with healthcare.gov. From calling in on the phone to meeting cabinet secretaries. They finally got it turned around and launched.

After the thanks ("I'm writing from the hospital bed where I'm recovering from the lung transplant 2 weeks ago which I wouldn't've have lived without, so thanks for fixing the website"), he found it hard to work back at Google so he joined what became the US Digital Service. They're 24 people today, and government has a well-earned reputation for not being a lot of "fun." Public service really is a thing (and one which our profession needs to start thinking about).

They have high demand for their services, but few people to deliver them so they're forced to prioritize. He regularly attends the executive staff meeting in the Roosevelt Room at 8:30am every morning. They often work 80 hour weeks.

Like us, he's happy at the President's statement on net.neutrality, and the Digital Service helpd him reach that understanding.

After the session ended I dropped my stuff off in the hotel room, changed into evening- and outdoor-appropriate clothing, and met folks in the lobby at 6 to catch the monorail to the EMP museum for the reception.

The reception was fun. I did a quick run through several of the exhibits (Science Fiction, Fantasy, Horror, and Hendrix), ate too much food from the caterers (quinoa salad with dried fruits, antipasti, ravioli with sage in a cream sauce, burger sliders, chilled cheesesteak on foccachia, and salted caramel pecan brownie tartlets), had some good conversations with lots of people, and continued schmoozing on the monorail back to and at the hotel afterwards.

The only thing on the schedule after that was a couple of BOFs, none of which really interested me so I came upstairs to crash.


Friday, November 14

By the time I woke up they'd manage to install a new Java Unlimited Strength policy file (though I don't know the specifics since the person who did it needs to be reminded, every single time she changes anything, to put in a change ticket) and resolve the sev-1 outage. Whew; one less thing to panic about.

After breakfast I went to the invited talks track. First up was "Embracing Checklists as a Tool for Improving Human Reliability," where Chris Stankaitis talked about the cultural implications of that. He didn't lose a bet to be in the 9am Friday slot. Checklists at Pythian opened his eyes; their culture is built around them. If you ask most people they say checklists make things better, but they also say they don't need them.

He used to hate them, but now he loves them. Checklists have challenges in many work cultures, so getting them accepted can be hard.

We need them because our environments are becoming too complex to keep in our heads. Some systems are so critical that lives depend on them. Humans aren't reliable and make mistakes, yet assess themselves as being stronger and better than they are. Having those people be in charge of the complexity without tools to help can cost the company a lot of lost revenue, lost reputation, and lost opportunity. Total cost of ownership (TCO) needs to include the costs of downtime; most people think it's just capital expenditure and operations expenditure, but it should also include incident expenditure, and that is what we can control.

Some stories (replace sh with bash in wrong path on all Solaris boxes, leave modem in .ca dialed in to .vz, deleting the svn repo on the server). All avoidable mistakes due to human error... and all led to draconian heavy-handed measures... like ITIL.

Checklists need to be implemented well. They can make smart people feel dumb and to acknowledge we make mistakes. They're not glamorous. Poorly-implemented checklists can take forever to be accepted (if ever).

To create a culture that accepts checklists you need to create a culture that understands people make mistakes. Using a blameless post-mortem is important.

Building a checklist:

Change is scary. And changing cultures is hard.

FIT-ACER is their human reliability checklist and is open sourced:

F. Focus: Slow down, are you ready?
I. Identify the server/database name, time, and authorization.
T. Type the command but don't hit Enter yet.
A. Assess the command.
C. Check the server/database name again.
E. Execute the command.
R. Review and document the results.

By turning common sense into a consciously-used checklist it forces you to slow down and ensure you're not on autopilot.

Next up was "I Am Sysadmin (And So Can You!)" where Ben Rockwood spoke about change. Some change is good, sometimes it's excessive. He was in despair of Underwood's "The Death of System Administration" in the April 2014 ;login:... And Elizabeth Zwicky and Rik Farrow in that same issue both said they used to be but are no longer sysadmins.

Playing word games doesn't help; no-ops, post-ops, devops, and SRE are all philosophies of operation, but there're still sysadmins doing these things.

For developers:

For ops:

Dilemma: Cloud changes the game, software is more important, and ops have to do more with less and get out of dev's way... but sysadmins have to evolve as we're building the cloud and infrastructure.

Argument: Pilots. Planes can fly themselves, take off and land themselves too... so why have people in the cockpit? They don't NEED to be there. We want them there to monitor, to take over in a crisis, to make sure things work correctly and smoothly. Note there're at least 2 (or more), for redundancy. Similarly, SAs will continue to exist.

It's hard because not only do we need to operate the business (fly the plane) but to also build the infrastructure to do so (build the plane).

Requirements for change:

How?

After the break I stayed with the ITs, going to Caskey L. Dickson's "Gauges, Counters, and Ratios, Oh My!" The only thing worse than no metrics are bad and/or misleading ones. Well-designed metrics enable you to quickly know the state of your service and have confidence that your systems are healthy. Poor metrics distract you from finding root causes of outages and extend downtime. Unfortunately it isn't always obvious what counts and how to count it. This talk covered the essential attributes needed in quality metrics and walk participants through the steps needed to capture them in a useful format while avoiding common pitfalls in metric design.

What do we record? A metric is three key parts: An identity or name (specific with qualifications, such as "sheep in pen A" or "white sheep in pen B"), a value, and a timestamp.

How can we monitor? Three variables:

Why do we gather the data?

What kind of metrics are important? Anything that helps you detect, diagnose, and remediate a problem. It's generally the superset of all lower classes of data... and therefore the hardest.

What can go wrong? Most things: Bad { code, hardware, data, configuration }, resource exhaustion or starvation, or any combination thereof, or even other things. Metrics should help enable diagnosis and you should therefore think about why you're creating a metric and how you are going to use it.

Where do we get the data? Kernel, app software, log files, and others. The general flow is (1) read a counter and (2) examine the new state.

Remember that numbers aren't the same — differences (e.g., starting a process that takes 2G RAM will mean from 10GB to 8GB and from 64GB to 62GB, but compare that to 20GB free on a 100GB disk or on a 100TB disk). You can combine them to get insight: 75% disk used yesterday, 80% today, implies less than 4 days before it reaches 100%. So consider if the number is nominal (categorical, distinct or disjoint), ordinal (nominal with relative ordering), interval (ordinal with equality of spacing), or ratio (interval with correlation factor).

So what about using gauges versus counters?

Being able to sum metrics across some dimensions can be useful.

So how about metrics we care about? At a host level:

But because naming and attributes are hard, convincing the data storage layer to track metrics that live in arbitratrily sized, multi-dimensional namespaces, is problematic.

After the lunch break, I went to the first half of the invited talks to hear Brendan Gregg speak on "Linux Performance Analysis: New Tools and Old Secrets," more because he's a knnown-good speaker than because I was interested in the material.

At Netflix, performance is crucial and they use many high to low level tools to analyze our stack in different ways. In this talk, he introduced new system observability tools they are using at Netflix, which he's ported from DTraceToolkit, and are intended for Linux 3.2 cloud instances. These show that Linux can do more than you may think, by using creative hacks and workarounds with existing kernel features (ftrace, perf_events). While these are solving issues on current versions of Linux, he also briefly summarized the future in this space: eBPF, ktap, SystemTap, sysdig, and so on.

Nothing grabbed my attention for the second half of the session so I wound up in the NOC chatting with the operations team.

The final session of the conference was a plenary endnote. The original speaker for the robots-in-spaaaace talk had a medical emergency and could not come, so we had someone step in. Courtney Kissler from Nordstrom spoke on "Transforming to a Culture of Continuous Improvement."

Nordstrom's has 16K employees (1.8K in IT) doing 12.5B in revenue. They have nordstrom{,.com} at the high end with iPhone and Android apps, and the lower-end Nordstrom Rack as well as electronic-only Hautelook and Trunkclub. She's the executive in charge of IT across the entire full-price experience.

The Board has offsites in June and November. In June 2011 it was focused on online growth, and the team/board then looked at what other brick-n-mortars thought and their competitors all thought "Digital is going nowhere, don't bother" and they're out of business now. So they wanted to spend more resources on the technology side. They spent 2+years doing a major rewrite of their in-store POS applications... and by 2011 when it was done it was irrelevant. Before the journey, they were cost-optimized: Shared services, annual planning, waterfall-style delivery, big-batch releases (2x/yr on the web site), with a 97% success rate (defined as on-time, -budget, and -scope). After the offsite, they chose to optimize for speed. They went Agile for methodologyy and do continuous flow and improvement. Adoption varied by team; some did it because they were told to, and some believed in it and went all-out, and some teams said "huh?" and didn't move forwards at all.

Leadership pushing Agile doesn't work, but team-led Agile can work. They took the time to look at the value stream. One example is the customer mobile app team. Beforehand there was a lot of layers of management and a lot of handoffs between them. The team went scrum... and it failed because it didn't deliver fast enough. So they put a single director in place to go and see what was what. Without using terminology or jargon, he asked a lot of question about how they were currently delivering value so they could document the value stream without using the words "value stream." In May 2013 the lead time was (22w without and 28w with API, and cyckle times were 11w and 20.5w respectively). The method they were organized in was a big challenge, so now that everyone can see what's wrong. Once the team saw that, the team (not the leader) figured out what to do differently to improve. They organized into squads, had no hardening phase (with quality up front), did continuous planning, and merged everything into a single backlog of work.

With everyone on board, they had significant improvement; app store ratings improved, quality and throughput improved, customer satisfaction improved, and the release cycle got much much shorter. Measuring lead and cycle times showed a lot of improvements... until September 2014, where there was a setback. In the June 2014 offsite the topic was to accelerate customer mobile to bridge the digital to brick-and-mortar and connect it for the customers. That led to a lot of resource changes and upstream issues. Also, the growth shifted the culture (with 100% growth in 2 months). Finally, the team was too comfortable and rested on their laurels. With all of that, their lead and cycle times went up again. So the team reinstated continuous improvement to align on a target condition to minimize those setbacks. (Again: Team led!) Also, they're organized by squads aligned to a business outcome, such as "customer satisfaction" or "in-store (or online) demand."

Story 2: They have a bistro restaurant. It's a legacy application they acquired about 8 years ago. The intent was to run it as SaaS but they had to bring some back in-house and the app's morphed a lot. The team doesn't own the code (it's a legacy tech stack). At the end of 2013 they had conducted 11 re-concept (from cafe to marketplace, which isn't as simple as "change the signs"), and they wanted to do 44 in '14 and the team had 30+ high-impacting incidents in 2 months. Restaurant outages never got categorized beyond Medium (even though "can't take the customer's money"). The team came up with problem solving, countermeasures, and were empowered to fix things. The business partner said "triple the team (add people)"... which is one way to solve it. Alernatively they could do value-stream mapping and see about gaining efficiencies in the process.

They took 40% of the time out of the process in one experiment. The initial kickoff asked for 60 data points (very painful to the user), and in reality they only needed 10 up front and the rest could come later. The morale for everyone turned around, they improved MTTR, and turned things around. (Yay!)

Story 3: Server provisioning — they took 8 days out of the process in the first experiment.

Story 4: Enterprise Service Bus implemented Logging as a Service for a 90%+ productivity improvement.

Story 5: Enterprise Service Bus reduced the amount of time from 90 to less than 10 minutes and from 4 to 1 people to do a deploy. The team leader said "It's about the mindset and behaviors. I learned so much in VSM training and about lean/CI techniques, allowed us to teach. Everything is about the customer."

Summary: Underlying all of this is people. The method of delivering the work is grounded in the people. Everything they build should be about the customer; ask ourselves if the customer would value that. She has a passionate believe in continuous improvement as a critical component of how they get work done (now and in the future). Create a learning culture. Be persistent: Keep going.

Finally, leaders have to evolve. If we had magic wands, every leader would:

What are their current challenges?

David Blank-Edelman closed us out. Thanks to everyone who put this together, especially chair Nicole Forsgren. After the session wrapped up I went to dinner at the Tap Room with Andrew, Mark, Peter, Todd, Travis, and Trey. I wound up getting a cup of crab bisque and a decent burger. Chose to skip dessert because there'd be munchies at the Dead Dog later. Got back to I swung by the lobby bar to hang out and schmooze with people. Wound up playing an 8-person game of Cards Against Humanity and came in tied for 4th place. A bit after 10pm I headed up to the Dead Dog in the Presedential Suite. Drank some bourbon, drank some scotch, schmoozed a lot, and left around midnight. Finished packing (except CPAP, laptop, and toiletries) and went to bed.


Saturday, November 15

Alas, it's time to return home. Slept in until a bit after 7am, checked the work email (no crises, yay), checked the morning comics and TwitBookPlus feeds, packed up the CPAP, jumped in the shower, and blew my nose eleventeen bazillion times until blood stopped coming out. (It was way too dry in this hotel.) I wound up having breakfast with Branson and Xavier, which was nice; over the 1.75 hours we were there I had a heart attack omelette (bacon, ham, sausage, and cheese), with sides of bacon, sausage, and breakfast potatoes, plus some lox and a banana.

Cashed in my housekeeping certificates for points, headed back to the room, packed up the laptop, and headed down to the lobby to wait for the shuttle. Got to the airport without issue, checked my bag, cleared security (no line!), and go to my gate in plenty of time to board.

The flight was uneventful. Lunch/dinner was chicken enchiladas (over black beans and rice — what genius thought "Let's cram a bunch of people in a sealed metal tube then feed them beans" was a good idea?), with a salad (lettuces, jicama, and red bell pepper), and a dessert plate of cherry cheesecake, grapes, cheddar, and brie. Got into Metro a bit before 9pm, and by the time I got off the tram to go to baggage claim the fire alarms were sounding. Ignored them, got the bag, took the shuttle to the parking lot and drove home (entirely at or above posted speeds). Unpacked (mostly), wrote up some trip report while still on the clock, grabbed something to drink (neither alcoholic nor caffeinated), and headed off to bed to try to reset the body to Eastern time.



Back to my conference reports page
Back to my professional organizations page
Back to my work page
Back to my home page

Last update Feb01/20 by Josh Simon (<jss@clock.org>).