Josh Work Professional Organizations Trip Reports Conference Report: 2013 LISA

The following document is intended as the general trip report for me at the 27th Systems Administration Conference (LISA 2013) in Washington, DC, from November 3-8, 2013. It is going to a variety of audiences, so feel free to skip the parts that don't concern or interest you.


Saturday, November 2

Today was my travel day. Headed off to the airport a bit early to avoid stress. Rainy drive to the airport, but no wait for the terminal shuttle and no line outside security. The line was inside the security point since only one of the porno scanners was in use. Got to the gate with plenty of time, boarded without incident, and wound up sitting (in my maize-and-blue Michigan sweatshirt) in front of a pair of MSU people. The senior flight attendant was also from MSU and gave me some good-natured hassle, so I made sure to give her hassle back... on my way off the plane. I'm no dummy.

Took the subway to the hotel, got my room key, and headed upstairs. Unpacked and realized I'd forgotten my "fancy" badge holder. Wound up skipping lunch and just hung out around the registration area waiting for them to open and hanging out with the other early arrivals.

Got my stuff when registration opened at 5pm — bag; badge; tickets to the sessions, workshop, and reception; conference directory; and the various adverts — and hung out chatting with folks until the Welcome Reception at 6pm. That ran a bit long, with the presentation (and heckling) letting everyone know what's new and different this year. As the food was basically cheese-n-crackers and veggies-n-dip, and I hadn't had lunch yet, I wound up adjourning to Cafe Paradiso across the street for dinner. Adam, David N., Heather, Marc, and I shared a table; I had the seafood risotto (3 shrimp, 2 scallops, and a nice mound of risotto all on top of a saffron-infused broth) which was excellent. Add in a Caesar salad to start and chocolate mousse at the end and I was happy. Got back to the hotel, hung out in the lobby bar for a bit, before heading upstairs to crash.


Sunday, November 3

Weekend day! I slept in 'til almost 7am (which would be a biological 8am thanks to the time change overnight)... with a couple of bio breaks. The sleep wasn't all that restful; I suffer from "can't sleep the first night in a hotel"-itis. I wound up doing brunch at Scion with a local friend (who I hadn't seen since January 2005) and his new husband (who I just met). I went simple with 3 eggs (scrambled), potatoes (think "thin sliced home fries"), (3 pieces of thick-cut) bacon, and rye toast. After brunch I did some window shopping while wandering around Dupont Circle, spent some time at the Farmer's Market but didn't buy anything, and got back to the hotel around 1pm. Napped before meeting other friends — one local and one now from Chicago — for dinner. We decided to stay local and went to Lebanese Taverna. We split a chef's sample appetizer platter (a "sampling of hommus, baba ghanoush, tabouleh, lebneh, grape leaf, falafel, fatayer spinach, kibbeh, and m'saka") and I had a lamb shank and roasted vegetables over couscous. Excellent meal.

The evening ended with the Board Game Night, which has become in large part a Card Game Night, since we had a couple of different decks of Cards Against Humanity. I was at one of the CAH tables and while I did well, winning 6 questions, I didn't get to 7 before someone else got to 10 (and two others were also at 9). Was still fun.

After some post-game schmoozing I wound up heading to bed around 10:30pm.


Monday, November 4

From a conference standpoint today was another unstructured Hallway Track day. I woke up around 6am anyhow, caught up on email and social media (or as caught up as I'm going to get). After a quick shower I headed downstairs to schmooze (and drop off the leftover Halloween candy at the registration desk for Lee so he could toss it out to students in his classes later in the week). Once the tutorials and workshops started I found a quiet corner with a table, opened up a floor outlet, plugged in my power strip, and got online. Alas, the ISP was having issues, so regardless of which wireless network one was on, the latency was high.

Went to an early lunch with another hallway-tracking friend; we went across the street to the Woodley Cafe. I had one of their specials, and it was a Wreck. (Seriously, they named it after the national chain who does a similar sandwich.) Ham, turkey, roast beef, and melted swiss, with lettuce and tomato, on a panini-pressed roll, with sides of chips and a pickle. Very tasty.

Monday afternoon I hallway-tracked. When the afternoon sessions adjourned I grabbed a quick power nap in advance of dressing up for the deadbeef dinner. Adam, Branson, Dan, Doug, Jennifer, Mario, and I went to The Prime Rib for a high-end steak dinner. We went there back in 2006 and even through 2012 thought it was the best place we'd held a deadbeef dinner, and this year it didn't disappoint. I had a Caesar salad to start, the 24-oz. bone-in roasted prime rib with loaded baked potato, and a slice of pecan pie for dessert. Those of us drinking wine went through 2 bottles of a 2010 Molly Dooker maitre d' cabernet sauvignon that was very nice. (wine.com says, "Dark cherry red, with a deep violet hue, this is an incredibly aromatic varietal Cabernet. Showing great intensity with bright berry fruits and mashed raspberry, together with licorice spice and espresso coffee. The palate is vibrant and fresh, building with layers of fruit, toasty oak and fine Cabernet tannins. It finishes with hints of anise spice, dark cocoa and cassis fruit.") Given that our two cabs were driven dramatically differently (my group had the driver who'd floor it at the green light, almost mowed down a couple of pedestrians and bicyclists, and put the cab in park instead of waiting on the brakes at every single red light; the other group had a very polite and nonaggressive driver who never went above the speed limit and let others in front of him nearly constantly), we decided to walk to the closest Red Line Metro stop and took the train back to the hotel. Between the food and the wine and the exercise, I was pooped so I headed upstairs to pop the evening drug regimen and crash.


Tuesday, November 5

Woke up a little before 6am and couldn't get back to sleep, so I went through my work email.

Tuesday's sessions began with the Advanced Topics Workshop; once again, Adam Moskowitz was our host, moderator, and referee.

[... The rest of the ATW writeup has been redacted; please check my web site for details if you care ...]

After the workshop, I went across the street to Tono Sushi with Adele and Rowan; the food was great, the sake okay (it's not really to my taste but I wasn't drinking much beyond a sip or three), and the service glacial. We waited about 20 minutes between finishing the food and being given one dessert menu, then another 20 minutes between dropping off that menu and asking us if we wanted anything, and finally being ignored once we got the bill and tried to pay. Seriously, we could've been out by 7:15 if not for the slow server; as it is, we left cash (and a smaller tip than we would've on credit cards) just to get out of there around 8:10pm. Got back to the hotel and Hallway Tracked for a while, grabbing an ice cream bar from the Cambridge Computer hospitality suite and catching up with folks like Peg (who's not been to one of these conferences in ages).

At 10pm I went to the LGBT Issues BOF and managed to con David into actually hosting (moderating?) it. Some introductions, some political discussions, and some anecdotes; everyone there was from the US; some were returning to the conference after a few years' absence, while a few were first-timers. We wrapped a bit after 11pm so I went upstairs to go to bed — to find the fire alarm was sounding, but not flashing the strobes. Couldn't get through to the front desk, operator, or guest services, so I went downstairs along with many if not most of the occupants of the 7th, 8th, and 9th floors. It seems that a false alarm sounded on 8 and spread to 7 and 9 before they could shut it off. Other than "all clear, false alarm" they didn't say much this evening. I was directed to ask the dayshift manager in the morning what happened and how they planned to recompense us for the inconvenience and lost sleep.

Given the excitement I didn't get to bed until well after midnight.


Wednesday, November 6

Narayan Desai and Kent Skaar began with the usual announcements. This is the 27th LISA and the theme is System Engineering Depth and Rigor. They thanked the usual suspects (program committee, invited talks coordinators, other coordinators, steering committee, USENIX staff and board, sponsors, authors, speakers, reviewers, and vendors). There were 76 invited talk proposals, 33 accepted, and they doubled the slot count by splitting the IT spaces into 2x45. The program committee accepted 13 papers from 24 submissions, and (though in direct contravention of both tradition and policy they didn't announce it) as of shortly before the keynote we had nearly attendees. (I later spoke to the staff who gave me a number of "999" later on Wednesday.) They gave the usual housekeeping information (speakers to meet with their session chairs 15 minutes before they go on-stage, BOFs in the evenings, and reception at the hotel). Also there's a new LISA Labs! hands-on hacking space where you can experiment with new stuff.

LISA 2014 (the 27th) will be November 9-14 in Seattle WA and Nicole Forsgren Velasquez will be the program chair.

They announced that the 9-layer OSI model T-shirts would be for sale at $25 each, and the proceeds would go to the Evi Nemeth Scholarship Fund. Evi was an engineer, author, and teacher known for her expertise in computer system administration and networks. She was the lead author of the "bibles" of system administration: UNIX System Administration Handbook (1989, 1995, 2000), Linux Administration Handbook (2002, 2006), and UNIX and Linux System Administration Handbook (2010). Evi Nemeth was known in technology circles as the matriarch of system administration. She was lost at sea this past summer.

Next the regular awards were presented:

Narayan introduced our keynote speaker, Jason Hoffman, the founder of Joyent, who spoke about "Modern Infrastructure: The Convergence of Network, Compute, and Data." The three pillars of our industry are network, compute, and data. All trends come down to the convergence of these. The convergence of network and compute resulted in the "the network is the computer," the convergence of network and data spawned the entire networked storage industry, and now we believe we're in the technology push where we are converging compute and data. At Joyent, they were able to take a fresh look at the idea of building a datacenter as if it were a single appliance, we took a storage-centric and "software defined everything but always offload to hardware when you can" approach, and they're intending to do everything in the open. In his talk, he covered the philosophical basis, the overall architecture, and the deep details of a holistic datacenter implementation.

Computers should be simple: Given data, do computations on it, and send and receive data over a network. However, we need to stop piling abstractions on top of abstractions; we should do things at the right layer and inherit. Once the systems are comprehensible and inheritance works, you can address the WHYs: Why is it slow, down, and so on. That lets us work on fault analysis and performance analysis.

There are four areas to provide an industry framework (and Joyent is a system example):

Career wise, we need to be more data-centric and be more scientific.

In the Q&A session, Rik Farrow has a quibble with PXE booting a small (200MB) kernel. 200MB isn't small. Jason notes it's smaller than an Android update. Narayan Desai notes that commodity hardware has become boring; we're now starting to see new cooler technologies. Will we see drives above 2TB, or 128-bit chips, and so on? It should be interesting to see what happens. The consolidation (de-duplication) of the component industry and materials science limits and huge R&D budgets.... Finally, someone at Google notes they have lots of Linux boxes. SmartOS lets you move away from "ssh into a box" conceptually. What can we do in the SmartOS world to get away from using SSH? Can we get to a place to never do config management (on the remote boxes), just making atomic images.

After the morning break (which are just beverages, and possibly any leftover breakfast pastries, this year) I went to the invited talks on autonomous systems ("SysAdmins Unleashed! Building Autonomous Systems Teams at Walt Disney Animation Studios") and designing operations drills ("Becoming a Gamemaster: Designing IT Emergency Operations and Drills").

First in the 2-talk session was Jonathan Geibel and Ronald Johnson of Walt Disney Animation Studios. They talked about culture change on an operations team... through a story, providing ideas for us to implement in our own organizations.

How do you instill the agility and effectiveness of a startup company within the walls of one of the most storied animation studios in the world? This question guided our design of a new systems organization at Walt Disney Animation Studios. Our goals were simple: break down traditional top-down management silos, empower staff to make autonomous decisions, and remove bureaucracy. We used scientific method experimentation with different structures and ideas, discussing the impact of each change with our staff along the way. We'll discuss the methods we used to empower sysadmins, and how we've evolved into an organization that's designed for and by technical staff.

They usually have a technical challenge they don't know how to solve when they start on a movie. In Rapunzel it was making the 50' hair look right; in Frozen it's the snow and ice (everywhere, falling in it, throwing it, walking on it, etc.). Once they figured out how to do it, the rendering was very challenging.

They supported, for Frozen, 30K core renderfarm, 60M render hours, 1.5MW data centers (2; they maxed out the power on the first, 6PB raw storage, 1000 Linux workstations and 800 Macs. (Wreck-It Ralph took 10K cores and Tangled took 5300.)

The systems team is a 50-person crew. They used to have a standard siloed org chart, but resource contention and pigeonholing made it inefficient, with poor communications, and difficult to evolve. Career paths were limited and people weren't empowered. Top-down reorgs affected too many things at once. So they got the top tier out of the way, then broke down the silo walls. How?

Each silo was a functional area but not necessarily a team. They developed small autonomous teams of 2-6 people (empirical evidence that if more than 6, the introverts ramp down, extroverts ramp up, and communications flows less freely). The team members can cross disciplines (engineering, R&D, support, etc.). Engineering and support are tightly-coupled within the team. They effectively a team can act as a small start up: Hire smart people, give them autonomy and authority, give them a big problem, and get out of the way. They focus on the results. No dress code, but co-location is strongly encouraged, including a collaborative space or war room.

Within each team there are roles: A lead (person who's the technical expert in the area, comfortable providing technical leadership, providing the day-to-day operations as well as tactical and strategic leadership, but no hr duties; title is irrelevant; works directly with stakeholders and managers to get resources), primary members (dedicated to a specific team — everyone has one primary role on a team), and secondary members (provide additional skills — technical or otherwise — on a team that isn't their primary team). Someone can play multiple roles, like primary on the data center team and secondary on the Linux team (to provide wisdom) and on the storage team (to learn more about it). They do this without time slicing, so the team leads for the teams on which someone is secondary know the resource might go away. No talent is left behind, as everyone has different passions.

Great! But what about the managers? They're a team of managers, structured the same way, but their roles are different. Instead of owning functional areas, they provide career and team coaching and deciding where to invest resources within the teams. As for coaching, they're to ensure people's passions are addressed, that they're seeing career growth opportunities, that the customers are getting the service they need, providing resources as needed, and so on. People can pick their own manager now based on skill set and personality. You can switch managers as needed (though rotating through all of them may indicate a problem; similarly having a manager that nobody wants to work with is also a problem).

The managers can decide where to invest resources, and build new teams as technology evolves, selecting new team leads based on who has the best ideas and vision, and so on. They also help define the team goals to align with the Studio's direction.

They have 17 teams in a flat hierarchy. Technical people can grow and be promoted without becoming managers. Customers can talk directly with the technical people instead of having to go through a manager. They did go analyze this and it's not 17 small silos. Part of that is because being spread across multiple teams in multiple roles lets everyone look forward. Compensation for senior technical people mirrors compensation for senior managers.

Physically they removed the walls in the engineering pods.

In the Q&A, one asked how were they able to enact such a revolutionary change and did they find after the implementation that people drifted into siloed behavior? First, they merged with Pixar 6 years ago, and the new senior management as part of that gave them the autonomy to try this revolutionary event; it's a Disney Animation cultural things. Second, there was some drifting back but they made sure to do one-on-ones frequently (at least monthly), and do daily walk-arounds to make small tweaks as needed instead of making big adjustments. The promotions are through meritocracy. Another noted that the slides show personal spaces on the perimeter with a collaboration space in the middle. ATC folks seemed to need 25% collaborative and 75% focused. Does tearing down the walls lead to a too-noisy environment for focusing? It's an experiment in one particular area so far, and the team suggested it. That team all have similar noise levels and work ethos and personalities and trust, so they're pretty confident this should work.

Second in the 2-talk session was Adele Shakal of Metacloud. What can we as Ops teams do to prepare for the zombie apocalypse?

Bring emergency response and operations, business continuity, disaster recovery, and IT architecture together into practical drill design... and prepare your organization for whatever zombie apocalypse it may face.

Learn key concepts in emergency operations center and incident headquarters design, methods of introducing such concepts to your organization, and a sequence of basic-to-advanced drill designs.

Keeping IT folks engaged in a drill simulation can be very challenging. Become a gamemaster worthy of designing and executing drills on likely emergency scenarios and realistic function failures for your organization.

She started with her language:

Emergency Planning and Drills brings all this together. Someone near the top of the organization needs to make it a priority and design (and run) a drill. Create a plan, ensure it is current and available. Hope for the best, plan and drill for the most likely, and cope with the worst. California has few hurricanes and Florida has few earthquakes, but power goes out unexpectedly everywhere. If you plan a drill, you need to ally with the emergency operations center (EOC) or incident headquarters (IHQ), and meet with them to design an EOC/IHQ if you don't have one. See other experts, like ICS, NIMS, NEMA, IAEM, Citizen Corps, and CERT; don't reinvent the wheel.

Showcase the EOC/IHQ in a non-emergency setting; have an open house. Provide food and drink, lead short guided tours, and so on. Status stations should be for personnel, facilities, and critical business functions. Publicize the drill schedule too so participants can wear comfortable shoes.

We should be doing life safety drills (fire, tornado, etc.) already. Then move beyond for basic IT emergency ops drills (ASSESS, REPORT, RECOVER). The goal is to set up the EOC/IHQ, collect and communicate status, and assign resources to recover critical business functions as needed. Note that we're not turning off actual services, unlike Google's actual clobbering of infrastrcutre pieces. Getting the communications right first is important. Once you've got the people assigned....

There may not be:

Identify the top few function and what's required to get them working, and drill down on those, not necessarily on the entire infrasstructure.

Map only what you need. Drill down only on the things you care about. Remember there may be a manual process or workaround. Don't try to solve all the problems.

Now that we know that, design the theoretical IT emergency. Create secret notes for participants to open at set times during the drill, simulating personnel, facilities, and critical business functions updates. Chart them out ahead of time; compare after the drill whether the EOC accurately summarized the information coming in and communicated the summary out. At the start of the drill, introduce the structure, and at the end include a discussion to capture lessons learned.

Do a couple of basic drills, months apart. Have data captured at the time. Then eventually meet with the resource allocators to see if anything needs to happen.

Then, you can do an advanced drill (RESPOND AND ASSESS, REPORT, RECOVER): Include the emergency response. Combine the first aid station concurrently and near the IT EOC, and watch ow things are different. Caveat: Simulate unavailability of personnel (vacation, collapsed freeway, other responsibilities, and so on).

Don't overdrill. One a year once they're reasonable might be fine.

For relevant organizations, go big and a full Game Day: Interface with the media and plan in advance. Introduce conflicting updates to the EOC ("Building 3 is fine" / "No, the halon dumped and it's evac'd") so the EOC can figure out how to handle it. Use dice to randomize with a D10 to see how long it took someone to get the update you just opened to you. Simulate lack of personnel and facilities.

Given all that, what about the zombie apocalypse? Prepare for the most likely emergency scenarios, which may not actually be the zombie apocalypse. Burn a pop tart and evacuate the help desk.

In the Q&A, someone asked if she'd recommend running this on smaller teams (e.g., 20-30 people)? How would you get started? Adele's experience has been the larger organization saying Go Forth And Do This. In her smaller company they have to prepare for the office being unavailable, and doing that ofr a small company is a useful exercise. Showcase 2 small-team drills and show improvement at a brown bag for other management to go grass-roots buy-in, and do a post mortem at any real event. Metrics (money, service uptime) will sell this to senior(er) management.

In addition to the preceding, bring or have your own tech for the EOC team. For example, the disaster team at CSCO has an end-of-the-world apocalypse van with a satellite link and its own 220/240 power and gas generator etc. Also universities can tap their local radio station or ham radio.

(Her slides are here.)

Went to lunch with a bunch of Googlers at Open City Cafe and had a tolerable burger. I think they swapped my medium rare and Mike's medium burgers, but it was still well-flavored. Got back to the hotel in plenty of time to drop off my coat, grab the laptop, and get to the next session.

In the first afternoon session I started with John Sellens' talk, "Building a Networked Appliance." In his talk he told the tale of designing and building a small networked computing appliance ("thing") into a product, and the decisions, trial(s) and error(s), and false starts that it entailed. It primarily covered the technical challenges, and the infrastructure and support tools that were required. The device is intended to be deployed unattended and in remote locations, which meant that the device and the supporting infrastructure had to be built such that it could be remotely managed and would be unlikely to fail. What could possibly go wrong?

(His slides are here.)

In an effort to avoid marketing, which is bad, he never said what the device or thing did. It's monbox.com, so you can monitor systems and devices even if you can't reach the network they're on from your monitoring server. It just needs to be able to reach the central server over https. The central server doesn't need to talk to it once it's set up.

It's built off a Raspberry Pi model B running Raspbian using standard OS packaging tools (.rpm and .deb, built via Easy Package Management (EPM)). Most of is is non-interactive command line tools, a console menu, and a web interface (PHP that invokes the command line utils), plus base OS tools like lighttpd and ssh. Read only root file system.

Worries:

What's next: Enhancements and features, plus more integration tools.

I wasn't that interested in the second half of the session, so instead of the talk I went to the vendor floor to swap out my codes for more Google pins. Wound up getting dragged into organizational discussions so I didn't get to any booths other than Google's.

In the late afternoon session I attended Jeff Darcy's talk, "Storage Performance Testing in the Cloud." Based on experience testing distributed storage systems in several public clouds, this talk consisted of two parts. The first part covered approaches for characterizing and measuring storage workloads generally. The second part covered the additional challenges posed by testing in public clouds. Contrary to popular belief, no two cloud servers are ever alike. Even the same server can exhibit wild and unpredictable performance swings over time, so new ways of analyzing performance are critical in these environments. I didn't take detailed notes for this section.

The second talk in this final block was Bruce Schneier, "Surveillance, the NSA, and Everything," via Skype. He apologized for being stuck at the IETF meeting in Vancouver.

He loves the NSA code names, like Muscular (collecting Google and Yahoo information by tapping into the data center communications lines), and Prism (getting data by asking the service providers), and the unnamed one to get buddy lists and contacts by tapping the lines between the users and the ISPs; Fairview, Blarney, Oakstar, Stormview, Little (L3), and Remedy (British Telecom) are all telco-intercept programs. Quantum (the packet-injection system) which runs on Tumult, Turmoil, or Turbulence. Foxacid is Metasploit. Worst ever? Egotistical Giraffe. Then the NSA can stick stuff on your computer; look for his blog post. Bullrun inserts backdoors into the products we actually buy.

There's a lot we know, and a lot we don't know, and there's a lot we won't know forever. The documents omit the cryptography information and the software impacted isn't listed, mainly because it's not written down.

In short, the NSA turned the Internet into a robust — legally, technically, and politically — means of surveillance. Data is a byproduct of an information society and socialization.

This is more than the NSA; it's the CIA, MI5/MI6, etc. Everyone, not just the US, is actually doing this kind of stuff. We know China is doing a lot of the same things (cf. Great Firewall of China).

How'd we get here? We made surveillance too easy and cheap. Snowden notes that [proper] encryption works. NSA agrees; their efforts to break Tor illustrates that. Encryption isn't enough; endpoint security is very weak and the NSA can find ways around it. The software and networks are vulnerable.

We know the NSA has some cryptanalysis we don't have. They're a black hole of mathematicians. There's a sentence in the black budget, "We are investing in ground breaking cryptographic abilities (and so on)." It sounds like they have something theoretical and are building or manufacturing their custom whatever to work that.

  1. Elliptic curves — They may have more advanced techniques for breaking them (or some unknown class thereof), giving them an advantage. They're manipulating the public collection of curves.
  2. Breaking RSA — They may have some way to factor RSA mathematically.
  3. RC4 Attack — They may have an attack against RC4, and it's plausible there's a massively-parallel attack could break it.

But we know they defeat encryption by going around it, such as targeted attacks to exfiltrate keys.

The NSA had no plans for exposure; they looked at the risks of the adversaries finding out what they did but not of the public finding out. They need to assume that everything they do will become public in 3-5 years. This changes the cost-benefit analysis for corporations to do business (or not) with the NSA. Cloud computing, hardware, and software companies have all lost (mostly international) sales because they dealt with the NSA instead of standing up against and fighting the NSA.

There'll be internal NSA self-corrections. There'll be external corrections as well, especially from our allies, so it's likely that the US administrations will move away from "collect everything" as it's too noisy, too full of false-positives, and reliance on detection avoids other solutions. The blowback from the revelations is a liability. There're also limitations on intelligence. In a lot of cases, NSA surveillance isn't (or doesn't seem to be) worth it. Their dual mission to both exploiting and protecting communications needs to be rebalanced.

Externally? We need to make eavesdropping more expensive. How can we change that economics?

We need to leverage economics, physics, and math, and limit bulk and enforce target collection.

Largely this is a political problem and the solutions will be too: Transparency, oversight, and accountability. It's hard but we can do it. A problem is that laws had led technology. MSA says "Give me the box to operate in and I'll operate within [and push the boundaries of] that box." Laws, by not keeping up with technology, give the NSA more gray areas to fill.

We probably won't win the "stop doing this" debate. We might win the "tell us what you're doing" debate. And remember, reigning in the NSA only affects the USA and US citizens. It won't affect non-US folks, other countries, and non nation-state actors.

We need to move from an arms race and zero-sum game to a positive-sum game. But it's a rbust problem legally, technically, and politically, but we need to solve it for everyone (not just the NSA). The Snowden documents show how anyone with a budget can do this. We also have to prevent the balkanization of the Internet, and countries all going to internal-to-their-country internets, and emboldening other countries to be more invasive surveillance techniques in their own countries.

In the near term, we need to know who to trust; we've broken the "US benign dictatorship" trust model. The ITU is trying to take over (which would be a disaster). We need better governance models. More generally, this problem is bigger than NSA or Security. Fundamentally it's a debate about data: ownership, sharing, surveillance as a business model, and the dichotomy for the societal benefits of big data versus the risks about and loss of personal privacy. Solving the problem will take decades, but this is where we've started.

Immediately after the talks, I met up with Lee, Mark, and Philip to head out to Fogo de Chao for meat on swords. In addition to the salad bar (where I stuck to mostly meats and cheeses — asparagus, bacon, grape tomatoes, mozzarella, parmesan, prosciutto, salami, and smoked salmon), I had alcatra (top sirloin), cordeiro (lamb chops), costela de porco (pork ribs), filet mignon (both with and without bacon), fraldinha (bottom sirloin), linguica (sausage), lombo (parmesan-crusted pork tenderloin), and picanha (top sirloin). Add in the sides (cheese rolls, fried bananas, garlic mashed potatoes, and polenta) and a caipirinha, and then the molten chocolate lava cake a la mode for dessert, and I almost had to waddle back to the Metro Center subway stop to head back to the hotel.

I got back from dinner around 8:15pm, so I had time to drop my stuff off in my room before heading over to the Wardman tower's 8th floor to open up the Presidential Suite for the Scotch BOF. We had 8 or 9 different scotches (and an Irish whiskey and a bourbon), various chocolate, and water and sodas. I had a shot or less of 4 or 5 scotches (most were tasty and I don't recall which is which). Wound up playing a game of Cards Against Humanity though we ended it early as we were all turning into pumpkins.


Thursday, November 7

Despite not getting to bed until after midnight, my body insisted on waking up around 6:15am. Managed to get an hour's worth of non-conference work done (email, time-tracking paperwork, writing more of this very trip report, and dealing with a scratch disk issue on a lab server). Shaved, showered, and got down to the conference floor in plenty of time for this morning's plenary session, "Data Engineering for Complex Systems," which was supposed to be by Hilary Mason of bitly but she got food poisoning and wound up in the ER. Brendan Gregg of Joyent gave his afternoon talk, "Blazing Performance with Flame Graphs," early as a plenary instead.

We also awarded him the Domino's award, for delivering a plenary in less than 60 minutes... for free.

"How did we ever analyze performance before Flame Graphs?" This new visualization invented by Brendan can help you quickly understand application and kernel performance, especially CPU usage, where stacks (call graphs) can be sampled and then visualized as an interactive flame graph. Flame Graphs are now used for a growing variety of targets: for applications and kernels on Linux, SmartOS, Mac OS X, and Windows; for languages including C, C++, node.js, ruby, and Lua; and in WebKit Web Inspector. This talk will explain them and provide use cases and new visualizations for other event types, including I/O, memory usage, and latency.

(He came up with the "shouting at disks" latency heat graphs that went viral, BTW. See his LISA'10 talk.)

Talk part 1: CPU flame graphs.

Problem: Production MySQL database with poor performance, heavy CPU performance, used a CPU profiler, condensed to provide stacks and counts. Even with condensation it became 500K lines of output with 270K unique stacks.

(His slides are here.)

Y axis is depth (top function led to profiling event), X axis is alphabetical since time doesn't matter. A box in the graph represents a function, its width is proportional to the time its function (or its children) were profiled.

These can also be used for more than just CPU — memory (malloc, brk, mmap), I/O (both logical and physical), etc. I/O tends to be standard enough (especially on-CPU) to be able to read from the text, not needing the graphs.

Some useful URLs:

He's working on wakeup latency flame graphs next, and chain graphs after that. Others are doing work on node.js (with ustack helper), OS X from Instruments, Ruby MiniProfiler, Windows Xperf (xperf_to_collapsedstacts.py), and we'd hope for it in Google Chrome Developer Tools. Want to add color controls, zoom in/out functions. Pedro Teixerida has a project for node.js flame graphs as a service: generate for each github push.

Several examples are in the slides.

Generation (see https://github.com/brendangregg/FlameGraph):

Can mix user and kernel space too. You can also mix both on-CPU (hot) and off-CPU (cold) using a mix of colors — but they don't really mix well.

After the morning break, I attended the Women in Advanced Computing panel. Moderated by Rikki Endsley, with panelists Deanna McNeil (accidental sysadmin), Amy Forinash (programmer who became a sysadmin), Amy Rich (Unix SA and now manager), and Deirdre Straughan (communications), the session is geared towards women in advanced computing and is intended to discuss practical solutions and tips for recruiting and retaining women in advanced computing.

The first topic was career-related recruiting: How do you recruit women to these roles, since lots of teams are predominantly if not entirely male and they seem to be getting primarily male applicants. Where to how to recruit to get more women? Deirdre has run events has encouraged Hackpride Academy and an all-female code camp to send people. If you want to recruit women, go to predominantly-female events. Amy Rich notes that the more women you have the more you get, and it influences the culture and affects how you're treated and the respect you're given. We need to remember that these people are tecnical people not first, so "Come join us because you're good" not "...because you're a woman." Write the job descriptions right. Amy FOrinash says to encourage grow-your-own and to support summer internship programs (and look to recruit from those programs). Deanna notes that having different perspectives in the hiring process is an excellent example.

Next they talked about generational, experience, and management differences. Any tips on managing up someone, or managing someone with more experience? Most folks who work for Amy Rich are younger than her but think she's in her late 20s/early 30s (she's not); the people she reports to think she's younger (she's not). She's always managed up regardless of the age of the person above her; it's all about communications and respect; listen to those who have things to say, and share your ideas, both upwards and downwards. In doing reviews, Amy Forinash does collaboration: Here's the goal, how do we get there, what did we do right or not in that process. Deanna's not been a manager but a mentor; she likes that she can help them grow without having to deal with consequences. Deirdre has little management experience and can't tell ages well so doesn't think of it much.

We shifted gears and Rikki asked how, in their career, did they make sure they were recognized without seeming aggressive (as opposed to assertive) or bitchy? Amy Forinash doesn't like being in the spotlight; if she does her job well you won't notice her. She wrote a list of everything she did for her boss and it was 3 pages; now she writes things like "Now I'm spending 90% of my time on..." to let her manager know. Deanna does a lot of outside activities and tells people about those things as part of the conversation; her boss has been supportive and allowing her to flex some time when deadlines at work aren't fast-appropaching. Amy Rich has impostor syndrome and also hates being in the spotlight; she's worked at places where IT (in general) has a reputation issue, so she uses that to improve IT's reputation and building bridges between IT and other departments, so each side talks up the others' successes and strengthen the company by improving the synergy.

What would they adivse people, perhaps with impostor syndrome, to get out there more? Deanna would go with one-on-one private encouragement. It's important to write what you're doing, summarize, and make notes — and that tracks your successes. Join Toastmasters to learn how to speak in public without speaking out, and put your thoughts together better, so you can give paper and talk proposals. Know what you're doing and how to put thoughts together and that's the key. One secret is that nobody's good at it when they start, and practice is necessary; eventually it bemes reflex. Another is to have a co-speaker and Amy Rich said something intentionally wrong to break the co-presenter out of his brain-panic. Bounce ideas off people who can help and encourage you, and do the same for others. Speak about something you love or feel expert on so you CAN be extemporaneous. Look for (or even organize) local meetups or user groups.

Someone from the floor asked how you make your voice heard when you think people around you don't value your opinion. Look for outside opportunities who want to hear what you have to say. Strengthening the groups so you have advocates elsewhere across the company, such that they'll speak up on your behalf. If those people are a mix of gender it may help. Also, if you're right all the time they'll eventually remember you were right and they might start listening to you. If you aren't being heard, try reframing it with the audience in mind (to the C-level for the CEO or the tech level for techies). Also, record yourself asking and answering the question, so you can make sure you don't do too much "um, like" or giggle or self-deprecate and so on. Rikki recommends Confessions of a Public Speaker.

Rikki notes that bio statements are a good place to brag, and women may be told not to brag. Deirdre uses Brendan as her go-to example; she's written bios for him. (Help each other by writing their bios.) Update yor resume regularly, and have your boss review it ("Oh, you forgot..."), and bases her bio from that. Amy Rich likes having the external reality check. Amy F. agrees that updating the resume is important. Deanna likes seeing the others' bios and using them as a basis for something similar; ask for a third-party view of the bio.

A GenXer is on his 5th or 7th career and went off to get what became an English degree and is reentering the tech field over the past 2 years. People look at this gray-haired person and assume they have more experience or skill than he really has. Rikki notes that career changes aren't uncommon. Amy F. notes that "I don't know how to do that, but..." can go a long way. Be upfront with your limitations to engender trust. Amy R. notes we tend to undervalue ourselves and our knowledge. As far of learning what you don't know, the LOPSA Mentorship Program may be helpful. Just go to them with something in mind ("here's the problem, I think X is the solution, is that right or am I going in the wrong way"). And remember you've been good at things in the past so not knowing something now isn't a problem. Also, teaching something is a way to ensure you know the material. Don't be afraid to ask. (Women in technology should join the Sisters mailing list.)

Jessica does a campus LISA event on UCSD campus, doing technical outreach. She's found that some events (like the weekly beer meetup) don't get a lot of women. Do you have advice on getting more women to social events? Deirdre notes that the overwhelming preponderance of men can intimidate some women, so if you can get women to come as a group of 2-4 that might help. Alternatively, try a women-only one. Possibly state a need: "I need X women to help solve a problem." And remember that not everyone likes beer... try coffee? Incentivize: This might be good on a resume, or find your next career/job, or....

Another questioner notes that how people take feedback varies and may hear only the negative or only the positive. Taking feedback requires listening and processing both sides. In his case, women tend to be better at processing both.

Morty asked for advice to men, as coworkers or leads or managers? Rikki thanks him for coming. Deirdre notes that regardless of gender, make sure that all the voices get heard (not just the extroverts or culturally-quiet). When you encounter a woman in the course of your work, treat her as a real thinking human being. Pay attention to everyone's communications skills, habits, strengths, and weaknesses (in meetings, in email, in person otherwise). Part of a manager's job is to point out the things their employees are good at; even if it's not related to the project at hand, hearing good things about yourself makes you stretch more and do better.

Chris notes that geekfeminism.org has a great wiki with an outline of issues that can help us learn and contriibute. This year, most of the panel aren't managers. It seems that women may be promoted to managers (more?) often. Has that played out? Amy F. refused to become a manager, as she has no interest in that and wants to stay a nerd. Deanna doesn't want to be a manager either because it's not her strength or gift. Amy R. used to be like that but had a manager who was never in the country and never reachable, and has the personality to pick things up that're getting dropped and filling the power vacuum, so she became a double-hatter as both a manager and sysadmin.

Deirdre asks if there's a way to do both management and technical (be it sysadmin or code), perhaps sharing the hr/management away from the team leadership? Amy F. note they do that at NASA; Deanna notes that promotion to management is the only way to advance. Amy R. had a past life where she wound up doing sysadmin work and technical leadership and the project manager would do mentoring and one-on-ones and so on. Someone on the floor is a manager and a techie; she manages (only) men. She thinks we're being very focused on changing themselves, but a lot of men wouldn't even give this a second thought; many of the men she manages have done technical work and are confident and competent, but the women who are here are different than the women at the pharmacy convention down the hall: The women here tend to be older. This is a tough business for women. Women tend to be better communicators.

Another advises parents to raise their daughters to be interested in more than just Brittney Spears or Barbie or whatever. Amy F.'s mom was the touchy-feely sort but her dad had her soldering. Rikki is a single mom who's done plumbing and who's dragged her daughter to technical events. Fathers are also bringing their daughters to techies as well. Amy R. fell into the tech world because her father was technically incompetent. She also suggests that parents raise their boys to be respectful of women as well.

The people in the industry who are successful don't care what others think of them, but of contributing something worthwhile.

Nicolai asked that if you have similar job offers, what would you look for in terms of culture and ethos, to make them attractive to work for? Are the hours flexible, and can you work from home? Amy R. spends 70%+ of the day in front of the computer so even without kids whe needs to do some stuff in the daylight. Did the job description and initial contact imply that there're women there already (maternity and paternity benefits, etc.), showing that the company is respectful of and considering issues relevant to women. "Wellness" is important — is there any interest in growth, supporting you to be the best human/person you can be. Is there training (for both technical contrubutor or management)? Google the executive leadership; what's the diversity (not just gender but other aspects).

Is the culture campus-based (cf. Google 100% time there).

As part of the recruitment process, hr somewhere removes names and genders from applicants, but hr still writes the job descriptions that imply they want particular personality types. He's pointed out that they're doing it and it's had no effect. Amy R. suggests educating them with specific language to work around language choices.

Amy F. wanted to share two thoughts: If you want to see more women in tech you need to be more visible. People (regardless of gender), if there's a situation you don't like, take a step back and ask what you want. If you're happier emotionally in general you'll be happier at work and more productive.

Another speaker notes there's a cultural DISRESPECT for management in all fields, not just technology. How do you get past that when women are promoted to managenment but not coming back up to the front lines. Amy F. loves her manager who remembers what it's like to be a technical person. Amy R. is all about the interpersonal relationships for her reportees and reporters: Pets, kids, spouses, lives, career goals, and so on. Listen and ask questions (even dumb ones).

Do you think the BOFH concept will go away, and if so does that make things better or worse? It may, but there will still be assholes. That won't really have much affect. You can learn more at home now than in the past though, which will probably help.

Deirdre notes that Code Club is a UK-based project to get more young girls involved in STEM, so please take a look.

After the panel I went down to the vendor floor for both lunch and a quick tour of the booths. However, $9 burrito and $3 soft drink didn't sound good, so I joined the group heading across the street through the drizzle to Hot n Juicy Crawfish. I had the crawfish po'boy lunch special, which was very nice, with cajun fries, which were very spicy without much flavor-add. Got back to the hotel with enough time to grab a dozen pens from Google to have swag for the coworkers when I get back.

At 2pm, since Brendan Gregg gave his Flame Graphs talk this morning, he provided a quick overview on the state of systems performance testing in 2013. It's more than the OS and kernel: It's up through the application level and down to bare metal.

In the 1990s there were just closed-source OS-specific tools like iostat and vmstat with limited insight into the details. In the 2010s, everything is open-source, more dynamic, allows for visualizations not just text, and there are resource controls and methodologies out there. DTrace is a huge win.

In the second half of the first afternoon session, Branson Matheson spoke on social engineering in "Hacking Your Minds and Emotions." (We joked he should've called it "Hacking Your Flame Graphs," but after consideration decided that wasn't the best idea.) He'll show us how we're routinely social engineered every day; he's here to show us how we are, and give us tools to both attack and defend.

Agenda:

Basics and explanations — When social engineering you have to be the Bad Guy or con artist. Definition of SE: Make someone do something they wouldn't otherwise do. To do it well you have to think on your feet, be confident, know your limits, and a sense of theatrics.

There're two basic types: society or political or governmental/large scale and individuals and groups/small scale. Small scale is where you get the win.

Basically you need to make the request seem normal, keep the goal in mind ("don't get lost in your own lie"), but be able to abort or revise as you go. Works best with a herd mentality.

Examples:

Walk-through — Identify target, do research (individuals need more work than groups), then plan it before doing it (but KISS!). After, evaluate your impact. Don't overreach. Lather, rinse, repeat; note relevant data and expand.

Tools:

Another example: Given only a surname, you can get name address and phone number and balance on a credit card.

Defense:

The break this afternoon had pretzels (with optional cheese sauce), though they were a bit too salty even for me. Nothing was in the second afternoon block that seemed of sufficient interest and relevance so, as I'd not slept either well or long enough the past two nights I went upstairs to nap instead. Didn't actually sleep, but lying down for a bit was probably good. Swapped out the shirt I'd been wearing for the conference bowling shirt before heading back downstairs for the reception.

The reception was bowling-themed. The Big Lebowski was playing on the big screen, tables had tiny bowling pins and balls, there was a "strap into a hugemetal contraption and have friends roll you into human-sized foam pins" area.... It was cute. Food included pizza, chili cheese fries, chicken-n-mozarella sliders, crabcake sliders, and a salad bar, but no desserts.

After the reception I headed upstairs to the organizers' party in the Presidential suite. Unlike the Scotch BOF last night, I didn't have to open up the room and the party was catered. Food here included chips-n-hommus, veggies-n-dip, a cheese tray, mini crabcakes with a remoulade dressing, and small chocolate mousse shots; drinks included several bottles of wine, a variety of beer and soda, plus the scotch leftover from last night. Had some good conversations on a variety of topics. Meant to head to bed around 10pm but didn't get out until 10:30pm.


Friday, November 8

Slept in until 7am, finally. Showered and got downstairs in time for the first invited talk session on security. Dan Kaminsky spoke on "Rethinking Dogma: Musings on the Future of Security." Security has become a first-class engineering requirement. But it is not the only such requirement. In this talk, Dan considered various sacred cows in security and asked whether we'll still believe in them in a few years. Does the user model make sense now in a world of app servers? Are biometrics better or worse than passwords? We talked about the future of actually delivering security to our users.

He's a seurity researcher and likes speaking to non-security audiences. He's starting up WhiteOps.com: Pop a million systems, click a million ads, make a million dollars. What did he learn? Hacking and engineering only different by constraints: Hacking is without limits and ignoring constraints and not caring about side effects; engineering has lots of constraints and limits, and side effects matter. What if security becomes an actual engineering requirement, then? It wasn't a requirement, then security was hacked in (with side effects like impacts on performance, reliability, debuggabilitiy, and usability).

When you're hacked it's always your fault. A recent study says that getting compromised is worse than getting divorced. Strongest predictor of security religion is whether there's been a compromise in the past. Universal comment: If the violator of norms can't be punished, the victim must be a special case so we can believe ourselves are safe.

Alas, we really do not know how to deliver security at scale.

Cybersecurity crisis has 3 simultaneous problems:

Engineering security requires no religion (lots of people will admit security has failed, but surprisingly few will question their core assumptions) and new science (data and instrumentation, just like everyone else).

Continuous deployment has won. It's build/test/hack/deploy/fail on a regular basis. We're only surviving because of extensive instrumentation. This means are systems aren't puzzles but games. We're shipping bugs and we know it, so patches are adaptations.

Attackers emit signals that are different than our operations. Hackers need to discover and work around our network; we know it, have a concept of normal operations, and the advantage in detection (where and how). NetWirness is very successful at this. Once you're in, there're fewer things you can do. So the game is giving the attacker the cost of making errors. Even when their assumptions are right, they're never sure and might get caught, and there're consequences if they're wrong.

The myth is that exploitation is instantaneous. In reality, this doesn't hold up. Flaws have to first be found, then unsed to figure out what you could potentially do, then maybe do something. All of that produces signals we can detect.

There are 3 ways to look at a program: Look at the source, look at the compiled binary ('cause source lies), or look at the runtime ('cause binaries lie too, e.g., "the garbage collector will take care of this"). As our game we need to link log entries to attacks that are beginning to succeed. We don't need to care about those who're trying, we only need to care about who's getting in and doing things. We can use errors and logs that aren't correlating to changes we're making in our environment ourselves. As a corollary we'll be finding more interesting things to capture.

Honeypots are actually now more viable. Continuous deployment and virtual machines make it easy to deploy and reconfigure new machines automatically. Remember, we don't have to play fair.

Remember, however, control planes aren't immune to compromise (protocols, credentials, logged-in administrator machines, and so on).

There are 7 ARM chips inside an iPhone 5 — that's a network.

There's no such thing as a hardware bug. Drivers, sure. Kernels are trusted... but Hypervisors don't have trusted kernels, they just pass things through.

The black market's gotten huge. "Crimeware" has specialized and reputations and payments for criminal activity is a solved problem.

Discussion on #BadBIOS — it's real.

White Hat hacker latency is too slow. Dan found a DNS bug and it took 25 years to get it fixed. Facebook and Microsoft came up with bug bounties; identifying problems to them get you cash. Finding bugs isn't necessarily hard, but bug reports from external sources usually suck. They have no context for the code. Don't think that external bug finders are bad; think about how to invest in making them richer. Maybe give them cash (financial incentive) 'cause it's in the vendor's ultimate interest.

asm.js might be a sandbox winner: It's an execution environment with no access to even the parts of JS tat get hacked. Prediction: libtiff from 2005 run inside asm.js can get acceptable performance, and if it breaks, none of those are exploitable.

Internet Bug Bounty prediction: We'll see a lot of new useful data at a higher rate.

What's he nervous about? The whole NSA thing, for example. See Bruce Schneier's talk above. He's worried about people's REACTIONS to the revelations; "2 stupids do not make a smart." He thinks we may see competitions on political FUD instead of technical merit. We want to keep net.neutrality, and we can appreciate the concept of keeping traffic within national or political boundaries. But we have to keep the technical aspects working.

Most crypto doesn't need the NSA's help to be a disaster; it can do that fine on its own for free. There's a reason we, the academic community, went to the NSA for help. It'd be nice if there were a government department focused on defense (Department of Defense?) or a conference dedicated to it (DefCon?). NIST's come back with good advice that researchers agree with, and the community says NIST is evil.

Larger issue: Crypto didn't actually work pre-Snowden. The problems aren't in the protocols or side channels, but key management and user education and implementation. It's more than math and code implementation, it should include operations — which is the hard and complex part. Dan declined to predict whether including operations into the work calculation will happen in 2014 (I think it won't).

Summary:

The Q&A session is perhaps best as bullet points:

After the morning break, I went to the "Futures and Emergent Technologies" panel moderated by Narayan Desai, so people think about new technologies for building and running systems. On it, Rob Sherwood, Mark Cavage, and Nadav Har'El gave an overview of what they're working on now and how it's likely to affect computing and administration over the next few years.

[For the third year of the panel, the fact that there's no description in the conference program is unacceptable and unforgiveable. Shame on Narayan, especially as the session chair.]

Nadav developed a new operating system, OSv, designed specifically for VM guest systems. Rob is working on SDN. Mark is at a cloud computing company to bring Unix to big data via OS virtualization. All three gave a talk, either yesterday or today. Narayan notes there're two kinds of approach, either a new style or environment to an existing domain, or a new model on an existing space.

Nadav says it'd be nice to throw everything away and start over but audiences won't use the new thing until they know it's useful. They started with allowing unmodified Linux binaries to run on the guest VMs running OSv. Rob's SDN exposes RESTful APIs and a web GUI to make developers and operations both happy. More and more people are realize they need to be doing more DevOps-y things on the network stack. Mark's a firm believer that Unix is the right answer. Unix is pretty much the last major OS and killed OS research (Rob Pike?). Their ethos is to innovate at the Unix layer; their goal is so people don't have to retool.

Narayan notes that they're all focusing deliverately on scale. Shrinkwrapped software ignores scale, in general; their approaches are all assuming large scale. In their wildest dreams, if they're successful tackling these problems, what does the world look like in a couple of years? Rob's main limit is the admin's brain trying to kludge it all together; in the future, this technology can let sysadmins sleep through the night. He wants network admins to have help in managing the network in an automated way, instead of those who just say No. Mark thinks that they're pushing towards a container-centric view of the world, pull up a rack or a truck with POWER and PING plugs and voila! Data center as a point of presence. This is a shift from where we are now.

One point was that everything has to be at scale. Narayan thinks we need new and better abstractions; one thing strange about SDN is that the model is more complicated than the traditional network model. Rob says the technology of SDN has a lot of potential to simplify things eventually, but right now it's not a mature enough technology. For example, isolating via VLAN or tagging should be an implementaion detail and not a technical issue. The abstractions today aren't there yet. Mark says that clouds are scale, and the dirty secret is that you need to rewrite your applications and not just drop things into the cloud.

Moving to the cloud is not as simple as using a single server; there's a cost involved in the change. Some of the changes can be mitigated, and sysadmins shouldn't worry about their jobs going away, since there're still problems to solve.

So how do we change the economics of problem domains, and what are the new kinds of problems we can solve in the next couple of years that we can't solve (or even work on) now? Network monitoring on top of SDN. Converged data and converged compute with high-speed backbone networking opens up a lot of things people haven't thought about, like video feeds of retail stores for security. Ron mentioned restructuring the address space, as in ethernet addresses with an netmask, letting you run at lower levels of the stack, to precisely manage a network with a protocol that doesn't look like IP (v4 or v6) or anything. Once the costs are down, we can look more into how we move stuff around.

Nadal says we should create new abstractions that are relevant, and get rid of the ones that should die. Which ones should die? Spanning tree! A loop-free broadcast network would be nice. Mark says XML RPC and OASIS should go away. More seriously, a lot of the abstractions are still useful; HTTP 2.0 will be good for most things but has some regressions.

We're moving towards a world of interconnected data, where we have to analyze it, move it around, and have atoms that can interact without performance impact. We don't want our lightbulbs to need full configuration management. Economies of scale now make possible the stuff that was impossible; a decade ago, an online game (MMPORG) would be impossible: What's the initial demand and how do you scale it? The cloud has changed that model; with little capital investment you can spin it up and scale up as necessary within one cloud service, and you can expand to others as needed. Mark notes that complex mobile applications are hard; a lot of things assume either zero or ubiquitous bandwidth.

There's a pendulum in history: Mainframe terminals, up to peer-to-peer, and we're swinigng back to the former. There's no single stable equilibrium point though.

Doug Hughes loves his nest — he's got a cloud-controlled thermostat and water heater in the house he's renovating that he can control through the cloud... But what if the provider discontinues the service? How are they usable if that happens? Ron calls that a single point of failure problem (economy, buyout, hack, and so on) and it's ahuge problem for things in this space. Fortunately, with open source this becomes less of a problem. At least for us, but what about for the consumer market? There's no real answer here.

Another from the floor commented about their individual products. Network administration is awful and have no desire to see it change; the people who want SDN are folks like us who aren't "allowed" to configure it. With OSv, people are comfortable with Linux so why take it away and give us a new one. Ron said that in his mind, some networks are so complex that even good netadmins can't run them (cf. Facebook, Google, Microsoft).

Scaling small systems large is a pain; taking a large-scale design and scaling it down can happen. Microtek's home routers support SDN. Will we move towards "Just use SDN because it's easy"? Narayan notes some of the mailing lists are working on overflow support on the Linksys firmware. Home hobbyists are helping drive the solutions on the low-end of the market _first_. It's almost a race to the middle, with both the high and low ends working. OpenFlow was ported to OpenWRT early. Popele are focusing on the bug guys and data centers because thats where the fastest benefit is; the home market is more lucrative but will take longer to adapt.

The days of going without automation or scripting are gone. At least you'll need a for loop if not a full-on language (like sh, pl, py, or rb).

What technologies should people be paying attention to but aren't (in addition to theirs)? RHAT's ARM servers. The RAFT paper. Accessible consensus systems in the next 18 months. The VL2 paper. New hardware technologies (FPGA, hardware acellerators). Open source general management tools. Narayan is excited about the changes in the hardware space. Ron is looking at the shift between major players; CSCO is in the systems space, IBM and DELL moved into the network space via acquisition, so the price per { bit, byte, CPU cycle } will hit the floor.

Ron asked what do we want to see that doesn't exist yet. The audience responded with "a money tree," just for a couple of people. More seriously, you used to buy and install an application that had commands with some other interfaces optional. Today, APIs have grown and now with REST and JSON the command itself is almost a myth. The quality of the abstraction is what matters, both command line and API. At a deep level we need concurrency. Commands tend to be local and APIs tend to be remote (though there are of course exceptions).

For lunch, Adam, John, Mario, Mark, Travis, and I went to Italian Pizza Kitchen. Adam, John, and I split a large Italian pizza (pepperoni, sausage, pancetta, prosciutto, and red onion).

In the first afternoon session, I went to the security talks. Zane Lackey and Kyle Barry were up first with "Scaling User Security: Lessons Learned from Shipping Security Features at Etsy." Etsy is an e-Commerce company and the talk is about user-facing security features they ship. We should learn from their epic failures.

Etsy wasn't built from the ground up to be full-site SSL. Whoops. In 2012, they couldn't use SSL beyond their (inbound) load balancer due to licensing issues. That meant forcibly redirecting down to HTTP if not accessing admin-specific pages, and stored rules in the load balancers to force SSL for the admin-specific pages. They recognized this was not ideal so they planned to fix the load balancer licensing, go HTTPS everywhere, and get HTTPS management into git. Roadblocks included the content delivery networks (CDN) not necessarily supporting SSL (so they had to be replaced or renegotiated), some servers not supporting SSL well (10x performance hit), using less-CPU intensive algorithms (which may not be good for security). Escape hatch: Allow users to turn off whole-site SSL, wondering how mobile (vs desktop) handled it. They watched their metrics closely (especially conversion, bounce rate, pages per visit, and page performance) to make sure they weren't negatively impacting the users. All metrics got better after the rollout. They also added the HSTS header (enforce HTTPS and only HTTPS for the entire connection). (Turn HSTS off when they og out of the site, BTW.) They also had to go to secure cookies.

Browser trick: Use 'href="//whatever"' to keep the protocol intact.

After September, everyone is fully SSL'd. There're a couple of minor things to do: The HTTPS-to-HTTP downgrade, make HTTPS canonical, and ask the haters (the people who turn off the Use It setting) so they can fix problems before removing the setting. It's been over a year for the process.

Two-factor authentication (2FA) had some guidelines in mind (SMS (with multiple providers worldwide), voice, enrollment, international users, backup codes, admin tools). They're monitoring things sanely. Some other lessons included:

Zane took over and talked about some protocol and operational issues next:

Bug Bounty program lessons learned:

In the Q&A session, someone asked in what name spaces they run their CDNs; the speaker wasn't sure. The bounce rate went down after SSL was implemented: Why? They think because people felt safer, but they're not entirely sure. It was rolled out just to signed-in users, so the redirects aren't happeneing. Takeaway: Graph everything. Finally, someone thanked the speaker for talking about what went wrong! It's a huge benefit for the community.

Jennifer Davis was up second with "Building Large Scale Services." The talk is specific to her experiences and the stuff in her bubble may not be something we agree with. She notes this talk is more anecdotal than technical.

The reality is that people don't all know the same things, and communications is hard, so systems can be fragile. As part of this, we need some core principles, both common things (collaboration across teams, companies, industry, and standards; incident, problem, change, config, and release management) and distinct things specific to an application or services. We have to kill the myths ("stupid user" is one of the worst). Modulate your message to the audience, both text and images.

Within a team, and between teams, treat others as individuals ("Elaine," not "the Devs"). Understand the vision. All of this goes into her job definition: Examine the thing, define risk, communicate the costs of them, mitigate them, and then determine how to handle incidents.

Checklists aren't just for automation and don't imply that you're dumb, but they enable you to see what's different or changed and can direct you to the anomalies and outliers for fixing them. You need to know the components, the protocols spoken between them, expected and unexpected inputs and outputs. You also need to know the state transitions (is it installed but not ready/production? What if it's being retired?); go through What If? situations an scenarios and document them. Know your choke points (memory, disk, bandwidth, CPU)... now and in 6, 12, ... months.

Failure will happen. "Give me the brain" documentation so anyone can be the brain helps with repeatable and reliable failure handling. You also should run fire drills — you can't prepare for EVERYTHING but once you've prepared for SOME things you can move beyond that initial reaction of panic.

Post mortems should be reclassed as retrospectives to avoid blame-placing. The term change avoids the blame game.

All of this leads to success at scale: You need to collaborate and cooperate across teams.

You might consider letting things fail earlier so the thing breaks before it becomes a crisis. Failure is often hidden in success. Documentation doesn't guarantee understanding. Documentation should happen during the process, not at the end as the last minute. Remember that the operations folks may not read-to-understand the documentation in the moment of crisis.

When you join the team, think about what things will look like when you leave.

(Her slides are here.)

The final technical session of the conference opened with final thanks from Skaar and Narayan. Todd Underwood gave the closing plenary, "PostOps: A Non-Surgical Tale of Software, Fragility, and Reliability." He did ask for questions (at the mic) during the talk itself. He had memorable lines, like "So far we've learned I hate humans and I hate computers. Are you still with me?" He was a dynamic speaker and kept having to unlock his display. As a result of all of this, my notes are flaky at best.

His call to action: Administration is bad and done, DevOps a bandaid, but we need to move beyond Administration and Operations. We should look at the last 20 years of history in sysadmin with something other than nostalgia and reverence.

Despite being cranky about operations, the culture and practice has many admirable characteristics we must not lose:

(His slides are here.)

Went to dinner with Geoff, Peg, Ruth, and Trey, (and briefly MikeC, though he had to run away early) at Bistrot Du Coin in Dupont Circle. Had a crock of French onion soup and then the moules au pistou (mussels with pesto, prosciutto, and French ham, in white wine, with a side of frites, and the tarte aux pommes Normande (with Calvados). OMG delicious, and considering the quantity and quality, perfectly reasonably priced (low enough I made my per diem today.)

When we got back to the hotel I dropped my jacket and sweatshirt back at the hotel room and headed over to the Dead Dog party. Some bourbon, some rum, some water, some schmoozing, and eventually a lot of goodbyes.


Saturday, November 9

Alas, it's time to return home. Woke up around 7am; checked the email, shaved, and did most of the packing; this is the first time in years I hadn't done ANY packing before going to bed Friday night. Once everything except the toiletries and today's clothes was packed, I jumped in the shower so I'd be ready to head out to breakfast. Eventually met up in the lobby with 9 others and headed off to Open City. I wound up at a 4-top with Branson, John, and Nicole (Lee, Peter, Steve, Tim, and 2 others whose names I didn't catch were the other 3- and 4-top); I had a bacon-n-cheddar omelette, hash browns (done as a large brick as if it'd been baked), and marble rye toast. At this point, as long as I don't eat lunch or dinner, I've made my per diem for the week. (Yeah, it's not gonna happen.)

Got back to the room, finished packing up (I'm missing two things: The 3-to-1 power adapter from the CPAP bag, which might be hiding in the big suitcase but I'm not about to unpack it to check, and the Delta "wings" pin from being a good sport about the flight attendant giving me a hassle about the sweatshirt on the flight to DC), and typed up and sanitized more of this trip report.

After checkout, I hung out in the lobby bar with a few other attendees, catching up and chatting, until it was time to head out to the Metro for the ride to the airport. Put the right amount on the card so it was at a $0 balance on my last ride. Just missed the Red Line train so had to wait for it, then had a 20-minute wait for a Yellow Line train to the airport; got to the airport without further incident. Despite some others' complaints about lines, there was literally no line at security for DCA (terminal B gates 1–21), so I grabbed a Food Court late lunch around 3ish.

The flight itself was full and uneventful. Landed on time, got to the gate quickly (at the very end of the terminal), had to wait 3 minutes for the tram to the middle of the terminal to get to baggage claim, waited only a short time for my bag, and schlepped it to the terminal shuttle stop. It took an unusually long time (over 20 minutes) for a shuttle to appear, but it eventually got me to the Big Blue Deck where I headed back to my car (and did a minor oops: I have to go to zone E before heading down to level 3; if I go down to level 3 first I can't get to zone E at all). Got home without major remaining incident... and reset the clocks (car, stove, microwave, and alarm) that I wasn't home to reset when DST ended.



Back to my conference reports page
Back to my professional organizations page
Back to my work page
Back to my home page

Last update Feb01/20 by Josh Simon (<jss@clock.org>).