Josh Work Professional Organizations Trip Reports Conference Report: 2021 LISA

The following document is intended as the general conference report for me at the 34th Systems Administration Conference (LISA 2021), held virtually from June 1–3, 2021. It is going to a variety of audiences, so feel free to skip the parts that don't concern or interest you.


Tuesday, June 1

Opening Remarks

The conference proper began with the usual opening remarks: Thanks and notes (the sponsors, program committee, liaisons, staff, and board). They briefly mentioned the code of conduct. We were reminded to go to the vendor sessions after the conference sessions proper, though due to the time difference I declined. (The conference was held 11am–6pm in my time zone, and I still worked a few hours before it began, so I didn't have time; instead, I had dinner.)

The conference chairs did share some thoughts about future conferences.

Keynote: Beyond Firefighter vs. Safety Matches: Growing the DevSecOps Pipeline

Our first keynote was from Amelie Erin Koran from Splunk. She asked if we're putting too much emphasis on a given role or title, and if we are putting a bottleneck in place.

DevSecOps lets you spread the dev, ops, and sec roles, skills, and responsibilities across multiple people - there's still some specialization. (Q: Bringing dev, ops, sec, net and other disciplines back into the same fold seems to suggest we need to create more generalists in the future. Is this a fair assumption? How can those folks continue to grow their careers over time? A: Yes... however, don't over-rely on generalists... this is the caution. My original talk noted that for security, scaling works best for generalization for a lot of the block and tackling, but still reserving some resources to hire and utilize specialists... carefully.)

The government thinks there'll be a shortfall of security people (skills gap). Having dev and ops take over is possible, but what's the learning curve, and are they even interested in moving over to sec to begin with?

Q: Education is one of the things we're struggled with in this industry for decades. What are some of the effective ways you've found to improve this? Where should the industry be focusing energy?

A: It seems for security, at least, they are trying to strike at college and work down to elementary education to address the pipeline, but for coding skills, it's the other direction, getting coding skills early in life...

What has and hasn't worked?

Gap analysis — identify what's missinh and then identify (or design/develop) solutions. Do you need a generalist? Highly-specific skills needed? Document what you've done and train your successors or the staff remaining when you leave.

Balance work and talent: Where are you, where do you need to be, and how do you get there from here; tactics, techniques, and procedures.

Q: One argument for the "generalist" model is that folks develop multiple skillsets and then experience high degrees of career mobility. What are the long-term career benefits of this "specialist" role model?

A: For some people it fits what they want to do well (especially if you like a deep dive). For most people though it doesn't fit what they want to do; it breaks mobility if the specialty disappears.

Keynote: Lessons Learned from a Ransomware Attack

Our second keynote was by Ski Kacoroski of the Northshore School District. They got hit by a ransomware attack, which are becoming more and more common due to small IT staffs, bug bounties, and security not being a priority for the district. In their case, Emotet was installed in Mar 2019 and the data auctioned off on the dark web, then trickbot was installed in Jul 2019 and the data auctioned off again, and finally ryuk was installed in Sep 2019. The attack itself started at 11:37pm Friday, September 20, to minimize reaction time since payroll was going to run Tuesday, September 24.

An additional complicating factor is that their normally-airgapped backups weren't actually airgapped because they were synching backups — so it's all corrupt. Their NAS was okay and the remote backups were too, but all of the Wwindows servers and AD domain controllers needed to be rebuilt from scratch.

Lessons learned include:

The root cause analysis identified several causes: Understaffing, security shortcuts, insuficient antivirus software, backups not airgapped, crufty sustems, eggshell security using only firewalls, inadequate monitoring, and not enough gameplay planning (earthquake yes, but ransomware no).

Some saving graces: 40% services and 95% workstations were okay, NAS snapshots were untouched, the databases were backed up to NAS, and they could recover SIDs and CNs.

Some advantages of being attacked: It gives them the ability to make changes (training, 2FA, upgrade OS/patching, get rid of very old files).

Q: What would you do differently?

A: Better AD backups (cloud-hosted and so on). Plan better up front to reduce repeated work. Working with contractors again would be faster as they're familiar with it already.

Q: The way the insurance company was involved (forensics and so on) was interesting. Were you ready for it?

A: No, it was amazing. Their demands were specific and they wanted it done sooner. The impact was surprising. Cyber-insurance — especially now that more are being hit — is now more strict if you want coverage and then payment after an incident.

Q: Was there anything more surpising?

A: The surprise systems. Ski was there for 17 years and the fact that nobody knew the food POS systems had local storage, and nobody knew that it handled $30,000/day in sales, and some systems are very time-sensitive (door locks, school bells) or weather-sensitive.

Q: There's a lot of Men In Black neuralizer here. Did the instant response folks help you retain the lessons? How involved were they?

A: No. They're responsible to the insurance company, they say what you need to do ASAP, and there's no written report afterwards (since it's discoverable); all you get at the end is an oral report. It may have recommendations — but some of those won't be possible ("hire six more staff"). They only care about plugging the old hole and preventing it from happening again.

Q: Has this changed budgets?

A: Yes. Security is now a top priority. They had two budgeted positions the district wasn't allowed to fill, but this let them fill them.

Q: Can you share requirements of the audit so people can plan?

A: Yes. You have to have anti-malware documented and on all machines. Backups must be airgapped. Ask any cyberinsurance company and get their form.

Q: Have Procurement procceses changed?

A: Not enough. Instructional or a department will find something they like and IT is only brought in at the end. Now IT can do push-back like SAML login required. It's still an ongoing negotiation... but IT will lose the battle if the department or Instructional is pushy.

Q: Would a table-top exercise have helped see what the critical points were?

A: Maybe. It's only good for what you know (e.g., the POS system that only Finance knew about). If other departments were involved it'd've been better.

Q: Bringing consultants onboard?

A: They got there the day after they called: Hit early Sat, called Sat afternoon, had people Sun morning.

"Disorganizing" Your SRE Organization

After the morning break I went to the invited talk by Leonid Belkind, CTO of Stack Pulse. What does it mean to disorganize the organization?

How'd they do it?

Wnat did they learn throughout the process? Mostly about the human part of the process; the technology more or less just worked:

  1. Accept the new normal.
  2. Build to the individual. Trying to keep "everyone in the room" with Slack, Zoom, Discord, and so on doesn't work; it leads to increased fatigue, poor responsiveness, and low morale. Balance critical need with personal preference (introvert vs extrovert).
  3. Explicitly build culture. Directly share that you're working on culture as a project, define changes in day-to-day responsibilities, and build oppoprtunities for informal interaction that use different formats.
  4. Terminate loops locally. Redivide responsibilities, empower with playbooks as documentation, 4-eyes verification only for critical issues, and measure performance and share with the team.

Six months in, enrichment and RCA over 60% automated, MTTR reduced by 35%, playbooks used to manage incidents from create to post-mrotem, and over half the team has lef incident response.

Over a year in, playbooks as code allow for uniform processes and don't fail to delvier; 85% of the steps are automated.

How would someone else get started?

  1. Be open with your teams. Explicitly explain that the organization is embarking on a journey to change its culture.
  2. Identify individuals that are passionate about it and involve them in leading the efforts.
  3. Let the teams drive choices of automation tools. Technologists enjoy solving problems with tools much more than they do with manual processes. Tools do matter.
  4. Don't assume that people will tell you how they feel or how confident they are. Constantly monitor the "soft" metrics.

Everything We Did Wrong to Do Accessibility Right at BuzzFeed

My next talk was by Plum Ertz and Jack Reid of BuzzFeed. When working to make the site fully accessible they learned a lot of lessons in communication, logistics, and bravery. Ideally they'd've built accessibility in from the beginning (2006) and not when they did (2018). They didn't set themselves up for a good partnership with their external auditor. There'd been engineers who tried locally to push it, but nothing holistic. Lawsuits (or threats thereof) meant they could get paid external experienced auditors.

The developers and the auditors weren't that near each other.

They got leadership mandate but not really buy-in. Business-critical priorities prevented staffing to get a head start before the results were ready. So they got the report (around 400 issues, from easy to systemic), but nobody to work on it. They chose to divide and conquer and put things on existing teams as mjch as they could... except a lot (most) of the work wound up in the catch-all one-person week-at-a-time team and not any of the other teams.

They manually imported everything into Engineering's JIRA, not the third party's system, so since the other teams were involved it had to be JIRA like the rest of the work. Keeping the two systems in sync was manual. (Should've required API access in the contract.)

They asked a lot of unprepared people to do a lot of stuff, adding things to their roadmaps.

They didn't build flexibility into the plan. Buzzfeed laid off 15% of their workforce in Jan 2019. The silver lining was that pepole were looking for things to do and Accessibility had a big backlog of tickets. But they were too casual about turning it into a formal project and getting it on the roadmap, waiting on the bigger picture from above.

They told the story in terms of dollars instead of sense. Accessibility is a long-term investment and not a quick win. They were perhaps more-focused on specific KPIs that accessibility wasn't a priority. Leadership wasn't disabled and thus not that interested. They should've treated accessibility as foundational and not as a feature. (The engineering cost would be the same either way. Project management and auditors was overhead expense. They were sending things to the auditors piecemeal which burned through their hourly budget faster.)

Another mistake was asking for permission not forgiveness, especially for the legacy code (which may or may not have had owners). They wound up flipping the script and making the change (and nobody objected).

They saved the hard stuff until the end because of working piecemeal. (ALT text for images was the big problem.)

On July 10, 2020 they got their letter of compliance. Accessibility isn't one-and-done but a way of thinking.

What did they do right?

Kind Engineering: How to Engineer Kindness

After the lunch break I went to the Kind Engineering talk bu Evan Smith of Solvemate. He started by quoting Tanya Reilly, "Kind is about being invested in other people, figuring out how to help them, meeting them where they are."

His talk had four major areas.

Code Reviews

Tone matters. Understand the WHY not just the WHAT and HOW. Assume positive intent and intelligence. Assume instead that you're missing something; ask open-ended clarifying questions, but not aggressively or challengingly. Iin the context of code review, consider preceding nit-picky comments with "nit:" or something. But is the nit-picking a sign of a larger problem (perhaps of something, e.g. formatting or indentation) that can be fixed automatically?

Know when to switch from asynchronous communication like a code review to synchronous communication. The latter can be private and public criticism is hard. Is there something one side or the other isn't seeing? Talking it out in real-time may be easier to clear it up.

Honesty

Be more than professional. Care about people and bring your whole self. Include the positives when you challenge someone (even though people tend to distrust praise). Admit when you're wrong. White lies aren't evil but they don't help.

Note the difference between NICE ("Good job in the meeting") and KIND ("Your answer was rambly and you missed the opportunity to convince the team, but it's a good idea so practice your elevator pitch").

You're building rapport with people, fostering the connections between people to build trust.

Psychological safety

Feedback

Give and receive both positive and negative feedback.

Giving has three setps — emotion, credibility, and logic.

Receiving:

BPF Internals

I switched between IT tracks and went to Brendan Gregg's "BPF Internals" talk next. It was mostly kernel internals and while I followed along I didn't take notes.

Lightning Talks

After the break I went to the Lightning Talks. There were four:


Wednesday, June 2

Plenary: Computing Performance: On the Horizon

Wednesday began with Brendan Gregg's plenary where he gave a performance engineer's views about industry-wide server performance and some predictions in several areas.

Processors

Clock rate incrased... but practcally we're maxed out at 3.5 GHz clocks. There are exceptions but we're scaling horizontally (more cores, more threads, more server instances)... which puts pressure on the interconnect rate (over the past decade we've increased the bus rate 3.25x but core count 6x). Side effect: CPU Utilization counts "busy" even if stalled/waiting. Lithography is getting tinier: Silicon atom is 0.1nm, and lithography is down do ~2–3nm... except it's a marketing term with no reference to reality. Expect to hit limit by 2029.

(Chip shortage may last into 2023. Oi.)

Cloud chip race, too — Amazon ARM/Graviton 2 is in production already.

And now we have GPUs, FPGAs, and TPUs to accelerate processing depending on your workloads. Remember to monitor them too.

Predictions: Multisocket is doomed and will become an edge case. We're seeing 80–100 cores on a single socket plus cloud-based horizontal scaling, so why pay NUMA costs?

Simultaneous multithreading (hardware) future is unclear: Performance variation, ARM cores competitive, and after Meltdown/Spectre they're turned off a lot.

Core count limits — more generalpurpose cores max out memory, ernel/app lock contention, power consumption, and so on, so there will be a de facto practical limit.

More vendors meaning more choice, but be careful about optimizing for the benchmark.

Cloud CPU will have an advantage; vendors have >100k workloads to analyze directly and can use that to aid processor design, possibly with machine learning to help.

FPGA won't be adopted widely (beyond cryptocurencies) until there's major app support.

Memory

Many workloads are memory I/O bound. DDR5 (coming out this year) is coming out this year with a faster bus but needs processor support. Samsung has a 512GB DIMM. DDR latency hasn't changed in 20 years because they use the same 200MHz memory clock. Lower-latency DDR exists but it's not seeing widespread server use. Expect high-bandwidth memory (HBM) to grow. Used by GPUs now, can use 3D stacking, and can be provided on-package (on the CPU itself).

Expect another memory tier below DRAM but above SSD/HHD.

Predictions: DDR bandwidth will double every ten years. Don't expect single-access latency to drop in DDR6. DDR5 can get up to 2x wins based on workload. HBM-only servers may happen, especially in the cloud. The extra memory tier is probably too late to market.

Disks

Perpendicular magnetic recording and multu-actuator recording will help performance. Shingled magnetic recording gets 11–25% more storage but gives worse performance; good for archival workloads.

Flash memory disks have gone through a lot of technologies; affects the block erase cycle limits. SSDs have their own performance pathologies.

Storage interconnects are getting faster (and the specs include reliability, power management, virtualization support, and so on).

Predictions: Slower rotational disks for archival workloads. 3D XPoint will work as a rotational disk accelerator and as petabyte storage. More flash pathologies: Worse internal lifetime, more wear-levieling and logic, and more latency outliers.

Networking

Latest hardware uses 400 Gb/s now and 800 Gb/s is coming. Protocols and TCP congestion control algorithm changes. Things are getting more complex as more tunable performance features are added.

Predictions: BPF in FPGAs, massive I/O xceiver capabilities. Cheap BPF routers with commodity hardware. More demand for network performance because apps need more and more network.

Runtimes

Predictions: FPGQ as compiler target. io_uring I/O libraries. Adaptive runtime internals. 100-core scalability support.

Kernels

io_uring uses shared ring buffers for faster syscalls (batched I/O). eBPF everywhere (incl. on Windows!). eBPF == BPF. BPF Future will include event-based applications to run in kernel space but sandboxed. Emerging BPF uses include observability and security agents.

Predictions: File system buffering and radahead, CPU chedduler, and other policies can be kernel-based. Kernels will become automatically JITted. Kernel emulation will stay slow. OS performance for Linux will be more complex and have worse performance defaultsm BSD has high performance for narrow uses, and Windows has community performance improvements. Unikernels will get one compelling use case.

Hypervisors

cgroup v2 rollout and scheduler adoption increasing. VM improvements plus lightweight VMs to boot superfast like a container with the kernel inside the guest.

Predictions: Should see containers everywhere, and longer term more containers than VMs but more lightweight VM cores than container cores.

Evolution: FaaS → container → lightweight VM → metal, for light → heavy workloads, respectively.

Cloud: Microservice IPC cost drives the need for container schedulers colocating _ and cloud-wide runtime schedulers colocating apps

Observability

BPF FTW. OpenTelemetry, Grafana.

Predictions: More front ends (bpftrace and libbpf-tools). Too many BPF tools. Expect GUIs to find/execute the tools you want. Flame Scope adoption to show deviations between multiple flame graphs/heat maps.

Plenary: Performance Analysis of XDP Programs

Next I attended the plenary talk by Zachary Jones of Verizon Media on XDP. At a high level, XDP lets them optimize the Linux kernel to improve server efficiencies for their media platform:

Organizational Design for Technical Emergency Response in Distributed Computing Systems

After the break I went to Adrienne Walcer and Alexander Perry's talk about emergency response. Adrienne is on the Google SRE Disaster team. She started discussing the June 2019 Maya disaster that their code rollout caused cascading failures leading to internal tool outages, increased alerts, and mass pages. But the bottleneck was the network on-call person, and there was a lot of confusion. The outage took down Gmail, Snapchat, and YouTube for up to three hours.

When the scope becomes sufficiently large, where component incident responders can't see the whole picture, they change to a system-of-systems (SoS) responders as a second-tier.

They use a common protocol (with clearly defined roles in their incident response process), trust (responders ahve the authority to handle the incident without needing to seek authority), respect (everyone is comfortable escalating as needed; psychological safety is created and maintained), and transparency (everything is available company-wide).

Back to June 2019. The network component on-caller paged tech IRT who could formally assume incident command, assess the current state, organiuze people to coordinate the moving parts of the response, set piorities and delegate tasks, secure additional resources where needed, and remove administrative and communications burdens from the folks that can implement mitigations — and the network-savvy people were working on operations more than taking over IC.

Once service was restored and the incident closed, there was a very detailed postmortem. They could spin-off engineering resources to address the root cause and trigger conditions and prevent reoccurrance. They also rewarded the people involved for their efforts.

Q: How are incident responders trained?

A: A couple of different ways. First they have a really robust onboarding program and exrcise. "SRE EDU" is a week-long deep dive into Google tools, techniques, and systems, including a couple of practice emergency scenarios in a demo environment. That gives psychological comfort. There's also a more robust incident response training program. And then there's the disaster/resilience component; most people are used to the tools for their systems. The test scenarios are designed to be on something the person cares about, using the incident response protocol. They do it in the small scale as well as larger scale where multiple teams are given an infrastructure failure to try to coordinate and solve.

Specifically to roles, most people know what roles they prefer to have. If you really hate the role you've got you can swap with others. The ICs tend to be those willing to do it and not those paged and forced to do it. They've generally had the whole Incident Management At Google (IMAG) process training. A lot of it is learn-by-doing, though, even given practice incidents. Mentoring and shadowing helps; the IC has command but has a hidden backchannel with people who can give advice without making the IC look bad.

Q: How do you implement this in smaller environments or with a smaller pool of people?

A: A lot of people will lean on the generalist side of things. The smallest way to implement is to have N+1, a second person on-call or available to come help, or some understood escalation path to someone more senior. Keep the architects on their own on-call rotation for escalation. You don't want people to overspecialize to not be able to help out.

It's tempting to be optimal and let the experts (not management) handle production. But if the company is at risk then management wants to get involved... and you need to keep them from interfering and slowing things down. They need to have exposure to the incident process before joining an incident. Rotate the roles through the entire company. (CEO can't be in charge if the IC is.)

Q: What information do you bring together for the post-mortem? Who sits in on that discussion?

A: There's not just one discussion. Those involved discuss ini writing in detail all the aspects of the incident. They're large enough and global enough that getting everyone in a room is problematic. They collect all the potential data (log sources), recreate timelines, who did what when, and then investigate the success of mitigations. There's also system analysis and reflection work to find root cause analysis or triggers. They also cover what went well (and what didn't) and where they got lucky. And then open bug reports to prevent thngs from reocurring.

Groove with Ambiguity: The Robust, the Reliable, and the Resilient

Next up for me was Matt Davis' talk about how the words robust, reliable, and resilient can be ambiguous. Why are things considered one thing and not another? If we're building something to be robust is it also reliable?

Complexity is a cycle: Discovery, adaptation, emergency, and ambiguity. Emergence in software is a lot like music: Compare sheet music (staves, notes, dots, lines, but it's silent) to the emergent as it's played on an instrument.

Core attributes of a complex system: "Diverse, interdeendent, network entries that can adapt" (Scott Page, Diversity and Complexity).

Rather than use a dictionary he'll use prepositional phrases:

[At this point the video player corrupted hard and I missed the rest of the talk.]

Protecting System Integrity with Trusted Platform Module

After the lunch break I went to the TPM talk by Dmitrii Potoskuev and Marco Guerri of Facebook. Sadly, the streaming of this track broke before the session began, causing a substantial delay (35 minutes due to having to restart the video due to audio dropouts in the 23-minute-delayed livestream).

Every software and firmware component running on a system can be the vector for delivering an attack to the host itself and the wider infrastructure around it. People often focus on protecting the system from what runs in user space or kernel space, and we don't always include in our threat model the integrity of the lower layers in the stack. In this talk, they wanted to show what could be the impact of compromising a host through a persistent implant in its system firmware. They focused specifically on UEFI, the industry-wide standard that defines how system firmware should operate. They demonstrated a "hello-world" system firmware malware from its development to its injection on the host. They then introduced the concept of Trusted Platform Module, a secure cryptoprocessor that has become an industry standard on consumer and enterprise systems, and explained how the TPM can help protect the platform from our demonstrative malware. They assumed that our system requires secrets to be able to interface with the infrastructure around and they leveraged the TPM to give the host access to those secrets only if they could guarantee that all layers of the stack have not been compromised.

This was a demo of how they set up a shared secret and then used a malicious driver to compromise that secret. They showed how TPM protects the system. Secure boot could help in some cases but it doesn't scale well. The demo showed how sealing and signing can help.

Q: From the Quote we can only check that it comes from a valid TPM but not a specific TPM, leading to Cuckoo attacks. What is your take on that?

A: Quote is usually signed by Attestation Identity Key, which is specific for each TPM. In order to make sure that the quote comes from a particular host we should enroll its EK and AIK first.

It should be noted that TPMs are easily detachable from the system. This means TPM is not so useful for protecting from presence-based (physical access) attacks. Microsoft did a big work on integration TPM with CPU in order to mitigate the issue, but it's not applicable for ordinary servers.

The Cornerstone for Cybersecurity-Cryptographic Standards

Next for me was Dr. Lily Chen's talk on cryptographic standards. This presentation introduced NIST Cryptographic Standards and their applications in cybersecurity. The presentation also discussed transitions and validations. It highlighted challenges and solutions for next generation cryptographic standards, including challenges to deal with quantum threats, new cryptography transition, and lightweight cryptography for constrained devices.

History: Developed first encryption standard (DES, 1977) and has been involved since. Nearly every device now uses their standards because of public-key. They also published key generation and management guidelines.

They've managed transitions — DES → 3DES → AES, and so on — as computing power and math techniques have become more powerful.

Next generation: Standards have to deal with extremes (powerful attacks, quantum computers, constrained environment) — but they need to have some degree of backward compatibility and interoperability (TLS version and cipher suites, for example).

Much is general-purpose now — what about special usage moving forwards? Synchronizing with industry best practice. What about international adoption? (And yet some countries have their own standards and there're some we can't eport to, and so on)

New initiatives:

Summary: NIST cryptography standards have been the cornerstone for cyberseucrity, are developed for non-national security applications, and the next generation cryptography standards will deal with quantum threats (PQC) and constrained environments' protection demands (lightweight crypto).

Popcorn Talks

After the break I went to the Popcorn Talks. Popcorn talks are informal, short, silly, and fun talks! Speakers were given a surprise set of slides and had five minutes max to ad lib a short talk based on their contents. There were lots of GIFs, memes, and extremely silly slides, which may or may not have been related to technology.

It was very silly, it showed off the speakers' improvisational talents, and I took no notes.


Thursday, June 3

Why You Should Burn Down Your Datacenter

Despite the attention-grabbing title this talk wasn't actually about pyromania or destruction, at least not directly. Facebook's Mike Elkin is not a controls or mechanical engineer; this isn't about cloud computing but more about the industrial control systems. There are three components he talked about:

Datacenter 101

We care about power, cooling, and space for them. We'll be ignoring space since it's more about planning.

So fault domains aren't just the network, but the power and water. He showed a chart mapping which rows are served by which power and air handlers.

How do the control systems work? Using the Purdue model:

                     / Level 0 process: senors actuators, CTs, fans
        Data center <  Level 1 devices: PLC, controller, gateway
                     \ Level 2 control: SCADA< BMS, PMS, HMI

        Enterprise   / Level 3 ops: Workstations, DC, time-series DB
                     \ Level 4 business: warehouse, DCIM, ERP
    

You don't wamt the power draw to exceed the breaker limit.

Smart Infrastructure

Requires sensors and their data (and storing the data). Given that, build or buy software to do it? Not many want to deal with it so most go the "buy and integrate" decision. The building systems may be collecting what we want, but can the sensors handle the load and do they really gie us what we want?

We need to know what the ICS devices are (both network and facilities information as well as anything protocol-specific and how the devices all interrelate; keeping this information current and correct, like with any inventory, is a problem), collection systems (with what endian and precision and scaling and so on), and data access (have tiers for different user types; many can view the data at aggregate but you want control limited and very very granular).

Burn it down and start over

In summary:

Q: In this process, did you develop a product validation checklist or method to qualify industrial control devices prior to widespread deployment?

A: They have added requirements (especially around network performance, detecting data caches, and so on). Checklist items exist not just for equipment's core purchases but also its network performance and how data can be collected.

Q: Concerning the lack of standards and most sensor data problems, what do you predict for the future of ICS equipment?

A: The ICS industry seems to be ~25–30 years in the past, compared to the IT, network, and security realms. If we don't bring them up to current standards there's certainly increased risk for failures. We need to convince the vendors that making these changes is actually important. The equipment is very specialized and we need to have a critical subset of them to make these changes to move forward safely.

Q: What other terrible life choices have you met?

A: Iterative devlopment cycle. Significant problems is data modeling: What attributes do you want, how do you query things and how often?

Selectively Sharing Multipath Routes in BGP

Next up, Trisha Biswas of Fastly talked BGP.

Overview of BGP

External routing protocol between ISPs. Routers running it are called speakers. Is best suited for a network of networks or a network of autonomous systems (AS).

AS runs interior and exterior gateway protocols (IGP, EGP). Between ASs it's called interdomain routing. BGP neighbors are peers and must be configured statically. Peers in different ASes use external BGP (eBGP) for communication and those within the AS form internal or iBGP sessions.

A route is a list of ASes and other attrributes on the path to the destination. BGP is a path-vector protocol: speakers advertise reachability to other networks (prefixes) with their peers. Advertisiing a prefix replaces previous announcements of that prefix (implicit withdrawal).

Best Path Selection

She stepped through several examples in the slides. BGP uses the number of hops as the first filter and then other attributes to tie-break to select the best path.

Additional Paths

Routers propagate only their best path, which is calable but you lose path diversity. BGP Additional Paths (RFC7911) allows sharing of multipl paths for the ssame prefix WITHOUT the new paths implicitly replacing previous paths. This helps achieve faster reconvergence after a network failure.

Selective Add Paths

You can't share all paths from/for all ASes since there are thousands of prefixes thus thousands of paths and millions of routes. So Selective Add Paths is a feature extension to limit how many or what kind of routes do and don't get shared.

Experimental data shows that only a few prefixes serve most of the traffix, so sharing multiple paths for only those prefixes unlock the potential of multipathBGP without compormising peer performance.

Policy Based Filtering of Add Paths

Routes can be filtered based on any of the BGP route attributes. Best or preferred path should always be advertised.

Demo

She stepped through a demo of the Add Paths functionality.

Conclusions

BGP add paths helps achieve faster reconvergence but can affect the overall performance of the peer due to the large number of routes advertised.

Although ASes have hundreds of thousands of routes, only a few hundred (experimentally 0.1% to 1%) serve most of the traffic and are likely worth sharing.

Selective add paths helps achieve the best of both worlds bu leveraging BGP multipath without overloading the peers.

Q&A

Q: Is it really involved to determine the paths that are most used/popular, so that you can selectively share those paths?

A: It's somewhat involved but doesn't need to be in the control plane. They're working on a tool. It can be a wrapper outside of the routing stack.

Q: Adding millions more routes would be very expensive, but do you think network vendors should look at increasing the resources available so this could become a default feature rather than having everyone figure out their paths selectively?

A: Because it's a tradeoff — heuristically/cleverly versus more resources — it's hard to convince vendors to increase the resources. You'll need heuristics anyhow... especially since Internet routers aren't awlways cutting edge.

Q: I have a use for BGP within Kubernetes or between Kubernetes cluster, not exposed to the internet. Where would be to learn about that?

A: Unsure. there are various open source BGP implementations you could use within the k8s cluster. She uses bird; there's also frr and openbgpd. Any of those shuld work. Start experimentally between two nodes.

Q: Have you considered other routing software like quagga?

A: They have. The issue is that once you build a network with one routing stack and configuration it's very hard to move away from it later.

Q: Are you using Selective Add Path in Production now or is this still being investigated?

A: It's still just in test. They're looking to opensource bird.

Year One: Transitioning From Application Engineer to Infrasec Engineer

After the break I went to Misty Hall's talk about how she transitioned from an application engineering role to a infrastructure security role. It was loosely a case study (with a fish theme) that can let us determine what if anything we should do.

The base assumption for the talk is that we want to improve hiring/mentoring, reduce attrition, or avoid hiring one type of person. Some questions to consider:

What about "slot limits:" What if you hire the wrong person (hook the wrong fish)? You can't "throw back" engineers you hire because they're too small or the wrong fit.

(What if you try moving from app to infra and don't like it? Does your company give you retreat rights?)

(Are you inclined or disinclined towards or away from a tool or product based on its community?)

Once you hire or transition an infrasec engineer and they like the work and are ready and motivated, how do you set them up for success? In her case there were no specific expectations about growth or tooling, but also no organized mentoring structure. You really need some sort of guidelines for where someone should be after a month or three or six. Core competencies in a ROSE chart are useful.

She eventually got put on a project. There's a lot of hard skills she learned as part of it, especially moving from startup-land to government-land. Continued and frequent pairing helped.

Integration opportunities: Culture adaptation, conference talk playlist, mentorship with clear and specific expectations, and moving families.

Q: What did you feel helped you most understand your new world? (What should onboarding look like?)

A: You basically need some kind of guidewires. Their culture is to do what you're most interested in for the company to get the most out of your labor. It's hard at an agency since you need genralists above you to get an idea where you want to migrate skills wise. But managers and CTOs have shared resources. She was an early-enough hire they could treat her as an experiment. She intentionally didn't talk to management or CTO about this talk because she wanted it unfiltered. A lot is trial and error; look around, see what needs doing, and do it.

Q: Has anyone else there made the transition afterwards? Is there a way you'd guide them in app→infrasec?

A: Not really, but she's defintiely willing to mentor. It's a lift to mentor ans she'd like to see mentoring be rewarded for seniors.

SkillOps: Real-World Approaches in Skilling and Building World-Class Security & Technology Teams for a Remote-First World

Next up for me was Abhay Bhargav's talk focusing on training an skilling-up. He's got three major sections.

Changing nature of IT and IT Security jobs

Looking at 2020 and 2021:

A lot of this is going to continue as the new normal, even as we move from mostly-remote to more of a hybrid model.

In 2020 there were 3.12M jobs unfilled (ISC2). There may be reasons (like poor training) for this kind of skills shortage. Also, 70% of organizations are impacted by talent shortages (ISSA), and 52% require hands-on cybersecurity skills (ISSA). 32% of IT budgets will be on the cloud (Forbes). 99% of cloud security failures will be attributed to customers [misconfigurations or errors] (Gartner).

A commonly-mentioned statistic (unsourced): For every 100 developers you have 10 devops and 1 infosec people.

Before security consulting was vulnerability assessment and penetration testing, threat modeling, and vulnerability management, at a single (possibly repeated) point in time. However things have changed: Now places are overwhelmed: All of that's still happening, but add DevSecOps and feedback loops, bug bounties, threat hunting, red teaming, cloud security, Kubernetes security, and more. Security folks have a lot more to do and aren't being staffed up (due in part to a talent shortage or skills gap) to be able to do it.

The best security teams tend to decentralize, treat engineering as a customer, and set useful defaults (the default way should be the secure way).

SkillOps

Continuous microtraining (or small doses of content) accompanied with hands on labs to increase capabilities quickly, along with tailoring the security education to the organization:

Recommendation is to apply both offensive (red team) and defensive (blue team) training and doing a de facto "purple team."

[Some resources he's impressed by are in the slides.]

Conclusions

Q: How large is the team he implemented this on?

A: It's a collection of experiences from various places, some startups and some multi-product teams (100–200 developers).

Q: Some of us come from smaller teams if not startups (e.g., a 30-person team in a multi-hundred person engineering org). How do we get started in changing the culture and/or implementing these sort of changes?

A: It's about finding the resources. Engaging activities (capture/defend the flag) to pique curiosity is a good starting point. Then circulating smaller security/technology bits that are specific to your enviornment, or sharing relevant research. Once you build interest it can start to be self-sustaining. Tech people tend to be curious and hitting those buttons help people increase interest.

Service Mesh Up and Running with Linkerd

After the lunch break I attended Charles Pretzer's talk about linkerd. He started with an overview of service mesh concepts.

A service mesh uses the network of a distributed system to observe, secure, and add reliability. Twitter started this in 2010 to break their monolithic application into a distributed system... which is by definition more complex. A service mesh provides insight into which pieces or parts may be experiencing issues. This decreases both MTTD[etection] and MTTR[esolution].

The data plane of a service mesh consists of may proxies to handle service traffic, injecting proxy code (YAML) into the service. The data plane lets the services all talk through proxies, opening up powerful options. Anything handling the traffic between services now can provide telemetry to capture latency. linkerd also provides mutual TLS (mTLS) for security. The developers don't need to make this part of their code; the proxies take care of it for them. A service doesn't know how many instances of another service there are since the proxies can do load balancing.

The control plane of a service mesh is a finite set of components: Identity service, destination service, proxy injectors, and controller, all of which talk to and configure a new proxy on the data plane.

Some other main concepts:

He ran through a CLI-based demo that, thanks to the lower-quality video, was hard to see. See https://linkerd.io for more information or to download it.

The What and Why of Documenting Your Infrastructure

Next I attended Kevin Metcalf's talk about documenting your infrastructure. He basically had three parts to the talk.

Part 1: What happened?

His employer offered early retirement so his Linux admins and his supervisor took advantage of it, gave him a bunch of their responsibilities "temporarily." What did he get? List of duties and responsibilities and servers and any documentation he could find — all of five sentences on one page but no deployment automation and no configuration management.

So what did he do?

  1. Assume it's a joke (denial).
  2. Swear (anger).
  3. Ask for a raise (bargaining).
  4. Cry (depression).
  5. Get crackin' (acceptance).

Part 2: The plan

He developed a plan:

  1. Get list of servers. They had a finite number of server rooms and physical access, so he could inspect what was there and then narrow that down to what he needs to care about.
  2. Get login credentials. Reset root from single user mode as needed.
  3. Get at least one user contact for each system to identify what services the server is running.

What should we do differently? Think about it:

(Aside: There are two kinds of people: Those who think "I suffered through X so everyone else should too" and those who think "I suffered through X so nobody else should have to.")

Part 3: The work

  1. Document the duties of the position. The initial list was what management thought the position did, but that didn't match the actual ongoing duties. Ask coworkers and customers. Don't sweat the details; iterate. Categorize the duties: Infrastructure (DNS, printing, and so on), security (remove user access, SSL, patching/upgrades, and so on), and licensing and consulting.
  2. Document the device inventory.
  3. Go all Ansible on thsat $#!+. Use configuration management to automate and manage everything. At least manage non-standard system services, key (access) management, patch automation, gatekeeper tasks for developers,

Takeaways:

Closing Remarks

After our last break Avleen and Carolyn had some closing remarks. Avleen started with a retrospective on LISA: The online archives date back to 1993 (though we started as a workshop in 1987 and became a conference proper in 1990). Some interesting points:

Thanks to the USENIX staff for making things work as smoothly as possible.

Favorite moments this year? Carolyn said all the keynotes and plenaries. Avleen said the variety of talks: All kinds of industries (not just tech), all kinds of talks (subjects), diversity of speakers (cultures and backgrounds), and so on.

Usenix is Open Access — all the talks will be online soon after the conference, without a paywall. Consider becoming a member or donating money.

There is an attendee survey; please fill it out.

Closing Plenary: Practical Kubernetes Security Learning using Kubernetes Goat Configure

We closed down the technical portion of the conference with Madhu Akula' plenary session on Kubernetes. If you're new to Kubernetes he strongly suggests looking at the Illustrated Children's Guide video.

The slides were his walking through the docs in a browser, so I didn't take specific notes. See https://github.com/madhuakula/kubernetes-goat for details.


Thoughts provided via survey

Recorded videos are good. Live captioning of them is less so because of jargon. Can we get captions done in advance? That said, if the speaker is using humor in the video, because it's prerecorded there's no way for them to tell whether the jokes are landing... which means there's no chance to adjust the talk if they are (add more) or not (remove them).

It's not clear how long if at all the session-specific chat text is kept. Is it like Zoom where it's archived to a text file on ending the call, or is it ephemeral and lost when the session ends?

It's not clear whether you want multiple or non registrants in the event of a split-track session. For example, if I were attending talk 1 in track I and talk 2 in track II, should I register for one, the other, both, or neither? How will that impact your metrics on the back end to judge success?

That said, should we switch the schedule from 90-on-15-off to 45-on-10-off so people can more-easily jump between talks in a given session?

Were you tracking the number of live viewers (max during session, average between 15-min-after-start and end, ...)?

Timezone issues. Having the schedule in one timezone means meal breaks for other timezones are odd. I recognize this is a problem and there's no good solution. Perhaps more 10- or 15-minute breaks between the 45-minute sessions (because "move the airwalls" and "reset the room" aren't issues now) to add more flexibility?

It's hard to provide applause-as-feedback when the speaker (video) is done or when the Q&A session ends. ":clap:" can only go so far, whether in Swapcard (plaintext) or Slack (emoji).

I recognize that multiple simultaneous tracks can be problematic. We've only got up to two this year but have had up to four or more in the past (e.g., referreed papers, two invited talks, and Guru-Is-In). Are we considering going back to a more-content conference (especially since prerecorded video doesn't add a lot of overhead)?

Might be useful to recommend speakers don't use the bottom 10–15% of their slides since that's where the captioning appears, making both the captioning and the section of the slide unreadable.

The 9:45–11:15am PT (pre-lunch) talks in track II on Wed Jun 02 was a barely-mitigated disaster. The player kept failing for people at random. Sometimes a refresh would work, sometimes it would fail again quickly. For some, downgrading to 240p (which made slides illegible) helped. Refreshing the video player every 10–15 seconds is untenable.

The 12:00n–1:30pm PT (post-lunch) talks in track II on Wed Jun 02 was a less-mitigated disaster. We lost 22 minutes while Usenix and Swapcard troubleshot whatever was broken with live streaming. I feel sorry for the speakers who had to put up with the delay (and the possibly-lost audience who switched to track I out of boredom or frustration).

It's often very hard to view demos, especially command line demos. Even it the "use 60% of my browser window" mode it's very difficult to see the CLI commands used in demos... even if the speaker doesn't clear their screen quickly. Can we recommend they increase their typeface size?


Thoughts not provided via survey

They didn't announce a venue or chairs for a next LISA conference, be it in 2022 or beyond. I spoke with Executive Director Casey Henderson and the short answer is that it hasn't been decided yet whether (and if so, when) it will happen.

My thoughts are that I'm hopeful but not optimistic that one will happen. Over the last decade or so, the program has changed substantially: We've axed the guru-is-in track, refereed papers, and full- and half-day tutorials and workshops; we've gone from a five- to six- then three-day event; and the international yet regional SRECons have a substantial audience overlap with LISA. With all that, it may no longer make financial sense for the USENIX Association to host LISA.



Back to my conference reports page
Back to my professional organizations page
Back to my work page
Back to my home page

Last update Jun08/21 by Josh Simon (<jss@clock.org>).