Conference Report: 2004 LISA: Advanced Topics Workshop

Tuesday began with the Advanced Topics Workshop, once again ably hosted by Adam Moskowitz. We started, unlike usual, with a quick overview of the new moderation software. We followed that with introductions around the room — in representation, businesses (including consultants) outnumbered universities by about 2 to 1, and the room included 3 past LISA program chairs and 6 past members of the USENIX Board or SAGE Executive Committee.

We had our usual interesting discussions on a variety of topics. Our first topic was introducing the concepts of disciplined infrastructure to people (e.g., it's more than just "cfengine" or "isconf" or something), or infrastructure advocacy, or getting rid of the ad hoc aspects. Some environments have solved this problem at varying levels of scale; others have the fear of change paralyzing the systems administration staff. One idea is to offload the "easy" tasks either to automation (but avoid the "one-off" problem and be careful with your naming standards) or to more junior staff so the more senior staff can spend their time on more interesting things than the grunt-work. Management buy-in is essential; exposing all concerned to LISA papers and books in the field has helped in some environments. This is, like many of our problems, a sociological one and not just a technical one. Remember that what works on systems (e.g., Unix and Windows boxes) may not work for networks (e.g., routers and switches), which may be a challenge for some of us. We also noted that understanding infrastructures and scalability is very important, regardless of whether you're in systems, network, or development. Similarly important is remembering two things: First, ego is not relevant; code isn't perfect and a developer's ego does not belong in the code. Second, the good is the enemy of the perfect; sometimes you have to acknowledge there are bugs and release it anyway.

After the morning break, we discussed self-service, where traditionally sysadmin tasks are handed off (ideally in a secure manner) to users. Ignoring for the moment special considerations (like HIPAA and SOX), what can we do about self-service? A lot of folks are using some value of web forms or automated emails, including the business- process (e.g., approvals) not just the request itself. One concern is to make sure the process is well-defined (all edge cases and contingencies planned for). We've also got people doing user education ("we've done the work, but if you want to do it yourself the command is..."). Constraining possibilities to do only the right thing, not the wrong thing, is a big win here.

Next we discussed metrics. Some managers believe you have to measure something before you can control it. What does this mean? Well, there're metrics for services (availability and reliability are the big two), with desired levels to meet, in-person meetings for when levels aren't met, and so on. Do the metrics help the SAs at all, or just management? It can help the SAs identify a flaw in procedures or infrastructure, or show an area for improvement (such as new hardware purchases or upgrades). We want to stress that you can't measure what you can't describe. Do any metrics other than "customer satisfaction" really matter? Measure what people want to know about or are complaining about; don't just measure everything and try to figure out from the (reams of) data what's wrong. Also, measuring how quickly a ticket got closed is meaningless: was the problem resolved, or was the ticket closed? Was the ticket reopened? Was it reopened because of a failure in work we did, or because the user had a similar problem and didn't open a new ticket? What's the purpose of the metrics? Are we adding people or laying them off? Quantifying behavior of systems is easy; quantifying behavior of people (which is the real problem here) is very hard. But tailor the result in the language of the audience, not just numbers. Most metrics that are managed and monitored centrally have no meaningful value; metrics cannot stand alone, but need context to be meaningful. Not all problems have technical solutions. Metrics is not one of them. What about trending? How often and how long do you have to measure something before it becomes relevant? Not all metrics are immediate.

After a little bit of network troubleshooting (someone's Windows XP box was probing port 445 on every IP address in the network from the ATW), we next discussed virtualized commodities such as user-mode Linux. Virtual machines have their uses — for research, for subdividing machines, for providing easily-wiped generic systems for firewalls or DMZ'd servers where you worry about them being hacked, and so on. There are still risks, though, with reliance on a single-point of failure (the hardware machine) theoretically impacting multiple services on multiple (virtual) machines.

Next we discussed how to get the most out of Wikis as internal tools. What's out there better than TWiki? We want authentication out of LDAP/AD/Kerberos, among other things. The conference used PurpleWiki which seems to be more usable. There's a lot of push-back until there's familiarity. They're designed for some specific things, but not everything. You need to be able to pause and refactor discussions if you use it as (e.g.) an email-to-Wiki gateway. (There is an email-to-Wiki gateway that Sechrest wrote.) If email is the tool most people use, merging email into a Wiki may be a big win. Leading by example — take notes in the Wiki in real time, format after the fact, organize it after you're done — may help sell it to your coworkers.

Next we listed our favorite tool of the past year, as well as shorter discussions about Solaris 10, IPv6, laptop vendors, backups, and what's likely to affect us on the technology front next year. We finished off with making our annual predictions and reviewing last year's predictions; we generally did pretty well.

Back to my conference reports page

Back to my professional organizations page

Back to my work page

Back to my home page

Last update Feb01/20 by Josh Simon (<jss@clock.org>).