August 2001 Column

[ Site Index] [ Linux Index] [ Feedback ]

Zen and the art of System Administration

One of the less-well-understood aspects of Linux (and UNIX in general) is the job of the system administrator. What does a sysadmin do, and why do they do it?
If you started out in life using a single-user computer system, odds are that you were your own system administrator without realising it. If you backed up your files, defragmented your disks, installed your own applications, and set up your own internet dial-up link, you'd have a pretty good idea of how everything worked.
More recently, in graphically-based operating systems, such as Windows 95 or MacOS, configuring these tasks is pretty easy: you just go to the control panel and run a program that configures, say, the time zone your machine is in, or its TCP/IP address and dialup networking characteristics.
Life in the world of multi-user systems is different. Once it's not just your own desktop that you're configuring, you need to bear in mind that different users will have a different view of the resources the computer makes available to them. For example: a business may want to centralize backups of user data so that nothing gets lost by people who don't understand how to do a backup. The usual way of doing this is to set up a file server machine, stick all the user directories on the file server, and back up just the file server. However, for this to work each user has to be set up slightly differently -- to use a different folder on the file server. Otherwise they'll be reading each other's files and getting confused.
This illustrates an important lesson: as computers become ubiquitous at work, the level of computer literacy of the average user declines. Thus, it's more important than ever for specialists to be able to set machines up in a usable configuration and lock them down so that the user can work through their daily tasks without messing things up trying to install new printer drivers or something.
Another point is that in moving from a self-administered single user box to a networked environment we have gone from one dimension of variables we're trying to control -- how the computer is set up -- to a two-dimensional map of how the computer is set up for each user: what permissions each user has with respect to, say, networked file servers, printers, access to an internet gateway, and so on.
One way of doing this is exemplified by MacOS 9, which has acquired a concept of user profiles: each of the control panels used to configure a subsystem like the desktop background, or network setup, has an administrator mode in which different configurations can be established for different users. This is, however, cumbersome in the extreme when you're trying to run a network of hundreds of boxes; it's still fundamentally oriented towards a single-user machine that a couple of people might alternate at from time to time. Or a laptop that might be plugged into two different networks.
UNIX started life as a multi-user system almost from the outset, and consequently it has a different approach to administration. Linux copies the UNIX model pretty slavishly, so it's this model that I'm going to talk about.
At the heart of all UNIX-type systems there is a program called the kernel. The kernel has unlimited privileges; it talks to the disk controllers, console, serial ports, network cards, and other bits of hardware. It manages memory allocation, keeps track of file and process permissions, fires up other programs and schedules periods during which they can execute their code, and provides system services. In fact, without a kernel there is no operating system.
Sitting on top of the kernel are a number of programs that are almost as fundamental. For example, there's init -- the root process. Init is a process (running program) that is spawned by the kernel when it first loads; it keeps running until the system shuts down. Init is used to spawn other processes -- it has a number of states it can be in (called run levels) and a list of tasks to carry out when switching between run levels in response to a command to change level. (The command to do this is, confusingly, called init -- but it's a user-space program, not the actual init process spawned by the kernel.) When init first fires up it switches to run level 1 and executes a set of scripts which may ultimately tell it how to change to other run levels (such as 3 -- general multi-user level -- or 5 -- graphical login level). One of init's chief responsibilities is to permit users to log onto the system by authenticating their identity, at which point init hands them an interactive shell session started with their user ID -- without init acting as gatekeeper to the system you can't do *anything*. (Actually, init lets people log in by spawning getty and login daemons, but we'll not go into that just yet.)
When init is switching between run levels, the scripts it runs are used to start and stop system services -- services that aren't provided by the kernel, but by software systems. Examples of these services include the FTP server, or the Apache HTTP server, or the Samba SMB (Windows file and print sharing) server. Init may also run scripts that configure hardware -- for example, by installing device drivers for PCMCIA cards on a laptop, then bringing up network interfaces for a PCMCIA ethernet adapter.
The point to note here is that beyond the core services provided by the kernel, almost everything is started up or shut down by an init script. And most everything is provided by an external *software subsystem* -- a set of faceless programs installed on a UNIX server that run in the background and are started by init.
The core task of the UNIX system administrator is this: to configure each subsystem so that it provides services to the users of the system. This may entail installing upgrades to a subsystem, editing its configuration, editing init's configuration scripts to start the subsystem up correctly, telling init to change run level (starting the system), and monitoring its operations. About the only exception to this is configuring programs that are executed periodically (by the cron daemon), rather than continuously (by init) -- and these are mostly housekeeping tasks associated with daemons executed from init. (For example, if you're running a news server you'll have a periodic batch job, run by cron, that cleans out old news articles; but it still boils down to administering a subsystem.)
Sooner or later every task boils down to this level. Administering user accounts? That's all about messing around with the password database and login subsystem. Setting up printers? That's about messing around with the lpd -- line printer daemon -- subsystem. Running a web server? You need to configure the apache daemon -- another subsystem. Backups? slightly different, but if you're doing it automatically this probably entails using a subsystem such as the Amanda network backup system and configuring it to run automatically.
If you adopt the attitude that you're configuring a subsystem, then everything falls into place. Subsystems consist of programs (typically a daemon process that executes continuously, and some supporting utilities), configuration files (that tell the daemon how to behave), and optionally some data files. The configuration files almost always live under the special directory /etc, the data files are often stored under /var, and the programs under /sbin or /usr/sbin. Configuration files typically have a syntax that resembles a mini programming language; UNIX is notorious for spawning languages, from the shell language interpreter you type commands at to the C compiler that the kernel is built using. Learn to program in the mini languages in question (which mostly resemble C or the shell, except for a few -- like Apache -- that resemble XML) and you can pick up the essentials rapidly using the man pages.
Manual pages are the system documentation on UNIX; originally they were a description and specification of a program or file format that was meant to be complete enough that a good programmer could replicate the original program from the man page. Whether or not they work properly is questionable; they tend to be rather impenetrable. However, there is almost always a man page named for the core daemon in a subsystem -- and it will have cross-references at the end to other man pages (for example, for configuration files). As long as you remember that man pages describe individual daemons, or configuration files, or utilities, but *not* entire subsystems, things will fall into place: but remember, man pages do not explain why and how some subsystem is configured, they merely describe individual components.
So how do you go about becoming a system administrator?
Firstly, learn to think in terms of subsystems that are installed on your Linux kit to provide a service. For example, when you're trying to figure out why a computer won't print anything, you need to think in terms of (a) the print spooler subsystem, and (b) the system that's trying to feed output to a printer. After checking the cables and making sure the printer isn't out of toner, you need to ensure that the print spooler daemon is running, verify that it has been configured to know about your printer (see /etc/printcap), check that the user who is printing is printing to a print queue defined by /etc/printcap, and check the status of the daemon (so that it isn't, for example, suspended). You might also need to check that a device driver for your parallel port is loaded into the kernel -- if not, the print spooler daemon won't be able to find a valid parallel port device with a printer attached to it -- but in principle, the job entails hunting from one end of a subsystem through to the other.
Secondly, don't assume that the computer will do what you want it to do: in general it does what you tell it to do instead, which may not be the same thing. Unlike Windows or MacOS, UNIX systems tend to expose all the guts of a system to your eyes. The level of detail in some of the configuration files may be bewildering or confusing at first: for example, you might have no idea why the printcap file has options for sending email to users when their print jobs have been delivered, or for printing header sheets with their username on them. (These options make sense in a big company or department where people in different offices are using centralized print facilities located elsewhere: odds are you don't work in such an environment.) The trick is to learn to ignore the extraneous options and look for the relevent ones.
Thirdly, don't assume that you're helpless. Consumer operating systems like MacOS or Windows try to hide the complexity: Linux and UNIX systems, in contrast, expose you to it. This means that when something goes wrong with a consumer OS, there is often nothing to be done except try to reinstall it. In contrast, the exposed complexity of Linux means that you can very often fiddle around with a broken subsystem until it works properly. Of course, it takes willingness to roll your sleeves up and try to understand what's going on -- and to read the system logfile. Logfiles are saved under /var/log, and are written to automatically by the daemons running on the system: they usually contain lots of cryptic error messages when something is going wrong, and the errors in turn usually point straight at whatever is causing the problem. (The lack of accessible logging information is one of the biggest obstacles to problem solving thrown up by consumer operating systems.)
Fourthly, learn about the information sources available to you. If you want to run a Linux or UNIX system, you need to learn: to learn about the subsystems already available on your system, to learn about new tools and techniques as they become available, to learn how to troubleshoot hardware and software, to learn about new security holes and how to patch them. The first and best single source of information is undoubtedly the internet: a quick Google search for some relevant keywords usually turns up a plethora of articles from knowledge bases on the web. (In fact, the best trick is learning how to hone your searches down so that you aren't deluged in information of dubious relevance). An alternative starting point is the website of whichever Linux distribution you're using -- Redhat, SuSE, and most of the others maintain extensive archives of technical reports, hardware compatability data, and documentation. Then there's the Linux Documentation Project which maintains a repository of guides, HOWTOs (essays on specific subjects), FAQs, and man pages.
Fifthly, become paranoid! If your computer is connected to the internet, it is exposed to a worldful of hurt. The availability of free cracking tools on the net means that an unprotected machine is at risk from bad guys with broadband network links who can scan huge ranges of IP addresses looking for vulnerable hosts. Once your machine is identified as suffering from a security hole, it is only a matter of time before someone exploits it. It might just be kids fooling around, or it might be a spammer using your open sendmail service to relay their junk adverts at your expense -- whatever the case, you don't want that to happen. There are numerous information sources for up to date warnings about security hazards; in particular, you should probably subscribe to the Bugtraq mailing list and check out some of the security-related websites. Assuming that nothing bad will happen to you because you aren't a high-profile target is about the best way to have it happen to you -- so get paranoid fast.
Sixthly, learn to upgrade your system. If you use Redhat, SuSE, or Mandrake Linux distributions, when vulnerabilities are discovered "patches" -- sets of files that address the problem -- are distributed in RPM (Redhat Package Manager) format. Download the relevent package and type:
rpm -Uh packagename-version.number.i386.rpm
And the defective files will be replaced. You can identify your current version of a package by scanning the list of currently installed packages:
rpm -qa | less
(Which prints a long list of all the packages on your system, and lets you view it with "less").
Debian-based distributions use different package management tools, as do Solaris and other UNIX-type systems -- in extreme cases, there isn't any concept of dependency checking, and you have to recompile the source from scratch. However, it's important to keep up to date: recently there has been a spate of worms trawling the net for out-of-date Linux systems and copying themselves to them. While no major damage has been done so far, it's only a matter of time before a malicious worm targets Linux systems with a payload that causes serious damage.
Seventh, and final, piece of advice: remember that we use computers for tasks, not as an end in themselves! True, some people are paid to do this sort of thing -- dinking with problems, fixing them, filing off rough edges, and keeping the machines running -- but there's a real world out there as well: the sun is shining, the birds are singing, there's beer to be drunk and vacations to be taken. The most important thing you need to learn in system administration is when to leave off and go and do something else instead.
The whole point of having a well-defined system administrator role is so that the system can be set up to largely run itself, subject to occasional fits of maintenance: unlike a single user system, which requires intermittent management, UNIX and Linux systems are designed for reliability. As I write, one of my Linux boxes has been running steadily for 348 days, serving as a firewall; another has been up for 286 days without a reboot, acting as a webserver, mail and news server, and public telnet host. (My workstation was, in contrast, last rebooted only 20 days ago. That's because I test a lot of strange software on it and sometimes many-legged nasties crawl out of the woodwork.)
The reason I run Linux is so that I know what to expect of my computers: they aren't going to unexpectedly tell me I've added some new hardware, and please to reboot them. Nor are they going to succumb to a memory leak in a third-party application and crash. Nor do they (like Microsoft Windws 95) have a millisecond timer that overflows after 44 days and kills the system. They just tick along like the industrial appliances that they are, with minimum maintenance. The biggest cost in running any computer below the level of a mainframe is the time of the human beings who run it; being able to manage a herd of boxes by broadcasting configuration files at it using rsync, or using shell or perl scripts, beats having to walk over to each machine, turn on the monitor, and mouse around a muddle of pretty but meaningless icons. Some surveys suggest that networks of UNIX and Linux systems require only third as many administrators as Windows networks: the reason for this is the flexibility and configurability of the operating system -- which in turn derives from that complexity which can be so intimidating to beginners.

[ Site Index] [ Linux Index] [ Feedback ]