Linux Format 25 [[ Typographical notes: indented text is program listing text surrounded in _underscores_ is italicised/emphasised ]] Perl Tutorial TITLE: Writing client/server programs Linux is a POSIX-compatible UNIX like operating system. As such, it has a whole grab-bag of features for letting one program talk to another. In the beginning, UNIX System 7 provided pipes, signals and temporary files. The BSD versions added TCP/IP networking and sockets, while AT&T's UNIX team added a slew of IPC (inter-process communication) options, including shared memory, semaphores, and streams (not really used on Linux). Then the Bell Labs Plan Nine research team, who got heavily into doing esoteric things with filesystems, invented the /proc system and the Linux kernel developers snaffled it and added their own extras, notably the shared memory filesystem (which makes shared memory easier to use). The result is that Linux is a rich environment for writing client/server software, although not all the mechanisms you'll find there are portable to other operating systems. A SHM-filesystem based solution won't run on Solaris or other SVR4 UNIXes, and SVR3/SVR4 shared memory won't work on BSD. So I'm going to keep to the limited subset of client/server tools that are relatively portable -- signals, pipes, and sockets. In this tutorial, I'm going to go over some of the basics of interprocess communication -- signals and pipes, with a side-order of sockets. In next month's tutorial we'll see how we go about building socket-based internet clients and server programs (such as a simple web server). HEADING: Signals Signals are about the most primitive way for two processes (executing programs) to communicate: they don't transfer any information at all, except the fact that a signal with a given number has been raised -- because some condition has occured. For example, if a process divides by zero, SIGFPE (floating point exception) is sent to the process. A lot of other conditions can trigger signals; for example, hitting the interrupt key on a TTY that a process is connected to sends a SIGINT (interrupt) signal, and the kill command can be used to send arbitrary signals. You can see a list of the signals supported by Linux in the man page signal(7); it's a superset of POSIX.1 and SUS v2 (standard UNIX specification). Processes can do three things: ignore the signal, allow a default action to occur (e.g. terminate the process -- this is the default for SIGFPE), or provide a function that is called when the signal occurs. Perl processes can set handler functions, and they can also send signals. Here's how it works. There's only one signal that doesn't affect the target process: signal 0. This signal merely checks whether the process is alive and hasn't changed its user ID; if kill(0, $some_process_id) fails, then $some_process_id isn't accepting signals. (Maybe it's a zombie?) To send a signal in Perl, we use the kill() command. kill() takes two or more parameters -- the signal to send, and a list of process ID numbers. kill() returns the number of processes successfully signalled. Signals can be represented in numeric form (e.g. 9 for SIGKILL) or symbolic form (e.g. HUP for SIGHUP). The list of process ID's are just the PIDs of the processes to signal. Note that your Perl script can't kill processes belonging to higher-privileged user ID's -- if you want to kill one of root's processes, you have to be root, and if you do this from a CGI script running with a UID of 'nobody' or 'httpd' you can only signal other processes owned by 'nobody' or 'httpd'. Secondly, there are two ways of processing a received signal in Perl: the traditional way, and the sigtrap pragma. Perl (this is the traditional way) provides a hash called %SIG, in the main namespace. You can create subroutines to handle specific signals, and store them in %SIG; for example, if you write a subroutine called handle_hup() to handle SIGHUP (hang-up) signals, you'd put a reference to it (or just its name) in $SIG{HUP}: $SIG{HUP} = \&handle_hup() ; Thereafter, whenever the process receives a SIGHUP, the subroutine handle_hup() is triggered. You can also assign the name of a subroutine to %SIG: $SIG{HUP} = 'handle_hup'; Or an anonymous subroutine: $SIG{HUP} = sub {die "Killed by SIGHUP!\n" }; Now, here's a problem: signals are asynchronous. They can come at you from anywhere and at any time. To handle asynchronous events like this, you need to write re-entrant code -- code that can be invoked from within itself without losing track of where the program counter and registers has gotten to. This isn't re-entrancy at the Perl level -- it's at the C or assembly language level, the level Perl is written in. Perl is linked to libc, and older versions of libc are not re-entrant -- GNU Libc, aka Libc 6, is, but you can't count on older versions of Linux providing this. This means that your Perl signal handler may go bang messily if it receives a second signal while trying to handle a signal. So you need to keep signal handler subroutines short and sweet. Ideally all they should do is log an error message somewhere, close any open system resources (such as open filehandles), and call die() gracefully. Even doing that much can be a bit risky. Note that the %SIG array is global by default; if you want to change a signal handler locally, say within a subroutine, you should override %SIG using local(), so that when you leave the special scope %SIG regains its normal settings. The shiny new Perl approach to signals is the sigtrap pragma (a pragma is a built-in feature controlled using use() that changes the way the compiler behaves). Sigtrap lets you specify that one of a number of predefined behaviours is to be triggered by one or more signals. Behaviours include 'stack-trace' (output a Perl stack trace and dump core), 'die' (die with a programmer-specified message), and 'handler', which is followed by the name of a user-written subroutine to use (as in %SIG). Lists of signals to apply a behaviour to include 'normal-signals' (HUP, INT, PIPE, or TERM), error-signals (ABRT, BUS, EMT, FPE, ILL, QUIT, SEGV, SYS and TRAP -- these would usually cause your Perl interpreter to die), 'old-interface-signals', 'untrapped' (all signals for which no handler has yet been installed), and 'any' (all signals, period). You can also list signals by name or number. For example: use sigtrap qw(stack-trace normal-signals); Causes Perl to install the stack-trace handler for all the signals listed in the 'normal-signals' set. use sigtrap qw(die untrapped); Makes the die() handler the default for any signals we haven't previously set up a handler for: and: my $my_handler = sub { warn "SIGHUP received\n" }; use sigtrap 'handler', $my_handler(), HUP; Installs a handler for SIGHUP that prints 'SIGHUP received' on STDERR. SUBHEADING: 101 Uses for a dead signal What can you usefully do with kill() and %SIG? One useful trick is SIGALRM, the alarm signal. When you call the Perl command alarm(), it sets a timer; after the specified number of seconds, this causes the kernel to send a SIGALRM signal to the current process. (You can only have one timer active at once, and each time you call alarm() it deletes the previous timer; you can also manually cancel an alarm by calling alarm(0). alarm() returns the amount of time remaining on the previous timer.) Suppose you want to read some data from another process that's writing to a pipe -- but which might die of boredom or something. If you just read like this: while ($input_data eq "") { $input_data = ; } ... You risk being blocked forever by whatever process is on the other end of the file handle MY_PIPE. You can avoid this problem by using SIGALRM and a custom signal handler: eval { my $alarm_timeout = 60; # wait up to sixty seconds before giving up local %SIG; $SIG{ALRM} = sub { die "Error: MY_PIPE coprocess timed out\n"; }; alarm $alarm_timeout; while ($input_data = "") { $input_data = ; } alarm 0; }; # end of eval block scope alarm 0; if ($@) { print "Error: $@\n"; } If the alarm goes off inside the eval() block, the appropriate action to take is to call die() -- otherwise Perl might try to restart the syscall to read from MY_PIPE. By calling die() in an eval block we raise an exception and exit the block. If the alarm goes off, the result will be that $@ holds the string "Error: MY_PIPE coprocess timed out". (There's a slightly more convoluted example in Chapter 16 of "Programming Perl", 3rd edition -- involving nested eval() scopes to get around a problem with a syscall that may not be implemented on some platforms. By ignoring cross-platform portability and focussing on Linux, we get off lightly!) Here's another use for a signal (which we'll see an example of later): signalling a process group. On Linux, processes have a family tree. When you want to run a new program (a running instance of a program is called a process), your current program must use the fork() system call. fork() clones the current process, and returns the process ID of the child to the parent; the only distinguishing features of the new child are its new process ID and a couple of bits of metainformation used by the kernel. If you want to run a different program, the child must check it's process ID and call exec(), to execute the new code (overwriting itself in the machine's memory, roughly speaking). Ultimately all processes are descended from init, the root process. We refer to a parent and its child processes as a 'process group'; process groups have a number corresponding to the process ID of the process group leader (or parent). Perl processes can check their own process ID, using the special variable $$, and collect the PIDs of children as they call fork(). You can individually signal a whole bunch of children: kill(HUP, @my_kids); But if they've spawned grandchildren or other descendants, you may not have a list of their PIDs. The solution is to become a process group leader, set a signal handler for signal 1, SIGHUP, and then send a HUP signal to a negative number corresponding to the process ID of the process group leader. (Send a signal to, say, 407, and it affects process ID 407. Send it to -407 and it goes to both process 407 and all its descendants. So, if you send a signal to -$$, it goes to all the processes descended from $$.) For example: eval { setpgrp(0, $$); # set $$ to be the process group ID }; if ($@) { die "Could not setpgrp(0, $$): $@\n"; } # do stuff that spawns child processes here, # including shell scripts or daemons : # now it's time to shut down all descendants gracefully local$SIG{HUP} = 'IGNORE'; # we don't want to kill ourself kill(HUP, -$$); # signal entire process group When processes exit, the parent is expected to tidy up after them. A signal -- SIGCHLD -- gets sent to the parent, which must then use one of the wait() or waitpid() system calls to acknowledge the death in the family. Processes that don't get wait()'d for hang around as entries in the process table, and are known as zombies: when writing a program that spawns lots of children you need to take care of this. SIGCHLD was intended to allow parents to do things like wait for a child to run and terminate before continuing. Perl's system() and backquote mechanisms automatically take care of reaping CHLD signals, but if you call fork() directly to spawn children, you need to be able to deal with the consequences manually. The simplest approach to dealing with a death in the family is to callously ignore it: $SIG{CHLD} = 'IGNORE'; Alternatively, you can subcontract the job out to a grim reaper -- a subroutine that periodically checks for zombies, and calls waitpid() to collect them. This is the recommended tactic for server processes (like the ones we'll eventually get round to writing). For example (by way of the Camel book): our $zombies = 0; # global counter for zombies $SIG{CHLD} = sub {$zombies++}; sub reaper { my $zombie; our %exit_status; $zombies = 0; while (( $zombie = waitpid( -1, WNOHANG) ) != -1) { $exit_status{$zombie} = $?; } } # now we do a whole lot more stuff until finally ... while (1) { # main execution loop for the server if ($zombies > 0) { reaper(); } # and so on } The idea of this approach is that whenever a child dies, the global counter $zombies is incremented. Servers tend to run in a continuous loop; if $zombies shows that there are pending children waiting to be reaped, the main loop periodically kicks off the reaper() subroutine. The reaper() simply runs waitpid() repeatedly, recording the returned child process ID's (returned in $zombie), until waitpid() returns -1, indicasting that there're no dead child processes. HEADING: Pipes A pipe is a one-way i/o channel that funnels bytes from one process to another. Unlike signals, which merely convey a single bit of information (such as "hello, process 4117! Your child process 4118 has died!"), pipes let us transfer arbitrary data from one process to another. They're not the only way of doing that -- temporary files and shared memory segments can be used for the same job -- but pipes are often very useful because they're easy to use in shell programming, and being able to access them from Perl lets us write Perl scripts that can interact with pipelines. Important note: pipes connect two processes and they are unidirectional. That is: a given perl program can read from one end of a pipe, or can write to one end of a pipe, but can't read and write to the same end of the same pipe. If you need bidirectional communication between two processes (call them A and B) you will need to use a pair of pipes, one which A writes to and B reads from, and one which B writes to and A reads from. A second characteristic of pipes is that they're buffers. You can pump data into a pipe, but if the process at the other end doesn't suck data out fast enough it can block you. (Clue: this is a good reason to practice using SIGALRM to time out operations that rely on another process.) Finally, there are two different types of pipe on Linux -- anonymous pipes (the normal kind) and named pipes (which have names in the filesystem, and which you can open and close like a file). You also want to avoid getting pipes confused with sockets (which we go into at great length later on.) In Perl, there are two general strategies for using pipes. You can use a pipe to talk to another program by calling open() with some special arguments; or you can use the low-level pipe() command to create a couple of pipes, and then fork() so that a parent process can talk to its children. SUBHEADING: Talking to aliens We're used to seeing Perl programs open files for reading data or writing results like this: open(INPUT, "/dev/null |") or die "Can't fork: $!\n"; Runs 'netstat -an', discards the standard error, and attaches it to the filehandle MONITOR, so that we can do something like this: while ($data = ) { # do something with output from netstat } When you use open() this way, Perl implicitly calls fork() and exec() to spawn a sub-process. This is not necessarily the program you expected to run: if the argument you give to open() has more than one token in it, open() spawns a shell and passes your command line to the shell, while connecting the standard output from the shell to your filehandle (if you're trying to read from a pipe), or your filehandle to the standard input of the shell (if you're trying to write to it). So it's legal to do something like this: open(WORDS, "find . -type f -name '*.txt' -exec cat {} \; | wc -w |"); The output from the pipeline (find searching for files and calls cat on them; wc then counts the number of words output by cat) is then available for reading on the Perl filehandle WORDS. However the pipeline command isn't executed directly by Perl, but by /bin/sh. SUBHEADING: Talking to yourself Using open() to read from or write to external processes is all very well, but what if you want to read and write to the same process? Or if you want to spawn a child copy of your perl process to carry out some task, then read some data back in from it? The traditional UNIXy way of doing this is pretty low-level. First, you use the pipe() system call to create a pipe. This returns two filehandles -- a reader and a writer. If you want bidirectional communication with your child, you call it twice, giving a total of four file handles (one for the parent to write to, which has a corresponding one for the child to read from, and one for the child to write to, with a corresponding reader handle for the parent). You then call fork() to spin off a child. The child now has to close the parent filehandles, while the parent closes the child handles. Then you have to somehow manage i/o without a deadlock occuring. Deadlock occurs when two processes are both waiting for each other to do something, and it's deadly. If you hit a situation where the parent is waiting for the child to say something, but the child expects the parent to say something, both processes will hang -- unless you've set up a SIGALRM handler and set an alarm to go off and break one or other of the processes out of the deadly embrace. You might think you can avoid this situation by carefully planning your parent and child dialog, but there's a fly in the ointment: UNIX I/O buffering. Normally, input and output on filehandles are buffered -- that is, the standard I/O library saves up bytes of input until it has enough to fill a buffer which can be transferred to a reader process in a lump. This is an important optimization for UNIX systems -- otherwise the kernel would have to switch between executing the writer and the reader processes for every byte transferred, and context switches are expensive -- but if you just write a short string, and expect a response, you may be disappointed, because the short string won't fill the I/O buffer and therefore the client won't get to read it. A partial solution to this is to tell Perl to use unbuffered I/O on a filehandle. You can do this using select: select FH; $| = 0; The $| special variable controls I/O buffering; setting it to 0 disables buffering on the current selected filehandle only, so if you have a bunch of filehandles in use you may need to select and unbuffer all of them in turn. The standard Perl distribution comes with a couple of life-belts for programmers who really want to write processes that do this sort of thing. First, there's the module IPC::Open2. IPC::Open2 supplies a single subroutine, open2(); this does what you'd intuitively expect of the (illegal) command: $pid = open(HANDLE, "| some_pipeline |"); -- That is, open2() takes two filehandles as arguments, and glues them onto the standard input and standard output of a command. For example: open2(\*INPUT, \*OUTPUT, $my_external_filter); You can now write to INPUT and read from OUTPUT. If open2() succeeds it returns the PID of the child program; if it fails it raises an exception that you can trap with an eval block: eval { open2(\*INPUT, \*OUTPUT, $my_external_filter); }; if ($@ =~ /^open2:/ ) { warn "failed to spawn $my_external_filter: $@\n"; } Note that open2() doesn't reap children, so you'll need to trap SIGCHLD (as described earlier) in programs that call it. HEADING: Sockets Signals and pipes have a major deficiency when we use them for inter- process communication; they're restricted to a single computer. If you want to do network programming, or write general purpose client/server applications that can run on more than one machine, you need to look into sockets. The easiest way to think of a socket is as a bidirectional pipe (yes! you can read from and write to the same socket) that, instead of connecting to the standard input and standard output of a process on the same computer, connects to a process bound to a TCP/IP address and port number on a computer. The biggest 'gotcha' is to remember that while the socket itself is bidirectional, the work of setting up the connection requires distinctive and different tasks to be carried out by a server and a client. To set up a socket-based connection, a server must first create a socket, tell it what IP address and port number to bind to, and call listen() to establish a queue of incoming connections. The server also needs to know what to do when an incoming connection arrives. The client's job is a bit easier: it needs to create a socket, tell it the address of the remote machine, and then call connect() to talk to the server. There are several ways to write both a server or a client in Perl: suffice to say that the easy way is to use the standard IO::Socket::INET class, which provides a socket object. (Or you can use the lower-level Socket.pm module to provide low-level functions, for example if you want to mess around with the protocol type or set various flags on the socket.) (In addition, if all you want to do is talk to a specific type of internet server -- that is, a program on a remote machine that uses a specific protocol, such as SMTP for mail routing or HTTP for serving up web content -- there are very-high level modules (Net::SMTP and HTTP::Request) that will let you issue requests without having to deal with sockets at all. But socket programming is unavoidable if you want to implement your own special services.) For now, here's a very simple client that talks to a web server and requests the default page: use IO::Socket::INET; my $remote = "www.linuxformat.co.uk"; # a well-known web server ;-) my $http = 80; # standard port for HTTP traffic my $sock = IO::Socket::INET->new(PeerAddr => $remote, PeerPort => $http, Proto => 'tcp', Type => SOCK_STREAM) or die "Couldn't connect to $remote:$http: $!\n"; print $sock "GET /\n\n"; # a minimalist HTTP request print STDOUT $sock->getlines(); close $sock; Note the line "print STDOUT $sock->getlines()". IO::Socket::INET inherits a bunch of methods from IO::Handle, a very neat abstraction that lets us treat file handles as objects. Sockets are treated much like filehandles in Perl, and getlines() simply reads an array of lines from the filehandle. All the real socket-handling voodoo is concealed in IO::Socket::INET::new(), which accepts a bunch of parameters but defaults to sensible values when setting up a new socket connection. Here's a really simple server. It listens for connections to the localmachine on port 999, and when one arrives, prints the current date and time to the client: use IO::Socket::INET; my $port = 9999; # some unused TCP port to bind to $server = IO::Socket::INET->new(LocalPort => $port, Type => SOCK_STREAM, Reuse => 1, Listen => 10) or die "Couldn't bind to port $port: $!\n"; while ($client = $server->accept()) { $client->print(scalar(localtime(time)), "\n"); } Warning: don't try and turn this into a general purpose server! For a whole bunch of reasons (which we'll look at in depth next month), this server won't scale up. But next month we're going to see how to do it properly -- how to write a server that will cope with a whole bunch of clients and do useful work. (END)