Linux Format 25


[[ Typographical notes:


   indented text is program listing


   text surrounded in _underscores_ is italicised/emphasised


]]


Perl Tutorial


TITLE: Writing client/server programs


Linux is a POSIX-compatible UNIX like operating system. As such, it has
a whole grab-bag of features for letting one program talk to another. In
the beginning, UNIX System 7 provided pipes, signals and temporary files. The
BSD versions added TCP/IP networking and sockets, while AT&T's UNIX team
added a slew of IPC (inter-process communication) options, including
shared memory, semaphores, and streams (not really used on Linux). Then
the Bell Labs Plan Nine research team, who got heavily into doing esoteric
things with filesystems, invented the /proc system and the Linux kernel
developers snaffled it and added their own extras, notably the shared
memory filesystem (which makes shared memory easier to use).


The result is that Linux is a rich environment for writing client/server
software, although not all the mechanisms you'll find there are portable
to other operating systems. A SHM-filesystem based solution won't run on
Solaris or other SVR4 UNIXes, and SVR3/SVR4 shared memory won't work on
BSD. So I'm going to keep to the limited subset of client/server tools that
are relatively portable -- signals, pipes, and sockets.


In this tutorial, I'm going to go over some of the basics of interprocess
communication -- signals and pipes, with a side-order of sockets. In next
month's tutorial we'll see how we go about building socket-based internet
clients and server programs (such as a simple web server).


HEADING: Signals


Signals are about the most primitive way for two processes (executing
programs) to communicate: they don't transfer any information at all,
except the fact that a signal with a given number has been raised --
because some condition has occured. For example, if a process divides
by zero, SIGFPE (floating point exception) is sent to the process. A
lot of other conditions can trigger signals; for example, hitting the
interrupt key on a TTY that a process is connected to sends a SIGINT
(interrupt) signal, and the kill command can be used to send arbitrary
signals. You can see a list of the signals supported by Linux in the
man page signal(7); it's a superset of POSIX.1 and SUS v2 (standard UNIX
specification).


Processes can do three things: ignore the signal, allow a default action
to occur (e.g. terminate the process -- this is the default for SIGFPE),
or provide a function that is called when the signal occurs.  Perl 
processes can set handler functions, and they can also send signals. Here's
how it works.


There's only one signal that doesn't affect the target process: signal 0.
This signal merely checks whether the process is alive and hasn't 
changed its user ID; if kill(0, $some_process_id) fails, then $some_process_id
isn't accepting signals. (Maybe it's a zombie?)


To send a signal in Perl, we use the kill() command.  kill() takes two or
more parameters -- the signal to send, and a list of process ID numbers.
kill() returns the number of processes successfully signalled.  Signals
can be represented in numeric form (e.g. 9 for SIGKILL) or symbolic form
(e.g. HUP for SIGHUP). The list of process ID's are just the PIDs of
the processes to signal. Note that your Perl script can't kill processes
belonging to higher-privileged user ID's -- if you want to kill one of
root's processes, you have to be root, and if you do this from a CGI
script running with a UID of 'nobody' or 'httpd' you can only signal
other processes owned by 'nobody' or 'httpd'.


Secondly, there are two ways of processing a received signal in Perl: the
traditional way, and the sigtrap pragma.


Perl (this is the traditional way) provides a hash called %SIG, in the 
main namespace. You can create subroutines to handle specific signals, and
store them in %SIG; for example, if you write a subroutine called
handle_hup() to  handle SIGHUP (hang-up) signals, you'd put a reference to
it (or just its name) in $SIG{HUP}:


  $SIG{HUP} = \&handle_hup() ;


Thereafter, whenever the process receives a SIGHUP, the subroutine
handle_hup() is triggered. 


You can also assign the name of a subroutine to %SIG:


  $SIG{HUP} = 'handle_hup';


Or an anonymous subroutine:


  $SIG{HUP} = sub {die "Killed by SIGHUP!\n" };


Now, here's a problem: signals are asynchronous. They can come at you from
anywhere and at any time. To handle asynchronous events like this, you need
to write re-entrant code -- code that can be invoked from within itself
without losing track of where the program counter and registers has gotten
to. This isn't re-entrancy at the Perl level -- it's at the C or assembly 
language level, the level Perl is written in. Perl is linked to libc, and
older versions of libc are not re-entrant -- GNU Libc, aka Libc 6, is,
but you can't count on older versions of Linux providing this.


This means that your Perl signal handler may go bang messily if it
receives a second signal while trying to handle a signal. So you need to
keep signal handler subroutines short and sweet. Ideally all they should
do is log an error message somewhere, close any open system resources
(such as open filehandles), and call die() gracefully. Even doing that much
can be a bit risky.


Note that the %SIG array is global by default; if you want to change a
signal handler locally, say within a subroutine, you should override %SIG using 
local(), so that when you leave the special scope %SIG regains its normal
settings.


The shiny new Perl approach to signals is the sigtrap pragma (a pragma is a
built-in feature controlled using use() that changes the way the compiler
behaves). Sigtrap lets you specify that one of a number of predefined
behaviours is to be triggered by one or more signals. Behaviours include
'stack-trace' (output a Perl stack trace and dump core), 'die' (die with
a programmer-specified message), and 'handler', which is followed by the
name of a user-written subroutine to use (as in %SIG). Lists of signals
to apply a behaviour to include 'normal-signals' (HUP, INT, PIPE, or TERM),
error-signals (ABRT, BUS, EMT, FPE, ILL, QUIT, SEGV, SYS and TRAP -- these
would usually cause your Perl interpreter to die), 'old-interface-signals',
'untrapped' (all signals for which no handler has yet been installed), and
'any' (all signals, period). You can also list signals by name or number.


For example:


  use sigtrap qw(stack-trace normal-signals);


Causes Perl to install the stack-trace handler for all the signals listed
in the 'normal-signals' set.


  use sigtrap qw(die untrapped);


Makes the die() handler the default for any signals we haven't previously set
up a handler for: and:


  my $my_handler = sub { warn "SIGHUP received\n" };
  use sigtrap 'handler', $my_handler(), HUP;


Installs a handler for SIGHUP that prints 'SIGHUP received' on STDERR.


SUBHEADING: 101 Uses for a dead signal


What can you usefully do with kill() and %SIG?


One useful trick is SIGALRM, the alarm signal. When you call the
Perl command alarm(), it sets a timer; after the specified number of
seconds, this causes the kernel to send a SIGALRM signal to the current
process. (You can only have one timer active at once, and each time
you call alarm() it deletes the previous timer; you can also manually
cancel an alarm by calling alarm(0). alarm() returns the amount of time
remaining on the previous timer.)


Suppose you want to read some data from another process that's writing to a
pipe -- but which might die of boredom or something. If you just read
like this:


   while ($input_data eq "") {
         $input_data = <MY_PIPE>;
   }


... You risk being blocked forever by whatever process is on the other end of
the file handle MY_PIPE.


You can avoid this problem by using SIGALRM and a custom signal handler:


  eval {
    my $alarm_timeout = 60;  # wait up to sixty seconds before giving up
    local %SIG;
    $SIG{ALRM} = sub { die "Error: MY_PIPE coprocess timed out\n"; };
    alarm $alarm_timeout;
    while ($input_data = "") {
        $input_data = <MY_PIPE>;
    }
    alarm 0;
  }; # end of eval block scope 
  alarm 0;
  if ($@) { print "Error: $@\n"; }


If the alarm goes off inside the eval() block, the appropriate action to
take is to call die() -- otherwise Perl might try to restart the syscall to
read from MY_PIPE. By calling die() in an eval block we raise an exception 
and exit the block. If the alarm goes off, the result will be that $@ holds
the string "Error: MY_PIPE coprocess timed out".


(There's a slightly more convoluted example in Chapter 16 of "Programming
Perl", 3rd edition -- involving nested eval() scopes to get around a
problem with a syscall that may not be implemented on some platforms. By
ignoring cross-platform portability and focussing on Linux, we get off
lightly!)


Here's another use for a signal (which we'll see an example of later):
signalling a process group. On Linux, processes have a family
tree. When you want to run a new program (a running instance of
a program is called a process), your current program must use the fork()
system call. fork() clones the current process, and returns the process
ID of the child to the parent; the only distinguishing features of the
new child are its new process ID and a couple of bits of metainformation
used by the kernel. If you want to run a different program, the child
must check it's process ID and call exec(), to execute the new code
(overwriting itself in the machine's memory, roughly speaking).
Ultimately all processes are descended from init, the root process.


We refer to a parent and its child processes as a 'process group';
process groups have a number corresponding to the process ID of the
process group leader (or parent). Perl processes can check their
own process ID, using the special variable $$, and collect the PIDs
of children as they call fork(). You can individually signal a whole
bunch of children:


  kill(HUP, @my_kids);


But if they've spawned grandchildren or other descendants, you may
not have a list of their PIDs. The solution is to become a process
group leader, set a signal handler for signal 1, SIGHUP, and then
send a HUP signal to a negative number corresponding to the process
ID of the process group leader. (Send a signal to, say, 407, and it 
affects process ID 407. Send it to -407 and it goes to both process
407 and all its descendants. So, if you send a signal to -$$, it
goes to all the processes descended from $$.) For example:


   eval {
      setpgrp(0, $$); # set $$ to be the process group ID
   };
   if ($@) {
      die "Could not setpgrp(0, $$): $@\n";
   }


   # do stuff that spawns child processes here,
   # including shell scripts or daemons
        :
   # now it's time to shut down all descendants gracefully


   local$SIG{HUP} = 'IGNORE';   # we don't want to kill ourself
   kill(HUP, -$$);              # signal entire process group


When processes exit, the parent is expected to tidy up after them.
A signal -- SIGCHLD -- gets sent to the parent, which must then
use one of the wait() or waitpid() system calls to acknowledge 
the death in the family. Processes that don't get wait()'d for
hang around as entries in the process table, and are known as
zombies: when writing a program that spawns lots of children you
need to take care of this. SIGCHLD was intended to allow parents
to do things like wait for a child to run and terminate before
continuing. Perl's system() and backquote mechanisms automatically
take care of reaping CHLD signals, but if you call fork() directly
to spawn children, you need to be able to deal with the consequences
manually.


The simplest approach to dealing with a death in the family is to
callously ignore it:


   $SIG{CHLD} = 'IGNORE';


Alternatively, you can subcontract the job out to a grim reaper -- a
subroutine that periodically checks for zombies, and calls waitpid()
to collect them. This is the recommended tactic for server processes
(like the ones we'll eventually get round to writing).


For example (by way of the Camel book):


   our $zombies = 0;   # global counter for zombies
   $SIG{CHLD} = sub {$zombies++}; 


   sub reaper {
      my $zombie;
      our %exit_status;
      $zombies = 0;
      while (( $zombie = waitpid( -1, WNOHANG) ) != -1)  {
          $exit_status{$zombie} = $?;
      }
   }


   # now we do a whole lot more stuff until finally ...


   while (1) { # main execution loop for the server
      if ($zombies > 0) {
          reaper();
      }
      # and so on
   }


The idea of this approach is that whenever a child dies, the global
counter $zombies is incremented. Servers tend to run in a continuous
loop; if $zombies shows that there are pending children waiting to
be reaped, the main loop periodically kicks off the reaper() 
subroutine.


The reaper() simply runs waitpid() repeatedly, recording the returned
child process ID's (returned in $zombie), until waitpid() returns -1,
indicasting that there're no dead child processes.


HEADING: Pipes


A pipe is a one-way i/o channel that funnels bytes from one process
to another. Unlike signals, which merely convey a single bit of
information (such as "hello, process 4117! Your child process 4118
has died!"), pipes let us transfer arbitrary data from one process to
another. They're not the only way of doing that -- temporary files
and shared memory segments can be used for the same job -- but pipes
are often very useful because they're easy to use in shell programming,
and being able to access them from Perl lets us write Perl scripts that
can interact with pipelines.


Important note: pipes connect two processes and they are unidirectional. 
That is: a given perl program can read from one end of a pipe, or can 
write to one end of a pipe, but can't read and write to the same end of 
the same pipe.  If you need bidirectional communication between two
processes (call them A and B) you will need to use a pair of pipes,
one which A writes to and B reads from, and one which B writes to and
A reads from. 


A second characteristic of pipes is that they're buffers. You can 
pump data into a pipe, but if the process at the other end doesn't 
suck data out fast enough it can block you. (Clue: this is a good 
reason to practice using SIGALRM to time out operations that rely on
another process.)


Finally, there are two different types of pipe on Linux -- anonymous
pipes (the normal kind) and named pipes (which have names in the
filesystem, and which you can open and close like a file). You also
want to avoid getting pipes confused with sockets (which we go into
at great length later on.)


In Perl, there are two general strategies for using pipes. You can use
a pipe to talk to another program by calling open() with some
special arguments;  or you can use the low-level pipe() command to
create a couple of pipes, and then fork() so that a parent process
can talk to its children.


SUBHEADING: Talking to aliens


We're used to seeing Perl programs open files for reading data or
writing results like this:


   open(INPUT, "</var/spool/maillog") or die "Couldn't open maillog: $!\n";


However, Perl's open() command can also be used to run an external
program and connect a pipe to it -- _either_ to write data from your
Perl program to the external command, _or_ to let your Perl program
read data from the external program via a pipeline. 


Examples:


   open(OUTGOING, "| sendmail -ba") or die "Can't send mail: $!\n";


Opens a pipe into a copy of sendmail, presumably so that
the Perl program can send an email message. 


   open(MONITOR, "netstat -an 2>/dev/null |") or die "Can't fork: $!\n";


Runs 'netstat -an', discards the standard error, and attaches it to the
filehandle MONITOR, so that we can do something like this:


   while ($data = <MONITOR>) {
       # do something with output from netstat
   }


When you use open() this way, Perl implicitly calls fork() and exec() to
spawn a sub-process. This is not necessarily the program you expected to run:
if the argument you give to open() has more than one token in it, open()
spawns a shell and passes your command line to the shell, while connecting
the standard output from the shell to your filehandle (if you're trying
to read from a pipe), or your filehandle to the standard input of the shell
(if you're trying to write to it). So it's legal to do something like this:


   open(WORDS, "find . -type f -name '*.txt' -exec cat {} \; | wc -w |");


The output from the pipeline (find searching for files and calls cat on
them; wc then counts the number of words output by cat) is then available
for reading on the Perl filehandle WORDS. However the pipeline command
isn't executed directly by Perl, but by /bin/sh.


SUBHEADING: Talking to yourself


Using open() to read from or write to external processes is all very 
well, but what if you want to read and write to the same process? Or
if you want to spawn a child copy of your perl process to carry out 
some task, then read some data back in from it?


The traditional UNIXy way of doing this is pretty low-level. First,
you use the pipe() system call to create a pipe.  This returns two
filehandles -- a reader and a writer.  If you want bidirectional
communication with your child, you call it twice, giving a total of
four file handles (one for the parent to write to, which has a 
corresponding one for the child to  read from, and one for the child
to write to, with a corresponding reader handle for the parent). 
You then call fork() to spin off a child. The child now has to close
the parent filehandles, while the parent closes the child handles.
Then you have to somehow manage i/o without a deadlock occuring.


Deadlock occurs when two processes are both waiting for each other to
do something, and it's deadly. If you hit a situation where the parent
is waiting for the child to say something, but the child expects the
parent to say something, both processes will hang -- unless you've
set up a SIGALRM handler and set an alarm to go off and break one
or other of the processes out of the deadly embrace. 


You might think you can avoid this situation by carefully planning your
parent and child dialog, but there's a fly in the ointment: UNIX I/O
buffering.  Normally, input and output on filehandles are buffered --
that is, the standard I/O library saves up bytes of input until it has
enough to fill a buffer which can be transferred to a reader process in a
lump. This is an important optimization for UNIX systems -- otherwise the
kernel would have to switch between executing the writer and the reader
processes for every byte transferred, and context switches are expensive
-- but if you just write a short string, and expect a response, you may be
disappointed, because the short string won't fill the I/O buffer and 
therefore the client won't get to read it. 


A partial solution to this is to tell Perl to use unbuffered I/O on a
filehandle. You can do this using select:


   select FH;
   $| = 0;


The $| special variable controls I/O buffering; setting it to 0 disables
buffering on the current selected filehandle only, so if you have a bunch
of filehandles in use you may need to select and unbuffer all of them
in turn.


The standard Perl distribution comes with a couple of life-belts for
programmers who really want to write processes that do this sort of
thing. First, there's the module IPC::Open2. IPC::Open2 supplies a
single subroutine, open2(); this does what you'd intuitively expect
of the (illegal) command:


   $pid = open(HANDLE, "| some_pipeline |");


-- That is, open2() takes two filehandles as arguments, and glues
them onto the standard input and standard output of a command. For
example:


   open2(\*INPUT, \*OUTPUT, $my_external_filter);


You can now write to INPUT and read from OUTPUT. If open2() succeeds
it returns the PID of the child program; if it fails it raises an
exception that you can trap with an eval block:


   eval {
       open2(\*INPUT, \*OUTPUT, $my_external_filter);
   };
   if ($@ =~ /^open2:/ ) {
       warn "failed to spawn $my_external_filter: $@\n";
   }


Note that open2() doesn't reap children, so you'll need to trap SIGCHLD
(as described earlier) in programs that call it.


HEADING: Sockets


Signals and pipes have a major deficiency when we use them for inter-
process communication; they're restricted to a single computer. If you
want to do network programming, or write general purpose client/server
applications that can run on more than one machine, you need to look into
sockets.


The easiest way to think of a socket is as a bidirectional pipe (yes! you
can read from and write to the same socket) that, instead of connecting
to the standard input and standard output of a process on the same
computer, connects to a process bound to a TCP/IP address and port
number on a computer. The biggest 'gotcha' is to remember that while
the socket itself is bidirectional, the work of setting up the connection
requires distinctive and different tasks to be carried out by a server and
a client.


To set up a socket-based connection, a server must first create a socket,
tell it what IP address and port number to bind to, and call listen() to
establish a queue of incoming connections. The server also needs to 
know what to do when an incoming connection arrives. 


The client's job is a bit easier: it needs to create a socket, tell it
the address of the remote machine, and then call connect() to talk to the
server.


There are several ways to write both a server or a client in Perl: suffice
to say that the easy way is to use the standard IO::Socket::INET class,
which provides a socket object. (Or you can use the lower-level Socket.pm
module to provide low-level functions, for example if you want to mess
around with the protocol type or set various flags on the socket.)


(In addition, if all you want to do is talk to a specific type of 
internet server -- that is, a program on a remote machine that uses a
specific protocol, such as SMTP for mail routing or HTTP for serving
up web content -- there are very-high level modules (Net::SMTP and
HTTP::Request) that will let you issue requests without having to 
deal with sockets at all. But socket programming is unavoidable if
you want to implement your own special services.)


For now, here's a very simple client that talks to a web server and
requests the default page:


   use IO::Socket::INET;


   my $remote = "www.linuxformat.co.uk";  # a well-known web server ;-)
   my $http   = 80;                       # standard port for HTTP traffic
 
   my $sock = IO::Socket::INET->new(PeerAddr => $remote,
                                    PeerPort => $http,
                                    Proto    => 'tcp',
                                    Type     => SOCK_STREAM)
              or die "Couldn't connect to $remote:$http: $!\n";


   print $sock "GET /\n\n";  # a minimalist HTTP request


   print STDOUT $sock->getlines();
   close $sock;


Note the line "print STDOUT $sock->getlines()". IO::Socket::INET
inherits a bunch of methods from IO::Handle, a very neat abstraction
that lets us treat file handles as objects. Sockets are treated much
like filehandles in Perl, and getlines() simply reads an array of 
lines from the filehandle. All the real socket-handling voodoo is
concealed in IO::Socket::INET::new(), which accepts a bunch of 
parameters but defaults to sensible values when setting up a new
socket connection.


Here's a really simple server. It listens for connections to the

localmachine on port 999, and when one arrives, prints the current date
and time to the client:


   use IO::Socket::INET;


   my $port = 9999;   # some unused TCP port to bind to
   
   $server = IO::Socket::INET->new(LocalPort => $port,
                                   Type      => SOCK_STREAM,
                                   Reuse     => 1,
                                   Listen    => 10)
             or die "Couldn't bind to port $port: $!\n";


   while ($client = $server->accept()) {
       $client->print(scalar(localtime(time)), "\n");
   }
   
Warning: don't try and turn this into a general purpose server! For
a whole bunch of reasons (which we'll look at in depth next month),
this server won't scale up. But next month we're going to see how to
do it properly -- how to write a server that will cope with a whole
bunch of clients and do useful work.


(END)