Linux Format 27


[[ Typographical notes:


   indented text is program listing


   text surrounded in _underscores_ is italicised/emphasised


]]


Perl Tutorial


TITLE: Roll your own network clients


In last month's Perl tutorial, we took a look at how network servers
work, what they do, and how to go about writing a simple one in Perl --
most of the effort going, not into Perl code, but into defining a 
protocol (language) for the server to talk to its clients. 


This month, we're going to look at how to go about writing networking 
protocol clients. We'll start with the client for the simple manpage
protocol (introduced last month), then look at some other aspects of
client-side programming in Perl -- including things like off-the-shelf
libraries for writing web robots, parsing HTML, and doing FTP.


What does a network client look like in Perl? Here's about the 
simplest web client you can knock up using the modules in the standard
Perl distribution:


   use IO::Socket::INET;


   my $remote = "www.linuxformat.co.uk";  # a well-known web server ;-)
   my $http   = 80;                       # standard port for HTTP traffic
 
   my $sock = IO::Socket::INET->new(PeerAddr => $remote,
                                    PeerPort => $http,
                                    Proto    => 'tcp',
                                    Type     => SOCK_STREAM)
              or die "Couldn't connect to $remote:$http: $!\n";


   print $sock "GET /\n\n";  # a minimalist HTTP request


   print STDOUT $sock->getlines();
   close $sock;


The IO::Handle module in the standard Perl 5.6 distribution provides a common
interface to i/o tasks in Perl. A variety of sub-classes such as IO::File
and IO::Socket are derived from it; they understand all the methods of
IO::Handle, but add additional methods for dealing with files and sockets
respectively. In general the IO modules all use similar semantics; you
treat an IO object as if it's a wrapper around a file handle. The usual
Perl filehandle i/o commands (such as print(), getc(), eof() and read()) 
are all present, albeit provided as methods to call on the IO::File or 
IO::Socket object rather than as built-ins. Some of these core modules
have sub-types in turn; for example, IO::Socket provides a generic interface
to the Linux socket mechanism, but you need to use IO::Socket::INET to 
handle internet domain sockets (ones that send data over TCP/IP), in
contrast to IO::Socket::UNIX (which builds UNIX domain sockets, which
run very rapidly but are restricted to the local machine).


Note that there are some non-standard IO::Handle -derived modules that
aren't in the standard distribution. IO::Stringy lets you treat a scalar
variable as if it's a file handle -- great for splatting volumes of test
into a handle that you're going to pass to some other subroutine. And
IO::Stty lets you do things to the i/o settings on a serial TTY device. But
we're not going to deal with these here.


It's possible to use lower level Perl commands to set up a socket
connection; the Socket.pm module provides a low-level interface to the C
socket.h standard library, and the basic socket system calls are provided
as Perl built-in commands. But unless you want to deliberately build TCP/IP
packets by hand to meed some specific aim (such as generating malformed
packets that appear to originate from another machine) there's not much
point to doing this. The basic socket connection provided by IO::Socket::INET
lets you connect to a server on a remote host, read and write data, check
for the end of a data stream, and set buffering.


SUBTITLE: Protocols -- the client's eye view


Just being able to connect to a server isn't enough; the client needs to
know how to get the server to do something useful.  The usual solution,
as we saw last month, is to use a protocol -- a well-defined set of rules
governing how a client and server should communicate. Protocols dictate
the order in which a client and server exchange information, so that 
the session never enters a state in which the server is waiting to hear
something from the client and the client is waiting to hear something from
the server -- a deadlock situation that can persist indefinitely. Protocols
often provide some level of error recovery. 


The example in the previous section is about the lowest-possible level of
client for HTTP, the hypertext transport protocol used by the web. After
connecting to an HTTP server, our client prints a message like this:


  GET /


(followed by two carriage returns).


It then reads whatever the server emits and prints it to its standard
output. The GET command in HTTP (or rather, the GET method) tells the 
server to return the object named after it -- in this case '/', the
document root on the web server. If you execute this code, Linux Format's
web server will reply:


  200 OK


Then emit a stream of HTML (the default web page it's set to serve up
as the document root).


HTTP is a simple, stateless protocol; that is, we can connect to a 
server, issue commands, and each command will be executed without reference
to the previous one. Other TCP/IP protocols are not so simple. 


An example of a stateful program is SMTP, Simple Mail Transport
Protocol. The client can connect, tells the server to accept mail from
person <X> and destined for person <Y>, and starts sending the mail. The
'MAIL' command that introduces the mail tells the server to enter a
state in which it is accepting mail from someone. This can be followed by
'RCPT TO' command, telling the server who the mail is for. But RCPT TO
can be followed by either another RCPT TO (introducing another recipient)
or by DATA (introducing the message data). Keeping track of which state
the client is in is essential.


Other common stateful protocols include FTP (File Transfer Protocol),
NNTP (NetNews Transfer Protocol), and IRC (Internet Relay Chat).


In Perl, the commonest way to handle the client side of a stateful
protocol is to use a perl module that provides an object-oriented
interface to a session using that protocol. When a client program connects
to, say, a mail server to send mail out, we can be pretty sure that its
connection is and its state within that session aren't dependent on any 
outside influences. By wrapping the connection up in an object, and 
providing methods to poke data into the object and read data out, we
can build an object-oriented interface to a protocol that makes it easy
to build network clients.


As it happens, there are Perl modules on CPAN that implement the client
side of a whole bunch of network protocols. You can install a bunch of
them with the following command:


  perl -MCPAN -e 'install Bundle::libnet'


This bundle installs Net::Cmd, a collection of methods that can be
inherited by descendants of IO::Handle and that provide tools used by
command-based protocols (like SMTP or FTP). These in turn are used
by Net::FTP (which, with its special-purpose subclasses provides
a client-side implementation of the FTP protocol), Net::NNTP (which
provides the NNTP netnews transfer protocol client), Net::POP3 (a client
for the popular POP3 remote mailbox delivery protocol), Net::SMTP (which
allows client-side sending of outgoing email), Net::Time (a network time
protocol client), and utility  modules such as Net::Netrc (which provides
access to the .netrc file used by the ftp program to store login details
for remote FTP servers) and Net::Domain (which attempts to verify the
domain of the current host),


SUBTITLE: The simple manserver protocol


Last month we introduced a simple sever that provides access to man pages
over the network (by invoking man in response to various commands, and
returning its output to the client). The manpage protocol is about as
simple as it gets. The client opens a connection to the server on port
8888 and sends a request. Requests are a bunch of lines, each with
a keyword and optional arguments. The server waits for the request
to terminate (with an END command) then sends a reply and closes the
connection.  The server times out stale connections after 60 seconds.


To request a manpage, the client sends:


  PAGE <manpagename>
  SECTION <section_number>
  END


The <section_number> can be a number (1-8) or the keyword ALL, meaning
send all the man pages named <name> -- see below for how they're delivered.


The client can run an 'apropos' search (equivalent to 'man -k) by sending:


  SEARCH <regexp>
  END


And it can run a 'whatis' search (equivalent to 'man -f') by sending:


  DESCRIBE <manpagename>
  END


If the server parses the request and knows what to do, it sends:


  200 - OK


Followed by the body of the manpage or search result.


If the request doesn't make sense, the server sends:


  500 - Bad Request


And closes the connection.


Now, we could use the tools in Net::Cmd to implement proper protocol-
dependent behaviour, but the manserver protocol is stateless; we just
issue commands and read whatever comes back. So Net::Cmd is overkill.
Instead, we want a command -- called, say, manc (for 'man client'),
that accepts the same basic arguments as man: that is, a man page
name, optional section, and optional -f and -k arguments. Depending on
its arguments it should then open a connection to the man server (specified
in a simple configuration file in /etc -- /etc/manc.conf) and send the 
request. It should then print whatever comes back, or a suitable error
message.


SUBTITLE: The manc client program -- how it works


The manc client is designed as a replacement for the man command. Instead
of invoking external commands (notably troff) to format a man page stored
on the local filesystem, manc parses its arguments, then opens a socket
connection to the manserver program (in last month's column) and sends a
protocol request.  


Here are the highlights (line numbers, as ever, not required by Perl):


First, we set up the modules manc uses. In this case, Pod::Text is used
to format the help text (actually a man page embedded in the program),
Getopt::Long is used for parsing command-line arguments, and IO::Socket::INET
is used to talk to the server:


013: use strict;
014: 
015: use Getopt::Long;
016: use IO::Socket::INET;
017: use Pod::Text;


Use 'strict' isn't an external module but a compiler pragma -- it turns on
strict variable initialisation checking.


019: # ------------------------------  main program variables
020: my ($man) = "";                 # flag: used to print man page
021: # ----------------------------  set up configuration variables - default
022: 
023: my ($server, $port) = "";
024: 
025: if (exists $ENV{MANSERVER}) {
026:   $server = $ENV{MANSERVER};
027: } else {
028:   $server = "localhost";
029: }
030: 
031: if (exists $ENV{MANPORT}) {
032:   $port = $ENV{MANPORT};
033: } else {
034:   $port = 8888;
035: }


The bit above lets us configure the manc program via two environment
variables -- MANSERVER and MANPORT.  These default to 'localhost'
and 8888 respectively and indicate the server the program will try
to connect to. (We could have built a configuration file mechanism in,
using a module like ConfigReader::DirectiveStyle to do the heavy lifting,
but for the purpose a tutorial program this seemed like overkill. We
demonstrated a program that used ConfigReader in Linux format #24.)


Our manc client implements the basic man features -- lookup manpage
by section and name, search the man descriptions for a string (apropos),
and retreive the one-line description of a manpage name (whatis). The
next part of the program uses Getopt::Long to work out what sort of 
command line we're  processing. The picture is complicated by the 
possibility that manc has been fed a mangled command line (say, a
combination of the -k (apropos) and -w (whatis) flags).


044: # ------------------------------  first, get options for -s, -k, -w flags
045: 
046: my $section = ""; # manpage section
047: my $page    = ""; # man page name, if given
048: my $apropos = ""; # apropos string, if given
049: my $whatis  = ""; # whatis string, if given
050: my $help    = 0 ; # flag; print help if > 0
051: 
052: GetOptions(
053:            "section=s"   => \$section,
054:            "s=s"         => \$section,
055:            "k=s"         => \$apropos,
056:            "w=s"         => \$whatis,
057:            "help"        => \$help);
058: 
059: # ------------------------------  now get manpage and section from ARGV
060: 
061: if (scalar(@ARGV) == 2) {
062:     # presumably we have arguments of the form man <sect> <page>
063:     ($section, $page) = (@ARGV);
064: } elsif (scalar(@ARGV) == 1) {
065:     # presumably we have arguments of the form man <page>
066:     $page = $ARGV[0];
067: } 
068: 
069: # -----------------------------  check for confused arguments
070: 
071: my $sanity = 0;
072: if ( $apropos ) { $sanity++ }
073: if ( $whatis ) {$sanity++ }
074: if ( $page ) { $sanity++ }
075: print "sanity: $sanity\n";
076: if ( ($sanity != 1) or ($help > 0) ) {
077:    # the user entered an invalid mixture of arguments or asked for help
078:    do_help();
079:    exit;
080: }


Lines 52-58 are a call to GetOptions. GetOptions (imported from Getopt::Long)
scans the command line and if it finds a known option it puts any parameter
to  that option in the appropriately named variable. 


Lines 61-67 then check for section/page arguments. man traditionally takes
either one argument (a man page name) or two arguments (section number, page
name), hence the significance of looking for the number of arguments specified
after we've finished processing the options.


At lines 71-80 we check for valid combinations of query. $sanity is a flag
that indicates a sane request; if we've got parameters relating to different
types of query, something's gone wrong and it's time to give the user some
help. If not, we have more or less apropriate parameters, so it's time to
move on:


082: if ( $apropos ) { print STDOUT @{ do_apropos($server, $port, $apropos) } }
083: if ( $whatis  ) { print STDOUT @{ do_whatis($server, $port, $whatis) } }
084: if ( $page    ) { print STDOUT @{ do_man($server, $port, $page, $section)} }
085: exit;


That's all the main program!


Actually, it isn't. But depending on whether we're processing an apropos,
whatis, or man query, we need to do different things. To avoid excessively
complex if() constructs, we farm the details out to different subroutines,
each of which returns an arrayref pointing to an array of lines of text
containing the answer sent by the manserver program. This, when printed, 
provides the output for manc.


Three subroutines (do_apropos, do_whatis and do_man) create a manserver
protocol request, as a string. They then feed the request to a single
routine, talk_to_server(), and return whatever talk_to_server() gives them.


116: sub talk_to_server {
117:     my ($server, $port, $request) = @_;
118:     my @response = ();
119:     eval {
120:         my $alarm_timeout = 60;
121:         local %SIG;
122:         $SIG{ALRM} = sub { die "Error: timeout\n"; };
123:         alarm $alarm_timeout;
124:         my $socket = new IO::Socket::INET(PeerAddr => $server,
125:                                           PeerPort => $port,
126:                                           Proto    => 'tcp',
127:                                           Type     => SOCK_STREAM)
128:                      or die "Couldn't connect to $server:$port: $!\n";
129:         print $socket $request;
130:         @response = $socket->getlines();
131:         close $socket;
132:         alarm 0;
133:     };
134:     if ($@) { return [ "Error, timeout: $@\n" ] }
135:     @response = @response[2..$#response]; # discard first two lines
136:     return \@response;
137: }


talk_to_server() is the subroutine that actually does the dialogue with
the manserver program. It's wrapped in an eval() block, from lines 119
to 133 -- this allows us to set SIGALRM inside the block and do something
sensible if the alarm times out and causes an exception. If we remove
all the signal-handling stuff (which was explained in Linux Format 26)
what we're left with is:


       my $socket = new IO::Socket::INET(PeerAddr => $server,
                                          PeerPort => $port,
                                          Proto    => 'tcp',
                                          Type     => SOCK_STREAM)
                     or die "Couldn't connect to $server:$port: $!\n";
        print $socket $request;
        @response = $socket->getlines();
        close $socket;


$socket is an IO::Socket::INET object -- a wrapper around a filehandle
that is bound to a socket connected to $server on port $port. We simply
print a message to it (to send a request), then read from it to see
what the program at the other end sends us back. (getlines() reads all
the lines from a filehandle until eof() is true.)


Note line 135:


135:     @response = @response[2..$#response]; # discard first two lines


The manserver returns two things that aren't part of the response -- a
line saying "hello, I'm a manserver" when the client first connects, and
a protocol response saying "200 - ok" or "500 - not ok" (or words to that
effect) in response to the request. Line 135 naively and crudely throws
this information away. If we wanted to do the job properly we wouldn't do
that -- we'd make some attempt to get the status line, then parse it,
and only then call getlines() (if necessary) to read the man page text.


139: sub do_help {
140:    my $parser = Pod::Text->new();
141:    $parser->parse_from_file($0, "-");
142: }


do_help() is a boilerplate help routine; the program has an __END__
marker at the end of the perl code, followed by its man page in POD
documentation format. do_help() feeds the file $0 (that is, the currently
executing perl script's source file) to Pod::Text and tells it to parse
the stream, emitting text on the file handle "-" (a UNIXy traditional
shorthand notation meaning whatever file handle is bound to STDOUT).


SUBTITLE: What next?


This is a pretty basic man page client. For starters, there's no option
to return unformatted (troff -man macro source) pages. For seconds, there's
no local cacheing mechanism; ideally manc would maintain a small (on the
order of 1Mb) cache of files, and look there before going out over the net
to pester a documentation server. And all good clients should set their
exit status sensibly when they exit, in case they're invoked within a shell
script.  


We could get elaborate by adding extra functionality on the server
side, too. It's not hard to use one of the many html to TEXT filters 
to provide access to the Linux HOWTO and Mini-HOWTO documentation, which
lives in a set location (according to the Filesystem Hierarchy Standard).


Monitoring the response codes from the server is a must. For example, as
written, this manserver client will not be able to deal intelligently with
a server-side error. If the server hangs, the alarm call should enable
the client to exit gracefully -- but there's no way for it to deal with 
the server returning an error code indicating that there's no such manpage
or some other error has occured. 


Probably the worst issue with this client  (and the most subtle) is that it
is vulnerable to attack. Perl doesn't succumb to buffer overruns -- it has
dynamically resizable strings -- but it is possible for an attacker who
has subverted a server to force-feed the client a long string, eventually
eating up all available memory on the client machine. A solution for this
would be to replace getlines() with something more controlable (such as
by calling read() to fill a buffer until eof() becomes true or a hard
limit is exceeded).


However, this example illustrates the basics of exchanging data between a
client and a server. Have fun, and remember: man pages are not the only fruit!


END (BODY COPY)


BOXOUT: Spidering the web


Pulling stuff off a server is all very well, but it begs the most
important question you need to ask when writing a network client: what
do you do with the data when you've got it?


Processing data retreived from a network server raises a bunch of
questions.  In general, there are about four types of data you're likely
to be concerned with. Firstly, there's free-form text no organising
structure; objects pulled in from an FTP server often fall into this
category. Secondly, there is regular structured data.  Email or usenet
message headers fall into this category, as do CSV files and a few similar
things: you can make assumptions about them having a regular field-based
layout with certain items stored in each field. Thirdly, there is nested
data. The body of an email message containing MIME attachments may consist
of a MIME wrapper around a bunch of included MIME sections. Similarly,
an HTML file contains nested tagged text structured in accordance with a
DTD. To handle nested data almost always requires a fairly sophisticated
parser that recognizes the containing structure and builds a tree of
elements. And finally there's binary data -- which may or may not be
amenable to processing in Perl (pack() and unpack() are your can-opening
friends, here) and which often has an internal structure that imposes
problems qualitatively equivalent to one of the other three types of data.


When messing around with network clients, we are mostly concerned with
types two and three -- regularly structured messages (such as email or
usenet message headers, or HTTP headers) and nested structures (such
as email attachments or HTML and XML files).


If you want to dig around inside data retreived from a network server,
you first need to know what you're dealing with. It may be that you're
dealing with a mixture of data types. For example, if you establish an
HTTP connection to a web server with keep-alive, you may receive multiple
files via the same connection, encapsulated within separate MIME-encoded
messages. To deal with this sort of bundle, you need a two-stage process:
first, to identify the separate message components (by parsing the HTTP
response messages and separating out and decoding the attachments),
and secondly to deal with individual components (for example, by parsing
the HTML in a file into a tree representing a document's structure, which
can then be searched for specific keywords, either identified as attributes
of a META tag, or as plain text). 


The difficulty of this job should not be under-estimated. Any Perl
developer working on this sort of project would do well to lay their hands
on a copy of "Data Munging in Perl" by David Cross (ISBN: 1-930110-00-6,
published by Manning: see www.manning.com/cross). This book is concerned
with one of Perl's core tasks -- taking raw data from a source (such as
a server), manipulating or parsing it, and processing it into a final
form. It includes a lot of valuable insight into processing nested formats
like HTML and XML, and an introduction to Parse::RecDescent, the standard
Perl module for building recursive-descendant parsers (which can cope with
nested expressions). 


The world wide web is a special case. Perl has a rich grab-bag of tools 
for serving up data under Apache and other web servers -- and an equally
rich client-side grab-bag of tools. The biggest and most powerful Swiss
Army chainsaw in your toolbox is the CPAN Bundle::LWP package -- LWP
is short for "Lib-WWW Perl", and you can install it by typing:


   perl -MCPAN -e 'install Bundle::LWP;'


LWP contains two types of tool; gadgets for retreiving web pages from a
remote server, and tools for parsing HTML. The first set of classes,
grouped under HTTP::Request, treat an HTTP request as an object; you
can set the method to use (GET, PUT,  POST, or HEAD, as defined in
the HTTP  RFC's), the uri (universal resource indicator) denoting the
object to retreive, and additional headers. You execute an HTTP::Request 
by passing the object to a User Agent (such as LWP::Useragent), which 
creates the network connection to the server and handles communication-
related aspects of the transaction, such as coping with timeouts and
making use of proxy servers. When activated, the User Agent returns
an  HTTP::Response object, which provides access to a response code,
HTTP response headers, and data returned from the server.


For example:


    use LWP::UserAgent;
 
    # create a user agent
    $ua = LWP::UserAgent->new;
    $ua->agent("Test/0.1 ");


    # Create a request
    my $req = HTTP::Request->new(GET => 'http://www.perl.com/index.html');


    # Throw HTTP::Request at LWP::UserAgent and return HTTP::Response
    my $res = $ua->request($req);


    # Check the outcome of the response
    if ($res->is_success) {
        print $res->content;
    } else {
        print "Bad luck this time\n";
    }
    # get contents 


    my $html = $res->content;


What you do  with the contents of the HTTP response once you've retreived
it is somewhat more difficult; a good starting point is Gisle Aas's module
HTML::Parser, which can be found on CPAN and which provides a toolkit for
building tools that parse HTML files and extract specified data from  them --
for example, all the link addresses, or all the text enclosed in <H3>...</H3>
tags. We'll be looking at HTML::Parser in detail in another tutorial.


END BOXOUT (Spidering the web)