Linux Format 27 [[ Typographical notes: indented text is program listing text surrounded in _underscores_ is italicised/emphasised ]] Perl Tutorial TITLE: Roll your own network clients In last month's Perl tutorial, we took a look at how network servers work, what they do, and how to go about writing a simple one in Perl -- most of the effort going, not into Perl code, but into defining a protocol (language) for the server to talk to its clients. This month, we're going to look at how to go about writing networking protocol clients. We'll start with the client for the simple manpage protocol (introduced last month), then look at some other aspects of client-side programming in Perl -- including things like off-the-shelf libraries for writing web robots, parsing HTML, and doing FTP. What does a network client look like in Perl? Here's about the simplest web client you can knock up using the modules in the standard Perl distribution: use IO::Socket::INET; my $remote = "www.linuxformat.co.uk"; # a well-known web server ;-) my $http = 80; # standard port for HTTP traffic my $sock = IO::Socket::INET->new(PeerAddr => $remote, PeerPort => $http, Proto => 'tcp', Type => SOCK_STREAM) or die "Couldn't connect to $remote:$http: $!\n"; print $sock "GET /\n\n"; # a minimalist HTTP request print STDOUT $sock->getlines(); close $sock; The IO::Handle module in the standard Perl 5.6 distribution provides a common interface to i/o tasks in Perl. A variety of sub-classes such as IO::File and IO::Socket are derived from it; they understand all the methods of IO::Handle, but add additional methods for dealing with files and sockets respectively. In general the IO modules all use similar semantics; you treat an IO object as if it's a wrapper around a file handle. The usual Perl filehandle i/o commands (such as print(), getc(), eof() and read()) are all present, albeit provided as methods to call on the IO::File or IO::Socket object rather than as built-ins. Some of these core modules have sub-types in turn; for example, IO::Socket provides a generic interface to the Linux socket mechanism, but you need to use IO::Socket::INET to handle internet domain sockets (ones that send data over TCP/IP), in contrast to IO::Socket::UNIX (which builds UNIX domain sockets, which run very rapidly but are restricted to the local machine). Note that there are some non-standard IO::Handle -derived modules that aren't in the standard distribution. IO::Stringy lets you treat a scalar variable as if it's a file handle -- great for splatting volumes of test into a handle that you're going to pass to some other subroutine. And IO::Stty lets you do things to the i/o settings on a serial TTY device. But we're not going to deal with these here. It's possible to use lower level Perl commands to set up a socket connection; the Socket.pm module provides a low-level interface to the C socket.h standard library, and the basic socket system calls are provided as Perl built-in commands. But unless you want to deliberately build TCP/IP packets by hand to meed some specific aim (such as generating malformed packets that appear to originate from another machine) there's not much point to doing this. The basic socket connection provided by IO::Socket::INET lets you connect to a server on a remote host, read and write data, check for the end of a data stream, and set buffering. SUBTITLE: Protocols -- the client's eye view Just being able to connect to a server isn't enough; the client needs to know how to get the server to do something useful. The usual solution, as we saw last month, is to use a protocol -- a well-defined set of rules governing how a client and server should communicate. Protocols dictate the order in which a client and server exchange information, so that the session never enters a state in which the server is waiting to hear something from the client and the client is waiting to hear something from the server -- a deadlock situation that can persist indefinitely. Protocols often provide some level of error recovery. The example in the previous section is about the lowest-possible level of client for HTTP, the hypertext transport protocol used by the web. After connecting to an HTTP server, our client prints a message like this: GET / (followed by two carriage returns). It then reads whatever the server emits and prints it to its standard output. The GET command in HTTP (or rather, the GET method) tells the server to return the object named after it -- in this case '/', the document root on the web server. If you execute this code, Linux Format's web server will reply: 200 OK Then emit a stream of HTML (the default web page it's set to serve up as the document root). HTTP is a simple, stateless protocol; that is, we can connect to a server, issue commands, and each command will be executed without reference to the previous one. Other TCP/IP protocols are not so simple. An example of a stateful program is SMTP, Simple Mail Transport Protocol. The client can connect, tells the server to accept mail from person and destined for person , and starts sending the mail. The 'MAIL' command that introduces the mail tells the server to enter a state in which it is accepting mail from someone. This can be followed by 'RCPT TO' command, telling the server who the mail is for. But RCPT TO can be followed by either another RCPT TO (introducing another recipient) or by DATA (introducing the message data). Keeping track of which state the client is in is essential. Other common stateful protocols include FTP (File Transfer Protocol), NNTP (NetNews Transfer Protocol), and IRC (Internet Relay Chat). In Perl, the commonest way to handle the client side of a stateful protocol is to use a perl module that provides an object-oriented interface to a session using that protocol. When a client program connects to, say, a mail server to send mail out, we can be pretty sure that its connection is and its state within that session aren't dependent on any outside influences. By wrapping the connection up in an object, and providing methods to poke data into the object and read data out, we can build an object-oriented interface to a protocol that makes it easy to build network clients. As it happens, there are Perl modules on CPAN that implement the client side of a whole bunch of network protocols. You can install a bunch of them with the following command: perl -MCPAN -e 'install Bundle::libnet' This bundle installs Net::Cmd, a collection of methods that can be inherited by descendants of IO::Handle and that provide tools used by command-based protocols (like SMTP or FTP). These in turn are used by Net::FTP (which, with its special-purpose subclasses provides a client-side implementation of the FTP protocol), Net::NNTP (which provides the NNTP netnews transfer protocol client), Net::POP3 (a client for the popular POP3 remote mailbox delivery protocol), Net::SMTP (which allows client-side sending of outgoing email), Net::Time (a network time protocol client), and utility modules such as Net::Netrc (which provides access to the .netrc file used by the ftp program to store login details for remote FTP servers) and Net::Domain (which attempts to verify the domain of the current host), SUBTITLE: The simple manserver protocol Last month we introduced a simple sever that provides access to man pages over the network (by invoking man in response to various commands, and returning its output to the client). The manpage protocol is about as simple as it gets. The client opens a connection to the server on port 8888 and sends a request. Requests are a bunch of lines, each with a keyword and optional arguments. The server waits for the request to terminate (with an END command) then sends a reply and closes the connection. The server times out stale connections after 60 seconds. To request a manpage, the client sends: PAGE SECTION END The can be a number (1-8) or the keyword ALL, meaning send all the man pages named -- see below for how they're delivered. The client can run an 'apropos' search (equivalent to 'man -k) by sending: SEARCH END And it can run a 'whatis' search (equivalent to 'man -f') by sending: DESCRIBE END If the server parses the request and knows what to do, it sends: 200 - OK Followed by the body of the manpage or search result. If the request doesn't make sense, the server sends: 500 - Bad Request And closes the connection. Now, we could use the tools in Net::Cmd to implement proper protocol- dependent behaviour, but the manserver protocol is stateless; we just issue commands and read whatever comes back. So Net::Cmd is overkill. Instead, we want a command -- called, say, manc (for 'man client'), that accepts the same basic arguments as man: that is, a man page name, optional section, and optional -f and -k arguments. Depending on its arguments it should then open a connection to the man server (specified in a simple configuration file in /etc -- /etc/manc.conf) and send the request. It should then print whatever comes back, or a suitable error message. SUBTITLE: The manc client program -- how it works The manc client is designed as a replacement for the man command. Instead of invoking external commands (notably troff) to format a man page stored on the local filesystem, manc parses its arguments, then opens a socket connection to the manserver program (in last month's column) and sends a protocol request. Here are the highlights (line numbers, as ever, not required by Perl): First, we set up the modules manc uses. In this case, Pod::Text is used to format the help text (actually a man page embedded in the program), Getopt::Long is used for parsing command-line arguments, and IO::Socket::INET is used to talk to the server: 013: use strict; 014: 015: use Getopt::Long; 016: use IO::Socket::INET; 017: use Pod::Text; Use 'strict' isn't an external module but a compiler pragma -- it turns on strict variable initialisation checking. 019: # ------------------------------ main program variables 020: my ($man) = ""; # flag: used to print man page 021: # ---------------------------- set up configuration variables - default 022: 023: my ($server, $port) = ""; 024: 025: if (exists $ENV{MANSERVER}) { 026: $server = $ENV{MANSERVER}; 027: } else { 028: $server = "localhost"; 029: } 030: 031: if (exists $ENV{MANPORT}) { 032: $port = $ENV{MANPORT}; 033: } else { 034: $port = 8888; 035: } The bit above lets us configure the manc program via two environment variables -- MANSERVER and MANPORT. These default to 'localhost' and 8888 respectively and indicate the server the program will try to connect to. (We could have built a configuration file mechanism in, using a module like ConfigReader::DirectiveStyle to do the heavy lifting, but for the purpose a tutorial program this seemed like overkill. We demonstrated a program that used ConfigReader in Linux format #24.) Our manc client implements the basic man features -- lookup manpage by section and name, search the man descriptions for a string (apropos), and retreive the one-line description of a manpage name (whatis). The next part of the program uses Getopt::Long to work out what sort of command line we're processing. The picture is complicated by the possibility that manc has been fed a mangled command line (say, a combination of the -k (apropos) and -w (whatis) flags). 044: # ------------------------------ first, get options for -s, -k, -w flags 045: 046: my $section = ""; # manpage section 047: my $page = ""; # man page name, if given 048: my $apropos = ""; # apropos string, if given 049: my $whatis = ""; # whatis string, if given 050: my $help = 0 ; # flag; print help if > 0 051: 052: GetOptions( 053: "section=s" => \$section, 054: "s=s" => \$section, 055: "k=s" => \$apropos, 056: "w=s" => \$whatis, 057: "help" => \$help); 058: 059: # ------------------------------ now get manpage and section from ARGV 060: 061: if (scalar(@ARGV) == 2) { 062: # presumably we have arguments of the form man 063: ($section, $page) = (@ARGV); 064: } elsif (scalar(@ARGV) == 1) { 065: # presumably we have arguments of the form man 066: $page = $ARGV[0]; 067: } 068: 069: # ----------------------------- check for confused arguments 070: 071: my $sanity = 0; 072: if ( $apropos ) { $sanity++ } 073: if ( $whatis ) {$sanity++ } 074: if ( $page ) { $sanity++ } 075: print "sanity: $sanity\n"; 076: if ( ($sanity != 1) or ($help > 0) ) { 077: # the user entered an invalid mixture of arguments or asked for help 078: do_help(); 079: exit; 080: } Lines 52-58 are a call to GetOptions. GetOptions (imported from Getopt::Long) scans the command line and if it finds a known option it puts any parameter to that option in the appropriately named variable. Lines 61-67 then check for section/page arguments. man traditionally takes either one argument (a man page name) or two arguments (section number, page name), hence the significance of looking for the number of arguments specified after we've finished processing the options. At lines 71-80 we check for valid combinations of query. $sanity is a flag that indicates a sane request; if we've got parameters relating to different types of query, something's gone wrong and it's time to give the user some help. If not, we have more or less apropriate parameters, so it's time to move on: 082: if ( $apropos ) { print STDOUT @{ do_apropos($server, $port, $apropos) } } 083: if ( $whatis ) { print STDOUT @{ do_whatis($server, $port, $whatis) } } 084: if ( $page ) { print STDOUT @{ do_man($server, $port, $page, $section)} } 085: exit; That's all the main program! Actually, it isn't. But depending on whether we're processing an apropos, whatis, or man query, we need to do different things. To avoid excessively complex if() constructs, we farm the details out to different subroutines, each of which returns an arrayref pointing to an array of lines of text containing the answer sent by the manserver program. This, when printed, provides the output for manc. Three subroutines (do_apropos, do_whatis and do_man) create a manserver protocol request, as a string. They then feed the request to a single routine, talk_to_server(), and return whatever talk_to_server() gives them. 116: sub talk_to_server { 117: my ($server, $port, $request) = @_; 118: my @response = (); 119: eval { 120: my $alarm_timeout = 60; 121: local %SIG; 122: $SIG{ALRM} = sub { die "Error: timeout\n"; }; 123: alarm $alarm_timeout; 124: my $socket = new IO::Socket::INET(PeerAddr => $server, 125: PeerPort => $port, 126: Proto => 'tcp', 127: Type => SOCK_STREAM) 128: or die "Couldn't connect to $server:$port: $!\n"; 129: print $socket $request; 130: @response = $socket->getlines(); 131: close $socket; 132: alarm 0; 133: }; 134: if ($@) { return [ "Error, timeout: $@\n" ] } 135: @response = @response[2..$#response]; # discard first two lines 136: return \@response; 137: } talk_to_server() is the subroutine that actually does the dialogue with the manserver program. It's wrapped in an eval() block, from lines 119 to 133 -- this allows us to set SIGALRM inside the block and do something sensible if the alarm times out and causes an exception. If we remove all the signal-handling stuff (which was explained in Linux Format 26) what we're left with is: my $socket = new IO::Socket::INET(PeerAddr => $server, PeerPort => $port, Proto => 'tcp', Type => SOCK_STREAM) or die "Couldn't connect to $server:$port: $!\n"; print $socket $request; @response = $socket->getlines(); close $socket; $socket is an IO::Socket::INET object -- a wrapper around a filehandle that is bound to a socket connected to $server on port $port. We simply print a message to it (to send a request), then read from it to see what the program at the other end sends us back. (getlines() reads all the lines from a filehandle until eof() is true.) Note line 135: 135: @response = @response[2..$#response]; # discard first two lines The manserver returns two things that aren't part of the response -- a line saying "hello, I'm a manserver" when the client first connects, and a protocol response saying "200 - ok" or "500 - not ok" (or words to that effect) in response to the request. Line 135 naively and crudely throws this information away. If we wanted to do the job properly we wouldn't do that -- we'd make some attempt to get the status line, then parse it, and only then call getlines() (if necessary) to read the man page text. 139: sub do_help { 140: my $parser = Pod::Text->new(); 141: $parser->parse_from_file($0, "-"); 142: } do_help() is a boilerplate help routine; the program has an __END__ marker at the end of the perl code, followed by its man page in POD documentation format. do_help() feeds the file $0 (that is, the currently executing perl script's source file) to Pod::Text and tells it to parse the stream, emitting text on the file handle "-" (a UNIXy traditional shorthand notation meaning whatever file handle is bound to STDOUT). SUBTITLE: What next? This is a pretty basic man page client. For starters, there's no option to return unformatted (troff -man macro source) pages. For seconds, there's no local cacheing mechanism; ideally manc would maintain a small (on the order of 1Mb) cache of files, and look there before going out over the net to pester a documentation server. And all good clients should set their exit status sensibly when they exit, in case they're invoked within a shell script. We could get elaborate by adding extra functionality on the server side, too. It's not hard to use one of the many html to TEXT filters to provide access to the Linux HOWTO and Mini-HOWTO documentation, which lives in a set location (according to the Filesystem Hierarchy Standard). Monitoring the response codes from the server is a must. For example, as written, this manserver client will not be able to deal intelligently with a server-side error. If the server hangs, the alarm call should enable the client to exit gracefully -- but there's no way for it to deal with the server returning an error code indicating that there's no such manpage or some other error has occured. Probably the worst issue with this client (and the most subtle) is that it is vulnerable to attack. Perl doesn't succumb to buffer overruns -- it has dynamically resizable strings -- but it is possible for an attacker who has subverted a server to force-feed the client a long string, eventually eating up all available memory on the client machine. A solution for this would be to replace getlines() with something more controlable (such as by calling read() to fill a buffer until eof() becomes true or a hard limit is exceeded). However, this example illustrates the basics of exchanging data between a client and a server. Have fun, and remember: man pages are not the only fruit! END (BODY COPY) BOXOUT: Spidering the web Pulling stuff off a server is all very well, but it begs the most important question you need to ask when writing a network client: what do you do with the data when you've got it? Processing data retreived from a network server raises a bunch of questions. In general, there are about four types of data you're likely to be concerned with. Firstly, there's free-form text no organising structure; objects pulled in from an FTP server often fall into this category. Secondly, there is regular structured data. Email or usenet message headers fall into this category, as do CSV files and a few similar things: you can make assumptions about them having a regular field-based layout with certain items stored in each field. Thirdly, there is nested data. The body of an email message containing MIME attachments may consist of a MIME wrapper around a bunch of included MIME sections. Similarly, an HTML file contains nested tagged text structured in accordance with a DTD. To handle nested data almost always requires a fairly sophisticated parser that recognizes the containing structure and builds a tree of elements. And finally there's binary data -- which may or may not be amenable to processing in Perl (pack() and unpack() are your can-opening friends, here) and which often has an internal structure that imposes problems qualitatively equivalent to one of the other three types of data. When messing around with network clients, we are mostly concerned with types two and three -- regularly structured messages (such as email or usenet message headers, or HTTP headers) and nested structures (such as email attachments or HTML and XML files). If you want to dig around inside data retreived from a network server, you first need to know what you're dealing with. It may be that you're dealing with a mixture of data types. For example, if you establish an HTTP connection to a web server with keep-alive, you may receive multiple files via the same connection, encapsulated within separate MIME-encoded messages. To deal with this sort of bundle, you need a two-stage process: first, to identify the separate message components (by parsing the HTTP response messages and separating out and decoding the attachments), and secondly to deal with individual components (for example, by parsing the HTML in a file into a tree representing a document's structure, which can then be searched for specific keywords, either identified as attributes of a META tag, or as plain text). The difficulty of this job should not be under-estimated. Any Perl developer working on this sort of project would do well to lay their hands on a copy of "Data Munging in Perl" by David Cross (ISBN: 1-930110-00-6, published by Manning: see www.manning.com/cross). This book is concerned with one of Perl's core tasks -- taking raw data from a source (such as a server), manipulating or parsing it, and processing it into a final form. It includes a lot of valuable insight into processing nested formats like HTML and XML, and an introduction to Parse::RecDescent, the standard Perl module for building recursive-descendant parsers (which can cope with nested expressions). The world wide web is a special case. Perl has a rich grab-bag of tools for serving up data under Apache and other web servers -- and an equally rich client-side grab-bag of tools. The biggest and most powerful Swiss Army chainsaw in your toolbox is the CPAN Bundle::LWP package -- LWP is short for "Lib-WWW Perl", and you can install it by typing: perl -MCPAN -e 'install Bundle::LWP;' LWP contains two types of tool; gadgets for retreiving web pages from a remote server, and tools for parsing HTML. The first set of classes, grouped under HTTP::Request, treat an HTTP request as an object; you can set the method to use (GET, PUT, POST, or HEAD, as defined in the HTTP RFC's), the uri (universal resource indicator) denoting the object to retreive, and additional headers. You execute an HTTP::Request by passing the object to a User Agent (such as LWP::Useragent), which creates the network connection to the server and handles communication- related aspects of the transaction, such as coping with timeouts and making use of proxy servers. When activated, the User Agent returns an HTTP::Response object, which provides access to a response code, HTTP response headers, and data returned from the server. For example: use LWP::UserAgent; # create a user agent $ua = LWP::UserAgent->new; $ua->agent("Test/0.1 "); # Create a request my $req = HTTP::Request->new(GET => 'http://www.perl.com/index.html'); # Throw HTTP::Request at LWP::UserAgent and return HTTP::Response my $res = $ua->request($req); # Check the outcome of the response if ($res->is_success) { print $res->content; } else { print "Bad luck this time\n"; } # get contents my $html = $res->content; What you do with the contents of the HTTP response once you've retreived it is somewhat more difficult; a good starting point is Gisle Aas's module HTML::Parser, which can be found on CPAN and which provides a toolkit for building tools that parse HTML files and extract specified data from them -- for example, all the link addresses, or all the text enclosed in

...

tags. We'll be looking at HTML::Parser in detail in another tutorial. END BOXOUT (Spidering the web)