Linux Format 17


Perl tutorial


Ever wondered how they print your bank statement? Lots of figures lined
up in columns, page numbers neatly printed at the top of each sheet --
you didn't think they used a series of "print" statements, did you?
If you need to generate long output listings, you need to examine the
"write" command and output formats -- Perl's built-in report generator.
We'll take a look at write(), and (with this month's introduction to
parsing command line arguments) produce a program for analysing
system logfiles and reporting on anomalies. We'll also take a
preliminary look at POD and Perl documentation, discuss how to
cache data from intensive operations such as resolving IP addresses,
and look at how to make our example program do something useful.


SUBHEADING: logscan -- reading logfiles and reporting on their contents


In Linux Format 16 we looked at using Getopts::Mixed to parse command-
line arguments: the example was a tool for scanning Apache server logs.
This article revolves around logscan -- the same program, this time with
real moving parts that scan real live logfiles.


The general structure of a command-line tool is this:


-- BEGIN BULLET LIST --


Declare any necessary global variables


Parse command-line parameters (using Getopt::Mixed or equivalent)


Validate inputs and open files


Process input files


Emit output


Close down cleanly


-- END BULLET LIST --


logscan does all of these things in pretty much this order (with the
reservation that it doesn't do a whole lot with its input files, being an
unfinished tutorial project at this stage). The sections from lines 17-20
(setting up configuration variables) and lines 50-81 (using Getopt::Mixed
to parse arguments) carry out the first two tasks; the notable
difference between this and last month's example is the inclusion of
an format specification in lines 28-50, which is used for printing
tabular output. (SEE BOXOUT "Writing reports")


From lines 82 to 105, logscan validates its inputs. It's no good trying to
open a log file that doesn't exist, or isn't readable; here, Perl's file test
operators come in handy. File test operators, inherited from the UNIX shell,
are operators that, when applied to a filename, return true or false
depending on whether the file matches the operator. For example, the
-r operator tests whether a file is readable or not:


-- BEGIN MONOSPACED --


  if ( -r "/etc/passwd") {
     print "I can read /etc/passwd!\n";
  }


-- END MONOSPACED --


Like other operators the file test operators can be negated with ! (logical
not) and connected with && and || (logical-and, logical-or). Because file
test operations are expensive -- Perl has to look up a file's inode each
time -- they're cached: the details of the last file tested are stored in
the special variable "_", so that we can write something like this:


-- BEGIN MONOSPACED --


  if (( -r "/etc/passwd" ) && ( -w _)) {
      print "permissions on /etc/passwd are a bit too loose, I think!\n";
  }


-- END MONOSPACED --


In this example, "( -w _ )" means "test for writable attribute on cached
record of last file tested". 


There's another interesting idiom at large in lines 50-100 of our
program: stuff like this:


-- BEGIN MONOSPACED --


  ($debug_level >= 1) && print STDERR "opened $target_file for reading\n";


-- END MONOSPACED --


What this does is evaluate the expression "$debug_level >= 1"; if the
result is true (i.e. if $debug_level is greater than or equal to 1)
it executes the print statement. $debug_level is a debugging flag;
set it to zero and none of these statements will execute, but set it
above zero and you'll see more and more messages about the internal state
of logscan as it runs. (It's a quick and easy alternative to using Perl's
built in debugger for simple tasks. We'll look at the debugger in a future
tutorial.)


Back to the program. After we run through a series of tests to make sure
that our target file exists, we open it (lines 101-105). Using the module
IO::File gives us an object-oriented syntax for dealing with file handles;
it's not the only way of doing this, but by using IO::Handle objects as 
our underlying way of talking to the file, we make it easier to pass
it to subroutines if we want to. (Which we don't, yet.)


Apache logfiles consist of a series of one-line records; each of these
consists of a series of comma-separated fields, some of which (notably
the timestamp, HTTP request, and referrer) have different (but regular)
formats. To make things a little harder, a core of about seven fields
are usually present, but additional fields may be tagged on the end
of a logfile if the website administrator is trying to obtain verbose
output. Logscan is not a sophisticated logfile scanner; in particular,
it has no concept of how to parse an HTTP logfile correctly, being
content to extract space-separated fields in the order they appear in
a file. (In a future tutorial we'll see how to write a proper parser.)


The main loop of the program runs from line 110-154. We read a line from
our log file. If the $resolv_addr flag is set, we expand IP addresses in
the line to filenames (if possible). This is carried out in a subroutine,
expand_address(). After expanding IP addresses, it checks for the two
additional filters. If HTTP response code filtering is in effect, it
skips the loop unless the specified HTTP response matches the one in
the current record; and if we're looking for contents, it ignores lines
where the HTTP request doesn't match the pattern specified on the command
line by the --contains option.


If the loop gets as far as line 145, the filter criteria have been
matched; logscan therefore (naively) digs some information out of the
line, sticks it in appropriate variables, and calls write() to emit a
formatted record.


Note that this doesn't do any analysis of the input (other than counting
the bytes sent and number of records); but the block from line 145-150
can be used for other tasks -- for example, to carry out a SQL INSERT
into a relational database, which can subsequently be analysed, or to 
put values into a hash.


Probably the most interesting part of logscan is the subroutine
expand_address(). HTTP log records may contain either the raw internet 
address (as a series of four numbers separated by periods -- a "dotted
quad") of the requester; or Apache can be configured to resolve these
addresses to hostnames. This facility is often switched off to reduce the
workload on a web server; logscan provides a replacement facility.


IP addresses in dotted-quad form match the Perl regular expression
(\d{1,3}\.){3}(\d{1,3}); if a logfile contains this, it's an unresolved
address so expand_address() tries to look up its corresponding hostname.
Because the act of looking up an IP address can be expensive -- it
involves querying a name server -- logscan only does this when necessary;
the actual code is in lines 188-189, using subroutines supplied by the
library Socket.pm.


To minimize the number of lookups, whenever a hostname is checked the
results are stored in the hash %main::address_cache. We use the IP
address as the key; the value stored under it is either the hostname,
or the same IP address (indicating that we were unable to resolve
the hostname). Before expand_address() checks each address, it checks
$main::address_cache{$address} to see if it's already been asked for; if
it has, it uses the cached name instead of checking it twice. Because
most web page impressions serve multiple files to the same client, this
drastically reduces the amount of work logscan has to do in looking up
hostnames.


END (BODY COPY)


NEXT MONTH


Logscan doesn't do much, does it? But it will! Next month we see how to
hook logscan up to MySQL or PostgreSQL to insert Apache logfile contents
into a relational database for analysis. We also take a closer look at
the structure of HTTP logfiles and see how to write a decent parser for
them rather than one that splits on whitespace.


END (NEXT MONTH)


BOXOUT: Writing reports


One of the most boring but useful components of any programming language
are its facilities for printing output. Perl -- the Programmable Extraction
and Report Language -- has a little-known facility, inherited from FORTRAN,
for making the job easier: output formats and the write() command. An
output format is useful when you want to output oodles of data records
in a fixed format, with extras like page headers and footers (because you're
dumping the records to a line printer).


The job of a Linux system administrator often involves drawing up reports,
some of which involve programming. It might be writing a tool to scan the
system logfiles; for example, scanning /var/log/wtmp (a binary file -- we'll
look at how to extract this data in another tutorial) to report on user 
login activity, or scanning the filesystem and reporting on the largest
or oldest files, or scanning the web server logs to see who's been doing 
what. In many cases, excellent tools exist for doing these jobs: for example,
if you want to mangle HTTP server logs your starting place should be either
Analog (see http://www.analog.cx/) or the Perl module Logfile (from CPAN --
particularly Logfile::Apache). However, digesting logs is only half the
problem: how do you present output meaningfully?


Perl's report generator has two halves: the format command (which
specifies the way to format a bunch of variables for output, and names
variables in question), and the write() command (which causes the contents
of the variables specified in the current format to be printed to the
appropriate file handle in a layout specified in the format). Each format
has a name, and is associated with a filehandle; for example, the default
format for STDOUT is called "STDOUT", and the default format for filehandle
FOO is called "FOO". There's also an associated format for each filehandle
called something like STDOUT_TOP or FOO_TOP -- this is used to output a 
page header whenever necessary. Unile print(), write() understands the idea
of printing on a page with a set number of lines; the number of lines in a
page is stored in the special variable $=, the number of lines left on the
page is in $-, and whenever $- counts down to zero a call to write()
triggers a page throw and uses the _TOP format to print a new page header.


You declare a format like this:


-- BEGIN MONOSPACE LISTING --
format STDOUT =
@<<<          @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<  @<<<<<<<<<<<<<<<<<<<<<<<<<<<    
$http_result, $uri,                             $requester
              @<<<<<<<<                         @>>>>>>>>>   @>>>>>>>>>>>>>>>>>>>>>>>>>>>
              $protocol,                        $bytes_sent, $datestamp
.
-- END MONOSPACE LISTING --


The "format FILEHANDLE =" line gives the name of the filehandle we're defining a
format for; everything until the line with a dot on it is part of a picture of
how the output should look on the page. The picture can contain comment lines
(introduced with a hash sign, like normal Perl comments), or alternating 
picture lines and argument lines. 


A picture line illustrates how variables should be laid out -- @<<< means
"field, left-justified, three characters", @|||||| means "field, centred,
six characters", @>>>>.>> means "field, right-justified, four characters,
decimal point, two characters", and so on; the minutiate are covered in
Chapter 7 of "Programming Perl" (3rd edition).


Below each picture line there's an argument line; this consists of a list
of variables which are interpolated into the picture line whenever write()
is called. (The whitespace in the example above isn't required by Perl;
it's just there to make it easier to see which field each variable is
associated with.) 


When you call write(), the variables named in the format will be output, in
the layout specified by the picture line. If their value is undefined, the
field will be left blank; if they're larger than the field, they'll be
truncated to fit.  If you want to output variable-length data, it's worth
using the special character ^ to introduce a field, rather than @; perl pulls
as much text as it can out of the corresponding variable, prints it in the
field (for example, ^<<<<<<<<<<<<<<<<<<<< will print twenty characters, left
justified), then chop the printed section off the front of the variable -- so
that next time it's referenced, more text can be printed. Thus:


-- START MONOSPACE LISTING --
format FOO =
~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
$long_text
~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
$long_text
~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
$long_text
~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
$long_text
~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
$long_text
.
-- END MONOSPACE LISTING --


Prints the contents of $long_text in a block of left-justified lines
twenty characters wide and five lines deep when write() is called. (The
leading tilde "~" in each picture line tells write() to suppress output
of the line if the contents are empty.) By using the continuation 
field format (^) and suppressing blank lines we can output variable
length records as well as fixed-length.


In our tutorial example program, logscan, we're using the STDOUT format
and write() to output our results as we read through an Apache web server
access log. We do it this way solely because logscan is notionally providing
us with printed, paginated output. It's an alternative to print(). However,
formats really come into their own when you're preparing tabular data like
a bank statement or form letter! Try working up output like the following,
using nothing but print() commands, and you'll find it's a lot less clear ...


-- BEGIN MONOSPACE LISTING --


format DUNNING_LETTER =
~                                             @>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
                                              $my_co->{officers_name},
~                                             ~>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
                                              $my_co->{office_location},
~                                             ~>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
                                              $my_co->{office_location},
~                                             @>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
                                              $my_co->{postcode},
~                                             @>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
                                              $my_co->{phone},
~                                             @>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
                                              $my_co->{fax},


~                                             @>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
                                              $letter_date
~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  $bill->{debtor_name}
~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  $bill->{debtor_addr}
~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  $bill->{debtor_addr}
~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  $bill->{debtor_addr}
~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  $bill->{debtor_addr}
~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  $bill->{debtor_addr}


Our ref:  @<<<<<<<<<<<<<<<
          $bill->{$our_ref}
Your ref: @<<<<<<<<<<<<<<<
          $bill->{$your_ref}


Dear ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<,
     $debtor_name


We note with interest that your account is overdrawn by @<<<<<<.<< .
                                                        $bill->{overdraft_size}
Please remit this money at your earliest convenience or it will be necessary
for us to send the boys round, and you Wouldn't Like That.


Signed, @<<<<<<<<<<<<<<<<<<<<<<<<<<<<,
        $my_co->{officers_name}
        @<<<<<<<<<<<<<<<<<<<<<<<<<<<<
        $my_co->{officers_job}


.


-- END MONOSPACE LISTING --


END BOXOUT (Writing reports)


BOXOUT: POD and documentation


If you write a program for your own use -- and nobody else's -- then you
probably have no reason to document it. But in the real world, most of us
write programs that we hope somebody else will use. Our logscan example
in this issue demonstrates Perl's approach to manual-writing: POD, or
Plain Ordinary Documentation.


POD is included in Perl programs and modules as a sort of multi-line comment
block -- you can either include it in-line with your program, or after an
__END__ delimiter (which marks the end of source code in your file). 
Inline POD documentation is delimited by special marker lines, beginning
with the symbol "=pod" and ending with the symbol "=cut" -- text within
these lines is not treated as code by the Perl interpreter. For example:


-- BEGIN MONOSPACE LISTING --


sub some_complex_subroutine ($$) {
     my ($arg1, $arg2) = @_;
=pod


This sentence is POD text, and will not cause a compile-time error because
perl will not attempt to compile it! Now, back to the source code ...


=cut


    return ($arg1 + $arg2);
}


-- END MONOSPACE LISTING --


However, POD isn't just a multiline comment; it is a simple formatting
language with lists, headings, emphasis, and a variety of other features that
allow you to generate documentation. There's a module called Pod::Parser that
comes with the standard Perl distribution; this is used (in conjunction with
some sub-classes) to provide command-line tools to scan a Perl file for POD
and convert it to HTML, plain text, or man macro source (suitable for
printing with groff). The tools pod2man, pod2html, and pod2text respectively
do this job.


POD formatting commands come in two flavours: functions that begin with an
equals-sign in the first character of a line (and optionally take arguments),
and single characters followed by angle-brackets that enclose some text and
apply an attribute to it. For example: 


-- BEGIN MONOSPACE LISTING --


=head1 This is the title


=head2 This is a section heading


This is some descriptive text with I<italicised words> and
U<underlined words> for emphasis; and you can also include
cross-reference links to other documentation such as L<perl>.


-- END MONOSPACE LISTING --


When you feed a file containing the above to the pod2text filter (which
calls Pod::Parser::Text) it strips out the emphasis and reformats the text
after the fashion of a man page. But if you feed it through pod2html or
pod2man, it will replace the simple POD tags with more complex HTML 
elements or troff macros that provide appropriate formatting.


You commonly need to include verbatim chunks of source code in POD; by
default, pod filters reflow text that starts in the leftmost column of a
line (like the paragraph in the example above), but leave verbatim the
formatting of text that starts with one or more whitespace characters.
For example:


-- BEGIN MONOSPACE LISTING --


=head2 EXAMPLES


While this paragraph will be reformatted and flowed when we feed it through
the pod2html parser, the next line will be left alone:


    pod2html < filename.pl >filename.html


-- END MONOSPACE LISTING --


POD has a simple way of producing lists. You issue the =over command, with
an argument that is the number of character positions to indent entries by.
You then issue =item commands; these take one argument and are not indented,
but the text following each item is indented as far as the =over command
specified. You close out a list by using the =back command. For example:


-- BEGIN MONOSPACE LISTING --


Our collection includes:


=over 4


=item 1


A bucket of figs


=item 2


A can of tuna


=item 3


A jar of marmelade


=item 1000000


A left-handed screwdriver


=back


(That's enough.)


-- END MONOSPACE LISTING --


produces output like this:


-- BEGIN MONOSPACE LISTING --


Our collection includes:


1   A bucket of figs


2   A can of tuna


3   A jar of marmalade


1000000
    A left-handed screwdriver


(That's enough.)


-- END MONOSPACE LISTING --


That's the basics of how you use POD format to write documentation; there are
a couple more wriggles described in the perlpod documentation (type "perldoc
perlpod" in a terminal window), but you should be able to write workable POD
documentation with the tags described above.


However, if you want to document programs or modules properly you need to do a
bit more than that. The pod2man filter is used to turn POD documentation in a
file into a UNIX man page. It's fairly demanding; you need to provide a series
of =head1 (top level) headings with set names, in a set order, with specific
information under each heading. While some of the headings are optional,
pod2man will complain if they're in the wrong order or if mandatory headings
(such as the NAME and DESCRIPTION headings) are missing. A full description
of the sections you need to write to create a manpage is given in the pod2man
manpage (type "man pod2man" in a terminal window).


logscan includes basic man-compliant documentation.  The POD documentation is
all gathered at the end of the file, after the __END__ marker (which tells the
Perl interpreter to ignore everything beyond this point); alternatively, we could 
have scattered bits of it throughout the source code, or included it at the
beginning between =pod and =cut markers.


If you invoke logscan with the --help or --man options, while parsing command-
line arguments it will create a new Pod::Text or Pod::Man object, and tell it
to parse the file $0 ($0 is a special variable -- it's the name of the current
Perl source file that the perl interpreter is munching on). By default, the
POD parsers spit out their digested output on standard output. The --help
option uses Pod::Text so you can read the documentation on screen; the --man
option calls Pod::Man and generates man macros for groff -- so you can
send a postscript copy of the documentation to your printer like this:


  logscan --man | groff -man -Tps | lpr


If you want to generate an HTML version of the documentation, you need to use
the stand-alone pod2html filter, or edit the argument parser so that it calls
Pod::HTML.


END BOXOUT (POD and documentation)