Linux Format 17 Perl tutorial Ever wondered how they print your bank statement? Lots of figures lined up in columns, page numbers neatly printed at the top of each sheet -- you didn't think they used a series of "print" statements, did you? If you need to generate long output listings, you need to examine the "write" command and output formats -- Perl's built-in report generator. We'll take a look at write(), and (with this month's introduction to parsing command line arguments) produce a program for analysing system logfiles and reporting on anomalies. We'll also take a preliminary look at POD and Perl documentation, discuss how to cache data from intensive operations such as resolving IP addresses, and look at how to make our example program do something useful. SUBHEADING: logscan -- reading logfiles and reporting on their contents In Linux Format 16 we looked at using Getopts::Mixed to parse command- line arguments: the example was a tool for scanning Apache server logs. This article revolves around logscan -- the same program, this time with real moving parts that scan real live logfiles. The general structure of a command-line tool is this: -- BEGIN BULLET LIST -- Declare any necessary global variables Parse command-line parameters (using Getopt::Mixed or equivalent) Validate inputs and open files Process input files Emit output Close down cleanly -- END BULLET LIST -- logscan does all of these things in pretty much this order (with the reservation that it doesn't do a whole lot with its input files, being an unfinished tutorial project at this stage). The sections from lines 17-20 (setting up configuration variables) and lines 50-81 (using Getopt::Mixed to parse arguments) carry out the first two tasks; the notable difference between this and last month's example is the inclusion of an format specification in lines 28-50, which is used for printing tabular output. (SEE BOXOUT "Writing reports") From lines 82 to 105, logscan validates its inputs. It's no good trying to open a log file that doesn't exist, or isn't readable; here, Perl's file test operators come in handy. File test operators, inherited from the UNIX shell, are operators that, when applied to a filename, return true or false depending on whether the file matches the operator. For example, the -r operator tests whether a file is readable or not: -- BEGIN MONOSPACED -- if ( -r "/etc/passwd") { print "I can read /etc/passwd!\n"; } -- END MONOSPACED -- Like other operators the file test operators can be negated with ! (logical not) and connected with && and || (logical-and, logical-or). Because file test operations are expensive -- Perl has to look up a file's inode each time -- they're cached: the details of the last file tested are stored in the special variable "_", so that we can write something like this: -- BEGIN MONOSPACED -- if (( -r "/etc/passwd" ) && ( -w _)) { print "permissions on /etc/passwd are a bit too loose, I think!\n"; } -- END MONOSPACED -- In this example, "( -w _ )" means "test for writable attribute on cached record of last file tested". There's another interesting idiom at large in lines 50-100 of our program: stuff like this: -- BEGIN MONOSPACED -- ($debug_level >= 1) && print STDERR "opened $target_file for reading\n"; -- END MONOSPACED -- What this does is evaluate the expression "$debug_level >= 1"; if the result is true (i.e. if $debug_level is greater than or equal to 1) it executes the print statement. $debug_level is a debugging flag; set it to zero and none of these statements will execute, but set it above zero and you'll see more and more messages about the internal state of logscan as it runs. (It's a quick and easy alternative to using Perl's built in debugger for simple tasks. We'll look at the debugger in a future tutorial.) Back to the program. After we run through a series of tests to make sure that our target file exists, we open it (lines 101-105). Using the module IO::File gives us an object-oriented syntax for dealing with file handles; it's not the only way of doing this, but by using IO::Handle objects as our underlying way of talking to the file, we make it easier to pass it to subroutines if we want to. (Which we don't, yet.) Apache logfiles consist of a series of one-line records; each of these consists of a series of comma-separated fields, some of which (notably the timestamp, HTTP request, and referrer) have different (but regular) formats. To make things a little harder, a core of about seven fields are usually present, but additional fields may be tagged on the end of a logfile if the website administrator is trying to obtain verbose output. Logscan is not a sophisticated logfile scanner; in particular, it has no concept of how to parse an HTTP logfile correctly, being content to extract space-separated fields in the order they appear in a file. (In a future tutorial we'll see how to write a proper parser.) The main loop of the program runs from line 110-154. We read a line from our log file. If the $resolv_addr flag is set, we expand IP addresses in the line to filenames (if possible). This is carried out in a subroutine, expand_address(). After expanding IP addresses, it checks for the two additional filters. If HTTP response code filtering is in effect, it skips the loop unless the specified HTTP response matches the one in the current record; and if we're looking for contents, it ignores lines where the HTTP request doesn't match the pattern specified on the command line by the --contains option. If the loop gets as far as line 145, the filter criteria have been matched; logscan therefore (naively) digs some information out of the line, sticks it in appropriate variables, and calls write() to emit a formatted record. Note that this doesn't do any analysis of the input (other than counting the bytes sent and number of records); but the block from line 145-150 can be used for other tasks -- for example, to carry out a SQL INSERT into a relational database, which can subsequently be analysed, or to put values into a hash. Probably the most interesting part of logscan is the subroutine expand_address(). HTTP log records may contain either the raw internet address (as a series of four numbers separated by periods -- a "dotted quad") of the requester; or Apache can be configured to resolve these addresses to hostnames. This facility is often switched off to reduce the workload on a web server; logscan provides a replacement facility. IP addresses in dotted-quad form match the Perl regular expression (\d{1,3}\.){3}(\d{1,3}); if a logfile contains this, it's an unresolved address so expand_address() tries to look up its corresponding hostname. Because the act of looking up an IP address can be expensive -- it involves querying a name server -- logscan only does this when necessary; the actual code is in lines 188-189, using subroutines supplied by the library Socket.pm. To minimize the number of lookups, whenever a hostname is checked the results are stored in the hash %main::address_cache. We use the IP address as the key; the value stored under it is either the hostname, or the same IP address (indicating that we were unable to resolve the hostname). Before expand_address() checks each address, it checks $main::address_cache{$address} to see if it's already been asked for; if it has, it uses the cached name instead of checking it twice. Because most web page impressions serve multiple files to the same client, this drastically reduces the amount of work logscan has to do in looking up hostnames. END (BODY COPY) NEXT MONTH Logscan doesn't do much, does it? But it will! Next month we see how to hook logscan up to MySQL or PostgreSQL to insert Apache logfile contents into a relational database for analysis. We also take a closer look at the structure of HTTP logfiles and see how to write a decent parser for them rather than one that splits on whitespace. END (NEXT MONTH) BOXOUT: Writing reports One of the most boring but useful components of any programming language are its facilities for printing output. Perl -- the Programmable Extraction and Report Language -- has a little-known facility, inherited from FORTRAN, for making the job easier: output formats and the write() command. An output format is useful when you want to output oodles of data records in a fixed format, with extras like page headers and footers (because you're dumping the records to a line printer). The job of a Linux system administrator often involves drawing up reports, some of which involve programming. It might be writing a tool to scan the system logfiles; for example, scanning /var/log/wtmp (a binary file -- we'll look at how to extract this data in another tutorial) to report on user login activity, or scanning the filesystem and reporting on the largest or oldest files, or scanning the web server logs to see who's been doing what. In many cases, excellent tools exist for doing these jobs: for example, if you want to mangle HTTP server logs your starting place should be either Analog (see http://www.analog.cx/) or the Perl module Logfile (from CPAN -- particularly Logfile::Apache). However, digesting logs is only half the problem: how do you present output meaningfully? Perl's report generator has two halves: the format command (which specifies the way to format a bunch of variables for output, and names variables in question), and the write() command (which causes the contents of the variables specified in the current format to be printed to the appropriate file handle in a layout specified in the format). Each format has a name, and is associated with a filehandle; for example, the default format for STDOUT is called "STDOUT", and the default format for filehandle FOO is called "FOO". There's also an associated format for each filehandle called something like STDOUT_TOP or FOO_TOP -- this is used to output a page header whenever necessary. Unile print(), write() understands the idea of printing on a page with a set number of lines; the number of lines in a page is stored in the special variable $=, the number of lines left on the page is in $-, and whenever $- counts down to zero a call to write() triggers a page throw and uses the _TOP format to print a new page header. You declare a format like this: -- BEGIN MONOSPACE LISTING -- format STDOUT = @<<< @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< @<<<<<<<<<<<<<<<<<<<<<<<<<<< $http_result, $uri, $requester @<<<<<<<< @>>>>>>>>> @>>>>>>>>>>>>>>>>>>>>>>>>>>> $protocol, $bytes_sent, $datestamp . -- END MONOSPACE LISTING -- The "format FILEHANDLE =" line gives the name of the filehandle we're defining a format for; everything until the line with a dot on it is part of a picture of how the output should look on the page. The picture can contain comment lines (introduced with a hash sign, like normal Perl comments), or alternating picture lines and argument lines. A picture line illustrates how variables should be laid out -- @<<< means "field, left-justified, three characters", @|||||| means "field, centred, six characters", @>>>>.>> means "field, right-justified, four characters, decimal point, two characters", and so on; the minutiate are covered in Chapter 7 of "Programming Perl" (3rd edition). Below each picture line there's an argument line; this consists of a list of variables which are interpolated into the picture line whenever write() is called. (The whitespace in the example above isn't required by Perl; it's just there to make it easier to see which field each variable is associated with.) When you call write(), the variables named in the format will be output, in the layout specified by the picture line. If their value is undefined, the field will be left blank; if they're larger than the field, they'll be truncated to fit. If you want to output variable-length data, it's worth using the special character ^ to introduce a field, rather than @; perl pulls as much text as it can out of the corresponding variable, prints it in the field (for example, ^<<<<<<<<<<<<<<<<<<<< will print twenty characters, left justified), then chop the printed section off the front of the variable -- so that next time it's referenced, more text can be printed. Thus: -- START MONOSPACE LISTING -- format FOO = ~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $long_text ~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $long_text ~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $long_text ~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $long_text ~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $long_text . -- END MONOSPACE LISTING -- Prints the contents of $long_text in a block of left-justified lines twenty characters wide and five lines deep when write() is called. (The leading tilde "~" in each picture line tells write() to suppress output of the line if the contents are empty.) By using the continuation field format (^) and suppressing blank lines we can output variable length records as well as fixed-length. In our tutorial example program, logscan, we're using the STDOUT format and write() to output our results as we read through an Apache web server access log. We do it this way solely because logscan is notionally providing us with printed, paginated output. It's an alternative to print(). However, formats really come into their own when you're preparing tabular data like a bank statement or form letter! Try working up output like the following, using nothing but print() commands, and you'll find it's a lot less clear ... -- BEGIN MONOSPACE LISTING -- format DUNNING_LETTER = ~ @>>>>>>>>>>>>>>>>>>>>>>>>>>>>> $my_co->{officers_name}, ~ ~>>>>>>>>>>>>>>>>>>>>>>>>>>>>> $my_co->{office_location}, ~ ~>>>>>>>>>>>>>>>>>>>>>>>>>>>>> $my_co->{office_location}, ~ @>>>>>>>>>>>>>>>>>>>>>>>>>>>>> $my_co->{postcode}, ~ @>>>>>>>>>>>>>>>>>>>>>>>>>>>>> $my_co->{phone}, ~ @>>>>>>>>>>>>>>>>>>>>>>>>>>>>> $my_co->{fax}, ~ @>>>>>>>>>>>>>>>>>>>>>>>>>>>>> $letter_date ~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $bill->{debtor_name} ~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $bill->{debtor_addr} ~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $bill->{debtor_addr} ~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $bill->{debtor_addr} ~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $bill->{debtor_addr} ~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $bill->{debtor_addr} Our ref: @<<<<<<<<<<<<<<< $bill->{$our_ref} Your ref: @<<<<<<<<<<<<<<< $bill->{$your_ref} Dear ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<, $debtor_name We note with interest that your account is overdrawn by @<<<<<<.<< . $bill->{overdraft_size} Please remit this money at your earliest convenience or it will be necessary for us to send the boys round, and you Wouldn't Like That. Signed, @<<<<<<<<<<<<<<<<<<<<<<<<<<<<, $my_co->{officers_name} @<<<<<<<<<<<<<<<<<<<<<<<<<<<< $my_co->{officers_job} . -- END MONOSPACE LISTING -- END BOXOUT (Writing reports) BOXOUT: POD and documentation If you write a program for your own use -- and nobody else's -- then you probably have no reason to document it. But in the real world, most of us write programs that we hope somebody else will use. Our logscan example in this issue demonstrates Perl's approach to manual-writing: POD, or Plain Ordinary Documentation. POD is included in Perl programs and modules as a sort of multi-line comment block -- you can either include it in-line with your program, or after an __END__ delimiter (which marks the end of source code in your file). Inline POD documentation is delimited by special marker lines, beginning with the symbol "=pod" and ending with the symbol "=cut" -- text within these lines is not treated as code by the Perl interpreter. For example: -- BEGIN MONOSPACE LISTING -- sub some_complex_subroutine ($$) { my ($arg1, $arg2) = @_; =pod This sentence is POD text, and will not cause a compile-time error because perl will not attempt to compile it! Now, back to the source code ... =cut return ($arg1 + $arg2); } -- END MONOSPACE LISTING -- However, POD isn't just a multiline comment; it is a simple formatting language with lists, headings, emphasis, and a variety of other features that allow you to generate documentation. There's a module called Pod::Parser that comes with the standard Perl distribution; this is used (in conjunction with some sub-classes) to provide command-line tools to scan a Perl file for POD and convert it to HTML, plain text, or man macro source (suitable for printing with groff). The tools pod2man, pod2html, and pod2text respectively do this job. POD formatting commands come in two flavours: functions that begin with an equals-sign in the first character of a line (and optionally take arguments), and single characters followed by angle-brackets that enclose some text and apply an attribute to it. For example: -- BEGIN MONOSPACE LISTING -- =head1 This is the title =head2 This is a section heading This is some descriptive text with I and U for emphasis; and you can also include cross-reference links to other documentation such as L. -- END MONOSPACE LISTING -- When you feed a file containing the above to the pod2text filter (which calls Pod::Parser::Text) it strips out the emphasis and reformats the text after the fashion of a man page. But if you feed it through pod2html or pod2man, it will replace the simple POD tags with more complex HTML elements or troff macros that provide appropriate formatting. You commonly need to include verbatim chunks of source code in POD; by default, pod filters reflow text that starts in the leftmost column of a line (like the paragraph in the example above), but leave verbatim the formatting of text that starts with one or more whitespace characters. For example: -- BEGIN MONOSPACE LISTING -- =head2 EXAMPLES While this paragraph will be reformatted and flowed when we feed it through the pod2html parser, the next line will be left alone: pod2html < filename.pl >filename.html -- END MONOSPACE LISTING -- POD has a simple way of producing lists. You issue the =over command, with an argument that is the number of character positions to indent entries by. You then issue =item commands; these take one argument and are not indented, but the text following each item is indented as far as the =over command specified. You close out a list by using the =back command. For example: -- BEGIN MONOSPACE LISTING -- Our collection includes: =over 4 =item 1 A bucket of figs =item 2 A can of tuna =item 3 A jar of marmelade =item 1000000 A left-handed screwdriver =back (That's enough.) -- END MONOSPACE LISTING -- produces output like this: -- BEGIN MONOSPACE LISTING -- Our collection includes: 1 A bucket of figs 2 A can of tuna 3 A jar of marmalade 1000000 A left-handed screwdriver (That's enough.) -- END MONOSPACE LISTING -- That's the basics of how you use POD format to write documentation; there are a couple more wriggles described in the perlpod documentation (type "perldoc perlpod" in a terminal window), but you should be able to write workable POD documentation with the tags described above. However, if you want to document programs or modules properly you need to do a bit more than that. The pod2man filter is used to turn POD documentation in a file into a UNIX man page. It's fairly demanding; you need to provide a series of =head1 (top level) headings with set names, in a set order, with specific information under each heading. While some of the headings are optional, pod2man will complain if they're in the wrong order or if mandatory headings (such as the NAME and DESCRIPTION headings) are missing. A full description of the sections you need to write to create a manpage is given in the pod2man manpage (type "man pod2man" in a terminal window). logscan includes basic man-compliant documentation. The POD documentation is all gathered at the end of the file, after the __END__ marker (which tells the Perl interpreter to ignore everything beyond this point); alternatively, we could have scattered bits of it throughout the source code, or included it at the beginning between =pod and =cut markers. If you invoke logscan with the --help or --man options, while parsing command- line arguments it will create a new Pod::Text or Pod::Man object, and tell it to parse the file $0 ($0 is a special variable -- it's the name of the current Perl source file that the perl interpreter is munching on). By default, the POD parsers spit out their digested output on standard output. The --help option uses Pod::Text so you can read the documentation on screen; the --man option calls Pod::Man and generates man macros for groff -- so you can send a postscript copy of the documentation to your printer like this: logscan --man | groff -man -Tps | lpr If you want to generate an HTML version of the documentation, you need to use the stand-alone pod2html filter, or edit the argument parser so that it calls Pod::HTML. END BOXOUT (POD and documentation)