Linux Format 16 Perl tutorial SECTION 1: Speeding up CGI scripts with Apache::Registry and mod_perl Last week we looked at the basics of CGI programming in Perl on Linux. CGI, the Common Gateway Interface, is slow; every time a call to the CGI module is executed by Apache, it has to fork (launch) a child process, wait for it to load and run, then return its results. This has several implications. Firstly, Linux (and other UNIX-type systems) limit you to a maximum number of processes (executing programs) that can run at the same time. Someone can effectively mount a denial of service attack on your server by causing it to spawn lots of CGI processes, filling up the process table (which in turn prevents other necessary system programs from running). Secondly, there's a fair amount of overhead associated with launching, executing, and wrapping up a child process; in the case of Perl, much of the overhead comes from the fact that Apache must execute Perl itself, compile the script, then run it. A script that calls several modules may run to many thousands of lines of code; this imposes additional strain on the server every time someone clicks on a button that invokes some dynamic action. There is a way around this, of course: instead of CGI programming, you use mod_perl. mod_perl is an Apache module that consists, essentially, of the Perl interpreter compiled with glue to allow Apache to call it directly. CGI scripts need to be loaded, compiled, executed, and terminated whenever anyone calls them; mod_perl scripts are loaded and compiled just once -- when Apache starts up. Thereafter they sit in the server's memory, waiting for input. When a user sends some input for the mod_perl script, the script digests it, produces some output, then goes back to waiting for more input (instead of exiting). Thus, the startup overheads are eliminated; in extreme cases, using mod_perl instead of CGI.pm with Apache can result in a twentyfold performance increase. Which begs the question, how do you write scripts for mod_perl? Probably the simplest way to get started is to use Apache::Registry. This module allows you to run (nearly) unaltered Perl CGI scripts under mod_perl. The main differences are that (a) you need to use Apache::Registry as well as CGI.pm, and (b) the CGI script, instead of running as a stand-along program called from the server, runs as a subroutine called from the Apache perl module. Before you can use mod_perl, you need to install it. If you're running SuSE 7.1 or Red Hat 7.1, mod_perl is already present; you can switch it on in various ways. (On SuSE 7.1, edit /etc/rc.config.d/apache.rc.config and set HTTPD_SEC_MOD_PERL to "yes"; then force a change of system run level. On Red Hat, you'll probably need to mess with the contents of /etc/httpd/conf.) If you need to install mod_perl from scratch, your starting place is http://perl.apache.org/; there's extensive documentation, but you will probably also need a complete Apache source tree, and a complete Perl source tree, in order to do the job properly (because it entails linking a copy of Perl as a shared library into a module that can be loaded by Apache). You will then need to edit the httpd.conf file to indicate that Perl scripts in a designated directory (or specified by name) are to be run under mod_perl. For example: Alias /perl/ /perl/apache/scripts/ #optional PerlModule Apache::Registry SetHandler perl-script PerlHandler Apache::Registry Options ExecCGI The first clause aliases URLs on your server with "/perl/" in them to "/perl/apache/scripts".The tag then ensures that anything in /perl/apache/sc ripts is handled by Apache::Registry with the ExecCGI option enabled. When the Apache HTTPD server restarts, each time a server process receives a request for a file in /perl/apache/scripts, it will compile it as a subroutine and store it in memory. If the file on disk changes, the apache server will recompile it. In earlier versions, if you left a call to exit() in your script, it would terminate the program after a run, forcing Apache to repeatedly re- compile it; current versions of Apache::Registry override Perl's built-in exit() command. When trying to convert a CGI script from CGI to Apache::Registry you should really set the $Apache::Registry::Debug bitmask; this saves debugging information in the server error_log file. For example, a value of "3" causes mod_perl to log recompilation events and do a dump of possibly useful information in event that $@ is set on exit from the script. $@ is the error variable set whenever a call to eval() (which compiles and executes a string containing some Perl code, or a block of code, in a different context from the main perl interpreter, so that runtime errors don't kill the program). It's possible to write CGI scripts that can detect whether they're running under mod_perl or bog-standard CGI; check for the MOD_PERL environment variable. In general, CGI scripts written using CGI.pm and Perl 5.004 or later will run fine under mod_perl. You may want to set: PerlSendHeader ON in the httpd.conf file, to make mod_perl send HTTP headers (normally it doesn't); or you can do it yourself by doing something like: print $cgi->header(); or: print "Content-type: text/html\n\n"; One important pitfall is that mod_perl scripts don't terminate; they run in a loop until the mod_perl module decides to clean up and re-compile them. Scripts don't run in package main; they run in a namespace of their own, based on the URL of the script, so as to keep variables apart. Which is a good thing, because variables will retain their values! Herein lies a trap for the unwary. Suppose we have a script that contains something like this: if ($cgi->param("user_logging_in") == 1) { # the hidden field "user_logging_in" indicates we're processing a # "login" dialogue and acquiring username/password information $username = $cgi->param("username"); $password = $cgi->param("password"); } # more code happens here ... # then we consult a username/password database before letting you do something This looks harmless enough. However: if the author of the program forgot to predeclare $username and $password as lexically scoped variables with my(), then the contents of whatever you entered in the "username" and "password" fields will hang around until someone else enters new data. A cracker can prepare a dummy form where the hidden field "user_logging_in" -- which is used by the program to confirm that it's processing data supplied by a logging-in form -- doesn't exist or isn't equal to 1; then the bomb goes off, because when they hit the "submit" button the old, stale data for "username" and "password" left behind for whoever last used the script will be interpolated when the username/password lookup happens. The moral of this cautionary tale is that it is ***VITAL*** that you ***ALWAYS*** predeclare (and preinitialise!) variables when writing code for mod_perl -- and do it at the highest level of enclosing scope in which the variables will be set or referenced. For example: my ($username, $password) = ""; if ($cgi->param("user_logging_in") == 1) { # the hidden field "user_logging_in" indicates we're processing a # "login" dialogue and acquiring username/password information $username = $cgi->param("username"); $password = $cgi->param("password"); } # more code happens here ... # then we consult a username/password database before letting you do something More importantly, always set the "strict" compiler pragma and run your mod_perl scripts with the -w flag; this will switch on pedantic error-checking and prevent an unsafe script (at least, one that's unsafe in the manner described above) from compiling: #!/usr/bin/perl -w use strict; use Apache::Registry; use CGI qw(-compile :all); # force precompilation of all autoloaded methods One particular benefit of mod_perl is that you can write scripts that make use of persistent database connections. Normally, the process of setting up a connection to a relational database engine such as Oracle or MySQL is slow and expensive: moreover, in a CGI script, a connection is created and then shut down every time someone invokes the script. One of the mod_perl modules with looking at is Apache::DBI; this module allows you to initialise a database connection (via the standard Perl DBI database interface) and maintain it between calls to the program. Simply add the line: PerlModule Apache::DBI In httpd.conf (before all other modules using DBI) and then use normal calls to the DBI database interface in your CGI script. Any DBI calls will be forwarded to the persistent Apache::DBI module, which maintains a cache of database handles on a per-process basis; if the current process already has an open database connection, it will return the database handle rather than re-opening the link. (One drawback is that Apache::DBI maintains its cache on a per-process basis; if you have a bundle of Apache child servers running, each server will do its own database conect. Another drawback is that if each user connects to the database with a unique userid on the database server, then a heavily used server will end up with a Apache::DBI cache containing one cached connection per user and per Apache child: this can be a server-killer if you're running, say, twenty httpd children and have twenty different users, for a total of four hundred cached database connections! So to make effective use of Apache::DBI, avoid designing a system that uses lots of different user ID's on the database server.) END SECTION 1 SECTION 2: Parsing UNIX command lines How often have you run command-line programs with complex sets of parameters? Perl gives you access to the command line used to run the script by stuffing each whitespace-separate chunk of it into an array called @ARGV. However, if you've written a script called ls.pl, trying to work out what to do if it's called as: ls.pl -a -l or ls.pl -al or ls.pl -la Is rather annoying. To get around this, UNIX systems traditionally used a C library called getopt; this was later replaced by the more versatile getopts library, and then the GNU getopt_long library; this allows both the short options (of the form -arg, where arg is a single character) and POSIX-style extended options, such as: ls.pl --long --all which is a bit easier to understand when you run across it in a shell script, and gets around the problem of there only being 52 upper and lower case letters to use for single letter command-line options (a bit of a headache for some really complex programs). In Perl, the job is best taken care of using a CPAN module called Getopt::Mixed. If you don't already have it, install it with the command: perl -MCPAN -e "install Getopt::Mixed;" Getopt::Mixed allows you to use (and mix) traditional short command- line options and POSIX-style long options. To use it, the first step is to work out what the options to your program are going to be. For example, suppose we're writing a report generator that scans logfiles created by Apache. These files reside (usually) in /var/log/httpd, and contain information (not necessarily very easy to read) about what's been going on with the web server. We're going to pick some vaguely realistic options for our hypothetical logfile scanner. It's usage looks like this: -- BEGIN MONOSPACED LISTING -- logscan -- scan HTTP logs for miscellaneous interesting information options: -h, --help print usage message -d, --debug= set debugging level (0..9, default: 1) -a, --resolve-addr perform address resolution -r, --response= filter by HTTP response code -c, --contains= filter by contents usage: logscan [ options ] logfile_name example: logscan -d2 -a --response=404 /var/log/httpd/access_log (Set debug level to 2, resolve IP addresses, filter HTTP response 404 file not found) -- END MONOSPACED LISTING -- Firstly, each of these command line options needs to be logged in a variable which we can consult, inside our program, where appropriate. (They're liable to show up in @ARGV in any random order and combination, so we can't just leave them there.) We first use Getopt::Mixed; then we create an option description -- a string that specifies the type of options our program expects, and what they relate to. The option description is a bit like a printf() format string, but instead of specifying the format to print variables in, it specifies the format to which command-line parameters must conform. It specifies the names of possible command line options, whether they're mandatory or optional, whether they take an argument (for example: our debug level flag takes an argument, the debug level), and so on. The format string for logscan looks like this: $optstring = "h a d=:i r=i c=s help>h resolve-addr>a debug>d response>r contains>c"; This is a space-separated list of option specifiers. The standalone "h" and "a" options mean that these options are boolean -- there are no parameters to "-h" or "-a". The "r=i" option means that -r expects an integer argument; the "d=:i" means that the -d option may or may not have an integer argument. "c=s" has -c looking for a mandatory string argument. The long forms are specified using the ">" notation: "resolve-addr>a" means that --resolve-addr is a synonym for -a. Now we've specified the option format, we have to process each option on the command line, parse it, and take appropriate action. We do that like this: -- BEGIN MONOSPACED LISTING -- #!/usr/bin/perl use Getopt::Mixed; ##### setup the option string for this program my $optstring = "h a d=:i r=i c=s help>h " . "resolve-addr>a debug>d response>r contains>c"; ##### setup the online usage message for this program my $usage = <<"%%"; logscan -- scan HTTP logs for miscellaneous interesting information options: -h, --help print usage message -d, --debug= set debugging level (0..9, default: 1) -a, --resolve-addr perform address resolution -r, --response= filter by HTTP response code -c, --contains= filter by contents usage: logscan [ options ] logfile_name example: logscan -d2 -a --response=404 /var/log/httpd/access_log (Set debug level to 2, resolve IP addresses, filter HTTP response 404 file not found) %% ##### set up default values for logscan's configuration variables my $debug_level = 0; my $resolv_addr = 0; my $response_filter = ""; my $content_filter = ""; ##### parse command line arguments using Getopt::Mixed Getopt::Mixed::init($optstring); while (my ($option, $value) = Getopt::Mixed::nextOption()) { if ($option =~/^d(ebug)*/) { if ($value == 0) { $debug_level = 1; # using the -d flag sets debug=1, unless a } else { # debug level is specified, in which case $debug_level = $value; # we set debug level to the specified value } } elsif ($option =~ /^a|resolve-addr/) { $resolv_addr = 1; # resolve_addr can be true or false only } elsif ($option =~ /^r(esponse)*/) { $response_filter = $value; # response code to search for } elsif ($option =~ /^c(ontains)*/) { $content_filter = $value; # contents to search for in logfile } elsif ($option =~ /^h(elp)*/) { print $usage; exit 0; } } Getopt::Mixed::cleanup(); ##### Whatever is left over in @ARGV, it isn't an option. So it must be ##### the name of a logfile to search! Pluck it off @ARGV, open it, and ##### begin to scan ... my $target_file = shift @ARGV; ##### rest of program goes here! -- END MONOSPACED LISTING -- The important part of this leading section of code is the line immediately following the call to Getopt::Mixed::init(), which tells Getopt::Mixed to start parsing. We loop continuously on Getopt::Mixed::nextOption, which returns successive option:value pairs until it runs out of command-line options to return. Using each option, we then execute a set of if/elsif statements which in effect form a case statement. Each case is triggered by an option we named in our option specification; when we see its name in $option we take appropriate action, be it to print the message stored in $usage and exit, or to stash an associated parameter in a variable we pre-declared for that purpose. If all this looks long-winded, just try writing your own command-line argument parser by hand! Because arguments can be specified in any order (or bunched, like "logscan -ad --contains=foo"), it's a non- trivial task. Getopt::Mixed makes it easy to write programs with complex command-line arguments; the only point to remember when using it is that you should think carefully about the program's functions and options before you start coding. (But isn't that always the case?) END SECTION 2 BOXOUT: Good Books BOOK TITLE: "Advanced PErl Programming" by Joseph N. Hall << DETAILS -- 300 words >> If I was cast away on a desert island with a computer for company, and my salvation depended on programming my way to rescue, I'd want a copy of this book with me. BOOK TITLE: "Programming the Perl DBI" by Tim Bunce and Alligator Descartes DBI, the Database Interface, is one of the most powerful (and useful) bits of glue in Perl's armoury of tools for tying external programs together. DBI is an attempt to define a common API for all relational databases. You load the DBI module and create a DBI object, then tell it to connect to a database server; the DBI object then picks the appropriate DBD module (database driver) and uses it to channel your commands to the right place. The beauty of this is that you can modify your Perl program to switch between databases (such as Oracle, MySQL, PostgreSQL, Ingres, Informix, and DB2) without major rewrites: in PHP, for example, each relational database has a different set of commands for opening a connection, sending a SQL query, and so on. Because DBI is so flexible and powerful, it has become indispensible to anyone trying to work with multiple databases on UNIX -- but the power comes at a price, and DBI isn't a simple module. Tim Bunce and Alligator Descartes wrote the DBI core, its API, and the most commonly used drivers; this book is their roadmap of how to talk to databases in Perl. Be warned that before you dip into this book it will help considerably to have at least a passing familiarity with SQL (structured query language) and relational databases. You should also be able to use other people's packaged Perl modules: this isn't a beginner's text. SQL database servers have distinctive quirks and features, and while DBI attempts to gather them all behind one interface you also need to be familiar with the features of your chosen RDBMS. However, if you need to do serious data mangling -- for example, writing database-driven web front ends, or porting tables from one server to another -- this book is a potential life-saver. END BOXOUT NEXT MONTH: Ever wondered how they print your bank statement? Lots of figures lined up in columns, page numbers neatly printed at the top of each sheet -- you didn't think they used a series of "print" statements, did you? If you need to generate long output listings, you need to examine the "write" command and output formats -- Perl's built-in report generator. We'll take a look at write(), and (with this month's introduction to parsing command line arguments) produce a program for analysing system logfiles and reporting on anomalies. We'll also take a preliminary look at POD, MakeMaker, and other Perl utilities for packaging up your programs in a way that lets other people use them. END NEXT MONTH THE GURU IS "IN": Want coverage of a particular perl topic? Send email to charlie.stross@linux_format.co.uk << OR WHEREVER >>!