Linux Format 16
Perl tutorial
SECTION 1: Speeding up CGI scripts with Apache::Registry and mod_perl
Last week we looked at the basics of CGI programming in Perl on Linux. CGI,
the Common Gateway Interface, is slow; every time a call to the CGI module
is executed by Apache, it has to fork (launch) a child process, wait for
it to load and run, then return its results. This has several implications.
Firstly, Linux (and other UNIX-type systems) limit you to a maximum number
of processes (executing programs) that can run at the same time. Someone
can effectively mount a denial of service attack on your server by causing
it to spawn lots of CGI processes, filling up the process table (which in
turn prevents other necessary system programs from running). Secondly,
there's a fair amount of overhead associated with launching, executing,
and wrapping up a child process; in the case of Perl, much of the overhead
comes from the fact that Apache must execute Perl itself, compile the script,
then run it. A script that calls several modules may run to many thousands
of lines of code; this imposes additional strain on the server every time
someone clicks on a button that invokes some dynamic action.
There is a way around this, of course: instead of CGI programming, you
use mod_perl. mod_perl is an Apache module that consists, essentially,
of the Perl interpreter compiled with glue to allow Apache to call it
directly. CGI scripts need to be loaded, compiled, executed, and
terminated whenever anyone calls them; mod_perl scripts are loaded
and compiled just once -- when Apache starts up. Thereafter they sit in
the server's memory, waiting for input. When a user sends some input for
the mod_perl script, the script digests it, produces some output, then
goes back to waiting for more input (instead of exiting). Thus, the
startup overheads are eliminated; in extreme cases, using mod_perl instead
of CGI.pm with Apache can result in a twentyfold performance increase.
Which begs the question, how do you write scripts for mod_perl?
Probably the simplest way to get started is to use Apache::Registry. This
module allows you to run (nearly) unaltered Perl CGI scripts
under mod_perl. The main differences are that (a) you need to use
Apache::Registry as well as CGI.pm, and (b) the CGI script, instead
of running as a stand-along program called from the server, runs as a
subroutine called from the Apache perl module.
Before you can use mod_perl, you need to install it. If you're running
SuSE 7.1 or Red Hat 7.1, mod_perl is already present; you can switch it
on in various ways. (On SuSE 7.1, edit /etc/rc.config.d/apache.rc.config
and set HTTPD_SEC_MOD_PERL to "yes"; then force a change of system run level.
On Red Hat, you'll probably need to mess with the contents of /etc/httpd/conf.)
If you need to install mod_perl from scratch, your starting place is
http://perl.apache.org/; there's extensive documentation, but you will
probably also need a complete Apache source tree, and a complete Perl source
tree, in order to do the job properly (because it entails linking a copy of
Perl as a shared library into a module that can be loaded by Apache).
You will then need to edit the httpd.conf file to indicate that Perl scripts
in a designated directory (or specified by name) are to be run under
mod_perl. For example:
Alias /perl/ /perl/apache/scripts/ #optional
PerlModule Apache::Registry
SetHandler perl-script
PerlHandler Apache::Registry
Options ExecCGI
The first clause aliases URLs on your server with "/perl/" in them to
"/perl/apache/scripts".The tag then ensures that
anything in /perl/apache/sc ripts is handled by Apache::Registry with
the ExecCGI option enabled.
When the Apache HTTPD server restarts, each time a server process receives a
request for a file in /perl/apache/scripts, it will compile it as a subroutine
and store it in memory. If the file on disk changes, the apache server will
recompile it. In earlier versions, if you left a call to exit() in your script,
it would terminate the program after a run, forcing Apache to repeatedly re-
compile it; current versions of Apache::Registry override Perl's built-in
exit() command.
When trying to convert a CGI script from CGI to Apache::Registry you should
really set the $Apache::Registry::Debug bitmask; this saves debugging
information in the server error_log file. For example, a value of "3"
causes mod_perl to log recompilation events and do a dump of possibly
useful information in event that $@ is set on exit from the script. $@ is
the error variable set whenever a call to eval() (which compiles and
executes a string containing some Perl code, or a block of code, in a
different context from the main perl interpreter, so that runtime
errors don't kill the program).
It's possible to write CGI scripts that can detect whether they're running
under mod_perl or bog-standard CGI; check for the MOD_PERL environment
variable.
In general, CGI scripts written using CGI.pm and Perl 5.004 or later will
run fine under mod_perl. You may want to set:
PerlSendHeader ON
in the httpd.conf file, to make mod_perl send HTTP headers (normally it
doesn't); or you can do it yourself by doing something like:
print $cgi->header();
or:
print "Content-type: text/html\n\n";
One important pitfall is that mod_perl scripts don't terminate; they run in a
loop until the mod_perl module decides to clean up and re-compile them.
Scripts don't run in package main; they run in a namespace of their own,
based on the URL of the script, so as to keep variables apart. Which is a
good thing, because variables will retain their values! Herein lies a trap
for the unwary.
Suppose we have a script that contains something like this:
if ($cgi->param("user_logging_in") == 1) {
# the hidden field "user_logging_in" indicates we're processing a
# "login" dialogue and acquiring username/password information
$username = $cgi->param("username");
$password = $cgi->param("password");
}
# more code happens here ...
# then we consult a username/password database before letting you do something
This looks harmless enough. However: if the author of the program forgot
to predeclare $username and $password as lexically scoped variables with
my(), then the contents of whatever you entered in the "username" and
"password" fields will hang around until someone else enters new data. A
cracker can prepare a dummy form where the hidden field "user_logging_in"
-- which is used by the program to confirm that it's processing data
supplied by a logging-in form -- doesn't exist or isn't equal to 1; then
the bomb goes off, because when they hit the "submit" button the old,
stale data for "username" and "password" left behind for whoever last used
the script will be interpolated when the username/password lookup happens.
The moral of this cautionary tale is that it is ***VITAL*** that
you ***ALWAYS*** predeclare (and preinitialise!) variables when writing code
for mod_perl -- and do it at the highest level of enclosing scope in which the
variables will be set or referenced. For example:
my ($username, $password) = "";
if ($cgi->param("user_logging_in") == 1) {
# the hidden field "user_logging_in" indicates we're processing a
# "login" dialogue and acquiring username/password information
$username = $cgi->param("username");
$password = $cgi->param("password");
}
# more code happens here ...
# then we consult a username/password database before letting you do something
More importantly, always set the "strict" compiler pragma and run your
mod_perl scripts with the -w flag; this will switch on pedantic error-checking
and prevent an unsafe script (at least, one that's unsafe in the manner
described above) from compiling:
#!/usr/bin/perl -w
use strict;
use Apache::Registry;
use CGI qw(-compile :all); # force precompilation of all autoloaded methods
One particular benefit of mod_perl is that you can write scripts that make use
of persistent database connections. Normally, the process of setting up a
connection to a relational database engine such as Oracle or MySQL is slow
and expensive: moreover, in a CGI script, a connection is created and then
shut down every time someone invokes the script. One of the mod_perl modules
with looking at is Apache::DBI; this module allows you to initialise a
database connection (via the standard Perl DBI database interface) and
maintain it between calls to the program. Simply add the line:
PerlModule Apache::DBI
In httpd.conf (before all other modules using DBI) and then use normal calls
to the DBI database interface in your CGI script. Any DBI calls will be
forwarded to the persistent Apache::DBI module, which maintains a cache of
database handles on a per-process basis; if the current process already
has an open database connection, it will return the database handle rather
than re-opening the link.
(One drawback is that Apache::DBI maintains its cache on a per-process
basis; if you have a bundle of Apache child servers running, each
server will do its own database conect. Another drawback is that if
each user connects to the database with a unique userid on the database
server, then a heavily used server will end up with a Apache::DBI cache
containing one cached connection per user and per Apache child: this
can be a server-killer if you're running, say, twenty httpd children
and have twenty different users, for a total of four hundred cached
database connections! So to make effective use of Apache::DBI, avoid
designing a system that uses lots of different user ID's on the database
server.)
END SECTION 1
SECTION 2: Parsing UNIX command lines
How often have you run command-line programs with complex sets of
parameters?
Perl gives you access to the command line used to run the script by
stuffing each whitespace-separate chunk of it into an array called @ARGV.
However, if you've written a script called ls.pl, trying to work out
what to do if it's called as:
ls.pl -a -l
or
ls.pl -al
or
ls.pl -la
Is rather annoying.
To get around this, UNIX systems traditionally used a C library called
getopt; this was later replaced by the more versatile getopts library,
and then the GNU getopt_long library; this allows both the short options
(of the form -arg, where arg is a single character) and POSIX-style
extended options, such as:
ls.pl --long --all
which is a bit easier to understand when you run across it in a shell
script, and gets around the problem of there only being 52 upper and
lower case letters to use for single letter command-line options (a
bit of a headache for some really complex programs).
In Perl, the job is best taken care of using a CPAN module called
Getopt::Mixed. If you don't already have it, install it with the
command:
perl -MCPAN -e "install Getopt::Mixed;"
Getopt::Mixed allows you to use (and mix) traditional short command-
line options and POSIX-style long options. To use it, the first step
is to work out what the options to your program are going to be. For
example, suppose we're writing a report generator that scans logfiles
created by Apache. These files reside (usually) in /var/log/httpd, and
contain information (not necessarily very easy to read) about what's
been going on with the web server.
We're going to pick some vaguely realistic options for our hypothetical
logfile scanner. It's usage looks like this:
-- BEGIN MONOSPACED LISTING --
logscan -- scan HTTP logs for miscellaneous interesting information
options:
-h, --help print usage message
-d, --debug= set debugging level (0..9, default: 1)
-a, --resolve-addr perform address resolution
-r, --response= filter by HTTP response code
-c, --contains= filter by contents
usage:
logscan [ options ] logfile_name
example:
logscan -d2 -a --response=404 /var/log/httpd/access_log
(Set debug level to 2, resolve IP addresses, filter HTTP response 404
file not found)
-- END MONOSPACED LISTING --
Firstly, each of these command line options needs to be logged in a
variable which we can consult, inside our program, where appropriate.
(They're liable to show up in @ARGV in any random order and combination,
so we can't just leave them there.)
We first use Getopt::Mixed; then we create an option description -- a
string that specifies the type of options our program expects, and what they
relate to. The option description is a bit like a printf() format string, but
instead of specifying the format to print variables in, it specifies the
format to which command-line parameters must conform. It specifies the
names of possible command line options, whether they're mandatory or
optional, whether they take an argument (for example: our debug level
flag takes an argument, the debug level), and so on.
The format string for logscan looks like this:
$optstring = "h a d=:i r=i c=s help>h resolve-addr>a debug>d response>r contains>c";
This is a space-separated list of option specifiers.
The standalone "h" and "a" options mean that these options are boolean --
there are no parameters to "-h" or "-a". The "r=i" option means that -r
expects an integer argument; the "d=:i" means that the -d option may or
may not have an integer argument. "c=s" has -c looking for a mandatory
string argument.
The long forms are specified using the ">" notation: "resolve-addr>a"
means that --resolve-addr is a synonym for -a.
Now we've specified the option format, we have to process each option on
the command line, parse it, and take appropriate action. We do that like
this:
-- BEGIN MONOSPACED LISTING --
#!/usr/bin/perl
use Getopt::Mixed;
##### setup the option string for this program
my $optstring = "h a d=:i r=i c=s help>h " .
"resolve-addr>a debug>d response>r contains>c";
##### setup the online usage message for this program
my $usage = <<"%%";
logscan -- scan HTTP logs for miscellaneous interesting information
options:
-h, --help print usage message
-d, --debug= set debugging level (0..9, default: 1)
-a, --resolve-addr perform address resolution
-r, --response= filter by HTTP response code
-c, --contains= filter by contents
usage:
logscan [ options ] logfile_name
example:
logscan -d2 -a --response=404 /var/log/httpd/access_log
(Set debug level to 2, resolve IP addresses, filter HTTP response 404
file not found)
%%
##### set up default values for logscan's configuration variables
my $debug_level = 0;
my $resolv_addr = 0;
my $response_filter = "";
my $content_filter = "";
##### parse command line arguments using Getopt::Mixed
Getopt::Mixed::init($optstring);
while (my ($option, $value) = Getopt::Mixed::nextOption()) {
if ($option =~/^d(ebug)*/) {
if ($value == 0) {
$debug_level = 1; # using the -d flag sets debug=1, unless a
} else { # debug level is specified, in which case
$debug_level = $value; # we set debug level to the specified value
}
} elsif ($option =~ /^a|resolve-addr/) {
$resolv_addr = 1; # resolve_addr can be true or false only
} elsif ($option =~ /^r(esponse)*/) {
$response_filter = $value; # response code to search for
} elsif ($option =~ /^c(ontains)*/) {
$content_filter = $value; # contents to search for in logfile
} elsif ($option =~ /^h(elp)*/) {
print $usage;
exit 0;
}
}
Getopt::Mixed::cleanup();
##### Whatever is left over in @ARGV, it isn't an option. So it must be
##### the name of a logfile to search! Pluck it off @ARGV, open it, and
##### begin to scan ...
my $target_file = shift @ARGV;
##### rest of program goes here!
-- END MONOSPACED LISTING --
The important part of this leading section of code is the line immediately
following the call to Getopt::Mixed::init(), which tells Getopt::Mixed to
start parsing.
We loop continuously on Getopt::Mixed::nextOption, which returns
successive option:value pairs until it runs out of command-line options
to return. Using each option, we then execute a set of if/elsif statements
which in effect form a case statement. Each case is triggered by an
option we named in our option specification; when we see its name in
$option we take appropriate action, be it to print the message stored
in $usage and exit, or to stash an associated parameter in a variable
we pre-declared for that purpose.
If all this looks long-winded, just try writing your own command-line
argument parser by hand! Because arguments can be specified in any
order (or bunched, like "logscan -ad --contains=foo"), it's a non-
trivial task. Getopt::Mixed makes it easy to write programs with
complex command-line arguments; the only point to remember when using
it is that you should think carefully about the program's functions
and options before you start coding. (But isn't that always the case?)
END SECTION 2
BOXOUT: Good Books
BOOK TITLE: "Advanced PErl Programming" by Joseph N. Hall
<< DETAILS -- 300 words >>
If I was cast away on a desert island with a computer for company, and
my salvation depended on programming my way to rescue, I'd want a copy
of this book with me.
BOOK TITLE: "Programming the Perl DBI" by Tim Bunce and Alligator Descartes
DBI, the Database Interface, is one of the most powerful (and useful)
bits of glue in Perl's armoury of tools for tying external programs
together. DBI is an attempt to define a common API for all relational
databases. You load the DBI module and create a DBI object, then tell
it to connect to a database server; the DBI object then picks the
appropriate DBD module (database driver) and uses it to channel your
commands to the right place. The beauty of this is that you can modify
your Perl program to switch between databases (such as Oracle, MySQL,
PostgreSQL, Ingres, Informix, and DB2) without major rewrites: in PHP,
for example, each relational database has a different set of commands
for opening a connection, sending a SQL query, and so on.
Because DBI is so flexible and powerful, it has become indispensible
to anyone trying to work with multiple databases on UNIX -- but the
power comes at a price, and DBI isn't a simple module. Tim Bunce and
Alligator Descartes wrote the DBI core, its API, and the most commonly
used drivers; this book is their roadmap of how to talk to databases
in Perl.
Be warned that before you dip into this book it will help considerably
to have at least a passing familiarity with SQL (structured query
language) and relational databases. You should also be able to use
other people's packaged Perl modules: this isn't a beginner's text.
SQL database servers have distinctive quirks and features, and while
DBI attempts to gather them all behind one interface you also need to
be familiar with the features of your chosen RDBMS. However, if you
need to do serious data mangling -- for example, writing
database-driven web front ends, or porting tables from one server to
another -- this book is a potential life-saver.
END BOXOUT
NEXT MONTH:
Ever wondered how they print your bank statement? Lots of figures lined
up in columns, page numbers neatly printed at the top of each sheet --
you didn't think they used a series of "print" statements, did you?
If you need to generate long output listings, you need to examine the
"write" command and output formats -- Perl's built-in report generator.
We'll take a look at write(), and (with this month's introduction to
parsing command line arguments) produce a program for analysing
system logfiles and reporting on anomalies. We'll also take a
preliminary look at POD, MakeMaker, and other Perl utilities for
packaging up your programs in a way that lets other people use them.
END NEXT MONTH
THE GURU IS "IN": Want coverage of a particular perl topic? Send
email to charlie.stross@linux_format.co.uk << OR WHEREVER >>!