LINUX FORMAT PERL TUTORIAL
Issue 19


Last month, we saw how to combine a report generator, file scanner,
and command line parser to write a simple tool for scanning web
server logfiles.  But it didn't do much with the data! If we want to
really analyse what's been going on, it helps to store the details in
a relational database, which we can then query in various ways. Perl
is at its strongest when it comes to throwing large quantities of
data around, and being able to interact with databases is essential,
so this month we're going to take a first look at the perl DBI -- the
database interface.


In addition, we'll take a closer look at how to pull useful information
out of strings and see how to use the Perl debugger to make finding errors 
easier.


HEADING: Database access using DBI


One of the most useful Perl modules on CPAN is the DBI suite. DBI is Perl's
database interface; using it, you can give Perl direct access to relational
database servers running on the local (or a remote) machine. 


DBI provides us with a common API (application programming interface) for
talking to relational databases, but the sad fact is that most databases
do everything in their own inimitable way; even the dialect of SQL
(structured query language, the tool used for expressing queries) that
they provide will be different. Thus, DBI sits on top of a lower-level
DBD module -- DBD is short for Database Driver -- which you (as the
application programmer) don't need to see. There are different DBD modules
for most of the databases that run on Linux, but as long as you go through
DBI and stick to a common pidgin dialect of SQL, you don't need to be
aware of their particular foibles.  Supported databases include Oracle,
Sybase, Informix, Adabas, MySQL, PostgreSQL, Ingres, DB2, Interbase,
and a whole slew of others; there's also an ODBC driver that lets you
talk to any ODBC-compliant database, and an LDAP driver that lets you
talk to LDAP databases. You can even use DVD as a front end to query
XBase files (as used by dBase and Foxbase).


Note that there is a book on DBI that is an absolutely vital addition to your
shelf if you have anything to do with database programming and Perl;
"Programming the Perl DVI" by Alligator Descartes and Tim Bunce (pub. O'Reilly
& Associates; ISBN 1-56592-699-4). As Alligator and Tim more or less invented
the DBI specification between them and wrote most of the DBD drivers, they
know what they're talking about. Note that this book will not teach you about
the theory of relational databases -- for that you need something like
"Introduction to Database Systems" by C. J. Date (pub. Addison-Wesley,
ISBN 0-201-82458-2). If you aren't familiar with the relational model, you
will need to do some reading; relational databases do not resemble other
information storage tools, and you can get into some nasty messes if you 
don't have a basic grasp of the key ideas. Finally, you'll also need some 
documentation for the database server you're working on. MySQL comes with a
copious manual; Oracle comes with about 250Mb of PDFs of manuals, and in
both cases O'Reilly and Associates have books that provide some coverage
of those database servers.


To get started using DBI, it helps to have a local database server up and
running. In a fit of shameless cowardice, I am going to assume that your
Linux system is already running MySQL -- specifically MySQL version 3.23.39
-- and you've followed the instructions under "Post-installation setup
and testing" in the INSTALL-SOURCE file in the MySQL distribution,
specifically to start up the server and create a username and password.
Warning: using a SQL database server without assigning access privileges
and sensible passwords is a major security risk -- they're designed to be
accessible via TCP/IP, so strangers may be able to connect to your database 
and examine or corrupt it. (Second warning: if you think the megabytes of
documentation for MySQL look complicated and intimidating, you obviously 
haven't seen Oracle yet; there is a reason people are paid a decent living
to do database administration for the rest of us!)


To install DBI and the MySQL DBD drivers, your best bet is to use the CPAN
module to install Bundle::Mysql from CPAN:


   perl -MCPAN -e "install Bundle::Mysql;"


DBI has its own rather unusual way of working. The first step in talking
to a database is to obtain a "database handle" -- a reference to an
object that DBI uses as a container for various information about a
connection to the database engine. Once you've got a database handle,
you can issue commands to it (for example, to execute a SQL statement);
these in turn return a variety of things -- including arrays of results,
or statement handles (which are in turn references to a special type
of DBI object that allows you to scan through the results of a query
without loading the whole lot into an array in working memory).


You can get a handle on a MySQL database like this:


-- BEGIN MONOSPACED LISTING --


  use DBI;
  
  my $user = "mysql";
  my $password = "some_password";


  my $dbh =  DBI->connect("dbi:mysql:database=test;host=localhost;port=3306",
                          $user, $password);


-- END MONOSPACED LISTING --


(Note that $user and $password must be a valid username/password pair that
your MySQL database has been configured to recognize. The connect string that
precedes them is specific to MySQL; syntax may vary for other databases.)


The value you get back -- $dbh -- is a database handle if everything is
successful, or an error code (if not). There's no law saying that you 
have to connect to only one MySQL database at a time; in fact, you can use
DBI->connect() with different connect strings to open handles on different
types of database engine (or on databases on remote servers) at the same
time in order to exchange data between them.


MySQL (and other RDBMSs) store collections of data in databases; each database
may contain one or more tables. A table is a collection of records with the
same structure; conceptually, we may talk about rows (individual records) and
columns (fields in the record structure). We define the structure of a
database, and insert, update, create, and select data from the database, using
SQL (structured query language), which in turn is comprised of a data
definition language and a data manipulation language. What follows won't make
much sense to you unless you have a smattering of SQL ...


Having used DBI to connect to a database called "test", we can create a table
in "test" (let's call it test1) like this:


-- BEGIN MONOSPACED LISTING --


    my $new_table_name = "test1";


    if (grep(/$new_table_name/, $dbh->tables())) {
        print "a table called $new_table_name already exists ",
              "in the current database\n";
    } else {
        my $stmt = "CREATE TABLE test1 (
                       id    INTEGER NOT NULL,
                       name  VARCHAR (32),
                       addr  VARCHAR (128)
                   )";
        my $rows = $dbh->do($stmt);
        print "Succeeded: \n$stmt\n" unless ($rows == 0);
    }


-- END MONOSPACED LISTING --


This snippet of code demonstrates a couple of features of DBI. Firstly,
the database handle object supplies us with some useful utility methods,
such as tables() -- this returns an array of the table names in the
current database (which we check, using grep(), to ensure that we haven't
already created a table called $new_table_name). Secondly, we embed a
SQL data definition language statement (CREATE TABLE ...) in a Perl scalar
and feed it to the database handle via the do() method. do() nominally
returns the number of rows affected by a statement; in this case, it will
probably return -1 (no rows affected) or the string "0E0" (which also
means "no rows affected but value is non-false"). A return value of 0
means the statement failed.


If we execute the above chunk of code twice, it will only create 'test1'
once -- thereafter it will whine that this table already exists. (And a
good thing too.)


Inserting new records into a table is relatively easy; the do() method
is fine for smallish batches of data:


-- BEGIN MONOSPACED LISTING --


    foreach $record (@array_of_records) {
        my $insert_stmt = "INSERT INTO test1 (id, name, addr) " .
                          "VALUES ( " . 
                          join(", ", ( $record->{index_number} ,
                                       $record->{name},
                                       $record->{address} )) .
                          ")";
        if ( ! $dbh->do($insert_stmt) ) {
            print "error inserting:\n", $insert_stmt, "\n";
    }


-- END MONOSPACED LISTING --


Here we're adding an array of Perl records to a table in a database. Each
record is a hashref (a reference to a hash with fields called index_number,
name, and address). For each record, we glom together a SQL INSERT statement,
then we call do() on it and test for a false return value.


This is readable, but not the fastest way of doing things if we have a lot
of data to inject into a database. For that, we use the prepare/execute
mechanism: we prepare a SQL statement containing placeholders for our
data values, then repeatedly execute it:


-- BEGIN MONOSPACED LISTING --


    my $insert_template  = "INSERT INTO test1 (id, name, addr) VALUES (?, ?, ?)";
                          
    my $statement_handle = $dbh->prepare($insert_template);


    foreach $record (@big_array_of_records) {
        $sth->execute($record->{index_number}, $record->{name}, $record->{address})
            or die $dbh->errstr();
    }


-- END MONOSPACED LISTING --


We pass a SQL template to the prepare() method, with question marks in place
of actual values that we're going to insert; prepare() returns an object called
a statement handle, which is a subclass of a database handle that is ready and
waiting to execute that SQL statement as soon as it's supplied with some data.
Every time we call execute(), its parameters are slotted in in place of the
question marks in th template, and the INSERT statement is executed. If we
get a result of FALSE we die, emitting the results of errstr() (the error
string returned from the database via DBI).


We can use statement handles to speed up most non-SELECT statements, such
as UPDATEs (which modify an existing record) or DELETEs (which delete them). 
SELECT is a special case; it's used by SQL to search the database for data, 
and it can return a huge amount of information.


When we call SELECT, a SQL database returns a view of those records that match
the SQL query in the WHERE clause of the SELECT statement. This is potentially
enormous! So rather than return an array of records, the Perl DBI returns 
a statement handle; we then use the statement handle to return each record
in the results, one at a time, a bit like reading lines from a file. For
example:


-- BEGIN MONOSPACED LISTING --


    my $query = "SELECT name, addr FROM test1 WHERE (id > 1)";
    my $sth   = $dbh->execute($query);


    while (my @record = $sth->fetchrow_array()) {
        printf(" %12s  %64s\n", $record[0], $record[1]);
    } 


-- END MONOSPACED LISTING --


The execute() statement returns a statement handle. Once we've got it,
we can examine the results of our query one record at a time. We do this
using fetchrow_array(), a method applicable to statement handles; this
returns the next record returned by a query, in the form of a Perl array
containing the fields in the order specified in the SELECT statement. In
this example, every time we call fetchrow_array() we will receive an array
containing the values of the next record in the results; the first element
of the array will be the "name" field, and the second will be the "addr"
field. When we run out of results, fetchrow_array() will return undef,
terminating the while() loop.


This isn't all we can do. We can use placeholders in query: for example:


-- BEGIN MONOSPACED LISTING --


    my $query = "SELECT name, addr FROM test1 WHERE (id = ?)";


    foreach ( 5 .. 9) {
        my $sth   = $dbh->execute($query);
        my @record = $sth->fetchrow_array();
        printf(" %12s  %64s\n", $record[0], $record[1]);
    }


-- END MONOSPACED LISTING --


(This is an [unusual] way of printing record with an id field in the range 5 ... 9.)


We can use the fetchrow_hashref() method to return records as a hash of
name/value tuples:


-- BEGIN MONOSPACED LISTING --


    my $query = "SELECT name, addr FROM test1 WHERE (id > 1)";
    my $sth   = $dbh->execute($query);


    print "We found ", $sth->rows(), " matching records for this query:\n";


    while (my $record = $sth->fetchrow_hashref()) {
        foreach $key (keys %$record) {
            print "$key => ", $record->{$key}, "\n";
        }
        print "\n", "-" x 20, "\n";
    } 


-- END MONOSPACED LISTING --
   
This prints out "fieldname => value" lines, one field per line, with records
separated by dashed lines. (Note that fetchrow_hashref() is considerably
slower than fetchrow_array.)


BOXOUT: Using split() and join() to extract data from strings


Perl has three core data types -- scalars, arrays, and hashes. (There are
several others, but these three are the ones commonly used for storing data
that user programs work on.) But sometimes it's useful to be able to convert
between them.


For example: suppose we're reading a file of data formatted with one record
to a line, and sub-fields within the records separated by commas. We can read
each record into a variable like this:


-- BEGIN MONOSPACED LISTING --


   while ($line = <>) {
      # $line contains the next line read from STDIN
   }


-- END MONOSPACED LISTING --


But how do we fillet out the individual fields within the record?


Enter split(). split() looks at a string stored in a scalar variable and 
returns a list of sub-elements, split on some regular expression. split()
without arguments splits the variable $_ on whitespace (/\s+/, as a Perl
regular expression); you can specify alternative regular expressions to
look for (such as /,\s*/ -- to split on a comma, optionally followed by
whitespace), and a different variable to look at. So we can read a record
in comma-separated format from STDIN and assign each field to a different
variable like this:


-- BEGIN MONOSPACED LISTING --


  while ($line = <>) {
      chomp $line;
      my ($name, $address1, $address2, $phone, $postcode) = split(/,\s*/, $line);
  }


-- END MONOSPACED LISTING --


(NB: chomp() scans a string and, if it ends with a newline, removes the
trailing newline. It's just a necessary convenience when dealing with data
read from a file.)


or: 


-- BEGIN MONOSPACED LISTING --


  while ($line = <>) {
      @fields = split(/,\s*/, $line);
  }


-- END MONOSPACED LISTING --


or (more obscurely):


-- BEGIN MONOSPACED LISTING --


  while (<>) {
     split(/,\s*/);
  }


-- END MONOSPACED LISTING --


(The last example implicitly reads from STDIN into $_, then calls split --
the results of which show up in @_. Don't do this in a program that someone
else might have to work with! It's confusing!)


Having separated a bunch of fields out, you can work with their contents. But
how do you turn them back into a single scalar? You can say something like:


-- BEGIN MONOSPACED LISTING --


   foreach $item (@fields) {
      $line .= $item . ", ";
   }


-- END MONOSPACED LISTING --


But this is inelegant and verbose. Perl supplies a command, join(), which does
the opposite of split: you feed join a separator-string and a list, and it
returns a scalar (string) consisting of all the elements in the list, strung
together and separated by the supplied separator string. For example:


-- BEGIN MONOSPACED LISTING --


   my @list   = qw(1 2 3 4 5 6);
   my $string = join("_", @list);
   print $string;


-- END MONOSPACED LISTING --


Which prints:


-- BEGIN MONOSPACED LISTING --


   1_2_3_4_5_6


-- END MONOSPACED LISTING --


If we're sure our comma-separated fields have nothing between them but commas,
we can therefore say "join(',', @fields)" and be reasonably certain that the
result will be a string consisting of the line we first read.


END BOXOUT (split() and join())


BOXOUT: Using the Perl source-level debugger


Perl has a source-level debugger built in, but it doesn't work the way you
might expect if you're used to debugging C or Pascal or other compiled
languages.


Recall that Perl first compiles your program to an internal parse tree,
then executes it. If you run your program with the -d (debug) flag,
Perl does two things. Firstly, it loads the module specified by the
environment variable PERL5DB (or perl's default debugger module --
typically perl5db.pl). Secondly, under the control of this module
it compiles your program and effectively evals it, one statement at a
time. Because of this two-step process, it's necessary to make sure that
your program is syntactically correct before you try debugging it --
run Perl over it first with the -cw flags for a syntax check.


The debugger itself is written in perl, and keeps its internal state
information in the namespace DB; you can get help on the available
debugger commands by typing h (or h h) at the debug prompt. For example:


-- BEGIN MONOSPACED LISTING --


bash-2.04$ perl -d test.pl
Default die handler restored.
 
Loading DB routines from perl5db.pl version 1.07
Editor support available.
 
Enter h or `h h' for help, or `man perldebug' for more help.
 
main::(test.pl:5):      my $user = "root";
  DB<1>


-- END MONOSPACED LISTING --


Note that the debugger is big! You can find documentation for it in the
perldebug manpage ("perldoc perldebug") and on the debugger internals (hooks
to allow you to modify it or write your own) in the perldebguts manpage.


When you run code under the perl debugger, perl first compiles your program
then halts before the first statement and displays the debug prompt. Anything
you type at the DB<n> prompt is executed as a perl statement -- but the 
debugger has a lot of built-in commands that can be used to do things to
your currently loaded program. The debugger has a built-in history
mechanism: the number in angle-brackets after the DB prompt is the number
of a command, and you can re-execute an earlier command by typing an
exclaimation mark followed by its number -- for example, !7 repeats 
command 7.


Probably the most important commands are "s" (step to the next instruction,
entering subroutines as the flow of execution reaches them), "n" (next 
line, steps over subroutines if there's one in the way), and "l" (which,
with various options, lists lines). You can also print the contents of a
variable currently in scope with "p <varname>", set breakpoints on a function
with the "b <function_name>" command, and add watch expressions (events to 
watch for and halt on). Type "v <pkg_name>" to display all variables in
the package called pkg_name; you can print the variables that match (or do
not match) a regular expression and are in pkg_name by specifying the pattern
after pkg_name -- for example "v IO::Socket ~my_obj" to print all variables
that resemble "my_obj" in the namespace IO::Socket.


When you tell your program to execute until it hits a breakpoint on a named
function, you can work out where the function was called from by asking for a
stack backtrace (the "T" command); this will tell you whether it was called
in a scalar, list or void context, and tell you where it was called from. 


The debugger can produce verbose output that scrolls off your screen or window
too fast to read. You can deal with this by prefixing your commands with a pipe
symbol -- for example, |h to ask for help -- which causes output to be piped
through your pager (usually "less").


The built-in perl debugger is not the only way of debugging perl source.
Because you can write your own hooks, there is an emacs perl mode that
provides support for the Perl debugger; if you use the emacs editor you
can debug perl using a menu-based environment.  In addition, there's the
Perl/Tk debugger -- from http://world.std.com/~aep/ptkdb/ -- which is
implemented in perl with the Tk graphics toolkit; run it on a machine with
Perl/Tk installed and running X11, and type "perl -d:ptkdb progname.pl"
and you'll get a much nicer interface to the core perl debugger functions.
Finally, the DDD data display debugger now supports Perl; this is probably
the nicest graphical high-level debugging environment available for
Linux, although its perl support is a little patchy. (You can get it
from http://www.gnu.org/software/ddd/.)


INCLUDE SCREENSHOT: perltkdb.gif


END BOXOUT (the Perl debugger)


BEGIN BOXOUT: More damn Perl books than you can wave a shitty stick at


The radio show "Desert Island Disks" has spent the past fifty years or
so interviewing celebrities and asking them to list a couple of books
and records that they'd want to take with them if they were cast away
on a desert island for some serious R&R time.


Well, along with the obligatory copies of the Complete Works of William
Shakespeare and the Bible, I'd have to take along the Reader's Digest
book of DIY Boat-building and Emergency Rescue Signalling; but in the
absence of rescue, and the presence of a solar-powered laptop, I think
the choice of recreational reading matter would be a no-brainer. Because
those nice people at O'Reilly and Associates have just published a new
edition of the one item that should be on every Perl programmers'
desk -- the Perl CD Bookshelf, Version 2.0 (pub. O'Reilly and Associates;
ISBN 0-596-00164-9, RRP �49.95).


The Perl CD Bookshelf consists of a CD-ROM and a paper copy of Perl in a 
Nutshell -- itself a fine publication that digests into rapidly accessible
form all the Perl commands, most of the commonly-used modules, and a brief
description of the language. But the CD rapidly becomes indispensible.
It contains (in boringly standard, accessible, cleanly laid-out HTML)
the complete contents of five of O'Reilly's most useful books: "Perl in a
Nutshell" (of course), "Programming Perl" (third edition, the latest), 
Tom Christiansen's excellent "Perl Cookbook", "Advanced Perl Programming", 
and (new with this edition) "Perl for System Administration".


Gone from the 2nd edition are the tutorial books, including Learning Perl for
Win32. This is a wise choice; the Perl CD Bookshelf was always aimed at 
working perl programmers, not newbies. "Perl for System Administration"
is an excellent addition for Linux users -- for example, if you want to
learn how to unpack a wtmp logfile, or build an automated user account
management system, this is the place you'd turn to. Perl is the best
tool there is for the sort of administrative task that used to be carried
out with shell scripts, and this book alone is vital to any budding Linux
administrator with a smattering of Perl. 


Then there's the "Perl Cookbook".  This doesn't focus on tasks (such as
the administrative tasks described in "Perl for System Administrators");
it focusses on techniques, providing lots of short recipes for
generally-applicable methods of mangling data.  Want to know how to use
random-access file I/O, or read lines from a file with continuation
characters? Want to know how to find all the elements in an array
that mach some set of criteria? This is the book to turn to.  Finally,
"Advanced Perl Programming" gives a detailed overview of some common
areas of, well, advanced Perl technique: writing modules, implementing
object persistence, writing networking applications, template- driven code
generation, and so on. 


These three books form a beautifully complementary set of interlocking
commentaries on aspects of Perl programming, while the inclusion of
"Programming Perl" and "Perl in a Nutshell" gives a detailed reference
to the language and its common libraries. You really can't afford to be
without a copy of the Perl CD Bookshelf if you do Perl for a living, and
Version 2.0 is worth the money to upgrade from Version 1.0. No kidding;
this is a real desert island disk for Perl programming.


END BOXOUT (More damn Perl books ...)