LINUX FORMAT PERL TUTORIAL Issue 19 Last month, we saw how to combine a report generator, file scanner, and command line parser to write a simple tool for scanning web server logfiles. But it didn't do much with the data! If we want to really analyse what's been going on, it helps to store the details in a relational database, which we can then query in various ways. Perl is at its strongest when it comes to throwing large quantities of data around, and being able to interact with databases is essential, so this month we're going to take a first look at the perl DBI -- the database interface. In addition, we'll take a closer look at how to pull useful information out of strings and see how to use the Perl debugger to make finding errors easier. HEADING: Database access using DBI One of the most useful Perl modules on CPAN is the DBI suite. DBI is Perl's database interface; using it, you can give Perl direct access to relational database servers running on the local (or a remote) machine. DBI provides us with a common API (application programming interface) for talking to relational databases, but the sad fact is that most databases do everything in their own inimitable way; even the dialect of SQL (structured query language, the tool used for expressing queries) that they provide will be different. Thus, DBI sits on top of a lower-level DBD module -- DBD is short for Database Driver -- which you (as the application programmer) don't need to see. There are different DBD modules for most of the databases that run on Linux, but as long as you go through DBI and stick to a common pidgin dialect of SQL, you don't need to be aware of their particular foibles. Supported databases include Oracle, Sybase, Informix, Adabas, MySQL, PostgreSQL, Ingres, DB2, Interbase, and a whole slew of others; there's also an ODBC driver that lets you talk to any ODBC-compliant database, and an LDAP driver that lets you talk to LDAP databases. You can even use DVD as a front end to query XBase files (as used by dBase and Foxbase). Note that there is a book on DBI that is an absolutely vital addition to your shelf if you have anything to do with database programming and Perl; "Programming the Perl DVI" by Alligator Descartes and Tim Bunce (pub. O'Reilly & Associates; ISBN 1-56592-699-4). As Alligator and Tim more or less invented the DBI specification between them and wrote most of the DBD drivers, they know what they're talking about. Note that this book will not teach you about the theory of relational databases -- for that you need something like "Introduction to Database Systems" by C. J. Date (pub. Addison-Wesley, ISBN 0-201-82458-2). If you aren't familiar with the relational model, you will need to do some reading; relational databases do not resemble other information storage tools, and you can get into some nasty messes if you don't have a basic grasp of the key ideas. Finally, you'll also need some documentation for the database server you're working on. MySQL comes with a copious manual; Oracle comes with about 250Mb of PDFs of manuals, and in both cases O'Reilly and Associates have books that provide some coverage of those database servers. To get started using DBI, it helps to have a local database server up and running. In a fit of shameless cowardice, I am going to assume that your Linux system is already running MySQL -- specifically MySQL version 3.23.39 -- and you've followed the instructions under "Post-installation setup and testing" in the INSTALL-SOURCE file in the MySQL distribution, specifically to start up the server and create a username and password. Warning: using a SQL database server without assigning access privileges and sensible passwords is a major security risk -- they're designed to be accessible via TCP/IP, so strangers may be able to connect to your database and examine or corrupt it. (Second warning: if you think the megabytes of documentation for MySQL look complicated and intimidating, you obviously haven't seen Oracle yet; there is a reason people are paid a decent living to do database administration for the rest of us!) To install DBI and the MySQL DBD drivers, your best bet is to use the CPAN module to install Bundle::Mysql from CPAN: perl -MCPAN -e "install Bundle::Mysql;" DBI has its own rather unusual way of working. The first step in talking to a database is to obtain a "database handle" -- a reference to an object that DBI uses as a container for various information about a connection to the database engine. Once you've got a database handle, you can issue commands to it (for example, to execute a SQL statement); these in turn return a variety of things -- including arrays of results, or statement handles (which are in turn references to a special type of DBI object that allows you to scan through the results of a query without loading the whole lot into an array in working memory). You can get a handle on a MySQL database like this: -- BEGIN MONOSPACED LISTING -- use DBI; my $user = "mysql"; my $password = "some_password"; my $dbh = DBI->connect("dbi:mysql:database=test;host=localhost;port=3306", $user, $password); -- END MONOSPACED LISTING -- (Note that $user and $password must be a valid username/password pair that your MySQL database has been configured to recognize. The connect string that precedes them is specific to MySQL; syntax may vary for other databases.) The value you get back -- $dbh -- is a database handle if everything is successful, or an error code (if not). There's no law saying that you have to connect to only one MySQL database at a time; in fact, you can use DBI->connect() with different connect strings to open handles on different types of database engine (or on databases on remote servers) at the same time in order to exchange data between them. MySQL (and other RDBMSs) store collections of data in databases; each database may contain one or more tables. A table is a collection of records with the same structure; conceptually, we may talk about rows (individual records) and columns (fields in the record structure). We define the structure of a database, and insert, update, create, and select data from the database, using SQL (structured query language), which in turn is comprised of a data definition language and a data manipulation language. What follows won't make much sense to you unless you have a smattering of SQL ... Having used DBI to connect to a database called "test", we can create a table in "test" (let's call it test1) like this: -- BEGIN MONOSPACED LISTING -- my $new_table_name = "test1"; if (grep(/$new_table_name/, $dbh->tables())) { print "a table called $new_table_name already exists ", "in the current database\n"; } else { my $stmt = "CREATE TABLE test1 ( id INTEGER NOT NULL, name VARCHAR (32), addr VARCHAR (128) )"; my $rows = $dbh->do($stmt); print "Succeeded: \n$stmt\n" unless ($rows == 0); } -- END MONOSPACED LISTING -- This snippet of code demonstrates a couple of features of DBI. Firstly, the database handle object supplies us with some useful utility methods, such as tables() -- this returns an array of the table names in the current database (which we check, using grep(), to ensure that we haven't already created a table called $new_table_name). Secondly, we embed a SQL data definition language statement (CREATE TABLE ...) in a Perl scalar and feed it to the database handle via the do() method. do() nominally returns the number of rows affected by a statement; in this case, it will probably return -1 (no rows affected) or the string "0E0" (which also means "no rows affected but value is non-false"). A return value of 0 means the statement failed. If we execute the above chunk of code twice, it will only create 'test1' once -- thereafter it will whine that this table already exists. (And a good thing too.) Inserting new records into a table is relatively easy; the do() method is fine for smallish batches of data: -- BEGIN MONOSPACED LISTING -- foreach $record (@array_of_records) { my $insert_stmt = "INSERT INTO test1 (id, name, addr) " . "VALUES ( " . join(", ", ( $record->{index_number} , $record->{name}, $record->{address} )) . ")"; if ( ! $dbh->do($insert_stmt) ) { print "error inserting:\n", $insert_stmt, "\n"; } -- END MONOSPACED LISTING -- Here we're adding an array of Perl records to a table in a database. Each record is a hashref (a reference to a hash with fields called index_number, name, and address). For each record, we glom together a SQL INSERT statement, then we call do() on it and test for a false return value. This is readable, but not the fastest way of doing things if we have a lot of data to inject into a database. For that, we use the prepare/execute mechanism: we prepare a SQL statement containing placeholders for our data values, then repeatedly execute it: -- BEGIN MONOSPACED LISTING -- my $insert_template = "INSERT INTO test1 (id, name, addr) VALUES (?, ?, ?)"; my $statement_handle = $dbh->prepare($insert_template); foreach $record (@big_array_of_records) { $sth->execute($record->{index_number}, $record->{name}, $record->{address}) or die $dbh->errstr(); } -- END MONOSPACED LISTING -- We pass a SQL template to the prepare() method, with question marks in place of actual values that we're going to insert; prepare() returns an object called a statement handle, which is a subclass of a database handle that is ready and waiting to execute that SQL statement as soon as it's supplied with some data. Every time we call execute(), its parameters are slotted in in place of the question marks in th template, and the INSERT statement is executed. If we get a result of FALSE we die, emitting the results of errstr() (the error string returned from the database via DBI). We can use statement handles to speed up most non-SELECT statements, such as UPDATEs (which modify an existing record) or DELETEs (which delete them). SELECT is a special case; it's used by SQL to search the database for data, and it can return a huge amount of information. When we call SELECT, a SQL database returns a view of those records that match the SQL query in the WHERE clause of the SELECT statement. This is potentially enormous! So rather than return an array of records, the Perl DBI returns a statement handle; we then use the statement handle to return each record in the results, one at a time, a bit like reading lines from a file. For example: -- BEGIN MONOSPACED LISTING -- my $query = "SELECT name, addr FROM test1 WHERE (id > 1)"; my $sth = $dbh->execute($query); while (my @record = $sth->fetchrow_array()) { printf(" %12s %64s\n", $record[0], $record[1]); } -- END MONOSPACED LISTING -- The execute() statement returns a statement handle. Once we've got it, we can examine the results of our query one record at a time. We do this using fetchrow_array(), a method applicable to statement handles; this returns the next record returned by a query, in the form of a Perl array containing the fields in the order specified in the SELECT statement. In this example, every time we call fetchrow_array() we will receive an array containing the values of the next record in the results; the first element of the array will be the "name" field, and the second will be the "addr" field. When we run out of results, fetchrow_array() will return undef, terminating the while() loop. This isn't all we can do. We can use placeholders in query: for example: -- BEGIN MONOSPACED LISTING -- my $query = "SELECT name, addr FROM test1 WHERE (id = ?)"; foreach ( 5 .. 9) { my $sth = $dbh->execute($query); my @record = $sth->fetchrow_array(); printf(" %12s %64s\n", $record[0], $record[1]); } -- END MONOSPACED LISTING -- (This is an [unusual] way of printing record with an id field in the range 5 ... 9.) We can use the fetchrow_hashref() method to return records as a hash of name/value tuples: -- BEGIN MONOSPACED LISTING -- my $query = "SELECT name, addr FROM test1 WHERE (id > 1)"; my $sth = $dbh->execute($query); print "We found ", $sth->rows(), " matching records for this query:\n"; while (my $record = $sth->fetchrow_hashref()) { foreach $key (keys %$record) { print "$key => ", $record->{$key}, "\n"; } print "\n", "-" x 20, "\n"; } -- END MONOSPACED LISTING -- This prints out "fieldname => value" lines, one field per line, with records separated by dashed lines. (Note that fetchrow_hashref() is considerably slower than fetchrow_array.) BOXOUT: Using split() and join() to extract data from strings Perl has three core data types -- scalars, arrays, and hashes. (There are several others, but these three are the ones commonly used for storing data that user programs work on.) But sometimes it's useful to be able to convert between them. For example: suppose we're reading a file of data formatted with one record to a line, and sub-fields within the records separated by commas. We can read each record into a variable like this: -- BEGIN MONOSPACED LISTING -- while ($line = <>) { # $line contains the next line read from STDIN } -- END MONOSPACED LISTING -- But how do we fillet out the individual fields within the record? Enter split(). split() looks at a string stored in a scalar variable and returns a list of sub-elements, split on some regular expression. split() without arguments splits the variable $_ on whitespace (/\s+/, as a Perl regular expression); you can specify alternative regular expressions to look for (such as /,\s*/ -- to split on a comma, optionally followed by whitespace), and a different variable to look at. So we can read a record in comma-separated format from STDIN and assign each field to a different variable like this: -- BEGIN MONOSPACED LISTING -- while ($line = <>) { chomp $line; my ($name, $address1, $address2, $phone, $postcode) = split(/,\s*/, $line); } -- END MONOSPACED LISTING -- (NB: chomp() scans a string and, if it ends with a newline, removes the trailing newline. It's just a necessary convenience when dealing with data read from a file.) or: -- BEGIN MONOSPACED LISTING -- while ($line = <>) { @fields = split(/,\s*/, $line); } -- END MONOSPACED LISTING -- or (more obscurely): -- BEGIN MONOSPACED LISTING -- while (<>) { split(/,\s*/); } -- END MONOSPACED LISTING -- (The last example implicitly reads from STDIN into $_, then calls split -- the results of which show up in @_. Don't do this in a program that someone else might have to work with! It's confusing!) Having separated a bunch of fields out, you can work with their contents. But how do you turn them back into a single scalar? You can say something like: -- BEGIN MONOSPACED LISTING -- foreach $item (@fields) { $line .= $item . ", "; } -- END MONOSPACED LISTING -- But this is inelegant and verbose. Perl supplies a command, join(), which does the opposite of split: you feed join a separator-string and a list, and it returns a scalar (string) consisting of all the elements in the list, strung together and separated by the supplied separator string. For example: -- BEGIN MONOSPACED LISTING -- my @list = qw(1 2 3 4 5 6); my $string = join("_", @list); print $string; -- END MONOSPACED LISTING -- Which prints: -- BEGIN MONOSPACED LISTING -- 1_2_3_4_5_6 -- END MONOSPACED LISTING -- If we're sure our comma-separated fields have nothing between them but commas, we can therefore say "join(',', @fields)" and be reasonably certain that the result will be a string consisting of the line we first read. END BOXOUT (split() and join()) BOXOUT: Using the Perl source-level debugger Perl has a source-level debugger built in, but it doesn't work the way you might expect if you're used to debugging C or Pascal or other compiled languages. Recall that Perl first compiles your program to an internal parse tree, then executes it. If you run your program with the -d (debug) flag, Perl does two things. Firstly, it loads the module specified by the environment variable PERL5DB (or perl's default debugger module -- typically perl5db.pl). Secondly, under the control of this module it compiles your program and effectively evals it, one statement at a time. Because of this two-step process, it's necessary to make sure that your program is syntactically correct before you try debugging it -- run Perl over it first with the -cw flags for a syntax check. The debugger itself is written in perl, and keeps its internal state information in the namespace DB; you can get help on the available debugger commands by typing h (or h h) at the debug prompt. For example: -- BEGIN MONOSPACED LISTING -- bash-2.04$ perl -d test.pl Default die handler restored. Loading DB routines from perl5db.pl version 1.07 Editor support available. Enter h or `h h' for help, or `man perldebug' for more help. main::(test.pl:5): my $user = "root"; DB<1> -- END MONOSPACED LISTING -- Note that the debugger is big! You can find documentation for it in the perldebug manpage ("perldoc perldebug") and on the debugger internals (hooks to allow you to modify it or write your own) in the perldebguts manpage. When you run code under the perl debugger, perl first compiles your program then halts before the first statement and displays the debug prompt. Anything you type at the DB prompt is executed as a perl statement -- but the debugger has a lot of built-in commands that can be used to do things to your currently loaded program. The debugger has a built-in history mechanism: the number in angle-brackets after the DB prompt is the number of a command, and you can re-execute an earlier command by typing an exclaimation mark followed by its number -- for example, !7 repeats command 7. Probably the most important commands are "s" (step to the next instruction, entering subroutines as the flow of execution reaches them), "n" (next line, steps over subroutines if there's one in the way), and "l" (which, with various options, lists lines). You can also print the contents of a variable currently in scope with "p ", set breakpoints on a function with the "b " command, and add watch expressions (events to watch for and halt on). Type "v " to display all variables in the package called pkg_name; you can print the variables that match (or do not match) a regular expression and are in pkg_name by specifying the pattern after pkg_name -- for example "v IO::Socket ~my_obj" to print all variables that resemble "my_obj" in the namespace IO::Socket. When you tell your program to execute until it hits a breakpoint on a named function, you can work out where the function was called from by asking for a stack backtrace (the "T" command); this will tell you whether it was called in a scalar, list or void context, and tell you where it was called from. The debugger can produce verbose output that scrolls off your screen or window too fast to read. You can deal with this by prefixing your commands with a pipe symbol -- for example, |h to ask for help -- which causes output to be piped through your pager (usually "less"). The built-in perl debugger is not the only way of debugging perl source. Because you can write your own hooks, there is an emacs perl mode that provides support for the Perl debugger; if you use the emacs editor you can debug perl using a menu-based environment. In addition, there's the Perl/Tk debugger -- from http://world.std.com/~aep/ptkdb/ -- which is implemented in perl with the Tk graphics toolkit; run it on a machine with Perl/Tk installed and running X11, and type "perl -d:ptkdb progname.pl" and you'll get a much nicer interface to the core perl debugger functions. Finally, the DDD data display debugger now supports Perl; this is probably the nicest graphical high-level debugging environment available for Linux, although its perl support is a little patchy. (You can get it from http://www.gnu.org/software/ddd/.) INCLUDE SCREENSHOT: perltkdb.gif END BOXOUT (the Perl debugger) BEGIN BOXOUT: More damn Perl books than you can wave a shitty stick at The radio show "Desert Island Disks" has spent the past fifty years or so interviewing celebrities and asking them to list a couple of books and records that they'd want to take with them if they were cast away on a desert island for some serious R&R time. Well, along with the obligatory copies of the Complete Works of William Shakespeare and the Bible, I'd have to take along the Reader's Digest book of DIY Boat-building and Emergency Rescue Signalling; but in the absence of rescue, and the presence of a solar-powered laptop, I think the choice of recreational reading matter would be a no-brainer. Because those nice people at O'Reilly and Associates have just published a new edition of the one item that should be on every Perl programmers' desk -- the Perl CD Bookshelf, Version 2.0 (pub. O'Reilly and Associates; ISBN 0-596-00164-9, RRP £49.95). The Perl CD Bookshelf consists of a CD-ROM and a paper copy of Perl in a Nutshell -- itself a fine publication that digests into rapidly accessible form all the Perl commands, most of the commonly-used modules, and a brief description of the language. But the CD rapidly becomes indispensible. It contains (in boringly standard, accessible, cleanly laid-out HTML) the complete contents of five of O'Reilly's most useful books: "Perl in a Nutshell" (of course), "Programming Perl" (third edition, the latest), Tom Christiansen's excellent "Perl Cookbook", "Advanced Perl Programming", and (new with this edition) "Perl for System Administration". Gone from the 2nd edition are the tutorial books, including Learning Perl for Win32. This is a wise choice; the Perl CD Bookshelf was always aimed at working perl programmers, not newbies. "Perl for System Administration" is an excellent addition for Linux users -- for example, if you want to learn how to unpack a wtmp logfile, or build an automated user account management system, this is the place you'd turn to. Perl is the best tool there is for the sort of administrative task that used to be carried out with shell scripts, and this book alone is vital to any budding Linux administrator with a smattering of Perl. Then there's the "Perl Cookbook". This doesn't focus on tasks (such as the administrative tasks described in "Perl for System Administrators"); it focusses on techniques, providing lots of short recipes for generally-applicable methods of mangling data. Want to know how to use random-access file I/O, or read lines from a file with continuation characters? Want to know how to find all the elements in an array that mach some set of criteria? This is the book to turn to. Finally, "Advanced Perl Programming" gives a detailed overview of some common areas of, well, advanced Perl technique: writing modules, implementing object persistence, writing networking applications, template- driven code generation, and so on. These three books form a beautifully complementary set of interlocking commentaries on aspects of Perl programming, while the inclusion of "Programming Perl" and "Perl in a Nutshell" gives a detailed reference to the language and its common libraries. You really can't afford to be without a copy of the Perl CD Bookshelf if you do Perl for a living, and Version 2.0 is worth the money to upgrade from Version 1.0. No kidding; this is a real desert island disk for Perl programming. END BOXOUT (More damn Perl books ...)