Linux Format 22 Perl tutorial [[ TYPOGRAPHY/LAYOUT -- text surrounded by _underscore_ characters like so should be italicized or emphasized. Text indented from the margin by two or more characters is a program listing: needs monospaced typeface, with indentation and word wrap preserved. Contact me if it needs changing to fit the page! If you want a boxout, the section "tie() and objects" can be turned into one quite easily, but I'll need to tweak the first sentence of the following section, "tied databases and Berkeley DB". ]] HEADING: Making Data persist: DBM files, tie(), and Data::Dumper What do you do when you want to store some data -- say, a hash of values -- from one execution of a perl script to the next? This isn't an unusual requirement; it's particularly common in the world of web development, as CGI scripts need to give an appearance of continuity while actually being run to completion every time a user clicks on a link. It's also useful when dealing with a variety of discontinuous jobs that occur regularly -- for example, logfile analysis. Although the DBI (database interface) lets you do this, it's overkill for most purposes; you don't want to have to fire up a relational database engine such as PostgreSQL or Oracle just to find some data you checkpointed last time you ran a CGI script. Perl gives you several ways of saving data, some built-in and others in the form of external modules. Perhaps the most obvious, and simplest -- at first glance -- is to can use a straight text file, calling open() to create it and return a filehandle, print() to save data in it, and close() to close it -- then in a later visit, opening it and reading its contents. For example, to save and restore a simple hash: sub save_hash { # parameters: a reference to a simple hash # returns: name of file that hash is saved in, or undef if it failed my $hashref = shift; if ($hashref == undef) { return undef; # no hash to save! } # generate per-process unique filename my $filename = "/tmp" . $$ . "." . time(); if (! open(OUT, ">$filename")) { print "Could not open $filename for write: $!\n"; return undef; } while (my ($k, $v) = each %$hashref) { print OUT $k, "%%", $v, "\n"; } close OUT; return $filename; } And to reload the hash: sub reload_hash { # parameters: a temporary filename created by save_hash() # returns: a reference to a simple hash my $filename = shift; if (! -r $filename) { return undef; # no filename to load! } if (! open(IN, "<$filename")) { print "Could not open $filename for read: $!\n"; return undef; } %my_hash = (); while (my $line = ) { chomp $line; my ($k, $v) = split(/%%/, $line); $my_hash{$k} = $v; } close IN; return \%my_hash; } This mechanism actually works -- it lets you save the contents of a hash into a temporary file, and reload it back into your perl program at a later time. So what's wrong with it? The answers are numerous. (To put it bluntly, if you're managing a Perl programmer and they deliver a gem like this, you need to evaluate the quality of code they're producing and consider sending them on some remedial training courses.) Let's take it from the top ... Firstly, these routines will fail horribly if we try and feed them some data that isn't a simple hash with scalar values -- if, for example, one of the values is a reference to another object. It'll also fail if one or more of the values contains the string "%%" (which it uses as a field separator), or a newline character "\n" (which it uses as a record separator). So the first failing of this mechanism is that it is inflexible -- it can only work effectively on certain very limited types of data -- and doesn't fail gracefully. A second failing is that the temporary filename is part of the metainformation that needs to be saved. Here, we've created a file in the directory /tmp by taking the process ID of the perl script (special variable $$) and the current timestamp. On a given computer, this is a good guarantee of a unique filename -- we could even make it network-unique by including, for example, the host's IP address. But how do we get this temporary filename from one process (the one that creates the temporary file) into another? Yes, we need to save the filename somewhere, too. But where? And how do we ensure that somebody doesn't fiddle the date stamp, or otherwise overwrite the file? A third failing is that this code makes no arrangements for what to do if the filesystem fills up or some other problem arises while writing data. And a fourth failing is that this doesn't help us in any other way by saving memory or system resources -- all it does is dump some data into a temporary file for future retrieval. There are probably even more things wrong with this approach than I've outlined, but I'm not going to go there: the point I'm trying to make is that there's more to making data persist between two runs of a program than simply splatting it into a text file and reading it back in. Perl comes with two general mechanisms for saving data or state information in the filesystem in a way that makes it accessible to other processes, and both mechanisms are better than the one above (which is the one that novices usually start by exploring). These mechanisms are, respectively, using tie() to stash data in a database file, and using a module like Data::Dumper or Freeze::Thaw to digest complex data into a form that can safely be saved in a text file (and subsequently reconstitute it without the record/field delimiter problems above). SUBTITLE: All tied up in DBM tie() appeared in Perl 5 as a replacement for dbmopen(), so I'll explain dbmopen() first. However, tie() is far more than just a front-end to data storage; we'll explore its uses below. DBM, the database manager, is not a relational database: it's a simple hash-table database library that C programs can call to save data in. DBM files consist of records stored in a hash table (not a Perl hash, but the classic data structure of the same name). By calling dbmopen("database", %myhash, "0600"), your Perl program could open a DBM file called "database" with file permissions "0600" (owner has read/write permissions, nobody else can access it), and associates it with a pseudo-hash called %myhash. Thereafter, if you assign some value to %myhash, it is not actually soaking up memory in the perl interpreter's stack space; it's being written into the "database" DBM files. Your program can call dbmclose() when it's done: "database" will be closed, all data is flushed to disk, and %myhash magically vanishes. If we want to use dbmopen and dbmclose, we can: they're a front-end for tie() (see below), but still work as advertised. We can use them to re- write our simple hash dumping routines above: sub save_hash { # parameters: a reference to a simple hash # returns: name of file that hash is saved in, or undef if it failed my $hashref = shift; # generate per-process unique filename for DBM database my $filename = "/tmp" . $$ . "." . time(); dbmopen($filename, %my_dbm, "0600") or do { print "dbmopen($filename, \%my_dbm, 0600) failed: $!\n"; return undef; }; %my_dbm = %$hashref; dbmclose(%my_dbm); return $filename; } sub reload_hash { # parameters: a temporary filename created by save_hash() # returns: a reference to a simple hash my $filename = shift; if (! dbmopen($filename, %my_dbm, undef)) { print "dbmopen($filename, \%my_dbm, 0600) failed, does $filename ", "exist?\nReported error code was: $!\n"; return undef; } my %hashref = %my_dbm; dbmclose($my_dbm); return \%my_hashref; } This is a fair bit better than the simple flat-file save of our hash; while it won't cope with complex data structures, it should save and reload hash correctly, without worries about record or field separators. More to the point, we don't need these routines at all if we know in advance that we want our hash to persist; we can just call dbmopen() at the start of our program, put in a call to dbmclose() in an END{} block (executed whenever we call exit() to leave our program), and rely on it working. A big advantage of that approach, using DBM files to store hashes, is that the data is saved in the filesystem; we can actually use gigantic hashes, containing gigabytes of data, if they're actually bound to an underlying DBM file. By using each() to iterate over elements of such pseudo-hashes, we can avoid using lots of memory while applying operations to every value in the hash. And we have the benefits of keyed access to data. There are still some problems with this approach, though. The original DBM library had some limitations; for example, a record length limit of 4096 bytes (using GDBM, the GNU DBM library) or 1024 bytes (using some other versions of DBM). And dbmopen() and dbmclose() are crude and inelegant language operators. Which is why Perl 5 brought us the tie() mechanism. SUBTITLE: tie() and objects Where dbmopen() effectively associates a variable (which can only be a hash) to an underlying DBM database file, tie() provides a more general mechanism. Tie() associates a variable (which can be a scalar, array, or hash) with an underlying Perl Class, which supplies methods for saving or retrieving data. These methods (which have fixed names, such as FETCH, STORE, DESTROY, EXISTS, or TIEHASH/TIEARRAY/TIESCALAR) are triggered whenever you try to assign to, or read from, the tied variable. For example, if you tie a variable to a class called Foo, then assigning the string "fred" to the variable triggers a call to Foo::STORE with the parameter "fred". (The gory details can be found in Chapter 14 of "Programming Perl", 3rd edition). You can use tie() yourself, to tie() weird and wonderful behaviour to scalars, arrays, and hashes. Here's a minimal use of tie() to create a variable that always contains the current date and time in a human- readable form: package ltime; # ltime is a tie()'able class that provides a human-readable interface # to time() sub TIESCALAR { return bless {}; } sub FETCH { scalar(localtime(time())); } sub STORE { return undef; # this is a no-op } sub DESTROY { return undef; # again, this is a no-op } #--- end of class ltime and beginning of main program package main; tie ($x, ltime); for ($i = 1; $i < 10; $i++ ) { print "ltime is $x\n"; sleep 1; } What we're doing here is defining a class, called ltime. (The class definition is implicitly ended when we declare package main, which brings us back into the namespace of the main program.) Within ltime we need to create four methods -- TIESCALAR, FETCH, STORE, and DESTROY. DESTROY and STORE are strictly irrelevant, as is TIESCALAR. The idea of STORE is that if we're tieing to something that can be updated, we can create a STORE method that updates the underlying object (we could in principle put code here that would update the system clock when we assign to a tied ltime object). DESTROY is called to destroy the tied object whenever we untie it or otherwise exit from the program. TIESCALAR is called as an instance creator method when we first call tie(); it can be used to return a reference to the underlying object that we can fidget with without going through FETCH or STORE. None of which is relevent to this simple example; the only useful method here is FETCH, which returns a neatly-formatted time stamp whenever we read the value of a scalar which has been tied to class ltime. The code in package main is a driver for ltime: it repeatedly prints the value of $x, which is tied to an ltime object which returns the formatted current time. One important note that isn't obvious from context: tie() associates a class of object with a variable, but it doesn't automatically call 'require' or 'use' to load the class's package. In this example, package ltime is defined in the same file as the test program, so it's already present. If you save ltime in a separate module, though, you'll need to call 'use ltime;' before calling tie() with ltime as a class. (If you're wondering how useful the ltime package is, consider this: you can't interpolate a subroutine call into a string, but you can interpolate a variable: print "scalar(localtime(time()))\n"; results in: scalar(localtime(time())) as output. But if we tie $x to the class ltime: print "$x\n"; produces something like: Wed Sep 26 13:20:41 2001 So we can use tie() where we want to be able to interpolate the output from a subroutine into the middle of strings (or other things we can interpolate into). And this isn't limited to scalar ties; we can do the same with tied arrays or hashes. (We can also tie onto filehandles, but filehandles don't interpolate. Anyone who asks about tieing onto globs will be politely ignored as globs are going away for good in Perl 6 ...) When building a tied scalar class, you can omit some of the methods defined above by using the Tie::Scalar or Tie::StdScalar classes that come with the standard perl distribution. Tie::Scalar provides standard default methods for FETCH, STORE, and so on; you can declare your new tieable class to be a child of Tie::Scalar and rely on the parent for the basic method calls. Tie::StdScalar adds default methods that cause an object of this class to behave just like a standard scalar. We can use it to re-write ltime like this: package ltime; require Tie::StdScalar; @ISA = (Tie::StdScalar); sub FETCH { return scalar(localtime(time())); } 1; # end of class ltime This will work in place of the previous version of ltime -- it just calls Tie::StdScalar for all the undefined methods. In addition to tying scalars, you can tie arrays. Arrays require more methods for access than scalars; you need to be able to specify offsets into the array, use STORESIZE and FETCHSIZE to return the number of elements or reset the size of the array, and implement the POP, PUSH, SHIFT, UNSHIFT, SPLICE, DELETE, and EXISTS methods if you want the Perl functions of the same names to work. However, as with Tie::StdScalar there's a Tie::StdArray class that defines a whole bunch of default methods. A plain instance of the tie::StdArray class behaves like, well, an array: you can override it selectively by defining only those methods you need to treat in a non-standard manner. Tied hashes go one step further. To define a tied hash, you need to provide eight methods: the TIEHASH constructor, FETCH and STORE (to access key/value pairs), EXISTS (to see if a key is present in the hash), DELETE (to remove a key and its associated value), CLEAR (to empty the hash by deleting all its key/value pairs), and FIRSTKEY/NEXTKEY (which are used to iterate over the key/value pairs when you call keys, values, or each). A full example of a tied hash is a bit large to fit in this column; however there's an excellent worked example in Chapter 14 of "Programming Perl" that demonstrated how to use a tied hash to view and edit the contents of a user's dot files (from their home directory). The keys in the hash are the names of dotfiles (without the leading dot); when you supply a dotfile name (for example "bashrc") you get back its contents. Again, the CPAN module Tie::EncryptedHash provides a neat mechanism for encrypting data: tie a hash to Tie::EncryptedHash, and anything you store in it will be encrypted. The special field with key "__password" contains a password (set when the tie() operation was carried out); delete it, and thereafter you will only be able to see the encrypted values in the hash, until you write the password back into the __password field. You can tie Tie::EncryptedHash on top of an underlying DBM file if you want to save your data in an encrypted format; writes to the underlying DBM file go through the EncryptedHash level so that data is encrypted on its way out, and you can decrypt the contents when you read (with the appropriate password). Filehandles, being variable objects, can also be tied: you can attach a filehandle to a variable (for example, so that writes to STDOUT are buffered up in a scalar), or to a non-standard way of storing data (for example, so that writes to and reads from a filehandle actually go to a shared memory segment accessible to another program, or to a socket or pair of pipes). While it's non-trivial to set up a class to tie a filehandle onto an IPC mechanism like sockets, the end result can in principle be to make it easy for relative novices to do tasks like socket programming. SUBHEADING: Tied databases and Berkeley DB Which brings us back round to dbmopen(), which is now visible as a tied hash, where the underlying class actually calls an external database library (using the XS extension mechanism to provide a Perl interface to a C library such as DBM). The standard Perl distribution provides tie-able classes that know how to talk to a number of different DBM libraries, including original DBM, GDBM (GNU dbm), SDBM, and the now-common Berkeley DB library. Berkeley DB is probably the one you want to use for most purposes: it's fast, efficient, can store huge amounts of data and provide instant access to it, and doesn't impose arbitrary limits on the size of records. It needs to be stressed again that DBM files are _not_ relational database engines; they don't understand SQL, can't store complex tables or perform relational operations, and in some cases don't even supply record locking. These are _simple_ database systems designed for those situations where you want to stash some configuration information in a binary file format and retrieve it quickly later on, and where the structure of the data isn't going to change from one session to another. As the manual page (db_intro) says: "The DB library is a family of groups of functions that provides a modular programming interface to transactions and record-oriented file access. The library includes support for transactions, locking, logging and file page caching, as well as various indexed access methods. The DB library does not provide user interfaces, data entry GUI's, SQL support or any of the other standard user-level database interfaces. What it does provide are the programmatic building blocks that allow you to easily embed database-style functionality and support into other objects or interfaces." (Full documentation and the latest version of DB is available from Sleepycat software, www.sleepycat.com.) The DB_File class provides a front-end to Berkeley DB that lets you create tied hashes or arrays; the hash class can sit on top of an underlying B-tree database where the keys are stored in sorted lexical order, or on top of a keyed hash database. The array class (defined by the flag DB_RECNO when you call tie()) is accessed via an array , and can be used for manipulating fixed or variable-length text records; it works like the hash mechanism, but the key for any given record is its array subscript. In use, after calling tie() you can use a hash tied onto DB_File just like any other hash -- but it won't suck up memory as you write to it; instead, its contents are stored in the DB database on disk. One really nice feature of DB is that you can customize it heavily from within Perl; for example, you can specify your own sort routines to change the B-Tree sort order (when storing data in a tied B-tree database). You can also tell DB to handle multiple records with the same key by setting the R_DUP flag on a B-Tree database; this can give rise to hashes which appear to have multiple identical keys with different values. (We get round this in practice because B-Tree databases in DB have a special seq() function that accesses records in absolute sequence; this breaks the tied hash model slightly, but lets us do really cool things. For example, the get_dup() method relies on this to return an array (or hash) of those records with duplicate keys, so that we can apply del_dup() to fix them (or do something else appropriate). DB_File and Berkeley DB is too large to treat in a single tutorial; while not quite as complex as DBI (with its ability to talk to multiple SQL databases) DB supplies enough features to write complex data management applications. For more details of what you can do with DB_File, try the perl documentation ("perldoc DB_File"). In the meantime, here's our original let's-dump-a-hash-to-disk-and-restore-it, re-written to use DB: sub save_hash { # parameters: a reference to a simple hash # returns: name of file that hash is saved in, or undef if it failed my $hashref = shift; # generate per-process unique filename for DBM database my $filename = "/tmp" . $$ . "." . time(); tie(%my_dbm, "DB_File", $filename, O_CREAT|O_RDWR, 0666, $DB_HASH) or do { print "tie(\%my_dbm, DB_File, $filename) failed: $!\n"; return undef; }; %my_dbm = %$hashref; untie(%my_dbm); return $filename; } sub reload_hash { # parameters: a temporary filename created by save_hash() # returns: a reference to a simple hash my $filename = shift; if (! tie(%my_dbm, "DB_File", $filename, O_CREAT|O_RDWR, 0666, $DB_HASH)) { print "dbmopen($filename, \%my_dbm, 0600) failed, does $filename ", "exist?\nReported error code was: $!\n"; return undef; } my %hashref = %my_dbm; untie($my_dbm); return \%my_hashref; } SUBHEADING: Data::Dumper and FreezeThaw -- for complex data structures Back before we got side-tracked into dbmopen() and tie(), we noted that the original save/restore hash code was brittle: it doesn't cope well with any type of data other than a straightforward, flat, hash. Perl programmers often want to save hairy data structures; for this reason, we turn to a couple of modules that people have written for exactly this job -- Data::Dumper and FreezeThaw. Data::Dumper is part of the core Perl distribution. When you create a Data::Dumper object, you can supply it with a list of scalars or references to more complex data types. Call the Dump() method, and the Data::Dumper object will return a textual representation of the scalars in a form such that if you save it in a string, you can call eval() on it to reconstitute the variables. In effect, it lets you freeze-dry complex data structures and unfreeze them later! If we re-visit our save-a-hash subroutines and twiddle them to use Data::Dumper we can make them robust enough that they'll work on any arbitrary Perl data structure and reload it later. We use it like this: use Data::Dumper; sub save_stuff { # parameters: a reference to an object # returns: name of file that object is saved in, or undef if it failed my $ref = shift; if ($ref == undef) { return undef; # nothing to do } # generate per-process unique filename my $filename = "/tmp" . $$ . "." . time(); if (! open(OUT, ">$filename")) { print "Could not open $filename for write: $!\n"; return undef; } my $d = Data::Dumper->new($ref); $d->Purity(1); # cope with nested references if necessary print OUT $d->Dump(); close OUT; return $filename; } And to reload the hash: sub load_stuff { # parameters: a temporary filename created by save_stuff() # returns: a reference to an object my $filename = shift; if (! -r $filename) { return undef; # no filename to load! } if (! open(IN, "<$filename")) { print "Could not open $filename for read: $!\n"; return undef; } my $stuff = join(); # read entire file into a scalar close IN; my $ref = eval $stuff; # eval() the dumped data structure return $ref; } This is by no means perfect -- dumped data tends to expand quite a bit, it can be fiddled with by other people, and we've made no provision to store more than one piece of data on disk -- but it's a huge step forward from our starting point. There are some drawbacks. You can't dump an array or hash: only an arrayref or hashref. Moreover, Data::Dumper can't actually dump coderefs (references to anonymous subroutines such as closures); it's a data marshalling tool, not a compiler! As an alternative to Data::Dumper, you may want to investigate the CPAN module FreezeThaw (note: this isn't supported by its author any more but works fine under Perl 5.61). FreezeThaw doesn't save out data structures in a human- readable format; instead it dumps a binary representation of the internal data (using freeze()) and then allows this to be reconstituted (using thaw()). (In use it's similar to the newer Data::Dumper class, so there's some question over whether you should be using it.)