Linux Format 36 Perl Tutorial: ///TITLE: Directory searching with Perl ///STRAP: Charlie Stross gets into filesystems and explains how to do fun things with directories and files [[ TYPOGRAPHICAL NOTE: text enclosed in ///BEGIN CODE and ///END CODE is a code listing. Text in the body copy surrounded by _underscores_ (thus) is italicised or emphasized -- but _not_ in the code listings ]] ///BEGIN BODY COPY ///SUBHEADING: Managing files with Perl Perl's reputation as the UNIX Swiss Army Chainsaw is pretty well-known, but when we work on a Linux system we tend to use a shell -- either a pretty graphical file manipulation front-end such as GNOME's Nautilus or KDE's Konqueror, or a command-line shell such as bash (which is also an interpreter for a fairly powerful string-substitution language). One particular reason we use a shell is because they're good for manipulating files, and files are the lowest-level unit of data storage that we normally deal with directly. But Perl can do (albeit not very interactively) all the jobs we expect of a shell -- and in many cases, it can do it faster and more elegantly. First of all, let's look at some basics. The open() command lets us associate a file with a data structure called a _file_ _handle_; options to the filename tell open() whether to open the file for reading, writing (truncating it to zero length), appending (writing additional data without truncating it), or reading and writing. open() returns any error conditions in the special variable $!. The file handle itself is a special type of variable that keeps track of an open file, and can be read from or written to using a variety of commands (read(), print(), sysread(), syswrite(), and so on). When a file needs to be flushed to disk and read/write operations are done, we call close() on the file handle. For example: /// BEGIN CODE my $tmpfile = "/tmp/tmpfile.$$"; # name of temporary file my $line = ""; # scratch variable open (TEMPFILE, ">$tmpfile") || die "Error opening $tmpfile for writing: $!\n"; print TEMPFILE "This data is written to $tmpfile\n"; close TEMPFILE; open (TEMPFILE, "<$tmpfile") || die "Error opening $tmpfile for reading: $!\n"; print "-- Start of file\n"; while ($line = ) { print $line; } print "-- End of file\n"; close TEMPFILE; /// END CODE /// SUBHEADING: Fun with directories Directories are a bit different from files. Originally, in the UNIX world a directory was a special file containing a list of records. Each record was of a fixed length; the first 32 bits specified an inode number, and the rest of each record (up to the first null byte) was a filename associated with that inode. (An inode is a special data structure stored in a UNIX filesystem that points to all the data blocks associated with a file and keeps track of the files' creation and modification times, among other things.) Linux considers directories to be special files -- you can't simply open() them or write to them. (This will be a relief to those of you who share your columnist's memory of discovering the hard way that on Solaris 2.4 and other UNIXes it was possible to open() a directory and scribble all over it in blue crayon, thus wreaking havoc on a filesystem.) Part of the reason for this is that Linux supports a variety of filesystems in which a directory is _not_ a simple list of inodes and filenames -- ReiserFS, for example, or NTFS and HFS -- and so a higher level abstraction is needed. But Perl still lets you get at the contents of a directory, by using the opendir(), readdir(), telldir(), seekdir(), and closedir() commands. Here's how we slurp a list of all the filenames in the current directory: /// BEGIN CODE $dir = "."; # present working directory @flist = (); # list of files opendir(THISDIR, $dir) || die "Unable to opendir($dir): $!\n"; @flist = readdir(THISDIR); closedir(THISDIR); print "There are ", scalar(@flist), " items in $dir\n"; /// END CODE This example is a bit crude -- in particular, beware the consequences of doing it on a truly huge directory! (@flist will bloat up.) You may prefer to do something smart using telldir() and seekdir(). When you call readdir() in scalar context, it returns the next filename from the current directory. telldir() returns the current position of the readdir() location, and seekdir() seeks to that position in the current directory. (Unfortunately the value returned from readdir() and used by seekdir() isn't a straightforward integer -- it's a magic internal pointer and you can't do useful things like increment or decrement it manually.) /// SUBHEADING: Testing file attributes A filename is just a human-friendly name associated with an inode; inodes keep track of file contents. To examine an inode's attributes, Perl gives us two different mechanisms; the stat() command, and the file test operators. stat() first; inodes contain a bunch of attributes, and stat() lets us examine these for a given filename or filehandle. stat() returns a thirteen-element array, containing: filesystem device number, inode number, file mode (permissions), number of hard links to the file, numeric user ID of file's owner, numeric group ID of file's owner, device identifier (for special files such as deviecs), file size in bytes, last access time, last modification time, last change time, preferred blocksize for file system I/O, and the actual number of blocks allocated to the file. All times are reported as seconds since the UNIX epoch began (in 1970). For example, to check the name of a file's owner: /// BEGIN CODE $uid = (stat $filename)[4]; @user_pwent = getpwuid($uid); # get /etc/passwd entry for user ID $uid print "File $filename belongs to user ", $user_pwent[0], "\n"; /// END CODE The file test operators are somewhat different; they're unary operators that apply to a filename or file handle, and let us test various attributes of the file, returning their value: for example, the -r, -w, and -x operators test whether a file is read/write/executable by the current program's effective user ID and group ID, while the -e test examines whether a file exists, -f, -d, -l and -p test whether a file is a plain file, a directory, a symbolic link or a named pipe, and so on. These operators are boolean; you can chain them together using the logical-OR and logical-AND operators ('||' and '&&'). To save the overhead of a low-level stat() system call, Perl caches the previously-examined inode in a special variable called simply '_'; so we can do tests like this: /// BEGIN CODE if ( -e $file && -r _ && -x _) { # $file exists and is both readable and executable ... } /// END CODE ///SUBHEADING: File manipulation modules The built-in file test operators, directory and file read/write commands, and extras like stat() give Perl a firm base on which to mess around with files. But there are a number of recurrent tasks that frequently crop up when itemizing files in directories, and many of these are handled by the core modules in the File:: hierarchy -- some of which are distributed in the standard Perl distribution (as of 5.8.0). Core File modules you can count on in a recent Perl include File::Find, File::Compare, File::Basename, File::CheckTree, File::Copy, File::Path, and File::Temp. Let's go through them and see what they do. File::Find is actually the back end of the find2perl program, a tool which lets you recursively search filesystems for files that match some criterion. find works by letting you specify a sequence of criteria against which each file it encounters is matched; when files fail to match they're weeded out. For example: /// BEGIN CODE find / -depth -type f -name 'ez*' -print /// END CODE Causes find to start searching from the root of the filesystem '/', conducting a depth-first traversal. Find then rejects any files that are not of type 'f' (ordinary file), rejects anything that doesn't match the pattern 'ez*', and prints the name of anything that's left over. find2perl does much the same -- but instead of _searching_ for a file, it emits a block of Perl code that does the same job that find would. For example: /// BEGIN CODE find2perl / -depth -type f -name 'ez*' -print /// END CODE emits: /// BEGIN CODE #! /usr/bin/perl -w eval 'exec /usr/bin/perl -S $0 ${1+"$@"}' if 0; #$running_under_some_shell use strict; use File::Find (); # Set the variable $File::Find::dont_use_nlink if you're using AFS, # since AFS cheats. # for the convenience of &wanted calls, including -eval statements: use vars qw/*name *dir *prune/; *name = *File::Find::name; *dir = *File::Find::dir; *prune = *File::Find::prune; sub wanted; # Traverse desired filesystems File::Find::finddepth({wanted => \&wanted}, '/'); exit; sub wanted { my ($dev,$ino,$mode,$nlink,$uid,$gid); (($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($_)) && -f _ && /^ez.*\z/s && print("$name\n"); } /// END CODE If you want to do something elaborate with the files you find using this routine, just replace 'print("$name\n");' with a call to your own subroutine. The other common File utilities are less obscure. File::Basename does much the same job as the traditional library routines basename() and dirname(); it lets you parse a filename, extracting the directory path component, file suffix, and file name, separately. File::Copy provides equivalents to the cp and mv commands, for copying and moving files. File::Path allows you to create and remove directory trees, including multiple nested directories in one go. File::Temp provides routines for generating the name and handle on a temporary file -- one that is guaranteed to be unique to the current process. File::Compare allows you to compare the contents of two files or file handles, to check for similarity -- it's equivalent to the UNIX cmp command, not diff (which is supported by the non-core module Text::Diff, available from CPAN, if you need to do detailed differential analysis). And File::CheckTree allows you to run many file test operators in parallel on a tree of files; you'd use this if you want to check that a bundle of files you're distributing is consistent with the state that a program requires before it will run safely. All these modules come bundled with the standard Perl distribution; there are a lot more File:: modules on CPAN, but they aren't guaranteed to be present on all Perl installations. /// END BODY COPY /// BEGIN BOXOUT: Gratuitous book review /// TITLE: The Perl CD Bookshelf, Version 3.0 /// PUBLISHER: O'Reilly and Associates /// ISBN: 0-596-00389-7 If there's one Perl book that you really need to keep handy -- if you use Perl for a living -- it's the Perl CD Bookshelf. O'Reilly have made a speciality out of Perl documentation (Larry Wall is a research fellow there), and this is a compendium of seven of the key texts on the language. In addition to a paperback copy of Perl in a Nutshell (the second, improved edition -- itself an indispensible desk reference), the CDROM that comes in the binder includes "Perl in a Nutshell" and the third edition versions of "Programming Perl" and "Learning Perl", as you'd expect. There's also a copy of Tom Christiansen's "Perl Cookbook", a handy tome full of useful procedures and algorithms for accomplishing day to day tasks in Perl, including many file maintenance operations. In addition to these core books, this version of the CD bookshelf drops the Windows content to focus on some more general Perl development books. "Perl and LWP" covers programming the web using LibWWW-Perl or LWP, a vital toolkit that lets you download information from the web, parse HTML to extract information, and even build small web servers into your own applications. Then there's "Perl and XML", extending the utility of Perl as an internet programming language to the next level with coverage of processing XML data in Perl, including PerlSAX, XSLT, and the Document Object Model. Finally, "Mastering Perl/TK" shows up; the only decent book about the only decent cross- platform GUI programming kit for Perl is a welcome addition to the collection. About the only criticism that can be made is that O'Reilly are only putting seven books on the CD -- "Programming the Perl DBI" and "Advanced Perl Programming", and "Perl for System Administration" would all be useful additions, even at the cost of a higher cover price. But It's hard to exaggerate the value of this book too highly; "Perl in a Nutshell" is itself vital to a jobbing Perl programmer, and the combination with six other core books provides a level of coverage that can't easily be equalled. /// END BOXOUT: Gratuitous book review