Linux Format 28 [[ Typographical notes: indented text is program listing text surrounded in _underscores_ is italicised/emphasised ]] Perl Tutorial TITLE: The art of the one-liner STRAP: How much power can you get in a single line of code? Charlie Stross explores the art of the one-liner -- and shows you some handy Linux power-tools SUBTITLE: Fast food and convenience programming In this tutorial series, we've mostly focussed on writing programs in Perl. Common sense suggests that real programs are big -- but common sense is wrong: size has nothing to do with usefulness, and Perl is so full of flexible commands that you can pack a lot of power into a single line of code. By way of an example, let's look at how we'd go about counting the number of words in a line of text. In a conventional procedural language like C, we'd iterate over each character in turn, checking to see whether it was a whitespace character; if it is, and the preceding character is non-whitespace, then we've reached a word boundary so we can add 1 to our count of words. In contrast, in Perl we use an array operator like split(), primed with a definition of a word boundary, and evaluate it in a scalar context: print "there are ", split(/\s+/, $line), " words in [$line]\n"; split() scans a scalar variable (in this case $line) looking for an expression (in this case the regular expression /\s+/ -- one or more whitespace characters). In an array context, it chops up the scalar and returns a list consisting of the pieces, using the expression to define the boundary between fragments. In a scalar context, it returns a count of how many pieces it found. You usually use split for tasks such as scanning files of records with fields separated by some regular delimiter character, but splitting a line of text into words is just fine. Want to count the words in a file? Instead of calling split() on a single variable, we want to have it run in a loop, reading lines from standard input, and we want to add up the number returned each time we call split(). Like this: while () { $count += split(/\s+/, $_); } print $count; Perl being Perl, we can leave out some of the details here and let Perl assume the default values. For example, if you don't give split() a scalar to scan, it assumes you want to scan $_; if you don't give it a pattern to look for, it assumes you want to split on whitespace. And the () construct (read a record from the named file handle) defaults to STDIN and sticks its output in $_ if you don't tell it anything else. All in all, this means we can reduce our count-the-words-in-the-standard-input program to this: while (<>) { $count += split }; print $count, "\n"; And then run it from the shell prompt like this: perl -e 'while (<>) { $count += split }; print $count, "\n";' < some_file.txt The -e (evaluate) option to Perl indicates that the following argument is a program to be executed -- Perl can run commands specified this way, rather than stored in a named file. Can we shorten this further? Sure: the perlrun manpage shows us all the other juicy options Perl understands. For example: perl -ane '$c += $#F; print $c, "\n";' The '-n' option makes Perl assume that your code (supplied following the '-e' flag) is to be enclosed in a loop, reading from STDIN: while () { # your program goes here } And the -a (autosplit) flag tells Perl to automatically call split(' ', STDIN) on each line, stashing the results in a special variable called @F. You can modify the pattern used to split fields by using the -F flag, thus: perl -F'/\W+/' -ane '$c += $#F; print $c, "\n";' This tells Perl to assume a loop around the program, automatically split each input line read from STDIN on the pattern /\W+/ (one or more non-word characters), and stick the results in @F. Our program then uses $#F (count of highest subscript to @F) to increment $c, a cumulative count of words, and print the result for each line. A little messier than our first attempt, but a lot shorter than GNU wc (word-count), which weighs in at a couple of hundred lines of code in C! SUBTITLE: Concorde in one line of perl Counting words isn't the most immediately obvious use for Perl -- after all, all GNU/Linux systems already have 'wc', right? But what if you want, for example, to count the frequency with which specific words occur in a file? In a non-one-liner we'd do it like this: while (<>) { @line = split(/\W+/); foreach (@line) { $conc{$_}++; } } while (($key, $value) = each %conc)) { print "$key: $value\n"; } %conc (concordance) is a hash; every time we meet a word in our input we use it as a key in %conc and increment the value associated with it (which therefore contains a running count of how many times we've met it). At the end we iterate over %conc and print out each word and the number of times we've met it. This is a wee bit verbose. We can cut it down a bit: perl -F'/\W+/' -ane 'map { $c{$_}++ } @F; END { while (($k, $v) = each %c) { print "$k: $v\n"; }' It's still a one-liner (sort-of). We use an explicit END {} block to call out the instructions to be executed right before the Perl script terminates -- they won't be executed in the implicit loop block. Here's a problem: the output from this program is a workable word-frequency list, but the words come out in no particular order. We might want them sorted in dictionary order (by their first field), or on the basis of their frequency (the second field). We can sort them by frequency at the shell prompt by daisy-chaining it in-line with a program that will read its standard input, and sort it on the basis of the second field. If we stash our concordance program into a UNIX shell variable called $PROGGY, then using a standard UNIX pipeline we can and pipe them into sort: $PROGGY < my_input_file.txt | sort -t : -k 2, The -t flag tells 'sort' to split fields on the following character ':', and the -k flag tells it to use the second and any subsequent fields (there aren't any) as the sort key. The point to take away from this is that if you know the basics of Perl, you can roll one-liners to take the place of any random shell utility you happen to need but can't remember the name of. SUBHEADING: Capital! Want to capitalise all the filenames in a directory? Or convert them all to lowercase? ls -1 | perl -ane 'chomp; print "mv $_ ", uc($_), "\n";' | /bin/bash This basically gets a list of filenames, one per line, using ls. Perl then bends these into commands of the form "mv filename.txt FILENAME.TXT" using the uc() (uppercase) command. Finally, we feed them to bash. For some reason there's no "uppercase rename" command in the UNIX repertoire, but Perl one-liners let us add them easily enough. A really cool thing to do with one-liners like this is to add them to your UNIX shell environment. If you're using Bash (or Ksh, or Zsh, or most contemporary shells) you can make use of aliases. A shell alias is set like this: alias ln="ls -al $1 | less" Thereafter, whenever you type the token "ln" as a command, separated by whitespace, the shell will expand the alias to "ls -al | less", and "ln foo" will expand to "ls -al foo | less" (although not all shells support parameters to aliases -- you may need to use a shell function instead). Aliases and functions in bash can be loaded from the file ~/.bashrc. So you can add a batch of Perl one-liners here. For example, add the following bits of bash/perl to your .bashrc file: function ren_lower() { perl -ane 'chomp; print "mv $_ ", lc($_), "\n";' } function ren_upper() { perl -ane 'chomp; print "mv $_ ", uc($_), "\n";' } function ren_capitalise() { perl -ane 'chomp; print "mv $_ ", ucfirst($_), "\n";' } You can then (once you've executed your .bashrc or started a new subshell) use ren_lower in your shell scripts: ls -1 | ren_lower | bash or find . -type f -print | ren_capitalise | bash # make Windows users feel at home SUBTITLE: Who sed what? Perl is a working superset of the awk and sed text manipulation tools -- we saw that in the capitalisation example above. But by using sed-style regular expressions we can achieve lots more. For example, take ASCII files. On UNIX and Linux, an ASCII file uses the \n (decimal 13, carriage return) character to signify the end of a line. The Macintosh classic OS uses \r (decimal 10, carriage return) to signify the end of a line, and DOS/Windows uses a sequence "\r\n", just to be different. If you work with Mac or Windows users, you may want to add the following bash functions to your .bashrc to make swapping files containing ASCII text easier: function to_mac () { perl -pi.bak -e 's/\n/\r/g;' "$@" } function from_mac () { perl -pi.bak -e 's/\r/\n/g;' "$@" } function to_dos () { perl -pi.bak -e 's/\n/\r\n/g;' "$@" } function from_dos () { perl -pi.bak -e 's/\r//g;' "$@" } In each case, the short perl script does a global search/replace on newline and carriage return characters, going in the appropriate direction (from Linux to Mac or DOS, or from DOS or Mac to Linux). We use the -i.bak flag to tell Perl to do inplace editing -- if you tell it to work on a file called fred, Perl will put its output into a file called fred and rename the original file to fred.bak. The "$@" is a shell parameter that expands to a list of all the filenames specified as parameters to the shell function, so that if we say: to_mac *.txt The Perl script will be executed on every *.txt file in the current directory, re-creating them as Mac-compatible ASCII files and leaving the original versions with a .txt.bak suffix. Perl's regular expressions can do a hell of a lot more than simply change line ending characters. For example, you can embed arbitrary Perl code in them using the /e operator: $_ =~ s/foo/myfunc($1)/eg Searches the variable $_ for the expression "foo", and wherever it finds it, executes myfunc() (passing "foo" as a parameter). This can come in handy if you've got a big website full of HTML files and want to manually update a link target. For example, if a lot of pages refer to an HTML page called "feedback.html" and you've just replaced it with the much-more-whizzy "guestbook.cgi" you can use Perl in combination with find, the UNIX file search command: find . -type f -name "*.html" -print | \ perl -pi.bak -e "s/feedback.html/guestbook.cgi/g;"; Using an external find command is a pain in the neck. Why can't Perl do this itself? The answer is, Perl can -- using File::Find. File::Find is a utility module for directory traversal and does pretty much what find(1) does -- in fact, you can give the find2perl command a find command line and it will spit out Perl code using File::Find that does exactly the same work. It's not strictly suitable for one-liners, but it's nice to know that if you turn the above example into: find2perl . -type f -name "*.html" -print | \ perl -pi.bak -e "s/feedback.html/guestbook.cgi/g;"; Then there are no programs other than Perl involved in the loop. (File::Find has one obvious deficiency -- although you can use it to run other programs, there's no facility to add a callback subroutine in Perl that is executed on the _contents_ of whatever it finds. So although we can execute the s/foo/bar/g command on the _name_ of each file it locates, we need a second command (or a complex mess of file open/read/write/close code) to do it to the contents.) SUBTITLE: Modules In an earlier tutorial we already met one of the most powerful Perl one- liners: perl -MCPAN -e shell The -M flag instructs Perl to import the named module (in this case CPAN.pm) and the -e command can then make use of anything imported from that module -- in this case, the shell() interactive subroutine (which lets you search for and install modules interactively). Some modules -- notably CPAN.pm -- export lots of subroutines that you can use in one-liners. For example, if you want to take a snapshot of your machine's Perl installation that you can replicate onto another computer, you'd do this: perl -MCPAN -e autobundle This writes a bundle file into the directory defined in $CPAN::Config->{cpan_home}/Bundle on your system -- typically in ~/.cpan/Bundle (in your home directory). A bundle file is a list of modules and version numbers. The bundle file might look a bit like this: [ BEGIN CODE LISTING ] package Bundle::Snapshot_2002_04_03_00; $VERSION = '0.01'; 1; __END__ =head1 NAME Bundle::Snapshot_2002_04_03_00 - Snapshot of installation on blueberry on Wed Ap =head1 SYNOPSIS perl -MCPAN -e 'install Bundle::Snapshot_2002_04_03_00' =head1 CONTENTS AnyDBM_File undef Apache 1.27 Apache::Connection 1.00 Apache::Constants 1.09 [ END CODE LISTING ] You can take this bundle snapshot and copy it to another computer -- then type: perl -MCPAN -e "install Bundle/Snapshot_2002_04_03_00.pm;" And the CPAN module will do its second-best to install everything in the snapshot on your new machine. (If you're feeling reckless, try "force install" to force CPAN to continue with installation if any of your modules fail to test out.) This is the recommended way of copying a Perl setup from one machine to another. Another handy module that gives us a bunch of commandline tools is Config -- the Perl configuration module. Config.pm is basically an archive of system-specific information that describes everything the Configure program discovered when it was preparing to compile your local perl installation; this is a detailed description of many aspects of your operating system. When you use Config, you implicitly import a hash called %Config, which contains a huge bundle of key/value pairs. Want to see a textual summary of the main configuration options on your system? perl -MConfig=myconfig -e "print myconfig();" Note the flag: -MConfig=myconfig. This is equivalent to the line: use Config qw(myconfig); Which explicitly imports "myconfig" from the module Config. We can add other items to import explicitly: use Config qw(myconfig config_vars config_sh); is equivalent to: perl -MConfig=myconfig,config_vars,config_sh To get the entire configuration, try this: perl -MConfig=config_sh -e "print config_sh();" To get the Perl version: perl -MConfig -e 'print $Config{version}, "\n";' Or the system's identification (including operating system type, version, platform, CPU type, and when it was created): perl -MConfig -e 'print $Config{myuname}, "\n";' Want to know what the byte order (endian-ness) is on your system, or what name to use to invoke your C compiler? perl -MConfig -e 'print $Config{byteorder}, "\n";' perl -MConfig -e 'print $Config{cc}, "\n";' SUBTITLE: Webbing around LWP -- libwww-perl -- gives us a potload of utility modules, of which the easiest to use is LWP::Simple, a procedural front-end to the HTTP requests. For example: perl -MLWP::Simple -e 'getstore("http://www.antipope.org/charlie/", "charlie.html");' Fetches the URL "http://www.antipope.org/charlie/" and stores it in "charlie.html". And: perl -MLPW::Simple -e "$x = "http://www.antipope.org/charlie/"; \ print "Document $x is ", head($x)[1], " bytes long\n"; Issues a HEAD request. This returns a list consisting of the object's MIME content-type, length, modification time, expiration time, and server. We can have fun with the web and our .bashrc file -- or other configuration files. For example, in .bashrc, add the following: function getprint () { perl -MLPW::Simple -pe 'getprint($_);' "$@" } getprint() (in LWP::Simple) gets a URL and prints it on STDOUT (or, if it can't be retrieved, prints the status code and message on STDERR). This shell function expects to read a bunch of URLs as parameters, and fetches then prints them. So you can use a command like this: getprint(http://www.linuxformat.co.uk/) > mypage-copy.html On the bash command line to copy Linux Format's home page. A bit more usefully, here's a nearly-one-liner that may be useful if you like to keep an eye on a bunch of web pages (such as slashdot, or the BBC news website): function has_changed () { perl -MLWP::Simple -pe '$prefix = $ENV{HOME} . "/.urls"; \ $mod = $_; $mod =~ s/\//+/g; \ $time = scalar(stat "$prefix/$mod"); \ $mt = head($_)[2]; \ if ($mt > $time) { \ print "CHANGED: $_\n"; \ `touch -m --date=$mt $prefix/$mod`; \ } } The idea here is that in your home directory ($ENV{HOME}) you have a subdirectory called ".urls". In this directory you keep a bunch of files with names like "http:++slashdot.org+". (We change the forward-slashes to plus signs which are a bit easier for the Linux command line tools to handle). When has_changed is passed a series of URLs on standard input, it works out which file in .urls to look at (by putting $_ into $mod and turning the slashes to plus signs), and checks its modification time (using the perl command stat()). It then calls the head() method from LWP simple and checks the modification time on the remote file. If the remote version is more recent, it prints "CHANGED" (followed by the URL) on standard output, then uses the external command 'touch' to update the modification time on the file in ~/.urls to match the version on the web. In use, you just write a shell script in bash that uses has_changed() to detect whether a web page has been updated since the last time you ran it. (NOTE: the date comparison method used here is primitive in the extreme. Doing this job properly isn't really a one-liner; if you want a fun exercise, try turning has_changed() into a real standalone Perl program with error checking, the ability to take account of different time zones and handle failed connections gracefully, and use a DBM database file instead of a subdirectory.) LWP isn't the only module you can do interesting web-related one-liners with. Have a look at this: perl -MNet::Ping -e '$p = Net::Ping->new("tcp", 60); \ $p->{port_num} = 80; \ $p->ping("www.antipope.org") && \ print "www.antipope.org has a live http port\n";' Net::Ping lets us create ICMP, UDP or TCP packets and bounce them off remote hosts. In this example, we create a new Net::Ping object with a 60 second timeout, tell it to send TCP packets to port 80 (the standard well- known port for HTTP), and then ping www.antipope.org to see if it's alive and responding on that port. (You can actually use Net::Ping to write a dumb but easy-to-understand port scanner; just loop through all the available ports: perl -MNet::Ping -e '$p = Net::Ping->new("tcp", 60); \ for ($i = 0; $i <= 65535; $i++) { \ $name = (getservbyport($i, "tcp"))[3]; \ $p->ping("www.antipope.org") && \ print "www.antipope.org has a live $name port\n"; \ }' getservbyport() is a built-in Perl command that, for a given port number and protocol returns (in array context) the corresponding service name, aliases, port number, and protocol name. END (BODY COPY - One Liners) BOXOUT: Sorting One topic that comes up repeatedly in Perl is: how do you sort a list of items? Let's take an array of simple scalars like this: my @list = (1, 2, 3, 4, 5, 6, 7, 8); Each of these scalars is a simple integer number, and it's already sorted into ascending order. We can reverse it quite easily: print reverse @list; 87654321 To insert some delimeters, we use "join" to turn @list into a scalar, with each element delimited by "][" -- like this: print "[", join("][", reverse @list), "]"; [8][7][6][5][4][3][2][1] Now, let's randomize our list and run sort on it: my @list = (7, 2, 3, 4, 1, 8, 5, 6); print "[", join("][", sort @list), "]"; [1][2][3][4][5][6][7][8] This demonstrates something important about Perl's sort() command. sort sorts numbers into ascending order, by default. You run it on a list, and it returns another list -- this one sorted. Want to sort it the other way? Try this: sort { $b <=> $a } @list; [8][7][6][5][4][3][2][1] The expression in curly braces is a sort USERSUB -- a user-defined sorting test. sort works by iteratively applying a sorting test to each pair of variables in the array. The variables are visible within the sorting text as $a and $b; it's the job of the user sub to return -1, 0, or 1 depending on how the elements in the list are to be ordered. The <=> operator, and it's string equivalent cmp, are used to compare two values and return -1, 0, or 1 depending on whether the left item is less than, equal to, or greater than the item on the right. By default, sort() assumes the following user-defined sorting test: { $a <=> $b } If $a is less than $b, <=> returns -1; if it's greater, it returns 1; and if they're the same it returns 0. By swapping the order of $a and $b in the usersub (as in our sorting-it-the-other-way example above) we reverse the sorting order. Let's try another array: my @list = (qw(red blue green mauve pink violet yellow)); print "=>", join(" ", sort { $a cmp $b } @list); => blue green mauve pink red violet yellow And the other way: print "=>", join(" ", sort { $b cmp $a } @list); => yellow violet red pink mauve green blue When sorting strings, it's important to remember that we don't sort in dictionary order -- we sort in string order, depending on the local character set in use. (Undefined values sort before defined null strings, which sort before all characters in character-set order.) One side-effect of this is that sort is case-sensitive -- and capitals go in a different sequence before lowercase letters, in ASCII/Latin-1: my @list = (qw(red blue green MAUVE pink violet yellow)); print "=>", join(" ", sort { $a cmp $b } @list); =>MAUVE blue green pink red violet yellow To avoid this, we need to canonicalize our comparison routine to lowercase: my @list = (qw(red blue green MAUVE pink violet yellow)); print "=>", join(" ", sort { lc($a) cmp lc($b) } @list); =>blue green MAUVE pink red violet yellow Arrays of simple scalars are easy enough to sort, but as we saw in the concordance example, sorting hashes so that they show up in the right order is a bit harder. However, we can sort the keys to a hash on the basis of their associated value: my @keys = keys %deparment_sales_volume; sub sort_by_sales { $department_sales_volume{$b} <=> $department_sales_volume{$a} ; } for $dept (sort sort_by_sales keys %department_sales_volume) { print "$dept: ", $department_sales_volume{$dept}, "\n"; } As you can see, here we're sorting an array (the keys to the hash) by comparing the value in the hash that's associated with each key. Once we've got the keys sorted into order on the basis of their values, we can print out a ranking of each department in our store. Looking at our word-frequency program: perl -F'/\W+/' -ane 'map { $c{$_}++ } @F; END { while (($k, $v) = each %c) { print "$k: $v\n"; }' Let's re-write this and use sort() to ensure that our concordance is printed in proper decreasing order of frequency: #!/usr/bin/perl while (<>) { chomp; @F = split(/\W+/); for $word (@F) { $frequency{$word}++; } } # end of standard input sub by_frequency { $frequency{$b} <=> $frequency{$a}; } for $word (sort by_frequency keys %frequency) { printf ("%012s => %6d\n", $word, $frequency{$word}); } There are some much hairier uses of sort(), but we'll deal with them in another tutorial; hopefully this demonstrates that sort() is rather more powerful and flexible than most people realise. END BOXOUT