LINUX FORMAT PERL COLUMN STRAP: Part 2: Regular expressions and Packages: power tools for large programs SUBHEADING: Regular expressions One of the most interesting features of Perl -- which we don't find built into any traditional languages like C or Lisp -- are regular expressions. We use regular expressions for scanning text strings and carrying out search/replace operations on them. It's possible to use Perl for text mangling without regexps (as they're known to their friends) -- Perl also provides the ancient BASIC substr() and length() functions -- but regular expressions are at the heart of much of Perl's power. If you've used a word processor's search and replace facility, you're used to entering a word and having the word processor search for the first occurence of that word in the text. You may also have used a word processor that understands patterns: a series of characters that represent items that can be found in the text (for example, "?" for "any character" and "*" for "any sequence of characters"). Regular expressions in Perl are basically a sophisticated symbolic language for specifying patterns in text. You can specify a regexp and Perl's regular expression engine will take the pattern and determine whether (and if so, how) it matches against your data. (Note that this is a really abbreviated intro to regular expressions. For the full story, you need to read Chapter 3 of "Programming Perl, 3rd Edition" by Larry Wall, Tom Christiansen and Jon Orwant, ISBN: 0-596-00027-8, or "Mastering Regular Expressions" by Jeffrey E. F. Friedl, ISBN 1-56592-257-3.) About the simplest kind of regular expression is one that looks like this: /red/ This isn't just the word "red" enclosed between slashes: it's a pattern that matches the character "r" followed by the character "e" followed by the character "d". In a regular expression, any character (with some exceptions we'll see in a moment) is a pattern. Simple letters or integers usually match themselves, but most punctuation characters have special meanings. We use this regular expression like so: $text = "the quick brown fox jumped over the red hen"; if ($text =~ /red/) { print "caught red handed!\n"; } The operator =~ "binds" a regular expression to a variable (in this case, to $text) and returns true if a match exists between the pattern and the contents of the target variable. It also sets some special variables: BEGIN TABLE variable contents $` all the text in the target string, to the left of the match $& the text that matches the regular expression $' all the text to the right of the matched section in the target END TABLE So, if we want to delete everything up to and including the matched text, we can do something like this: $text = "the quick brown fox jumped over the red hen"; if ($text =~ /red/) { $text = $' ; print "$text\n"; } (which will print "hen".) Metacharacters are special characters that don't represent themselves (in a regexp); they form a simple language. BEGIN TABLE Character Meaning \ The following character is not a metacharacter (e.g. \\ is treated as a literal "\") | Logical-OR; a|b means "match either a OR b at this position) () Group some characters together to form a sub-expression: (red)|(blue) means "match either "red" OR "blue", not "match "red", then "(" or ")", then "blue") [] Character set: [A-M] means "match any one character in the series [A, B, C ... M]. ^ Outside a character set, this means "match the beginning of a string" ^ As first element in a character set, means "match anything that is not a member of this set": [^A-M] means "match any character that is not in [A-M]". $ Matches the end of a string * Match zero or more of the preceeding expression + Match one or more of the preceeding expression ? Match exactly zero or one of the preceeding character (this is equivalent to {0,1}) ? (other) If it occurs after a quantifier (like "*"), match the shortest possible string that the preceding expression binds to . Match any single character once {a,b} Match from a to b occurences of the preceding expression; if "a" is missing, match up to "b" occurences, and if "b" is missing, match at least "a" occurences. e.g., (red){2,3} means "match two to three occurences of the expression (red) in a row" END TABLE So, to match primary colours we can say something like: $text =~ /(red)|(blue)|(green)/i Note the trailing "i" after the last slash. This is known as an expression modifier; modifiers are used to change the way the Perl regexp engine interprets an expression. This one means that the regexp is to be evaluated as case-insensitive, i.e. as either upper or lowercase, or mixed case. It saves us from having to write something like: $text =~ /([Rr][Ee][Dd])|([Bb][Ll][Uu][Ee])|([Gg][Rr][Ee][Ee][Nn])/ There are a number of other modifiers, however they're specific to the type of pattern we're using. SUBTITLE: Replacing strings There are, broadly speaking, three ways you can use a regular expression. You can deploy one (as above) to match (search for) a pattern to a string. You can use one to match one pattern and replace it with a different one. And you can use it to match (and replace) a set of characters. The form we've been using ($variable =~ /expression/) is a simplified version of the m// (match) operator. To match and replace we use the s/// (substitution) operator, and to operate on a set of characters we use the tr/// (translation) operator. Simple replacement looks like this: $text =~ s/red/blue/ ; This replaces every occurence of the string "red" with the string "blue". Each matched sub-expression is in turn assigned to special variables called $1, $2, and so on: $text =~ s/red/blue-$1/ ; print $text; will print "blue-red" if successful. Note that it might not be what you want to achieve -- instances of redundancy will become blueundancy. In fact, if we want to replace the word "red" with the word "blue", we have a couple of implicit problems to solve; how to identify word boundaries, and how to apply a change to a subexpression. Perl provides some special metasymbols that help, here. We've seen ^ used to match the beginning of a string: for example, $text =~ /^red/ Only matches if the string stashed in $text begins with "red"; /red$/ does the same for a string ending in "red". Metasymbols that may be of use include: TABLE Metasymbol meaning \d match any digit character \D match any nondigit character \w match any word character ([A-Za-z_]) \W match any nonword character \s match any whitespace character (space, tab, return, etc) \S match any non-whitespace character \3 match the third previously-matched subexpression (and \2, and so on) END TABLE (There are other metasymbols; they're listed in table 5-7 of Chapter 3 of "Programming Perl".) Our problem of replacing "red" only where it's a word can now be solved. First, we bracket "red" with metasymbols that match only nonword characters, so that a word like "redundancy" won't match -- the "u" is a word character. This gives us an expression like: $text =~ /\W+red\W+/ Which only matches the word "red", not words containing the substring "red". Then we do the substitution like this: $text =~ s/^(.*\W*)red(\W+.*)*$/$1blue$2/ Let's break this down. The search pattern (between the first two slashes) has a subexpression (grouped with brackets) on each side of the word "red", which we're trying to replace. The first subexpression is anchored to the start of the string (with the caret). It looks for the start of the string, followed by a sequence of zero or more of any character, followed by zero or more non-word characters. (That's because we might have the word "red" right at the start of the string.) We then have the word red -- but this won't match if red isn't preceded by a non-word character, as in "inferred". The second subexpression states that "red" must be followed by the end of the string -- optionally preceeded by a group containing one or more nonword characters, then some more arbitrary characters. On the right hand side of the replace expression, we see "$1blue$2". What this means is, the match (which starts with ^ and ends with $, and therefore amounts to the entire string, if and only if it matches) is to be replaced with the first bracketed expression, the string "blue", then the second bracketed expression. And both of these expressions include the nonword stuff that preceeds (or follows) our target word. Incidentally, the s/// (replace) operator takes a modifier that makes it a lot easier to read: if there's a trailing /x, we can include whitespace and comments in our regular expression. So we can re-write it this way: $text =~ s/ ^(.*\W*) # group match leading chars and whitespace, if any: $1 red # match literal text "red" (\W+.*)*$ # group match trailing whitespace and text, if any: $2 / $1blue$2 /x ; If you find the slashes confusing, you can replace them with braces like this: $text =~ s{ ^( # open group $1, anchored to start of string .*\W* # match chars and whitespace, if any ) red # match literal text "red" ( # open group $2, immediately after "red" \W+.* # match whitespace and optional chars )* # group two is optional (zero or more occurences) $ # anchor to end of string }{ $1blue$2 # what we replace this with }x; SUBHEADING: Repeatability! There is one fundamental problem with the regular expression we built up in the last section: it doesn't work properly if there are two or more occurences of "red" in a line. For example, "the red house has a red door" will be replaced by "the red house has a blue door". The reason it doesn't work is that regular expressions are greedy -- they will match as many characters in the string as possible. The initial grouped expression, $1, can match either "the " or "the red house has a ". Being greedy, it will take the longer string; thus we'll end up only replacing the second "red". The way we get around this is to remember that a regular expression match using the s/// operator is like a function; it returns either undef (zero) if no matches are found, or the number of matches it made. So we can put the whole thing in a little loop: 1 while $text =~ s/^(.*\W*)red(\W+.*)*$/$1blue$2/ ; The statement "1" doesn't do anything, but it acts as a body for a "while" loop, the condition of which is the search/replace expression. While the search/replace returns a non-zero number, we have stuff to do, so Perl evaluates the non-functional statement "1" and executes the loop condition again; when there are no more replacements, execution leaves the loop. Of course, this is the hard way to replace words in Perl: as the language motto says, "there's more than one way to do it". In the table of metasymbols above, we met symbols that represent word characters or nonword characters. But there's a different set of metasymbols that represent positions: BEGIN TABLE Symbol Meaning \b Word boundary (the position between a \w and a \W, in any order) \B matches any position that is not a word boundary \A Beginning of string (even if multi-line matching is switched on) \z Matches at end of string \Z Matches right before newline at end of string (or EOS if no newline) END TABLE So, with these positional symbols we can re-write our substitution as: 1 while $text =~ s/^(.*)\bred\b(.*)$/$1blue$2/ ; or, more readably: 1 while $text =~ s/^(.*)\b red \b(.*)$ /$1blue$2/x; SUBHEADING: Translating sets In addition to searching and replacing strings, Perl's regular expression engine can be used to translate characters. The operator we use for this is tr/// (transliterate), and instead of working on regular expressions, it works on characters; tr/a/b/ replaces all occurences of the character "a" with the character "b", for example. More usefully, tr/// works on sets of characters: tr/[a-g]/[h-n]/ replaces a's with h's, b's with i's, c's with j's, and so on (but doesn't do anything to characters later than g). Here's a simple one: $text =~ tr/[a-z]/[A-Z]/; This translates all lowercase letters in $text into uppercase. And here's another: $text =~ tr/A-Za-z/N-Za-Mn-za-m/; This is the classic Caesar substitution cypher, also known as ROT-13 -- every character is replaced by one shifted halfway through the alphabet. (Note that the square brackets used to identify a set are redundant in a transliterate expression.) A useful mechanism to note is that you can represent any character in the 8-bit ASCII codeset as a backslash followed by an octal number (three digits, padded with leading zeroes). For example, tr/\003/C/ will translate all occurences of Control-C into a capital C. It's also worth knowing that as of Perl 5.6, Perl understands POSIX character classes -- a set of specially defined classes with names, such as [:digit:] (equivalent to \d), or [:space:], equivalent to \s). Moreover, Perl has the basics of Unicode support -- but that's a long, hairy, topic that is best left to Chapter 3 of "Programming Perl". There are some handy modifiers to tr///; BEGIN TABLE Modifier Meaning /c complement search list /d delete characters that don't have a defined replacement /s squash duplicate replacements END TABLE For example, $text =~ tr/a-z/a-m/d ; will delete all letters from n-z inclusive from $text. END (Main text) BOXOUT: Modular programs When we use a variable like $text or define a subroutine called &mysub(), Perl stores it in a chunk of memory called a "name space". By default, everything lives in a namespace called "main", and if you refer to "$text" without specifying the namespace, Perl assumes you're talking about $main::text. We can define variables in other namespaces; they spring into existence as soon as we mention them: for example $myprog::text = 1. $myprog::text and $main::text are two completely different variables. They don't know about each other or share contents. Likewise, &main::mysub() may be a very different subroutine from &myprog::mysub(). Writing down the namespace every time we want to define a variable or subroutine in a foreign one can get tedious, so there's a simple keyword that tells Perl we're in a new namespace: package(). If your Perl program contains a line like this: package MyPrivate; Every variable and subroutine defined after it will be prefixed with MyPrivate:: until you exit the package. (You exit a package by exiting the enclosing scope -- end of file, end of eval() block, or end of a bracketed chunk of code.) Note that it's important to end a package with some expression that evaluates to "true" (we'll see why in a moment). Thus, it's common to end a package with a line like this: 1; (one being non-zero, therefore true all the time). The useful thing about packages is that they allow us to keep a bunch of code and data somewhere private -- this is important when writing a large, modular program. The annoying thing about packages is keeping track of them, and of the namespaces they use. A common, conventional solution is this: each package goes into a separate file, which begins with the line "package " and ends with "1;". The main program then has a series of lines at the top: require "package1"; require "package2"; # and so on The keyword require tells Perl to search the directories in its module search path (a list of directories in the internal variable @INC) for a named package. You can either use the name you defined in the package file, or specify a relative pathname to the file. When Perl finds the package file, it loads, compiles, and executes it -- which is why we want it to return "1" (otherwise we'll end up with a runtime error). Usually we just put subroutines and variables in packages, so nothing obvious happens -- but the extra namespace and code is sitting around waiting for our main program to use it. There's a related keyword, "use", which is similar to require except that it is executed while your program is being compiled, instead of at run time; this means that errors will be detected sooner, and it allows you to set up your variables and package code before running the program. In next month's tutorial we'll see how packages can be used to create objects -- bits of data that have associated private code -- and where "use" fits into the scheme of things. END BOXOUT