Linux Format 35 Perl Tutorial: ///TITLE: Perl 6 Regular Expressions -- a Match made in Heaven? ///STRAP: Charlie Stross goes dumpster-diving in Larry Wall's in-tray and comes back with an explanation of how Perl 6 is going to improve on Perl 5's pattern matching [[ TYPOGRAPHICAL NOTE: text enclosed in ///BEGIN CODE and ///END CODE is a code listing. Text in the body copy surrounded by _underscores_ (thus) is italicised or emphasized -- but _not_ in the code listings ]] ///BEGIN BODY COPY ///SUBHEADING: A nest of chaos Perl wasn't the first programming language to make a big deal out of using patterns to manipulate strings, but Perl's approach is one of the most powerful, and goes a long way towards explaining the language's enduring popularity. Older UNIX tools, starting with the ed text editor and the Bourne shell, used an abstract language called "regular expressions" to define the shape of character strings rather than their actual content. The sed streaming editor and later awk pattern-matching language generalised the concept further, and Perl started out with a superset of awk's pattern-matching system. Back in the days of sed and awk, most text that you came across on a UNIX system consisted of plain old-fashioned unstructured ASCII, with line breaks. Perl's regular expression system was developed to cope with this sort of material. For example, it's dead easy to write a basic expression that matches an English sentence: ///BEGIN CODE / [A-Z] # first letter capitalized [a-z] # second letter lowercased -- avoids catching initials (.+)? # one or more subsequent characters, minimal match [\.\!\?] # end of sentence / xs ///END CODE This uses Perl 5.6 vernacular -- the 'x' modifier after the end of the pattern indicates that whitespace and comments are allowed in the pattern, and the 's' modifier indicates that we want to do multiline matching, in which a newline is matched by the '.' character. It also uses minimal matching -- (.+)? matches the shortest sequence of one or more characters preceding the following (end of sentence) pattern. Note that this approach isn't perfect -- we might run into problems with quotation marks around periods -- but it's a fairly robust expression for dealing with blocks of ordinary ASCII text. Where it breaks down is in two places -- dealing with non-ASCII text, and dealing with structured text. Non-ASCII text came up first, in the form of unicode and related multi-byte codesets. Some support for Unicode was bolted into Perl 5.6, in particular the ability to set the number of bits to match for a character and to use predefined character sets, but it fundamentally broke the clean simplicity of the earlier pattern matching model by essentially making it modal. But the biggest challenge is structured text, such as HTML or XML (or any programming language). Structured text is _recursive_ -- that is, it consists of elements which may contain other, nested, elements. An example is the common problem (for Perl regexp wizards) of ripping the plain text contents out of an HTML file: ///BEGIN CODE
Hi there! I am a very simple example of a structured file.
///END CODE The goal of extracting the textual contents of this file is to produce something like this: ///BEGIN CODE Example file Hi there! I am a very simple example of a structured file. ///END CODE A naive approach is to simply rip out anything resembling an HTML tag: ///BEGIN CODE $file =~ s/<.+?>//g; ///END CODE This works up to a point. But what if we want to extract only the text that is in italics and boldface, but not in italics _or_ boldface? To do this, our pattern matching system needs to know something about the context in which it is trying to match a string -- which tags it is enclosed by. This rapidly gets hairy, because the current string could be nested within any number of tags. The Perl 5 way of dealing with this sort of problem is to use a proper parser module, such as Parse::RecDescent or Parse::YAPP. These parsers allow you to define the structure of a file from the bottom up (specifying what elements look like), and associate actions to take (in the form of code to execute) which each element. But this is rather clumsy, and goes against the Perl 6 goals of (a) simplifying things, and (b) turning Perl 6 into a language for writing application domain-specific mini-languages. ///SUBHEADING: The Perl 6 way The first thing to notice about Perl 6's approach to regular expressions is that the 'x' modifier is turned on by default. Introduced in Perl 5.6, this modifier means that whitespace is allowed in regular expressions -- if you want to match whitespace you have to use the \w (whitespace) pattern or backslash-escape it. By the same token, comments are allowed in regular expressions to make them more readable -- which makes sense when you consider that the Perl 6 system is geared towards writing full-scale parsers. Next, there's the addition of the regular expression constructor, rx//; this generates a new expression which can be stored in a scalar variable. For example: ///BEGIN CODE my $pattern = rx / <$tag> # match whatever the rule $tag matches .*? # minimum-match of any character string <$endtag> # match whatever the rule $endtag matches / ; ///END CODE (Note that the rx constructor we use to build a regular expression object isn't a quoting mechanism (like the older qq// syntax); we can refer to $tag and $endtag before they're defined and they'll be interpolated whenever we actually apply $pattern to a target string at runtime.) It's also worth noting that the modifiers that specify how a regular expression should be processed are all changing. You can stack a bundle of modifiers in front of a pattern delimiter, to do things like specify case-insensitivity, repeatedly match as many times as possible (changed to 'e' from Perl 5's 'g' modifier), pretend to be Perl 5 (for backward compatability), and so on. It's new terminology time in Perl land: we now have named patterns, called _rules_, and we can embed rules in each other using <> to enclose them. If we omitted the angle-brackets and simply put $tag and $endtag in our expression, they would be interpolated into $pattern when it was executed -- but as strings, not as rules. We can apply other regular expression modifiers to rules: for example, <$tag>{2,3} matches when the pattern in $tag either two or three times. Another major twist in Perl 6 is the ability to match arrays or hashes directly: /// BEGIN CODE @colour = ('red', 'blue', 'green', 'orange'); $text =~ / @colour /; /// END CODE Within the regular expression, '@colour' is replaced by a pattern matching any of its contents -- in Perl 5 we'd have had to construct a regular expression from it first: /// BEGIN CODE @colour = ('red', 'blue', 'green', 'orange'); $colour = join '|', map { quotemeta $_ } @colour; $text =~ / (?: $colour ) /; /// END CODE Now things begin to get peculiar. In Perl 5, square brackets enclosed character sets. In Perl 6 they don't; the enclose a _noncapturing group_. Character sets like [A-Z] may work in ASCII, but they're less useful in unicode where they'll stop working if you accidentally blunder into the wrong language find yourself processing Greek or Chinese text by mistake. A noncapturing group is one that matches some text but doesn't capture it (for interpolation). You can still do character classes in Perl 6, but you need to put them inside the metasyntactic marker (angle brackets). We do have some special names for frequently used characters, though: '