Linux Format 35 Perl Tutorial: ///TITLE: Perl 6 Regular Expressions -- a Match made in Heaven? ///STRAP: Charlie Stross goes dumpster-diving in Larry Wall's in-tray and comes back with an explanation of how Perl 6 is going to improve on Perl 5's pattern matching [[ TYPOGRAPHICAL NOTE: text enclosed in ///BEGIN CODE and ///END CODE is a code listing. Text in the body copy surrounded by _underscores_ (thus) is italicised or emphasized -- but _not_ in the code listings ]] ///BEGIN BODY COPY ///SUBHEADING: A nest of chaos Perl wasn't the first programming language to make a big deal out of using patterns to manipulate strings, but Perl's approach is one of the most powerful, and goes a long way towards explaining the language's enduring popularity. Older UNIX tools, starting with the ed text editor and the Bourne shell, used an abstract language called "regular expressions" to define the shape of character strings rather than their actual content. The sed streaming editor and later awk pattern-matching language generalised the concept further, and Perl started out with a superset of awk's pattern-matching system. Back in the days of sed and awk, most text that you came across on a UNIX system consisted of plain old-fashioned unstructured ASCII, with line breaks. Perl's regular expression system was developed to cope with this sort of material. For example, it's dead easy to write a basic expression that matches an English sentence: ///BEGIN CODE / [A-Z] # first letter capitalized [a-z] # second letter lowercased -- avoids catching initials (.+)? # one or more subsequent characters, minimal match [\.\!\?] # end of sentence / xs ///END CODE This uses Perl 5.6 vernacular -- the 'x' modifier after the end of the pattern indicates that whitespace and comments are allowed in the pattern, and the 's' modifier indicates that we want to do multiline matching, in which a newline is matched by the '.' character. It also uses minimal matching -- (.+)? matches the shortest sequence of one or more characters preceding the following (end of sentence) pattern. Note that this approach isn't perfect -- we might run into problems with quotation marks around periods -- but it's a fairly robust expression for dealing with blocks of ordinary ASCII text. Where it breaks down is in two places -- dealing with non-ASCII text, and dealing with structured text. Non-ASCII text came up first, in the form of unicode and related multi-byte codesets. Some support for Unicode was bolted into Perl 5.6, in particular the ability to set the number of bits to match for a character and to use predefined character sets, but it fundamentally broke the clean simplicity of the earlier pattern matching model by essentially making it modal. But the biggest challenge is structured text, such as HTML or XML (or any programming language). Structured text is _recursive_ -- that is, it consists of elements which may contain other, nested, elements. An example is the common problem (for Perl regexp wizards) of ripping the plain text contents out of an HTML file: ///BEGIN CODE Example file

A simple file

Hi there! I am a very simple example of a structured file.

///END CODE The goal of extracting the textual contents of this file is to produce something like this: ///BEGIN CODE Example file Hi there! I am a very simple example of a structured file. ///END CODE A naive approach is to simply rip out anything resembling an HTML tag: ///BEGIN CODE $file =~ s/<.+?>//g; ///END CODE This works up to a point. But what if we want to extract only the text that is in italics and boldface, but not in italics _or_ boldface? To do this, our pattern matching system needs to know something about the context in which it is trying to match a string -- which tags it is enclosed by. This rapidly gets hairy, because the current string could be nested within any number of tags. The Perl 5 way of dealing with this sort of problem is to use a proper parser module, such as Parse::RecDescent or Parse::YAPP. These parsers allow you to define the structure of a file from the bottom up (specifying what elements look like), and associate actions to take (in the form of code to execute) which each element. But this is rather clumsy, and goes against the Perl 6 goals of (a) simplifying things, and (b) turning Perl 6 into a language for writing application domain-specific mini-languages. ///SUBHEADING: The Perl 6 way The first thing to notice about Perl 6's approach to regular expressions is that the 'x' modifier is turned on by default. Introduced in Perl 5.6, this modifier means that whitespace is allowed in regular expressions -- if you want to match whitespace you have to use the \w (whitespace) pattern or backslash-escape it. By the same token, comments are allowed in regular expressions to make them more readable -- which makes sense when you consider that the Perl 6 system is geared towards writing full-scale parsers. Next, there's the addition of the regular expression constructor, rx//; this generates a new expression which can be stored in a scalar variable. For example: ///BEGIN CODE my $pattern = rx / <$tag> # match whatever the rule $tag matches .*? # minimum-match of any character string <$endtag> # match whatever the rule $endtag matches / ; ///END CODE (Note that the rx constructor we use to build a regular expression object isn't a quoting mechanism (like the older qq// syntax); we can refer to $tag and $endtag before they're defined and they'll be interpolated whenever we actually apply $pattern to a target string at runtime.) It's also worth noting that the modifiers that specify how a regular expression should be processed are all changing. You can stack a bundle of modifiers in front of a pattern delimiter, to do things like specify case-insensitivity, repeatedly match as many times as possible (changed to 'e' from Perl 5's 'g' modifier), pretend to be Perl 5 (for backward compatability), and so on. It's new terminology time in Perl land: we now have named patterns, called _rules_, and we can embed rules in each other using <> to enclose them. If we omitted the angle-brackets and simply put $tag and $endtag in our expression, they would be interpolated into $pattern when it was executed -- but as strings, not as rules. We can apply other regular expression modifiers to rules: for example, <$tag>{2,3} matches when the pattern in $tag either two or three times. Another major twist in Perl 6 is the ability to match arrays or hashes directly: /// BEGIN CODE @colour = ('red', 'blue', 'green', 'orange'); $text =~ / @colour /; /// END CODE Within the regular expression, '@colour' is replaced by a pattern matching any of its contents -- in Perl 5 we'd have had to construct a regular expression from it first: /// BEGIN CODE @colour = ('red', 'blue', 'green', 'orange'); $colour = join '|', map { quotemeta $_ } @colour; $text =~ / (?: $colour ) /; /// END CODE Now things begin to get peculiar. In Perl 5, square brackets enclosed character sets. In Perl 6 they don't; the enclose a _noncapturing group_. Character sets like [A-Z] may work in ASCII, but they're less useful in unicode where they'll stop working if you accidentally blunder into the wrong language find yourself processing Greek or Chinese text by mistake. A noncapturing group is one that matches some text but doesn't capture it (for interpolation). You can still do character classes in Perl 6, but you need to put them inside the metasyntactic marker (angle brackets). We do have some special names for frequently used characters, though: '' for space, '' for whitespace, '' and '' for angle brackets, '' for a period, and so on. Using metasyntactic markers and rules lets up build arbitrarily complex trees of patterns -- essential if we're going to write a recursive-descendant parser. But as anyone who's written a parser knows, matching patterns is only half the problem; if you're matching some input against a complex ruleset, and part of a ruleset has failed, you need to have a mechanism to control backtracking. Perl 5 controlled backtracking implicitly, but Perl 6 has a much more fine-grained mechanism. We can control backtracking in Perl 6 by using the ":", "::" and ":::" operators after a rule, but within a noncapturing group. For example: /// BEGIN CODE [ <$rule1> : <$rule2> <$rule3> ] /// END CODE This attempts first to match $rule1, then $rule2 $rule3. But the colon operator means "if the preceding match fails, don't bother backtracking to the previous element". A double-colon means "don't bother backtracking within the enclosing group", and a triple-colon means "if we have to backtrack here, the entire rule fails (including but not limited to the current group)". And there's a special directive named -- if we have to backtrack through a commit, the entire match fails immediately. (This is what we'd use in throwing a syntax error from inside the parser of a compiler, for example.) /// SUBHEADING More fun new features In Perl 5, it was pretty damn hard to read from a file handle and apply some kind of pattern-match to the process as we did it. There were workarounds (such as IO::Stringy), but in general you couldn't apply pattern matching to file handles directly without using a scalar as a buffer. Perl 6 fixes this by letting you bind an input stream to a scalar and match against it: /// BEGIN CODE my $text is from($*ARGS); # Bind scalar to input stream if $text =~ /<$pattern>/ { # do something } /// END CODE In addition to letting us do fun things with input streams, Perl 6 makes it easier to manage complex pattern-matching tasks by adding some new declarations -- grammar and rule. A grammar is the pattern-matching equivalent of a Perl module; a collection of rules grouped together in curly brackets and denoted by a name, in their own namespace. Rules within a grammar are named, and may be referred to as grammarname.rulename, just as methods within a module are referred to as Modulename::methodname. The grammar declaration either applies to the block immediately following its name, or to the rest of the file. For example (from Damien Conway): /// BEGIN CODE grammar HTML { rule file :iw { \Q[] \Q[] } rule head :iw { \Q[] + \Q[] } # etc. } # Explicit end of HTML grammar /// END CODE We can match against named rules by putting the name in angle brackets, for example: /// BEGIN CODE $file =~ // /// END CODE The analogy between rules and subroutines goes even further; like subroutines, we can pass arguments to rules (including the new Python- style of argument passing with name/value tuples that Perl 6 is assimilating). A rule-based parser in Perl 6 builds a hierarchical data structure of results as it goes along. Whereas in Perl 5 the result of a pattern match would be available as $1, $2 ... $_n_ for substitution, and $0 as the entire string, in Perl 6 we have a hash, too. If we match a sub- rule, its contents become accessible via $0{rulename}; this works recursively, so the result of running a bunch of recursive rules against a structured file is actually a parse tree of hashes. At this point, we've only begun to scratch the surface of Perl 6's regular expression language. We can embed arbitrary code blocks in rules, so that if we're writing a web browser (for example), then encountering a tag name can trigger some graphical action (such as rendering its contents in boldface). The possibilities are huge -- basically the parser modules (such as Parse::RecDescent) are now obsolete, with equivalent functionality built into the core language. Perl 5 needed no lex-workalike, but Perl 6 builds in the same functionality as yacc (albeit better designed). If you want to read more about writing compilers and parsers in Perl 6, Damien Conway's exegesis (see http://www.perl.com/pub/a/2002/08/22/exegesis5.html) is currently the place to start; in the meantime, rest assured that although your existing regular expressions will still work (with the :perl5 modifier to tell Perl 6 to be backward compatible), things just got a whole lot more powerful. ///END BODY COPY