Linux Format #32


[[ Typographical notes:


   indented text is program listing


   text surrounded in _underscores_ is italicised/emphasised


]]


Perl tutorial


///TITLE: A quick tour of Perl 5.8


///STRAP: Charlie Stross gives a whistle-stop tour of what's new in
the latest version of Perl, released last month


///SUBTITLE: History lesson


Perl was first released to the public around 1987, and evolved
rapidly into Perl 4.0; this was the standard version until the
first release of Perl 5, in 1994. Perl 5 has mushroomed in popularity
and is the standard flavour of Perl; work has been under way since
2000 on developing a radical successor (Perl 6, which we will cover
in great detail in the next Perl tutorial in Linux Format), but
for the time being Perl 5 has progressed slowly, with an emphasis
on bug fixes and stability improvements rather than changes to the
core language.


Since 2000, we've been running Perl 5.6 (actually 5.6.1 for the latest
patched release); this is the stable branch of the Perl development
tree, and unless your Linux system is more than two years old or you
like installing bleeding-edge development releases, it's the version 
on your computer right now. Development of the Perl 5 tree since 5.005
(released in 1998) has followed the naming convention of the Linux
kernel; that is, there's an even-numbered stable version, and an odd-
numbered development tree. Around April 2002, the Perl 5.7 development
branch was considered stable enough to start building release
candidates of Perl 5.8; and Perl 5.8 was officially released in July
2002.


What has this got to do with Perl 6.0?


The answer is: very little. Perl 6 is a complete redesign of the
core language, from the ground up. When it surfaces, it will probably
bear a slightly closer relationship to Perl 5.x than Java does to
C++ -- it'll be recognizably of the same family, and most Perl 5.x
code will actually compile under Perl 6, but it'll fundamentally
be a new language, at least as different as Perl 5 was from Perl
4. (Perl 5 added references, object-orientation, and modules --
not exactly minor changes!) But Perl 6 is still some way off, and
before it arrives there'll be a Perl 5.10 release. For now we
working stiffs are stuck with Perl 5.8. So what's changed?


///SUBTITLE:  Read Me First


Perl 5.8 is a maintenance release, but  one with an eye on Perl 6. We
know -- from the list of RFCs and Larry Wall's Perl Apocalypse papers
-- a little bit about what features to expect in 6.0, so it's no
surprise to see funny stuff happening around the I/O side of things.
There's a full list of changes in 
http://dev.perl.org/perl5/news/2002/07/18/580ann/perldelta.pod ; but
here's an overview of the gotchas you'll run up against.


There are three major aspects to Perl 5.8. Firstly, it's not
binary-compatible with existing XS (extension system) modules --
the whole input/output system has been ripped out from under the
hood and replaced. Secondly, Unicode support has been beefed up
considerably, with several side-effects. And finally, the old
multi-threading model has been tossed on the scrapheap and replaced.
Most existing Perl 5.6 code will run happily enough on Perl 5.8,
but there are some constructs that will fail as a result of these
changes -- we'll tackle them in turn.


>>>
Binary incompatability can be a major gotcha when upgrading Perl
versions. Because some Perl modules include extensions written in C
and compiled to shared libraries (XS modules), you will need to 
reinstall all your existing modules (see boxout, "Installing Perl
5.8"). More importantly, you must ensure that old binary modules
don't exist in the @INC search path of your new Perl, otherwise
you may experience erratic segmentation faults. (This is a particular
problem on MacOS X, and may affect you if you installed Perl in some
non-standard location, but if your Linux installation uses the default
settings you should be alright.)
<<<


Perl traditionally provided file handles as a user-level abstraction
for dealing with input and output. Perl 5.8 still uses them, but
the underlying C library Perl relies on -- stdio -- has been replaced
by the PerlIO framework. PerlIO relies on a lower level library to
handle direct input/output to files or operating system devices. As
a result, it allows layers to be added that do "\n" to CRLF
translation, or some other useful task, or to talk to different types
of file store. Layers can use different buffering schemes, and extra
layers can be inserted under Perl -- for example, to translate between
Perl's native character encoding (Unicode UTF-8) and whatever native
format is used by the operating system.


This is an important move for the future, but has several side-effects.
Firstly, any modules that use XS need to be recompiled when you
switch to Perl 5.8 from 5.6.1. Secondly, XS modules that aren't
PerlIO-aware may  be unsupported in future --  this probably won't
affect you immediately, because the PerlIO system is designed to
look identical to the older stdio-based  interface, but it may have
effects on modules that try to do odd things to file handles. Globbing
on filehandles is deprecated -- we're supposed to use IO objects
instead, when passing references to data sources around. And there are
changes to the way layers are handled: the ":raw" layer (aka
"discipline") is now formally defined as equivalent to binmode().


There are some other fun effects. For example, the old IO::Stringy
module is now obsolete: it's legal to open a file handle on a
variable:


///code///


open($fh, ">", \$trap_output)


///end code///


This directs the output of writes to $fh into the scalar $trap_output.


And you can create anonymous temporary files:


///code///


open ($tmpfile, "+>", undef)


///end code///


Unicode support was added to Perl in 5.6; in a nutshell, Unicode
is a character set (and encoding scheme) that is intended to supplant
the old ASCII character set by providing support for just about
any writing system, including the largest Chinese, Japanese and
Korean dictionaries. Unicode uses a number of encoding schemes,
including UTF-8, a transitional 8-bit scheme roughly equivalent to
the traditional Latin-1 character set, but Unicode characters aren't
bound to any integer width. Unicode characters consist of a "code
point" (an entity, such as "LATIN CAPITAL LETTER A") and various
modifiers (such as "COMBINING ACUTE ACCENT").  Code points also
have properties ("uppercase", "lowercase", "punctuation") and
collating sequences. The combination of a code point and its
modifiers and properties is called a "combining character sequence".


Perl 5.8 is the first fully unicode-compliant release of Perl.
Normally, if all code points in a string are of value 0xFF or less,
Perl treats the string as being of the native 8-bit character set;
otherwise it assumes that the string is UTF-8 encoded. If you
specifically want to output UTF-8, you can use the :utf8 output layer
in PerlIO by explicitly attaching it to a filehandle with binmode():


///code///


binmode(STDOUT, ":utf8");


///end code///


You can use other output layers too:


///code///


open($fh, ">:crlf :utf8", "myfile.$$")


///end code///


Applies a CR->CRLF filter layer and the UTF-8 translation layer to
"myfile.$$" when it is opened for output.


You can create Unicode characters in string literals in Perl by using 
the \x{} notation in double-quoted strings or regular expressions, or 
chr() to return a unicode character at runtime:


///code///


my $smiley = "\x{263a}";


# or


print "Smiley detected!\n" if $string =~ /\x{263a}/;


///end code///


The basics of Unicode handling are explained in the POD document
"perluniintro" -- if you're likely to have to handle Unicode text you
really need to read this, because it explains how to apply Perl's 
text mangling capabilities to these character sets.


A few related things have happened to string handling in the
migration to unicode. For example, the string relational operators
"ge", "lt", "eq", and so on used to have uppercase aliases ("GE",
"LT", "EQ" ...). These have now been dropped. A couple of
unimplemented POSIX regular expression features that formerly failed
silently now cause fatal errors,  and  so on.


Threading is a hairy subject; essentially, when you spawn a thread you
tell your program that execution can proceed in parallel instances of
the same program, with some access to shared data. The new ithreads
implementation forces data sharing to be explicit, rather than
implicit -- it's explained in the "perlthrtut" POD file. Ithreads is
now considered stable. (I'm not going to go into it here -- threading
with ithreads will be covered in a future tutorial.)


Maybe as a side-effect of the multithreading work, Perl 5.8 has
considerably beefed up its signal handling capability. Signal handling
is not handled robustly -- signals are deferred until Perl finished
processing the current opcode, in order to prevent them from 
corrupting Perl's internal state. However, use of signals to break out
of potentially blocking operations is still possible.


On top of these three significant changes (Unicode, PerlIO, and
ithreads), a whole load of new modules have found their way into the
core Perl distribution. For example, there are now switch and case 
constructs in Perl -- just use  the Switch module:


///code///


use Switch;


switch($key) {
   case "a" {print "you pressed 'a'\n" }
   case "b" {print "you pressed 'b'\n" }
   case "q" {print "quitting"; last; }
   else {
      # do something here
   }
}


There's no substitute for reading the perldelta pod document; a whole
lot has changed in 5.8. However, for the most part it'll be pleasant
experience (unless you rely on taintperl or on globbing filehandles,
both features that have died or are on the way out). In particular,
most of the changes make life easier -- for example, the new PerlIO
layers make a bunch of IO modules obsolete and unnecessary.


///end code///


///BOXOUT: Installing Perl 5.8


Installing Perl 5.8 goes pretty much the same as for any previous
version of Perl. If you don't want to use pre-packaged RPMs from your
Linux distributor, you go to a mirror of CPAN -- the combined Perl
archive network -- such as ftp://ftp.demon.co.uk/pub/perl/CPAN. Look
in the "src" subdirectory and grab the file perl-5.8.0.tar.gz. Then
become root and type the following magical incantation (bearing in
mind that it'll take some time to run:


///code///


tar xvzf perl-5.8.0.tar.gz
cd perl-5.8.0
./Configure -des
make
make test
make install
cd /usr/include && h2ph *.h sys/*.h linux/*.h
cd -
installhtml --help


///end code///


This should -- if nothing blows up -- tell the Perl distribution to 
autoconfigure itself, compile and test itself, install the results,
then create the Perl header files and install the help text as HTML.
The place where Perl installs itself is usually /usr/local; it's
dictated by the file in the "hints/" subdirectory of the Perl source
tree that corresponds to your operating system. (You can live
dangerously and tell Perl 5.8 to install in /usr by either running
Configure interactively, without the "-des" arguments, or by editing 
hints/linux.sh or the config.sh file that Configure generates.)


This installs a new copy of Perl, but it doesn't convert all your old
modules over. To do that, _before_ you install your new Perl you
should do the following with your old Perl:


///code///


perl -MCPAN -e autobundle


///end code///


"Autobundle" generates a special bundle file -- a listing of all the
modules installed under your current Perl's library tree. The bundles
are written into your .cpan/Bundle subdirectory (with a name beginning
"Snapshot" followed by the current date -- such as
Snapshot_2002_07_22_00.pm).


If you generate a bundle file, you can make your freshly installed
Perl reload all the modules listed in it by first configuring the CPAN
module (type "perl -MCPAN -e shell" and answer the questions), then
telling CPAN to install the bundle: "perl -MCPAN -e install
Snapshot_2002_07_22_00". As long as the bundle is in your @INC search
path, Perl will find it and reinstall each module listed in it.


///END BOXOUT (Installing Perl 5.8)


///END COPY