Linux Format 23
Perl tutorial


[[ TYPOGRAPHY/LAYOUT -- text surrounded by _underscore_ characters like 
   so should be italicized or emphasized. Text indented from the margin by 
   two or more characters is a program listing: needs monospaced typeface,
   with indentation and word wrap preserved. Contact me if it needs
   changing to fit the page. 
]]


HEADING: Using CPAN to install and control your modules


Probably the most powerful feature of the Perl language and its community
isn't part of the language at all -- but it's intimately tied in to Perl's
free software history, and makes the language even more useful as a jack of
all trades toolkit.


I'm talking, in case you hadn't guessed, at CPAN -- the Comprehensive Perl
Archive Network.


Like most programming languages, the Perl distribution includes a standard
library of modules that give you a range of convenient re-usable tools.
You might think that Perl's huge range of built-in keywords -- just
about everything in the standard C library is a command in Perl -- gives
you enough; but because Perl is a very high level language, there are some
very high level libraries out there.


Of late, the core distribution has bloated somewhat, but there are limits
to how many modules can be bundled in the standard kit. So many of the
most useful tools are distributed separately, via CPAN.


CPAN is a large archive of Perl-related stuff, including the Perl source
code, the repository of published modules, oodles of documentation, ports
to strange operating systems, and many other items. Anyone can download
stuff from CPAN, and you can do this either manually (using an FTP client
or web browser), or using the CPAN.pm module that is part of the core
distribution (to automate the process of upgrading your Perl installation).


Stuff finds its way into CPAN via PAUSE, the Perl Authors Upload Server.
Suppose you've got a new module that you've written and that you think is
of interest to other people. You've checked the definitive list of
CPAN modules (at http://www.cpan.org/modules/00modlist.long.html) and
nobody has contributed anything quite like your own, so you contact
the module list maintainers (modules@perl.org). They'll discuss
whether it's appropriate to put your module into CPAN, and where it
should fit in the two-level-deep namespace. If you agree, they'll
assign you a Perl author ID and you can then use the forms at PAUSE
(http://www.cpan.org/modules/04pause.html ) to upload your module into
your CPAN directory. From the central PAUSE server, updates are mirrored
onto the 197 CPAN satellite sites every hour.


If you're a user rather than a producer, you don't need to concern yourself
with PAUSE; instead, you need to know about CPAN and WAIT. The module 
list (see URL above) contains an exhaustive list of the publicly-
available Perl modules available from CPAN and the core Perl distribution.
Modules in the list are broken down by functional category, and 
identified by development status, support level, language used, 
interface style, and software license type. You really, _really_ need
to explore the module list before embarking on a new piece of 
software that does something that involves serious coding; it's 
quite possible that tools to help with the task you want to accomplish 
have already been contributed to CPAN, and using them will make your
job much easier.


One problem with CPAN which should be extremely familiar to you if you're
a Linux user is the dependency problem: you want to install a module 
(say, Pod::RTF, because you want to take your laboriously-written POD
documentation and turn it into RTF for conversion into a Microsoft help
file or import into StarOffice), and you find that it requires a bundle
of other modules before it will run (Pod::Parser, and some RTF utilities).
This is what the CPAN.pm module is for. 


CPAN.pm is two things: a set of routines for searching, downloading,
compiling, and cacheing Perl modules from CPAN -- so that you can write
Perl programs that install their own prerequisites if you distribute them
to customers -- and an interactive text-based shell that lets you
do all this from the command line.


You can bring up the interactive CPAN shell on your Linux box like
this:


  perl -MCPAN -e shell


(This tells Perl to load the CPAN.pm module and execute the command
"shell", which is a subroutine that provides the interactive shell).


The first time you use the CPAN shell, it will prompt you for various
bits of information -- notably, the nearest CPAN sites (from a long
list of mirrors on the net), how much cache space you want to use,
whether you're behind a firewall or need to use web or ftp proxies,
what sort of utilities to use for fetching modules, and so on. One useful
item buried in the CPAN manual pages is that it works with file: URLs,
so you can point it at a local CDROM image (if you've got a CDROM of CPAN
kicking around). this information gets stashed in cpan/config.pm in your
perl home directory, but can be overridden by a file ~/.cpan/MyConfig.pm, 
so if you're trying to distribute a self-installing Perl program you
can pre-configure some sensible defaults, or edit the settings to suit
yourself.  The CPAN configuration consists of a hash; the keys and their
acceptable values are listed in the man page under 'CONFIGURATION'. (Type
'perldoc CPAN' or 'man CPAN' to read this.)


One note: if you want to use CPAN from behind a firewall you may have to
jump through some configuration tricks to do so. These are described in
the manual. The main obstacles are firewalls that rely on IP masquerading
and filter out active FTP connections. You may also have headaches if
Net::FTP and some other modules aren't installed, although CPAN is able
to fall back on NcFTP, Lynx, and other external programs to replace
built-in Perl modules for fetching files.


SUBHEADING: Fancy CPAN tricks


Once you've got the CPAN shell configured properly, the first thing you
want to do is to install Term::Readkey and Term::Readline -- these add
command history editing and recall to CPAN. At the cpan> prompt, type:


   i Term::Readkey


And CPAN.pm will go away, fetch the module, compile, and install it. When
you've done this, type:


   reload


And CPAN.pm will reload all its modules (and add the history editing
facilities).


You can use the CPAN shell to search for items in CPAN -- by author ("a"),
bundle ("b"), distribution files ("d") and modules ("m"). (A bundle
is a collection of modules that have been designed as going together
to provide some chunk of essential  functionality. For example, the
Bugzilla system  -- Mozilla's bug tracking database -- is distributed
as a bundle.)  Searching is done by regular expression. For example,
to search for files with the term "CGI" in it:


  cpan> d /CGI/
  Distribution    A/AL/ALTITUDE/MsqlCGI-0.8.tar.gz
  Distribution    A/AN/ANDK/Apache-HeavyCGI-0.0133.tar.gz
  Distribution    A/AW/AWOOD/CGI-WML-0.05.tar.gz
  Distribution    B/BE/BEHROOZI/CGI-SecureState-0.26.tar.gz
  Distribution    B/BE/BENL/CGI-Lite-2.0.tar.gz
   ...
  Distribution    Z/ZE/ZENIN/CGI-Validate-2.000.tar.gz
  75 items found


(Yes, there are a _lot_ of CGI scripting tools on CPAN!)


To search for authors with "Stross" in their name: 


  cpan> a /stross/
  Author id = CHSTROSS
      EMAIL        charlie@antipope.org
      FULLNAME     Charlie Stross


(CPAN search terms are either treated as exact text strings, or as
case-insensitive regular expressions. The CPAN shell treats anything
surrounded by slashes as a regular expression: otherwise it assumes it's
a text string.)


You can search for any of the items (author, bundle, distribution or
modules) using the "i" command. For example:


  cpan> i /apache/


Looks for anything with "Apache" in it.


Incidentally, CPAN.pm is smart; if it only finds one or two matches, it
gives you a verbose description of it (as in the Author ID check above),
but if it finds zillions, it collapses them to one-line summaries. You can
then find out more about a one-liner by asking for it by name:


  cpan> i /CGI-Validate-2.000/
  Distribution id = Z/ZE/ZENIN/CGI-Validate-2.000.tar.gz
      CPAN_USERID  ZENIN (Byron Brummer <zenin@bawdycaste.org>)
      CONTAINSMODS CGI::Validate
  
If you're decided to install a module -- say, CGI::Validate, the next
step is to tell CPAN.pm to download, build, test, and install it.


When you manually install a module -- say, CGI::Validate -- by hand,
what you usually do is download the file CGI-Validate-2.000.tar.gz
and then do something like this:


   tar xvzf CGI-Validate-2.000.tar.gz
   cd CGI-Validate-2.000
   perl Makefile.PL
   make
   make test
   make install


(The "perl Makefile.PL" step invokes MakeMaker to generate a Makefile
suitable for use with your current Perl setup using the details in
the Makefile.PL script.)


The CPAN shell understands the commands make, test, and install. If you
need to do any of these steps, it does the necessary prerequisites --
for example, if you type "make CGI::Validate" at the CPAN prompt, what
you'll see is this:


  cpan> make CGI::Validate
  Running make for module CGI::Validate
  Running make for Z/ZE/ZENIN/CGI-Validate-2.000.tar.gz
  Fetching with LWP:
    ftp://ftp.perl.org/pub/CPAN/authors/id/Z/ZE/ZENIN/CGI-Validate-2.000.tar.gz


    CPAN: MD5 security checks disabled because MD5 not installed.
    Please consider installing the MD5 module.


  Scanning cache /root/.cpan/build for sizes
  CGI-Validate-2.000/
  CGI-Validate-2.000/README
  CGI-Validate-2.000/Makefile.PL
  CGI-Validate-2.000/test.pl
  CGI-Validate-2.000/MANIFEST
  CGI-Validate-2.000/Validate.pm
  CGI-Validate-2.000/INSTALL
  CGI-Validate-2.000/Changes


    CPAN.pm: Going to build Z/ZE/ZENIN/CGI-Validate-2.000.tar.gz


  Checking if your kit is complete...
  Looks good
  Writing Makefile for CGI::Validate
  cp Validate.pm blib/lib/CGI/Validate.pm
  Manifying blib/man3/CGI::Validate.3pm
    /usr/bin/make  -- OK


  cpan>


The make and test commands are executed unconditionally, but the CPAN
shell will only obey an 'install' command if all the tests defined in the
module (assuming there were any) were passed successfully, and if the
module is not currently installed at a version number equal to or higher
than the new version. You can pass arguments to 'make' on the command
line; for example, 'make clean' causes make to build the 'clean' target
(usually used to mop up intermediate and object files in a build area).


You can force an install by using the 'force install' command -- in this
case, the CPAN shell will install the module, even if it's a downgrade
or it failed some tests. Watch out with this one! A module may fail tests
for good reasons (such as, it wants an internet connection and you're
offline when you  run them), or because it's totally broken. If you 
force an install in the latter situation, you may break other bits of your
Perl installation.


One really nice thing about CPAN is that if you try to install a module
that depends on some other module as a prerequisite, it recursively
invokes itself to install the prerequisites (if you configured CPAN.pm
to do that). This means you can type something like:


  perl -MCPAN -e 'CPAN::Shell->install("My::Entire::Hierarchy");'


And CPAN will work out all the modules that My::Entire::Heirarchy depends
on and blast them into your system before it installs My::Entire::Hierarchy.pm.


Maybe you prefer to do everything by hand, in case somebody has buried
a time bomb in a CPAN module (system('rm -rf /boot/vmlinuz');) ?
In addition to installing bundles or modules, you can fetch (but not
open) them and also fetch their README files (held in the same CPAN
directory). For example:


  cpan> readme CGI::Validate
  Running readme for module CGI::Validate
  Fetching with LWP:
    ftp://ftp.perl.org/pub/CPAN/authors/id/Z/ZE/ZENIN/CGI-Validate-2.000.readme


  Displaying file
    /root/.cpan/sources/authors/id/Z/ZE/ZENIN/CGI-Validate-2.000.readme
  with pager "less"


(The CPAN shell then displays the README to you using less(1), or whichever
pager tool you configured into it, globally or in ~/.cpan/MyConfig.pm.)


The "look" command does something similar -- but instead of running your
pager on the README, it retrieves and unpacks the module, then dumps
you into a sub-shell running in the module's directory. When you've
finished looking around (and maybe editing things) you can hit ^D (control-D)
or type "exit" to return to the cpan> prompt and run a "make" or "install"
command on the modified source code.


What if you want to take your Perl module configuration from one computer
and replicate it on another? CPAN.pm provides several utilities to make this
easier. Firstly, if you're writing a Perl program that needs to install its
own modules when someone runs it for the first time, you might like to
know that all the commands available in the CPAN shell are also method
calls within the class CPAN::Shell -- for example, 'install' is a 
method. You can write code to auto-install CGI::Validate like this:


  #!/usr/bin/perl


  use CPAN;
  CPAN::Shell->install("CGI::Validate") ;


  # and so on


Next, you can use the "autobundle" command to generate a list of all the
modules in your @INC (module include path) that are also available from
CPAN. A bundle snapshot file is written into ~/.cpan/Bundle, with the
date and timestamp appended. You can read it; it's even got built-in
POD documentation. For example, the file Snapshot_2001_11_12_00.pm will
tell you, right at the top, that to rebuild it you should execute: 


  perl -MCPAN -e 'install Bundle::Snapshot_2001_11_12_00'


This is pretty cool, as it ensures that when you move to a new platform
all you need is a copy of the Perl source tarfile and a bundle snapshot
and you can reinstall your preferred set of modules.


One important command is "r", for "recommendations". This checks CPAN for 
new versions of your installed modules and tells you which ones need
updating:


  cpan> r


  Package namespace         installed    latest  in CPAN file
  CGI                           2.752      2.78  L/LD/LDS/CGI.pm-2.78.tar.gz
  DBD::ADO                          2       2.4  T/TL/TLOWERY/DBD-ADO-2.4.tar.gz
  DBD::ExampleP                 10.14     11.02  T/TI/TIMB/DBI-1.20.tar.gz
   ...


  3 installed modules have a version number of 0
  118 installed modules have no parseable version number


  cpan>


The "recommendations" command works neatly with the rest of CPAN.pm -- the
following command brings every installed module up to date, if necessary
building package dependencies in the right order:


  perl -MCPAN -e 'CPAN::Shell->install(CPAN::Shell->r)'


Finally, CPAN has a 'recompile' command. This does a brute-force
recompilation on every dynamically loadable extension (XS module)
installed on your system. This makes it easier for you to build multiple
Perl distributions for different architectures with a common file location
(for example, in an NFS filesystem that's exported to a heterogenous
network of machines -- maybe a combination of Intel Linux and Sun Solaris
boxes, or MacOS X, or whatever).


Bundles are special Perl modules defined in the namespace Bundle:: -- 
they don't have any functions or methods, and usually contain nothing
but POD documentation. After the initial package declaration and a 
$VERSION variable there's a POD section containing special 
contents that look like this:


  =head1 CONTENTS


  Modulename Version_string parameters
  Modulename Version_string parameters
   ...


The version string and parameters are optional; the Module names are
not, because these are the component modules that the bundle consists
of. If you say "install bundle foo", CPAN will install Bundle::foo,
then install in turn each module listed under the CONTENTS section
of the POD documentation.


As a final aside: you really want to get cosy with the long format module
list. It contains everything you need to know about CPAN, and a load
more beside, and groups the modules by category under the following
headings:


 1) Perl Core Modules, Perl Language Extensions and Documentation Tools
 2) Development Support
 3) Operating System Interfaces, Hardware Drivers
 4) Networking, Device Control (modems) and InterProcess Communication
 5) Data Types and Data Type Utilities
 6) Database Interfaces
 7) User Interfaces
 8) Interfaces to or Emulations of Other Programming Languages
 9) File Names, File Systems and File Locking (see also File Handles)
 10) String Processing, Language Text Processing, Parsing and Searching
 11) Option, Argument, Parameter and Configuration File Processing
 12) Internationalization and Locale
 13) Authentication, Security and Encryption
 14) World Wide Web, HTML, HTTP, CGI, MIME
 15) Server and Daemon Utilities
 16) Archiving, Compression and Conversion
 17) Images, Pixmap and Bitmap Manipulation, Drawing and Graphing
 18) Mail and Usenet News
 19) Control Flow Utilities (callbacks and exceptions etc)
 20) File Handle, Directory Handle and Input/Output Stream Utilities
 21) Microsoft Windows Modules
 22) Miscellaneous Modules
 23) Interface Modules to Commercial Software
 24) Bundles
 
END (BODY COPY)


BOXOUT: A million modules to do with HTML


A quick CPAN search for the string "HTML" reveals something horrifying --
there were at last count 447 modules with HTML in their name! How on earth
do you start working out which ones you need and which you don't?


First of all, a search for bundles reveals two: HTML::Mason and HTML::EP.
These are integrated toolkits for building web pages -- in the case of
HTML::EP, it allows you  to write HTML pages with embedded Perl commands
(like PHP or ASP), while HTML::Mason uses a different approach, allowing
you to build web sites out of Perl and HTML components that can call each
other. Each of these bundles is therefore an attempt to systematize building
web sites, rather than a specific tool.


HTML is a structured file format -- HTML elements contain other elements,
with attributes and extras (such as enclosed text) recursively. You can't
simply use regular expressions if you want to do search/replace operations
on HTML, the way you can on regular text. However, CPAN can help if you
want to examine the structure of an HTML document and do things with
it. The problem is working out what to look for: CPAN is almost too
big! A search using /html.*parse/ produces 14 hits (for items ranging
from HTML::ParseBrowser and HTML::Parser through HTML::TokeParser);
parsing (decomposing the hierarchical structure of HTML files and doing
things depending on what entities you find in them) is quite popular.


If you want to generate HTML files on the fly, things get
harder to find using the generic CPAN tool; in fact, you
really need to read through the long list of all modules at
http://www.cpan.org/modules/00modlist.long.html.  Section 15 of the list
("World Wide Web, HTML, HTTP, CGI, MIME") is where you want to be.


What you'll find include modules for: building HTML pages in an object-
oriented manner from Perl, automatically fix broken (Microsoft codeset)
HTML generated by Word, convert HTML to plain text or postscript, build
parsers that scan HTML and take actions depending on what they find 
inside a given element, extract links, manage standard HTML bookmark files,
generate forms and tables in a variety of ways, build menus, analyse the
capabilities of a web browser so your program can generate HTML exactly
as complex as the browser can handle, and do basic searching. To say
nothing of parsing, parsing, and building websites from templates (in
several different ways).


Probably the most useful modules are HTML::Parser (parse those HTML
trees!) and HTML::Base (generate HTML in an OO manner). You get HTML::Parser
as part of libwww-perl -- not a bundle, but a distribution containing
multiple modules to help in parsing HTML and managing links. After these,
it depends what you want to do -- but building HTML from templates is
probably on your shopping list somewhere if you're a webmaster, and
CPAN is the place to look.


END BOXOUT (A million modules)


BOXOUT: Networking tools


The most important toolkit of modules you can get for Perl is available
on CPAN: Bundle::libnet.


Libnet is a bunch of modules that talk common TCP/IP protocols and allow
you to construct clients rapidly in Perl. They're all configured 
centrally (at install time) via the module Net::Config (which can be
overriden locally by a ~/.netrc file owned by a user); this tells
the Libnet modules whether you're behind a firewall, where your 
nearest servers are, whether to do FTP in passive mode (necessary
behind a masquerading gateway) and so on. The actual client-side
modules are specific to each protocol: Net::FTP, Net::NNTP, Net::SMTP,
Net::POP3, and Net::Time. Some of these (particularly the command-
based protocols, such as SMTP and NNTP) inherit their generic methods 
from Net::Cmd, but add methods and parameters specific to their particular
protocol.


It needs to be emphasized that these are client-side implementations.
You can't easily use them to implement a server! However, they're very
handy for tasks like sending a short email message to the webmaster
for your site:


  #!/usr/bin/perl -w


  use Net::SMTP;
  my $recipient = "webmaster\@localhost";
  my $SMTPserver = "localhost";


  $smtp = Net::SMTP->new($SMTPserver);


  $smtp->mail($ENV{USER});
  $smtp->to($recipient);


  $smtp->data();
  $smtp->datasend("To: $recipient\n");
  $smtp->datasend("\n");
  $smtp->datasend("A simple test message\n");
  $smtp->dataend();
  $smtp->quit;


(This sort of thing makes more sense if you see it embedded in a CGI
script or some batch process that needs to report in if something goes
wrong -- or goes right, for that matter.)


The other modules are equally useful for writing short, special-purpose
client-side applications. For example, Net::POP3 can be used quite
efficiently with the SpamAssassin anti-spam tool to suck in all the
messages from a moribund mail account (which is no longer being used 
by anybody but spammers) and auto-forward them to the collaborative
spam monitoring (and blocking) server. And Net::NNTP can be used in
conjunction with grep() to efficiently scan usenet groups for mention
of particular topics -- and then with Net::SMTP to forward postings 
in the relevant threads to your mailbox.


There's an equivalent bunch of centrally-configured protocol handlers
out there for the World Wide Web: libwww-perl, also known as LWP.
LWP contains several distinct categories of module. First, there are
the HTTP modules (HTTP::Request, HTTP::Response, HTTP::Negotiate,
HTTP::Status and HTTP::Cookies) that let you assemble arbitrary HTTP
requests, fire them at a server, negotiate the content-types you
can accept, store and retrieve cookies, and analyse the response.
HTTP::Daemon puts all these together and implements a simple HTTP
server class -- it isn't anything close to a replacement for a real
HTTP server, but it can provide a very useful way of serving up
files to remote client processes if you need to tie two or more
machines together over the network, or write a lightweight special-
purpose proxy server. 


Next, there's the LWP classes. These are used to build a user agent --
a client-side program that talks to servers by HTTP. LWP classes
include LWP::UserAgent, an object that creates HTTP::Request objects,
sends them, and returns an HTTP::Response object. LWP::ConnCache is
used to provide connection cacheing for LWP clients; LWP::RobotUA
implements a user agent designed for robot operation (i.e. to scan
entire websites without bringing them to their knees or looking in
places from which robots are banned by the robot exclusion protocol --
which is implemented in WWW::RobotRules). 


Finally, the whole stack is based on top of Net::HTTP -- the client-
side protocol implementation equivalent to the Net modules. This is 
a very low-level HTTP client implementation, and you should really
use something more sophisticated if you're talking to real-world
web servers; a good place to start is by customizing the lwp-request
or lwp-mirror example programs supplied with the kit.


Sometimes you may find yourself wanting to implement genuine TCP/IP
servers in Perl. This is a big, long topic and will make for a future
Linux Format tutorial in its own right. If you've done this before in
C or C++ and just want to know how to tackle it in Perl, you could
do worse than look into NetServer::Generic (for very simple tasks)
or one of the Server::Inet classes.


END BOXOUT (Networking tools)


BOXOUT: Weird stuff


Weird stuff? _How_ weird?


CPAN is brimming  over with weird stuff written by weird people for
weird purposes. It's probably not appropriate to define the sound-
manipulation toolkits (like Audio::OSS, a front-end to the Linux Open
Sound System, or Audio::SOX, which lets you translate audio file
formats) at weird. But when we get into territory like Games::AIBots
-- an implementation of the old A. I. Wars game in Perl -- we're
getting warm. LEGO::RCX is a module for, well, hacking the LEGO 
Mindstorms robotics kit; and by the time we get into the Silly:
hierarchy things are definitely weird, with Silly::Werder (a 
module for generating inane but realistically word-like strings of 
gibberish), and Silly::StringMaths, which lets you do arithmetic 
with text strings: upper case letters are positive, lowercase
letters are negative, and the value corresponds to the number of
letters in a string, so that add("FOO", "BAR") returns "ABFOOR".


Of course, CPAN is mostly sensible. That's because the most barking
mad Perl programmers don't hold with anything as sane as putting
all their modules in one place so that just _anyone_ can get their
hands on them. For that, you have to go to the horse's mouth, or
the Obfuscated Perl Contest: see, for example, the nightmares
at http://www.samag.com/tpj/obfuscated/, such as Garry Taylor's
reimplementation of the classic Spectrum game Frogger in 2048
bytes of Perl. 


END BOXOUT (Weird stuff)