Linux Format 23 Perl tutorial [[ TYPOGRAPHY/LAYOUT -- text surrounded by _underscore_ characters like so should be italicized or emphasized. Text indented from the margin by two or more characters is a program listing: needs monospaced typeface, with indentation and word wrap preserved. Contact me if it needs changing to fit the page. ]] HEADING: Using CPAN to install and control your modules Probably the most powerful feature of the Perl language and its community isn't part of the language at all -- but it's intimately tied in to Perl's free software history, and makes the language even more useful as a jack of all trades toolkit. I'm talking, in case you hadn't guessed, at CPAN -- the Comprehensive Perl Archive Network. Like most programming languages, the Perl distribution includes a standard library of modules that give you a range of convenient re-usable tools. You might think that Perl's huge range of built-in keywords -- just about everything in the standard C library is a command in Perl -- gives you enough; but because Perl is a very high level language, there are some very high level libraries out there. Of late, the core distribution has bloated somewhat, but there are limits to how many modules can be bundled in the standard kit. So many of the most useful tools are distributed separately, via CPAN. CPAN is a large archive of Perl-related stuff, including the Perl source code, the repository of published modules, oodles of documentation, ports to strange operating systems, and many other items. Anyone can download stuff from CPAN, and you can do this either manually (using an FTP client or web browser), or using the CPAN.pm module that is part of the core distribution (to automate the process of upgrading your Perl installation). Stuff finds its way into CPAN via PAUSE, the Perl Authors Upload Server. Suppose you've got a new module that you've written and that you think is of interest to other people. You've checked the definitive list of CPAN modules (at http://www.cpan.org/modules/00modlist.long.html) and nobody has contributed anything quite like your own, so you contact the module list maintainers (modules@perl.org). They'll discuss whether it's appropriate to put your module into CPAN, and where it should fit in the two-level-deep namespace. If you agree, they'll assign you a Perl author ID and you can then use the forms at PAUSE (http://www.cpan.org/modules/04pause.html) to upload your module into your CPAN directory. From the central PAUSE server, updates are mirrored onto the 197 CPAN satellite sites every hour. If you're a user rather than a producer, you don't need to concern yourself with PAUSE; instead, you need to know about CPAN and WAIT. The module list (see URL above) contains an exhaustive list of the publicly- available Perl modules available from CPAN and the core Perl distribution. Modules in the list are broken down by functional category, and identified by development status, support level, language used, interface style, and software license type. You really, _really_ need to explore the module list before embarking on a new piece of software that does something that involves serious coding; it's quite possible that tools to help with the task you want to accomplish have already been contributed to CPAN, and using them will make your job much easier. One problem with CPAN which should be extremely familiar to you if you're a Linux user is the dependency problem: you want to install a module (say, Pod::RTF, because you want to take your laboriously-written POD documentation and turn it into RTF for conversion into a Microsoft help file or import into StarOffice), and you find that it requires a bundle of other modules before it will run (Pod::Parser, and some RTF utilities). This is what the CPAN.pm module is for. CPAN.pm is two things: a set of routines for searching, downloading, compiling, and cacheing Perl modules from CPAN -- so that you can write Perl programs that install their own prerequisites if you distribute them to customers -- and an interactive text-based shell that lets you do all this from the command line. You can bring up the interactive CPAN shell on your Linux box like this: perl -MCPAN -e shell (This tells Perl to load the CPAN.pm module and execute the command "shell", which is a subroutine that provides the interactive shell). The first time you use the CPAN shell, it will prompt you for various bits of information -- notably, the nearest CPAN sites (from a long list of mirrors on the net), how much cache space you want to use, whether you're behind a firewall or need to use web or ftp proxies, what sort of utilities to use for fetching modules, and so on. One useful item buried in the CPAN manual pages is that it works with file: URLs, so you can point it at a local CDROM image (if you've got a CDROM of CPAN kicking around). this information gets stashed in cpan/config.pm in your perl home directory, but can be overridden by a file ~/.cpan/MyConfig.pm, so if you're trying to distribute a self-installing Perl program you can pre-configure some sensible defaults, or edit the settings to suit yourself. The CPAN configuration consists of a hash; the keys and their acceptable values are listed in the man page under 'CONFIGURATION'. (Type 'perldoc CPAN' or 'man CPAN' to read this.) One note: if you want to use CPAN from behind a firewall you may have to jump through some configuration tricks to do so. These are described in the manual. The main obstacles are firewalls that rely on IP masquerading and filter out active FTP connections. You may also have headaches if Net::FTP and some other modules aren't installed, although CPAN is able to fall back on NcFTP, Lynx, and other external programs to replace built-in Perl modules for fetching files. SUBHEADING: Fancy CPAN tricks Once you've got the CPAN shell configured properly, the first thing you want to do is to install Term::Readkey and Term::Readline -- these add command history editing and recall to CPAN. At the cpan> prompt, type: i Term::Readkey And CPAN.pm will go away, fetch the module, compile, and install it. When you've done this, type: reload And CPAN.pm will reload all its modules (and add the history editing facilities). You can use the CPAN shell to search for items in CPAN -- by author ("a"), bundle ("b"), distribution files ("d") and modules ("m"). (A bundle is a collection of modules that have been designed as going together to provide some chunk of essential functionality. For example, the Bugzilla system -- Mozilla's bug tracking database -- is distributed as a bundle.) Searching is done by regular expression. For example, to search for files with the term "CGI" in it: cpan> d /CGI/ Distribution A/AL/ALTITUDE/MsqlCGI-0.8.tar.gz Distribution A/AN/ANDK/Apache-HeavyCGI-0.0133.tar.gz Distribution A/AW/AWOOD/CGI-WML-0.05.tar.gz Distribution B/BE/BEHROOZI/CGI-SecureState-0.26.tar.gz Distribution B/BE/BENL/CGI-Lite-2.0.tar.gz ... Distribution Z/ZE/ZENIN/CGI-Validate-2.000.tar.gz 75 items found (Yes, there are a _lot_ of CGI scripting tools on CPAN!) To search for authors with "Stross" in their name: cpan> a /stross/ Author id = CHSTROSS EMAIL charlie@antipope.org FULLNAME Charlie Stross (CPAN search terms are either treated as exact text strings, or as case-insensitive regular expressions. The CPAN shell treats anything surrounded by slashes as a regular expression: otherwise it assumes it's a text string.) You can search for any of the items (author, bundle, distribution or modules) using the "i" command. For example: cpan> i /apache/ Looks for anything with "Apache" in it. Incidentally, CPAN.pm is smart; if it only finds one or two matches, it gives you a verbose description of it (as in the Author ID check above), but if it finds zillions, it collapses them to one-line summaries. You can then find out more about a one-liner by asking for it by name: cpan> i /CGI-Validate-2.000/ Distribution id = Z/ZE/ZENIN/CGI-Validate-2.000.tar.gz CPAN_USERID ZENIN (Byron Brummer ) CONTAINSMODS CGI::Validate If you're decided to install a module -- say, CGI::Validate, the next step is to tell CPAN.pm to download, build, test, and install it. When you manually install a module -- say, CGI::Validate -- by hand, what you usually do is download the file CGI-Validate-2.000.tar.gz and then do something like this: tar xvzf CGI-Validate-2.000.tar.gz cd CGI-Validate-2.000 perl Makefile.PL make make test make install (The "perl Makefile.PL" step invokes MakeMaker to generate a Makefile suitable for use with your current Perl setup using the details in the Makefile.PL script.) The CPAN shell understands the commands make, test, and install. If you need to do any of these steps, it does the necessary prerequisites -- for example, if you type "make CGI::Validate" at the CPAN prompt, what you'll see is this: cpan> make CGI::Validate Running make for module CGI::Validate Running make for Z/ZE/ZENIN/CGI-Validate-2.000.tar.gz Fetching with LWP: ftp://ftp.perl.org/pub/CPAN/authors/id/Z/ZE/ZENIN/CGI-Validate-2.000.tar.gz CPAN: MD5 security checks disabled because MD5 not installed. Please consider installing the MD5 module. Scanning cache /root/.cpan/build for sizes CGI-Validate-2.000/ CGI-Validate-2.000/README CGI-Validate-2.000/Makefile.PL CGI-Validate-2.000/test.pl CGI-Validate-2.000/MANIFEST CGI-Validate-2.000/Validate.pm CGI-Validate-2.000/INSTALL CGI-Validate-2.000/Changes CPAN.pm: Going to build Z/ZE/ZENIN/CGI-Validate-2.000.tar.gz Checking if your kit is complete... Looks good Writing Makefile for CGI::Validate cp Validate.pm blib/lib/CGI/Validate.pm Manifying blib/man3/CGI::Validate.3pm /usr/bin/make -- OK cpan> The make and test commands are executed unconditionally, but the CPAN shell will only obey an 'install' command if all the tests defined in the module (assuming there were any) were passed successfully, and if the module is not currently installed at a version number equal to or higher than the new version. You can pass arguments to 'make' on the command line; for example, 'make clean' causes make to build the 'clean' target (usually used to mop up intermediate and object files in a build area). You can force an install by using the 'force install' command -- in this case, the CPAN shell will install the module, even if it's a downgrade or it failed some tests. Watch out with this one! A module may fail tests for good reasons (such as, it wants an internet connection and you're offline when you run them), or because it's totally broken. If you force an install in the latter situation, you may break other bits of your Perl installation. One really nice thing about CPAN is that if you try to install a module that depends on some other module as a prerequisite, it recursively invokes itself to install the prerequisites (if you configured CPAN.pm to do that). This means you can type something like: perl -MCPAN -e 'CPAN::Shell->install("My::Entire::Hierarchy");' And CPAN will work out all the modules that My::Entire::Heirarchy depends on and blast them into your system before it installs My::Entire::Hierarchy.pm. Maybe you prefer to do everything by hand, in case somebody has buried a time bomb in a CPAN module (system('rm -rf /boot/vmlinuz');) ? In addition to installing bundles or modules, you can fetch (but not open) them and also fetch their README files (held in the same CPAN directory). For example: cpan> readme CGI::Validate Running readme for module CGI::Validate Fetching with LWP: ftp://ftp.perl.org/pub/CPAN/authors/id/Z/ZE/ZENIN/CGI-Validate-2.000.readme Displaying file /root/.cpan/sources/authors/id/Z/ZE/ZENIN/CGI-Validate-2.000.readme with pager "less" (The CPAN shell then displays the README to you using less(1), or whichever pager tool you configured into it, globally or in ~/.cpan/MyConfig.pm.) The "look" command does something similar -- but instead of running your pager on the README, it retrieves and unpacks the module, then dumps you into a sub-shell running in the module's directory. When you've finished looking around (and maybe editing things) you can hit ^D (control-D) or type "exit" to return to the cpan> prompt and run a "make" or "install" command on the modified source code. What if you want to take your Perl module configuration from one computer and replicate it on another? CPAN.pm provides several utilities to make this easier. Firstly, if you're writing a Perl program that needs to install its own modules when someone runs it for the first time, you might like to know that all the commands available in the CPAN shell are also method calls within the class CPAN::Shell -- for example, 'install' is a method. You can write code to auto-install CGI::Validate like this: #!/usr/bin/perl use CPAN; CPAN::Shell->install("CGI::Validate") ; # and so on Next, you can use the "autobundle" command to generate a list of all the modules in your @INC (module include path) that are also available from CPAN. A bundle snapshot file is written into ~/.cpan/Bundle, with the date and timestamp appended. You can read it; it's even got built-in POD documentation. For example, the file Snapshot_2001_11_12_00.pm will tell you, right at the top, that to rebuild it you should execute: perl -MCPAN -e 'install Bundle::Snapshot_2001_11_12_00' This is pretty cool, as it ensures that when you move to a new platform all you need is a copy of the Perl source tarfile and a bundle snapshot and you can reinstall your preferred set of modules. One important command is "r", for "recommendations". This checks CPAN for new versions of your installed modules and tells you which ones need updating: cpan> r Package namespace installed latest in CPAN file CGI 2.752 2.78 L/LD/LDS/CGI.pm-2.78.tar.gz DBD::ADO 2 2.4 T/TL/TLOWERY/DBD-ADO-2.4.tar.gz DBD::ExampleP 10.14 11.02 T/TI/TIMB/DBI-1.20.tar.gz ... 3 installed modules have a version number of 0 118 installed modules have no parseable version number cpan> The "recommendations" command works neatly with the rest of CPAN.pm -- the following command brings every installed module up to date, if necessary building package dependencies in the right order: perl -MCPAN -e 'CPAN::Shell->install(CPAN::Shell->r)' Finally, CPAN has a 'recompile' command. This does a brute-force recompilation on every dynamically loadable extension (XS module) installed on your system. This makes it easier for you to build multiple Perl distributions for different architectures with a common file location (for example, in an NFS filesystem that's exported to a heterogenous network of machines -- maybe a combination of Intel Linux and Sun Solaris boxes, or MacOS X, or whatever). Bundles are special Perl modules defined in the namespace Bundle:: -- they don't have any functions or methods, and usually contain nothing but POD documentation. After the initial package declaration and a $VERSION variable there's a POD section containing special contents that look like this: =head1 CONTENTS Modulename Version_string parameters Modulename Version_string parameters ... The version string and parameters are optional; the Module names are not, because these are the component modules that the bundle consists of. If you say "install bundle foo", CPAN will install Bundle::foo, then install in turn each module listed under the CONTENTS section of the POD documentation. As a final aside: you really want to get cosy with the long format module list. It contains everything you need to know about CPAN, and a load more beside, and groups the modules by category under the following headings: 1) Perl Core Modules, Perl Language Extensions and Documentation Tools 2) Development Support 3) Operating System Interfaces, Hardware Drivers 4) Networking, Device Control (modems) and InterProcess Communication 5) Data Types and Data Type Utilities 6) Database Interfaces 7) User Interfaces 8) Interfaces to or Emulations of Other Programming Languages 9) File Names, File Systems and File Locking (see also File Handles) 10) String Processing, Language Text Processing, Parsing and Searching 11) Option, Argument, Parameter and Configuration File Processing 12) Internationalization and Locale 13) Authentication, Security and Encryption 14) World Wide Web, HTML, HTTP, CGI, MIME 15) Server and Daemon Utilities 16) Archiving, Compression and Conversion 17) Images, Pixmap and Bitmap Manipulation, Drawing and Graphing 18) Mail and Usenet News 19) Control Flow Utilities (callbacks and exceptions etc) 20) File Handle, Directory Handle and Input/Output Stream Utilities 21) Microsoft Windows Modules 22) Miscellaneous Modules 23) Interface Modules to Commercial Software 24) Bundles END (BODY COPY) BOXOUT: A million modules to do with HTML A quick CPAN search for the string "HTML" reveals something horrifying -- there were at last count 447 modules with HTML in their name! How on earth do you start working out which ones you need and which you don't? First of all, a search for bundles reveals two: HTML::Mason and HTML::EP. These are integrated toolkits for building web pages -- in the case of HTML::EP, it allows you to write HTML pages with embedded Perl commands (like PHP or ASP), while HTML::Mason uses a different approach, allowing you to build web sites out of Perl and HTML components that can call each other. Each of these bundles is therefore an attempt to systematize building web sites, rather than a specific tool. HTML is a structured file format -- HTML elements contain other elements, with attributes and extras (such as enclosed text) recursively. You can't simply use regular expressions if you want to do search/replace operations on HTML, the way you can on regular text. However, CPAN can help if you want to examine the structure of an HTML document and do things with it. The problem is working out what to look for: CPAN is almost too big! A search using /html.*parse/ produces 14 hits (for items ranging from HTML::ParseBrowser and HTML::Parser through HTML::TokeParser); parsing (decomposing the hierarchical structure of HTML files and doing things depending on what entities you find in them) is quite popular. If you want to generate HTML files on the fly, things get harder to find using the generic CPAN tool; in fact, you really need to read through the long list of all modules at http://www.cpan.org/modules/00modlist.long.html. Section 15 of the list ("World Wide Web, HTML, HTTP, CGI, MIME") is where you want to be. What you'll find include modules for: building HTML pages in an object- oriented manner from Perl, automatically fix broken (Microsoft codeset) HTML generated by Word, convert HTML to plain text or postscript, build parsers that scan HTML and take actions depending on what they find inside a given element, extract links, manage standard HTML bookmark files, generate forms and tables in a variety of ways, build menus, analyse the capabilities of a web browser so your program can generate HTML exactly as complex as the browser can handle, and do basic searching. To say nothing of parsing, parsing, and building websites from templates (in several different ways). Probably the most useful modules are HTML::Parser (parse those HTML trees!) and HTML::Base (generate HTML in an OO manner). You get HTML::Parser as part of libwww-perl -- not a bundle, but a distribution containing multiple modules to help in parsing HTML and managing links. After these, it depends what you want to do -- but building HTML from templates is probably on your shopping list somewhere if you're a webmaster, and CPAN is the place to look. END BOXOUT (A million modules) BOXOUT: Networking tools The most important toolkit of modules you can get for Perl is available on CPAN: Bundle::libnet. Libnet is a bunch of modules that talk common TCP/IP protocols and allow you to construct clients rapidly in Perl. They're all configured centrally (at install time) via the module Net::Config (which can be overriden locally by a ~/.netrc file owned by a user); this tells the Libnet modules whether you're behind a firewall, where your nearest servers are, whether to do FTP in passive mode (necessary behind a masquerading gateway) and so on. The actual client-side modules are specific to each protocol: Net::FTP, Net::NNTP, Net::SMTP, Net::POP3, and Net::Time. Some of these (particularly the command- based protocols, such as SMTP and NNTP) inherit their generic methods from Net::Cmd, but add methods and parameters specific to their particular protocol. It needs to be emphasized that these are client-side implementations. You can't easily use them to implement a server! However, they're very handy for tasks like sending a short email message to the webmaster for your site: #!/usr/bin/perl -w use Net::SMTP; my $recipient = "webmaster\@localhost"; my $SMTPserver = "localhost"; $smtp = Net::SMTP->new($SMTPserver); $smtp->mail($ENV{USER}); $smtp->to($recipient); $smtp->data(); $smtp->datasend("To: $recipient\n"); $smtp->datasend("\n"); $smtp->datasend("A simple test message\n"); $smtp->dataend(); $smtp->quit; (This sort of thing makes more sense if you see it embedded in a CGI script or some batch process that needs to report in if something goes wrong -- or goes right, for that matter.) The other modules are equally useful for writing short, special-purpose client-side applications. For example, Net::POP3 can be used quite efficiently with the SpamAssassin anti-spam tool to suck in all the messages from a moribund mail account (which is no longer being used by anybody but spammers) and auto-forward them to the collaborative spam monitoring (and blocking) server. And Net::NNTP can be used in conjunction with grep() to efficiently scan usenet groups for mention of particular topics -- and then with Net::SMTP to forward postings in the relevant threads to your mailbox. There's an equivalent bunch of centrally-configured protocol handlers out there for the World Wide Web: libwww-perl, also known as LWP. LWP contains several distinct categories of module. First, there are the HTTP modules (HTTP::Request, HTTP::Response, HTTP::Negotiate, HTTP::Status and HTTP::Cookies) that let you assemble arbitrary HTTP requests, fire them at a server, negotiate the content-types you can accept, store and retrieve cookies, and analyse the response. HTTP::Daemon puts all these together and implements a simple HTTP server class -- it isn't anything close to a replacement for a real HTTP server, but it can provide a very useful way of serving up files to remote client processes if you need to tie two or more machines together over the network, or write a lightweight special- purpose proxy server. Next, there's the LWP classes. These are used to build a user agent -- a client-side program that talks to servers by HTTP. LWP classes include LWP::UserAgent, an object that creates HTTP::Request objects, sends them, and returns an HTTP::Response object. LWP::ConnCache is used to provide connection cacheing for LWP clients; LWP::RobotUA implements a user agent designed for robot operation (i.e. to scan entire websites without bringing them to their knees or looking in places from which robots are banned by the robot exclusion protocol -- which is implemented in WWW::RobotRules). Finally, the whole stack is based on top of Net::HTTP -- the client- side protocol implementation equivalent to the Net modules. This is a very low-level HTTP client implementation, and you should really use something more sophisticated if you're talking to real-world web servers; a good place to start is by customizing the lwp-request or lwp-mirror example programs supplied with the kit. Sometimes you may find yourself wanting to implement genuine TCP/IP servers in Perl. This is a big, long topic and will make for a future Linux Format tutorial in its own right. If you've done this before in C or C++ and just want to know how to tackle it in Perl, you could do worse than look into NetServer::Generic (for very simple tasks) or one of the Server::Inet classes. END BOXOUT (Networking tools) BOXOUT: Weird stuff Weird stuff? _How_ weird? CPAN is brimming over with weird stuff written by weird people for weird purposes. It's probably not appropriate to define the sound- manipulation toolkits (like Audio::OSS, a front-end to the Linux Open Sound System, or Audio::SOX, which lets you translate audio file formats) at weird. But when we get into territory like Games::AIBots -- an implementation of the old A. I. Wars game in Perl -- we're getting warm. LEGO::RCX is a module for, well, hacking the LEGO Mindstorms robotics kit; and by the time we get into the Silly: hierarchy things are definitely weird, with Silly::Werder (a module for generating inane but realistically word-like strings of gibberish), and Silly::StringMaths, which lets you do arithmetic with text strings: upper case letters are positive, lowercase letters are negative, and the value corresponds to the number of letters in a string, so that add("FOO", "BAR") returns "ABFOOR". Of course, CPAN is mostly sensible. That's because the most barking mad Perl programmers don't hold with anything as sane as putting all their modules in one place so that just _anyone_ can get their hands on them. For that, you have to go to the horse's mouth, or the Obfuscated Perl Contest: see, for example, the nightmares at http://www.samag.com/tpj/obfuscated/, such as Garry Taylor's reimplementation of the classic Spectrum game Frogger in 2048 bytes of Perl. END BOXOUT (Weird stuff)