Linux Format 22 

Perl tutorial


[[ TYPOGRAPHY/LAYOUT -- text surrounded by _underscore_ characters like 

   so should be italicized or emphasized. Text indented from the margin by 

   two or more characters is a program listing: needs monospaced typeface,

   with indentation and word wrap preserved. Contact me if it needs

   changing to fit the page. 

]]


HEADING: Subroutines


For the past umpty-something tutorials, we've been using subroutines in Perl

without examining the topic too deeply. This is a mistake; there's some weird

magic buried in there that can come in extremely handy at times -- or turn

around and bite a programmer used to a more traditional language, such as C

or Pascal. So, let's take a look at how Perl handles blocks of code.


SUBHEADING: In the beginning


Perl's flow of control structures were originally inherited from the UNIX

shells -- in fact, Perl started life as a kind of shell on steroids, with

built-in equivalents of awk and sed. 


A subroutine is a block of code with a name:


   sub my_sub {

       print "hi there!";

   }


We can execute this block by referring to it by name.  

Note that Perl lets us do this in several ways. The &-prefix

before a name indicates that the following name is that of a subroutine

declared somewhere else in our Perl program. Using the prefix tells Perl

that we want to execute it:


  &my_sub;


If we declared it above the point where we call it, we can usually leave 

off the prefix: 


  my_sub;


And if we are passing parameters to it we can specify them in brackets:


  sub my_sub {

     my (@parameters) = @_;

     foreach (@parameters) {

         print "You said $_\n";

     }

  }


  my_sub("hi", "there");


Because subroutine names are symbols, we can stash them in variables

and refer to them indirectly. For example:


    my $subroutine = "my_sub";

    &{$subroutine}("elephant");


Will print:


    you said elephant


Let's take a closer look at what a subroutine is.  A subroutine looks

like a block of code -- lines of commands -- delimited in some way

(by BEGIN and END keywords in Pascal and other Algol-like languages, by

curly braces { ... } in C-family languages), and identified by a name. The

delimiters define the _scope_ of the named subroutine.  However, there's

more to a subroutine than just a block of code: otherwise we could just

use a macro processor to insert copies of the code block wherever it is

referred to in a program.


Subroutines are special because they are executed in a different context

from the surrounding code. The precise details of how this happens

differ between programming languages, but the general principle is that

things that go on inside a subroutine don't have side effects outside

it -- hence the importance of _scope_.


For example, in Pascal, when the flow of execution enters a subroutine

(be it a Function or a Procedure -- Pascal keywords that we'll look at

shortly), the compiler pushes a marker onto its internal stack (where

variables are allocated). Thereafter, any new variables that are declared

result in memory being allocated on the stack above this marker. When

the flow of execution leaves the subroutine, the runtime system pops

the marker off the stack and reclaims all the memory above it, including

the variables declared within the subroutine. 


The effect of this stack-based activity is that variables declared in the

enclosing scope are visible from within the subroutine, but the contents

of the subroutine are invisible to the rest of the program. Subroutines

in Pascal are thus much like software black holes: light (or variables)

fall into them from outside, but what goes on within the event horizon

(stack frame) is invisible to the rest of the universe.  And much the

same goes for C, except that unlike Pascal, ANSI C doesn't allow you to

define functions within function definitions. The stack naturally gives

rise to _variable_ _scoping_, which allows us to define variables that

are local to a block of code and invisible outside it.


Perl is very different. Primitive shells don't have much idea of

variable scope, and in the beginning, neither did Perl. Perl 3 added the

concept of namespaces; by default all variables are allocated within a

namespace called "main", but you can add them to different namespaces by

referring to the namespace's name. For example, $main::fred is the same

as $fred (which implicitly belongs to 'main'), but $fruitbat::fred and

$vampire::fred are two different entities in different namespaces. (Note:

the :: operator, stolen from C++, is _not_ valid Perl 4 -- this syntax

came in Perl 5. In Perl 4, we used the open-quote symbol, `, instead.)


When writing code you can change the default namespace from 'main'

to something else by using the 'package' command; this affects the

subsequent block of code. Perl has some weird ideas about the scope of

blocks of code -- they can be delimited by curly braces, but also by a

containing file. The only essential is that a package declaration must

evaluate as true. So a series of subroutine definitions in a package is

often followed by a line containing the following:


  1; 


And is treated as a block, with scope defined by the enclosing file.


The namespace mechanism doesn't give us true local variables that are scoped

to the subroutine -- a vital requirement for structured programming of any

kind. Normally Perl variables are globally accessible within their namespace.


The two types of scoping mechanism Perl provides are local variables

and lexical variables -- and they were added at different times. Local

variables came first historically; they're a bit less useful than

lexicals, but they still come in handy from time to time.  Let's look

at a code example:


  $flintstone = "fred";

  print "My favourite flintstone is generally $flintstone\n";


  sub specific {

      local($flintstone);

      $flintstone = "wilma";

      print "but I like $flintstone too\n";

      return;

  }


  &specific;

  print "says $flintstone!\n";


This snippet of program will print the following output:


  My favourite flintstone is generally fred

  but I like wilma too

  says fred!


What happened? First, we set the global variable $main::flintstone (to

"fred").  Then we defined a subroutine. Within this subroutine we call

local($flintstone).  This pushes the pre-existing variable $flintstone

onto an internal stack, and then assigns a new value to $flintstone

("wilma"). When we exit from the subroutine, Perl throws away the local

value of $flintstone, and "fred" makes a come-back.


Local lets us override a global variable locally; if you declare all

the variables you use in a subroutine using local(), you get a semblance

of true variable scoping. (But it's a pale imitation of the real thing,

with various drawbacks.) This was the Perl 4 way of doing things, and it's

obsolete for this purpose; the local() keyword persists because it sometimes

comes in handy to temporarily override a variable's value and then reinstate

it.


The Perl 5 way of providing dynamically scoped variables is to declare

them with my() or our(); this provides a very flexible scoping mechanism

that is in some ways better than that of C or Pascal.


When you declare a variable with my():


  my ($fred) = "flintstone";


You are creating a variable that doesn't exist in any name space. It exists

in another conceptual space, associated with whatever block of code you were

in when you called my() -- the enclosing subroutine, or set of lines delimited

with curly braces, or whatever. 

For example:


  my $flintstone = "wilma"; 


  sub specific {

      print "In specific my flintstone is ", ($flintstone || " undefined "), "\n";

      return;

  }


  print "My favourite flintstone is generally ", ($flintstone || " undefined "), "\n";

  &specific;

  print "My favourite flintstone is generally ", ($flintstone || " undefined "), "\n";


Will print:


  My favourite flintstone is generally wilma

  In specific, my flintstone is fred 

  My favourite flintstone is generally wilma


See how resetting the value of $flintstone in specific() doesn't affect its

value in the enclosing scope.


You can share private variables among a bunch of subroutines by using

curly braces to establish an enclosing scope:


  {

     my $traffic_light = "red"; # default value


     sub stop              { $traffic_light = "red" }

     sub get_ready_to_go   { $traffic_light = "red+amber" }

     sub go                { $traffic_light = "green" }

     sub get_ready_to_stop { $traffic_light = "amber" }

  }


This provides a rough equivalent to a C static variable.  If you need

to extend this to cover an entire program, providing a global that

will be accessed from various places, providing an enclosing scope is

a bit cumbersome. Perl 5.6 provides our(), which provides access to a

lexically scoped global variable. It behaves like my(), but the our()

variable has an independent existence; we don't want to declare such

a variable multiple times, as the effects of assigning to it persist

outside the scope of the declaration.


If lexically scoped variables aren't visible outwise subroutines, how

do we get data in or out of a subroutine safely? That's what we need to

look at next ...


SUBHEADING: Parameter passing 


Pascal knows about two types of subroutine; procedures and functions.

In Pascal, a "procedure" takes zero or more _inputs_ and does something

with them.  In contrast, a "function" takes zero or more _inputs_, does

something with them, and returns an _output_. In C, functions subsume

both these roles -- a Pascal procedure is replaced by a function that

returns void.


Perl takes a different tack. Perl subroutines all accept zero or more

parameters and return zero or more parameters. The inputs are fed

in as a list after the subroutine's name when you call it, and they

automagically appear to the subroutine code in the local variable @_ --

an array. When you leave a subroutine you can either return a list of

variables by calling return() (with a list or array as a parameter), or

the subroutine will automatically return the result of the last command.


For example:


    sub list_params {

        my $count = 0;

        foreach (@_) {

             print "Parameter ", $count++, ": ", $_, "\n";

        }

        return $count;

    }


    list_params("This is ", "the", "third parameter");


Will print:


  Parameter 1: This is

  Parameter 2: the

  Parameter 3: third parameter

  3


Each of the items in 


   list_params("This is ", "the", "third parameter");


Has locally assumed a position in @_ as it appears within sub list_params().

Our code then loops over the contents of the array, counts them, and returns

a scalar, $count -- which you may have noticed was declared as a lexical

variable visible only within list_params. Hint: lexicals may be visible 

only within their enclosing scope, but you can use return() to send their

values out into the big wide world. 


Parameter passing in @_ has its weaknesses. For example, either:


   list_params(%my_hash);


or

 
   list_params(@array1, @some_other_array);


are going to result in an unholy mess. In the first case, the hash is

flattened into an array (with key/value pairs alternating); we have to

rebuild it into a hash once we're inside the subroutine, which had better

have been written to expect to do this:


  %my_imported_hash = @_; # and hope @_ has an even number of elements! 


In the second case, we have no such good luck; the two arrays are merged

end-to-end into @_, with no obvious boundary!


For this reason, it is a good idea to always pass hashes or arrays

into subroutines by passing a _reference_ to the item rather than the

item itself:


  list_params(\@array1, \@some_other_array);


Produces an @_ with two elements, both of which are references to

a different array.


A second reason for passing references rather than hashes or arrays is

that when we call the subroutine the entire contents of the array or

hash are copied into @_. Duplicating a data structure which might be 

quite large is wasteful and slow; passing a reference entails duplicating

a single scalar instead.


We have the same problem, in reverse, with return(). The return() command

returns a list of values to the calling context. We can return a hash,

but if we want to return two hashes, or a hash and an array, or any mixture

of such structures, we may be in trouble. Things get even worse when we

read the small print for return(); as subroutines can be called in a scalar 

context or an array context (or a void context, where the returned values

are discarded) we need to use wantarray() to decide just _what_ to return.

For example:


    sub reverse_and_capitalize {

        return unless defined wantarray; # why bother carrying on if the results

                                         # will be thrown away?

        @_ = reverse @_;

        @_ = uc @_;

        return wantarray ? @_ : \@_;

    }


(If wantarray() is true, the subroutine has been called in list context. It

returns a defined false value if called in scalar context, and an undefined

false value if called in void context.)


We can exert some degree of control over what goes into a subroutine by

using prototypes.


SUBHEADING: Prototypes


A prototype is a declaration that specifies the parameters that a 

subroutine will accept. 


Languages like Pascal and C are very fussy about what you are allowed to

pass into a subroutine -- when you create it you need to declare the 

parameters, their basic data type, and the order they'll be passed in.

Call the subroutine with the wrong parameters and the compiler will

yell at you. Because Perl is a very free-form language and @_ is an

infinitely extensible array, the traditional Perl approach was to just

stuff every parameter into the array and let the subroutine sort it

out. Unfortunately this can lead to all sorts of headaches. For example:


  sub process_data {

     my $param = shift ;

     # do something with $param

  }


  ...


  process_data(@large_array_of_stuff);


This is an error because only the first item of @_ is used by process_data() --

shift() extracts and returns the first element in @_. What makes this error

particularly gruesome is the fact that it is not a compile-time, or even a

run-time, error: the program will work perfectly, but only the first element

in @large_array_of_stuff will ever be processed!


Of course, we could fix this inside process_data():


  sub process_data {

     while (my $param = shift) {

     # do stuff


     }

  }


But this is the kind of solution you need to impose on every subroutine you

write.


A different approach is to use prototypes. Which we would do like this:


  sub process_data($) {

    ...


The prototype is a pattern enclosed in brackets that follows the subroutine

name. In this instance, the single dollar sign means "this subroutine is

called with a single scalar parameter". 


If you declare process_data($), instead of process_data(), then any attempt

to call process_data with anything other than a single scalar parameter

will result in a compile-time error. 


You can tell process_data() to expect two scalar parameters like this:


  sub process_data($$) { ...


Or to expect a single array parameter:


  sub process_data(@) { ...


Or a hash:


  sub process_data(%) { ...


(Note that an unbackslashed % or @ eats everything else in the arguments and

forces list context, because of the way everything is passed in @_.)


Or an array which is _actually_ treated as an array reference, rather than

just a list of the elements stashed in it (for example, the way push(@fred, "stuff")

treats @fred):


  sub process_data(\@) {


A backslash in front of a prototype character means that the actual supplied

argument must begin with that character (i.e. \@ means that an actual array

must be specified, not just a list).


We can specify that some arguments are optional:


  sub my_complex_sub($$;$) {


(my_complex_sub() expects two scalars, with an optional third parameter)


Or filehandle globs:


  sub redirect_io(**) {


(redirect_io() expects two globs, typically on filehandles, as parameters).


Or subroutine references:


  sub wrap_around_sub(\&) { ...


(This treats the first and only argument as a reference to a subroutine. Yes,

there are reasons you might want to do this!)


It must be emphasized that Perl prototypes are _not_ like prototypes in C

or Pascal -- strongly typed languages. Perl prototypes might better be

called "templates", because that's what they are -- a template that specifies

the layout of the parameters that must be passed into a subroutine. The

subroutine itself still gets them as a concatenated list of stuff glommed

together in @_. Nor can you make a prototype element relate to a specific

global variable. Perl doesn't let you specify formal named parameters.


SUBHEADING: eval() -- exception handling and dynamic code


Perl is a dynamic language, unlike C or Pascal. That is, because it's

interpreted (or at least only semi-compiled) you can take a string containing

some Perl source code and compile and run it dynamically. 


The classic way of doing this is to use the eval() command. Typically,

like most Perl commands, eval is used for two things: for compiling and

executing snippets of code on the fly, and for trapping runtime errors

(where Java would use a throw/catch exception handler, for example).


You can tell the difference quite easily. If you see eval() called on

a string, it's being used to dynamically execute the code in the string.

And if you see eval() called on a subroutine or other block of code (such

as anything enclosed in curly braces) it's probably being used as an 

error trap.


Here's a really, really simple Perl shell that lets you interactively try

Perl commands out:


  #!/usr/bin/perl

  
  print "\nperl>";

  while (my $arg = <STDIN>) {

      chomp $arg;

      die if ($arg =~ /^exit/i;

      eval($arg);

      print $@;

      print "\nperl>";

  }


The special variable $@ holds any errors that were returned by eval(); if

the expression was eval'd alright, then $@ will be undefined. 

Yes, you can assemble small programs and eval() them by sticking them in

a string. But be warned that this is a very slow way to run Perl. However,

because it's compiling the code while the program runs, it can cope with

syntax errors (badly, but they won't crash the enclosing program; they'll

just set a warning in $@).


We use eval in block context when we're doing something a bit more

delicate -- executing a subroutine which may or may not succeed (typically

because it relies on some system resource external to Perl) -- and we

don't want failure to cause our program to terminate prematurely.


For example:


  eval {

     # this is a block of code that might throw a runtime error

     if (! open(FOO, "</tmp/mysocket.$$") ) {

         die("Eek! open(FOO, </tmp/mysocket.$$) failed: $!\n";

     }

  }

  if ($@) {

     # deal with the exception 


  } else {

     # everything worked okay

  }


The advantage of this particular example is minor -- it just returns

a more meaningful error message than whatever $! would normally

contain -- but you can wrap it around blocks of code, subroutines, or

attempts to run external commands and it will keep your program

alive when all else fails. One other factor to note is that the

block in an 'eval {}' type construct is compiled at compile time,

rather than run-time -- so it's no slower than any other construct.


SUBHEADING: Closures -- generating subroutines on the fly


Using eval() to create a chunk of code and run it isn't that useful.

Firstly it's slow; secondly, each time you call the code you have to

recompile it. Wouldn't it be handy if there was a way to write a 

subroutine just once, compile it, and thereafter use it dynamically?


As it happens, there is: since 5.003 Perl provides a feature (originally

seen in Lisp) called the closure. 


A closure is an anonymous subroutine -- that is, a proper subroutine

that has no name. Instead, we take a reference to the subroutine when

we create it, and pass it around. Remember how we can define a subroutine

called "fred", put "fred" into a scalar variable like $foo, and call

&fred by way of &{$foo}? This is a symbolic reference -- we refer to the

symbol by its name. Closures are not symbolic references -- they have no

name, but just a reference. We can create them like this:


  my $sub_with_no_name = sub { print 'I am a closure'; return; };


We can pass the contents of $sub_with_no_name around, and call it

by prefixing it with '&':


   $run_me = $sub_with_no_name;

   &$run_me;


We can pass it as a parameter to a subroutine, or stash it in another

variable:


   my $hash_of_subs = {};

   
   sub add_to_table ($$$) {

       my ($tab, $sub, $name) = @_;

       $tab->{$name} = $sub;

       return $tab;

   }

   
   $subs = add_to_table(\%hash_of_subs, $sub_with_no_name, 'sub_with_no_name');

   
   &{$subs->{sub_with_no_name}}; # runs the sub with no name


An interesting feature arises from the interaction between closures

and lexically scoped variables.  If you create a closure, it 'sees'

any lexical variables with whatever value they have at the (run-time)

moment when the closure was created. Closures exist within the lexical

scope they were created in, and this includes preserving the contents of

lexical variables that might otherwise go out of scope. For example:


    { 

         my $colour = "blue";

         $subref = sub { return $colour };

    }

    print &$subref;


Prints:


    blue


Even though the call to subref() takes place outside the scope in which

$colour was defined.  Even if we create a new lexical of the same name,

the old context persists. For example:


    { 

         my $colour = "blue";

         $subref = sub { return $colour };

    }

    my $colour = "red";

    print &$subref;


Prints:


    blue


Here's another example of what persistent lexical scope can do for us:


    my $c = 0;

    $subref = sub { return $c++ };

    
    for ($i = 0; $i < 10; $i++) {

        print &$subref, "\n";

    }


This prints the numbers 0 to 9 inclusive. Yes, $c survives in the

closure, and when we increment its value inside the closure it persists

from one invocation of the subroutine to the next! Essentially $c has

become a static variable associated with the closure. 


Using closures we can generate subroutines on the fly, associate

them with some initial starting values, and invoke them elsewhere in

our programs. This technique is most commonly used in event-driven

GUI programming, where such subroutines are known as _callbacks_ --

fragments of code that are triggered by events. (We set up a whole

bunch of callbacks by creating closures, put them into some kind of data

structure, and then loop repeatedly -- events are checked against the data

structure and used to trigger an appropriate callback to handle them.)


Here's a really simple subroutine factory:


  sub sub_factory ($) {

      my $name = shift;

      my $x = sub { 

          $name;

          if (@_) { 

            $name = shift;

          }

          return $name;

      };

      return  $x;

  }

  $setting = sub_factory("traffic light");

  $state   = sub_factory("widget");


  # and so on, until we do something funky like ...


  print &$setting, "\n";

  print $setting->("red"), "\n";

  print $state->($setting->()), "\n";


Whenever sub_factory() creates a closure it returns a reference to

it in $x; the closures all have the general structure:


      sub [NONAME] { 

          $name;

          if (@_) { 

            $name = shift;

          }

          return $name;

      };


These closures store the lexical variable that was passed to

sub_factory() when it was called: if called with a parameter, the

closure updates the lexical, and then it returns the value.

This might seem slightly pointless until you think in terms of 

$name being something like "menu1.button3", the value as

being something like "checked" or "unchecked", and some additional

code being called from within the closure before it returns --

to do whatever the GUI interface wants to happen when the button

is checked or unchecked. 


SUBHEADING: Odds and ends -- lvalue subroutines and exotica


Normally when we call a subroutine, we pass it parameters and expect

it to return a value:


   $result = some_subroutine_call(param1, param2 ...);


But there's a weird facility in Perl 6 that lets us create subroutines

that we can _assign_ to:


   weird_subroutine = (param1, param2);


And which updates the internal lexical state of the subroutine! 


This is almost exactly the opposite way we normally use subroutines,

and such wild entities are known as 'lvalue subroutines' (because

they can appear on the left-hand side of an assignment expression).


This is actually one special case of a subroutine that has an

attribute set. Perl subroutine attributes are special flags that

indicate that a subroutine has some specified feature. For example,

to set up a simple lvalue sub:


  my $fred;

  sub lvalsub : lvalue {

     $fred;

  }


  lvaluesub() = 5;


There are other attributes you can apply to subroutines; for example,

the locked attribute, which specifies that only one thread at a time

may call it, and the method attribute (which ensures that when called 

its first parameter -- the object it is called on -- is locked before

execution). Attributes are specified after the subroutine name:


  my contention_sensitive_op : locked {


or:


  my initialize : locked method {


The attribute system is extensible -- you can create your own attribute

names, too. However this is still somewhat experimental -- if you're

going to start down that road you ought to be reading "Programming Perl"

(3rd edition) and the Perl source code, because this is where I stop!