Linux Format 22 Perl tutorial [[ TYPOGRAPHY/LAYOUT -- text surrounded by _underscore_ characters like so should be italicized or emphasized. Text indented from the margin by two or more characters is a program listing: needs monospaced typeface, with indentation and word wrap preserved. Contact me if it needs changing to fit the page. ]] HEADING: Subroutines For the past umpty-something tutorials, we've been using subroutines in Perl without examining the topic too deeply. This is a mistake; there's some weird magic buried in there that can come in extremely handy at times -- or turn around and bite a programmer used to a more traditional language, such as C or Pascal. So, let's take a look at how Perl handles blocks of code. SUBHEADING: In the beginning Perl's flow of control structures were originally inherited from the UNIX shells -- in fact, Perl started life as a kind of shell on steroids, with built-in equivalents of awk and sed. A subroutine is a block of code with a name: sub my_sub { print "hi there!"; } We can execute this block by referring to it by name. Note that Perl lets us do this in several ways. The &-prefix before a name indicates that the following name is that of a subroutine declared somewhere else in our Perl program. Using the prefix tells Perl that we want to execute it: &my_sub; If we declared it above the point where we call it, we can usually leave off the prefix: my_sub; And if we are passing parameters to it we can specify them in brackets: sub my_sub { my (@parameters) = @_; foreach (@parameters) { print "You said $_\n"; } } my_sub("hi", "there"); Because subroutine names are symbols, we can stash them in variables and refer to them indirectly. For example: my $subroutine = "my_sub"; &{$subroutine}("elephant"); Will print: you said elephant Let's take a closer look at what a subroutine is. A subroutine looks like a block of code -- lines of commands -- delimited in some way (by BEGIN and END keywords in Pascal and other Algol-like languages, by curly braces { ... } in C-family languages), and identified by a name. The delimiters define the _scope_ of the named subroutine. However, there's more to a subroutine than just a block of code: otherwise we could just use a macro processor to insert copies of the code block wherever it is referred to in a program. Subroutines are special because they are executed in a different context from the surrounding code. The precise details of how this happens differ between programming languages, but the general principle is that things that go on inside a subroutine don't have side effects outside it -- hence the importance of _scope_. For example, in Pascal, when the flow of execution enters a subroutine (be it a Function or a Procedure -- Pascal keywords that we'll look at shortly), the compiler pushes a marker onto its internal stack (where variables are allocated). Thereafter, any new variables that are declared result in memory being allocated on the stack above this marker. When the flow of execution leaves the subroutine, the runtime system pops the marker off the stack and reclaims all the memory above it, including the variables declared within the subroutine. The effect of this stack-based activity is that variables declared in the enclosing scope are visible from within the subroutine, but the contents of the subroutine are invisible to the rest of the program. Subroutines in Pascal are thus much like software black holes: light (or variables) fall into them from outside, but what goes on within the event horizon (stack frame) is invisible to the rest of the universe. And much the same goes for C, except that unlike Pascal, ANSI C doesn't allow you to define functions within function definitions. The stack naturally gives rise to _variable_ _scoping_, which allows us to define variables that are local to a block of code and invisible outside it. Perl is very different. Primitive shells don't have much idea of variable scope, and in the beginning, neither did Perl. Perl 3 added the concept of namespaces; by default all variables are allocated within a namespace called "main", but you can add them to different namespaces by referring to the namespace's name. For example, $main::fred is the same as $fred (which implicitly belongs to 'main'), but $fruitbat::fred and $vampire::fred are two different entities in different namespaces. (Note: the :: operator, stolen from C++, is _not_ valid Perl 4 -- this syntax came in Perl 5. In Perl 4, we used the open-quote symbol, `, instead.) When writing code you can change the default namespace from 'main' to something else by using the 'package' command; this affects the subsequent block of code. Perl has some weird ideas about the scope of blocks of code -- they can be delimited by curly braces, but also by a containing file. The only essential is that a package declaration must evaluate as true. So a series of subroutine definitions in a package is often followed by a line containing the following: 1; And is treated as a block, with scope defined by the enclosing file. The namespace mechanism doesn't give us true local variables that are scoped to the subroutine -- a vital requirement for structured programming of any kind. Normally Perl variables are globally accessible within their namespace. The two types of scoping mechanism Perl provides are local variables and lexical variables -- and they were added at different times. Local variables came first historically; they're a bit less useful than lexicals, but they still come in handy from time to time. Let's look at a code example: $flintstone = "fred"; print "My favourite flintstone is generally $flintstone\n"; sub specific { local($flintstone); $flintstone = "wilma"; print "but I like $flintstone too\n"; return; } &specific; print "says $flintstone!\n"; This snippet of program will print the following output: My favourite flintstone is generally fred but I like wilma too says fred! What happened? First, we set the global variable $main::flintstone (to "fred"). Then we defined a subroutine. Within this subroutine we call local($flintstone). This pushes the pre-existing variable $flintstone onto an internal stack, and then assigns a new value to $flintstone ("wilma"). When we exit from the subroutine, Perl throws away the local value of $flintstone, and "fred" makes a come-back. Local lets us override a global variable locally; if you declare all the variables you use in a subroutine using local(), you get a semblance of true variable scoping. (But it's a pale imitation of the real thing, with various drawbacks.) This was the Perl 4 way of doing things, and it's obsolete for this purpose; the local() keyword persists because it sometimes comes in handy to temporarily override a variable's value and then reinstate it. The Perl 5 way of providing dynamically scoped variables is to declare them with my() or our(); this provides a very flexible scoping mechanism that is in some ways better than that of C or Pascal. When you declare a variable with my(): my ($fred) = "flintstone"; You are creating a variable that doesn't exist in any name space. It exists in another conceptual space, associated with whatever block of code you were in when you called my() -- the enclosing subroutine, or set of lines delimited with curly braces, or whatever. For example: my $flintstone = "wilma"; sub specific { print "In specific my flintstone is ", ($flintstone || " undefined "), "\n"; return; } print "My favourite flintstone is generally ", ($flintstone || " undefined "), "\n"; &specific; print "My favourite flintstone is generally ", ($flintstone || " undefined "), "\n"; Will print: My favourite flintstone is generally wilma In specific, my flintstone is fred My favourite flintstone is generally wilma See how resetting the value of $flintstone in specific() doesn't affect its value in the enclosing scope. You can share private variables among a bunch of subroutines by using curly braces to establish an enclosing scope: { my $traffic_light = "red"; # default value sub stop { $traffic_light = "red" } sub get_ready_to_go { $traffic_light = "red+amber" } sub go { $traffic_light = "green" } sub get_ready_to_stop { $traffic_light = "amber" } } This provides a rough equivalent to a C static variable. If you need to extend this to cover an entire program, providing a global that will be accessed from various places, providing an enclosing scope is a bit cumbersome. Perl 5.6 provides our(), which provides access to a lexically scoped global variable. It behaves like my(), but the our() variable has an independent existence; we don't want to declare such a variable multiple times, as the effects of assigning to it persist outside the scope of the declaration. If lexically scoped variables aren't visible outwise subroutines, how do we get data in or out of a subroutine safely? That's what we need to look at next ... SUBHEADING: Parameter passing Pascal knows about two types of subroutine; procedures and functions. In Pascal, a "procedure" takes zero or more _inputs_ and does something with them. In contrast, a "function" takes zero or more _inputs_, does something with them, and returns an _output_. In C, functions subsume both these roles -- a Pascal procedure is replaced by a function that returns void. Perl takes a different tack. Perl subroutines all accept zero or more parameters and return zero or more parameters. The inputs are fed in as a list after the subroutine's name when you call it, and they automagically appear to the subroutine code in the local variable @_ -- an array. When you leave a subroutine you can either return a list of variables by calling return() (with a list or array as a parameter), or the subroutine will automatically return the result of the last command. For example: sub list_params { my $count = 0; foreach (@_) { print "Parameter ", $count++, ": ", $_, "\n"; } return $count; } list_params("This is ", "the", "third parameter"); Will print: Parameter 1: This is Parameter 2: the Parameter 3: third parameter 3 Each of the items in list_params("This is ", "the", "third parameter"); Has locally assumed a position in @_ as it appears within sub list_params(). Our code then loops over the contents of the array, counts them, and returns a scalar, $count -- which you may have noticed was declared as a lexical variable visible only within list_params. Hint: lexicals may be visible only within their enclosing scope, but you can use return() to send their values out into the big wide world. Parameter passing in @_ has its weaknesses. For example, either: list_params(%my_hash); or list_params(@array1, @some_other_array); are going to result in an unholy mess. In the first case, the hash is flattened into an array (with key/value pairs alternating); we have to rebuild it into a hash once we're inside the subroutine, which had better have been written to expect to do this: %my_imported_hash = @_; # and hope @_ has an even number of elements! In the second case, we have no such good luck; the two arrays are merged end-to-end into @_, with no obvious boundary! For this reason, it is a good idea to always pass hashes or arrays into subroutines by passing a _reference_ to the item rather than the item itself: list_params(\@array1, \@some_other_array); Produces an @_ with two elements, both of which are references to a different array. A second reason for passing references rather than hashes or arrays is that when we call the subroutine the entire contents of the array or hash are copied into @_. Duplicating a data structure which might be quite large is wasteful and slow; passing a reference entails duplicating a single scalar instead. We have the same problem, in reverse, with return(). The return() command returns a list of values to the calling context. We can return a hash, but if we want to return two hashes, or a hash and an array, or any mixture of such structures, we may be in trouble. Things get even worse when we read the small print for return(); as subroutines can be called in a scalar context or an array context (or a void context, where the returned values are discarded) we need to use wantarray() to decide just _what_ to return. For example: sub reverse_and_capitalize { return unless defined wantarray; # why bother carrying on if the results # will be thrown away? @_ = reverse @_; @_ = uc @_; return wantarray ? @_ : \@_; } (If wantarray() is true, the subroutine has been called in list context. It returns a defined false value if called in scalar context, and an undefined false value if called in void context.) We can exert some degree of control over what goes into a subroutine by using prototypes. SUBHEADING: Prototypes A prototype is a declaration that specifies the parameters that a subroutine will accept. Languages like Pascal and C are very fussy about what you are allowed to pass into a subroutine -- when you create it you need to declare the parameters, their basic data type, and the order they'll be passed in. Call the subroutine with the wrong parameters and the compiler will yell at you. Because Perl is a very free-form language and @_ is an infinitely extensible array, the traditional Perl approach was to just stuff every parameter into the array and let the subroutine sort it out. Unfortunately this can lead to all sorts of headaches. For example: sub process_data { my $param = shift ; # do something with $param } ... process_data(@large_array_of_stuff); This is an error because only the first item of @_ is used by process_data() -- shift() extracts and returns the first element in @_. What makes this error particularly gruesome is the fact that it is not a compile-time, or even a run-time, error: the program will work perfectly, but only the first element in @large_array_of_stuff will ever be processed! Of course, we could fix this inside process_data(): sub process_data { while (my $param = shift) { # do stuff } } But this is the kind of solution you need to impose on every subroutine you write. A different approach is to use prototypes. Which we would do like this: sub process_data($) { ... The prototype is a pattern enclosed in brackets that follows the subroutine name. In this instance, the single dollar sign means "this subroutine is called with a single scalar parameter". If you declare process_data($), instead of process_data(), then any attempt to call process_data with anything other than a single scalar parameter will result in a compile-time error. You can tell process_data() to expect two scalar parameters like this: sub process_data($$) { ... Or to expect a single array parameter: sub process_data(@) { ... Or a hash: sub process_data(%) { ... (Note that an unbackslashed % or @ eats everything else in the arguments and forces list context, because of the way everything is passed in @_.) Or an array which is _actually_ treated as an array reference, rather than just a list of the elements stashed in it (for example, the way push(@fred, "stuff") treats @fred): sub process_data(\@) { A backslash in front of a prototype character means that the actual supplied argument must begin with that character (i.e. \@ means that an actual array must be specified, not just a list). We can specify that some arguments are optional: sub my_complex_sub($$;$) { (my_complex_sub() expects two scalars, with an optional third parameter) Or filehandle globs: sub redirect_io(**) { (redirect_io() expects two globs, typically on filehandles, as parameters). Or subroutine references: sub wrap_around_sub(\&) { ... (This treats the first and only argument as a reference to a subroutine. Yes, there are reasons you might want to do this!) It must be emphasized that Perl prototypes are _not_ like prototypes in C or Pascal -- strongly typed languages. Perl prototypes might better be called "templates", because that's what they are -- a template that specifies the layout of the parameters that must be passed into a subroutine. The subroutine itself still gets them as a concatenated list of stuff glommed together in @_. Nor can you make a prototype element relate to a specific global variable. Perl doesn't let you specify formal named parameters. SUBHEADING: eval() -- exception handling and dynamic code Perl is a dynamic language, unlike C or Pascal. That is, because it's interpreted (or at least only semi-compiled) you can take a string containing some Perl source code and compile and run it dynamically. The classic way of doing this is to use the eval() command. Typically, like most Perl commands, eval is used for two things: for compiling and executing snippets of code on the fly, and for trapping runtime errors (where Java would use a throw/catch exception handler, for example). You can tell the difference quite easily. If you see eval() called on a string, it's being used to dynamically execute the code in the string. And if you see eval() called on a subroutine or other block of code (such as anything enclosed in curly braces) it's probably being used as an error trap. Here's a really, really simple Perl shell that lets you interactively try Perl commands out: #!/usr/bin/perl print "\nperl>"; while (my $arg = ) { chomp $arg; die if ($arg =~ /^exit/i; eval($arg); print $@; print "\nperl>"; } The special variable $@ holds any errors that were returned by eval(); if the expression was eval'd alright, then $@ will be undefined. Yes, you can assemble small programs and eval() them by sticking them in a string. But be warned that this is a very slow way to run Perl. However, because it's compiling the code while the program runs, it can cope with syntax errors (badly, but they won't crash the enclosing program; they'll just set a warning in $@). We use eval in block context when we're doing something a bit more delicate -- executing a subroutine which may or may not succeed (typically because it relies on some system resource external to Perl) -- and we don't want failure to cause our program to terminate prematurely. For example: eval { # this is a block of code that might throw a runtime error if (! open(FOO, "{$name} = $sub; return $tab; } $subs = add_to_table(\%hash_of_subs, $sub_with_no_name, 'sub_with_no_name'); &{$subs->{sub_with_no_name}}; # runs the sub with no name An interesting feature arises from the interaction between closures and lexically scoped variables. If you create a closure, it 'sees' any lexical variables with whatever value they have at the (run-time) moment when the closure was created. Closures exist within the lexical scope they were created in, and this includes preserving the contents of lexical variables that might otherwise go out of scope. For example: { my $colour = "blue"; $subref = sub { return $colour }; } print &$subref; Prints: blue Even though the call to subref() takes place outside the scope in which $colour was defined. Even if we create a new lexical of the same name, the old context persists. For example: { my $colour = "blue"; $subref = sub { return $colour }; } my $colour = "red"; print &$subref; Prints: blue Here's another example of what persistent lexical scope can do for us: my $c = 0; $subref = sub { return $c++ }; for ($i = 0; $i < 10; $i++) { print &$subref, "\n"; } This prints the numbers 0 to 9 inclusive. Yes, $c survives in the closure, and when we increment its value inside the closure it persists from one invocation of the subroutine to the next! Essentially $c has become a static variable associated with the closure. Using closures we can generate subroutines on the fly, associate them with some initial starting values, and invoke them elsewhere in our programs. This technique is most commonly used in event-driven GUI programming, where such subroutines are known as _callbacks_ -- fragments of code that are triggered by events. (We set up a whole bunch of callbacks by creating closures, put them into some kind of data structure, and then loop repeatedly -- events are checked against the data structure and used to trigger an appropriate callback to handle them.) Here's a really simple subroutine factory: sub sub_factory ($) { my $name = shift; my $x = sub { $name; if (@_) { $name = shift; } return $name; }; return $x; } $setting = sub_factory("traffic light"); $state = sub_factory("widget"); # and so on, until we do something funky like ... print &$setting, "\n"; print $setting->("red"), "\n"; print $state->($setting->()), "\n"; Whenever sub_factory() creates a closure it returns a reference to it in $x; the closures all have the general structure: sub [NONAME] { $name; if (@_) { $name = shift; } return $name; }; These closures store the lexical variable that was passed to sub_factory() when it was called: if called with a parameter, the closure updates the lexical, and then it returns the value. This might seem slightly pointless until you think in terms of $name being something like "menu1.button3", the value as being something like "checked" or "unchecked", and some additional code being called from within the closure before it returns -- to do whatever the GUI interface wants to happen when the button is checked or unchecked. SUBHEADING: Odds and ends -- lvalue subroutines and exotica Normally when we call a subroutine, we pass it parameters and expect it to return a value: $result = some_subroutine_call(param1, param2 ...); But there's a weird facility in Perl 6 that lets us create subroutines that we can _assign_ to: weird_subroutine = (param1, param2); And which updates the internal lexical state of the subroutine! This is almost exactly the opposite way we normally use subroutines, and such wild entities are known as 'lvalue subroutines' (because they can appear on the left-hand side of an assignment expression). This is actually one special case of a subroutine that has an attribute set. Perl subroutine attributes are special flags that indicate that a subroutine has some specified feature. For example, to set up a simple lvalue sub: my $fred; sub lvalsub : lvalue { $fred; } lvaluesub() = 5; There are other attributes you can apply to subroutines; for example, the locked attribute, which specifies that only one thread at a time may call it, and the method attribute (which ensures that when called its first parameter -- the object it is called on -- is locked before execution). Attributes are specified after the subroutine name: my contention_sensitive_op : locked { or: my initialize : locked method { The attribute system is extensible -- you can create your own attribute names, too. However this is still somewhat experimental -- if you're going to start down that road you ought to be reading "Programming Perl" (3rd edition) and the Perl source code, because this is where I stop!