LINUX FORMAT PERL COLUMN STRAP: Being objective SUBHEADING: Object-orientation and Perl Object-oriented programming is not new; it's been around since the late 1960's, although it only really caught on in the mid to late 1980's, as a response to the increasing complexity of software. If you've written any program in any language that was more than a hundred lines long, you'll appreciate the need to wrap chunks of code up as separate subroutines. If you've written a program that was more than a thousand lines long, you'll probably have moved a bunch of utility subroutines out to a separate library file, so that they don't confuse the flow of control of the main program. But in really large projects, the proliferation of subroutines and data types that they work on rapidly becomes uncontrolable: which is where object orientation comes in. Object orientation is essentially a way of looking at software that allows us to fence off chunks of a project into "objects" (packages containing source code and the data structures the source code works on), with well-defined interfaces, so that we can concentrate on the big picture. In its early days, Perl didn't do object orientation. If you were a masochist you could emulate it using namespaces, just as you can emulate object orientation in C (the Motif APIs require you to do just that!), but that was about the limit. Perl 5 introduced some new keywords and constructs that give Perl a very flexible model for doing object-oriented programming, and that's what we're going to look at this month. SUBHEADING: What is object-oriented programming? Most programming work involves messing around with data structures -- collections of variables linked in weird and wonderful ways. In object oriented design (and programming) we try to keep our data structures parcelled together with the subroutines that create, modify, access, or destroy them. Access to a data structure is provided via some subroutines which are globally visible, but what happens to the internals of an object is a secret from the rest of the program, as is the internal structure of the object. There may also be some private subroutines that the rest of the software doesn't know about -- these are used by the public routines, for their own purposes. In general, object orientation relies on a handful of properties: information hiding (data is only visible inside the object's own code), inheritance (we can define a new type of object, incorporating an existing one but adding new data and subroutines to access it), modularity (information and subroutines related to a class of object are bundled together). This month, and continuing next month, we're going to take a look at a concrete example: a Perl module for editing the /etc/hosts file on a Linux system. /etc/hosts is a file that matches hostnames to internet addresses for computers on a network. (DNS, the domain name system, replaced the hosts file for computers connected to the global internet, because it's a distributed database: a hosts file for the entire net would be gigantic and require very frequent updates. However, we still use /etc/hosts files for small office and home networks because it's convenient and easy to set up.) We might want to write a Perl script to read, update, or modify an /etc/hosts file if we're planning a system administration framework for a small local network. Within /etc/hosts, we can write comments; they begin with a hash '#' symbol and continue to the end of the current line. We can also include a host record. A host record consists of an IP address, followed by a fully-qualified domain name for the host, then zero or more aliases (such as the hostname with no trailing domain information). Fields are separated by whitespace (spaces or tabs), and each record is terminated by a comment character or a newline. Within our (hypothetical) system administration tool, we might want to hive off maintenance of /etc/hosts entries from other functions (say, simultaneously updating entries in BIND's database of hosts). Typical tasks include looking to see if a hostname has an IP address in the file, or if an IP address has an associated name: also, deleting a host, adding a new host, adding a new alias to a host, and changing the IP address of a host. We may also want to future-proof ourselves: IPv6 (the next version of the TCP/IP networking protocol) adds a new syntax for defining classes of networks. It's fairly clear that the core entity we're going to work with, the object, corresponds to a file. We could pick different objects to work with -- say, individual entries in an /etc/hosts file -- but we'd still need an object corresponding to the hosts file, and its contents are simple enough that we don't need to modularize it further. On the other hand, we don't want to try to use a single class to update /etc/hosts, BIND configuration, SMB configuration, and so on in one place -- that would be excessively complex. We want to be able to create a new hosts object by reading in the /etc/hosts file and populating some sort of internal data structure with its contents. We want to be able to tell our object to update the version on disk (saving its contents). We want to be able to look up the names for an IP address, and vice versa. We want to be able to create a record for a given IP address, change its associated aliases, or delete it. Actually, this lot sounds like we *do* need another class, so we're going to create one: a class of objects that consists of records in a hosts file. Our main program will never see this class, but it'll make life easier inside the main class. So what we're going to do is this: * Write a class (let's call it LF::Hosts) that gives us a set of data structures and subroutines for messing around with an /etc/hosts file. * Write a class (called LF::Hosts::Entry), to be used by LF::Hosts, that gives us data structures and subroutines for creating/querying/editing/deleting a host record. Our main program will then be able to say something like this: my $hosts = new LF::Hosts or die "Could not open /etc/hosts file: $!\n!"; # get names associated with an IP address my @aliases = $hosts->identify("192.168.1.10"); # and vice versa my $ip_addr = $hosts->identify("mike.linuxformat.org"); # print comments associated with host $ip_addr print $hosts->comments($ip_addr); # print comments associated with no particular host print $hosts->comments(); # insert a new host $hosts->add("192.168.1.14", "bob.linuxformat.org", "bob"); # modify (rename) an existing host from "bob" to "patricia" $hosts->edit("192.168.1.14", { "bob" => "patricia", "bob.linuxformat.org" => "patricia.linuxformat.org" }); # delete a host $hosts->delete("192.168.1.100"); # finally, save the file $hosts->commit(); SUBHEADING: Creating Perl objects In Perl, a class of objects is defined by a package (that is, a set of Perl subroutines that come with their own namespace, usually in a separate file). All data associated with the class is stored in the class's namespace, or in data structures hanging off a reference. We usually put packages in separate files (with the suffix .pm); when our program needs to use a class, we add the "use Classname" directive to tell Perl to load the appropriate package while it compiles the program. For example: use MyPackage; Tells Perl that it should locate the file containing MyPackage (i.e., a file called MyPackage.pm, in one of the directories listed in the special array @INC) and compile it. (Note that "use" is executed at startup, before the script begins to run; the similar "require" directive, a hangover from Perl 4, is executed whenever the Perl script flow of control gets around to executing that line.) Unlike a normal data structure (such as an anonymous hash), an object knows what class it belongs to: instead of being a HASHREF or an ARRAYREF, it belongs to LF::Hosts, or LF::Hosts::Entry, or something. We tell an object what class (package) it belongs to using the bless() command. For example: bless $variable, Fred; This line tells $variable that it is a Fred (whatever a Fred is). A side-effect of blessing a variable, so that it belongs to a specific package like Fred, is that if we then call a subroutine ("method" in object-oriented jargon) called do_something() on the blessed variable, it will look first for a subroutine in its own class, called &Fred::do_something(). If no such method exists, it looks for other classes listed in a special array called @ISA (literally, "is a") before seeing if there's a standard subroutine of that name. We use a special shorthand for calling methods (subroutines associated with an object): $thing->do_something() Means "run the subroutine do_something(), from whatever package $thing belongs to, passing it $thing as its first parameter." In general, a class contains two types of method (subroutine): class methods and instance methods. A class method is one that operates on all objects defined as belonging to the class -- for example, we might use one to tell our LF::eosts class that our systems all put the file /etc/hosts somewhere unusual. An instance method is one that operates on a single object: for example, to get or set its internal state. We almost always need one specific type of instance method called a constructor. A constructor is called like a class method (i.e., by name, rather than by dereferencing an existing object), and it returns a reference to a new object. By convention, Perl classes usually call their constructors "new". In the case of LF::Hosts, calling "new" should return a reference to a data structure that embodies an /etc/hosts file, which has been blessed so that it "knows" it is a member of the LF::Hosts class (and "knows" what subroutines apply to it). In the case of LF::Hosts::Entry, calling "new" should return a reference to the next entry in the parent object's hosts file. Like this: package LF::Hosts; $LF::Hosts::HOSTFILE = "/etc/hosts"; sub new { my $self = {} ; # create a reference to an empty hash bless $self, $LF::Hosts; # tell $self that it is an LF::Hosts object # now we open the hostsfile, and generate a bunch of # LF::Hosts::Entry objects, each of which is added to $self # as a hash key/value pair open (FH, "<$LF::Hosts::HOSTFILE") or die: "$!\n"; while (! eof(FH) ) { my @line = LF::Hosts::Entry->new(FH); if ($line[0] eq "COMMENT") { push(@{$self->{COMMENT}}, $line[1]); } else { $self->{$line[0]} = $line[1]; } } close FH; return $self; } This is the constructor method for LF::Hosts; it returns a reference (called $self within the subroutine) to a blessed object, which is actually an LF::Hosts object. The object is a hashref containing various key/value pairs; each value is a reference (pointing to either an array of comment lines, or to an LF::Hosts::Entry object). A point to note: LF::Hosts::Entry's new() method returns an anonymous array containing two items: a key and an LF::Hosts::Entry object. The key is either the string COMMENT, or an incrementing number (stored in a class variable maintained by LF::Hosts::Entry) that is unique for each instance. The line: push(@{$self->{COMMENT}}, $line[1]); shows that we can push (append to an array) a variable (in this case, the object referenced by $line[1]) into an anonymous array that hangs off an object. $self->{COMMENT} is an array reference; we use @{ $self->{comment} } to tell Perl to treat it as an array. Next month, we'll see how the instance methods are written, write a child class (LF::Hosts::Entry), see how Perl's POD documentation system works, and discuss some of the more interesting applications of OOP in Perl. BOXOUT: References A reference is what Perl uses instead of pointers. You haven't met pointers? Don't worry ... Computers organise data in their memory by putting each byte (or word) into a separate cell. Each cell has a unique numerical address, just like the position of an element in an array. Languages like C or Pascal let us refer to data we've stored in memory either by giving it a variable name, or by specifying the address of the memory cell it is stored at. (Actually, all a variable name is is a key in a special table of memory adddresses called a symbol table: when you refer to a variable called fred in Pascal or C, the compiler generates code that checks the symbol table to find out the address where fred's data is stored, then fetches it.) A pointer is simply a raw memory address. We can grab a pointer to the data associated with a variable, and stash it in another variable (or in part of a variable -- say, inside an element of an array.) Perl references aren't pointers to physical chunks of your computer's memory; they're merely an internal handle that the current Perl process uses to store or retrieve a bit of information. But they act like pointers. We can obtain a reference to a variable by prefixing the variable's name with a backslash, and store references in any scalar: my @an_array = ("red ", "blue ", "green "); my $reference_to_an_array = \@an_array; $my_reference_to_an_array doesn't hold an actual array of data -- but it holds a reference which points to the chunk of memory where the array is stored. If we print a reference, it doesn't show us anything useful: print $reference_to_an_array; ARRAY(0x80f8f28) But we can dereference the contents of $reference_to_an_array (getting back to the original contents) by prefixing the scalar containing the reference with the type it belongs to: print @$reference_to_an_array; red blue green We can also use the ref() command to tell us what type of data is referenced: print ref($reference_to_an_array); ARRAY (Valid things that ref() can return include CODE, HASH, SCALAR, ARRAY. It returns undef (zero, false) if the object you call it on isn't a reference.) END BOXOUT (References) BOXOUT: Complex data structures We can create data structures by storing references in arrays or hashes. For example: my @colours = (qw(red blue green)); my @widgets = (qw(screw nail staple)); my @colourful_widgets = (\@colours, \@widgets); This is cumbersome, so we can employ an anonymous array constructor. Instead of using brackets to create a list, we use square brackets to return a reference to an anonymous (unnamed) array. An anonymous array is just an array without a name -- because we've saved a reference to it somewhere, it continues to exist and we can get at data stored in it. Like this: my @colourful_widgets = ( [ qw(red blue green) ], [ qw(screw nail staple) ] ); The array @colourful_widgets is not a two-dimensional array; it's a one-dimensional array containing references to other arrays. But we can use it as a two-dimensional array: $color = $colourful_widgets->[0]->[1]; # $color contains "blue" $thing = $colourful_widgets->[1]->[2]; # $thing contains "staple" The arrows are inherited from C's syntax for dereferencing pointers, and have pretty much the same meaning. Note that unlike C, they're optional (when dealing with array subscripts, as above), so we can refer to $colourful_widgets[0][2], just as if it's a true multidimensional array. In addition to constructing anonymous arrays using [ ... ], we can build anonymous hashes using { ... }. For example: my $properties = { "food" => "cheese", "colour" => "blue", "smell" => "strong" }; Which is equivalent to: %properties = ("food" => "cheese", "colour" => "blue", "smell" => "strong" ); my $properties = \%properties; A common use of anonymous hashes in Perl is to provide variables with multiple named fields -- like records in Pascal or structs in C. For example, we can get the smell of our object by saying: print $properties->{smell}; In general, Perl lets us return a reference to most items -- even subroutines. For example: sub fred { # do something } my $subref = \&fred(); Perl provides a powerful module for looking at complex data structures consisting of nested arrays and hashes linked by references: Data::Dumper. You use it like this: #!/usr/bin/perl use Data::Dumper # do something to create a complex data structure pointed to by a # scalar called "$fred" # now we want to inspect the structure of whatever $fred points to ... print Dumper $fred; prints something like this: $VAR1 = { 'colour' => [ 'red', 'blue', 'green' ], 'type' => [ 'screw', 'nail', 'staple' ] }; Curly braces denote an anonymous hash; square brackets indicate an array. So what we have here is $VAR1 (also known as $fred), pointing to an anonymous hash with two keys, 'colour' and 'type'. Each key has an associated value -- which is a reference to an array. END BOXOUT (Complex data structures) BOXOUT: Subroutines and parameter passing Like all serious programming languages, Perl lets us define subroutines (equivalent to C functions or Pascal functions). We do it like this: sub my_subroutine { # code goes here return $some_return_value; } When we invoke my_subroutine(), $some_return_value is returned to the calling context. We can invoke it either by prefixing it's name with an ampersand (like &my_subroutine), or following it with brackets (C style). If you don't explicitly return a value from a subroutine, it returns the result of the last expression to be evaluated within its scope. So: sub return_true { 1; } always returns "1" (which is not false, by definition). We can return more than one scalar value; in this case, whatever receives the returned values must be able to cope with a list, and identify whatever's been returned appropriately. If we try returning a hash, though, it will be "flattened" into a list -- and if we try returning a hash and an array, the results will be a messy collision. So if you want to emit complex structures from a subroutine, the best policy is to return a list of scalars containing references: sub returns_complex_stuff { # code goes here return ( \%my_internal_hash, \@some_array, $an_object); } # main program, now: ($my_returned_hash, $my_array, $my_object) = returns_complex_stuff(); A similar rule applies to getting parameters into a subroutine. We can pass as many scalars as we like to a subroutine; but from the point of view of the subroutine, they all get squished into a special array, called @_. So if we want to push a mix of different variables into a subroutine it's best to pass them as references: sub complex_sub { $incoming_array = shift @_; $incoming_object = shift @_; # do something or other and return } $result = complex_sub(\@array_to_process, $object_ref); Note that "shift" grabs the leftmost element of an array and returns it, shortening the array by one element. Along with the corresponding commands "unshift" (shove an item onto the "left" of an array), and push/pop (which operate on the other end of an array) we can implement stacks, queues, and a whole load of other useful structures using ordinary arrays. END BOXOUT (Subroutines and parameter passing) BOXOUT Variable scope In Perl, there are three mechanisms for defining the scope of a variable. First, there's the namespace. If you refer to a variable like $thing, it instantaneously springs into existence -- within the current namespace. If you haven't used a 'package' command to specify a different package -- each package comes with its own namespace -- this will be the namespace "main"; so your variable will actually be $main::thing. A variable created in this way is global, which is a nuisance; if you want something called $thing to be local to a subroutine, don't just use it this way. Tip: you can make Perl throw a runtime error when you do this by using the "use strict" compiler pragma: put a line like: use strict; at the top of your script, and Perl will refuse to run it unless all variables are explicitly declared within a namespace, or are lexical (see below). You can locally override a global variable by using the "local" command. For example, if our script has a $thing floating around, we can say: sub mysub { local $thing; Thereafter, within the subroutine mysub() $thing is treated as an entirely different variable; the global version is rendered invisible. But when we leave the scope of mysub(), the global version of $thing reappears, and the copy inside the subroutine vanishes mysteriously. (This is because the command "local" causes Perl to stash the designated variable on a stack, and restore the old value of it upon leaving the enclosing block of code.) Local scope is sometimes handy, but what you probably want are true local variables, the way a language like C or Pascal provides them. To declare a lexically scoped variable -- one that exists only within the scope of the lexical block of code enclosing the declaration, use "my": sub mysub { my $thing = shift @_; # and so on } The lexically scoped $thing doesn't exist within a symbol table; it's stored somewhere else entirely. It's invisible outside of mysub(), unless you obtain a reference to it and return the reference (in which case you can do neat things with it). This gives us true control over variable scope, like a real grown-up programming language. And in general, unless you want your variables to be global, you should remember to "use strict" and always declare your variables lexically (and initialise them to a sensible value!). END BOXOUT (Variable scope)