Linux Format 21 Perl tutorial [[ TYPOGRAPHY -- text surrounded by _underscore_ characters like so should be italicized or emphasized ]] HEADING: How to use hash, legally I've suddenly realised that in all the preceding Perl columns in this magazine I've neglected to explain one of the most fundamental concepts of Perl properly: the humble hash. This is a data structure that is found in other very high level languages (such as Python, Awk, Tcl, Ruby, and so on); however it's not part of the repertoire of C, Pascal, or the other languages commonly used for teaching programming (like Basic). Consequently, it tends to confuse programmers new to Perl, and that's a Bad Thing. Like cannabis, hashes can make you feel woozy and light-headed; they can also induce feelings of euphoria when used correctly. Unlike cannabis, it's legal. So why not dive in? SUBHEADING: Fundamental data types Almost all computer languages distinguish between types of data being manipulated. ALGOL-descended languages such as Pascal, Ada, C, or C++ (and Java for that matter) deal with data types at a low level -- is it an integer, a character, an array of characters, a floating-point number, and so on. They provide more complex data types by providing a type mechanism that lets you define a collection of more primitive entities and stick a type name on it; Pascal's RECORD type lets you aggregate primitive variables in a pre-declared order, while C's struct and union types give you enough flexibility to shoot yourself in the foot. Perl tries to hide this low level stuff and let the programmer focus on high-level issues. For example, a Perl scalar (a variable prefixed by a dollar symbol) can contain an integer number, a floating point number, or a variable-length string -- whatever you assign to it, as long as you assign only a single entity. Perl scalars can also contain other first- class objects, such as hard references (pointers to Perl data structures or code), file handles, symbolic references (the name of another Perl data structure), and so on -- but we'll ignore those for now. You can identify a scalar by it's leading dollar sign: for example, $foo is usually a scalar. If it isn't an actual scalar, an entity with a dollar sign in front is being referred to in _scalar_ _context_. Perl also supplies arrays: a variable-size (elastic) container for scalars. An array is denoted by the @-sign in front of it: for example, @my_stuff. The @-sign supplies _array_ _context_ to the following identifier. We can put scalars into an array like so: @container = ( $item_one, $thing_two, $something_else ); Note the enclosing brackets. These indicate that the enclosed items are a _list_. A list is _not_ an array; a list is just a collection of arbitrary items that are grouped together. An array is a data type that contains scalars. The critical difference means that you can do some things to an array that simply won't work with a list. You can refer to it by name (lists don't have names), get references that point to it (lists don't have references), and extract variables from it. For example: $third_item = $container[2]; # extract third scalar element from @container The dollar-sign prefixing container (an array) means that we are referring to it in _scalar_ context: we want to extract a scalar from it. Items inside an array are identified by subscript, a number that increments from zero for each position in the array; thus, a subscript of "2" refers to the third item stashed in the array. If we try the following we get a runtime error: $third_item = ( $item_one, $thing_two, $something_else )[2]; That's because the bracketed items above are a list, and the array operation (to return a subscripted item) simply isn't applicable to lists. In addition to being able to assign scalars to arrays, and extract scalars from arrays, we can deal with array _slices_. A slice is a segment of an array, identified by subscript: if we ask for a slice of an array we get a list of the scalar contents of the slice, which can then be assigned to another array: @new_container = @container[1..2]; @new_container is an array. We've asked for a slice of @container, starting at subscript 1 and ending at subscript 2; this produces a list (containing two elements) which is then assigned to @new_container. Note that it's an error to say this: @new_container = $container[1..2]; We're trying to refer to @container in a scalar context, but we're implicitly asking for it to return a list. This is an error because array slices (and arrays in general) exist in _list_ _context_. This makes a point about Perl's basic data type mechanism; it understands singular and plural, in a manner analogous to a human language: * Scalar context means singular -- a subroutine is asked to return a single item, a variable is asked to present a scalar, and so on. * List context means plural -- a subroutine is asked to return a bundle of values, an array is asked to return a list of array elements, and so on. * Many subroutines or build-in operators return different things depending on whether you call them in scalar or list context. For example, in list context the grep() command (which conducts regular expression searches on lists) returns a list of all the items it finds that match its search expression. In scalar context, grep() returns an integer -- the number of matches. If you mix up scalar and list context assignments you will usually produce an error (either at compile time or run time); in any event, Perl is likely to do something you don't expect. As long as you keep track of the context you are referring to an array in, you can do interesting things. For example: @new_array = @old_array[4..44]; # @new_array contains elements 3 to 43 of # @old_array @new_array[4] = @old_array[6]; # get element 6 of old_array and put it in # element 4 of new_array @new_array = reverse @old_array; # reverse() operates on a list and returns # a list consisting of its input in reverse # order When you assign a list to a slice, it is padded (with empty values) or truncated to length as necessary. For example, let's take two arrays: @foo = (1, 2, 3, 4, 5, 6, 7, 8, 9); @bar = (qw(a b c d e f g h i j k)); We can take a 3-element slice from @foo and put it into @bar: @bar[2..4] = @foo[3..5]; print "[", join("][", @bar), "]\n"; which prints: [a][b][4][5][][f][g][h][i][j][k] If we take a 3-element slice of @foo and try to cram it into a 2-element slice of @bar, we lose the last element: @bar[2..3] = @foo[3..5]; print "[", join("][", @bar), "]\n"; prints: [a][b][4][5][e][f][g][h][i][j][k] And if we take a 2-element slice of @foo and put it into 3-elements of @bar, we gain an extra, empty element: @bar[2..4] = @foo[3..4]; print "[", join("][", @bar), "]\n"; prints: [a][b][4][5][][f][g][h][i][j][k] (Spot the empty cell in @bar?) Perl comes with a function that makes this sort of array-mangling operation explicit: splice(). Splice operates on an array; you feed it an offset (from the beginning of the array), a length (actually, the number of array elements to replace), and then a list of scalars to replace those values with. If you omit the replacement list, splice() removes _length_ elements (starting from _offset_) and returns them as a list. For example: @bar[2..3] = @foo[3..5] can also be written as: splice(@bar,2,2, splice(@foo, 3, 2)); which returns: [a][b][4][5][e][f][g][h][i][j][k] Arrays can grow or shrink dynamically, as you add or remove elements from them -- but beware, there's an overhead associated with keeping items in an array. Even empty cells take up about 30 bytes of storage, and you can assign the same scalar to a bunch of cells simultaneously, like this: @foo[1..1000] = "bar"; Be careful how you use this. The innocent-looking expression: @foo[1..10000000] = ""; Actually gobbles up hundreds of megabytes of storage on a typical Linux system! SUBHEADING: cooking with hashes A hash is a special type of array. Instead of storing scalar values in containers that are ordered by numerical position, it stores scalar values in containers that are ordered by _key_. A _key_ is simply a piece of data associated with some value. For example, we can say: $my_hash{2} = "the number two"; print $my_hash{2}; which prints: the number two Syntactic differences from arrays are initially minor; the key (which we use in place of the array subscript) is surrounded by curly braces, and when we refer to a hash by name we prefix it with a %-symbol instead of an @-symbol. As with arrays, referring to a hash with a leading dollar sign forces scalar context. The differences begin to come clear when we consider that it's just as valid to say this: $my_hash{the number two} = "2"; as: $my_hash{2} = "the number two"; Essentially, arrays store single chunks of scalar data. Each chunk can be identified by its position in the array. Hashes store tuples -- coupled pieces of information: one of these is the key, and the other is an associated value. They are not identified by position, but by key. Because hashes store data by key, you need the keys to retrieve the associated values. You can get your hands on them in two ways. First, there's the keys() function. keys(%foo) returns a list of all the keys to data stored in the hash %foo. Warning: this can be a big list! If you're dealing with a list of unknown size, or a tied object (which we'll talk about later), it's best to use method two: the each() function. each() returns a list with two elements -- a key, and a value. Every time you call each() on a given hash, it returns the next entry in the hash, until it runs out of data -- at which point it returns an empty list. Thus, you can iterate over the contents of a hash in two ways: # For small hashes only foreach $key (keys (%my_hash) ) { # do something with the associated value, $my_hash{$key} } or: # for hashes of any size while ( ($key, $value) = each(%my_hash) ) { # do something with $key and $value } An important thing to know about hashes is that you can initialise them from lists. If you have a list with an even number of entries and assign it to a hash, the items in it are assigned as key/value pairs: my %small_hash = ( "colour" , "red", "number", 2, "flavour", "cherry"); (Don't assign a list with an odd number of elements or you'll get a runtime error at best!) The comma, used as a separator for items in lists, has a synonym that makes this notation clearer: the arrow, =>. We can re-write the above list like this: my %small_hash = ( "colour" => "red", "number" => 2, "flavour" => "cherry"); which makes it easier to distinguish keys from values. Some functions that operate on lists do magical things when you apply them to a hash. For example, reverse() reverses the order of items in an array or list. When we apply reverse() to a hash, it swaps each key/value pair! If we reverse() the above small hash and loop over it, we'll see this: my %small_hash = ( "colour" => "red", "number" => 2, "flavour" => "cherry"); %small_hash = reverse(%small_hash); foreach $key (keys (%small_hash) ) { print "$key => $small_hash{$key}\n"; } prints: red => colour cherry => flavour 2 => number Note that the hash entries are unordered -- there's no guarantee that entries will come out of a hash in the same order you entered them. But you can use sort() to sort the keys: foreach $key (sort ( keys (%small_hash) ) ) { print "$key => $small_hash{$key}\n"; } prints: 2 => number cherry => flavour red => colour (We get this curious ordering because sort() doesn't do what you think it does; it's actually one of the most powerful functions in Perl, and lets us build arbitrarily complex constructions. We'll examine it in detail in a future tutorial. For now, sort() sorts in string comparison order.) Just as we can refer to subsections of an array (as a slice), so too we can refer to hash slices. A hash slice is a sub-hash; if you see a construct like @my_hash{red,blue,green}, it's a slice through %my_hash consisting of those tuples with the keys "red", "blue", and "green". We can do useful things with hash slices. For example, we can merge two hashes using them: %old = ( "red" => 1, "blue" => 1, "green" => 1); %new = ( "turquoise" => 2, "violet" => 2, "amber" => 2, "black" => 2, "silver" => 2 , "red" => 2); @old{keys %new} = values %new; foreach $key (keys (%old) ) { print "$key => $old{$key}\n"; } prints: blue => 1 silver => 2 turquoise => 2 green => 1 violet => 2 amber => 2 black => 2 red => 2 (Spot the value of "red".) We can use hashes to derive interesting data from arrays. For example, we can use them to identify the intersection of two sets: @list1 = (qw(the quick brown fox jumped over the lazy dog)); @list2 = (qw(now is the time for all lazy men to dog me)); @key_hash{@list1} = @list1; foreach $word (@list2) { if (defined $key_hash{$word}) { print "$word occurs in both sets\n"; } } What we're doing here: for each word in @list1, create a hash entry in %key_hash. It doesn't matter what the value of the entry is, as long as it's defined. In this example, we use a hash slice to simultaneously set up every word in @list1 as a key, with itself as value. Then we loop through @list2; if a word in @list2 is a key in %key_hash, then it appears in both arrays. We can make this a little bit more readable by replacing: @key_hash{@list1} = @list1; with: foreach $word (@list1) { $key_hash{$word} = $word; } And we can obfuscate it thus: @key_hash{@list1} = @list1; print join("\n", grep {$_ if defined $key_hash{$_} } @list2); This does the same thing! grep() is a special function that iterates over a list or array. The first parameter to grep() is either a search-oriented regular expression, or some other expression. For each element in the list or array, grep() applies it to the element (which is presented to the expression as $_). In list context, grep() then returns a list of all those elements for which the expression returned true. We can use grep() instead of a foreach() loop in situations where we're iterating over an array or list and extracting another list. (Note that it's considered bad form to use grep() to modify an array, and then throw away the resulting list. That's because it's confusing to readers.) There's a corresponding function called map() that does exactly the same thing as grep(), except that it returns every element in the input list; map() is used to apply some sort of transformation to every element in an array or list. For example, given an array of numbers, we can multiply all of them by 5.5 like this: my @multiplied_array = map { $_ = $_ * 5.5 } @array_of_numbers; Or we can convert to uppercase every string in an array like this: my @words = map {uc($_) } @words; Getting back to computing the intersection of two lists, we can also compute the difference between them quite easily using a hash as an aid. @list1 = (qw(the quick brown fox jumped over the lazy dog)); @list2 = (qw(now is the time for all lazy men to dog me)); @key_hash{@list1} = @list1; print join("\n", grep {$_ if !defined $key_hash{$_} } @list2); Spot the one-character change that takes us from the intersection of two sets to the difference between two sets! (A logical-NOT in the right place works wonders in set theory.) SUBHEADING: Deletion We can delete items from arrays by using splice(), or pop(), or shift(), or by shrinking the array. You can see the size of the array by looking at the special variable $#arrayname, which is the index of the largest subscript in use: print $#list1; prints "8". (There are nine elements in @list1, but subscripts count from zero.) We can truncate @list1 to four elements by assigning to $#list1: $#list1 = 3; print join(" ", @list1), "\n"; prints: the quick brown fox One problem many beginners have is understanding the undef() function. undef() does two things: you can use it to test if a variable is undefined (if ($fred == undef) ... ), and you can use it to set a variable to something undefined (undef($fred)). However, if you just use undef() to undefine the value of an array element, this won't destroy the element and shrink the array; you just end up with an array with a hole in it. That's because undef() doesn't implicitly delete an element within a larger data structure; it replaces it with a special "undefined" value. (Think of the empty set in set theory, which isn't the same as a set containing zero, because zero is an actual number). undef(@list1) will delete the whole of @list1, but undef($list1[3]) merely replaces "fox" with an empty string. Hashes are a bit harder to cut down to size. However, there are two ways of doing so: the delete() function, and undef (on the whole hash). delete("key") deletes "key" and its associated value from a hash. my %small_hash = ( "colour" => "red", "number" => 2, "flavour" => "cherry"); delete $small_hash{colour}; Leaves us with two keys -- "number" and "flavour". Whereas: undef($small_hash{colour}); Leaves us with a hash that looks like this: ( "colour" => undef, "number" => 2, "flavour" => "cherry" ); SUBHEADING: Getting referential Hashes are handy; their ability to store disparate data using keys means we can use them for most of the tasks we'd use structures or unions for in C. But how can we create them on the fly? And can we use hashes (or arrays) to store other hashes or arrays? The answer is "sort of". Perl has another fundamental data type: the reference. A reference is a pointer; not quite the same as a pointer in C or Pascal (which points to a location in your computer's memory), but a symbolic pointer -- given a reference, Perl can always locate the data to which the reference points. We obtain a reference to a variable by prefixing the variable with a backslash: $ref_to_fred = \%fred; $ref_to_fred is a scalar. It contains a reference. The reference points to the location of %fred. We can de-reference a reference by referring to it in scalar, array, or hash context: %another_link_to_fred = %$ref_to_fred; Or, for clarity: %another_link_to_fred = %{$ref_to_fred} ; This may be a bit problematic if we can't remember what type of variable a reference points to, so Perl gives us the ref() function: print "\$fred is a: ", ( ref($ref_to_fred) or " non-reference " ), "\n"; prints: $fred is a: HASH ref(OBJECT) returns "HASH" if OBJECT is a reference to a hash, "ARRAY" if OBJECT is a reference to an array, "SCALAR", if it's a reference to a scalar, various other confusing things if it's a reference to a different internal Perl data structure, and undef (zero/the empty string) if it's not a reference. NOTE: To simply things, we call references to hashes "hashrefs" and references to arrays "arrayrefs". We can get at the elements of a hashref or an arrayref by using the indirection operator ->, roughly the same way as we can use a pointer to a struct in C: $ref_to_small_hash = \%small_hash ; print $ref_to_small_hash->{colour}; prints "red". References don't need to be named, unlike hashes or arrays. We can generate them on the fly. We've seen how lists are created by enclosing a bunch of scalars in ordinary brackets, and how they can be assigned to an array or hash. We can construct arrays and hashes on the fly and assign their references to a scalar: my $arrayref = [ "red", "blue", "green" ]; or: my $hashref = { "colour" => "red", "number" => 2, "flavour" => "cherry"}; Square brackets [] are array constructors: a list enclosed by square brackets returns an arrayref when we use it in a scalar context (as above). Curly braces {} are hash constructors: a list enclosed by them returns a hashref in a scalar context. (Note that you need an even number of elements for this trick to work.) References are efficient. If we want to pass a lot of data into a subroutine, it's more efficient to pass it a reference to an array or hash than to give it a long parameter list containing an actual array or hash; this is because anything we pass to a subroutine is copied on an internal stack. The reference only requires a single scalar to be copied, but an array or hash needs to be copied in its entirety. So for subroutines that work on large data sets, references are vital. Another place where references come in handy is in coping with dynamic data sets. A hash can be used, as we've seen, to emulate data structures like unions. These, in turn, let us build the usual complex data structures beloved of computer science courses: linked lists, trees, B-trees, and so on. For example, if we want to build a tree of records in Perl, we can do so by using a record structure like this: my $leaf = { "left" => \$some_other_leaf_with_lower_value; "right" => \$some_other_leaf_with_higher_value; "value" => $value_of_this_record; }; Starting with an empty hashref called $root, we can build a tree like this: my $word = some_function_to_give_us_a_new_word_to_insert_in_the_tree(); my $point = $root; while (ref($point->{left}) and ref($point->{right})) { if ($word le $point->{value}) { $point = $point->{left}; } else { $point = $point->{right}; } } if ($word le $point->{value}) { $point->{left} = { "value" => $word, "left" => undef, "right" => undef }; } else { $point->{right} = { "value" => $word, "left" => undef, "right" => undef }; } And we can traverse the tree (depth-first) like this: sub walk { my $point = shift; my $depth = shift; $depth++; if (ref($point->{left})) { walk($point->{left}, $depth); } if (ref($point->{right})) { walk($point->{right}, $depth); } print "$depth: ", $point->{value}, "\n";; } print "\n\nwalking ...\n\n"; walk($root); print "\n\nend of walk\n\n"; In practice, we seldom use this sort of data structure in Perl. The reason: hashes and arrays (in Perl) are much more powerful than the usual primitive data types we use in C or Pascal. For example, all arrays in Perl are functionally equivalent to two-way linked lists; because arrays are dynamically resizable, we can use them as lists, queues, and stacks by using the push(), pop(), shift(), and unshift() functions. shift() returns the lowest-indexed scalar from an array, removing it in the process and shuffling everything in the array down one subscript. push() in contrast adds a scalar onto the highest-indexed end of an array. By combining shift() (to read data off an array) and push() (to put data onto it), we can maintain a queue. unshift() and pop() do the opposite of shift() and push(). We can maintain a queue in the opposite direction using these two functions. Alternatively, by combining push() and pop() (or shift() and unshift(), although idiomatic Perl programmers don't usually do that) we can maintain a stack -- push() drops another value on top of it, and pop() yanks the most recent value off it and returns it. As any first year computer science student knows, you can take a recursive tree-walk (such as the one in subroutine walk() above) and turn it into a non-recursive one by using a stack. Hashes usually take the strain of storing keyed data -- a task that is usually carried out using trees and lists in C and Pascal. Why go to all the bother of walking a tree recursively to determine an insertion point, when you can just stick the value into a hash? SUBHEADING: Next episode Having rushed through our remiss tour of hashes, arrays, and data structures, it's time to dive into another application: in this case, a weblog application that builds on last month's CGI guestbook to give us a small but functional bulletin board. Which will, of course, use hashes ... (ENDS)