Tekstmanipulatie, week 9

Remark: It is highly recommended that you also check other sources (see the notes on week 8 about further readings) about handling arrays, hash tables, files and regular expressions in Perl.

0. Introduction

Two funny ways to compute N! (N-factorial, i.e. N! = 1 × 2 × 3 × ... × (N-1) × N):

#!/usr/bin/perl -w
print "\n N = ? ";
$N = <STDIN>;
chomp($N);
for ( $k = 1 , $i =1 ; $i<=$N ; $k *= $i , $i++ ) {};
print "N! = $k \n";

#!/usr/bin/perl -w
print "\n N = ? ";
$N = <STDIN>;
chomp($N);
for ( $k = ($i =1) ; $i<=$N ; $k *= ($i++) ) {};
print "N! = $k \n";

The operator $a *= $b is an abbreviation of $a = $a * $b.

The first approach shows you that you can have complex expressions both at the initializing statement and at the incremental statement.

The second approach shows you an even more PERL-like (or C-like) philosophy: the expression $i++ increments the variable $i on the one hand, and returns its value (before incrementing; if it was ++$i, then the return value would have been the value of $i after incrementing) on the other. The same for the expression ($i =1) : it gives the value '1' to the variable $i on the one hand, and returns the same value on the other. Then you can use these returned values whenever else you wish.

By the way you could have done the incrementation even in the condition...

for ( $k = ($i =1) ; (++$i)<=$N ; $k *= $i ) {};

But this is not too nice, because you mix up the philosophy behind the conditional statement and the incremental statement.

1. Arrays

An array is a complex data structure and can be imagined like a series of boxes:

@A =

$A[0] $A[1] $A[2] ... ... $A[n]

Each of the boxes can be seen as an independent variable with a complex name, and can have any value that a scalar variable can usually have.

Imagine for instance that there are 5 families living in a building, and you put the name of each family on the post box:

@HOUSE = ("de Vries", "Janssen", "de Boer", "de Wit", "Bakker");

This gives you the way how you can define an array in Perl. This array can be seen as

@HOUSE=

de Vries Janssen de Boer de Wit Bakker

An easier way of giving an array in Perl is by using the qw operator, that inserts the quotation marks and the commas for you (but you cannot have then two words in one element):

@HOUSE = qw(de_Vries Janssen de_Boer de_Wit Bakker);

In Perl the prefix @ refers to an array. The 0th element of the array @HOUSE is the string "de Vries", its 1st element is "Janssen", ... and its 4th element is "Bakker". (For some reasons, imagine that the flats are numbered not from 1 on, but from 0 onwards).

If you want to refer to the person living in flat 3, you can do that this way: $HOUSE[3]. The expression $HOUSE[$k] refers to the $k-th element of the array, whatever the actual value of $k is. In these cases '3' or '$k' are called the "subscript" or the index (reminding us the subscripts or indeces in mathematics..., like: a₁ or a_k). You can also to have expressions like $HOUSE[$k+1].

Never forget the $ symbol at the beginning! (It is $ and not @, because this is not an array any more, but a simple scalar variable, with a complex name). Notice that you need round brackets (...) for defining an array, and squared brackets [...] for refering to one of its elements.

How the print out the name of people living in the different flats?

#!/usr/bin/perl -w

@HOUSE = ("de Vries", "Janssen", "de Boer", "de Wit", "Bakker");
for ($i = 0 ; $i < 5 ; $i++)
{ print "\n The family living in flat no. $i is $HOUSE[$i]";}

The result is:

birot@hagen:~/> house
The family living in flat no. 0 is de Vries
The family living in flat no. 1 is Janssen
The family living in flat no. 2 is de Boer
The family living in flat no. 3 is de Wit
The family living in flat no. 4 is Bakkerbirot@hagen:~/>

Do you want to read the data from the keyboard? Look at this:

birot@hagen:~/> cat house2
#!/usr/bin/perl -w

$name = ""; # Just to give it any initial value (empty string)

for ($k = 0; $name ne "stop" ; $k++)
    {
      print "name?   ";
      $name = <STDIN>;
      chomp($name);

      if ($name ne "stop")
         { $HOUSE[$k] = $name ; } # Putting the value into the array

     }

$k--;   # Because it gets incremented after the last cycle

for ($i = 0 ; $i < $k ; $i++)
     { print "\n The family living in flat no. $i is $HOUSE[$i]";}

print "\n";

The result of a run will be:

birot@hagen:~/> house2
name?   Janssen
name?   de Wit
name?   de Vries
name?   Bakker
name?   stop

The family living in flat no. 0 is Janssen
The family living in flat no. 1 is de Wit
The family living in flat no. 2 is de Vries
The family living in flat no. 3 is Bakker
birot@hagen:~/>

You can have more complex data structures, too, like two (or more) dimensional arrays. In this case each element of an array is an array itself. For instance $FAMILY[$i][$j] gives returns you the name of the family living in building $i, appartment no. $j.

The big problem with arrays is that it can easily lead to an explosion in the memory size needed (a problem that was very serious especially in the past, not so long time ago...).

Imagine you would like to count the number of occurences of N-grams. Let be N = 4, and we will work on N-grams on letters (and not on words). Imagine you had an array that is 26 × 26 × 26 × 26. The element $NGRAMS[$i][$j][$k][$l] gives you the frequency of the quadgram whose first letter is the $i-th element of the alphabet, whose second letter is the $j-th element of the alphabet, etc. E.g. $NGRAMS[0][0][0][0] is the frequence of 'aaaa', while $NGRAMS[0][1][24][25] is the frequency of 'abyz'. What is the size of this array? 26 × 26 × 26 × 26 = 456976, which would make you big troubles on a regular PC a couple of years ago (not speaking on the time needed to handle it...). And we haven't spoken yet about 5- or 6-grams on word level... But we have to realize that a very significant part of this huge array would be simply zero, and therefore this is not a good way to handle with N-grams.

It is much more UNIX-like to manipulate files, and not huge arrays. Or, you should rather use hash tables...

2. Hash tables

A hash is another type of data structure. It can be seen as an array that has "keys" instead of "subscripts". This is like a phonebook:

%PHONEBOOK

Joe 545-7654

James 323-4567

... ...

Ann 424-1234

Now you want to refer to the phone number of James as $PHONEBOOK{James), and the the phone number of the person whose name equals the variable $i as $PHONEBOOK{$i}.

How can you define such a data structure in Perl?

%PHONEBOOK = qw (
     Joe      545-7654
     James    323-4567
     Kathy    888-9876
     Ann      424-1234
);

For the sake of easy reading I have put all entries into a separate line, first the key, and then the value. Notice that the % symbol is used for hash, replacing the @ symbol for arrays. The qw operator puts in everything needed, spaces are understood as separators, but you should therefore paying attention if you would like to use spaces as values.

When refering to the value of the hash, put the key into curly brackets {...}, as opposed to the square brackets [...] of the arrays.

The following program consists of two parts. In the first one you can add new people to your phone book (a couple of people are already entered at the beginning). Then you can ask for the phone number of people, either Joe, James, etc., or those that you have just entered. Look at carefully for all details.

#!/usr/bin/perl -w

%PHONEBOOK = qw (
     Joe       545-7654
     James     323-4567
     Kathy     888-9876
     Anne      424-1234
);
        # These are the names given initially

do
    {
      print "name?   ";
      $name = <STDIN>;
      chomp($name);

      if ($name ne "stop")
         {
            print "phone number?   ";
            $phone = <STDIN>;
            chomp($phone);
            $PHONEBOOK{$name} = $phone ;
                # This is the crucial line,
                # adding new people to the hash
         }
    }
while ($name ne "stop");

print "\n\n Now you can ask for people's phone number. \n";

do
     {
      print "name?   ";
      $name = <STDIN>;
      chomp($name);

      if ($name ne "stop")
         {
            print "The phone number of $name is: $PHONEBOOK{$name} ";
              # This line shows you how to get information from the hash
         }
     }
while ($name ne "stop");

print "\n";

3. More regular expressions

As PERL is very close to UNIX, it has plenty of possibilities to handle regular expressions. We have already met the expressions of the type:

$variable =~ /RegEx/
$variable != /RegEx/

The first one is true if the left hand side matches the RegEx, and the second one is true, if it doesn't.

What can contain a RegEx? Some examples:

/[A-Za-z\.]/      # Any uppercase and lowercase character, as well as the period ('.')
/^.[0-9]*/        # Any character at the begining of the string, followed by 0 or more digits
/[aeiou]+/        # 1 or more vowels
/[aeiou]\/?$/     # A vowel followed by 0 or 1 "/" symbol at the end of the string
/([0-9)+[a-z])*/ # 0 or more times the following sequence: 1 or more digits followed by a lowercase letter
/((A\.B)?C)+/     # 1 or more times the following sequence: a character 'C' preceeded by 0 or 1 times the "A.B" string

Remember:

. - any character (but not the new line character?)
* - Kleene-stare (0 or more times)
+ - Kleene-pluse (1 or more times)
? - 0 or 1 times
^ - beginning of the string
$ - end of the string
(...) - paranthesis grouping a substring
\ - escape character for neutralizing methacharacters, like ., /, ^, etc.

For much more possibilities (ignoring case, further multiplyers, substitution,...), consult any PERL book.

4. Handling files

This is usually your ultimate goal. Getting information from files (why having to retype everything at each run???), and putting information into file. For the latter, a possibility is to redirect your standard output when running your program, but this is not always enough.

A file is seen in a programming language as a special type of data structure. A file is on the one hand something on your hard disc or on your floppy disc, etc. On the other hand it is a special type of variable, a lot of data that you can refer to within your program, using a name, the same way as for scalar variables, arrays or hash tables. An important property of a variable of the FILE type is that it must end with the EOF (end of file) character. Sometimes this second meaning of a "file" is refered to as "Filehandle".

A filehandle can be imagined as a long chain (string) of characters that you are moving along. There are four basic types of operations:

1. Changing your actual position: moving forwards, backwards, to the beginning of the file, to the end of the file, etc.
2. Checking your actual position.
3. Reading the character / byte that is at your actual position.
4. Writing something to your actual position (overwriting the actual position, unless you are at the end of the file).

(In fact checking your actual position is in the most of the cases checking if you have reached the End-Of-File, and this is in fact nothing else, but reading the file at your current position, and checking if this is the EOF character.)

Theoretically handling a file consists of the following four steps:

1. Opening the file: The content of the file in the file system (on your hard disc or floppy disc, etc.) is read into, or at least associated with the variable name whose type is FILE (with the filehandle).
2. Manipulation on the file: See the four basic types of operations above. This is always done on the filehandle, and not on the "real" file itself (on your hard disc / floppy disc / etc.).
3. Flushing: This is the moment when the content of your filehandle gets physically to your hard disc, floppy disc, etc. If your file is read-only, this never happens. If you change the content of your file, you should do it at least once before closing the file. Some programming languages and some systems do it more often for you.
4. Closing the file: This is when you dissociate your FILE variable (your filehandle) from your "real" file. Your FILE variable ceases to exist, and a lot of memory gets free (this is an important factor...). Some systems automatically flush your files, others don't (so if you don't do it yourself, your changes get lost). Some systems automatically close your open (unclosed) files if your program halts, others don't (and this may lead to serious problems).

Handling files is the story when the most errors can occur. For instance the file to be opened doesn't exist in your file system, or you don't have the permission to read it or to write it, etc. Therefore it is always very important to check if the opening, the flushing and the closing operations were successful. The goal of this course is not to go into details, but this is a very important point to have in mind. Luckily Perl makes all these stories very simple.

The name of a filehandle in Perl doesn't begin with any prefix (unlike $ for scalar variables, @ for arrays or % for hashes). But it is highly recommended to use ALL UPPERCASE names.

The commands in Perl are:

open (FILENAME, "path");
# The first parameter is the name of the filehandle
# The second parameter is the path and filename in the file system
close (FILENAME); # it contains automatic flushing

When closing a file, as well as when doing any other manipulation, you have to specify which filehandle you want the operation to take place on (you can have more files open in the same time.) That is why FILENAME occurs everywhere.

(If you want to write to a file that possibly does not exist, use "> path" for overwritting, and ">> path" for appending.)

Reading from a filehandle:

$a = <FILENAME>;

Writing to a filehandle:

print FILENAME "text to be written to the file, including variables, etc.";

In fact there are three filehandles that are automatically opened for you: STDIN (standard input), STDOUT (standard output) and STDERR (standard error). We have used so far STDIN for reading from the keyboard (or from a file, if you run your Perl program with redirecting the standard input, using < or | pipelines). Furthermore, if not specifying the filehandle in the print command, is it understood as refering to the STDOUT, and this is the way we have used it so far.

What is standard error? (See also week 3.) It happens very often that when you redirect your standard output. you don't want to redirect everything that goes to the screen. You just want to redirect some results, but not error messages, messages about the program running successfully, etc. In the following I will show you an example to use that.

Here is a standard way to deal with errors:

open (FILENAME, "this/is/your/path/and/filename") ||
die "Couldn't open this/is/your/path/and/filename \n";

Without going into details, the || sign tells the computer to execute the second command if the first one ran into error, while the die command prints out the text to STDERR and then stops the program. (If the message of death does not end with \n, then you also get the name of the program running and the line number in wich the program died, an information very useful for debugging your program.)

Here is a new version of the previous program for writing to and reading from a phone book. It reads the initial phonebook from a file. Furthermore, it combines printing to STDOUT and to STDERR, because there are some operations you always want to do in an interactive way, and there are something you may want to redirect into a file (when wanting to create a copy of some entries of your phone book). Observe when I use STDOUT and when I use STDERR. Copy this file to your directory, run it, and test what happens when you redirect the standard output. Try redirecting the standard input, too.

#!/usr/bin/perl -w

open(PHNBK, "phonebook")
     || die "Not able to open file \" phonebook \" ";

while ($name = <PHNBK>)
       # Reads a line from the file into $name,
       # and behaves like "false" if nothing has been read,
       # because EOF was reached.
   {
     chomp($name);
     $phone = <PHNBK> ;
        # Notice that names and numbers should be in
        # separate lines in your input file
     chomp($phone);

     $PHONEBOOK{$name} = $phone;
   }

close (PHNBK)
     || die "Could not close the file";
print STDERR "\n Adding new names \n";

do
    {
      print STDERR "name?   ";
      $name = <STDIN>;
      chomp($name);

      if ($name ne "stop")
         {
            print STDERR "phone number?   ";
            $phone = <STDIN>;
            chomp($phone);
            $PHONEBOOK{$name} = $phone ;
         }
    }
while ($name ne "stop");

print STDERR "\n\n Now you can ask for people's phone number. \n";

do
     {
      print STDERR "name?   ";
      $name = <STDIN>;
      chomp($name);

      if ($name ne "stop")
         {
        print STDOUT "The phone no. of $name is: $PHONEBOOK{$name}\n";
         }
     }
while ($name ne "stop");

print STDOUT "\n";

As PERL is a language for UNIX, PERL has a lot of further functions for handling files. You can do (almost) everything in PERL that you can do in UNIX, you have a lot of possibilities to check the properties of files (their permissions, their types, etc.),... For more information look at any manual of PERL.