Tekstmanipulatie, week 10


1. Some examples
 

Here is a program realizing the Master Mind game. It is more sophisticated than the one you were supposed to write for the week 8 assignment, and it uses arrays, instead of strings.
 

#!/usr/bin/perl -w

$N = 5;   # number of digits
$M = 5;   # to be found out: 0...M-1

# ------------------------------------ #
# Creating the numbers to be found out #

srand;      # initializing the random number generator
            # this should be done in order to get different
            # values when running your program again and again

for ($j = 0 ; $j < $N; $j++)
   {
     $prob[$j]= int(rand($M));
   }

          # rand($M) returns a floating point (a real numer)
          #          random value between 0 and $M.
          # int($X) return the integer part of $x

          # the numbers to be found out are stored in this array
 

# ----------------------------------- #
#               Guesses               #

$bingo = 0;     # 0 = false
system ("clear");          # clears the screem

print "@prob \n";

while (! $bingo)
  {
   print "What is your guess?\nEach character in a new line, followed by ^d\n";
   @guess = <STDIN>;  # reading the guess directly into an array

   @prob1 = @prob;    # I will remove those that have a match in @guess

   $white = ( $black = 0 );
                  # white : same colour at different place
                  # black : same colour, same place

   for ($i = 0; $i < $N ; $i++)   # check for black
       {
         if ($prob1[$i] == $guess[$i])
            {
               $black ++;
               $prob1[$i] = ($guess[$i] = 999)--;
            }

       }
   for ($i = 0; $i < $N ; $i++)  # check for white
       {
        for ($j = 0 ; $j < $ N ; $j++)
           {
             if ($prob1[$i] == $guess[$j])
                {
                  $white ++;
                  $prob1[$i] = ($guess[$j] = 999)--;
                  $j = $N;   # stop checking for this

                }        # end of if

           }             # end of for (j)
       }                 # end of for (i)

   print "Black: $black   White: $white:";
   $bingo = ($black == $N);
                # $bingo becomes = 1 = true
                # if and only if $black == $N (maximal)
 

  }                      # end of while( ! $ bingo)

print "\n\n BINGO !!! \n\n ";
 

You can find this file, as well the example files from previous week on Hagen, under ' /users1/birot/Examples '. You can even run there these programs.
 

2. Some more regular expressions

(Maybe useful for somebody for the final assignment...)

Pre-defined classes:

\d = [0-9] (digits)
\D = [^0-9] (non digits)
\w = [a-zA-Z0-9_] (word characters)
\W = [^a-zA-Z0-9_] (non word characters)
A new meaning of the =~ operator: it alters the string on the left hand side, if the right hand side is a replacing expression.

For example you can make changes on strings that are similar to what the tr and sed commands do in Unix:

$name =~ tr/A-Z/a-z/;               # replace all upper case with the corresponding lower case letter

$name =~ s/[aou]e/German/;   # replace the first string 'ae', 'oe' or 'ue' with the string 'German'

$name =~ s/[\.,;]//g;        # delete full periods, comas and semicolons. (replace with the empty string)
                                        # 'g' stands for "global", that is not only the first one

$name=~ s/\W.*//;    # get rid of everything after the first word (starting with the first non alphabetic character)
 
 

More multipliers (beside +, ? and *):
/x{5,10}/         # 5, 6,...or 10 pieces of the character 'x'

/[a-n]{3,}/       # 3 or more pieces of the characters [a...n], e.g. 'bcd' or 'kka'

/\w{0,4}/         # maximum 4 word characters.

/.{5}/            # exactly five
 


A useful trick to build N-grams:

 
#!/usr/bin/perl -w

$N = 2;             ## here you can set up what the value of N is

$string = <STDIN>;
chomp($string);

for ($l = 1; $string =~ /.{$l}/ ; $l++) {};

$l--;    # $l equals the length of the input string

print "$l";   # check it, if it is really true...

for ($k = 0; $k < $l-$N+1 ; $k++)
  {
     $chunk = $string;

     $chunk =~ s/^.{$k}//;   # cut the first $k characters

     $last = $l-$k-$N;
     $chunk =~ s/.{$last}$//; # cut the last characters ($last pieces)

     print "\n $chunk";  # this is the $k-th $N-gram of the input string
  }

print "\n";


 

3. Perl-programs in context

Why not putting a Perl-program in a pipe-line?

The result of the previous program put into a pipe-line is:

birot@hagen:~/> echo "My dear, I love you" | ngrams
19
 My
 y
  d
 de
 ea
 ar
 r,
 ,
  I
 I
  l
 lo
 ov
 ve
 e
  y
 yo
 ou


Or what about:

birot@hagen:~/> echo "My dear, I love you" | ngrams > filename

birot@hagen:~/> echo "My dear, I love you" | ngrams | sort > sorted_file


The same way as you can run a Perl program from the prompt line simply by typing it its name, you can of course run a Perl program from a Shell script, too.
 
 

4. Undefined structures
 

You will get an error message, if you try to refer to something that is not defined: for instance to a hash with a key for which it has not been previously defined. How to avoid this problem?

Examples: you have two hashes, each of them containing a phone book, and you want to compare if they contain the same phone numbers for each person:
 

foreach $name (sort keys (%phonebook1))
    # this a loop in which $name takes its values from a list
    # of the keys of the hash %phonebook1, sorted alphabetically

  {
    if ($phonebook1{$name} ne $phonebook2{$name})
          { print "There is a difference at $name" ;}
  }

If there is a name which is stored in %phonebook1, but not stored in %phonebook2, then you will have an error message?

Another example: the ranks of n-grams for one file are given in one hash, and the n-grams for another file are given in another hash, and you want to calculate the sum of the differences of ranks for a given n-gram: $sum += $file1{$ngram} - $file2{$ngram}. But suppose, one of the ngrams does not occur in one of the files. What to do? First of all, you have to decide what the program should have to do in that case. This is an important rule: define first what your program should exactly do, before you write the program! For instance, in such a case the program should add a given pre-defined maximum value to $sum. Informally speaking:
 

if ($file2{$ngram} is defined)   # This is not a Perl code obviously!
           {
                 $sum += $file1{$ngram} - $file2{$ngram};
           }
   else
           {
                 $sum += $maximum_value;
            }


How to check if a hash is defined for a given key?

There are two options:

    Either use a double foreach cycle: this is slow (because of the double cycle, including plenty of unnecessary tries), but sure and simple:

foreach $name1 (sort keys (%PHONEBOOK1))
  {
   foreach $name2 (sort keys (%PHONEBOOK2))
    {
      if ($name1 eq $name2)
         {
           if ($PHONEBOOK1{$name1} ne $PHONEBOOK2{$name2})
               {
                 print "Different numbers for $name1 \n";
               }   # end of inner if
         }         # end of outer if
    }              # end of inner foreach
  }                # end of outer foreach
   Or learn about the defined function. When a variable is not given any value (like a scalar variable or a hash at a certain key), then it is told to contain the undef value. It is seen as zero in an expression like $var++.
Remark: That is  the reason why you can just count the frequencies of n-grams by simply writing: $frequency{$ngram}++ . (Have a look at it among the example files showed in class on week 9: /users1/birot/Examples/oct30/p10 on Hagen.) The value of the hash at that given key becomes set to 1 if this was the first time of encountering that given n-gram (i.e. the value of the hash has been previously undef). Otherwise the value of the hash is incremented by one.
In other cases (for example if a key-value pair of a hash is not defined) you will get error messages. To avoid that, use beforehand the "defined" function, that checks if the given expression returns the undef value or not. In the previous case the defined function returns 'false', in the latter case it returns 'true'. For instance, our previous example would be:
foreach $name1 (sort keys (%PHONEBOOK1))
  {
    if (defined($PHONEBOOK2{$name1}))
       {
         if ($PHONEBOOK1{$name1} ne $PHONEBOOK2{$name1})
             {
               print "Different numbers for $name1 \n";
             }   # end of inner if
       }         # end of outer if
  }                # end of foreach


For hashes containing the rank of n-grams in given files, it is very similar. (But I don't want to give you the entire solution of the final assignment...)
 
 
 

5. More about UNIX
 

A remark about the command line:

If your command line is too long, press the \ symbol at the end of the first line. Then you will get the so-called "secondary prompt" (usually '>'), and you can go on with typing in your command line. If it happens to you to type the \ key by chance just before typing the Enter key (it happens pretty often to me...), just press Enter again.
 

Processes:

The idea of UNIX is to have more processes running in the same time. For each person logged in, there is a copy of the shell running. If you start running a program, then this means that you start a new process, running parallel to the previous ones. Some programs start new "child processes" again.

If you put a & sign at the end of your command line, this means that you want the process to run "in the background". That is the shall that has launched that process is ready to receive new commands (going forward in a pipe-line or giving you the prompt back), before that process ends.

The ps command will print you all the running processes (run by you; if you encounter some troubles, use the -ef option for getting much more running processes).

If you want to stop a program, or "to kill a process" in the UNIX slang, then use:

- ^C (ctrl+C) if the process runs in the foreground

- ' kill -9 <pid> ' if you want to kill a process running in the background (where pid is the process identification number that you can learn from ps)

Suppose you want a process to run at a certain time, but not now. (Imagine you want to run a huge calculation, that needs a lot of resources, but it is not urgent. So you want to run it in the night when other people are not disturbed if your calculation slows down the machine.) Then use the ' at ' command.

The ' nohup ' (= NOt Hanging UP) command (combined with '&') is useful when you want your program to run even after you have logged out.
 

What happens when you log in?

A lot of things... The interesting part is that the first file that is run is the script /etc/profile,  then the .bash_profile file from your home directory. The latter calls the file .bashrc, and this one calls then /etc/bashrc. These files will set your system variables, etc. It is worth looking at them. If you feel selfconfident enough in Unix, you can alter the ones being in your home directory (after having created a safety copy!!! so that you can recover the original state at the end!!!), by entering some lines. For example, enter an "echo Hi, how are you\?" line, log out and then log in again. Look at the consequence of this line.
 
 

File structure above your home directory

I've mentioned on the first week that your home directory is a crucial point: bellow that (within it) you can do whatever you want, but nothing above it. Although different Unix systems are set up in a different way, there are still some standards that are used by most system managers.

The home directories of the different users are usually within a directory called /home or /users (or 'users1', 'user2', 'users3', etc.: if there is some reason for differentiating among them).

Executable files (e.g. programs that executes the standard Unix commands) are almost always to be found in /bin and /usr/bin. You can simply run most of the commands in the prompt line because these paths are set in your $PATH variable. Some people have a ~/bin directory within his or her home directory where (s)he puts his/her executable files.

/dev contains files concerning devices, in /etc you will find configuration files, /root contains the files of the root (the system administrator), /var contains variable files, /tmp contains temporary files, etc.

There are some funny files in /etc. For instance if the file /etc/nologin exists, then you cannot login, but get a message on the screen (this is the case if the system administrator wants to halt the system.) The passwords are sometimes stored in /etc/passwd (don't worry, in an encrypted form).
 
 

6. FTP

The abbreviation ftp stands for File Transfer Protocol. Look at the remarks about different protocols in the lecture notes of week 5. This is the simplest way to transfer files from one machine to another one, supposing you can log in to the remote machine, as well. The ftp program creates an interface which allows you only to do the basic steps that you need to transfer files. (The interfaces created by other protocols let you do more: in the case of 'telnet' you can do on the further machine whatever you could do from a non graphic terminal; while in the case of web browsers, like lynx and others :-)), they will automatically present you the files transferred.)

You will need it if for instance a Windows fun friend of you will send you some funny attachments that you are not able to open on Linux. Something happening pretty often...

In the early and middle 90s, before the web became so popular, there used to be lot of "anonymous ftp servers". If you knew their addresses, you could just log in as a "guest", without any password, and download public files, programs, images, etc. that were made available by others. Nowadays people prefer putting these files on their web page.

Unless you have a fancy ftp program that shows you your local directory and your remote directory, and let you transfer files by clicking on a button, you have the following commands:

ftp machines.name.nl - connect to the given machine (you can give the name of the remote machine as an argument already when running ftp).

disconnect - disconnect from the remote machine

bye - exit the ftp program (not 'exit', not 'quit', but 'bye'!)

bin - change the transfer protocol to binary (8 bits, instead of 7 bits; very important when transferring images, word files, etc.)

put - put a file from the local system to the remote system

get - get a file from the remote system to the local system

mput, mget - the same but allow you using standard UNIX wildcards

cd, pwd - change directory, print working directory on the remote system

lcd, lpwd - same on the local system

help - for more commands