Tekstmanipulatie, week 8

0. For people interested in n-gram techniques:

Another way of using n-grams for sorting text: M. Damashek: Aquaintance, Gauging Similarity with n-Grams, in: Science, vol. 267 (1995), pp. 843-848

Applying this method to DNA: T. Biro et. al: Application of Vector Space Techniques to DNA, in: Fractals 6, no. 3, pp. 205-210 (1998)

1.  A remark on ' man '

For a number of commands there are several man pages.  Sometimes it is useful to have a look to more than one of them, you will get additional information. Try out e.g.:

man man
man 7 man

man printf
man 3 printf

2. Introduction to Perl

Warmly recommended literature to Perl:

Henny Klein's lecture notes: week 7, week 8, week 9.

The slides of Miles Osborne (especially the basic ones)

The Perl page of Rob Koeling  (It is highly recommended that you go trhough it! It is quick, good and fun!)

The 1. chapter ("Introduction") of Schwartz & Christiansen: Learning Perl (2nd edition), especially the "stroll through Perl" is recommended

(Library codes: 1. UB zaal Wisk.Natuurw.Tech. uwnt 174S PERL 005, photocopy only; 2. Bibl. Letteren 10.124 H05, lendable; 3. Bibl. Sociale Wetensch. 089 53 EX.2, lendable)
The book called "Programming Perl" is the advanced version of this book, and will be useful when writing more complex programs in Perl. You can also read through man perl.
Perl = short for "Practical Extraction and Report Language" (for some people: "Pathologically Eclectic Rubbish Lister"...:-)

Historical background:

Its purpose is to solve problems that would be too hard to write in a Shell script, and too weird or short-lived of complicated to code in a traditional programming language (like Pascal or C). It is a simplified programing language very much fitted to the Unix philosophy and to common tasks one encounters when using Unix.

A Perl script is a file that

The path depends on the system used (on hagen, as on most systems, it is the one given above). Using the -w switch will warn you about potentially  dangerous construct: it is highly recommended when developing a program (then you can remove it).

A comment is a text that is not understood as being part of the program. In Perl, any text appearing after a # sign is understood as being a comment, until the end of the line. (If you wish to write a multi-line comment, each line should start with the # sign.)

For executing programs there are two possibilities:

Perl is an interpreter, it does not produce a compiled, executable file. But executing it, it still parses the program completely, and compiles it into a compact internal format. Therefore you can avoid the main disadvantage of interpreters: no syntax error will appear, once the program begins getting executed.

3. Basic input and output commands

First line of any Perl program:

#!/usr/bin/perl -w
...or the given Path. The -w option tells Perl to produce extra warning messages about potentially dangerous construct.

Statements end with a semicolon (;) (Can be omitted when the statement is the last statement of a block of file or eval, but not recommended.)

Paranthese are never required or forbidden for built-int functions.


print ("Hello word!");    # Text to be printed within double quotation marks

print "Hello word \n";    # \n stands for a new-line character

print "Hello, $name!\n";  # Prints the value of the variable $name at the given place

print "\a";            # ring the bell

Remark: within single quotation marks (single quoted strings) every characters means itself, with two exceptions (\' refers to the ' character, and \\ refers to the \ character; but \n refers to two characters, to \ and to n). Within double quotation marks much more things get interpreted (like variables, \n for new-line character,  \a for the bell, \ddd for the character given with its octal code ddd, as seen with tr, etc.)

Giving value to a variable:

$d = 17;  # the value of the variable $d should be 17

$d += 5   # add 5 to the value of d, and the result goes back to d

$d -= $n  # the value of $d is decremented by the value of $n

$d++      # suffix autoincrement (the value of d is incremented by 1, but the value of the expression is the value of d before that operation)

$++d      # prefix autoincrement (the value of d is incremented by 1, and the value of the expression is the value of d after that operation)

$d = 17; $e = ++$d; # The value of both $d and $e becomes 18: the value of $d is incremented, and THEN this values goes to $e

$d--; --$d    # suffix and prefix autodecrement (by 1)

It is important to remember that all scalar variables in Perl are either double-precision floating point values or strings. They are in fact automatically convertedback and forth


$name = <STDIN>        # The value of the scalar variable $name is read from the STandarD INput, including the closing \n

chomp($name);              # The closing \n us removed from $name (the results goes back to $name)

2 + 3      # addition

5.1 - 2.4  # substraction

3 * 12     # multiplication

14 / 3     # always floating point divide! so the result is: 4.666666....

2**3       # 2 to the third power

10.5 % 3.2 # modulus or remainder (10.5 gets reduced to its integer value, i.e. to 10, and 3.2 to 3, so the result will be 1).

The logical comparision operators returning true of false value are: < <= == >= > !=
Remember the double == sign for equality (single = sign is the value giving operator)!

Logical operators:

($a && $b)              # Logical AND: Is $a and $b true?
($a || $b)              # Logical OR: Is either $a or $b true?
!($a)                   # Logical NOT: is $a false?

Operators for strings:

"hello" . "world"    # concatenation, the result being "helloworld"

"hello" . "\n"       # the result is "hello\n"

$name eq "Tamas"     # equality of the strings: the result is true if the variable $name contains the string "Tamas"

$name ne "Tamas"     # not equal

Further operators comparing two strings: lt gt le ge (less than, greater than, less than or equal to, greater than or equal to). A string is lt (less than) another string if the first different character they have has a smaller ASCII value within the first string. (If the two strings share the same prefix, i.e. the same first couple of characters are the same, they don't count in the comparision.) Thus both 30 > 7 and 7 gt 30 are true expressions (because the ASCII value of 7 is 55, and the ASCII value of 3 is 51).

Regular expressions

Regular expressions are so characteristics to Unix, and therefore characteristics to Perl, too (which follows very close the Unix philosophy). Here are some simple examples:

$fruit =~ /apple/    # This expression is true if the string "apple" appears as a substring within the string $fruit.

$fruit !~ /apple/    # This expression is true if the string "apple" DOES NOT appear within $fruit

$fruit =~ /$name/    # This expression is true if the string contained in the $name scalar variable appears within $fruit

Don't forget, everything is case sensitive under Unix!

4. Conditions and cycles (loops)


The syntax of the basic commands for conditions are:

if ( <expression> )
The else-branch is not compulsory, of course. You can have more then one command between the {...} brackets, divided by a semicolon.

The expression can be something like $a == 2  or ($name1 . $name2)  eq "Tamas"  , but also a variable or another expression (like $a * $b). If the value of the expression is zero (false) then the else-branch will be executed. Otherwise the first branch gets executed.

An example:

#!/usr/bin/perl -w
print ("Please type 0 or 1 \n");
$number = <STDIN>;
if ( $number )
  {print "You did not type 0 \n"}   # Either you type 1 or 00 or anything else
  {print " You typed 0 \n"}         # Only if you have typed exactly 0
print "Thanks for playing with me \n"

A very Perl-like construction is the following:

if ( <expression> )
elsif ( <expression> )
elsif ( <expression> )

where elsif is the combination of else if.


How to build loops?

while ( <expression> )
where the <commands> are executed as long as the expression is true.


(From R. Koeling's page.) Perl has a for structure that mimics that of C. It has the form

for ( <initialise> ; <test> ; <inc> )

First of all the statement <initialise> is executed. Then while <test> is true the block of actions is executed. After each time the block is executed <inc> (incrementing) takes place. Here is an example for loop to print out the numbers 0 to 9.

for ($i = 0; $i < 10; ++$i)     # Start with $i = 1
                                # Do it while $i < 10
                                # Increment $i before repeating
        print "$i\n";


- - - - - - - - - - - - -

Here is a program that reads some input from the keyboard and won't continue until it is the correct password  (from Rob Koeling's tutorial):

print "Password? ";             # Ask for input
$a = <STDIN>;                   # Get input
chop $a;                        # Remove the newline at end
while ($a ne "fred")            # While input is wrong...
    print "sorry. Again? ";     # Ask again
    $a = <STDIN>;               # Get input again
    chop $a;                    # Chop off newline again

The curly-braced block of code is executed while the input does not equal the password. The while structure should be fairly clear, but this is the opportunity to notice
several things. First, we can we read from the standard input (the keyboard) without opening the file first. Second, when the password is entered $a is given that
value including the newline character at the end. The chop function removes the last character of a string which in this case is the newline.

To test the opposite thing we can use the until statement in just the same way. This executes the block repeatedly until the expression is true, not while it is true.

Another useful technique is putting the while or until check at the end of the statement block rather than at the beginning. This will require the presence of the do
operator to mark the beginning of the block and the test at the end. If we forgo the sorry. Again message in the above password program then it could be written like

        "Password? ";           # Ask for input
        $a = <STDIN>;           # Get input
        chop $a;                # Chop off newline
while ($a ne "fred")            # Redo while wrong input