Tekstmanipulatie, week 14


1. expr and bc


In order to make easy calculations you can use the 'bc' command (bell's calculator). Type bc <RETURN> and you can immediately type in any expressions, like 3+4 or (45/3400)*100. In fact, similarly to the way we were writing short files by using 'cat', we are just using the fact that this command needs an input file, and if nothing else is specified, then it is the standard input. Therefore the program can be ended by ^d (CTRL + D: end-of-file). Or, alternatively, by ^c (CTRL + C: stop the running program).

Therefore why not doing things like:

echo 3+4 | bc
echo 23/46 | bc
Hey! Why is 23 / 46 = 0 ?! Because, if otherwise not specified, bc works with integers. Type ' scale = 4 ' to be able to receive your results with four decimals.
How to do this within one command line? You need an input file of two lines:
(echo scale = 4; echo 5/8) | bc
What does
echo 13 % 3 | bc
mean? The remainder of the division. And what is the problem with this one:
 echo (13/26)*4 | bc
Try rather the following, and remember what you know about the escape characters:
 echo \(13/26\)*4 | bc

What is the difference between echo and cat?

You can find the same dichotomy among the commands dealing with mathematical expressions: Examples for expr:
expr 3 + 4
7

expr 3+4
3+4

expr \( 3 + 4 \) \/ 4
1

expr 2 * 3
expr: syntax error

expr '-2' \* 3
-6

expr 13 \% 3
1

expr 8 = 8
1

expr 15 = 2
0

expr \( 8 = 8 \) \& \( 3 = 3 \)
1

expr '(' 8 = 8 ')' '|' '(' 3 = 4 + 5 ')'
1

Remarks: The numbers, parantheses and arithmetic symbols are different arguments, therefore you should separate them by a space (if you don't: see the second example). Some out of the arithmetic symbols are metacharacters, therefore they should be protected using quotes or the escape character ('\') (what is the reason of the error message in the fourth example?). Division is understood as division of integers, and % refers to the modulo of the division. The last four examples show how logical statements are evaluated: 0 stands for the logical value FALSE, while 1 stands for the logical value TRUE. The '&' symbol means AND, '|' means OR. Check man expr for further possibilities (e.g. what happens if you use these logical operations between numerals, and not between statements?).

The expr command, combined with back quotes (that is replaced by the shell with the output of the command line within the quotes) makes us an easier way to calculate type-token ratio or word-frequencies. How to calculate for instance the frequency of the word "the" in a given a a given file?


2. Variables



 

Soon, we are getting to shell scripts, which are "programs" built up of the elements of the Unix "Lego game": a file that contains a series of commands to be executed by the computer. Programming languages, however, contain a few features that we have not used so far: conditions (if... then...), loops (cycles), and variables. Now, we introduce variables, and next week we introduce conditions and loops.

Variables in most programming languages belong to one of a number of types: integer (..., -2, -1, 0, 1, 2, 3,...), floating point (e.g.: 3.14159265 or 6*10^23), character, string (i.e., a series of characters), Boolean (true of false), and so forth. Nevertheless, in Unix we have but one type: strings of characters, conform to the idea that "in Unix, everything is a text". For instance, if we give a value to the variable PI the following way, then $PI will be a string of 6 characters:

PI=3.1415

The variable PI does not differ qualitatively from another variable that contains the six characters Unix!! . The only difference is that a variable containing the digits of a number may be used by commands that interpret a string of characters as a number: for instance, in the argument list of expr, in the standard input of bc, or in the argument following the -n option of head and tail.

Note in the above example that no space can be added between the name of the variable and the = symbol, as well as between the = symbol and the value of the variable. Further, it is the convention (though, not compulsory) to use exclusively upper case letters in the name of the variables (beside digits and the underscore symbol _).

We need variables not only in shell scripts. When you use Unix, you (unconsciously) use a number of variables that carry pieces of information needed by shell or by other programs. These include your login name, your home directory, your current working directory, the paths of some important files in the file structure, the configuration of your desktop, or the format of the prompt.

You can get the list of this high number of variables (including the variables you have defined yourself) by using the command called ' set'. In fact a useful way of using it is by pipelining it with grep, like:

set | grep a=
set | grep PATH=
The system itself has a high number of variables. They have always upper case names. Here are some of them:
SHELL : gives the path of the running shell
PATH : a set of paths that are checked (in this order) when you give a command (i.e. the name of a program), and the shell looks for it in the file system
HOME : the path of the home directory of the actual user (you)
MAIL: the path where your mails are located
PWD : the actual working directory
OLDPWD : the previous working directory (before the last cd command)
LOGNAME : your login name
HISFILE : the file where your 'history' is (the list of your previous commands, max. HISTSIZE / HISTFILESIZE number of them, and you can read them with the ' history ' command)
PS1, PS2: the settings of your primary and secondary promp
TERM: the type of your terminal
You can check their settings on your account.

The way you can give them a new value is the following:

PWD=Federalist
N.B.:  no space before and after the = symbol. (Try out what happens if you put one.)

Changing the PWD variable results in changing your prompt, but in fact does not change your directory. Change the other system variables only if you are sure of yourself, or there is a system administrator standing just behind you... (Not in practicum time, please...)

You can define new variables yourself, just by giving them values. It is important to remember that all variables in UNIX are strings. (Remember: methacharacters, quotes, escapes,...)

Refering to a variable (let it be a system variable or a variable you have just defined) is done by putting the $ symbol before the name of the variable: in this case the shell replaces the string $<var_name> by the value of the variable, in the shell's pre-processing phase. This happens within the double quotation marks (".."), but not within the simple quotation marks ('...').

Examples:

birot@hagen:~> pear=apple
birot@hagen:~> set | grep pear=
pear=apple
birot@hagen:~> echo $pear
apple
birot@hagen:~> echo "$pear"tree
appletree
birot@hagen:~> echo '$pear'tree
$peartree
birot@hagen:~> echo $TERM
xterm
birot@hagen:~> echo '$TERM'
$TERM
Now it is logical that if you want to give the value of one variable to another variable, the way to do it is:
birot@hagen:~> banana=$pear
birot@hagen:~> echo $banana
apple


Remark: If you want to use a variable in one shell that you have defined in another one (like in a running program), then you have to export it. Consult any Unix book or 'man export' on how to do that.
 

3. Type-token ratio


In a text, you will find several words, some of them occur more than once. For instance, if a text was composed of the previous sentence and this one, then the word "of" would appear three times.

If I ask "how many words are there in this text?", you can give two different answers. If each case when the word "of" occurs is a different word, then you speak about the number of tokens. Each occurence of the same word counts as a different token. But you can also ask what is the number of types, that is, how many different words you have in your text. If a word occurs more times, then these are different tokens of the same type.

Imagine that you have a text, in which word A occurs 5 times, word B occurs 3 times, word C occurs once, and word D occurs only ones. Then you have 10 tokens (5+3+1+1=10), and 4 types ( A, B, C and D).

If you are given a text, then you can calculate different statistics. You can calculate the number of tokens, which is the length of the text. You can calculate the number of types, which gives you how rich the vocabulary of the text actually is. Another useful statistics is the type-token ratio: the ratio of the number of types and the number of tokens (you divide the number of types with the number of tokens). In the above example, it is 4 / 10 = 0.4.

Type-token ratio is used for very different purposes. It can be used to measure somehow the richness of the vocabulary, for instance in child speech development. It has been claimed that the type-token ratio is typical to authors, different authors have different type-token ratios, so some researchers have tried to determine the authors of writings with debated authorship, based on type-token ratios.

Here is are the results of a very primitive way to calculate type-token ratios for the Federalist papers:

Some papers by Alexander Hamilton:

fed11.txt: 0.335
fed12.txt: 0.368
fed13.txt: 0.404
fed15.txt: 0.345
fed17.txt: 0.393
fed21.txt: 0.358
fed29.txt: 0.344

Some papers by James Madison:

mad37.txt: 0.336
mad38.txt: 0.310
mad39.txt: 0.250
mad40.txt: 0.277

Some papers by John Jay:

jay2.txt: 0.377
jay3.txt: 0.349
jay4.txt: 0.358
jay5.txt: 0.392

The type-token ratios of James Madison are much lower than the type-token ratios of the two other others. Unlike Hamilton, John Jay never has a type-token ratio above 0.400.


4. Shell scripts



 

After having solved a number of assignments, you might want to save some of them so that you won't need the reinvent them each time you need them. You can save them in a file, and just check that file each time before retyping the long chain of commands. But why not let the computer itself read this file and execute it? To make the long story short, can we write programs using UNIX?

There are two arguments pointing toward this possibility:


Is Unix a programming language? It has been designed as an operating system, but it has so many possibilities that you can even write simple programs using it. What is a program?

All of these are possible within UNIX. We shall come back to some of these later.

At the moment what we want is to put a sequence of commands into a file, and then just run it.

How to have a sequence of (complex) commands? If you want to simply combine a sequence of commands, pipes, etc., just write them into new lines, or separate them with a semi-collumn (;).

For instance:

cat > a_simple_shell_script
echo Now I will list the subdirectories of the directories whose name contains exactly 4 characters.
ls -l ???? | grep ^d
echo Thank you for your waiting.
echo What about an alphabetical order of these?
ls -l ???? | grep ^d | sort
echo Here you have it.
^d
Now, we have a file named a_simple_shell_script that contains six lines. What can we do with this? We want to run it. Let's type the file name after the promt, type enter, and... we get an error message:
bash: a_simple_shell_script: command not found
What is wrong? Let's type './a_simple_shell_script', in some systems this is the way you can run the programs that are within your own directory. Did it help? No, you get the same error message. Because the machine doesn't know that this file has been written to make it run (and not only a text-file, that can be, e.g. sent to Mariette as the solution of your assignment). What to do? There are two steps:


When you have a file that you want to use pretty often, it might be complicated to give always the entire path. Why not to make it into a "real" command? There is a system variable (we will speak about them later) that give you a set of paths: when you type the name of a program to be run, without determining the exact (absolut or relative) path, the Shell will look for the directories given in this variable. You can add additional paths to this variable by typing:

PATH=$PATH:$HOME/shellscripts
The meaning of this is the following: the new value of the variable PATH should be its actual value, followed by a column (separating the differents paths within the variable), and then you can give the new path to be added. Suppose it is a directory called shellscripts within your own home directory. You can save typing the exact path of your home directory by refering to this other system variable.

You might want to use arguments in your shell scripts, similarly to the arguments of the standard Unix commands: these arguments influence the task performed by the program. The way to do this is by referring to them within your shell script as $1, $2, etc. These will refer respectively to the first, second, etc. argument given aftern the script's name. The arguments will be separated by a space in the command line, unless the space is neutralized by an escape character or a quote.

Furthermore, the variable $0 in the shell script refers to the zeroeth argument of the script, which is the command name used under which the program has been called. Although this seems to be redundant, it is not. Imagine that you have more file names that are hard links of each other. In that case, the same script can be launched under different command names, and the task to be performed by the script may depend upon which file name has been used. For instance, cp and mv may be the same programs, but if mv has been used, the file is also deleted once it has been copied.

An example: a shell script containing

ls -l $1 | grep $2

will look for the second argument as a regular expression within the long list of directory given by the first argument.

$* refers to all arguments (a list including the argument list of the script).

$# means the number of arguments used.
 
 

Additional examples of shell scripts

a. a KWIC-script

On week 13, we have learnt what a KWIC (= Key Word In Context, also called as a concordance) is: a list of the words occurring in some text, including also the word's context (the words before and after). There has been a long tradition for centuries, for instance, of creating concordances to the Bible, to the classical Latin poetries or the works of Shakespeare. Concordances have been very useful tools for philologists, especially when it was not possible to search for a word in an electronic data base. Computers make it very easy to build the concordance of a previously digitalized text.

Last week, we also saw how you can use the Unix Lego game to create quickly a concordance. The step presented there can be put into one shell script (the keyword is 'their / Their', and we look it up in file fed32.txt):

#! /bin/sh

grep -i their fed32.txt | sed -e 's/[Tt]heir/#&#/'> lines
   # collect the lines with the keyword (ignore case differences),
   # then mark the keyword with a # before and after it.

cut -d# -f 1 < lines > before    # first field of a line: until first '#'
cut -d# -f 2 < lines > itself    # second field in each line: between #s
cut -d# -f 3 < lines > after    # third field: what follows the second #
sed -e 's/^/                              /' < before > before2
sed -e 's/$/                              /' < after > after2
cut -c 1-30 < after2 > after3    # 30 characters after keyword
rev before2 | cut -c 1-30 | rev > before3    # 30 characters before keyword
paste before3 itself after3

Note that you can add comments to a file script by using the # symbol. Whatever appearing in a line following the # symbol is not processed, but seen as a comment.

Its is a pitty that we have to rewrite this script each time we want to change the keyword or the file in which we look for the keyword. Why not giving the keyword and the file's name as arguments of the script?

#! /bin/sh

cat $2 > file1     #    the input file into file1
grep -i $1 file1 | sed -e "s/$1/#&#/" > lines

cut -d# -f 1 < lines > before
cut -d# -f 2 < lines > itself
cut -d# -f 3 < lines > after
sed -e 's/^/                              /' < before > before2
sed -e 's/$/                              /' < after > after2
cut -c 1-30 < after2 > after3
rev before2 | cut -c 1-30 | rev > before3
paste before3 itself after3
rm before* after* file1
      # This last line removes all files created by the script

There are a few nice points to notice in this second version of the KWIC-script. First, it uses its arguments similarly to grep. The first argument is compulsory, if no argument is given, you get an error message (try it out!). The first argument is a pattern that you look up in the input. The second argument, however, is optional. You may give the name of an input file as the second argument, or the standard input (what you type, pipeline, <, etc.) is used if no second argument is given.

How is this achieved? The trick lies in the first line (cat $2 > file1). If a file name is specified in the second argument ($2), then it is copied to file1. Otherwise, the variable $2 is an empty string, and thus cat has no argument: it will read its own standard input, that is, the standard input of the script, and save it into file 'file1'. In both cases, the input text (either the content of a file or the standard input) is saved into file 'file1', and can be processed further in the second command line of the script (grep).

The second point to notice is the double quotation mark in the sed command following grep (sed -e "s/$1/#&#/"). Because & is a metacharacter (the end of a command line that has to be executed in the background), if you do not escape it, you will get an error message ("Unterminated `s' command": what preceeds &, namely, s/$1/#, is an unterminated sed rule). How to escape it? So far, the single quotation marks and the double quotation marks did the same. However, this script will work only with double quotation marks, because the single quotation marks escape also the variable names ($1 will mean the $ character followed by the 1 character). Yet, $1 (such as all $variables) within double quotation marks will still refer to the first argument of the script.

The last remark concerns the end of the script. While the script was running, quite a few files were created. At the end of the script, we have to delete these temporary files. It is also very important to give those temporary files such names that will not coincede with other file names. However, while you are working on your script, you may want to keep those files, because they can help you debug your program. Therefore, you should add the line removing the temporary files only when your script works.

c. Giving values to variables using back quotation marks

Let us write a shell script that will read the first line of file, as well as its second line, and then reverse them: print first the second line, and then the first one. The name of the file is given as the argument of the script.

How to solve this problem? Using head and tail, we can get the first and the second line of the file. Using back quotation marks, we will put these lines into variables, before printing them to the screen (by the way, now echo finally becomes a really useful command!):

#! /bin/bash
FIRST_LINE=`head -n 1 $1`
SECOND_LINE=`head -n 2 $1 | tail -n 1`
echo $SECOND_LINE
echo $FIRST_LINE

A more compact, though more complex solution is of course the following:

#! /bin/bash
echo `head -n 2 $1 | tail -n 1`
echo `head -n 1 $1`


b. Mathematical calculations using variables

It has been said that variables under UNIX are always strings of characters. You can, however, use numerical variables by using commands that interpret strings of digits as numerals (these are expr and bc). For instance, the following script calculates the square of the integer given as its argument:

#! /bin/bash
echo `expr $1 \* $1`

You may prefer the following script that is longer but maybe easier to understand:

#! /bin/bash
SQUARE=`expr $1 \* $1`
echo $SQUARE

If you want to calculate the square of floating point numbers, remember what parantheses and semi-colons (;) meant, and use something like the following:

#! /bin/bash
SQUARE=`(echo scale = 8 ; echo $1 \* $1) | bc`
echo $SQUARE

Read an input value into a variable

Most programming languages have a command that allow the user of the program to give some value to a variable (scanf, read, input, etc.). According to the Unix philosophy, we would like to read a string of characters from the standard input of the shell script, and to put this value into some variable. You can use the command read VARIABLE for that purpose (read its man page for more information).

However, with some creativity, we do not need to introduce a new command. Try out this one:

VARIABLE=`cat`

The standard input of cat goes to its standard output, which then becomes the value of VARIABLE, thanks to the back quotation marks. The only bad side of this solution is that you need to type ^D when you have finished typing in your input.




N-gram-based text categorization


See the web site of the previous week about N-grams.

Imagine that you work for a news agency, and that you have many-many documents entering your agency each day. It would be nice to have a program that sorts you those documents, based on language or content. Indeed, in the last 10-15 years there has been intensive research in computational linguistics in order to produce better algorithms classifying documents.

You can, for example, compare the most typical words. If the document contains many tokens of "een", then it must be a Dutch document. If the document contains "ein", then it may be German, and if it contains "une", then it should be French. If it frequently contains the word "computer" then it is about information technology, unlike if the typical word is "inflation" or "recession".

Very often the typical characteristics are not words, but N-gram of words: "stock exchange" is a 2-gram typical for economic texts, while "F. C. Groningen" is a 3-gram typical for sport.

You can also look for N-grams on the character level, especially if you want to sort your documents according to language. For instance, the trigram 'eau' is typical to French, 'sch' to Dutch or German, 'aa' to Dutch, 'sh' to English, etc.

If you are interested in this topic, for more information, please have a look at this web-site from 2002, and to the article mentioned there.

(I can also tell you more about the work that I had done on this field myself.)


Bíró Tamás:
e-mail
English web site
Magyar honlap

Last modified: Thu Jul 3 11:39:17 METDST 2003