Tekstmanipulatie, week 6

1. Article

Please read the following article that we will discuss in class:

William B. Canvar en John M. Trenkle: N-Gram-Based Text Categorization

NEW: in pdf-format

2. Making a concordance (KWIC)

(From H. Klein's web site.)

       Concordantie=verzameling van voorkomens van woorden in hun context. Al van oudsher werden er b.v. (handmatig uiteraard) concordanties van de bijbel
       gemaakt: lijsten van woorden en namen met de plaatsen waar ze te vinden waren. Modern en electronisch: KWIC (key word in context). In dit formaat wordt
       voor een gevraagd key word een tabel geproduceerd met daarbij de linker- en rechtercontext tot een bepaalde diepte (bijvoorbeeld 35 characters). Voorbeeld
       voor het woord itself in Federalist/fed60.txt:

       ts own elections to the Union   itself . It is not
       e into them, it would display   itself   in a form
                 preference in which   itself   would not be included? Or to
                                       itself   could desire. And thirdly, th

Hoe maken we zo'n KWIC tabel met behulp van UNIX utilities? Antwoord:

           1.gebruik grep om relevante regels te selecteren
           2.gebruik cut om context te extraheren
           3.gebruik sed om contexten lang genoeg te maken (opvullen met spaties)
           4.gebruik cut om contexten kort genoeg te maken (eventueel met rev).
           5.gebruik paste om iedere kolom weer naast elkaar te geven

Hier gaan we dan. Stel je voor dat we net zoals boven een KWIC voor het key word itself willen maken van het bestand fed60.txt. Helaas kent sed geen optie
voor 'ignore case':

1.grep -i itself ~vannoord/Federalist/fed60.txt (door met pipe:)
| sed -e 's/[Ii][Tt][Ss][Ee][Ll][Ff]/#&#/'>lines

Het bestand lines ziet er als volgt uit. We hebben het speciale symbool # gebruikt om de key word duidelijk aan te geven (voor cut hieronder).

               regulating its own elections to the Union #itself#. It is not
               gain admittance into them, it would display #itself# in a form
               preference in which #itself# would not be included? Or to what
               #itself# could desire. And thirdly, that men accustomed to

           2.We maken aparte bestanden voor contexten en key word met behulp van cut. cut haalt stukken uit een tekstregel. Er zijn allerlei opties, de belangrijkste
              op dit moment:
              cut -c met getallen of range getallen: de tekens op die posities, bv cut -c 2,5-8,14- haalt uit elke regel de characters op positie 2, 5 t.m. 8 en van 14 tot
              eind
              cut -f met getallen of range getallen haalt de gewenste fields (velden) uit de regel. Default zijn die gescheiden door tab (vgl paste) maar je kunt een
              eigen 'delimiter' opgeven via -d, zoals een spatie voor woordgrenzen (-d' '). Hieronder gebruiken we de toegevoegde # als delimiter.

              cut -d# -f 1 < lines > before
              cut -d# -f 2 < lines > itself
              cut -d# -f 3 < lines > after

3.Maak de context lang genoeg door met sed bv 35 spaties toe te voegen aan begin / einde van de regel. Helaas 'verstaat' sed geen opgave van een
specifiek aantal

sed -e 's/^/ /' < before > before2
sed -e 's/$/ /' < after > after2

4.Verkort iedere regel tot bv 35 characters. Snap je waarom de before-file hiervoor omgekeerd wordt?

cut -c 1-35 < after2 > after3
rev before2 | cut -c 1-35 | rev > before3

5.Resultaat wordt verkregen met:

paste before3 itself after3

De hele set commando's zoals je ze van de terminal zou geven bij elkaar:

       grep -i itself ~vannoord/Federalist/fed60.txt | sed -e 's/[Ii][Tt][Ss][Ee][Ll][Ff]/#&#/'>lines
       cut -d# -f 1 < lines > before
       cut -d# -f 2 < lines > itself
       cut -d# -f 3 < lines > after
       sed -e 's/^/                                   /' < before > before2
       sed -e 's/$/                                   /' < after > after2
       cut -c 1-35 < after2 > after3
       rev before2 | cut -c 1-35 | rev > before3
       paste before3 itself after3

Beperkingen

alleen eerste match van de zoekstring op een regel wordt gegeven
regelnummers ontbreken(kan makkelijk worden toegevoegd, zie huiswerk)
de context wordt in karakters i.p.v. woorden geteld (zie huiswerk)
alleen context van dezelfde regel, geen zinnen

3. Zipf's law

It is very interesting to draw the following diagram: after having put our words in decreasing order of frequency, we don't look at the words themselves, just at the frequency f in the function of its rank r. So f(1) is the frequency of the most frequent word, f(2) is the frequency of the second most frequent word, ... f(100)is the frequency of the 100th most frequent word, etc.

This f(r) function has interesting properties. In most of the cases, it has the same form, independently of the language, style or content of the given text. Not speaking of the first couple of values (the samllest ranks), this function fits very well a hyperbole, i.e. the y = c / x function, where c is some constant. To be more precise, a good approximation is the following power law:

f = c / r^(-a)

where a is a little bit smaller than 1 (approx. 0.7 - 0.8). (Remark: ^ means here "to the power of...".)

The fact that the Zipf-function has such a power-law behaviour, and is independent of the kind of the text, is very surprising, and generated a nice literature since the 1930s, 1940s, when the law was discovered (without the use of computers, yet!). For more information consult me (i.e. Tamas).

You can find here a nice program in Perl, by Henny Klein, that does the Zipf analysis of a text.

4. A remark: 'echo' and 'cat', 'expr' and 'bc'

What is the difference between echo and cat?

echo sends to the standard output (or redirected standard output) its arguments, seperated by one space
cat sends to the standard output (or redirected standard output) the content of the file(s) given as its argument(s), or (if no arguments are given) the standard input (or the redirected standard input).

You can find the same dichotomy among the commands dealing with mathematical expressions:

expr outputs the value of the expression given as its arguments.
bc outputs the value of the expression given in a file (mentioned as its argument) or given in its (maybe redirected) standard input.

Examples for bc are given in the lecture notes of week 3. Examples for expr:

expr 3 + 4
7
expr 3+4
3+4
expr $ 3 + 4 $ \/ 4
1
expr 2 * 3
expr: syntax error
expr '-2' \* 3
-6
expr 13 \% 3
1
expr 8 = 8
1
expr 15 = 2
0
expr $ 8 = 8 $ \& $ 3 = 3 $
1
expr '(' 8 = 8 ')' '|' '(' 3 = 4 + 5 ')'
1

Remarks: The numbers, parantheses and airthmetic symbols are different arguments, therefore you should separate them by a space (if you don't: see the second example). Some out of the arithmetic symbols are metacharacters, therefore they should be protected using quotes or the escape character ('\') (what is the reason of the error message in the fourth example?). Division is understood as division of integers, and % refers to the modulo of the division. The last four examples show how logical statements are evaluated: 0 stands for the logical value FALSE, while 1 stands for the logical value TRUE. The '&' symbol means AND, '|' means OR. Check man expr for further possibilities (e.g. what happens if you use these logical operations between numerals, and not between statements?).

The expr command, combined with back quotes (that is replaced by the shell with the output of the command line within the quotes) makes us an easier way to calculate type-token ratio or word-frequencies. How to calculate for instance the frequency of the word "the" in a given a a given file?

Number of the occurences of "the" is given as the output of the following command line: tr ' ' '\012' < file | tr -d ".,;:" | grep '^the$' | wc -w

Remark: if you put just 'grep the', then you would match words like "therefore", too. The second tr will delete characters that might follow our word and are not separeted by a space: without this our command line wouldn't recognize them as tokens of the word that we are looking for, and grep would filter them out.)

The number of words occuring in the text is given by: wc- w < file.

Remark: if you wrote just wc- w file then the filename is also mentioned in the output, and this would lead to syntax error in the last step. (Try it out! It took me pretty long to find out what the problem was,,,)

Since dividing is understood by expr as dividing within integers, therefore let's multiply by 10,000, so that we receive the results in 0.01%.
As we need the input file twice, we need to write it to a temporary file. So the command line will be:

cat > file; expr ` tr ' ' '\012' < file | tr -d ".,;:" | grep '^the$' | wc -w ` \* 10000 \/ ` wc- w < file `