Tekstmanipulatie, week 13

1. sed

From the online manual:

 Sed  is  a stream editor.  A stream editor is used to perform basic text transformations on an input stream (a file or  input from a pipeline).  While in some ways similar to
an editor which permits scripted edits (such as  ed),  sed works  by  making  only one pass over the input(s), and is consequently more efficient.  But it is sed's  ability  to
filter text in a pipeline which particularly distinguishes it from other types of editors.

In other words:

The 'sed' command is like a sewing machine that goes through a file (a text), and performs some basic operations. You can define the operation to be performed in two ways:

When performing the 'sed' command, the computer will take all commands appearing either in the command line or in the specified script files, and decides in which order to execute them. Therefore if you want the operations to be executed in a specific order, and the result is not what you wanted, then you do better pipelining more 'sed' commands.

What are the basic operations you can use?

sed s/regex/newstring/ file1 > file2
This will rewrite the first string that matches the regular expression given (for possibilities that can be used in regex, see the grep command) by newstring. The input file is file1, and the output file is file2.
sed expr1/s/regex/newstring/
This will rewrite the first string that matches the regular expression given by newstring, but only the lines that include something matching expr1 are taken into consideration.
sed expr1/s/regex/newstring/g
will replace all instances regex.
sed expr1/s/regex/newstring/15
will replace the 15th instance. (Any number can be given.)  E.g.:
echo aaaaaaaaaaaaaaaaaaaaaaaaa | sed -e s/a/b/15
sed /regex/d
This will delete all lines that include regex. If you want to delete a given string or a regular expression, you can do it by replacing it with nothing: s/regex//.

If you want to put more than one operation into the command line (or you have a script file with at least one operation in the command line), use the -e option before each operations of the command line:

E.g. sed -e /Henry/d -e /Sally/s/Smith/White people.old > people
This will delete all lines containing 'Henry" from the file called people.old, and change 'Smith' to 'White' in all lines containing 'Sally'. The result goes to the file called 'people'.
The & character has a special meaning: it stands for the string that has been matched. E.g.:
echo "ik ben jan" | sed -e 's/j[a-z]*n/piet&piet/'
ik ben pietjanpiet

A script consists of commands, each of them in a new line (or separeted with a semi-colon). An example:

cat > changes
   James Walker 112
sed -f changes people.old > people.new
This script deletes all lines containing 'Henry', changes 'Smith' to 'White' in lines containg 'Sally', and add the line 'James Walker 112' to the end of the file. The a\ command stands for "add" or "append". The $ symbol stands for "last command line". (The ^D means to press CTRL-D to end the file.)

If you want to know more about sed (cycles, buffers; suppressing the output with the -n option, so that only the line including the p command should be outputed, etc.), consult any UNIX manual.

2. cut, paste


A big number of the Unix commands are so-called "filters". These are small programs that read the standard input (or the redirected standard input; and in some cases the name of the input file can be given as an argument, too), do something with it, and writes the result to the standard output (or to the redirected standard output).

Combining filters, i.e. building a series of filters is possible with pipe-lines ('|').

The simplest filter is cat: it just copies the input (that can be the concatanation of several files given as arguments, too) to the output.

Further filters were (cf. 3rd week): rev, sort, tr, uniq, wc, head, tail, as well as (cf. 4th week) grep. The previously introduced 'sed' command can also be seen as a filter.

Further filters are: colrm (removing columns from a file), crypt (coding and decoding data; theoretically this command shouldn't be available in systems outside the US -- for federal security reasons...), look (displaying lines beginning with a given string), spell (spell checker, not available on all systems). If you are interested, check the online manual to get more information about them.

Now we are going to deal with two further commands: cut and paste.

In Unix, "everything is a file", and "every file is a text". This means that each file is considered to be a series of lines, each endig with an end-of-line character, and a line being a series of characters (printable or not printable ones). (If our file does not contain any end-of-line character, because it is not meant to be a text, then it is seen to be
an one-line-long text.) Consequently, "columns" can be defined under Unix: the nth column is composed of the nth character of each line. (We have already encountered this at the +n option of the sort command, where n is a number.)

Suppose we have a file called grades, containing student number, the name of the student and the final grade:

0513678 John   8
0612942 Kathy  7
0418365 Pieter 6
0539482 Judith 9
Suppose you want to hang out this list, but without the names (just student no. and grade). Therefore you want to remove the columns 9 to 15. (The first character of a line is in column no. 1.) There are several options to do that (the lpr command sends its input to the printer):
colrm 9 15 < grades | lpr
cut -c1-8,16-18 grades | lpr
You also could redirect the output to another file (> grade_without_names), and then print it (lpr grades_without_names), of course.

The moral is:

colrm [startcol [endcol]]  :  will remove the columns from no. startcol (until no. endcol, if specified): these are two seperate arguments of the command. The input file should be given as a redirected standard input. The output lacks the specified columns.

cut -cLIST file    :  will output only the specified columns. The input file is given as an argument, and the list of columns to be cut are given as an argument after option -c. The LIST of columns to be cut out should meet the following criteria: numbers are separated by commas (but no space! because a space would mean: 'end of the argument'), and a series of neighbouring columns can be given using a minus sign. E.g. ' 3,5-8,11-20,28 ' would mean the column no. 3, as well as columns 5 to (t/m) 8, columns 11 to 20, and column 28.

The paste command can be used for many purposes. The basic use of it is merging columns. If you imagine the cat command as linking your files "vertically", then the paste command does the same "horizontally" (thank to the philosophy of seeing your files as being composed of lines). The syntax of this command is:
paste [-d char] file [file...]
The files to be combined are listed as arguments. The optional (remember: [..] stands for optionality) -d option defines the delimiters between the columns. By default it is set to the TAB character ('\t': bringing to the 1st, 9th, 17th, 25th, 33rd, etc. columns), but you can reset it (using the -d option) to be just a space (or '&' for TeX, etc.).

Suppose you have a file containing 5 names (names), another file containing 5 birthdays (birthdays), and a third one containing 5 addresses (addresses). The following command will create a file containing the combined information:

paste names birthdays addresses > info
cat info
Jane    23/11    9722EK Groningen....
Jack    05/09    9718UW Groningen....
Now how can you change the order of columns of a given file? Combining cut and paste:
cut -c1-8 info > naam; cut -c9-14 info > jaarig; cut -c15-40 info > woning; paste naam woning jaarig > new_info
(Remark: the semi-column is used as delimiters between commands. You could put them into separate lines, too. When putting into separate lines, the Shell will deal with them separately: pre-processing and executing the first line, then pre-processing and executing the second line, etc. When putting into one line, Shell will pre-process them together, and then execute them one-by-one.)

3. N-grams


The topic of this course gives us an excellent opportunity to mention a few basic ideas of statistical linguistics. What are the motivation of statistical approaches to linguistics?

First: why dealing only with qualitative, and not also with quantitative properties of human languages? For instance: the lyric style of a romantic poem could be explained in many cases with the statistically significant overuse of some 'soft' sounds (e.g. [l], [n]), as opposed to the 'sharp' sounds ([t], [k],...) in revolutionary poems. Furthermore, many modern linguists emphasize that linguistic statements should be checked with quantitative methods. For instance, the paradigm by Noam Chomsky (a sentence is either
grammatical or agrammatical) has been slowly changed by many to a 'corpus-based paradigm': some types are very frequent (therefore 'more correct'), others are pretty rare ('less or marginally correct'), or absolutely abscent ('incorrect'?) in a given corpus (set of texts).

A second motivation for these statistical games are real life applications. For instance, when writing a document in more languages, you don't want always to change the language of your spell checker, but you would like your spell checker to recognize the language of your sentence. Here is a demo for guessing the language a text has been written in.

A very important implementation of statistical linguistics is text classification (cf. 6th week). Imagine you are in a big news agency or in an intelligence agency and there are thousands of articles comming in day after day. You want to select them by language and by topic in order to be able to send them to the appropriate person. Instead of using human beings just to read all these materials and to categorize them, you can automatize this task if you are clever enough. This has been a huge topic of research in the last couple of decades.

The basic information you need in statistical linguistics is the frequency of some items. You should distinguish between absolute frequency (number of occurences) and relative frequency (percentage: absolute frequency / total number of items). Items can be for instance characters (e.g. the percentage of the 'w' character will be much higher in English than in French), words (e.g. the word 'computer' is much more frequent in articles related to informational sciences than in articles related to pre-school care education), or sentence types (e.g. embeded sentences are much more likely in written texts produced by university students than in an oral corpus produced by uneducated people).

A further basic idea is the distinction between type and token. You usely have several tokens of the same type: several occurences of the same character, word, expression, sentence-class, etc. For instance in my previous sentence the type 'several' is represented by two tokens. The type-token ration is the ratio of the number of types and the number of tokens in a given text. A small child (or somebody learning a new language) use few words, resulting in a low type-token ration (e.g. 10 different words in a text of 100 words, i.e. lot of repetition). While an educated speaker with a rich vocabulary might use a lot of different words (especially in written texts), resulting in an extremely high type-token ratio.

On the other hand, some words (articles, preposition, pronouns,...) will occur very frequently independently of the richness of one's vocabulary. The frequent use of the words (nouns, verbs,...) characteristic to the topic in question is also unevitable. The distribution of frequent and less frequent words follow a Zipf-function (cf. next week). It is remarkable that similar Zipf-functions can be met in many different fields (DNA, distribution of town size in a given country, etc. see also topics related to fractals, chaotic behavior, critical phenomena, etc.).

You will very often remark that you would rather use the frequency of the combination of two or more items, rather than the frequency of one item. French texts can be very easily recognized by the relatively high frenquency of the 'ou', 'oi', character pairs (bigram). The bigram 'sh' is characteristic to English, 'aa' to Dutch, the trigram 'sch' to German and Dutch, etc. You can also look at bigrams, trigrams, etc. (n-grams, in general) of words: e.g. some English speakers use much more often the bigram 'you know' than others.

About creating n-gram frequency counts, please do have a look at Henny Klein's lecture-notes from last year.

4. Making a concordance (KWIC)

(From H. Klein's web site.)

       Concordantie=verzameling van voorkomens van woorden in hun context. Al van oudsher werden er b.v. (handmatig uiteraard) concordanties van de bijbel
       gemaakt: lijsten van woorden en namen met de plaatsen waar ze te vinden waren. Modern en electronisch: KWIC (key word in context). In dit formaat wordt
       voor een gevraagd key word een tabel geproduceerd met daarbij de linker- en rechtercontext tot een bepaalde diepte (bijvoorbeeld 35 characters). Voorbeeld
       voor het woord itself in Federalist/fed60.txt:

       ts own elections to the Union   itself  . It is not
       e into them, it would display   itself   in a form
                 preference in which   itself   would not be included? Or to
                                       itself   could desire. And thirdly, th

       Hoe maken we zo'n KWIC tabel met behulp van UNIX utilities? Antwoord:

           1.gebruik grep om relevante regels te selecteren
           2.gebruik cut om context te extraheren
           3.gebruik sed om contexten lang genoeg te maken (opvullen met spaties)
           4.gebruik cut om contexten kort genoeg te maken (eventueel met rev).
           5.gebruik paste om iedere kolom weer naast elkaar te geven

       Hier gaan we dan. Stel je voor dat we net zoals boven een KWIC voor het key word itself willen maken van het bestand fed60.txt. Helaas kent sed geen optie
       voor 'ignore case':

           1.grep -i itself ~vannoord/Federalist/fed60.txt  (door met pipe:)
              | sed -e 's/[Ii][Tt][Ss][Ee][Ll][Ff]/#&#/'>lines

              Het bestand lines ziet er als volgt uit. We hebben het speciale symbool # gebruikt om de key word duidelijk aan te geven (voor cut hieronder).

               regulating its own elections to the Union #itself#. It is not
               gain admittance into them, it would display #itself# in a form
               preference in which #itself# would not be included? Or to what
               #itself# could desire. And thirdly, that men accustomed to

           2.We maken aparte bestanden voor contexten en key word met behulp van cut. cut haalt stukken uit een tekstregel. Er zijn allerlei opties, de belangrijkste
              op dit moment:
              cut -c met getallen of range getallen: de tekens op die posities, bv cut -c 2,5-8,14- haalt uit elke regel de characters op positie 2, 5 t.m. 8 en van 14 tot
              cut -f met getallen of range getallen haalt de gewenste fields (velden) uit de regel. Default zijn die gescheiden door tab (vgl paste) maar je kunt een
              eigen 'delimiter' opgeven via -d, zoals een spatie voor woordgrenzen (-d' '). Hieronder gebruiken we de toegevoegde # als delimiter.

              cut -d# -f 1 < lines > before
              cut -d# -f 2 < lines > itself
              cut -d# -f 3 < lines > after

           3.Maak de context lang genoeg door met sed bv 35 spaties toe te voegen aan begin / einde van de regel. Helaas 'verstaat' sed geen opgave van een
              specifiek aantal

              sed -e 's/^/                                   /' < before > before2
              sed -e 's/$/                                   /' < after > after2

           4.Verkort iedere regel tot bv 35 characters. Snap je waarom de before-file hiervoor omgekeerd wordt?

              cut -c 1-35  < after2 > after3
              rev before2 | cut -c 1-35 | rev > before3

           5.Resultaat wordt verkregen met:

              paste before3 itself after3

       De hele set commando's zoals je ze van de terminal zou geven bij elkaar:

       grep -i itself ~vannoord/Federalist/fed60.txt | sed -e 's/[Ii][Tt][Ss][Ee][Ll][Ff]/#&#/'>lines
       cut -d# -f 1 < lines > before
       cut -d# -f 2 < lines > itself
       cut -d# -f 3 < lines > after
       sed -e 's/^/                                   /' < before > before2
       sed -e 's/$/                                   /' < after > after2
       cut -c 1-35  < after2 > after3
       rev before2 | cut -c 1-35 | rev > before3
       paste before3 itself after3



5. Zipf's law


It is very interesting to draw the following diagram: after having put our words in decreasing order of frequency, we don't look at the words themselves, just at the frequency f in the function of its rank r. So f(1) is the frequency of the most frequent word, f(2) is the frequency of the second most frequent word, ... f(100)is the frequency of the 100th most frequent word, etc.

This f(r) function has interesting properties. In most of the cases, it has the same form, independently of the language, style or content of the given text. Not speaking of the first couple of values (the samllest ranks), this function fits very well a hyperbole, i.e. the y = c / x function, where c is some constant. To be more precise, a good approximation is the following power law:

f = c / r^(-a)
where a is a little bit smaller than 1 (approx. 0.7 - 0.8). (Remark: ^ means here "to the power of...".)

The fact that the Zipf-function has such a power-law behaviour, and is independent of the kind of the text, is very surprising, and generated a nice literature since the 1930s, 1940s, when the law was discovered (without the use of computers, yet!). For more information consult me (i.e. Tamas).

You can find here a nice program in Perl, by Henny Klein, that does the Zipf analysis of a text.

Bíró Tamás:
English web site
Magyar honlap

Last modified: Thu Jul 3 11:39:17 METDST 2003