From the online manual:
Sed is a stream editor. A stream editor is used
to perform basic text transformations on an input stream (a file or
input from a pipeline). While in some ways similar to
an editor which permits scripted edits (such as ed), sed
works by making only one pass over the input(s), and
is consequently more efficient. But it is sed's ability
to
filter text in a pipeline which particularly distinguishes it from
other types of editors.
In other words:
The 'sed' command is like a sewing machine that goes through a file (a text), and performs some basic operations. You can define the operation to be performed in two ways:
What are the basic operations you can use?
sed s/regex/newstring/ file1 > file2This will rewrite the first string that matches the regular expression given (for possibilities that can be used in regex, see the grep command) by newstring. The input file is file1, and the output file is file2.
sed expr1/s/regex/newstring/This will rewrite the first string that matches the regular expression given by newstring, but only the lines that include something matching expr1 are taken into consideration.
sed expr1/s/regex/newstring/gwill replace all instances regex.
sed expr1/s/regex/newstring/15will replace the 15th instance. (Any number can be given.) E.g.:
echo aaaaaaaaaaaaaaaaaaaaaaaaa | sed -e s/a/b/15Then:
aaaaaaaaaaaaaabaaaaaaaaaa
sed /regex/dThis will delete all lines that include regex. If you want to delete a given string or a regular expression, you can do it by replacing it with nothing: s/regex//.
If you want to put more than one operation into the command line (or you have a script file with at least one operation in the command line), use the -e option before each operations of the command line:
E.g. sed -e /Henry/d -e /Sally/s/Smith/White people.old > peopleThe & character has a special meaning: it stands for the string that has been matched. E.g.:
This will delete all lines containing 'Henry" from the file called people.old, and change 'Smith' to 'White' in all lines containing 'Sally'. The result goes to the file called 'people'.
echo "ik ben jan" | sed -e 's/j[a-z]*n/piet&piet/'
ik ben pietjanpiet
A script consists of commands, each of them in a new
line (or separeted with a semi-colon). An example:
cat > changesThis script deletes all lines containing 'Henry', changes 'Smith' to 'White' in lines containg 'Sally', and add the line 'James Walker 112' to the end of the file. The a\ command stands for "add" or "append". The $ symbol stands for "last command line". (The ^D means to press CTRL-D to end the file.)
/Sally/s/Smith/White/g
/Henry/d
$a\
James Walker 112
^D
sed -f changes people.old > people.new
If you want to know more about sed (cycles, buffers; suppressing
the output with the -n option, so that only the line including the p command
should be outputed, etc.), consult any UNIX manual.
A big number of the Unix commands are so-called "filters". These are small programs that read the standard input (or the redirected standard input; and in some cases the name of the input file can be given as an argument, too), do something with it, and writes the result to the standard output (or to the redirected standard output).
Combining filters, i.e. building a series of filters is possible with pipe-lines ('|').
The simplest filter is cat: it just copies the input (that can be the concatanation of several files given as arguments, too) to the output.
Further filters were (cf. 3rd week): rev, sort, tr, uniq, wc, head, tail, as well as (cf. 4th week) grep. The previously introduced 'sed' command can also be seen as a filter.
Further filters are: colrm (removing columns from a file), crypt (coding and decoding data; theoretically this command shouldn't be available in systems outside the US -- for federal security reasons...), look (displaying lines beginning with a given string), spell (spell checker, not available on all systems). If you are interested, check the online manual to get more information about them.
Now we are going to deal with two further commands: cut and paste.
In Unix, "everything is a file", and "every file is a
text". This means that each file is considered to be a series of lines,
each endig with an end-of-line character, and a line being a series of
characters (printable or not printable ones). (If our file does not contain
any end-of-line character, because it is not meant to be a text, then it
is seen to be
an one-line-long text.) Consequently, "columns" can be
defined under Unix: the nth column is composed of the nth
character of each line. (We have already encountered this at the +n
option of the sort command, where
n is a number.)
Suppose we have a file called grades, containing student number, the name of the student and the final grade:
0513678 John 8Suppose you want to hang out this list, but without the names (just student no. and grade). Therefore you want to remove the columns 9 to 15. (The first character of a line is in column no. 1.) There are several options to do that (the lpr command sends its input to the printer):
0612942 Kathy 7
0418365 Pieter 6
0539482 Judith 9
colrm 9 15 < grades | lprYou also could redirect the output to another file (> grade_without_names), and then print it (lpr grades_without_names), of course.
cut -c1-8,16-18 grades | lpr
The moral is:
colrm [startcol [endcol]] : will remove the columns from no. startcol (until no. endcol, if specified): these are two seperate arguments of the command. The input file should be given as a redirected standard input. The output lacks the specified columns.The paste command can be used for many purposes. The basic use of it is merging columns. If you imagine the cat command as linking your files "vertically", then the paste command does the same "horizontally" (thank to the philosophy of seeing your files as being composed of lines). The syntax of this command is:cut -cLIST file : will output only the specified columns. The input file is given as an argument, and the list of columns to be cut are given as an argument after option -c. The LIST of columns to be cut out should meet the following criteria: numbers are separated by commas (but no space! because a space would mean: 'end of the argument'), and a series of neighbouring columns can be given using a minus sign. E.g. ' 3,5-8,11-20,28 ' would mean the column no. 3, as well as columns 5 to (t/m) 8, columns 11 to 20, and column 28.
paste [-d char] file [file...]The files to be combined are listed as arguments. The optional (remember: [..] stands for optionality) -d option defines the delimiters between the columns. By default it is set to the TAB character ('\t': bringing to the 1st, 9th, 17th, 25th, 33rd, etc. columns), but you can reset it (using the -d option) to be just a space (or '&' for TeX, etc.).
Suppose you have a file containing 5 names (names), another file containing 5 birthdays (birthdays), and a third one containing 5 addresses (addresses). The following command will create a file containing the combined information:
paste names birthdays addresses > infoNow how can you change the order of columns of a given file? Combining cut and paste:
cat infoJane 23/11 9722EK Groningen....
Jack 05/09 9718UW Groningen....
...
cut -c1-8 info > naam; cut -c9-14 info > jaarig; cut -c15-40 info > woning; paste naam woning jaarig > new_info(Remark: the semi-column is used as delimiters between commands. You could put them into separate lines, too. When putting into separate lines, the Shell will deal with them separately: pre-processing and executing the first line, then pre-processing and executing the second line, etc. When putting into one line, Shell will pre-process them together, and then execute them one-by-one.)
The topic of this course gives us an excellent opportunity to mention a few basic ideas of statistical linguistics. What are the motivation of statistical approaches to linguistics?
First: why dealing only with qualitative, and not also
with quantitative properties of human languages? For instance: the lyric
style of a romantic poem could be explained in many cases with the statistically
significant overuse of some 'soft' sounds (e.g. [l], [n]), as opposed to
the 'sharp' sounds ([t], [k],...) in revolutionary poems. Furthermore,
many modern linguists emphasize that linguistic statements should be checked
with quantitative methods. For instance, the paradigm by Noam Chomsky (a
sentence is either
grammatical or agrammatical) has been slowly changed
by many to a 'corpus-based paradigm': some types are very frequent (therefore
'more correct'), others are pretty rare ('less or marginally correct'),
or absolutely abscent ('incorrect'?) in a given corpus (set of texts).
A second motivation for these statistical games are real life applications. For instance, when writing a document in more languages, you don't want always to change the language of your spell checker, but you would like your spell checker to recognize the language of your sentence. Here is a demo for guessing the language a text has been written in.
A very important implementation of statistical linguistics is text classification (cf. 6th week). Imagine you are in a big news agency or in an intelligence agency and there are thousands of articles comming in day after day. You want to select them by language and by topic in order to be able to send them to the appropriate person. Instead of using human beings just to read all these materials and to categorize them, you can automatize this task if you are clever enough. This has been a huge topic of research in the last couple of decades.
The basic information you need in statistical linguistics is the frequency of some items. You should distinguish between absolute frequency (number of occurences) and relative frequency (percentage: absolute frequency / total number of items). Items can be for instance characters (e.g. the percentage of the 'w' character will be much higher in English than in French), words (e.g. the word 'computer' is much more frequent in articles related to informational sciences than in articles related to pre-school care education), or sentence types (e.g. embeded sentences are much more likely in written texts produced by university students than in an oral corpus produced by uneducated people).
A further basic idea is the distinction between type and token. You usely have several tokens of the same type: several occurences of the same character, word, expression, sentence-class, etc. For instance in my previous sentence the type 'several' is represented by two tokens. The type-token ration is the ratio of the number of types and the number of tokens in a given text. A small child (or somebody learning a new language) use few words, resulting in a low type-token ration (e.g. 10 different words in a text of 100 words, i.e. lot of repetition). While an educated speaker with a rich vocabulary might use a lot of different words (especially in written texts), resulting in an extremely high type-token ratio.
On the other hand, some words (articles, preposition, pronouns,...) will occur very frequently independently of the richness of one's vocabulary. The frequent use of the words (nouns, verbs,...) characteristic to the topic in question is also unevitable. The distribution of frequent and less frequent words follow a Zipf-function (cf. next week). It is remarkable that similar Zipf-functions can be met in many different fields (DNA, distribution of town size in a given country, etc. see also topics related to fractals, chaotic behavior, critical phenomena, etc.).
You will very often remark that you would rather use the frequency of the combination of two or more items, rather than the frequency of one item. French texts can be very easily recognized by the relatively high frenquency of the 'ou', 'oi', character pairs (bigram). The bigram 'sh' is characteristic to English, 'aa' to Dutch, the trigram 'sch' to German and Dutch, etc. You can also look at bigrams, trigrams, etc. (n-grams, in general) of words: e.g. some English speakers use much more often the bigram 'you know' than others.
About creating n-gram frequency counts, please do have
a look at Henny Klein's lecture-notes
from last year.
(From H. Klein's web site.)
Concordantie=verzameling van voorkomens
van woorden in hun context. Al van oudsher werden er b.v. (handmatig uiteraard)
concordanties van de bijbel
gemaakt: lijsten van woorden en
namen met de plaatsen waar ze te vinden waren. Modern en electronisch:
KWIC (key word in context). In dit formaat wordt
voor een gevraagd key word een
tabel geproduceerd met daarbij de linker- en rechtercontext tot een bepaalde
diepte (bijvoorbeeld 35 characters). Voorbeeld
voor het woord itself in Federalist/fed60.txt:
ts own elections to the Union
itself . It is not
e into them, it would display
itself in a form
preference in which itself would not be included?
Or to
itself could desire. And thirdly, th
Hoe maken we zo'n KWIC tabel met behulp van UNIX utilities? Antwoord:
1.gebruik
grep om relevante regels te selecteren
2.gebruik
cut om context te extraheren
3.gebruik
sed om contexten lang genoeg te maken (opvullen met spaties)
4.gebruik
cut om contexten kort genoeg te maken (eventueel met rev).
5.gebruik
paste om iedere kolom weer naast elkaar te geven
Hier gaan we dan. Stel je voor
dat we net zoals boven een KWIC voor het key word itself willen maken van
het bestand fed60.txt. Helaas kent sed geen optie
voor 'ignore case':
1.grep
-i itself ~vannoord/Federalist/fed60.txt (door met pipe:)
| sed -e 's/[Ii][Tt][Ss][Ee][Ll][Ff]/#&#/'>lines
Het bestand lines ziet er als volgt uit. We hebben het speciale symbool # gebruikt om de key word duidelijk aan te geven (voor cut hieronder).
regulating its own elections to the Union #itself#. It is not
gain admittance into them, it would display #itself# in a form
preference in which #itself# would not be included? Or to what
#itself# could desire. And thirdly, that men accustomed to
2.We maken
aparte bestanden voor contexten en key word met behulp van cut. cut haalt
stukken uit een tekstregel. Er zijn allerlei opties, de belangrijkste
op dit moment:
cut -c met getallen of range getallen: de tekens op die posities, bv cut
-c 2,5-8,14- haalt uit elke regel de characters op positie 2, 5 t.m. 8
en van 14 tot
eind
cut -f met getallen of range getallen haalt de gewenste fields (velden)
uit de regel. Default zijn die gescheiden door tab (vgl paste) maar je
kunt een
eigen 'delimiter' opgeven via -d, zoals een spatie voor woordgrenzen (-d'
'). Hieronder gebruiken we de toegevoegde # als delimiter.
cut -d# -f 1 < lines > before
cut -d# -f 2 < lines > itself
cut -d# -f 3 < lines > after
3.Maak
de context lang genoeg door met sed bv 35 spaties toe te voegen aan begin
/ einde van de regel. Helaas 'verstaat' sed geen opgave van een
specifiek aantal
sed -e 's/^/
/' < before > before2
sed -e 's/$/
/' < after > after2
4.Verkort iedere regel tot bv 35 characters. Snap je waarom de before-file hiervoor omgekeerd wordt?
cut -c 1-35 < after2 > after3
rev before2 | cut -c 1-35 | rev > before3
5.Resultaat wordt verkregen met:
paste before3 itself after3
De hele set commando's zoals je ze van de terminal zou geven bij elkaar:
grep -i itself ~vannoord/Federalist/fed60.txt
| sed -e 's/[Ii][Tt][Ss][Ee][Ll][Ff]/#&#/'>lines
cut -d# -f 1 < lines > before
cut -d# -f 2 < lines > itself
cut -d# -f 3 < lines > after
sed -e 's/^/
/' < before > before2
sed -e 's/$/
/' < after > after2
cut -c 1-35 < after2
> after3
rev before2 | cut -c 1-35 | rev
> before3
paste before3 itself after3
Beperkingen
It is very interesting to draw the following diagram: after having put our words in decreasing order of frequency, we don't look at the words themselves, just at the frequency f in the function of its rank r. So f(1) is the frequency of the most frequent word, f(2) is the frequency of the second most frequent word, ... f(100)is the frequency of the 100th most frequent word, etc.
This f(r) function has interesting properties. In most of the cases, it has the same form, independently of the language, style or content of the given text. Not speaking of the first couple of values (the samllest ranks), this function fits very well a hyperbole, i.e. the y = c / x function, where c is some constant. To be more precise, a good approximation is the following power law:
where a is a little bit smaller than 1 (approx. 0.7 - 0.8). (Remark: ^ means here "to the power of...".)f = c / r^(-a)
The fact that the Zipf-function has such a power-law behaviour, and is independent of the kind of the text, is very surprising, and generated a nice literature since the 1930s, 1940s, when the law was discovered (without the use of computers, yet!). For more information consult me (i.e. Tamas).
You can find here
a nice program in Perl, by Henny Klein, that does the Zipf analysis of
a text.