From the online manual:
Sed is a stream editor. A stream editor is used to perform basic text transformations on an input stream (whatever shell sends to its standard input, let it be the content of a file or the output of a pipeline). The command sed makes always only one pass over the input, and this makes it more simple and more efficient. To pass over the input more times, you need more sed commands.
In other words:
The 'sed' command is like a sewing machine that goes through a file (a text), and performs some basic operations. You can define the operation to be performed in two ways:
When performing the 'sed' command, the computer will take all commands appearing either in the command line or in the specified script files, and decide in which order to execute them. Therefore, if you want the operations to be executed in a specific order, and the result is not what you wanted, then you do better use a pipe line, and make sure that the different commands are executed in the order you want.
What are the basic operations you can use?
sed s/regex/newstring/ file1 > file2
In each line, this command will rewrite the first string that matches the given regular
expression (for possibilities that can be used in regex, see the
grep command) by This will rewrite the first string that matches the regular
expression given by newstring,
but only in the lines that include something matching
expr1. The command
will replace all instances regex, and not only the first one in each line
('g' stands for global). Further, the command will replace the 15th instance. (Any number can be given.)
E.g.:
This rule will delete all lines that include regex, yielding a similar result
to the -v option of grep.
If you want to delete a given string or a regular expression, you can do it by
replacing it with "nothing" (to be more exact: by replacing it with the
empty string):
If you want to put more than one operation into the command
line (or you have a script file with at least one operation in the command
line), use the -e option before each operations of the command line:
The & character has a special meaning: it stands for
the string that has been matched. E.g.:
You will frequently use & when you will want to put each character
of the input file into a new line. Then, you have to insert a new-line character
after each character. With the following command line, you can insert the cahracter
@ after each character of the text:
This works because . stands for
whatever character, similarly to what we have learnt concerning grep. Then, the
safest way to insert a new-line character after each character is (supposing that
the character @ does not occur in the original input):
If you want to know more about sed (cycles, buffers; suppressing
the output with the -n option, so that only the line including the p command
should be outputed, etc.), consult any UNIX manual.
The s/regex/string/ functionality
of sed replaces always the longest string that matches the regular expression.
For instance, what is the output of the following command line?
The command sed replaces the first string in the line that matches the regular
expression a*. This is zero or more consecutive
a's, and already the beginning of the line matches it (namely, zero a's). Therefore,
the output will be:
However, if we require at least one character a in the regular expression
(s/aa*/@/), then the string to be replaced
by @, can be either a, or aa, or aaa, etc., yielding one of the following outputs:
Which one of these will be the output of the above command line? Here becomes
the rule about the longest match possible important: for sed looks for the longest
string matching the regular expression, the last one will be the output produced.
What is the difference between sed and tr? Which one to use? The command tr
transforms or deletes characters, while the
s/regex/string/g functionality of sed replaces each string
that matches the regular expression to the second string. With tr, unlike with sed,
you cannot change one character to a string of more characters,
or a string of more characters to something
else. On the other hand, because a single character is also a regular expression,
you can (almost) always use sed instead of tr. It seems as if
sed was just more general than tr. However, there are a few cases when you may
prefer using tr, or you cannot simply avoid using it.
First, imagine that you want to replace each occurences of the character a into 1,
and simultaneously each occurences of the character b into 2, and also each c's into
3. The following command lines will yield the same result, you may choose the simplest
yourself:
(The last option works, because it does not matter in which order the different
rules are applied. Nevertheless, if the order of applying the rules is important,
use rather a pipe line.)
Second, if you want to change all occurences of a, b or c into 1, you can do
one of the followings (I guess you would go again for tr):
Here, the second command line is a short hand for the first one: whenever the
second parameter of tr is a shorter string than the first parameter, then the last
character of the second parameter is also used to replace the remaining characters
of the first string. Furthermore, the regular expression [abc] in last command line
refers to any one-character-long string composed of the elements of the set
{a, b, c}.
Third, you can easily refer to the ASCII (or ISO 8859) code of some special
characters, something that is much more complicated in sed. The last important
difference is that you can delete new-line characters in tr
(tr -d '\n' or tr -d '\012').
However, this is not possible in sed, since sed reads its input line-by-line, and
writes its output line-by-line. Consequently, with sed you can make changes within
one line; and you can delete entire lines by using the rule
/regex/d, that is, you do not allow them get printed on
the output. However, deleting the new-line characters in the file and making
the entire file one long line contradicts the whole idea behind sed.
A big number of the Unix commands are so-called "filters".
These are small programs that read the standard input (or the redirected
standard input; and in some cases the name of the input file can be given
as an argument, too), do something with it, and writes the result to the
standard output (or to the redirected standard output).
Combining filters, i.e. building a series of filters is
possible with pipe-lines ('|').
The simplest filter is cat: it
just copies the input (that can be the concatanation
of several files given as arguments, too) to the output.
Further filters were (cf. 3rd
week): rev, sort, tr, uniq, wc, head, tail, as well as (cf. 4th
week) grep. The previously introduced 'sed' command can also be seen
as a filter.
Further filters are: colrm (removing columns from
a file), crypt (coding and decoding data; theoretically this command shouldn't
be available in systems outside the US -- for federal security reasons...),
look (displaying lines beginning with a given string), spell (spell checker,
not available on all systems). If you are interested, check the online
manual to get more information about them.
Now we are going to deal with two further commands: cut and
paste.
In Unix, "everything is a file", and "every file is a
text". This means that each file is considered to be a series of lines,
each endig with an end-of-line character, and a line being a series of
characters (printable or not printable ones). (If our file does not contain
any end-of-line character, because it is not meant to be a text, then it
is seen to be an one-line-long text.)
Under Unix, a file can be seen not only as a text document
composed of lines, but also as a table. Lines correspond to rows, and each character
in a line is a different cell of the table. Each cell is filled with a value between
0 and 255, or, if you want, with one character. The only difference between a file
and a regular table is that a file does not necessarily have the same number
of characters in a line, therefore, it is like a table which may have different
number of cells in each row. Now, the nth column in a table is composed of
the nth cells in each row. Similarly, the nth column of a file is
composed of the nth character of each line.
(We have already encountered this idea
at the +n option of the
sort command, where
n is a number.)
The second way to define columns is to define what the delimiter
between two columns is: for instance, by default, a tabulator-character (TAB, \t) may
serve as the delimiter. Now, one cell (called as a field in this context)
is what occurs between two delimiter characters, and contains a string
of characters. The nth column, then, is defined as the sequence of the
nth cells (fields) of each row. (For the sake of ease, let us consider a
new-line character and an end-of-file character always as a delimiter character.)
Let us see an example to the first definition of a column, where a cell is one
character:
Suppose we have a file called
grades,
containing student number, the name of the student and the final grade:
The moral is:
cut -cLIST file
: will output only the specified columns. The input file is given
as an argument, and the list of columns to be cut are given as an argument
after option -c. The LIST of columns to be cut out should meet the following
criteria: numbers are separated by commas (but no space! because a space
would mean: 'end of the argument'), and a series of neighbouring columns
can be given using a minus sign. E.g. ' 3,5-8,11-20,28
' would mean the column no. 3, as well as columns 5 to (t/m) 8, columns
11 to 20, and column 28. Comming back to our example using the grades of the students, how would the story
look like if we used the second definition of what a column is (a cell = a field,
that is, a string of characters between to delimiter characters)?
Let us use the % character as the delimiter between the fields (cells),
and let us introduce this character into the file in the following way:
Now, the first column contains the student numbers, the second column the name,
while the third one the grade. The -d option of the command cut defines what
the delimiter is, and you can refer to a field by giving its number after the
option -f. Thus, to cut the first column, that is, the student numbers, use
cut -d % -f 1
(or, cut -d "%" -f 1: escape characters
are required whenever your delimiter would be otherwise understood as a metacharacter
by shell). The command line
cut -d "%" -f 3 > file3 will save the third
column of its input into another file.
The paste command
can be used for many purposes. The basic use of it is merging columns.
Typically, you give more file names as its arguments. These are its
input files. What paste does is that
it reads the first line of its first input file, followed by the first line of its second
input file, etc. Then, it concatenates them (writes one after the other),
and this merging of the lines will be written as the first line of its standard
output. Then, the same happens to the second line of the input files, etc.
To be more precise, the standard output of paste is a table. The first cell
of the first row is the first line of its first input file, etc. In general,
the ith field (cell) in the jth row is the jth line of the ith
input file. With the option -d, you can define what you want to be the delimiter
character, or, more generally, the delimiter string (it can be more than one characters,
or, even the empty string,"", too). In fact, the jth line of the output
is the concatenation of the jth lines of the first input file, followed
by the delimiter character or string, followed by the jth lines of the second
input file, followed by the delimiter character or string, etc.
The delimiter character, by default, is TAB (\t),
which is visualized as if you had a series of spaces
(the TAB character brings to the 1st, 9th, 17th, 25th, 33rd, etc. columns).
But, remember, this is not the case: if you look at the bytes in the file, it is
not the same if you have three spaces (ASCII code = 32) or one TAB character
(ASCII code = 9).
Note that if you look at the output of Unix commands like wc or uniq -c,
you will see the same tabular structure: fields are divided by a TAB character.
If you imagine the cat command
as linking your files "vertically", then the paste
command does the same "horizontally" (thank to the philosophy of seeing
your files as being composed of lines).
Thus, the syntax of this command is: Suppose you have a file containing 5 names (names),
another file containing 5 birthdays (birthdays),
and a third one containing 5 addresses (addresses).
The following command will create a file containing the combined information:
The topic of this course gives us an excellent opportunity
to mention a few basic ideas of statistical linguistics. What are the motivation
of statistical approaches to linguistics?
First: why dealing only with qualitative, and not also
with quantitative properties of human languages? For instance: the lyric
style of a romantic poem could be explained in many cases with the statistically
significant overuse of some 'soft' sounds (e.g. [l], [n]), as opposed to
the 'sharp' sounds ([t], [k],...) in revolutionary poems. Furthermore,
many modern linguists emphasize that linguistic statements should be checked
with quantitative methods. For instance, the paradigm by Noam Chomsky (a
sentence is either
A second motivation for these statistical games are real
life applications. For instance, when writing a document in more languages,
you don't want to change the language of your spell checker all the time; instead,
you would like your spell checker to recognize the language of your sentence.
Here is a demo
for guessing the language a text has been written in.
A very important implementation of statistical linguistics
is text classification (cf. 6th
week). Imagine you are in a big news agency or in an intelligence agency
and there are thousands of articles comming in day after day. You want
to select them by language and by topic in order to be able to send them
to the appropriate person. Instead of using human beings just to read all
these materials and to categorize them, you can automatize this task if
you are clever enough. This has been a huge topic of research in the last
couple of decades.
The basic information you need in statistical linguistics
is the frequency of some items. You should distinguish between absolute
frequency (number of occurences) and relative frequency (percentage: absolute
frequency / total number of items). Items can be for instance characters
(e.g. the percentage of the 'w' character will be much higher in English
than in French), words (e.g. the word 'computer' is much more frequent
in articles related to informational sciences than in articles related
to pre-school care education), or sentence types (e.g. embedded sentences
are much more likely in written texts produced by university students than
in an oral corpus produced by uneducated people).
A further basic idea is the distinction between type
and token. You usually have several tokens of the same type: several
occurences of the same character, word, expression, sentence-class, etc.
For instance in my previous sentence the type 'several' is represented
by two tokens. The type-token ration is the ratio of the number of types
and the number of tokens in a given text. A small child (or somebody learning
a new language) use few words, resulting in a low type-token ration
(e.g. 10 different words in a text of 100 words, i.e. lot of repetition).
While an educated speaker with a rich vocabulary might use a lot of different
words (especially in written texts), resulting in an extremely high type-token
ratio.
On the other hand, some words (articles, preposition,
pronouns,...) will occur very frequently independently of the richness
of one's vocabulary. The frequent use of the words (nouns, verbs,...) characteristic
to the topic in question is also unevitable. The distribution of frequent
and less frequent words follow a Zipf-function (cf. next
week). It is remarkable that similar Zipf-functions can be met in many
different fields (DNA, distribution of town size in a given country, etc.
see also topics related to fractals, chaotic behavior, critical phenomena,
etc.).
You will very often remark that you would rather use the
frequency of the combination of two or more items, rather than the frequency
of one item. French texts can be very easily recognized by the relatively
high frenquency of the 'ou', 'oi', character pairs (bigram). The
bigram 'sh' is characteristic to English, 'aa' to Dutch, the trigram
'sch' to German and Dutch, etc. You can also look at bigrams, trigrams,
etc. (n-grams, in general) of words: e.g. some English speakers
use much more often the bigram 'you know' than others.
About creating n-gram frequency counts, please do have
a look at Henny Klein's lecture-notes
from last year.
(From H. Klein's web
site.)
Concordantie=verzameling van voorkomens
van woorden in hun context. Al van oudsher werden er b.v. (handmatig uiteraard)
concordanties van de bijbel
ts own elections to the Union
itself . It is not
Hoe maken we zo'n KWIC tabel met
behulp van UNIX utilities? Antwoord:
1.gebruik
grep om relevante regels te selecteren
Hier gaan we dan. Stel je voor
dat we net zoals boven een KWIC voor het key word itself willen maken van
het bestand fed60.txt. Helaas kent sed geen optie
1.grep
-i itself ~vannoord/Federalist/fed60.txt (door met pipe:)
Het bestand lines ziet er als volgt uit. We hebben het speciale symbool
# gebruikt om de key word duidelijk aan te geven (voor cut hieronder).
regulating its own elections to the Union #itself#. It is not
2.We maken
aparte bestanden voor contexten en key word met behulp van cut. cut haalt
stukken uit een tekstregel. Er zijn allerlei opties, de belangrijkste
cut -d# -f 1 < lines > before
3.Maak
de context lang genoeg door met sed bv 35 spaties toe te voegen aan begin
/ einde van de regel. Helaas 'verstaat' sed geen opgave van een
sed -e 's/^/
/' < before > before2
4.Verkort
iedere regel tot bv 35 characters. Snap je waarom de before-file hiervoor
omgekeerd wordt?
cut -c 1-35 < after2 > after3
5.Resultaat
wordt verkregen met:
paste before3 itself after3
De hele set commando's zoals je
ze van de terminal zou geven bij elkaar:
grep -i itself ~vannoord/Federalist/fed60.txt
| sed -e 's/[Ii][Tt][Ss][Ee][Ll][Ff]/#&#/'>lines
Beperkingen
Next week, we are going to see how to write all these files into one
"program", a so-called shell script.
It is very interesting to draw the following diagram:
after having put our words in decreasing order of frequency, we don't look
at the words themselves, just at the frequency f
in the function of its rank r.
So f(1) is the frequency
of the most frequent word, f(2) is
the frequency of the second most frequent word, ... f(100)is
the frequency of the 100th most frequent word, etc.
This f(r) function
has interesting properties. In most of the cases, it has the same form,
independently of the language, style or content of the given text. Not
speaking of the first couple of values (the samllest ranks), this function
fits very well a hyperbole, i.e. the y
= c / x function, where c
is some constant. To be more precise, a good approximation is the following
power law:
The fact that the Zipf-function has such a power-law behaviour,
and is independent of the kind of the text, is very surprising, and generated
a nice literature since the 1930s, 1940s, when the law was discovered (without
the use of computers, yet!). For more information consult me (i.e. Tamas).
You can find here
a nice program in Perl, by Henny Klein, that does the Zipf analysis of
a text.
sed expr1/s/regex/newstring/
sed expr1/s/regex/newstring/g
sed expr1/s/regex/newstring/15
echo aaaaaaaaaaaaaaaaaaaaaaaaa
| sed -e s/a/b/15
Then:
aaaaaaaaaaaaaabaaaaaaaaaa
sed /regex/d
sed s/regex//.
E.g. sed -e
/Henry/d -e /Sally/s/Smith/White/ people.old > people
This will delete all lines containing 'Henry" from the
file called people.old, and change 'Smith' to 'White' in all lines containing
'Sally'. The result goes to the file called 'people'.echo "ik ben
jan" | sed -e 's/j[a-z]*n/piet&piet/'
ik ben pietjanpiet
sed "s/./&@/g"
sed "s/./&@/g" | tr @ '\012'
A script consists of commands, each of them in a new
line (or separeted with a semi-colon). An example:
cat > changes
This script deletes all lines containing 'Henry', changes
'Smith' to 'White' in lines containg 'Sally', and add the line 'James Walker
112' to the end of the file. The a\ command stands for "add" or "append".
The $ symbol stands for "last command line". (The ^D means to press CTRL-D
to end the file.)
/Sally/s/Smith/White/g
/Henry/d
$a\
James Walker
112
^D
sed -f changes people.old
> people.new
An important note:
echo baaaaaab | sed "s/a*/@/"
@baaaaaab
@b@aaaaab
b@aaaab
b@aaab
b@aab
b@ab
b@b
sed and tr
tr abc 123
sed s/a/1/g | sed s/b/2/g | sed s/c/3/g
sed -e s/a/1/g -e s/b/2/g -e s/c/3/g
tr abc 111
tr abc 1
sed s/a/1/g | sed s/b/1/g | sed s/c/1/g
sed -e s/a/1/g -e s/b/1/g -e s/c/1/g
sed s/[abc]/1/g
2. cut,
paste
Columns
0513678 John
8
Suppose you want to hang out this list, but without the names
(just student no. and grade). If each character in a line is seen as a separate cell
of a table, then you want to keep the first and the last columns of this "table".
Therefore, you want to remove the columns
9 to 15. (The first character of a line is in column no. 1.) There are
several options to do that (the lpr
command sends its input to the printer):
0612942 Kathy
7
0418365 Pieter 6
0539482 Judith 9
colrm 9 15 <
grades | lpr
You also could redirect the output to another file (>
grade_without_names), and then print it (lpr
grades_without_names), of course.
cut -c1-8,16-18 grades
| lprcolrm [startcol
[endcol]] : will remove the columns from no. startcol
(until no. endcol, if specified):
these are two seperate arguments of the command. The input file should
be given as a redirected standard input. The output lacks the specified
columns.
0513678 % John
% 8
0612942 % Kathy
% 7
0418365 % Pieter % 6
0539482 % Judith % 9
paste
paste [-d char]
file [file...]
paste names
birthdays addresses > info
Now how can you change the order of columns of a given file? Combining
cut and paste:
cat info
Jane
23/11 9722EK Groningen....
Jack
05/09 9718UW Groningen....
...cut -c1-8 info
> naam; cut -c9-14 info > jaarig; cut -c15-40 info > woning; paste
naam woning jaarig > new_info
(Remark: the semi-column is used as delimiters between commands. You
could put them into separate lines, too. When putting into separate lines,
the Shell will deal with them separately: pre-processing and executing
the first line, then pre-processing and executing the second line, etc.
When putting into one line, Shell will pre-process them together, and then
execute them one-by-one.)
3. N-grams
grammatical or agrammatical) has been slowly changed
by many to a 'corpus-based paradigm': some types are very frequent (therefore
'more correct'), others are pretty rare ('less or marginally correct'),
or absolutely abscent ('incorrect'?) in a given corpus (set of texts).
4. Making a concordance (KWIC)
gemaakt: lijsten van woorden en
namen met de plaatsen waar ze te vinden waren. Modern en electronisch:
KWIC (key word in context). In dit formaat wordt
voor een gevraagd key word een
tabel geproduceerd met daarbij de linker- en rechtercontext tot een bepaalde
diepte (bijvoorbeeld 35 characters). Voorbeeld
voor het woord itself in Federalist/fed60.txt:
e into them, it would display
itself in a form
preference in which itself would not be included?
Or to
itself could desire. And thirdly, th
2.gebruik
cut om context te extraheren
3.gebruik
sed om contexten lang genoeg te maken (opvullen met spaties)
4.gebruik
cut om contexten kort genoeg te maken (eventueel met rev).
5.gebruik
paste om iedere kolom weer naast elkaar te geven
voor 'ignore case':
| sed -e 's/[Ii][Tt][Ss][Ee][Ll][Ff]/#&#/'>lines
gain admittance into them, it would display #itself# in a form
preference in which #itself# would not be included? Or to what
#itself# could desire. And thirdly, that men accustomed to
op dit moment:
cut -c met getallen of range getallen: de tekens op die posities, bv cut
-c 2,5-8,14- haalt uit elke regel de characters op positie 2, 5 t.m. 8
en van 14 tot
eind
cut -f met getallen of range getallen haalt de gewenste fields (velden)
uit de regel. Default zijn die gescheiden door tab (vgl paste) maar je
kunt een
eigen 'delimiter' opgeven via -d, zoals een spatie voor woordgrenzen (-d'
'). Hieronder gebruiken we de toegevoegde # als delimiter.
cut -d# -f 2 < lines > itself
cut -d# -f 3 < lines > after
specifiek aantal
sed -e 's/$/
/' < after > after2
rev before2 | cut -c 1-35 | rev > before3
cut -d# -f 1 < lines > before
cut -d# -f 2 < lines > itself
cut -d# -f 3 < lines > after
sed -e 's/^/
/' < before > before2
sed -e 's/$/
/' < after > after2
cut -c 1-35 < after2
> after3
rev before2 | cut -c 1-35 | rev
> before3
paste before3 itself after3
5. Zipf's law
where a is
a little bit smaller than 1 (approx. 0.7 - 0.8). (Remark: ^ means here
"to the power of...".)
f = c / r^(-a)
Bíró Tamás:
e-mail
English web site
Magyar honlap
Last modified: Thu
Jul 3 11:39:17 METDST 2003