Tekstmanipulatie, week 5

1. Emacs and XEmacs

Emacs is a text editor with more options than pico, and easier to use than vi. Although one has to get used to it...

But there is a tutorial (CTRL-H T), and by going through it (with a lot of exercises) you will get the practice...

XEmacs is a more modern version of Emacs.

You can use them also for writing and reading emails.

Read the manual for more information, and ask me for a synopsis of the commands.

2. Shell scripts

After having solved a number of assignments, you might want to save some of them so that you won't need the reinvent them each time you need them. You can save them in a file (that is what you do when sending the solution to Mariette), and just check that file each time before retyping the long chain of commands. But what not let the computer itself read this file, before executing it? To make the long story short, can we write programs using UNIX?

There are two arguments pointing toward this possibility:

Most of the Unix commands are in fact programs. Why couldn't we add new programs to them?
The special program executing other ones is the Shell. The input of the Shell is also a file: if not specified otherwise, the standard input, i.e. what we type on the keyboard. (That was the reason why ^d, meaning end-of-file, results in logging out, i.e. quiting the Shell.) Why could we not run the Shell with files other than the standard input? The expression "shell scripts" comes from this idea.

Is Unix a programming language? It has been designed as an operating system, but it has so many possibilities that you can even write simple programs using it. What is a program?

It is a file containing a sequence of commands, telling the machine what to do, and therefore it can be run several times.
You may not want it to run each time exactly in the same way, but you want it to make the run dependent upon some circumstances.
You may therefore want to give your program some parameters.
So you need to be able to handle variables, also in order to store intermediate results.
Using them you would like to write conditional commands (if...then...), as well as cycles.

All of these are possible within UNIX. We shall come back some of these later.

At the moment what we want is to put a sequence of commands into a file, and then just run it.

How to have a sequence of (complex) commands? If you want to simply combine a sequence of commands, pipes, etc., just write them into new lines, or separate them with a semi-collumn (;).

For instance:

cat > a_simple_shell_script
echo Now I will list the subdirectories of the directories whose name contains exactly 4 characters.
ls -l ???? | grep ^d
echo Thank you for your waiting.
echo What about an alphabetical order of these?
ls -l ???? | grep ^d | sort
echo Here you have it.
^d

Now, we have a file named a_simple_shell_script that contains six lines. What can we do with this? We want to run it. Let's type the file name after the promt, type enter, and... we get an error message:

bash: a_simple_shell_script: command not found

What is wrong? Let's type './a_simple_shell_script', in some systems this is the way you can run the programs that are within your own directory. Did it help? No, you get the same error message. Because the machine doesn't know that this file has been written to make it run (and not only a text-file, that can be, e.g. sent to Mariette as the solution of your assignment). What to do? There are two steps:

First you have to tell the machine how to understand the code, since it is a machine code. The way to do that is by inserting a first line beginning with #! (pronounce 'hash-bang'), followed by the path and the file name of the program that is supposed to execute your code (i.e. in our case the path of the shell, such as /usr/local/gnu/bin/bash or /bin/sh). (In fact in the case of our examples this is not needed, because the running bash shell is the one that should execute our shell scripts. But the standard way is still to insert this first line.)
Then you make the file executable by typing: ' chmod +x a_simple_shell_script '. Now you can run your program, either with typing 'a_simple_shell_script ' or with typing ' ./a_simple_shell_script ', depending on your system setup (depending on whether your local directory is given in $PATH or not).

When you have a file that you want to use pretty often, it might be complicated to give always the entire path. Why not to make it into a "real" command? There is a system variable (we will speak about them later) that give you a set of paths: when you type the name of a program to be run, without determining the exact (absolut or relative) path, the Shell will look for the directories given in this variable. You can add additional paths to this variable by typing:

PATH=$PATH:$HOME/shellscripts

The meaning of this is the following: the new value of the variable PATH should be its actual value, followed by a column (separating the differents paths within the variable), and then you can give the new path to be added. Suppose it is a directory called shellscripts within your own home directory. You can save typing the exact path of your home directory by refering to this other system variable.

You might want to add arguments to your shell scripts, similarly to the arguments of the standard Unix commands. The way to do this is by reffering to them within your shell script as $1, $2, etc. These will refer respectively to the first, second, etc. argument given aftern the script's name. The arguments will be separated by a space. (Unless the space is neutralized by an escape character or a quote.)

Thus a shell script containing

ls -l $1 | grep $2

will look for the second argument as a regular expression within the long list of directory given by the first argument.

$* refers to all arguments, $# gives the number of arguments, while $0 gives the zeroth argument, which is the file's name itself.

3. The 'sed' command

From the online manual:

Sed is a stream editor. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline). While in some ways similar to
an editor which permits scripted edits (such as ed), sed works by making only one pass over the input(s), and is consequently more efficient. But it is sed's ability to
filter text in a pipeline which particularly distinguishes it from other types of editors.

In other words:

The 'sed' command is like a sewing machine that goes through a file (a text), and performs some basic operations. You can define the operation to be performed in two ways:

Either in the command line, as arguments of 'sed'
Or in an independent file (a 'script-file'), and you run sed with the following argument: -f <filename>.

When performing the 'sed' command, the computer will take all commands appearing either in the command line or in the specified script files, and decides in which order to execute them. Therefore if you want the operations to be executed in a specific order, and the result is not what you wanted, then you do better pipelining more 'sed' commands.

What are the basic operations you can use?

sed s/regex/newstring/ file1 > file2

This will rewrite the first string that matches the regular expression given (for possibilities that can be used in regex, see the grep command) by newstring. The input file is file1, and the output file is file2.

sed expr1/s/regex/newstring/

This will rewrite the first string that matches the regular expression given by newstring, but only the lines that include something matching expr1 are taken into consideration.

sed expr1/s/regex/newstring/g

will replace all instances regex.

sed expr1/s/regex/newstring/15

will replace the 15th instance. (Any number can be given.) E.g.:

echo aaaaaaaaaaaaaaaaaaaaaaaaa | sed -e s/a/b/15
aaaaaaaaaaaaaabaaaaaaaaaa

Then:

sed /regex/d

This will delete all lines that include regex. If you want to delete a given string or a regular expression, you can do it by replacing it with nothing: s/regex//.

If you want to put more than one operation into the command line (or you have a script file with at least one operation in the command line), use the -e option before each operations of the command line:

E.g. sed -e /Henry/d -e /Sally/s/Smith/White people.old > people
This will delete all lines containing 'Henry" from the file called people.old, and change 'Smith' to 'White' in all lines containing 'Sally'. The result goes to the file called 'people'.

The & character has a special meaning: it stands for the string that has been matched. E.g.:

echo "ik ben jan" | sed -e 's/j[a-z]*n/piet&piet/'
ik ben pietjanpiet

A script consists of commands, each of them in a new line (or separeted with a semi-colon). An example:

cat > changes
   /Sally/s/Smith/White/g
   /Henry/d
   $a\
   James Walker 112
   ^D
sed -f changes people.old > people.new

This script deletes all lines containing 'Henry', changes 'Smith' to 'White' in lines containg 'Sally', and add the line 'James Walker 112' to the end of the file. The a\ command stands for "add" or "append". The $ symbol stands for "last command line". (The ^D means to press CTRL-D to end the file.)

If you want to know more about sed (cycles, buffers; suppressing the output with the -n option, so that only the line including the p command should be outputed, etc.), consult any UNIX manual.

4. About N-grams

The topic of this course gives us an excellent opportunity to mention a few basic ideas of statistical linguistics. What are the motivation of statistical approaches to linguistics?

First: why dealing only with qualitative, and not also with quantitative properties of human languages? For instance: the lyric style of a romantic poem could be explained in many cases with the statistically significant overuse of some 'soft' sounds (e.g. [l], [n]), as opposed to the 'sharp' sounds ([t], [k],...) in revolutionary poems. Furthermore, many modern linguists emphasize that linguistic statements should be checked with quantitative methods. For instance, the paradigm by Noam Chomsky (a sentence is either
grammatical or agrammatical) has been slowly changed by many to a 'corpus-based paradigm': some types are very frequent (therefore 'more correct'), others are pretty rare ('less or marginally correct'), or absolutely abscent ('incorrect'?) in a given corpus (set of texts).

A second motivation for these statistical games are real life applications. For instance, when writing a document in more languages, you don't want always to change the language of your spell checker, but you would like your spell checker to recognize the language of your sentence. Here is a demo for guessing the language a text has been written in.

A very important implementation of statistical linguistics is text classification (cf. 6th week). Imagine you are in a big news agency or in an intelligence agency and there are thousands of articles comming in day after day. You want to select them by language and by topic in order to be able to send them to the appropriate person. Instead of using human beings just to read all these materials and to categorize them, you can automatize this task if you are clever enough. This has been a huge topic of research in the last couple of decades.

The basic information you need in statistical linguistics is the frequency of some items. You should distinguish between absolute frequency (number of occurences) and relative frequency (percentage: absolute frequency / total number of items). Items can be for instance characters (e.g. the percentage of the 'w' character will be much higher in English than in French), words (e.g. the word 'computer' is much more frequent in articles related to informational sciences than in articles related to pre-school care education), or sentence types (e.g. embeded sentences are much more likely in written texts produced by university students than in an oral corpus produced by uneducated people).

A further basic idea is the distinction between type and token. You usely have several tokens of the same type: several occurences of the same character, word, expression, sentence-class, etc. For instance in my previous sentence the type 'several' is represented by two tokens. The type-token ration is the ratio of the number of types and the number of tokens in a given text. A small child (or somebody learning a new language) use few words, resulting in a low type-token ration (e.g. 10 different words in a text of 100 words, i.e. lot of repetition). While an educated speaker with a rich vocabulary might use a lot of different words (especially in written texts), resulting in an extremely high type-token ratio.

On the other hand, some words (articles, preposition, pronouns,...) will occur very frequently independently of the richness of one's vocabulary. The frequent use of the words (nouns, verbs,...) characteristic to the topic in question is also unevitable. The distribution of frequent and less frequent words follow a Zipf-function (cf. next week). It is remarkable that similar Zipf-functions can be met in many different fields (DNA, distribution of town size in a given country, etc. see also topics related to fractals, chaotic behavior, critical phenomena, etc.).

You will very often remark that you would rather use the frequency of the combination of two or more items, rather than the frequency of one item. French texts can be very easily recognized by the relatively high frenquency of the 'ou', 'oi', character pairs (bigram). The bigram 'sh' is characteristic to English, 'aa' to Dutch, the trigram 'sch' to German and Dutch, etc. You can also look at bigrams, trigrams, etc. (n-grams, in general) of words: e.g. some English speakers use much more often the bigram 'you know' than others.

About creating n-gram frequency counts, please do have a look at Henny Klein's lecture-notes from last year.

5. More "filters", dealing with columns: cut and paste

A big number of the Unix commands are so-called "filters". These are small programs that read the standard input (or the redirected standard input; and in some cases the name of the input file can be given as an argument, too), do something with it, and writes the result to the standard output (or to the redirected standard output).

Combining filters, i.e. building a series of filters is possible with pipe-lines ('|').

The simplest filter is cat: it just copies the input (that can be the concatanation of several files given as arguments, too) to the output.

Further filters were (cf. 3rd week): rev, sort, tr, uniq, wc, head, tail, as well as (cf. 4th week) grep. The previously introduced 'sed' command can also be seen as a filter.

Further filters are: colrm (removing columns from a file), crypt (coding and decoding data; theoretically this command shouldn't be available in systems outside the US -- for federal security reasons...), look (displaying lines beginning with a given string), spell (spell checker, not available on all systems). If you are interested, check the online manual to get more information about them.

Now we are going to deal with two further commands: cut and paste.

In Unix, "everything is a file", and "every file is a text". This means that each file is considered to be a series of lines, each endig with an end-of-line character, and a line being a series of characters (printable or not printable ones). (If our file does not contain any end-of-line character, because it is not meant to be a text, then it is seen to be
an one-line-long text.) Consequently, "columns" can be defined under Unix: the nth column is composed of the nth character of each line. (We have already encountered this at the +n option of the sort command, where n is a number.)

Suppose we have a file called grades, containing student number, the name of the student and the final grade:

0513678 John 8
0612942 Kathy 7
0418365 Pieter 6
0539482 Judith 9

Suppose you want to hang out this list, but without the names (just student no. and grade). Therefore you want to remove the columns 9 to 15. (The first character of a line is in column no. 1.) There are several options to do that (the lpr command sends its input to the printer):

colrm 9 15 < grades | lpr
cut -c1-8,16-18 grades | lpr

You also could redirect the output to another file (> grade_without_names), and then print it (lpr grades_without_names), of course.

The moral is:

colrm [startcol [endcol]] : will remove the columns from no. startcol (until no. endcol, if specified): these are two seperate arguments of the command. The input file should be given as a redirected standard input. The output lacks the specified columns.
cut -cLIST file : will output only the specified columns. The input file is given as an argument, and the list of columns to be cut are given as an argument after option -c. The LIST of columns to be cut out should meet the following criteria: numbers are separated by commas (but no space! because a space would mean: 'end of the argument'), and a series of neighbouring columns can be given using a minus sign. E.g. ' 3,5-8,11-20,28 ' would mean the column no. 3, as well as columns 5 to (t/m) 8, columns 11 to 20, and column 28.

The paste command can be used for many purposes. The basic use of it is merging columns. If you imagine the cat command as linking your files "vertically", then the paste command does the same "horizontally" (thank to the philosophy of seeing your files as being composed of lines). The syntax of this command is:

paste [-d char] file [file...]

The files to be combined are listed as arguments. The optional (remember: [..] stands for optionality) -d option defines the delimiters between the columns. By default it is set to the TAB character ('\t': bringing to the 1st, 9th, 17th, 25th, 33rd, etc. columns), but you can reset it (using the -d option) to be just a space (or '&' for TeX, etc.).

Suppose you have a file containing 5 names (names), another file containing 5 birthdays (birthdays), and a third one containing 5 addresses (addresses). The following command will create a file containing the combined information:

paste names birthdays addresses > info
cat info
Jane 23/11 9722EK Groningen....
Jack 05/09 9718UW Groningen....
...

Now how can you change the order of columns of a given file? Combining cut and paste:

cut -c1-8 info > naam; cut -c9-14 info > jaarig; cut -c15-40 info > woning; paste naam woning jaarig > new_info

(Remark: the semi-column is used as delimiters between commands. You could put them into separate lines, too. When putting into separate lines, the Shell will deal with them separately: pre-processing and executing the first line, then pre-processing and executing the second line, etc. When putting into one line, Shell will pre-process them together, and then execute them one-by-one.)

6. Telnet, ssh, protocols and lynx

Suppose you are home and you want to log in to Hagen. Or you are anywhere else but in the Unix computer room. How can you log in to a remote computer? How can you make your local computer become a terminal of another computer?

Telnet is one solution. It exists under almost any systems, such as DOS, Windows, UNIX, etc. Just run it, and you will get a window that is a terminal of a remote computer in front of you. The only disadvantage is that it is not a graphic interface, therefore you won't be able to use the most comfortable tools, like clicking with the mouse, etc.

In fact "telnet" is a protocol. What is a protocol? It is a standard of communication between two different systems, that may be very far from each other, and may be operating in very different ways. Independently of their inner structure, they still can understand each other, due to the standardized protocols.

For what reason would two computers communicate with each other? They may want to exchange information, like emails or files. Therefore one basic protocol is FTP (for File Transfer Protocol, see later in this course). Another one is the well-known http (hypertext transfer protocol), which is a more advanced protocol to transfer more specialized files (allowing e.g. links). Telnet is thus the protocol using which you can log in into a remote computer, and opening a terminal of the remote machine using your local computer.

There is a newer version of it, which is called 'ssh', that stands for 'secure shell protocol'. It is the same as telnet, i.e. it opens a terminal (a shell) belonging to the remote computer on your local computer. The only difference is that it encodes the information, thus no third party can have access to the information going between you and the remote computer, in both directions. Therefore it gets more and more popular, and many system administrators tend to forbid telnet, allowing only ssh. (There is a secure version of http, too, it is called https.)

How would I surf on the web if I don't have a graphical interface? How can I run my preferite Netscape? Well, this is a big disadvantage of not having a graphical interface, indeed. But, either you believe or not, life existed even before graphical interfaces! The prehistorical browser you still can use is called: 'lynx'. Use the cursor-up and cursor-down keys to see the page and to walk between the links (these are highlighted), use the cursor-right to follow a link, and use the cursor-left to come back. To open a new URL, either you type it in the command line when starting lynx, or -- once it has been run -- press 'G'. Other commands appear in the bottom line (including H for help and O for options). To quit, press Q.

It is worth trying out once to make up your mind: press Q, then you are asked if you are sure that you want to leave. Press N for 'no', and look at lynx's reaction!

If you design a web-site, no matter how wonderful graphic it has, it should be useable also using lynx...