Practicum - week 5


Read the article to be read for next week:

William B. Canvar en John M. Trenkle: N-Gram-Based Text Categorization

 NEW: in pdf-format


Start XEmacs by typing 'xemacs &'.

Remark: the & symbol means that you want to run the program "in the background", i.e. you wish to get back the prompt immediately, without quitting the given application. If you don't add this symbol, you can't use that terminal (console) simultaneously with XEmacs. (Of course you can run another console...)

Go through the XEmacs manual. You can get it by typing ^H T (CTRL-H, then T). Do it very carefully, practicing a lot.

There is no assignment about this now, but there will be some questions about using Emacs or XEmacs in the final test. (Questions like easy commands, that you use very often, if using Emacs or XEmacs.) So you should acquire a basic skill of using Emacs or XEmacs, by working in these editors instead of pine.


In German there are a few characters with accents (diacritical marks), as: ü, ä and ö. When writing an email (or sending a telex or telegram in ancient times...) German people usually use the following combinations instead: 'ue', 'ae' and 'oe' respectively, like: Oesterreich, Muenchen, Koeln, Duesseldorf, etc.

Give a command line that will transform a text writen in the standard way to a text written in the telegram-way. Give another command doing this transformation vice-versa. (3 points)

Could you do that using only 'tr'? Explain your answer. (1 point)




Under /users1/birot/Federalist-html you will find a few of the Federalist papers in a html-format. Suppose you want to get the text in a plain text format, because you don't have access to them. The way is to remove html-formating information. We are lucky because most of this information is to be found between '<' and '>' brackets. (Look at these files...)

Create a command line (and maybe a script file) that will delete these. Try to get a plain text as much as possible. (3 points)

(About html-protocol: see today's lecture notes.)


Create a command line that calculates the type-token ratio (cf. today's lecture notes) of its input.

The way to do that:

Hints: check Henny's description of how to make a frequency count (Excursus of the 3rd week's lecture notes), as well as the extra exercise in 3rd week's practicum.

Another hint: you may want to use a temporary file to save information... You can also use the ' tee ' command, that creates a T-junction in your pipeline.

(3 points)