practicum-week5

Practicum - week 5

Read the article to be read for next week:

William B. Canvar en John M. Trenkle: N-Gram-Based Text Categorization

NEW: in pdf-format

Start XEmacs by typing 'xemacs &'.

Remark: the & symbol means that you want to run the program "in the background", i.e. you wish to get back the prompt immediately, without quitting the given application. If you don't add this symbol, you can't use that terminal (console) simultaneously with XEmacs. (Of course you can run another console...)

Go through the XEmacs manual. You can get it by typing ^H T (CTRL-H, then T). Do it very carefully, practicing a lot.

There is no assignment about this now, but there will be some questions about using Emacs or XEmacs in the final test. (Questions like easy commands, that you use very often, if using Emacs or XEmacs.) So you should acquire a basic skill of using Emacs or XEmacs, by working in these editors instead of pine.

In German there are a few characters with accents (diacritical marks), as: ü, ä and ö. When writing an email (or sending a telex or telegram in ancient times...) German people usually use the following combinations instead: 'ue', 'ae' and 'oe' respectively, like: Oesterreich, Muenchen, Koeln, Duesseldorf, etc.

Give a command line that will transform a text writen in the standard way to a text written in the telegram-way. Give another command doing this transformation vice-versa. (3 points)

Could you do that using only 'tr'? Explain your answer. (1 point)

Tips:

You can refer to a character using its ASCII or ISO-code by remembering the following: the command 'tr' accepts a character given by its octal (ASCII, etc.) code, if you put the octal code into single quotation marks, predeeded by a backslash: e.g. the ASCII-code of the "new line" character is 10, i.e. 012 in octal. Therefore '\012' refers to the new-line character.
If you can once capture these characters using 'tr', try transforming these characters to something else (supposing that your text doesn't contain e.g. numerals, or some rare characters), that you can further transform into what you wish.
Try out 'man ascii', and check also its "see also" chapter for further information (using further 'man's)!

Under /users1/birot/Federalist-html you will find a few of the Federalist papers in a html-format. Suppose you want to get the text in a plain text format, because you don't have access to them. The way is to remove html-formating information. We are lucky because most of this information is to be found between '<' and '>' brackets. (Look at these files...)

Create a command line (and maybe a script file) that will delete these. Try to get a plain text as much as possible. (3 points)

(About html-protocol: see today's lecture notes.)

Create a command line that calculates the type-token ratio (cf. today's lecture notes) of its input.

The way to do that:

Understand the problem, define the different parts of the problem to be solved (what information do you need for a type-token ratio?).
First realize the parts separately, testing them one by one.
Then find out how to put them together.

Hints: check Henny's description of how to make a frequency count (Excursus of the 3rd week's lecture notes), as well as the extra exercise in 3rd week's practicum.

Another hint: you may want to use a temporary file to save information... You can also use the ' tee ' command, that creates a T-junction in your pipeline.

(3 points)