Reminder: Please don't forget to read the article for the last class!

Final assignment

It can be written either in Shell script or in Perl or both. In the latter case I would like to get some more information attached about how to use your package (what to run first, etc.). An idea: I should run the Shell script and this would launch the Perl program.

I ask you to put as many comments into your programs / scripts as possible, so that it will be easily readable and understandable. I ask you to use variable names, etc. that are self-explanatory. I ask you to do that in English (a practical thing to do always when you will ever write a program). (Don't worry, the answers in the final test can be given in Dutch.)

I really ask you not to work together when working on the final assignment. Too similar solutions will be rejected immediately, and might include even more severe consequences.

The task:

Write a program that reads two files and compares them using the similarity measure in the article by William B. Canvar and John M. Trenkle (N-Gram-Based Text Categorization). The output should be a measure of similarity, the so-called "out-of-place measure". Or in fact this is rather a measure of distance: the higher it is, the more different the two documents are. Using this package you can build tables like the ones I will show you based on my research (see also bellow).

The procedure is described in section 3.2 of the article, and figure 3 helps you to understand it. The input of this procedure is the two N-gram profiles, that are described in section 3.1 of the paper.

A remark: when calculating the out-of-place values for different n-grams (the difference in the ranks of the given n-gram in the two files), then you need to take the absolute value of the difference. (In other words: if the n-gram "_th" has rank 5 in one file, and rank 9 in the other, the difference is 4. If it has 9 in the first file, and 5 in the second file, the difference should not be -4, but +4 again.) In Perl you can do that by using the abs function, e.g.: $difference = abs ($rank1{$ngram) - $rank2{ngram}). An alternative solution (if you happen to write a shell script for instance): if (rank1 > rank2) then {difference = rank1-rank2} else {difference = rank2-rank1}.

If you prefer, you can do the following simplifications:

After having discarted digits and punctuations, you don't need to split the text into separate tokens. Instead, use the way to build n-grams described in assignment 1, week 8.
Use N = 2, N= 3 and N=4, instead opf N=1 to 5 when building the N-gram profile.
Take the 300 most frequent N-grams when building you N-gram profile.
When calculating the "out-of-place measure", the proposed "maximum out-of-place value" is 100.

When you are done, I suggest you (it is not compulsory) to run this procedure on a set of 6-8 files. Let's say: on 3-4 English and 3-4 Dutch texts, 5-10K long each (e.g. on long e-mails). Calculate the distance of each pair of texts, and put your results into a table like this:

Eng1 Eng2 Eng3 Dutch1 Dutch2 Dutch3

Eng1 - 55 66 211 321 231

Eng2 - 77 232 199 231

Eng3 - 256 354 342

Dutch1 - 87 78

Dutch2 - 66

Dutch3 -

I would really appreciate (including extra points) if you did this experiment and send me your files that you had used, as well as your results. Write a really short discussion of your results: could you sort the files according to language, based on this method? If you look at the fake numbers above, you can see that the English texts have a low distance among themselves, and the same applies to Dutch texts. But the distance between a Dutch and an English texts (numbers in blue) is much higher (significantly higher).

If you have any question regarding the final assignment or anything else, please please, don't be shy, ask me !!! Write me an email to birot@let.rug.nl !!!

Notes about the final exam:

1. During the first part (general test) you cannot use anything. It will be on paper, and you will have 45 minutes for it.

2. I will ask you to send me (birot@let.rug.nl) your solution to the above assignment not later than the beginning of the second part (I'd appreciate if you sent it to me beforehand.)

3. During the second part (making some changes on the solution of the above assignment) you can use anuyhing, except of human people different from yourself (including emailing, etc.). You will have 45 minutes for doing that, again. You will do that in two rounds, maximum 6 people working in the computer room in the same time.

4. You have to send me (birot@let.rug.nl) the solution of the final assignment the latest at the end of the exam.

	Eng1	Eng2	Eng3	Dutch1	Dutch2	Dutch3
Eng1	-	55	66	211	321	231
Eng2		-	77	232	199	231
Eng3			-	256	354	342
Dutch1				-	87	78
Dutch2					-	66
Dutch3						-