Final assignment


Note: The shell scripts that I showed you on Wednesday are here. You can download them, and play around with them. Next week, I plan to put them also on the web site of the lecture notes on week 15, together with some further explanations. If you have any further questions, please send me an email.



About the final test

The final test is on Tuesday January 25th, at 2 p.m. sharp, in the Unix lab (H12.102).

As I explained you in class, the final test will be composed of two parts:

  1. A written test, on paper. It lasts 45 minutes and nothing can be used. I recommend you to read through the lecture notes (including the few sections we did not speak about during the course), and go through the assignments once again, together with the corrections you received (will have received) from us and the solutions to be published on the web site.
  2. A task to be solved on the computer in 45-60 minutes, based on your solution of the the final assignment. This means that you have to have your solution with you (either on the machine, or on a disquette). You can use any non-human help: your own books, your own notes, the web,... But you cannot interchange information: borrowing your books or your notes to somebody else, sending emails, etc.


Final assignment

Our aim is to create a system for automatic text categorization. Therefore, we need to be able to calculate the "distance" or the "similarity" of documents. Once we have calculated the similarity of all document-pairs, you can put it into a table and try to find clusters of documents. Alternatively, if you have a new document, you can determine to which older document it is the most similar.

Here is a virtual example about what I mean by "clustering" documents based on their similarity. Suppose that the maximum of the similarity is 1 (a document is maximally similar to itself, therefore, you have 1s in the diagonal), and the minimum is 0. The similarity of document A to B is equal to the similarity of B to A, therefore the following matrix is symmetric.
A B C D E
A 1 0.9 0.1 0.2 0.1
B 0.9 1 0.4 0.3 0.2
C 0.1 0.4 1 0.8 0.2
D 0.2 0.3 0.8 1 0.1
E 0.1 0.2 0.2 0.1 1

It is clear that A and B form a cluster, C and D form a cluster, and E forms a cluster in itself. The similarity of the elements of a cluster is quite high (>= 0.8), whereas the similarity of documents belonging to two different clusters is very low (<= 0.4).

Your task is to write a shell script that measures the similarity of two documents. (You do not have to cluster documents.) The two arguments of the script are the file names containing the documents, and the standard output of the script is a number, namely, the similarity measure (or, some error message). The similarity measure is defined as follows:

Take the 100 most frequent words in each of the documents. The similarity measure of the two documents is the number of words that occur among the 100 most frequent words of both documents:

Similarity(A,B) := #{ w | w is among the 100 most frequent words of document A and w is among the 100 most frequent words of document B}

[ #{...} means the cardinality, the number in the given set]

In addition to that:

  1. Before creating the word frequency lists, you have to pre-process your documents. This preprocessing includes removing irrelevant features to the similarity of the two documents. For example, remove numbers, punctuation marks, html-codes, change upper case characters to lower case characters (or vice-versa), etc. The more pre-processing you can do, the better your solution is.
  2. Create a list of so-called stop words. These are words that occur frequently in any texts (such as "the", "and", "so", "have", etc.), and you remove these words from your documents before creating the word frequency lists. (In fact, if your goal is text categorization according to language, you do not want to have such a list. Yet, if your goal is text categorization according to topic, such a list will significantly improve your results.) For instance, on the web you may find the frequency ranking of words in a large corpus, and use the topmost words as your stop word list.
  3. If the script receives a third argument that is "-b", then perform the same story on bigrams of words, instead of on words (similarity = number of shared bigrams among the 100 most frequent bigrams).

Tips:

  1. You may want to use a variable instead of 100, because if your documents are too short, you obtain better results if you take to 10 or 25 most frequent words only. Then, you can simply change the value of this variable at the beginning of your shell script, instead of rewriting it.
  2. This similarity measure is symmetric: the similarity of document A to document B is equal to the similary of document B to document A. Check your script if it indeed returns the same result if you reverse the order of its arguments.
  3. After you are sure your script works, you can remove all temporary files created by your script. But do not delete them as long as your script is not complete: analyzing these files may help you debugging your script.
  4. With Lonneke, we have come up with two, very different ways of solving this problem. One makes use heavily of all the different tricks presented in the last lecture. The other involves a similar trick to the one used to find palindroms. Nevertheless, you may also find a third way to solve this assignment. We would be very happy if you came up with different solutions. If you have more than one way to solve this problem, don't keep them in secret from us :-).
  5. Slowly but surely, we are posting the solutions of the weekly assignments on the web. Reading them through may give you ideas and help you avoiding mistakes.

We will honour with extra points if you show us the results of your experiments. Take a set of documents (e.g. from the Federalist papers, from the Internet or from your own writings), and calculate their distance. Based on their distance, can you cluster the documents into sub-groups? Does this measure helps to cluster documents according to language, according to author, according to style or according to content? Is it true that the most frequent words shared by two documents belonging to the same cluster are characteristic of (keywords of) the given cluster? What is the role of the list of stop words? Do you have any additional observation?

Good luck with the assignment!