Final assignment
Note: The shell scripts that I showed you on Wednesday
are here. You can download them, and play around with them.
Next week, I plan to put them also on the web site of the lecture notes on week
15, together with some further explanations.
If you have any further questions, please send me an email.
About the final test
The final test is on Tuesday January 25th, at 2 p.m. sharp, in the
Unix lab (H12.102).
As I explained you in class, the final test will be composed of two parts:
- A written test, on paper. It lasts 45 minutes and nothing can be used. I recommend
you to read through the lecture notes (including the few sections we did not speak
about during the course), and go through the assignments once again, together
with the corrections you received (will have received) from us and the solutions
to be published on the web site.
- A task to be solved on the computer in 45-60 minutes, based on your solution of the
the final assignment. This means that you have to have your solution
with you (either on the machine, or on a disquette). You can use
any non-human help: your own books, your own notes, the web,... But you
cannot interchange information: borrowing your books or your notes
to somebody else, sending emails, etc.
Final assignment
Our aim is to create a system for automatic text categorization. Therefore,
we need to be able to calculate the "distance" or the "similarity" of documents.
Once we have calculated the similarity of all document-pairs, you can put it into
a table and try to find clusters of documents. Alternatively, if you have a new
document, you can determine to which older document it is the most similar.
Here is a virtual example about what I mean by "clustering" documents based on their
similarity. Suppose that the maximum of the similarity is 1 (a document is maximally
similar to itself, therefore, you have 1s in the diagonal), and the minimum is 0.
The similarity of document A to B is equal to the similarity of B to A, therefore the
following matrix is symmetric.
|
A |
B |
C |
D |
E |
A |
1 |
0.9 |
0.1 |
0.2 |
0.1 |
B |
0.9 |
1 |
0.4 |
0.3 |
0.2 |
C |
0.1 |
0.4 |
1 |
0.8 |
0.2 |
D |
0.2 |
0.3 |
0.8 |
1 |
0.1 |
E |
0.1 |
0.2 |
0.2 |
0.1 |
1 |
It is clear that A and B form a cluster, C and D form a cluster, and E forms a cluster
in itself. The similarity of the elements of a cluster is quite high (>= 0.8), whereas
the similarity of documents belonging to two different clusters is very low (<= 0.4).
Your task is to write a shell script that measures the similarity of two
documents. (You do not have to cluster documents.)
The two arguments of the script are the file names containing the documents,
and the standard output of the script is a number, namely, the similarity measure
(or, some error message). The similarity measure is defined as follows:
Take the 100 most frequent words in each of the documents. The similarity
measure of the two documents is the number of words that occur among the 100
most frequent words of both documents:
Similarity(A,B) := #{ w | w is among the 100 most frequent words of document A
and w is among the 100 most frequent words of document B}
[ #{...} means the cardinality, the number in the given
set]
In addition to that:
- Before creating the word frequency lists, you have to pre-process your documents.
This preprocessing includes removing irrelevant features to the similarity
of the two documents. For example, remove numbers, punctuation marks, html-codes,
change upper case characters to lower case characters (or vice-versa), etc.
The more pre-processing you can do, the better your solution is.
- Create a list of so-called stop words. These are words that occur
frequently in any texts (such as "the", "and", "so", "have", etc.), and
you remove these words from your documents before creating the word frequency lists.
(In fact, if your goal is text categorization according to language, you do not
want to have such a list. Yet, if your goal is text categorization according to
topic, such a list will significantly improve your results.) For instance, on the web
you may find the frequency ranking of words in a large corpus, and use the topmost
words as your stop word list.
- If the script receives a third argument that is "-b", then perform the same story
on bigrams of words, instead of on words (similarity = number of shared bigrams
among the 100 most frequent bigrams).
Tips:
- You may want to use a variable instead of 100, because if your documents are too
short, you obtain better results if you take to 10 or 25 most frequent words only.
Then, you can simply change the value of this variable at the beginning of
your shell script, instead of rewriting it.
- This similarity measure is symmetric: the similarity of document A to
document B is equal to the similary of document B to document A. Check your script
if it indeed returns the same result if you reverse the order of its arguments.
- After you are sure your script works, you can remove all temporary files created by
your script. But do not delete them as long as your script is not complete: analyzing
these files may help you debugging your script.
- With Lonneke, we have come up with two, very different ways of solving this
problem. One makes use heavily of all the different tricks presented in the last
lecture. The other involves a similar trick to the one used to find palindroms.
Nevertheless, you may also find a third way to solve this assignment. We would be
very happy if you came up with different solutions. If you have more than one
way to solve this problem, don't keep them in secret from us :-).
- Slowly but surely, we are posting the solutions of the weekly assignments on the
web. Reading them through may give you ideas and help you avoiding mistakes.
We will honour with extra points if you show us the results of your experiments.
Take a set of documents (e.g. from the Federalist papers, from the Internet or
from your own writings), and calculate their distance. Based on their distance, can
you cluster the documents into sub-groups? Does this measure helps to cluster documents
according to language, according to author, according to style or according to content?
Is it true that the most frequent words shared by two documents belonging to the
same cluster are characteristic of (keywords of) the given cluster?
What is the role of the list of stop words? Do you have any additional observation?
Good luck with the assignment!