Practicum - week 13

To make your life easier, create a soft link in your home directory towards /users1/birot/Federalist/. If you have this done, you can work in your own directory. If for instance Fed is the name of the link in your directory pointing to the above directory of mine, then you can use paths like Fed/jay64.txt or Fed/*.

Send to me the command line that creates the soft link.

(1 point)

In the following, we are going to (partially) solve a problem that was recently raised at the department. The aim is to be able to count the number of syllables in a text, an information frequently very useful.

We have used the command wc to count the number of bytes / characters, the number of words (one word = approx. a string of characters appearing between two spaces) and the number of lines (one line = approx. a string of characters appearing between two new-line characters). These three tasks can be performed quite automatically, for the only knowledge the computer needs is how to see a file as a text (remember: 1 byte = 1 character, based on the character table used). But how can we count the number of syllables in a text? It is clear that the definition of a syllable may vary from language to language, so you need an algorithm which is much more informed about the structure of the given language, as well as about its orthographical tradition.

We are going to solve this problem in three steps, the fourth step being open to further ideas.

2a.

In some languages, it is enough to count the number of vowels (like 'a', 'e', 'i', 'o' and 'u'), because each syllable contains exactly one vowel. This is the case in Hungarian. Yet in Dutch, this will give you only a first approximation.

Give a command line, thus, that will return you the number of vowels in a text.

You can use whatever commands we have so far. (If you are a Unix-expert, please, do not use other tools, for the goal of this assignment is to be creative and to practice what we have learnt so far.) However, let me give you some ideas:

Using tr, you can simply transform all vowels into some character (say, to `a'), and everything else to some other character. Though, you maybe do not need this second step, depending on how you go further. You can also use sed "s/something/something/g": then, either you pipe-line more sed commands (first transform all 'a's, then all 'e's, then all 'i's, etc.), or you use [something] to refer to the elements of a set.
You can delete everything that is not a vowel, either using tr -d, or using sed "s/something//g". For instance, tr -d abc will delete every occurrences of the character `a', every occurrences of the character `b', and every occurrences of the character `c'. On the other hand, sed "s/abc//g" will delete every occurrences of the string `abc' (replace it with "nothing", that is, with the empty string). Consequently, tr -d is handy only when you want to delete characters, and not strings of characters. However, you can delete three different things with one command using tr -d abc, whereas you may need a pipe line of sed's for the same purpose: sed "s/a//g" | sed "s/b//g" | sed "s/c//g" ( or, you use sed "s/[abc]//g").
ADDITION: You may want to delete the new-line characters, otherwise the number of characters in your file will also include them. Remember, however, that sed works on lines (a line of the input file will become a line of the output file, unless the whole line is deleted), whereas tr is able to transform a file into one long line.

(2 points)

2b.

In the next step, let us have a better approximation of the number of syllables in a Dutch text. Dutch, similarly to some other languages, encodes some of the (long) vowels by using two or more characters (e.g.: `oe', `aa', `ee'). Furthermore, Dutch has diphthongs, also encoded by a string of more vowels (e.g.: `ei', `oi', `eeu', etc.). A syllable in Dutch contains exactly one vowel, but that vowel may be long or may be a diphthong. Therefore, a Dutch syllable contains exactly one sequence of vowels (to be more precise: a sequence of vowel characters).

Consequently, if we can count the number of vowel sequences, we can get a better approximation of the number of syllables in the text. Can you do that in one command line?

A tip: Suppose that you have transformed all vowels to the character `a'. Now, how to squeeze consecutive occurrences of the same character into one? Either you use tr -s; or you put each character into a new line (e.g.: sed "s/./&@/g" | tr @ '\012' ), so that you can use uniq.

Another solution includes writing a nice regular expression in sed that will immediately replace each sequences of `a', `e', `i', `o' or `i' into something.

(2 points)

2c.

However, you have words in Dutch like `ideeën'. The Dutch orthographical tradition has used the trema to "break apart" combination of vowels. (The same symbol is called umlaut, when it changes the quality of the vowel, as it happens in German, Finnish or Hungarian.)

Therefore, we could obtain an even better approximation of the number of syllables in the text if we introduced a fictive consonant (let's call it a glottal stop or a glide) before each vowel with a trema. Imagine, you rewrite the word `ideeën' as `ideevën', by replacing `ë' with `vë': now we can use our solution for the assignment 2b (counting the number of consecutive sequences of vowels), and we will obtain the right result (the word `ideeën' contains three syllables).

Yet, a problem arises: how can we refer to the vowels containing a trema?

I want you to use the ISO 8859 codes of these characters, because the command tr is able to handle all these characters by referring to their code in octal. You remember, for instance, that the new-line character can be referred to as \n, but also as \012, because the ASCII code of it was 10 (that is, 012 in octal, 0A in hexadecimal). (Do not forget to escape the \ character, for instance, by using quotation marks!)

How to find out what the ISO 8859 codes of the different vowels with trema are? Try man ascii: here, you will find a table containing the ASCII standard. Nevertheless, characters with diacritics (such as a trema / umlaut / diaeresis), do not appear in the standard ASCII table (only including the codes between 0 and 127), but they appear in the different extensions of the ASCII standard. Still, man ascii will turn to be useful: look at the end of this man-page, where you will find pointers to other man-pages to look up. (Remember that one of the most useful parts of man-pages is their "see also" section.)

To sum up, you can transform the vowels with a trema into something else (say, into a `@') by using tr, before you manipulate them further (e.g., you replace them into a sequence of consonant + vowel).

If you want to try out whether your command line works, create a simple file yourself (or download something from the web), and try it out.

(2 points)

2d.

Our solution is still not perfect. For instance, the word "uien" has two syllables, but our command line will count only one.

You are welcome to collect further problematic cases where our solution fails, as well as try to come up with some solution to them. If you have something interesting, please send it to me to get extra points. You can also approach prof. John Nerbonne with your ideas about the problem.

In the lecture, Lonneke showed you how to create bigrams from a text, on a word level. On the web site, you also find a pointer to a description on how to do it. However, you can make also trigrams, and also on a character level.

For instance, take the following sentence:

I love sunny weather and sun.

This sentence contains the following bigrams on a word level: [I love], [love sunny], [sunny weather], [weather and], [and sun.]. The trigrams are: [I love sunny], [love sunny weather], [sunny weather and], [weather and sun.].

On a character level, the following bigrams can be obtained from this sentence: [I _], [_ l], [l o], [o v], [v e], [e _], [_ s], etc. Notice that I have written an underscore instead of the spaces in order to make what is happening more visible. I also recommend to you to first transform all spaces in the input file into an underscore.

Similarly, trigrams on a character level are: [I _ l], [_ l o], [l o v], etc. Similarly, we can speak about four-grams, 5-grams, etc. (in general: n-grams).

Now, I want you to get an alphabetical list of the trigrams on a character level, starting from some Federalist paper. In fact, you will probably need four command lines, four pipe lines to be executed one after the other. Yet, you can put them into one line, by adding a semi-colon (;) in between them.

A remark. Sometimes, it may be useful to add extra spaces before and after the document. Thus, you can say that the above sentence contains also the bigrams [___ I] and [sun ___], or the trigrams [_ _ I], [_ I _], [n . _] and [. _ _]. So, it is fine if you also have these trigrams in your output.

(3 points)