Lab assignment 1

This week, our goal is to analyze word frequencies in different languages.

The frequency of a certain word depends on both linguistic and extra-linguistic factors. For instance, if we compare the frequency of the prepositions standing for on and behind in different languages, the variation can be due to the fact that on usually has much more meanings (including its uses in fixed expressions such as to count on something). It is also possible that a certain language uses the preposition on in fewer metaphorical senses than another language. However, the difference between the frequency of on and behind is also due to extra-linguistic factors, such as the difference in the number of situations when the first and when the second preposition is the one that adequately describes the spatial relationship of two objects. These extra-linguistic factors are usually constant cross-linguistically, even though some slight variations might be due to cultural factors.

Remember the three steps of working with quantitative data:

  1. data collection,
  2. statistical analysis, starting with visualization,
  3. discussion of your findings.

The way we shall collect data about word frequencies is the cheapest (but least reliable) technique nowadays: using Google. Whenever you type in a search expression, Google reports to you the approximate number of pages matching the query. Even if this value is only approximating, it will help us getting an impression about which words are (much) more frequent than others.

If Google appears in Dutch, click on "voorkeuren" (Dutch for 'preferences', next to the button and the search expression) and then change the "interfacetaal" to your favorite language. ('English' is Engels in Dutch.) Finally, click on "voorkeuren opslaan".

On the same page you can also set the language(s) that you want to restrict your search to. That is, if you would like Google to return only hits on websites in a particular language, put a mark in the box next to the name of that language. This way, if your task is to count the frequency of the French personal pronoun il, you can avoid the Italian article il and all hits mentioning Kim Jong-il (unless it appears in a French context). NB: Do not forget to reset this function to its default value before you leave your computer!

 

Assignments:

1. Collect data on the frequencies of personal pronouns in your native tongue/favorite language.

Report the frequencies you have found and explain them (formulate possible hypotheses). Draw the distribution and find its mode. Does it make sense to speak of median, mean, etc.?

Compare your results with what two of your colleagues have found in two different languages: explain the similarities and the differences.

 

2. Collect data on the frequencies of the cardinal numbers written with letters ("one", "two", etc.) from one to forty.

Report the frequencies you have found and explain them (formulate possible hypotheses). Draw the distribution and find its mode. Are there outliers? (If so, why?) Does it make sense to speak of median, mean, etc.?

Draw the distribution if you combine the results in groups of four (1-4, 5-8, etc.). Does it make sense?

Compare your results with what two of your colleagues have found in two different languages: explain the similarities and the differences.

 

To both questions, please return an approximately one-page-long Word document. To report on data and to draw a histogram, use the functionalities of Word (or Excel) creating tables and charts.

Hints:

  1. Always look at the way you are collecting data carefully. For instance, not only look at the number of hits, but also at the first few hits, as well.
  2. When comparing data collected within a language, as well as in cross-linguistic comparisons, always ponder whether it is better to use absolute frequencies or relative frequencies (the number of websites vary enormously across languages...).
  3. When forming hypotheses about the factors behind your observations, always distinguish linguistic factors from extra-linguistic ones. Moreover, there are frequently artifacts: features of the observed data that do not actually belong to the phenomenon being observed, but are rather due to a detail of the way data were collected (in this case, to the way Google estimates the number of pages matching the query).

 

NEW: some hints to use Word to create diagrams.


Back to main page.