Lab 3: statistics for EMCL students

Lab assignment 3

This week, we start using SPSS. The assignments are getting less labor-intensive, as you will get the mathematics for free. Still, you should always keep in mind that software can only help you if you understand what they do and in which case you can use this or that function.

SPSS stands for Statistical Package for the Social Sciences, and is the most frequently used software among psychologists, sociologists and linguists (and probably in many other fields) to perform statistical computations.

You can send me the solutions (report) as a pdf, a Word file (please, not in a Windows Vista format, that is, with a .docx extension), or print it out and put in my postbox. In the latter case, a plastic bag is not necessary, just fold it up. Do not forget to mention your name in the report, as well as in the file name.

Each assignment is worth 10 points (as the previous ones... sorry for the delay in correcting them). Handing in the report more than a week after the lab results in half of the points, and no assignment is accepted with a delay of two weeks.

Reports must be as short as possible, that is, copy-and-paste only the SPSS results that are necessary. Explain the results in one sentence, especially if you needed to do more than just copy-paste (e.g., find the lowest value or calculate the difference of two values). Do not add any further information. Reporting irrelevant information can result in less points, as filtering the relevant information is one of your tasks.

Tasks that you simply have to do (before you get to the questions) appear below with a > starting the line in bold letters. Concerning these, you need not report anything, simply perform these tasks. The questions to be answered in the report are given below with a * starting the line, and in bold letters.

Answer the questions in a short but exact way, starting the number of the question. For instance:

…
3. 20 measurements.
4. Word length 3.
…

Try to finish all assignments during the lab. Should this fail, you can go on working on the assignments in your own time.

Aims of Lab 3

A Getting familiar with SPSS
B1 Entering data by hand
B2 Using “Variable View”
B3 Creating a frequency table
C Creating a histogram
D Creating a boxplot
E Calculating mean, modus and median
F Calculating measures of spread

Lab 3

A. Getting familiar with SPSS

> Turn on computer and screen.
> Enter username and password to log in.
> Look up SPSS 14 (or higher) within the RUG menu (within Mathematics & Statistics) and launch SPSS.

In case SPSS has not been installed on your machine yet, you get a window saying that you have to restart your computer. Do that, otherwise SPSS may have problems running.

> Once SPSS is running, you are offered a menu with choices. Click on “cancel”.

Now you are in the Data Editor, the window of SPSS in which you can enter data and work on them. It is a spreadsheet you might be familiar with from other applications. On its top you find the name of the data file you are working with, but at this moment it is still: Untitled1 [DataSet0].

In the Data Editor, each (vertical) column of numbers represents a variable. Each variable is given a name, which appears on the top of the column. Use meaningful names, such as LENGTH, and not something like X24A06.

Each (horizontal) row represents a case. A case is a series of observations belonging together, such as the answers of a respondent to the questions in a questionnaire, or different values measured on the same subject of the experiment. For instance, if you have 32 respondents, then you need 32 rows for the 32 cases. If the questionnaire contained 40 questions, then you most probably need 40 columns, and so you have 40 variables. (Next week, we learn how to calculate new, derivative variables from existing ones.)

The Data Editor is composed of two parts: the Data View and the Variable View. By clicking on the knob on the bottom left part of the page you can switch between them.

The Variable View offers an overview of your variables, and you can also define some features of these variables. The most important features are:

1. Name: the name of the variable.
2. Type: defines the type of the variable. Some of the types offered by SPSS:
a.    Numeric: the usual way of rendering numbers (e.g., 12345,67).
b.    Comma: comma before each group of three digits, dot before decimal digits (e.g., 12,345.67).
c.    Dot: dot before each group of three digits, comma before decimal digits (e.g., 12.345,67).
d.    String: any textual information (e.g., answers to an open question).
3. Width: the number of positions available in the Data View window.
4. Decimals: the number of decimal digits after the comma/dot.
5. Label: text providing more information about the variable.
6. Values: texts providing information about each value of the variable.
7. Missing: the value used to denote missing values (e.g., “no answer”).
8. Column: the width of the column in the Data View window.
9.Measure: the “measurement scale” of the variable (nominal, ordinal or scale, the last covering all types of numeric scales).

On the top of the window you find the menu of SPSS: FILE, EDIT, VIEW, DATA, etc. All statistical calculations are found under ANALYZE, and all diagrams and charts under GRAPHS. To calculate new variables based on the existing ones, use the commands under TRANSFORM. The HELP menu provides you help with further assistance, but which may prove quite concise in the beginning.

> Have a look at the different menus to get a general overview of them.

B. Entering data and creating a frequency table

The MLU (Mean Length of Utterance) measures the length of an utterance (a well-formed sentence or a sentence-like series of words) by counting the number of words it contains. It is an important measure of linguistic capabilities of children acquiring a language, of patients with impaired language, but it is also useful in identifying authors of texts.

Here is the length of a test utterence measured on 20 patients:

3, 5, 4, 4, 10, 4, 11, 4, 4, 6, 3, 4, 4, 8, 8, 8, 5, 8, 4, 9.

> Enter these values by hand and add the variable the name MLU.
> In the Variable View, set the number of decimals to 0 (as utterence length always has an integer value).

When you work with SPSS (as with any other application), it is good practice to regularly save your data files. Output files are often simpler to create again, but data files are certainly not. Moreover, SPSS 14 is not always stable, causing the program to terminate unexpectedly. Finally, we may want to use some of the data files during several labs.

> Therefore, save your data file to your own network drive (X:\) in a separate folder that you create specifically for this lab.

A frequency table is a table that shows how often each value of a variable appears among your data.

> Create a frequency table from this variable.
Hint: 'Analyze', 'Descriptive Statistics', 'Frequencies'.

During the data entry process, one quite often makes errors. Hence, it is imperative to check always the data you have just entered. Beside rereading the numbers in Data View, you should also look for outliers “created” by erroneous data entry: for instance, typing too many zeros or entering two values in a single cell will create values much greater than other values. In the present case, check if the frequency table contains only values you remember having entered (and that make sense). Compare also your frequency table to the one of your neighbours in the lab.

> Check the frequency of each value in you frequency table together your neighbour.

* 1. Copy the table in your report.
* 2. How many measurements (data) do you have?
* 3. Which MLU is the second most frequent?
* 4. How often does the highest value of MLU occur?

C. Creating a histogram

A histogram (or frequency diagram) is a graph displaying how frequently the possible values of a variable occur (or how frequently values falling within a certain range occur) among the data having been entered.

> Create a histogram based on the variable MLU.
Hint: 'Graphs', 'Histogram'.

> Do it again, but have SPSS also draw a Normal curve.
Hint: mark the checkbox ‘display normal curve’.

* 5. Copy this second graph to your report.
* 6. What does the vertical axis display: numbers or percentages?
* 7. What is the highest value and what is the lowest value of the variable?
* 8. How many peaks are there?
* 9. There is a gap is the graph. At what value can this gap be found? What does this observation mean? Would you expect to find this gap if you had many more data?
* 10. Is this distribution approximately Normal?

D. Creating a boxplot

A boxplot can be seen as a simplified histogram turned to its side, but it will also prove useful for other purposes later on.

> Create a boxplot of your variable.

Hint: 'Graphs', 'Boxplot'. Choose: “Simple” and “Summary separate variable”.

* 11. Copy this boxplot to your report.
* 12. Which is the lowest and highest value according to the boxplot?
* 13. Which is approximately the median according to the boxplot?
* 14. How many percentages of the data are outside of the box?
* 15. Which data are outside of the “whiskers” of the boxplot?

E. Calculating mean, modus and median

We often would like to summarize a variable as a single number that tells you roughly where the values of that variable are located. Generally the mean (average) is used for that purpose. Another option is employing the modus, that is, the value that appears most frequently. One can also use the median, the middle value if the observations are sorted from lowest to highest.

When a histogram is created, the mean is automatically calculated. The modus, the median and the mean can also be derived by choosing “Analyze”, “Descriptive Statistics”, and then “Frequencies“ in the menu. If you wish, uncheck the mark next to ‘Display Frequency Table’, and ignore the warning. Then choose the mean, the modus and the median via the Statistics.

> Have SPSS calculate the mean, the modus and the median, and report them to you in a single table.

* 16. Copy this table to your report.
* 17. Suppose you make an error during data entry: you type 80 instead of 8. Which of these values will change, and which will not? (Why? How does M&M call this feature of a statistical measure?)
* 18. The median of MLU is lower than its mean. This is because the histogram is skewed to the … (left or right?), and it has a longer tail to the … (left or right?).

F. Calculating measures of spread

In many cases we are not only interested in where more or less the values of the variable are located, but also in the “width” of the frequency distribution. There are different measures of describing the “width” of the histogram. The most known one is standard deviation (SD), but range and interquartile range are also used. The drawback of the range (the difference of the maximum and minimum values) is that it is fully dependent on the two most extreme values being measured.

> Have SPSS calculate for you the SD, the range and the quartiles.
Hint: “Analyze”, “Descriptive Statistics”, “Frequencies”.

* 19. Report the SD, the range and the IQR.
* 20. If the range is seen as the width of the histogram, then how many SD is the width of this histogram? (How many times is the range larger than the SD?)