Lab 1, Methodological skills

SPSS lab 1

SPSS stands for Statistical Package for the Social Sciences, and is the most frequently used software among psychologists, sociologists and linguists (and probably in many other fields) to perform statistical computations. With statistics software, such as SPSS, you get the mathematics for free. Still, you should always keep in mind that software can only help you if you understand what they do and in which case you can use this or that function.

Aims of Lab 1:

A Getting familiar with SPSS
B1 Entering data by hand
B2 Using “Variable View”
B3 Creating a frequency table
C Creating a histogram
D Creating a boxplot
E Calculating mean, mode and median
F Calculating measures of spread

A. Getting familiar with SPSS

> Launch SPSS in the Start Menu.

> Once SPSS is running, you are offered a menu with choices. Click on “cancel”.

Now you are in the Data Editor, the window of SPSS in which you can enter data and work with them. (The results are going to be presented in a different window.) On its top you find the name of the data file you are working with, but at this moment it is still: Untitled1 [DataSet0].

It is a spreadsheet you might be familiar with from other applications (such as Excell). There is, however, a big difference: rows and columns cannot be reversed, because they have different meanings. Namely, in this Data Editor, each (vertical) column represents a variable. Each variable is given a name, which appears on the top of the column. Use meaningful names, such as LENGTH, and not something like X24A06.

Each (horizontal) row represents a case. A case is a series of observations belonging together, such as the answers of a respondent to the questions in a questionnaire, or different values measured on the same subject of the experiment. For instance, if you have 32 respondents, then you need 32 rows for the 32 cases. If the questionnaire contained 40 questions, then you most probably need 40 columns, and so you have 40 variables. Additionally, you can also calculate new, derivative variables from existing ones.

The Data Editor is composed of two parts: the Data View and the Variable View. By clicking on the knob on the bottom left part of the page you can switch between them.

The Variable View offers an overview of your variables, and you can also define some features of these variables. The most important features are:

1. Name: the name of the variable.
2. Type: defines the type of the variable. Some of the types offered by SPSS:
a.    Numeric: the usual way of rendering numbers (e.g., 12345,67).
b.    Comma: comma before each group of three digits, dot before decimal digits (e.g., 12,345.67).
c.    Dot: dot before each group of three digits, comma before decimal digits (e.g., 12.345,67).
d.    String: any textual information (e.g., answers to an open question).
3. Width: the number of positions available in the Data View window.
4. Decimals: the number of decimal digits after the comma/dot.
5. Label: text providing more information about the variable.
6. Values: texts providing information about each value of the variable.
7. Missing: the value used to denote missing values (e.g., “no answer”).
8. Column: the width of the column in the Data View window.
9.Measure: the “measurement scale” of the variable (nominal, ordinal or scale, the last covering all types of numeric scales).

On the top of the window you find the menu of SPSS: FILE, EDIT, VIEW, DATA, etc. All statistical calculations are found under ANALYZE, and all diagrams and charts under GRAPHS. To calculate new variables based on the existing ones, use the commands under TRANSFORM. The HELP menu provides you help with further assistance.

> Have a look at the different menus to get a general overview of them.

B. Entering data and creating a frequency table

The MLU (Mean Length of Utterance) measures the length of an utterance (a well-formed sentence or a sentence-like series of words) by counting the number of words it contains. It is an important measure of linguistic capabilities of children acquiring a language and of patients with impaired language. It is also useful in identifying authors of texts, since every author has a characteristic MLU.

Here is the MLU measured on 20 patients:

3, 5, 4, 4, 10, 4, 11, 4, 4, 6, 3, 4, 4, 8, 8, 8, 5, 8, 4, 9.

> Enter these values by hand (in Data View).
> In the Variable View, specify that the variable is named MLU.
> Then, set the number of decimals to 0. Utterence length always has an integer value, so displaying decimals makes no sense, it is an error.

When you work with SPSS (as with any other application), it is good practice to regularly save your data files. Output files are often simpler to create again, but data files are certainly not. Moreover, SPSS may not be always stable, causing the program to terminate unexpectedly.

> Therefore, save your data file to your own network drive in a separate folder that you create specifically for this lab.

A frequency table is a table that shows how often each value of a variable appears among your data.

> Create a frequency table from this variable.
Hint: 'Analyze', 'Descriptive Statistics', 'Frequencies'.

During the data entry process, one quite often makes errors. Hence, it is imperative to check always the data you have just entered. Beside re-reading the numbers in Data View, you should also look for outliers “created” by erroneous data entry: for instance, typing too many zeros or entering two values in a single cell will create values much greater than other values. In the present case, check if the frequency table contains only values you remember having entered (and that make sense). Compare also your frequency table to the one of your neighbours in the lab.

> Check the frequency of each value in you frequency table together with your neighbour.

> Copy-paste the table into a Word file.
Q: How many measurements (data) do you have?
Q: Which MLU is the second most frequent?
Q: How often does the highest value of MLU occur?

C. Creating a histogram

A histogram (or frequency diagram) is a graph displaying how frequently the possible values of a variable occur (or how frequently values falling within a certain range occur) among the data having been entered.

> Create a histogram based on the variable MLU.
Hint: 'Graphs', 'Histogram'.

A Normal curve (a Gaussian distribution) is a very important function in statistics. Many statistical processes require that the data (approximately) follow a Normal distribution. A number of tricks exist to test whether this requirement is met by your data. The simplest one is to have SPSS fit a Normal curve on your data, when plotting a histogram.

> Create again a histogram, but now have SPSS also draw a Normal curve.
Hint: mark the checkbox ‘display normal curve’.

> Copy-paste this second graph to a Word file.
Q: What does the vertical axis display: numbers or percentages?
Q: What is the highest value and what is the lowest value of the variable?
Q: How many peaks are there?
Q: There is a gap on the graph. At what value? What does this observation mean? Would you expect to find this gap if you had many more data?
Q: Is this distribution approximately Normal?

D. Creating a boxplot

A boxplot can be seen as a simplified histogram turned to its side, but it will also prove useful for other purposes later on.

> Create a boxplot of your variable.

Hint: 'Graphs', 'Boxplot'. Choose: “Simple” and “Summary separate variable”.

> Copy-paste this boxplot to a Word file.
Q: Which is the lowest and highest value according to the boxplot?
Q: Which is approximately the median according to the boxplot?
Q: How many percentages of the data are outside of the box?
Q: Which data are outside of the “whiskers” of the boxplot?

E. Calculating mean, mode and median

We often would like to summarize the distribution of a variable as a very few numbers that tell us roughly where the many observed values of that variable are located. Generally the mean (average) is used for that purpose. Another option is employing the mode, that is, the value that appears most frequently. One can also use the median, the middle value if the observations are sorted from lowest to highest.

When a histogram is created, the mean is automatically calculated. The mode, the median and the mean can also be obtained by choosing “Analyze”, “Descriptive Statistics”, and then “Frequencies“ in the menu. If you wish, uncheck the mark next to ‘Display Frequency Table’, and ignore the warning. Then choose the mean, the mode and the median via the Statistics.

> Have SPSS calculate the mean, the mode and the median.

Q: Suppose you make an error during data entry: you type 80 instead of 8. Which of these values will change, and which will not?
Q: The median of MLU is lower than its mean. This is because the histogram is skewed to the … (left or right?), and it has a longer tail to the … (left or right?).

F. Calculating measures of spread

In many cases we are not only interested in where more or less the values of the variable are located, but also in the “width” of the frequency distribution. There are different measures of describing the “width” of the histogram. The most known one is standard deviation (SD), but range and interquartile range are also used. The drawback of the range (the difference of the maximum and minimum values) is that it depends on the two most extreme values being observed. The other measures are much less influenced by outliers, and they are rather determined by the bulk of the data.

> Have SPSS calculate for you the SD, the range and the quartiles.
Hint: “Analyze”, “Descriptive Statistics”, “Frequencies”.

Q: If the range is seen as the width of the histogram, then how many SD is the width of this histogram? (How many times is the range larger than the SD?)

This material is an adapted version of the assignments of the statistics courses developed by John Nerbonne at the University of Groningen.