A1 Entering data for a two-way table A2 Weighting cases with frequencies A3 Setting the meaning of a value/code B Creating a two-way table with all the occurrences C Choosing the most adequate two-way table form D Visualizing the counts from a two-way table E Significance tests and confidence intervals for proportions F Testing the association of two categorical variables using chi-square test

Is there a connection between gender and (academic) accomplishment? A large university carried out a research on PhD students who had started their PhD research six years earlier. The following two-way table presents how the number of these students is distributed according to two variables: status of their research and gender.

Gender

StatusManWoman

Quitted 238 98 Still in progress 134 33 Thesis defended 423 98

**> Enter these values to SPSS and specify the correct names of the variables.**

Suggestion: A useful trick to do this in SPSS is to define three different variables: GENDER, STATUS and COUNT. Each cell of the above table then becomes one case (one row), producing six rows in total. In truth, each and every student represents a separate case described with two variables (gender and status), so we should have entered 238 cases of "man/quitted", 98 cases of "woman/quitted", etc. Instead of doing so hundreds of times (which is not only time consuming but also a potential source of errors), we rather enter "man/quitted" only once but also add a third variable COUNT, which we shall soon use as a "weight".

Another useful practice in SPSS is to use numeric values even for categorical variables, because you will have access to more functionalities of SPSS then. For instance, you can encode 'man' as '1' and 'woman' as '2'; 'quitted' as '0', 'still in progress' as '1' and 'thesis defended' as '2'. Note that any other numbers could also be used, and that in many cases categorical variables are not ranked such as numbers are. (So, the numbers associated with 'man' and 'woman' could be reversed. Yet, it might make sense to use a number for 'still in progress' that is between the numbers used for 'quitted' and for 'thesis defended'.)

**> Weight the cases using variable COUNT.**

Hint: Data, Weight Cases. By specifying this, we ask SPSS to perform *all* calculations and to draw *all* figures as if the row "man/thesis defended" occurred 423 times, etc.

**>> In VARIABLE VIEW (column 'values') set the "meaning" of each possible value for both categorical variables: '1' is 'man', '2' is 'woman', etc. **

**> Check whether you have entered the counts correctly by creating a two-way table of GENDER and STATUS with the count of occurrences in the cells (that is, not with percentages, etc.).**

Hint: Analyze, Descriptive, Crosstabs... Click on "Cells" to define what you want to see.

*** 1. Copy the table to your report. What does SPSS display on the margins of this table? Explain how these values are obtained.**

**> Save the table in an SPSS .sav format.**

A disadvantage of the two-way table above is that the connection between variables GENDER and STATUS (which is what interests us) is far from evident looking at it.

**> Create a one-way table of GENDER and STATUS with exclusively row conditional distributions.
> Create a one-way table with exclusively column conditional distribution.
> Create a one-way table of GENDER and STATUS with exclusively joint distributions. **

Hint: Descriptive Statistics, Crosstabs. Click on 'Cells', and choose what you would like to have. In order to keep the table clear and intelligible, make sure you always have only one distribution displayed at a time.

*** 2. Explain how the values in each of these three tables are obtained. Where can you find the conditional distributions and where can you find the marginal distributions?**

*** 3. Choose the table that you consider the most useful to show the difference between men and women. Write a short paragraph describing and explaining your observations with the table appended, as if it were a section in a scientific paper. Do not forget to add a descriptive caption to the table.**

For most people, including the readers of your future papers, a nice diagram helps much more understand what is going on than a table with numbers.

**> Create a diagram, for instance stacked diagrams, which shows the difference between men and women as much as possible. Experiment with different types of diagrams offered by SPSS: try out what different options look like.
**Hint: Graphs, Bar...

*** 4. What is more useful: showing counts or percentages? Copy the diagram that you consider the most helpful, and add a caption to it.**

So far, we have been busy with descriptive statistics and visualization. Now, we turn to inferential statistics, that is, we seek to draw conclusions from the sample on the entire population.

In the present lab, we shall employ two different approaches: statistical procedures to estimate proportions in the population, as well as chi-square test to examine independence of the variables. The two approaches employ two different views on the same data.

In the first approach, a population is described by three parameters: the proportion of students having quitted within six years, the proportion of students still in progress after six years, and the proportion of students having defended their thesis within six years. A population can be all students at a certain university, or all female students at a certain university, etc. For instance, we may estimate the quitting rate among men, or compare it to the quitting rate among women.

In what follows, we are going to employ the statistical procedures described in chapter 8 of M&M. Although we have three different proportions (quitting rate, in-progress rate and defense rate) summing up always to 1, these procedures always focus on one rate at a time. So, we shall focus on the defense rate only.

**> For each of the following questions, specify what the population(s) and the sample(s) drawn from the population(s) are, what is "success" (the proportion of which these procedures deal with). Explain from which tables calculated in part C you take your values. Finally, explain which statistical procedure (for example, test) you use, and check that the criteria for applying the statistical procedures (as described by M&M) are always met.**

You can perform the computations either by hand (calculate the z-statistics as described in chapter 8, and use tables A or D of M&M), or using software such as http://www.quantitativeskills.com/sisa/statistics/t-test.htm. Unfortunately, SPSS is unable to help you in this task.

Hints to employ this software:

- If you have X cases of success in a sample of n data, then enter X/n as mean and n as nr. of cases.
- The site rounds values off, and make sure it uses the correct rounding.
- Make intensive use of the "clear" button.
- Ignore std def and DEFF.
- At question 5: leave the Mean 2 and N of cases 2 empty (zero: so the "difference between means" will be your only mean); set the confidence interval C.I.
- At question 8, you want to compare two proportions (two "means" in this software), so you simply enter the proportion and total number of women (mean 1 and Number of cases 1), as well as the proportion and total number of men (mean 2 and Number of cases 2). You are returned a t-value and a very large df, so you can use this t-value as if it were a z-value, and estimate the p-values based on Table A. NB: the software gives you some probabilities, but make sure you do not misunderstand what they refer to; it is worth checking them in a Table.
- At questions 6 and 7, you want to compare two proportions (two "means" in this software) again, but the second one is not a measured one, rather a test value. In other words, you want to run a one-sample test, and not a two-sample test. Yet, this software does not seem to offer such an opportunity. Still, you can use a trick (similar to the one used in section M\&M 9.3): you present your test value (mean 2) as if it were a mean value measured on an extremely large sample (Number of cases = 100000 at least). In fact, if you check the formulae of one-sample and two-sample procedures, you will see that if n2 is much larger than n1, then the procedure is the same as comparing the first sample to mu2 in a one-sample procedure.

*** 5. Provide a 90% confidence interval for the proportion of PhD students defending their thesis within six years.**

*** 6. Based on these data, can we safely (that is, with a significance level of 5%) say that the percentage of students defending their thesis within six years at this university is exactly 50%? (one-sided or two-sided? p-value?) **

*** 7. A national survey revealed that the percentage of students defending their thesis within six years is 47%. Can we conclude at a significance level of 0.05 that the percentage at this university is larger than the national average? (one-sided or two-sided? p-value?) **

*** 8. Can we conclude that there is a significant difference between the probability of a man finishing within six years and the probability of a woman finishing within six years?**

The second inferential approach consider these data as describing a single sample, originating in a single population. Yet, two variables are measured for each case: GENDER and STATUS. In other words, you do not compare 795 men to 229 women for STATUS, but you compare GENDER to STATUS in 1024 cases.

The chi-square test to be employed tests whether there is an association between the two variables (chapter 9 of M&M): whether knowing the value of one variable can we predict the value of the other variable? A situation of a very strong association would be for instance if all men have quitted and all women have defended their theses; that is, by knowing the value of GENDER for a certain case, we could predict the value of STATUS with full certainty. A situation of a somehow weaker association is if 70& of men have quitted and 70% of women have defended their theses. In this case, if we were told that a certain student is male, we would bet that he has quitted, even though we are not absolutely sure about it. Finally, in a situation with absolutely no association, the quitting rate among men is equal to the quitting rate among women: being told that the gender does not influence our knowledge concerning the probability that that student has quitted.

Having entered the data in the two-way table earlier today, we can let SPSS do the job. Yet, we need to be able to interpret the data, and now how to formulate your conclusion.

*** 9. Formulate the null hypothesis of the chi-square test, and the alternative hypothesis, in one full sentence each. Check whether the criteria for applying a chi-square test (as described by M&M) apply in our case.**

The chi-square test compares the observed counts to the expected counts in each cell. The latter are calculated using the totals on the margins. Can you let SPSS display the two-way table with the expected counts? Compare the observed and the expected counts in each cell: are they "very" different? What the chi-square test does is answering this question in a precise way.

**> Let SPSS run the chi-square test.**

Hint: go to CROSSTABS and choose STATISTICS.

*** 10. Summarize your conclusion concerning rejection or non-rejection of the null hypothesis, and what it means: is there a statistically significant association between gender and status? As usual, provide the details of the statistical procedure in parenthesis: in this case, the chi-square value, the degree of freedom and the p-value (probability, significance). Additionally, explain why df has this value.**

Note: a "statistically significant" association means that it can be observed using statistical techniques and based on our data. It is, however, not necessarily a "significant" association, that is, a strong association. The "strength" of such a correlation can be measured in different ways. For instance, as we assigned '0', '1' and '2' to the different possible values of status, we can compute the mean of the status for men, as well as for women, and we can compare these two means using a two-sample t-test. Moreover, since both variables have been encoded numerically, we can also calculate the correlation *r* of the two variables (within 'crosstabs' in SPSS).

Both these techniques will also provide us with information on the *direction* of the association: whether men or women tend to have a higher score on variable STATUS. Chi-square does not tell us this direction, as it is designed to be employed on categorical data (such as ours), in which case direction is theoretically meaningless.

The usual way to formulate the conclusion of the statistical procedure is as follows:

. . . we conclude that . . . (X2= . . . , df = . . . , p = . . .).

**Optional task:** perform the above computations yourself, describing the mathematical details of the procedures.

Back to main page.