Lab 2: statistics for EMCL students

Lab assignment 2

This week, our goal is to get a sense of what is going on in descriptive statistics. At least once you should do basic calculations by hand, before you have fancy software do it for you.

Moreover, we also practice doing calculations with Normal distributions. This will help you better understand the notions of inferential statistics.

First of all, make sure you understand everything in Chapter 1 of Moore and McCabe. Were this not the case, this is your last chance to ask me questions.

Then, locate some calculator program on your machine.

Using a calculator

Launch the Windows program called Calculator (Start menu --> Programs --> Accessories). Within its "View" menu, you can choose between using a "standard" and a "scientific" calculator.

I suggest you familiarize yourself with the statistic functions of the "scientific" calculator (or with those of any other calculator with similar functions). First, click on "Sta" to open a "Statistics Box", and move this new window away from the original one. Then, come back to the Calculator window, and enter your data, typing "Dat" after each of them. Observe what happens in the Statistics Box. To remove a data, click on CD in the Statistics Box. The number of data n is automatically shown in the Statistics Box. Buttons Ave, Sum and s in the main window will give you the average (mean), the sum and the standard deviation (using n-1) of the data currently in the Statistics Box, respectively. By clicking first on Inv, and then on buttons Ave, Sum and s, you obtain the mean of the squares, the sum of the squares and the standard deviation using n, respectively. For more information, right-click on the buttons (and then on "What is this?"; you may want to use the shortcuts mentioned there, especially during data entry), or read the Help.

In Task 1, you should use the "standard" calculator (or any other calculator without statistic functions). Yet, you may check your own calculations using the statistical functions. From Task 2 onwards, you may employ the scientific calculator.

Technicalities:

Please hand in the solutions on paper, by leaving it in my post box (floor 4).

3 points for Task 1,
5 points for Task 2,
2 points for Task 3.

Task 1: Basic descriptive statistics

In a test, you ask children of age 10 to read a text, and then to fill in a multiple-choice test of 20 questions. Out of 16 children, eight had been diagnosed as dyslexic, whereas the other eight children are normal controls. Here are the scores of the 16 children in a random order:

15, 13, 6, 7, 8, 16, 15, 17, 8, 6, 10, 13, 8, 14, 5, 15

While you analyze these numbers, use only a "standard" calculator, that is, without statistical functions (in fact, with a single exception, you will not need a calculator at all!):

Draw a histogram, and describe the general shape of the distribution (symmetry, skew, modes, etc.). Do you have any explanation for what you observe?
Find the minimum, the maximum, the median, the first and third quartiles, the IQR. Draw a boxplot.
Calculate the mean, the variance and the standard deviation (both with n and n-1) of the distribution.

Explain how you have obtained these results, especially for the mean and the standard deviation.

Task 2: Sampling distribution from population with a constant distribution

Imagine I have founded the Club of Fans of Statistics. I admit new members only in September, and members are so much fond of statistics that they never quit the Club. The Club was founded 9 years ago, so some have been members for 9 years, others for 8 years, etc., and those admitted a few weeks ago have been member for 0 years.

Now, you are interested in knowing the average time people have been member for. The Club has so many members that it is impossible for you to ask everyone, and I am not giving you confidential data on our members. So you come to our meeting where all members are present, and you randomly ask 20 people about how long they have been a member for.

Employ the first 20 digits in line 105 of Table B in Moore and McCabe to simulate the responses in your survey. (This table contains random digits, that is, something you typically get if you throw a ten-sided "dice" many times.) You hope that the average (mean) of these answers will provide a good estimation of what you really want to know, that is, the average (mean) of the corresponding values in the entire Club.

Question 1: in your survey, what is the population, what is the sample, what is the variable being measured, what is the parameter you are interested in, and what kind of statistic do you calculate?

Question 2: Plot the distribution of the variable you measure in your sample (a histogram), and calculate the statistic from it.

What you do not know is that each year I only admit fifty people (even though many more would like to become a member...). Consequently, there are fifty people who have been members for 9 years, another fifty people have been members for 8 years, etc., and fifty people have been member for 0 years. (Each "cohort" has the same amount of people, and this is why you could simulate the responses to your survey using the random numbers in Table B.)

Question 3: Given this new piece of information, plot the distribution of the variable being measured in the entire population, and calculate the true value of the parameter you are interested in. (Please, not with a calculator!) Is the statistic calculated in the previous question a good estimate of the parameter?

Now we turn to the sampling distribution of the statistic. Repeat the sampling process another nine times by using the first 20 digits of lines 101 to 110 in Table B. For each sample, calculate the statistic. So now you get 10 (different) estimations of the parameter. How are they distributed?

Question 4: Plot the sampling distribution of the statistic: create a stemplot or a histogram showing how often you got a value between 3.25 and 3.5, how often you got a value between 3.5 and 3.75, etc. What is the shape of this distribution? Calculate the mean and standard deviation of these 10 values (now, you may use the "scientific" function of your calculator).

A hint: first take a sheet of paper and sum up the digits in each group of five in Table B. So line 101 becomes: 17, 21, 23, 21, 28, 12, 19 and 26. Then calculate the statistic. Actually, this way you can even observe the spread of the n=5 case.

Another (new) hint: maybe it is even better if you don't do that. Type the first ten digits in a row in Calculator, then note the mean (cf. Question 5). Then type another ten digits, and note the mean of these 20 digits so far (Question 4). Then type the remaining 20 digits in the line, and do the same (Question 5). Now clear the Statistics Box (button CAD), and repeat the same for another line.

Three more tips: before noting the mean, you should check in the Statistics Box whether you have correctly typed the digits, for instance, by checking the value of n. Moreover, you could do this task in pairs: on is reading the numbers and the other is typing. Finally, don't use the mouse to click on Dat, but use one finger to type the numbers and another finger to hit the "insert" key on the keyboard, which is a shortcut for Dat.

What happens if you change the sample size?

Question 5: Repeat what you have just done, but using a sample size of 10 and another sample size of 40, instead of 20. That is, employ employ only the first ten digits, and then all 40 digits in lines 101 to 110 in Table B. Use a calculator to obtain mean and distribution for each of the two sample sizes. (It is worth drawing a histogram for yourself, but I don't ask you to hand in it in this case.)
How does the population mean (question 3), the mean of sample means with n=20 (question 4), and the mean of sample means with n=10 and n=40 (question 5) related to each other? How does the standard deviation of sample means with n=20 (question 4) and the standard deviation of sample means with n=10 and with n=40 (question 5) relate to each other? (Optional task: determine the standard deviation of the population, too.)

Task 3: Normal distributions and the use of Standard Normal Tables

The distribution of an IQ test in the population is N(100,16). That is, its mean is mu=100, and its standard deviation is sigma=16.

What percentage of the population has an IQ higher than (or equal to) 92?
What value is the 90th percentile? (That is, what is the IQ value above which we find only 10% of the population?)

Use the Standard Normal Table (Table A in Moore and McCabe).

To answer the first question, transform this Normal distribution to the standard Normal distribution (z = (x-mu)/sigma). So, what z value of the standard Normal distribution corresponds to x=92 in the N(100,16) distribution?

To answer the second question: first solve the problem for a standard Normal distribution, and then transform the z value found to the x value in the N(100,16) distribution. Hint: if z = (x-mu)/sigma, then x = z.sigma + mu.

The key is that the area under the standard Normal curve N(1,0) between points z1 and z2 is equal to the area under any Normal curve N(mu, sigma) between points x1 and x2, if z1 = (x1-mu)/sigma and z2 = (x2-mu)/sigma.

In case you have learned calculus: you can prove that by simply replacing the variable dx to dz in the integral.

Some final notes:

(Added after the lab.)

I would suggest you to start with Task 1, then Task 3, then Task 2. If you really run out of time, omit the n=40 case in Task 2, and just collect 5 samples (both for n=10 and n=20), not 10. Still, I would like you to tell me what behavior you expect based on the Central Limit Theorem.

It is fine to discuss what should be done with each other. I even encourage you to do the data entry in Task 2 in pairs. I also know that data entry is extremely boring... still, this kind of work is part of research, and the earlier you develop some routine or some "strategy" of doing it, the better for you. Simply this also belongs to the everyday practice of statistics.

Concerning file formats (now and in the future): MS Word is fine, pdf is even better but not required. But please avoid Word for Vista (.docx) format, as I am unable to open it. (Save the file using an earlier version of Word.) From next week onwards: simple answers within the body of the email will also do in most of the cases.

In case you send me a file: please make sure your name appears both in the file name and within the file! Imagine how many of yours have sent to me a file with name assignment1.doc? Also imagine how long it takes to find whose assignment is the sheet printed out without author's name... Thanks!

Further practice recommended before the final test

Assignments of John Nerbonne: part 1 and then part 2. You can expect very similar questions in the final test.