M&M 6.1 (p. 359, or p. 369): the confidence level states the
probability that the method will give a correct answer (namely, that
the unknown population mean *mu* is within the interval of margin
*m* around the known sample mean *x-bar*).

In other words,
if you repeat the method for many SRS samples drawn from the same
population, in C% of the cases (for C% of the samples) the interval
provided by the method will contain the population mean.

It means that if the null hypothesis were true, then the probability
*P* (the probability of drawing a sample whose statistic is at
least as extreme as the statistic calculated from our present sample)
is lower than the threshold *alpha*.

In other words, it is very unlikely to draw a sample as ours, if the null hypothesis were true. Therefore, the data in our sample provide a strong argument against the truth of the null hypothesis. In other words again, the data in our sample provide a strong argument in favor of the alternative hypothesis.

If, however, the P-value is larger than *alpha*, then the
evidence is not strong enough in favor or against either hypothesis.
Collecting more data or altering the null hypothesis can lead to a
stronger conclusion.

The level of significance *alpha* for statistical tests,
(similarly to the level of confidence *C* for confidence intervals)
set the strength of our conclusions. If the consequences of our
conclusions are more serious (e.g., political decisions on life and
death), we certainly need a higher *alpha*.

Yes, I plan to compile a sample exam, etc. But even now you can check the material for final and the example final that appear on John Nerbonne's page, which is quite similar to what I plan to do, even if some of the emphasis is put else.

Another piece of information is that the final exam will also mirror what you have done during the labs. Things that you have calculated by hand during labs (mean, quartiles, variance, std. deviations, etc.) will also reappear in the final: I won't ask the formula, but ask you to perform similar calculations with really simple numbers. If there was a task in which you had to use a table yourself, a similar task might occur in the final test, as well (I will provide you the table). However, if the only thing you had to do in lab was to let SPSS do the job for you, you only need to know when and how to use that procedure (in which case, under what conditions, what information you need to supply, what the meaning is of the software's output, etc.).

Certainly :) Here it is:

- Descriptive statistics: different measures of center and spread and their properties, different visualization techniques, etc.
- Basics of statistical inference: population vs. sample, parameter vs. statistic, sampling distribution of a statistic, etc.
- Statistical procedures: concept of a confidence interval and concept of a test.
- Distributions: Normal (Gaussian) distribution; distribution families (with one or two degrees of freedom), such as t, chi-square, F. Use of tables.
- Statistics used for inference: z, t, chi, etc. When to apply them? Which distribution do the follow? (Sometimes easy, because they have the same name, but not always...)
- One-sample procedures vs. two-sample procedures vs. more-sample procedures.
- Specific statistical procedures: z, t, proportions, chi-square, ANOVA, etc.
- More variables: correlation, regression, two-way table, independence.
- Parametric vs. non-parametric tests.

Additionally, in each case: 1. idea behind a concept; 2. technical details (to learn it or to know where to find it in book/software); 3. points of caution, criteria to use a test, etc.

Many of you has used Table A from M&M when solving a one-sample t-test in assignment 4. Why?

You have correctly calculated the t-statistic. (Not the z-statistic, as you do not know the population sample.) The t-statistic follow a t-distribution and Table D contains information on the t-distribution, not Table A.

First you have to set your alpha-level (e.g., 0.05), and then you should look up the corresponding critical t* in Table D (based on the degree of freedom). If the t-statistic of your sample is greater than t*, than it is "more extreme" (providing the null hypothesis is true); that is, its p-value is less than alpha, and you can reject the null hypothesis. If, on the other hand, the t-statistic of your sample is less than t*, it is "less extreme", and you do not have sufficient proof to reject the null hypothesis.

This method does not give you the exact p-value. If you would also like to know the exact p-value, you have several options (as so often in statistics): use software (in SPSS, the value given under "sig. (2-tailed)"), or use tables in advanced books, or use a Normal approximation. The latter case means that you use indeed Table A, but you should then emphasize that this is only a rough approximation. Were the sample size larger than 30, then this would have been a more reliable approximation.

Suppose you are working on some phenomenon, and you have a
hypothesis. Let's call it *my hypothesis*. In order to
support it, you need to collect data that are consistent with
this view and cannot be due to another hypothesis. Therefore,
you formulate a *null hypothesis*, which you hope to be able
to reject. (The *power* of a statistical test tells you
what is the chance of doing so, if in fact the null hypothesis is
false.) The *alternative hypothesis* is the negation of
the null hypothesis, more or less the same as "my hypothesis".

Importantly, all statistical procedures only work with well defined null hypotheses: A is equal to B, or two variables are independent, etc. So, you cannot have a null hypothesis of the form "A is less than or equal to B". Yet, most often you will not collect a sample in which A will be absolutely equal to B. The question is whether the difference is due to chance, or to something systematic.

A statistical test is always about calculating the p-value:

what is the probability of drawing a random sample whose statistic is as extreme as, or more extreme than, the statistic calculated from our sample,supposing thatthe null hypothesis is true.

The p-value is a measure of how common (how extreme) our sample is with respect to the null hypothesis; in other words, if the null hypothesis is true, how frequently one will draw such a sample. If p is very low, then we can be quite confident that the null hypothesis is false; that is, that our sample does not result from chance interfering with what one would expect based on the null hypothesis. Rather, truth is different from the null hypothesis.

The *sampling distribution of the statistic* will tell
you what is the chance of drawing a sample whose statistic is
"as extreme as, or more extreme than" the statistic calculated
from your sample.

Some statistics can take only positive values, or (say) only values larger than 1. This is the case of the chi-square statistic and ANOVA. If the statistic has the lowest value (e.g., 1), then the sample is "most common", providing that the null hypothesis is true, so you cannot reject the null hypothesis. The higher the statistic, the further away you are from the null hypothesis. See the pictures above Tables E and F in M&M. Therefore, in these cases we can only speak of a "1-sided" test.

Yet, the mean of a sample can be both higher and lower than the population mean. So "being extreme" in that case (supposing the null hypothesis is true) can happen both if the sample has an unusually large mean, and if it has an unusually little mean. This corresponds to a two-sided test. If your "my hypothesis" is simply that the two populations have different means, than you use a two-sided test.

In other cases, the "my hypothesis" says that A is
*significantly larger* than B. That is, if I run the
experiment, I run double risk: not only cases in which A
(almost) equal B is a counterargument to my "my hypothesis",
but also cases in which A is less than B. On the other hand,
I can use a one-sided test: extreme samples that support
the alternative hypothesis against the null hypothesis are only
those that have A larger than B, so I will have lower p-values.

Software will return you three kinds of df in an ANOVA. Here they are, as called by SPSS/by M&M:

**Between Groups df/DFG:**the number of groups minus 1. For instance, if you have 4 groups, DFG = 3.**Within Groups df/DFE:**the sum of the degrees of freedom within each group. The degree of freedom of a single group is the number of cases (individuals) within that group minus 1. For instance, if your three groups contain 9, 11, 13 and 13 subjects respectively, then the corresponding degrees of freedom are 8, 10, 12 and 12. Subsequently, DFE = 8 + 10 + 12 + 12 = 42.**Total df/DFT:**the number of individuals involved in the experiment (all groups in total) minus 1. In our case, there are 9 + 11 + 13 + 13 = 46 individuals, so DFT = 45. Importantly, DFT = DFG + DFE. Indeed, 45 = 3 + 42.

Why is it important? Because the mathematical-statistical
procedure behind ANOVA requires it. The statistic on which ANOVA
is based is a ratio (MSG/MSE), and its sampling distribution
follows an F-distribution whose degrees of freedom are DFG and DFE:
in short, that is an *F(DFG, DFE)* distribution.

I am sorry for my delay. In the **fifth assignment**, the
last few questions were the most difficult ones, because here
you were already expected to find out yourselves what statistical
procedure must be used. The solution is to combine the two procedures
used just before: run a two-sample t-test on the variable IMPROVEMENT.
This way you will know whether there is a significant difference
between the averages in improvement of the two groups.

In the **sixth assignment** there are no clear "rules" to
define which ways of visualization are "good", and which are "bad".
It depends on what you want to argue for. In the future you will
learn yourself what is good practice: not to put too few or too much
information on a single figure, not to use colors in an article (even
if the journal allows for colors, people might photocopy your article),
use contrastive colors in a presentation, etc. The point of the
assignment was to motivate you to think about these questions and
to let you experiment with visualization.

Some additional explanation concerning questions 6 and 7. In
both cases you wanted to compare the proportion observed to a
fixed value: is the proportion you observed significantly different
from 0.5000 (in question 6), or from 0.4700 (in question 7)? In the
first case it turns out that your sample is too small to argue that
the proportion observed is not 50% + random variation/fluctuation.
It is save to round off to 0.50. You use a two-sided test because
you want to know if the observed proportion is *different* from
0.5000. In the second case, however, you can use a one-sided test,
because the question was whether the observed proportion is significantly
*larger* than the national average.

How do you get to these results using the software mentioned? The problem is that the software pointed to is unable to perform a one-sample test for proportion ("given the observed proportion on a sample, is the population proportion equal to 0.5000/0.4700?"). Rather, you have to present this question as if it was a two-sample test: "is the the proportion in the population from which we have drawn our sample different from the proportion in another population?" The trick is to define a fictive second sample whose proportion is 0.5000 (then, 0.4700) and which is huge, say 100 or 1000 times larger than our sample. As this second sample is huge, its proportion will approximate the proportion in the second population very well (at least, much better than the way the first sample reflects the first population). Therefore, we can "force" the software to compare our sample (and the population from which it has been drawn) to another, fictive "sample/population" whose proportion is 0.5000 (and then 0.4700). I hope this note clarifies the trick. This trick of performing a one-sample test as if it were a two-sample test with a huge second sample, may be useful for you in the future, even beyond testing proportions. M&M 9.3 uses the same trick: it employs a chi-square test with a second, fictive row to test the goodness of fits.

The **seventh assignment** should have been straightforward.
The ANOVA is significant, but only the two extreme groups have
a mean that is different at the 0.05 level. The increase in the
proportion of women receiving a Nobel prize is, unfortunately, not
significant. Let's return to this second question in 30 years...

Yes, I would love to... I think there are some formal evaluation forms for you to fill in at the end of the semester. But I would also like if you would send me any remarks (even anonymously). If you want to keep it really anonymous, you can collect the remarks, send them to Julia Klitsch (or anybody else), and ask her to send them to me.