2008-12-10

Learning Statistics

Student's t-distributionImage via Wikipedia

Learning statistics is tough. How come we can be sure about population distribution by sample mean distribution? The answer is "not really", but statistics does provide us an unique measurement for unknown based on assumptions.

Sample mean distribution is normal distribution, which can be defined only by two variables: Mean and variance, and the mean of the distribution will be same as the constant value, population mean, provided the sample size is big enough. This is guaranteed by averaging out.

Once we know the sample mean distribution is normal, Z-value which is relative deviation from the population mean using SE (stand error, population variance divided by the number of sample) as the unit of the yardstick of measurement will explain the percentage of the probability, by what chance, the sample mean can fall within the reach of the Z-value yardstick. If the Z-value is +/-1.96, the chances are 95%. This could be fair, as any normal distributions are defined by mean and variance. Here come my questions:

Q1. How does the sample mean distribution guarantee the shape of the normal distribution by averaging out?
Q2. How come normal distribution can be defined only by two variables?
Q3. Where does the number 1.96 associated to 95% come from?
Q4. How can we state, “the sample is big enough” universally without knowing the real shape of the population distribution?

T-value (student’s T) is used when we do not know the population variance and replace it with a sample variance to construct the yardstick. For 95% chance, T-value can vary along with the degree of freedom (d.f.), which is the amount of information available based on sample numbers. For d.f. 120, T-value is 1.98, slightly bigger than the Z-value as we use less trustful yardstick of sample variance, rather than population variance. Again, here come my questions:

Q5. Where does the number 1.98 come from?
Q6. By what, it makes THE difference of THE NUMBER 0.02 for d.f.120
Q7. How come the divisor of sample variance is always “n–1“ to adjust the measure of spread? And how can we be assured –1 is always sufficient for any sample numbers.

Reblog this post [with Zemanta]