## Statistics Lecture #1, Errata – examples of bias

### standard deviation

The standard deviation of a collection of numbers is defined as where .

However, suppose that, from a sample of n objects, we want to predict the standard deviation of the population from which they come. Then, the above formula is biased. (That much I got right.)

The easiest way to see the problem is to consider the case n=1. In that case the formula amounts to the square root of 0/1, which is 0. That can't be right!

The way to fix the problem is to divide by n-1 rather than by n. Then, when n=1, you get the indeterminate form 0/0, which is reasonable in that it correctly represents the fact that a sample of only one element gives no information whatever as to the standard distribution of a population.

I incorrectly gave n+1 instead of n-1 in the heat of the lecture, which is nonsense. (Did I also omit the symbol?)

### Heterozygosity

The "heterozygosity" example was at best very unclear. The intended idea is this:

Suppose that, at a particular locus, nature provides two alleles, A1 and A2, with frequencies p and q=1-p, which, unknown to us, are ½. We assume HWE for the locus in question, so the rate of heterozygosity, h, is also ½, although we don't know that number either.

We propose to estimate h = the rate of heterozygosity, by a sampling experiment. The experiment consists of examining a sample of size n. Now, if we were to calculate the proportion of heterozygotes in the sample (by counting them and dividing by n), we would get an estimate of the population heterozygosity that would be sometimes too large and sometimes too small, but on the average would be just right. If we did that, there would be no story.

What we decide to do instead is to take advantage of our assumption of HWE and to estimate the population heterozygosity based on the allele frequencies in the sample. (In principle this rates to give a more accurate estimate of h.)

However, this idea will give a biased answer if we apply it in the obvious way. To see the problem, let's consider as an example the case of n=1; a sample of only one person.

The sample will have one of three genotypes:

• Case A1A1 – In this case, which occurs 1/4 of the time, we would estimate the gene frequencies as p=1 and q=0, hence h=2pq=0. An underestimate.
• Case A1A2 – In this case, we would estimate the gene frequencies correctly, and hence would estimate h correctly as well.
• Case A2A2 – This case would work out the same as the first case – underestimate h.
Net result: When the sample size is n=1, the value of h will sometimes be underestimated, and never overestimated. Therefore it will on average be underestimated.

That proves nothing, because I have only examined the case n=1. However, I hope it does give some insight into what might happen (and what I claim does happen) for larger sample sizes and for other values of p. When the sample size is larger, there will be some samples that result in an overestimate of h. But the more typical situation is the one illustrated here; on the average, h is underestimated.

Statistics lecture notes