Allele size | count or popularity |
---|---|
8 | 18 |
9 | 16 |
10 | 28 |
11 | 21 |
12 | 14 |
13 | 2 |
14 | 1 |
The table at the left is a typical reference sample for some STR locus. It shows allele size and number (count) of observations of the allele. Let's just focus on the count or popularity column, and suppose we examine many such tables. What will be the most popular number to appear as a count?
m=multiplicity of p within database | # αp of databases with m types of popularity (count) p | fraction of databases with m types of popularity p | ||||
---|---|---|---|---|---|---|
p=1 (singletons) | p=2 (doubletons) | p=3 (tripletons) | p=1 | p=2 | p=3 | |
0 | 296 | 467 | 547 | 0.37 | 0.58 | 0.68 |
1 | 282 | 237 | 213 | 0.35 | 0.3 | 0.27 |
2 | 121 | 75 | 35 | 0.15 | 0.09 | 0.04 |
3 | 48 | 14 | 4 | 0.06 | 0.02 | 0.005 |
4 | 31 | 6 | 2 | 0.04 | 0.007 | 0.002 |
5 | 11 | 1 | 0.01 | 0.001 | ||
6 | 4 | 1 | 0.005 | 0.001 | ||
7 | 5 | 0.006 | ||||
8 | 3 | 0.004 | ||||
αp=total p counts across 801 databases | α1=930 | α2=464 | α3=303 | 1.0 | 1.0 | 1.0 |
pαp=total chromosomes accounted for | 1·α1=930 | 2·α2=928 | 3·α3=909 | ← equal under the Law |
I examined a large collection of mostly published STR reference "databases" or population samples of moderate size. I tabulated 801 of them each having from 100 to 1000 chromosomes (observations). Singletons — allelic types with a count of one — are by a large margin the most popular; occurring in total 930 times among the 801 databases. 63% had one or more once-observed allelic types or "singletons". On average there were 1.16 singletons per database. I suggest the word popularity for the number of times something has occurred. A singleton means an allelic type of count or popularity p=1 in a database. If we denote by αp the number of allelic types of popularity p found in the dataset, then we can say that α1=930 is the popularity of singletons, and that singletons are very popular. Obviously these 930 singletons represent 930 (fragments of) chromosomes.
Doubletons — types of count p=2 — had a total popularity of α2=464 among the 801 databases. Since each doubleton represents two observations, in total they account for 2·α2=928 chromosomes, nearly the same as the singletons. And the 303 tripletons represent a similar number, 3·α3=909, of total observations.
All of which suggests Brenner's Law, the rule that
The number of p·αp of alleles represented by database popularity p is constant over p.
How well does it hold up? Look at the dotted line in the image at right. It's not highly accurate; let's call it a rule of thumb. It's moderately supported by the data shown, but it is also suggested by more than the data here presented. I did an earlier study based on RFLP markers; they conform more closely. Most importantly there is a theoretical underpinning. In fact I first investigated this distribution to compare STR markers with Ewens's sampling distribution for the ideal situation of "infinite alleles." Brenner's Law follows from Ewens' formula in the limit as the mutation rate goes to zero. Of course STRs violate all of the assumptions of the infinite alleles model with 0 mutation —
so we cannot expect accuracy. But the main reason of the above that the data doesn't conform to the Law is #1. The fact of convergent mutation for STRs is an influence towards common types. Point #2 compensates somewhat. A very high mutation rate, such as exists for Y-haplotypes, discourages common types.
The general point is that nature strongly favors rare alleles.
Brenner's Law is an observation about the comparative prevalence of rare and of common forensic STR allelic variants.
It is an example of a "frequency spectrum" — the distribution of frequencies that we can expect nature to deal to us. It says that for any given locus, population, and small frequency range f±ε, the probabity that an allelic type with frequency in the range f±ε exists, is double the probability that an allelic type with frequency in the range 2f±ε exists, and so on.
A frequency spectrum is thus a prior probability distribution for allele frequencies, and it can be used to impute a match probability via Bayes' theorem.
One way is to list all the allelic types, for example those that are represented one or more times in a sample, then select at random from the list.
Under this sampling rule a singleton type in the sample has the same chance to be chosen as a common type. Hence this can be thought of as sampling according to the distribution of αp. This is the sampling experiment I have in mind when I say that the most likely allelic popularity is singletons, and that rare alleles are prevalent.
Assuming Brenner's Law, this sampling will choose a singleton twice as frequently as a doubleton, thrice as frequently as a tripleton, and so on.
Another way to think of randomly sampling is to randomly choose a chromosome from all those counted in compiling the population sample. Brenner's Law predicts that if we sample by this method then ask for the popularity p of the allelic type thus obtained in its database, all choices for p are equally likely.