In the 2007 committee publication
ISFG: Recommendations on Biostatistics in Paternity Testing
we (Gjertson, Brenner, et al) included this recommendation:
R2 Population Genetics
R2.1 Allele probabilities
The probability of observing an allele, i, can be estimated as:
is the number of i alleles and N is the total number of alleles in the existing database.
The relevant probability of observing an allele is its conditional
probability given observation among tested individuals. The database
sample frequency of xi/N,
ignoring a new observation in a tested
trio, is regularly biased toward paternity . Extending the database
with one extra observation is a simple and nearly accurate procedure
to overcome the bias.
[The above is the part I want to discuss here.]
[The rest of the recommendation reads as follows:]
In particular, occasionally, a new allele not
present in a reference database is observed in routine testing. Then
the formula reduces to 1/(N+1) since the marker went unobserved among
the N previous alleles in the database. Additionally, laboratories
may choose to follow a minimum count policy, such as the NRC II
recommendation of a minimum numerator of 5.
There have been objections and questions about this formula.
For example, I've been asked why not +2 in the denominator, i.e. a formula of
I partially understand the sense of that. If the +1 in the recommendation is to condition
the probability on the observed instance of the paternal allele in the child, then wouldn't
it be more accurate to condition also on the non-instance of that allele as the child's
Let's lay that out more explicitly. Here's a good way to look at the paternity question:
- A mother, her child, and an alleged father present for paternity testing.
- Put the man aside for the moment. Don't observe his type.
- Determine the genetic (i.e. DNA) types of mother and child at some locus.
Suppose the child type is PQ and from comparison with the
mother we can see that Q is the paternal type.
- At this moment we ask the question: Suppose a non-father is tested for
paternity. What is the probability that a (particular) allele of his will
be a Q?
- Before this case, our experience was a population study of size N
of which x were type Q. However, the additional
observation of Q just noticed in the child is as
good as any other, so we should toss it into the population study. That
is, our experience to date is having observed x+1
Q's out of a total of N+1 observations.
- ... or maybe out of a total of N+2 observations if we also notice
the other, non-Q in the child, the maternal P.
So I agree that the N+2 formula isn't illogical.
However it seems to me an unnecessary complication for negligible benefit.
As we stated in the recommendation, the formula is
"simple and reasonably accurate".
Suppose we don't stop at the earliest step, and try to incorporate some complications.
Here are some examples of the tangled morass we can enjoy.
- The information of the other alleles from the reference mother and child
which are ignored in coming up with the recommended formula is on average
- The difference between the recommended formula and the
N+2 formula is about
1/N2, a very small number
(in the direction that the recommended
formula is slightly "conservative" favoring non-paternity).
Before worrying about so small a numerical discrepancy, it would be well
instead to consider the very basis, the mathematical philosophy,
of inferring a probability
(or of estimating a population frequency if you prefer to think of it that
way) from a sample frequency. Bridging the gap requires introducing another
modelling step, making an assumption about the prior expected distribution
of allele frequencies in the population (the expected
The idea that sample frequency is an unbiased estimate for population
frequency is not built into the universe. Rather, natural and obvious-seeming
as this assumption is, it is at best only approximately true and examples
could be given where it is far from true. It is probably a reasonably
close estimate for individual DNA STR loci, but the error is surely greater
Where do you stop?
The formula given is simple and reasonably accurate. Nothing is completely accurate.
- The N+2 formula
From the observation that the child type is PQ, we have in total
N+2 observations of which
of allele Q. Hence
the sample frequency is given by the N+2 formula above.
- Mother PR, Child PQ
But wait we have observations of the mother type. Why not
- Mother PQ, Child QQ
Now there are two instances of Q in the family. No problem,
you may say; just put +2 in the numerator:
- Mother PQ, Child PQ
That was fine, but how about this case where maternal and paternal alleles are not
known? We'll need probability estimates for both P and Q.
Which gets +1 and which gets +2?
Don't give up! There must be a probabilistic answer. Let's put
Pr(P | Mother, Child types) = (xp+j) / (N+3)
Pr(Q | Mother, Child types) = (xq+k) / (N+3)
Maybe it is reasonable to take j and k to be proportional to
xp and xq, giving rise to formulas like
Pr(Q | Mother, Child types) =
xq[1+3/(xq+xp)] / (N+3).
Is that attractive?
- Body identification using three siblings who are PQ,
PQ, and PR
Try this as an exercise. I think it will involve square roots.
The simpler formula x/N is too simple; it's biased
against the alleged father and the bias is significant for small x.
(The formula (x+1)/N may
seem simpler, but I don't like it for two reasons:
To be fair, no one suggested it.)
- It looks odd because it doesn't correspond to any model.
- It fails badly when N is 0
In the pursuit of greater accuracy you generally run into endless complication with zero
or negligible benefit. Only in the homozygous child QQ case
it might be justifed to accept the complication of +2,
rather than +1. It does correct a situation where the recommendation is
anti-conservative. However, I am reluctant to recommend it because:
- It sacrifices the simplicity of always using the same number for the probability
of the same allele.
- The error that it remedies is significant only in the very rare situation of
homozygosity for a rare allele.
- The only natural stopping point is the one taken by the