Table of contentsThe Problem (pdf link)
Forensic mathematics home page
Comments are welcome (see home page for email)
The problem is to evaluate given DNA data for paternity attribution in a case where the alleged father is not tested. Instead, DNA profiles are available for his sister (SI) and his brother (BR).
There is both autosomal (Identifiler) data, and a Y-haplotype.
;; (U and A correspond to BR and SI
;; in the official statement of the problem)
|likelihood ratio (PI)||formula at this locus||Allele frequencies used||Mother||Child||SI||BR|
|cumulative LR for autosomal loci||78200||(meaning of the letters)||M||C||A||U|
|1.45||(1+2p+r) / (4r+8pr)||p=0.143 r=0.199||14||14||12 13||12 14|
|4.37||(2+7q+r) / (4q+4r+8qq+8qr)||q=0.112 r=0.0193||31 32||31 32||30 31||31 32|
|1.43||(1+2p+s+4t+ps+pt+5st+tt) / (4p+4t+4ps+4pt+4st+4tt+8pst+8stt)||p=0.137 s=0.215 t=0.146||8 12||8 12||11 12||11 12|
|1.17||(2+p+7q) / (4p+4q+8pq+8qq)||p=0.262 q=0.328||10 11||10 11||11 12||10 11|
|1.33||(1+5p) / (4p+8pp)||p=0.291||17 18||15 17||15 18||15 16|
|2.29||(1+a+r) / (4r+4ar)||a=0.349 r=0.119||6 7||6 8||8 9.3||9.3|
|1.36||(1+5q) / (4q+8qq)||q=0.284||12||12||11 12||12 13|
|18.9||(1+4p+q+pp+5pq) / (4p+4pp+4pq+8ppq)||p=0.0138 q=0.148||11||8 11||8 9||8 9|
|0.433||(2+r+s) / (4+4r+4s+8rs)||r=0.113 s=0.164||19 20||17 20||19 20||19 20|
|3.94||1 / 4p||p=0.0634||13 16||12 13||13 14||12 15|
|3.26||(1+5q) / (4q+8qq)||q=0.0952||16 17||15 16||14 15||15 17|
|1.17||(1+p+s) / (4s+4ps)||p=0.575 s=0.247||8||8 11||8 11||8|
|30.2||1 / 4z||z=0.00828||16 17||17 22||16 18||12 22|
|0.221||1 / (4+8r)||r=0.0662||9 11||11||8 10||9 10|
|2.18||(1+5p) / (4p+8pp)||p=0.156||23||21 23||21 22||21 23|
For the problem here, to compute X (for example), the relevant people would be Ma, Child, SI, BR, the father, and the paternal grandparents Gma and Gpa. If we consider the locus D8S1179 as an example, then the genotype combinations to consider for each person are as much as
However, there is an essential complication that applies to this problem compared to if there were only one aunt or uncle for example, as there was in last year's paper challenge. Suppose for example that the type or BR were not given, just of SI. Then two "slots" among the four grandparental alleles would be known to be 12 and 13, and the other two slots would be known to be unknown. From this, the probability that Father would pass a 14 allele is easily seen to be the probability that he receives and in turn transmits one of the two empty slots, times the chance that that slot is a 14.
When also BR's type is known, the above reasoning breaks down because we don't know if the 12 from BR and SI is the same 12, or two different 12's. The number of slots accounted for in the grandparents is somewhere between 3 and 4 slots (probabalistically speaking). Last year's shortcut doesn't work.
|person||C||U (i.e. BR)|
|other loci||16, 11, ...||16, 11, ...|
|# observations in database||N=170|
|# matching observations||k=1||k=0|
|name for the haplotype|
|notation for probability to see the type in unrelated person||c||u|
|DYS390 mutation frequency||μ=0.009|
There are several possible approaches. We use the notation LR for the
likelihood ratio, and
LR = X/Y, where
X = Prob(observed haplotypes | BR an uncle of C) and
Y = Prob(observed haplotypes | BR unrelated to C).
Y = cu. X is more difficult.
X = c3μ/2 and
LR = X/Y = X/cu = 3μ/2u.
It remains to estimate u.
X = cμ/2 + u2μ/2, so
LR = X/Y = X/cu = μ(1/2u + 1/c).
Postscript June 2008.
Thanks to Steve Myers & John Planz for noting that the grandfather-centric approach is the logical one.
Also, on reflection I'm not worried as to whether the present population data is appropriate for previous generations. Of course the population frequencies have changed, but frequency was never the issue anyway, it's a question of probability. And data about the present, if that's all you've got, is a valid and I think unbiased indication of the past state equally as of the present.
Note that all formulas are equivalent if c = u. Therefore to be conservative let's take the uncle-centric view and take c=2/171.
|Hence LR = 30.009/2(2/171) = 1.15.|
The meaning of this neutral result is that the chance to see so rare a haplotype by mutation is about the same as the chance to see it at random in an unrelated individual.
The justification is that I consider the case occurrence of the allele in an unknown sample (or a child) as a (k+1)st observation of the allele out of now N+1 total observations. In other words, I (temporarily) toss it into the official database, producing what I call an "extended database."
At that point, (conceptually) before examining the suspect's (or the father's) allele, we ask what it the probability that an innocent suspect will match. Assuming the extended database is representative of the universe, the answer is (k+1)/(N+1).
Postscript June 2008: see
Allele probability the
Others have claimed (Stockmarr, maybe Balding) that you should add the allele twice once for the stain and once for the suspect. I think this is illogical because you should (in concept) evaluate the match probability before you know if the suspect matches. If you evaluate it afterwards, when you know if there is a match, isn't the probability either 0 or 1? Besides, Dawid and Mortera worked it out mathematically and their formula has +1.
So the (k+1)/(N+1) rule is rather conservative for rare haplotype systems.
For example, the
Another way to see clearly the distinction is to consider the so-called "frequency" of a full DNA profile. Typically the matching probability equates to far less than 1/world population, whereas a frequency would by definition need to be some integer out of the world (or whichever) population. It might help to realize that the probability experiment implied by saying that a matching probability is 1 in a trillion is not to consider trillions of different people and count how many match the given profile; it is to consider trillions of repetitions of the circumstances of this case (e,g. in parallel universes or over a long period of time).