Comments on: |
Fundamental problem of forensic mathematics – The Evidential Value of a Rare HaplotypeBrenner CH (2010)Fundamental problem of Forensic Mathematics — Evidential value of a rare haplotype Nov 10, 2009 (maximally readable html version of the paper) |
Understanding Y haplotype matching probabilityBrenner CH (2014),Forensic Sci. Int. Genet. 8 233–243 |
When a rare haplotype is shared between suspect and crime scene, how strong is the evidence linking the two? The fundamental question is the matching probability:
What is the probability that an innocent suspect will match the crime scene haplotype?The common and interesting situation is a previously unobserved haplotype. The traditional tools of product rule and sample frequency are not useful when there are no components to multiply and the sample frequency is zero. A useful statistic is the fraction Κ (kappa) of the population sample that consists of "singletons" – of once-observed types. A simple argument shows that the probability for a random innocent suspect to match a previously unobserved crime scene type is (1-Κ)/n – distinctly less than 1/n, likely ten times less. The robust validity of this model is confirmed by testing it against a range of population models.
I've been trying to write this paper since 1997 and had a lot of problems along the way. The goal was never to present a conservative number or to make an official recommendation having the force of authority, but to understand the problem.As early as 1999 I presented an answer at a statistics conference (although a key point is that the problem is not a statistical problem) in North Carolina, the approach called in the present paper the t model. However, I hit a snag trying to justify it mathematically which I thought necessary since the mathematical derivation is unfortunately not quite simple.
I had, though, by that time realized three key insights which contradict common practice in the forensic community and perhaps it is fair to say commonly held beliefs:
- The crime stain must be counted as part of the population sample.
- Implication: A "zero observations" situation never arises.
- Logic: The validation criterion for the calculation is fairness to an innocent suspect, so consider the case of an innocent suspect. He is equally unrelated to the crime scene for which he is suspected as to the sources for all the database profiles, hence all should be categorized together. That done, we imagine prospectively (i.e. before knowing the suspect's type) the probability by which he will match the crime scene profile. That probability would be the same for any other database profile about which the data — the known facts — are the same.
- The pertinent question is a question not of frequency but of probability (and there is a difference)
- Implication: "Confidence intervals" are irrelevant to the problem.
- Logic: Probability is a summary of evidence — data, i.e. known facts, what a court should and does deal with to make a decision. Confidence intervals are a way to speculate about unknown facts, such as what the (irrelevant) frequency might be or what the probability might be with additional unknown (hence irrelevant) information. Probability is also a measure of uncertainty. There is no mathematical sense to talk about uncertainty of uncertainty (as confidence intervals applied to probability would purport to do), nor would there be any sensible way for the judge, even if a statistician or mathematician, to use such information.
- As an estimate of the chance to see a trait in a population, the chance to see it in a population sample may be neither neutral nor reasonable.
- Implication: The matching LR may greatly exceed the size, n, of the reference sample.
- Logic: Briefly the point is that by noticing that the database is mostly types observed only one time, we can infer that the population must consist of a very large number of rare types of which we are seeing only a relative few. Therefore for the few that we do see (including the crime scene type that is temporarily part of the database), the sample frequency must typically greatly overestimate the population frequency.
A couple of years ago I found an easier way to arrive at approximately the same answer – the kappa model and eagerly imagined that with this new-found simplicity the paper was only weeks away. The short version of a long story is that proved not to be the case. However finally, after a final push of several months during which I determinedly avoided and procrastinated almost all other priorities I finally got the paper out the door. A link to the submission draft is above.