DNA Frequency Uncertainty – Why Bother?

Comments on probability and frequency added June 2012

The essay below makes its point indirectly. Written as a challenge, it is provocative rather than explanatory. "I'll bet you can't find a logical justification for supplying confidence intervals on DNA frequencies as evidence in a court case," is a taunt, and the implication is that if you supply them but don't have a firm understanding of what the court should do with them, then maybe there is no logical justification (and certainly no practical one as if the "expert" doesn't know how to use the information, certainly the court will not know).

I had a good pedagogic reason to put out a challenge rather than a explanation in the first place, but I do admit that the mere fact that my readers can't refute my challenge doesn't prove anything.

The explanation begins, as mentioned below, with the realization that frequency was never the issue, probability is, specifically the probability for an innocent suspect to match the crime scene profile, conditional on the crime scene profile having been observed. The difference? Probability is a summary of evidence, of facts actually observed. Frequency refers to facts typically not observed, namely the incidence of DNA traits in the entire population. Only evidence has any place in the courtroom; speculation about unknown facts (which is the apparent role of confidence intervals on sample frequencies) is not helpful.

If the mathematics is done right, the probability can be computed without ever mentioning or getting confused by the word "frequency." "What about a small database?" I am sometimes asked. "Shouldn't account be taken that the inference of an allele being rare is less robust when it comes from a small reference database?" Yes it should, and when the mathematics is done right the small size of the database will be reflected in a typically more generous matching probability. In a new essay Understanding How Matching Probability is Not Frequency I present a detailed example.

Epilogue (June 2009)

Over the years I received a handful of responses to this page. Mostly they explained to me what confidence intervals are. I had thought it obvious from my desert example below that I already knew that. One person suggested that the the appropriate confidence interval is a very large one, comparable to the weight of the evidence, an idea that I kind of like. The point here is that if you claim that the chance of a random match is "95% to be at most 1/million" then the strength of the conclusion is limited by the confidence interval; you're admitting that there's a 5% chance that the 1/million number is irrelevant, so all you've really said is that there's a at most a 5% (plus a tiny bit) chance of a random match. I had considered that idea once too, and also Bruce Weir told me that Ian Evett propounded it. Well, I've always liked Ian.

But no one answered or even pretended to answer the direct question of my challenge. Little wonder. There's no answer. Confidence intervals apply to frequency estimates, but the forensic matching question isn't one of frequency but of probability. I have now submitted a paper that goes into the issue in detail and, I hope, with clarity.

Sampling variation
Benefit of sampling variation
Stupidity of sampling variation
What damn good is it?
Digression – a silly bet
A challenge

Comments on probability and frequency (2012)
Epilogue (2009)

(This essay was originally posted in about the year 2000)

Sampling variation

sampling variation

Benefit of sampling variation

Stupidity of sampling variation

not

My question about the 1/3000 number is

What damn good is it?

For the sake of argument, let's imagine that the point estimate, 1/5000, if accepted by the jury, would result in a conviction, whereas the jury would feel that if the weaker number, 1/3000, is the true chance for a person to match by chance, then they will not convict.

If the jury understands the matter correctly, what will they do?

Roughly speaking, we can imagine for simplicty that the facts are these:¹

We can suppose that this more suspect-friendly estimate would occur with 5% of the re-collected databases.
However, equally the new databases might – another 5% of the time – give a matching estimate that is rarer than 1/5000 – say 1/15000.
The possibilities 1. and 2. cancel each other out. Based on all the evidence before us, we can say that the point estimate is correct²: There is 1 chance in 5000 that a randomly selected person will match the stain.

Digression – a silly bet

imagine

But that is nothing like the question that the jury has to answer. So why burden them with an extra number – a useless number?

A challenge

email to email contact

Note 1: This rendition might offend a statistical "frequentist" purist. However, have I offended in a material way – has my unsophisticated rendering disguised the reason for reporting the confidence limit? Or is there nothing to disguise and I am just being unsophisticated.

Note 2: I'm not sure but when I wrote "point estimate" years ago I may have had in mind the sample frequency. Today I would consider the extended-sample frequency (vis the (x+1)/(N+1) rule) and even more correctly the expected value of the frequency. In any case there's no need to delve into the technical details to understand the gist of the argument, so let's ignore them for now.

Return to home page of Charles H. Brenner