DNA Frequency Uncertainty – Why Bother?

The essay below makes its point indirectly. Written as a challenge, it is provocative rather than explanatory. "I'll bet you can't find a logical justification for supplying confidence intervals on DNA frequencies as evidence in a court case," is a taunt, and the implication is that if you supply them but don't have a firm understanding of what the court should do with them, then maybe there is no logical justification (and certainly no practical one as if the "expert" doesn't know how to use the information, certainly the court will not know).

I had a good pedagogic reason to put out a challenge rather than a explanation in the first place, but I do admit that the mere fact that my readers can't refute my challenge doesn't prove anything.

The explanation begins, as mentioned below, with the realization that frequency was never the issue, probability is, specifically the probability for an innocent suspect to match the crime scene profile, conditional on the crime scene profile having been observed. The difference? Probability is a summary of evidence, of facts actually observed. Frequency refers to facts typically not observed, namely the incidence of DNA traits in the entire population. Only evidence has any place in the courtroom; speculation about unknown facts (which is the apparent role of confidence intervals on sample frequencies) is not helpful.

If the mathematics is done right, the probability can be computed without ever mentioning or getting confused by the word "frequency." "What about a small database?" I am sometimes asked. "Shouldn't account be taken that the inference of an allele being rare is less robust when it comes from a small reference database?" Yes it should, and when the mathematics is done right the small size of the database will be reflected in a typically more generous matching probability. In a new essay Understanding How Matching Probability is Not Frequency I present a detailed example.

Epilogue (June 2009)

Over the years I received a handful of responses to this page. Mostly they explained to me what confidence intervals are. I had thought it obvious from my desert example below that I already knew that. One person suggested that the the appropriate confidence interval is a very large one, comparable to the weight of the evidence, an idea that I kind of like. The point here is that if you claim that the chance of a random match is "95% to be at most 1/million" then the strength of the conclusion is limited by the confidence interval; you're admitting that there's a 5% chance that the 1/million number is irrelevant, so all you've really said is that there's a at most a 5% (plus a tiny bit) chance of a random match. I had considered that idea once too, and also Bruce Weir told me that Ian Evett propounded it. Well, I've always liked Ian.

But no one answered or even pretended to answer the direct question of my challenge. Little wonder. There's no answer. Confidence intervals apply to frequency estimates, but the forensic matching question isn't one of frequency but of probability. I have now submitted a paper that goes into the issue in detail and, I hope, with clarity.

Sampling variation

DNA profile frequency estimates are based on population samples of limited size (not the whole population). Since a different population study would probably lead to a different frequency estimate, the estimate has an uncertainty known as sampling variation.

Benefit of sampling variation

Suppose you plan to drive to some point in the desert and must carry enough fuel for the round trip. Your best estimate is that ten gallons will be enough, but you know that this estimate carries some uncertainty, and there is, let us say, a 1% chance that you really will need 15 gallons. So 15 gallons is the "98% (or maybe 99%) upper confidence estimate", and you may well judge it prudent to carry this amount of gas, rather than the "point estimate" of 10 gallons.

Stupidity of sampling variation

In dealing with DNA matching frequencies, the NRC II report discusses (but does not recommend!) an analogous approach. Let's suppose we have a DNA profile, shared between the suspect and the crime scene. We have some (necessarily limited) databases from which to estimate the prevalence of this profile in the general population, and our best guess – the "point estimate" – is that the profile is shared by 1/5000 of the general population. Then NRC II shows how to compute a "95% lower confidence" number, which is, let us suppose, 1/3000.

My question about the 1/3000 number is

What damn good is it?

For the sake of argument, let's imagine that the point estimate, 1/5000, if accepted by the jury, would result in a conviction, whereas the jury would feel that if the weaker number, 1/3000, is the true chance for a person to match by chance, then they will not convict.

If the jury understands the matter correctly, what will they do?

Roughly speaking, we can imagine for simplicty that the facts are these:1

Were we to go back and collect databases again, possibly the new databases would result in a matching estimate of 1/3000.
1. We can suppose that this more suspect-friendly estimate would occur with 5% of the re-collected databases.
2. However, equally the new databases might – another 5% of the time – give a matching estimate that is rarer than 1/5000 – say 1/15000.
3. The possibilities 1. and 2. cancel each other out. Based on all the evidence before us, we can say that the point estimate is correct2: There is 1 chance in 5000 that a randomly selected person will match the stain.
As far as I can see, the only use the jury can make of the "distribution" information – of the confidence limits – is somehow to distill it into a single number along the lines I have indicated above. Finally, they will act on the single number. And that single number is the point estimate. So why give them more than the single number in the first place?

Digression – a silly bet

Now, I can imagine a situation where the confidence interval would be useful: Suppose you offer to bet me that the number of matching people, in a city of 1 million, is between 1/4000 of the people and 1/6000 of the people. In deciding whether to take the bet I of course would like to know the confidence interval around the 1/5000 point estimate.

But that is nothing like the question that the jury has to answer. So why burden them with an extra number – a useless number?

A challenge

Will someone tell me, please, what rational difference it ever can make to know the confidence limits in addition to knowing the best point estimate? Specifically, can you give premises under which, for a fixed point estimate, the decision to convict or not to convict would depend on the size of the confidence interval?

email to email contact

Note 1: This rendition might offend a statistical "frequentist" purist. However, have I offended in a material way – has my unsophisticated rendering disguised the reason for reporting the confidence limit? Or is there nothing to disguise and I am just being unsophisticated.

Note 2: I'm not sure but when I wrote "point estimate" years ago I may have had in mind the sample frequency. Today I would consider the extended-sample frequency (vis the (x+1)/(N+1) rule) and even more correctly the expected value of the frequency. In any case there's no need to delve into the technical details to understand the gist of the argument, so let's ignore them for now.