Arizona DNA databases "matches"

		Arizona DNA Database Matches
		Charles Brenner January 8, 2007
		analysis of partial matches in an offender database, and critique of a surprising apparent blunder by NPR's "Math Guy" Keith Devlin.

Offender database partial matches

Introduction

Postscript – source of data Feb1,'07

No. The offender database itself was not released, just a "9+ locus match summary report" which lists, one line per match, all the 9+ locus matches and the corresponding identification codes (not the DNA profiles).

Background. One 9-locus match came to light accidentally and was reported as a conference poster (Troyer et al). Eventually counsel (Bicka Barlow, San Francisco Office of the Public Defender) got wind of this and further data, and asked an Arizona judge to compel release of further data in the hope that it would aid the defense of a California man who'd been arrested on the basis of a database cold hit.

Barlow argued that "90 [the estimate then] matching pairs at 9 loci is an incredible fact. ... The State has information that they're not providing to the defense that says that, in fact, their statistical analysis is wrong and it could be wrong by orders of magnitude. ... I could have gotten a statistician to calculate a probability ... it's almost -- it's improbable [What did she stop herself from saying, and why?]."

There was no contradicatory testimony at the hearing. The judge ordered the match report to be produced.

If the theory comports with the data, that supports using the theory for calculations presented in court as evidence in criminal cases.
If the theory and observation are inconsistent, that could be used by a criminal defendant to attack the DNA calculations presented against him.
Or as a matter of research, inconsistency could be used as a clue to modify the theory.
Without actually comparing theory and observation, the defense expert might be able to use innuendo to imply that #2 holds because the observations are counter-intuitive even though in fact they are as expected.

Keith Devlin

"Math Guy"

What is a "partial match"?

TH01

Actual electron micro-photograph

The TH01 locus, located at about p15.5 on chromosomes 11

dna-view.com/profile.htm

^{1, 2}

At each locus, the between-individual variability of DNA is such that the probability of two randomly selected individuals to have the same pair of numbers – to "match" – is typically (depending on the actual pair of numbers, and on which locus) about one chance in 13.66 assuming the individuals are unrelated. (This figure is for the FBI Caucasian population study.³) Assuming the approximation that the various loci are independent, the chance of a full-profile match would nearly always be less than 1 chance in 10¹³ and the median chance is 1 in 5.7×10¹⁴. Whether these are meaningful numbers depends on how valid are the assumptions of unrelatedness and of independence, and beyond that in talking about so small a number there is some philosophical question as well.

The term "partial match" is used to mean that two profiles match at some subset of their loci. For example, a 9-locus partial match between two profiles means that they match at some nine loci and fail to match at the remaining four loci. The Arizona DNA Offender database contained 65,493 profiles, and the following partial matches were observed:

Observed partial matches in Arizona data
number of matching loci number of partial matches
9 122
10 20
11 1
12 1

Observed partial matches in Arizona data
number of matching loci	number of partial matches
9	122
10	20
11	1
12	1

Analysis

Are these results remarkable?

approximately 1 in every 228 profiles in the database matched another profile in the database at nine or more loci ...
[A careless reader might confuse the "1 in 228" statistic with the chance that two randomly selected profiles match, rather than the chance that a randomly selected profile will have a match with some one out of the 65,492 other profiles. Did Devlin phrase this way to mislead, or just to be dramatic? Probably neither. As I discuss below probably he was misled himself.]

How big a population does it take to produce so many matches that appear to contradict so dramatically [emphasis mine] the astronomical, theoretical figures given by the naive application of the product rule? The Arizona database contained at the time a mere 65,493 entries. Scary isn't it?

number (l) of matching loci	matching chance		number of			number of partial matches
number (l) of matching loci	(s) per locus	(t) for l-locus profile. t=s^l×(1-s)^13-l	(p) pairs of individuals. p=65493 choose 2 = 65493×65492/2	(w) ways to select l loci from 13. w=13 choose l	(n) comparisons. n=p×w	(x) expected. x=n×t	(m) observed^†	relative excess =m/x
9	1/13.66	1 / 2.2×10¹⁰	2.14×10⁹	715	1.5×10¹²	68.3	122	1.8
10		1 / 2.8×10¹¹		286	6.1×10¹¹	2.2	20	9
11		1 / 3.6×10¹²		78	1.7×10¹¹	0.05	1	22
12		1 / 4.5×10¹³		13	2.8×10¹⁰	1/1600	1	1600

exactly

Reasonable conformity to expectations

Postscript – Myers' analysis Jan30,'07

CAL-DOJ Criminalist Steve Myers presented a more careful analysis than mine at a recent meeting, taking into account estimates of the population structure. His model explains the observed results pretty well.

The observed number of 9-locus matches exceeds the prediction of the idealized, simplified model by only a factor of 1.8. If the source of this discrepancy were the product rule, then the implied departure from independence of loci would have to be considered very small. It would mean that the sequence of multiplications that leads to the number t, which contains 40 factors of 1.8, is in error by only one factor of 1.8. However, more likely the product rule is even more accurate than that.

In view of the quickly increasing "relative excess" as the number of matching loci increases, the more likely source of the discrepancy is the other ideal assumption of the calculation model, namely that all the individuals are unrelated. Just as the 11 and 12 locus matches supposedly represent siblings,¹ surely the 9 and 10 locus match count is also very much influenced by the presence of related criminals in the database.

There is scope for further analysis in order to understand better the population model implied by the observed model. That is, the data gives some clue as to the rate at which relatives occur in the criminal population. This is an interesting and practical question.⁴

Such research might also have implications for forensic practice, as alluded to in I.A.3 above. After all, if the model under which DNA profiles computations are presented to the court is wrong, does it really matter that it's right with respect to independence of loci but wrong in some other respect, namely the assumption of unrelatedness? I agree with this concern, and in fact the population model generally accepted and expected by courts in the United Kingdom for DNA identification probability calculations takes plentiful, probably generous, account of possible relationships. It's a subtle question, though, and beyond the scope of this discussion, the extent to which relationships within the offender database are relevant to a criminal trial.

Devlin's surprising oversight

Far from supporting the naive intuition that the Arizona matches conflict with the product rule, we see that they are in fact surprisingly consistent with it.

Also surprising is how Devlin, a mathematician, could have made such an error. Continuing from the passage above though makes it clear what happened:

Postscript – Devlin disagrees Jan30,'07

Devlin disagrees. He emailed me to point out his unfinished ms – Scientific Heat about Cold Hits, the appendix at the end of which takes into account the two multiplicative factors mentioned at the left. I'm relieved that he didn't need me to figure it out. Unclear which was written first. Clearly (to me) the ms and MAA articles are contradictory.

He further claims (January 2007 column) that a "legal consultant" and "blogger" (presumably meaning this page) has misrepresented the contents of his articles 1 and 2. Not in the least. My plain reading is the same as everyone else's. Please let me know if you see an alternative, any word or phrase to support Devlin's assertion that he presented not his own opinions but meant only to illustrate the errors that a non-mathematician might make.

It is not much of a leap to estimate that the FBI's national CODIS database of 3,000,000 entries will contain not just one but several pairs that match on all 13 loci, contrary (and how!) to the prediction made by proponents of the currently much touted RMP that you can expect a single match only when you have on the order of 15 quadrillion profiles.2

pairs

¹³

⁵

Note that there are two separate multiplicative factors that the naive tend to overlook when considering the number of possible 9-locus matches from a collection of profiles such as the Arizona data:

the factor, equal to one-half the size of the sample, by which the number of pairs exceeds the size of the sample;
the combinatorial factor – 715 above – representing the number of different 9-locus selections from 13, each of which is an opportunity for two selected individuals to have a 9-locus match.

The first point is the same one that tricks people who mis-estimate the chance that a classroom of 23 children will have some pair sharing a birthday (slightly over half) – the famous Birthday Problem. For further explication, see Devlin's own essay on the subject.⁶

Conclusions

Devlin of course, looked at the match data and argued the opposite – that it proves something quite negative about the underlying population genetic theory upon which DNA identification rests.

Both are wrong. If Budowle had looked at an analysis before making his pronouncement, he would would have realized there is no need to be defensive. Devlin could have followed through with his own suggestion:

So what should be done? To me, the answer is obvious. Instead of using mathematics, determine the various random match probabilities empirically.²

Footnotes

Keith Devlin, Statisticians not wanted MAA Online
Keith Devlin, Damned lies MAA Online
Journal of Forensic Science, Vol 44 number 6
Brenner, Bieber, Lazer, Finding Criminals through DNA of their Relatives
In fact there are many perfect matches, due mainly to the fact that offenders get catalogued from two different states or under different names. Hence the experiment is impossible to perform. However checking for near-perfect matches is a good experiment. It's complicated though, on account of related offenders.
Keith Devlin, Math Guy: The Birthday Problem

Arizona DNA Database Matches