Postscript source of data Feb1,'07
No. The offender database itself was not released, just a "9+ locus match summary report" which lists, one line per match, all the 9+ locus matches and the corresponding identification codes (not the DNA profiles).
Background. One 9-locus match came to light accidentally and was reported as a conference poster (Troyer et al). Eventually counsel (Bicka Barlow, San Francisco Office of the Public Defender) got wind of this and further data, and asked an Arizona judge to compel release of further data in the hope that it would aid the defense of a California man who'd been arrested on the basis of a database cold hit.
Barlow argued that "90 [the estimate then] matching pairs at 9 loci is an incredible fact. ... The State has information that they're not providing to the defense that says that, in fact, their statistical analysis is wrong and it could be wrong by orders of magnitude. ... I could have gotten a statistician to calculate a probability ... it's almost -- it's improbable [What did she stop herself from saying, and why?]."
There was no contradicatory testimony at the hearing. The judge ordered the match report to be produced.
|The TH01 locus, located at about p15.5 on chromosomes 11|
At each locus, the between-individual variability of DNA is such that the probability of two randomly selected individuals to have the same pair of numbers to "match" is typically (depending on the actual pair of numbers, and on which locus) about one chance in 13.66 assuming the individuals are unrelated. (This figure is for the FBI Caucasian population study.3) Assuming the approximation that the various loci are independent, the chance of a full-profile match would nearly always be less than 1 chance in 1013 and the median chance is 1 in 5.7×1014. Whether these are meaningful numbers depends on how valid are the assumptions of unrelatedness and of independence, and beyond that in talking about so small a number there is some philosophical question as well.
The term "partial match" is used to mean that two profiles match at some subset of their loci. For example, a 9-locus partial match between two profiles means that they match at some nine loci and fail to match at the remaining four loci. The Arizona DNA Offender database contained 65,493 profiles, and the following partial matches were observed:
|Observed partial matches in Arizona data|
|number of matching loci||number of partial matches|
approximately 1 in every 228 profiles in the database matched another profile in the database at nine or more loci ...But this intuition is not accurate. Let's start from the theoretical matching rate assuming independence and unrelatedness, and see how many partial hits would be expected:[A careless reader might confuse the "1 in 228" statistic with the chance that two randomly selected profiles match, rather than the chance that a randomly selected profile will have a match with some one out of the 65,492 other profiles. Did Devlin phrase this way to mislead, or just to be dramatic? Probably neither. As I discuss below probably he was misled himself.]
How big a population does it take to produce so many matches that appear to contradict so dramatically [emphasis mine] the astronomical, theoretical figures given by the naive application of the product rule? The Arizona database contained at the time a mere 65,493 entries. Scary isn't it?
of matching loci
|matching chance||number of||number of partial matches|
|(s) per locus||(t) for l-locus profile.
|(p) pairs of individuals.|
p=65493 choose 2
|(w) ways to select|
l loci from 13.
w=13 choose l
|(m) observed†||relative excess|
|9||1/13.66||1 / 2.2×1010||2.14×109||715||1.5×1012||68.3||122||1.8|
|10||1 / 2.8×1011||286||6.1×1011||2.2||20||9|
|11||1 / 3.6×1012||78||1.7×1011||0.05||1||22|
|12||1 / 4.5×1013||13||2.8×1010||1/1600||1||1600|
Postscript Myers' analysis Jan30,'07CAL-DOJ Criminalist Steve Myers presented a more careful analysis than mine at a recent meeting, taking into account estimates of the population structure. His model explains the observed results pretty well.
The observed number of 9-locus matches exceeds the prediction of the idealized, simplified model by only a factor of 1.8. If the source of this discrepancy were the product rule, then the implied departure from independence of loci would have to be considered very small. It would mean that the sequence of multiplications that leads to the number t, which contains 40 factors of 1.8, is in error by only one factor of 1.8. However, more likely the product rule is even more accurate than that.
In view of the quickly increasing "relative excess" as the number of matching loci increases, the more likely source of the discrepancy is the other ideal assumption of the calculation model, namely that all the individuals are unrelated. Just as the 11 and 12 locus matches supposedly represent siblings,1 surely the 9 and 10 locus match count is also very much influenced by the presence of related criminals in the database.
There is scope for further analysis in order to understand better the population model implied by the observed model. That is, the data gives some clue as to the rate at which relatives occur in the criminal population. This is an interesting and practical question.4
Such research might also have implications for forensic practice, as alluded to in I.A.3 above.
After all, if the model under
which DNA profiles computations are presented to the court is wrong, does it really matter that
it's right with respect to independence of loci but wrong in some other respect, namely the
assumption of unrelatedness? I agree with this concern, and in fact the population model
generally accepted and expected by courts in the United Kingdom for DNA identification
probability calculations takes plentiful, probably
generous, account of possible relationships. It's a subtle question, though, and beyond the
scope of this discussion, the extent to which relationships within the offender database
are relevant to a criminal trial.
Also surprising is how Devlin, a mathematician, could have made such an error.
Continuing from the passage above though makes it clear what happened:
Also surprising is how Devlin, a mathematician, could have made such an error. Continuing from the passage above though makes it clear what happened:
Postscript Devlin disagrees Jan30,'07
Devlin disagrees. He emailed me to point out his unfinished ms Scientific Heat about Cold Hits, the appendix at the end of which takes into account the two multiplicative factors mentioned at the left. I'm relieved that he didn't need me to figure it out. Unclear which was written first. Clearly (to me) the ms and MAA articles are contradictory.
He further claims (January 2007 column) that a "legal consultant" and "blogger" (presumably meaning this page) has misrepresented the contents of his articles 1 and 2. Not in the least. My plain reading is the same as everyone else's. Please let me know if you see an alternative, any word or phrase to support Devlin's assertion that he presented not his own opinions but meant only to illustrate the errors that a non-mathematician might make.
It is not much of a leap to estimate that the FBI's national CODIS database of 3,000,000 entries will contain not just one but several pairs that match on all 13 loci, contrary (and how!) to the prediction made by proponents of the currently much touted RMP that you can expect a single match only when you have on the order of 15 quadrillion profiles.2He has confused the number of pairs of individuals in a set the number of possible comparisons with the number of individuals! The actual number of pair-wise comparisons in the FBI database is 3,000,000 choose 2 = 1013. That's not quite enough to expect a 13-locus match between unrelated individuals, but it's getting close.5 From ten times that number of unrelated individuals from a Canada, approximately, of unrelated individuals we would expect to find one pair with identical 13-locus DNA profiles.
Note that there are two separate multiplicative factors that the naive tend to overlook when considering the number of possible 9-locus matches from a collection of profiles such as the Arizona data:
The first point is the same one that tricks people who mis-estimate the chance that a classroom of 23 children will have some pair sharing a birthday (slightly over half) the famous Birthday Problem. For further explication, see Devlin's own essay on the subject.6
Devlin of course, looked at the match data and argued the opposite that it proves something quite negative about the underlying population genetic theory upon which DNA identification rests.
Both are wrong. If Budowle had looked at an analysis before making his pronouncement, he would would have realized there is no need to be defensive. Devlin could have followed through with his own suggestion:
So what should be done? To me, the answer is obvious. Instead of using mathematics, determine the various random match probabilities empirically.2with the Arizona data. Had he done so, he would have come to the correct conclusion: The product rule for DNA profile probabilities is very accurate. Surprisingly so.