Arizona DNA Database Matches
  1. Offender database partial matches
    1. Introduction
    2. What is a "partial match"?
    3. observed partial matches
  2. Analysis
    1. Are these results remarkable?
      calculation table
    2. Reasonable conformity to expectations
    3. Devlin's surprising oversight
  3. Conclusions
  4. References

Related pages
Forensic mathematics home page

Arizona DNA Database Matches

Charles Brenner
January 8, 2007
analysis of partial matches in an offender database, and critique of a surprising apparent blunder by NPR's "Math Guy" Keith Devlin.

  1. Offender database partial matches
    1. Introduction
    2. Several people have pointed me to some statistics derived from analysis of the Arizona DNA offender database. It is a collection of DNA profiles from convicted offenders. Apparently somehow – I am not sure if I have this part right – the entire database became publicly available and someone did an analysis of "partial matches".

      Postscript – source of data Feb1,'07

      No. The offender database itself was not released, just a "9+ locus match summary report" which lists, one line per match, all the 9+ locus matches and the corresponding identification codes (not the DNA profiles).

      Background. One 9-locus match came to light accidentally and was reported as a conference poster (Troyer et al). Eventually counsel (Bicka Barlow, San Francisco Office of the Public Defender) got wind of this and further data, and asked an Arizona judge to compel release of further data in the hope that it would aid the defense of a California man who'd been arrested on the basis of a database cold hit.

      Barlow argued that "90 [the estimate then] matching pairs at 9 loci is an incredible fact. ... The State has information that they're not providing to the defense that says that, in fact, their statistical analysis is wrong and it could be wrong by orders of magnitude. ... I could have gotten a statistician to calculate a probability ... it's almost -- it's improbable [What did she stop herself from saying, and why?]."

      There was no contradicatory testimony at the hearing. The judge ordered the match report to be produced.

      The idea behind such a study is that it might be interesting to compare the number of partial matches with theoretical expectation and then perhaps
      1. If the theory comports with the data, that supports using the theory for calculations presented in court as evidence in criminal cases.
      2. If the theory and observation are inconsistent, that could be used by a criminal defendant to attack the DNA calculations presented against him.
      3. Or as a matter of research, inconsistency could be used as a clue to modify the theory.
      4. Without actually comparing theory and observation, the defense expert might be able to use innuendo to imply that #2 holds because the observations are counter-intuitive even though in fact they are as expected.
      In the past, #4 has been used from time to time by enterprising defense "experts" weak on statistics and long on bias. I can hardly imagine that Keith Devlin – National Public Radio's occasional and entertaining guest as "Math Guy" – personally belongs to that category. But his analysis in this case could not be more wrong, and at least one error is startlingly elementary.

    3. What is a "partial match"?
    4. TH01
      Actual electron micro-photograph
      The TH01 locus, located at about p15.5 on chromosomes 11
      A DNA profile, for the purpose of DNA identification, is thirteen pairs of numbers, each pair corresponding to a particular named location ("locus", plural "loci") on one of the chromosome pairs found in nearly every cell of the body of the human in question. My web page dna-view.com/profile.htm gives a simplified description of the identity calculation for forensic casework, consistent with the approach described (but disbelieved) by Devlin in his MAA Online articles1, 2.

      At each locus, the between-individual variability of DNA is such that the probability of two randomly selected individuals to have the same pair of numbers – to "match" – is typically (depending on the actual pair of numbers, and on which locus) about one chance in 13.66 assuming the individuals are unrelated. (This figure is for the FBI Caucasian population study.3) Assuming the approximation that the various loci are independent, the chance of a full-profile match would nearly always be less than 1 chance in 1013 and the median chance is 1 in 5.7×1014. Whether these are meaningful numbers depends on how valid are the assumptions of unrelatedness and of independence, and beyond that in talking about so small a number there is some philosophical question as well.

      The term "partial match" is used to mean that two profiles match at some subset of their loci. For example, a 9-locus partial match between two profiles means that they match at some nine loci and fail to match at the remaining four loci. The Arizona DNA Offender database contained 65,493 profiles, and the following partial matches were observed:
      Observed partial matches in Arizona data
      number of matching locinumber of partial matches
      9122
      1020
      111
      121

  2. Analysis
    1. Are these results remarkable?
    2. Possibly the results are skewed by the presence of brothers, especially the 11- and 12-locus matches.1 Assuming mostly unrelated pairs, even a 9-locus match seems intuitively improbable and it may be natural to suppose that 122 of them in a collection of only tens of thousands of people contradicts the independence assumption. Devlin2 cast the observation with flair:
      approximately 1 in every 228 profiles in the database matched another profile in the database at nine or more loci ...
      [A careless reader might confuse the "1 in 228" statistic with the chance that two randomly selected profiles match, rather than the chance that a randomly selected profile will have a match with some one out of the 65,492 other profiles. Did Devlin phrase this way to mislead, or just to be dramatic? Probably neither. As I discuss below probably he was misled himself.]

      How big a population does it take to produce so many matches that appear to contradict so dramatically [emphasis mine] the astronomical, theoretical figures given by the naive application of the product rule? The Arizona database contained at the time a mere 65,493 entries. Scary isn't it?

      But this intuition is not accurate. Let's start from the theoretical matching rate assuming independence and unrelatedness, and see how many partial hits would be expected:
      number (l)
      of matching loci
      matching chance number of number of partial matches
      (s) per locus (t) for l-locus profile.
      t=sl×(1-s)13-l
      (p) pairs of individuals.
      p=65493 choose 2
      = 65493×65492/2
      (w) ways to select
      l loci from 13.
      w=13 choose l
      (n) comparisons.
      n=p×w
      (x) expected.
      x=n×t
      (m) observed relative excess
      =m/x
      91/13.661 / 2.2×10102.14×109715 1.5×101268.31221.8
      10 1 / 2.8×1011 286 6.1×1011 2.2 20 9
      11 1 / 3.6×1012 78 1.7×1011 0.051 22
      12 1 / 4.5×1013 13 2.8×1010 1/160011600
      † Feb 1, '07 – corrected numbers of observed partial matches (from the court-ordered match summary report). E.g. 122 is the number of matches at exactly 9 loci.

    3. Reasonable conformity to expectations
    4. Look at column n, note the number of 9-locus partial match comparisons that there are among the population of only 65,493. It's a huge number, getting up toward the neighborhood of the reciprocal of the probability of a full profile match. No wonder there are 9- and 10-locus matches.

      Postscript – Myers' analysis Jan30,'07

      CAL-DOJ Criminalist Steve Myers presented a more careful analysis than mine at a recent meeting, taking into account estimates of the population structure. His model explains the observed results pretty well.

      The observed number of 9-locus matches exceeds the prediction of the idealized, simplified model by only a factor of 1.8. If the source of this discrepancy were the product rule, then the implied departure from independence of loci would have to be considered very small. It would mean that the sequence of multiplications that leads to the number t, which contains 40 factors of 1.8, is in error by only one factor of 1.8. However, more likely the product rule is even more accurate than that.

      In view of the quickly increasing "relative excess" as the number of matching loci increases, the more likely source of the discrepancy is the other ideal assumption of the calculation model, namely that all the individuals are unrelated. Just as the 11 and 12 locus matches supposedly represent siblings,1 surely the 9 and 10 locus match count is also very much influenced by the presence of related criminals in the database.

      There is scope for further analysis in order to understand better the population model implied by the observed model. That is, the data gives some clue as to the rate at which relatives occur in the criminal population. This is an interesting and practical question.4

      Such research might also have implications for forensic practice, as alluded to in I.A.3 above. After all, if the model under which DNA profiles computations are presented to the court is wrong, does it really matter that it's right with respect to independence of loci but wrong in some other respect, namely the assumption of unrelatedness? I agree with this concern, and in fact the population model generally accepted and expected by courts in the United Kingdom for DNA identification probability calculations takes plentiful, probably generous, account of possible relationships. It's a subtle question, though, and beyond the scope of this discussion, the extent to which relationships within the offender database are relevant to a criminal trial.

    5. Devlin's surprising oversight
    6. Far from supporting the naive intuition that the Arizona matches conflict with the product rule, we see that they are in fact surprisingly consistent with it.

      Also surprising is how Devlin, a mathematician, could have made such an error. Continuing from the passage above though makes it clear what happened:

      It is not much of a leap to estimate that the FBI's national CODIS database of 3,000,000 entries will contain not just one but several pairs that match on all 13 loci, contrary (and how!) to the prediction made by proponents of the currently much touted RMP that you can expect a single match only when you have on the order of 15 quadrillion profiles.2
      He has confused the number of pairs of individuals in a set – the number of possible comparisons – with the number of individuals! The actual number of pair-wise comparisons in the FBI database is 3,000,000 choose 2 = 1013. That's not quite enough to expect a 13-locus match between unrelated individuals, but it's getting close.5 From ten times that number of unrelated individuals – from a Canada, approximately, of unrelated individuals – we would expect to find one pair with identical 13-locus DNA profiles.

      Note that there are two separate multiplicative factors that the naive tend to overlook when considering the number of possible 9-locus matches from a collection of profiles such as the Arizona data:

      1. the factor, equal to one-half the size of the sample, by which the number of pairs exceeds the size of the sample;
      2. the combinatorial factor – 715 above – representing the number of different 9-locus selections from 13, each of which is an opportunity for two selected individuals to have a 9-locus match.

      The first point is the same one that tricks people who mis-estimate the chance that a classroom of 23 children will have some pair sharing a birthday (slightly over half) – the famous Birthday Problem. For further explication, see Devlin's own essay on the subject.6

  3. Conclusions
  4. Bruce Budowle, an FBI and "pro-DNA" guy, gave a talk at the recent (October 12, 2006) Promega meeting in which he stated that there is no point in looking for partial matches in a database because the results will be meaningless since one does not know the relatedness of the individuals in the database. I disagree. The possible relatedness may be a confounding factor in interpreting the results. But if a reasonable estimate of relatedness cannot explain the excess amount of partial matching, then the "anti-DNA" forces are entitled to their day. However, if the observed partial matching is consistent with a reasonable theory, then the experiment might advance our understanding of the population and/or buttress the traditional presentation of DNA evidence.

    Devlin of course, looked at the match data and argued the opposite – that it proves something quite negative about the underlying population genetic theory upon which DNA identification rests.

    Both are wrong. If Budowle had looked at an analysis before making his pronouncement, he would would have realized there is no need to be defensive. Devlin could have followed through with his own suggestion:

    So what should be done? To me, the answer is obvious. Instead of using mathematics, determine the various random match probabilities empirically.2
    with the Arizona data. Had he done so, he would have come to the correct conclusion: The product rule for DNA profile probabilities is very accurate. Surprisingly so.

Footnotes

  1. Keith Devlin, Statisticians not wanted MAA Online
  2. Keith Devlin, Damned lies MAA Online
  3. Journal of Forensic Science, Vol 44 number 6
  4. Brenner, Bieber, Lazer, Finding Criminals through DNA of their Relatives
  5. In fact there are many perfect matches, due mainly to the fact that offenders get catalogued from two different states or under different names. Hence the experiment is impossible to perform. However checking for near-perfect matches is a good experiment. It's complicated though, on account of related offenders.
  6. Keith Devlin, Math Guy: The Birthday Problem