|
Introduction Kinship Identification Programming notes and acknowledgements Translation into Belarusian by Daniela Milton Translation into Romanian by Irina Vasilescu |
| Links Forensic mathematics home page |
| Charles Brenner
Presentation at the 2005 APL2000 User Conference, Naples, Florida November 8, 2005 | ||
| in which a would-be mathematician, wannabe geneticist and uncultured programmer leverages the nimbleness of APL to identify criminals, fathers, World Trade Center and tsunami victims, and determine race using DNA in a world of fast-changing DNA identification technology. |
|
| The human genome consists of 46 chromosomes. There are two each of 1-22, the autosomal chromosomes. XX or XY are the two sex chromosomes. |
Note that there is pairing at two levels: The chromosomes are paired in that there are two of each number, which is physically significant during cell division and reproduction and genetically significant because most genes occur in two copies. Second, the DNA strand in each chromosome is a double strand (double helix), with the sequence of A,C,G, and T's on one strand mirrored by a complementary sequence of T,G,C, and A's on the other by the pairing rule. During cell division the strands separate and each single strand then serves as a template to recreate a copy of the original double strand.
|
|
An average chromosome is about 100,000,000 base pairs long. A gene is a small subset of a chromosome, typically a few thousand base pairs. Any random error in the sequence is likely to incapacitate the gene and in turn the organism; consequently there is relatively little variation in the genes among individuals.
By contrast the genome is replete with junk DNA, variations in which carry no survival penalty and therefore a lot of variation has accumulated over time and can be used to distinguish individuals. Traditional (time span=handful of years) locations (loci, singular locus) in the genome sometimes called pseudo-genes are thus employed. The locus TH01 (near the gene for tyrosine hydroxylase) is defined as a particular stretch of about 200 bases at 11p15.5 chromosome 11, the p or short arm, within band 15. The "business" part of TH01 consists of 6 to 11 short tandem repeats (STR's) of the same 4-base motif (tetramer): AATG, e.g. AATGAATGAATGAATGAATGAATG. The variant forms, called alleles (analogous to isotopes or isomers) vary between individuals and also between the two chromosomes of a pair within an individual. For example, a person can have a TH01 type of 8,10.
A DNA profile is typically 13 or so loci:
({13,15}, {28,28}, {8,10}, ...). Alleles vary greatly in frequency,
average being around 1/5. The average match probability between
unrelated people is therefore 0.1 per locus, 10-13 for a
profile. Hence near-certain association of criminals with crimes.
Naturally all this leads to myriad opportunities for computer tools to assist the DNA analyst. Biology and biological data, unlike engineering data, is inherently irregular, incomplete, and inaccurate and software must therefore be forgiving. For this reason and others there are many technical complications.
Programming note
DNA·VIEW (the name for the package containing my 20 years of software development) includes 1000 or so population samples for various populations and loci. There is a menu with names for each, each name of course multi-part. Therefore I have found it very convenient to use an idea I learned from Eric Lescasse for menu selection.
D18S51 STR Amsterdam Iraqi 70 03/03/27 2pm),
I can type amstd18iraq or iraqd18
or many other possibilities.
The classic paternity question is to collect
some genetic information (DNA profiles) and as which of these two pictures better
explains the data:
There are a few simple formulas, such as
1/q, for paternity analysis that are well
known to people in the business.
Implementation of this idea is simple. Whereever the original
program might add two some probabilities for example: +/P, the new
program simply used instead a defined function: Plus/P. The formulas
in question turn out to be polynomials in n+1 variables
(p, q, ... for the frequency of each mentioned
allele P, Q, ..., plus one symbol, z, for the total frequency
of all other alleles), hence the data structure to represent a
polynomial is two arrays: a matrix of exponents with a column per
variable and a row for each term, and a vector of coefficients for
each term. It is easy to write Plus, Times,
Simplify (collapse like
terms), Over (divide two polynomials and remove common factors) and
one function that turns out to have remarkable properties, UnZ, which
removes all the instances of the variable z in favor of the
equivalent expression 1-p-q-... . This is necessary
because z should only occur internally; the user doesn't
expect to see it. What is remarkable is that UnZ very much reduces the
complexity of the polynomial. Dozens of terms reduce to a few. Still,
the resulting expressions are funny looking, wrapping to several lines
when applied to in irregular collection of people who are hypothesized
to be half-siblings, uncles, cousins, and inbred.
A new problem that arose was to sort out the thousands of DNA profiles arising from the World Trade attack the victim profiles from anonymous bodies, versus the reference profiles from relatives or toothbrushes. Given a suspected identity it is all very well to use the Kinship program to test it, but kinship testing is inherently manual (the pedigree information is neither uniform nor accurate) so it is not practical to use the kinship program a million times with the wrong assumptions.
I don't have a lot to offer in return. My days at the forefront of computer literacy probably expired about 1963. I would of course be glad to share the utilities or the ideas behind them that I have developed, though I don't have any code in shape to include as a workspace. Adrian's wry comment a few years ago that Rex's utilities are typically about 90% complete would need to be revised negatively for my stuff as it stands.
I would, though, like to recommend a package that John Walker pointed out a few years ago, namely a free but very professional installer package called Inno Setup and a companion macro preprocessor called ISTool. Examination of the script file [omitted] that I use to create an install file for DNA·VIEW will give you an idea of its capabilities, and to the extent that I have learned some of the in's and out's of using it I will be glad to answer questions from any APLer who wishes to call or e-mail to me.