Date post: | 20-Dec-2015 |
Category: |
Documents |
View: | 216 times |
Download: | 1 times |
Some Probabilistic Results on the
Non-randomness of Simple Sequence Repeats in DNA
Sequences
2007 DIMACS Workshop on Mathematical Modeling of Infectious Diseases in Africa,Stellenbosch, South Africa, June 25-27
Asamoah Nkwanta, Morgan State UniversityJoint work with Wilfred Ndifon & Dwayne Hill
Nonrandomness of Microsatellites
“Numerous lines of evidence have demonstrated that genomic distribution of simple sequence repeats (SSRs) is nonrandom because of their effects on chromatin organization, regulation of gene activity, recombination, DNA replication, cell cycle, …”
You-Chun Li, et. al., Microsatellites Within Genes: Structure, Function, and Evolution, Molecular Biology and Evolution 21 (2004)
TOPICS
Introduction DNA/RNAPreliminaries on SSRsCounting SSRsProbability & Expectations of SSRsResults & Conclusion
Introduction: Molecular Biology
Proteins are the building blocks of living organisms.
The information necessary for producing proteins is encoded in Deoxyribonucleic acid (DNA).
DNA is considered as a set of words defined over the genetic alphabet consisting of the letters A (Adenine), T (Thymine), C (Cytosine), & G (Guanine).
Introduction: Molecular Biology (Cont.)
DNA has a double-helix structure
Introduction: Molecular Biology (Cont.)
Ribonucleic acid (RNA) mediates the translation of DNA into proteins.
An RNA molecule consists of a sequence of ribonucleotides. Each ribonucleotide contains one of four bases: A, C, G and U (Uracil) (Note. Uracil is substituted for Thymine in DNA).
Introduction: Molecular Biology (Cont.)
Central Dogma of Molecular Biology
DNA RNA Protein
Transcription / Translation
Introduction: Molecular Biology (Cont.)
Introduction: Molecular Biology (Cont.)
Revised Central Dogma
DNA Microsatellites: Agents of Evolution?; January 1999; Scientific American Magazine; by Moxon, Wills; 6 Page(s) A human's genetic code consists of roughly three billion bases of DNA, the familiar "letters" of the DNA alphabet. But a mere 10 to 15 percent of those bases make up genes, the blueprints cells use to build proteins. Some of the remaining base sequences in humans-and in many other organisms-perform crucial functions, such as helping to turn genes "on" and "off" and holding chromosomes together. Much of the DNA, however, seems to have no obvious purpose at all, leading some to refer to it as "junk.“
Part of this "junk DNA" includes strange regions known as DNA satellites. These are repetitive sequences made up of various combinations of the four DNA bases-adenine (A), cytosine (C) , guanine (G) and thymine (T)-repeated over and over, like a genetic stutter. In the past several years, researchers have begun to find that so-called microsatellites, those containing the shortest repeat sequences, have a significance disproportionately great for their size and perform a variety of remarkable functions.
Preliminaries: What are SSRs?
Short Sequence Repeats (SSRs) or Microsatellites are defined as regions (motifs) within DNA sequences where short tandemly repeated sequences of nucleotides, 1 to 6 base-pairs in length, occur in genomic DNA.
The lengths of sequences most often used are di-, tri-, or tetra-nucleotides
Preliminaries (cont.): What are SSRs?
Example: TACCCAGCAGGCCTATATATA.
This is an DNA sequence of length 21 which contains stretches of dimers (TA), trimers (CAG), and teramers (TATA).
CAG – contiguous & TA – non-contiguous
Preliminaries (Cont.)
Table 1. Total Lengthsa of Simple Sequence Repeats by Repeated Unit Length
Length of repeated motif (bp)
Taxonomic group 1
2
3
4
5
6
Total
Primates 3429 1643 477 1368 898 341 8156
Human chromosome 22 5141 1511 604 1906 1097 419 10678
Rodentia 1839 5461 1196 2942 1417 1034 13889
Mammalia 1397 2312 532 915 774 693 6623
Vertebrata 1418 2449 1069 1279 709 220 7144
Arthropoda 985 1403 956 439 732 875 5390
C. elegans 428 556 337 144 225 449 2139
Embryophyta 1245 1067 880 184 491 272 4139
S. cerevisiae 1075 580 646 93 204 406 3004
Fungi 905 272 485 194 395 426 2677
Arthropoda 985 1403 956 439 732 875 5390 a Base pairs (bp) per megabase of DNA.
SSRs are relatively abundant.
Preliminaries (Cont.)
Some Characteristics of SSRs: They are
Highly Mutable Good Molecular Markers Involved in Gene Regulation Involved in the Develop. of Immune System
Cells Associated with at least 20 human diseases,
including Huntington’s disease and some cancers.
Preliminaries (Cont.)
Real World Applications:
Linkage analysis (related to inheritance) DNA fingerprinting Genome sequencing (e.g., genome of the
apple plant) Diagnosis of genetic disorders Paternity tests Forensic studies Population & Ecological genetic studies
Ecological Genetics of Parasitic Sea Lice
Typical epidermal lesions caused by adult female Lepeophtheirus salmonis in the region of the anal fin of a wild-caught salmon
Marine Ecology Research Group: www.st-andrews.ac.uk/~merg/sea%20lice.htm
Recent research has focused on the development and screening of L. salmonis specific microsatellites. Microsatellites such as CACACACACACA are dispersed throughout the genome.
Ecological Genetics of Parasitic Sea Lice (cont.)
Chromas file depicting base sequence of L. salmonis repeat region[ CA19-AA-CA4 ] [bases 117 to 164] and flanking regions.
Primers designed to anneal to the DNA sequences flanking the microsatellite region allow the indirect measure of the number of repeat units. The variability in repeat number is often high and the construction of multilocus genotypes may
allow analyses at both the population and individual levels.
Population Genetic Studies
African American Lives, an unprecedented four-part PBS series. Shows how DNA analysis is used to trace lineage through American history and back to Africa. Microsatellites play an important role in lineage analysis.
Population Genetic Studies (cont.)
LINEAGE AND ADMIXTURE: THE
TESTS LEARNING FROM DNA
Migration of Populations Around the World
Diagnosis of Genetic Disorders
Rethinking genotype and phenotype correlations in polyglutamine expansion disorders Susan E. Andrew1, Y. Paul Goldberg2 and Michael R. Hayden1,2,*
Counting Non-contiguous SSRs
1 2 n i
Definition 1: An DNA sequence X of length n is denoted by the random sequence
where each is defined over the 4-letter nucleotide alphabet
A,C,G,T .
Note randomness here refers to the non-uni
X x x x x
form Bernoulli model
(meaning all bases of X have independent and possibility unequal probabilities).
For instance is a DNA
sequence of length 21. TA, CAG, and CAGCAG are SSRs
X TACCCAGCAGGCCTATATATA
n
of lengths 2, 3, and 6, respectively.
Counting Non-contiguous SSRs (cont.)
1 2 kDefinition 2: A k-mer Y is a subsequence y y of a
DNA sequence X of length n where 1 6.
For instance for ,
TA is a 2-mer (dimer) and CAG is a 3-mer (trimer).
Y y
k
X TACCCAGCAGGCCTATATATA
Counting Non-contiguous SSRs (cont.)
Definition 3: A kt-linked SSR of a k-mer Y is a subsequence
of X which is of length kt that consists of t tandem copies of Y.
For instance for ,
CAGCAG is a 6-linked SSR of the tri
X TACCCAGCAGGCCTATATATAmer
and TATATATA is an 8-linked SSR of the dimer .
Y CAG
Y TA
Counting Non-contiguous SSRs (cont.)
We simply count the number of ways of distributing l occurrences of a kt-linked SSR of Y into (n-klt+1) possible positions in a DNA sequence X of length n by the binomial coefficient:
See the following example.
- 1n klt
l
Counting Non-contiguous SSRs (cont.)
Example 1: How many arrangements of the 3 non-contiguous
occurrences of the 4-linked SSR of Y=GA are in
X
denotes an arbitrary base?
Using the above binomial co
GAGA GAGA GAGA
where
efficient: 15, 2, 2,
and 3. ,
15 12 1 4 4
3 3
n k t
l Thus
Counting Non-contiguous SSRs (cont.)
Example 1 (cont.): The 4 arrangements of the 3 non-contiguous
occurrences of the 4-linked SSR of Y=GA are
GAGA GAGA GAGA
GAGA GAGA GAGA
GAGA GAGA GAGA
denotes an arbitrary base.
GAGA GAGA GAGA
where
Counting Non-contiguous SSRs (cont.)
Lemma 1: The number of non-contiguous arrangements of
occurrences of a kt-linked SSR of a k-mer Y in a DNA
sequence X of length n is given by the nth coefficient of
the following generating function
l
1 1
11
1 .
1
l ktn
ln klt l
n kltzG z z
lz
Probability of SSRs
An urn model approach is used to compute the probability of SSRs of Y at a position i in a DNA sequence X of length n.
Probability of SSRs (cont.)
1
Definition 4: Let denote the number of occurrences of a k-mer Y in
a DNA sequence X. Then where is the number of tandem
copies found in the ith SSR of Y.
Note. is used to compute sta
Y
j
Y i ii
Y
N
N t t
N
tistics on the occurrence of SSRs of Y.
Probability of SSRs (cont.)
Theorem 1: Let U denote a random variable representing the number of
tandem copies of a k-mer Y occurring at position i in a DNA sequence X.
Then,
1
1 1
i
t
Y Yi
Y
N n kNP U t
n N k
1
is the probability that an occurrence of a kt-linked SSR of Y starts
at position i in X.
t
Probability of SSRs (cont.)
2
2
Corollary 1: The variance of frequencies of SSRs of a k-mer Y in a
DNA sequence of length n is
2 .
1 1 1 1
Y
Y Y
N
n N k n N k
Probability of SSRs (cont.)
Theorem 2: The expected number of non-contiguous
occurrences of a kt-linked SSR of a k-mer Y in a
DNA sequence X is given by
- 1 . Y i yE P U t n kN
Index of Nonrandomness
2
Y
Metric:
1 / , E 0
/
where O is the observed number of SSRs, E is the
expected number of SSRs, and = / is the
representation of SSRs of Y in X.
Y YY
Y Y
Y Y
Y Y Y
O EI
O E
R O E
Results of Index of Nonrandomness
The index of nonrandomness provides an approach to identifying genomic loci in which SSR occurrences exhibit significant deviations from random expectations
No simulations are needed to compute deviations from random expectations
Closed form expression for finding the variance of SSRs (non-uniform Bernoulli model)
Higher index implies more nonrandomness in microsatellite DNA
The trimer CCG exhibited a high degree of nonrandomness which was unexpected
Results of Index of Nonrandomness (cont.)
Potential Biological Applications
Screening organismal genomes for putative disease-associated genes
Identifying loci of interest for future genomic studies
Computing the exclusion probability of SSR-based genetic markers used in paternity tests
Establishing relationships between SSR nonrandomness and the incidence of particular infectious diseases (dynamics)
Other Applications of Microsatellite DNA and Index of Nonrandomness
Alzheimer’s Disease (2007)
Prostate Cancer (In Progress)
Sickle Cell Disease (TBD)
Malaria (TBD)
Tuberculosis, HIV, and E-coli (???)
Related Sources
Identifying nonrandom occurrences of simple sequence repeats in genomic DNA sequences (with W. Ndifon and D. Hill), Ethnicity and Disease, Proceedings From RCMI 9th Intl. Symposium on Health Disparities 15 (2005)
S5-67 – S5-70.
Some probabilistic results on the nonrandomness of simple sequence repeats in DNA sequences (with W. Ndifon and D. Hill), Bulletin of Mathematical
Biology 68 (2006) 1747 –1759.
Differential enrichment of simple sequence repeats in selected Alzheimer-associated genes (with W. Ndifon and D. Hill), Cellular and Molecular Biology
(Noisy-le-grand) 1553 (2007) 23 – 31.
Acknowledgments
National Science Foundation, DIMACS, SACEMA, University of Stellenbosch, AIMS
Office of Faculty Development, SCMNS & Departments of Mathematics and Chemistry at Morgan State University
Collaborators: Wilfred Ndifon, Princeton University and Dwayne Hill, Morgan State University