Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007...

Some Probabilistic Results on the

Non-randomness of Simple Sequence Repeats in DNA

Sequences

2007 DIMACS Workshop on Mathematical Modeling of Infectious Diseases in Africa,Stellenbosch, South Africa, June 25-27

Asamoah Nkwanta, Morgan State UniversityJoint work with Wilfred Ndifon & Dwayne Hill

Nonrandomness of Microsatellites

“Numerous lines of evidence have demonstrated that genomic distribution of simple sequence repeats (SSRs) is nonrandom because of their effects on chromatin organization, regulation of gene activity, recombination, DNA replication, cell cycle, …”

You-Chun Li, et. al., Microsatellites Within Genes: Structure, Function, and Evolution, Molecular Biology and Evolution 21 (2004)

TOPICS

Introduction DNA/RNAPreliminaries on SSRsCounting SSRsProbability & Expectations of SSRsResults & Conclusion

Introduction: Molecular Biology

Proteins are the building blocks of living organisms.

The information necessary for producing proteins is encoded in Deoxyribonucleic acid (DNA).

DNA is considered as a set of words defined over the genetic alphabet consisting of the letters A (Adenine), T (Thymine), C (Cytosine), & G (Guanine).

Introduction: Molecular Biology (Cont.)

DNA has a double-helix structure


Ribonucleic acid (RNA) mediates the translation of DNA into proteins.

An RNA molecule consists of a sequence of ribonucleotides. Each ribonucleotide contains one of four bases: A, C, G and U (Uracil) (Note. Uracil is substituted for Thymine in DNA).


Central Dogma of Molecular Biology

DNA RNA Protein

Transcription / Translation



Revised Central Dogma

DNA Microsatellites: Agents of Evolution?; January 1999; Scientific American Magazine; by Moxon, Wills; 6 Page(s) A human's genetic code consists of roughly three billion bases of DNA, the familiar "letters" of the DNA alphabet. But a mere 10 to 15 percent of those bases make up genes, the blueprints cells use to build proteins. Some of the remaining base sequences in humans-and in many other organisms-perform crucial functions, such as helping to turn genes "on" and "off" and holding chromosomes together. Much of the DNA, however, seems to have no obvious purpose at all, leading some to refer to it as "junk.“

Part of this "junk DNA" includes strange regions known as DNA satellites. These are repetitive sequences made up of various combinations of the four DNA bases-adenine (A), cytosine (C) , guanine (G) and thymine (T)-repeated over and over, like a genetic stutter. In the past several years, researchers have begun to find that so-called microsatellites, those containing the shortest repeat sequences, have a significance disproportionately great for their size and perform a variety of remarkable functions.

Preliminaries: What are SSRs?

Short Sequence Repeats (SSRs) or Microsatellites are defined as regions (motifs) within DNA sequences where short tandemly repeated sequences of nucleotides, 1 to 6 base-pairs in length, occur in genomic DNA.

The lengths of sequences most often used are di-, tri-, or tetra-nucleotides

Preliminaries (cont.): What are SSRs?

Example: TACCCAGCAGGCCTATATATA.

This is an DNA sequence of length 21 which contains stretches of dimers (TA), trimers (CAG), and teramers (TATA).

CAG – contiguous & TA – non-contiguous

Preliminaries (Cont.)

Table 1. Total Lengthsa of Simple Sequence Repeats by Repeated Unit Length

Length of repeated motif (bp)

Taxonomic group 1

2

3

4

5

6

Total

Primates 3429 1643 477 1368 898 341 8156

Human chromosome 22 5141 1511 604 1906 1097 419 10678

Rodentia 1839 5461 1196 2942 1417 1034 13889

Mammalia 1397 2312 532 915 774 693 6623

Vertebrata 1418 2449 1069 1279 709 220 7144

Arthropoda 985 1403 956 439 732 875 5390

C. elegans 428 556 337 144 225 449 2139

Embryophyta 1245 1067 880 184 491 272 4139

S. cerevisiae 1075 580 646 93 204 406 3004

Fungi 905 272 485 194 395 426 2677

Arthropoda 985 1403 956 439 732 875 5390 a Base pairs (bp) per megabase of DNA.

SSRs are relatively abundant.


Some Characteristics of SSRs: They are

Highly Mutable Good Molecular Markers Involved in Gene Regulation Involved in the Develop. of Immune System

Cells Associated with at least 20 human diseases,

including Huntington’s disease and some cancers.


Real World Applications:

Linkage analysis (related to inheritance) DNA fingerprinting Genome sequencing (e.g., genome of the

apple plant) Diagnosis of genetic disorders Paternity tests Forensic studies Population & Ecological genetic studies

Ecological Genetics of Parasitic Sea Lice

Typical epidermal lesions caused by adult female Lepeophtheirus salmonis in the region of the anal fin of a wild-caught salmon

Marine Ecology Research Group: www.st-andrews.ac.uk/~merg/sea%20lice.htm

Recent research has focused on the development and screening of L. salmonis specific microsatellites. Microsatellites such as CACACACACACA are dispersed throughout the genome.

Ecological Genetics of Parasitic Sea Lice (cont.)

Chromas file depicting base sequence of L. salmonis repeat region[ CA19-AA-CA4 ] [bases 117 to 164] and flanking regions.

Primers designed to anneal to the DNA sequences flanking the microsatellite region allow the indirect measure of the number of repeat units. The variability in repeat number is often high and the construction of multilocus genotypes may

allow analyses at both the population and individual levels.

Population Genetic Studies

African American Lives, an unprecedented four-part PBS series. Shows how DNA analysis is used to trace lineage through American history and back to Africa. Microsatellites play an important role in lineage analysis.

Population Genetic Studies (cont.)

LINEAGE AND ADMIXTURE: THE

TESTS LEARNING FROM DNA

Migration of Populations Around the World

Diagnosis of Genetic Disorders

Rethinking genotype and phenotype correlations in polyglutamine expansion disorders Susan E. Andrew1, Y. Paul Goldberg2 and Michael R. Hayden1,2,*

Counting Non-contiguous SSRs

1 2 n i

Definition 1: An DNA sequence X of length n is denoted by the random sequence

where each is defined over the 4-letter nucleotide alphabet

A,C,G,T .

Note randomness here refers to the non-uni

X x x x x

form Bernoulli model

(meaning all bases of X have independent and possibility unequal probabilities).

For instance is a DNA

sequence of length 21. TA, CAG, and CAGCAG are SSRs

X TACCCAGCAGGCCTATATATA

n

of lengths 2, 3, and 6, respectively.

Counting Non-contiguous SSRs (cont.)

1 2 kDefinition 2: A k-mer Y is a subsequence y y of a

DNA sequence X of length n where 1 6.

For instance for ,

TA is a 2-mer (dimer) and CAG is a 3-mer (trimer).

Y y

k

X TACCCAGCAGGCCTATATATA


Definition 3: A kt-linked SSR of a k-mer Y is a subsequence

of X which is of length kt that consists of t tandem copies of Y.

For instance for ,

CAGCAG is a 6-linked SSR of the tri

X TACCCAGCAGGCCTATATATAmer

and TATATATA is an 8-linked SSR of the dimer .

Y CAG

Y TA


We simply count the number of ways of distributing l occurrences of a kt-linked SSR of Y into (n-klt+1) possible positions in a DNA sequence X of length n by the binomial coefficient:

See the following example.

- 1n klt

l


Example 1: How many arrangements of the 3 non-contiguous

occurrences of the 4-linked SSR of Y=GA are in

X

denotes an arbitrary base?

Using the above binomial co

GAGA GAGA GAGA

where

efficient: 15, 2, 2,

and 3. ,

15 12 1 4 4

3 3

n k t

l Thus


Example 1 (cont.): The 4 arrangements of the 3 non-contiguous

occurrences of the 4-linked SSR of Y=GA are

GAGA GAGA GAGA

GAGA GAGA GAGA

GAGA GAGA GAGA

denotes an arbitrary base.

GAGA GAGA GAGA

where


Lemma 1: The number of non-contiguous arrangements of

occurrences of a kt-linked SSR of a k-mer Y in a DNA

sequence X of length n is given by the nth coefficient of

the following generating function

l

1 1

11

1 .

1

l ktn

ln klt l

n kltzG z z

lz

Probability of SSRs

An urn model approach is used to compute the probability of SSRs of Y at a position i in a DNA sequence X of length n.

Probability of SSRs (cont.)

1

Definition 4: Let denote the number of occurrences of a k-mer Y in

a DNA sequence X. Then where is the number of tandem

copies found in the ith SSR of Y.

Note. is used to compute sta

Y

j

Y i ii

Y

N

N t t

N

tistics on the occurrence of SSRs of Y.


Theorem 1: Let U denote a random variable representing the number of

tandem copies of a k-mer Y occurring at position i in a DNA sequence X.

Then,

1

1 1

i

t

Y Yi

Y

N n kNP U t

n N k

1

is the probability that an occurrence of a kt-linked SSR of Y starts

at position i in X.

t


2

2

Corollary 1: The variance of frequencies of SSRs of a k-mer Y in a

DNA sequence of length n is

2 .

1 1 1 1

Y

Y Y

N

n N k n N k


Theorem 2: The expected number of non-contiguous

occurrences of a kt-linked SSR of a k-mer Y in a

DNA sequence X is given by

- 1 . Y i yE P U t n kN

Index of Nonrandomness

2

Y

Metric:

1 / , E 0

/

where O is the observed number of SSRs, E is the

expected number of SSRs, and = / is the

representation of SSRs of Y in X.

Y YY

Y Y

Y Y

Y Y Y

O EI

O E

R O E

Results of Index of Nonrandomness

The index of nonrandomness provides an approach to identifying genomic loci in which SSR occurrences exhibit significant deviations from random expectations

No simulations are needed to compute deviations from random expectations

Closed form expression for finding the variance of SSRs (non-uniform Bernoulli model)

Higher index implies more nonrandomness in microsatellite DNA

The trimer CCG exhibited a high degree of nonrandomness which was unexpected

Results of Index of Nonrandomness (cont.)

Potential Biological Applications

Screening organismal genomes for putative disease-associated genes

Identifying loci of interest for future genomic studies

Computing the exclusion probability of SSR-based genetic markers used in paternity tests

Establishing relationships between SSR nonrandomness and the incidence of particular infectious diseases (dynamics)

Other Applications of Microsatellite DNA and Index of Nonrandomness

Alzheimer’s Disease (2007)

Prostate Cancer (In Progress)

Sickle Cell Disease (TBD)

Malaria (TBD)

Tuberculosis, HIV, and E-coli (???)

Related Sources

Identifying nonrandom occurrences of simple sequence repeats in genomic DNA sequences (with W. Ndifon and D. Hill), Ethnicity and Disease, Proceedings From RCMI 9th Intl. Symposium on Health Disparities 15 (2005)

S5-67 – S5-70.

Some probabilistic results on the nonrandomness of simple sequence repeats in DNA sequences (with W. Ndifon and D. Hill), Bulletin of Mathematical

Biology 68 (2006) 1747 –1759.

Differential enrichment of simple sequence repeats in selected Alzheimer-associated genes (with W. Ndifon and D. Hill), Cellular and Molecular Biology

(Noisy-le-grand) 1553 (2007) 23 – 31.

Acknowledgments

National Science Foundation, DIMACS, SACEMA, University of Stellenbosch, AIMS

Office of Faculty Development, SCMNS & Departments of Mathematics and Chemistry at Morgan State University

Collaborators: Wilfred Ndifon, Princeton University and Dwayne Hill, Morgan State University

Date post:	20-Dec-2015
Category:	Documents
View:	216 times
Download:	1 times

Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007...

Documents