Kolmogorov Complexity for analysis of DNA sequence

Kolmogorov Complexityfor analysis of DNA sequence

Shijun TangThiraphat Meesumrarn

Gaith Albadarin

Outline

• Kolmogorov Complexity The Complexity of DNA

Methods

• Quantum Kolmogorov Complexity Qubit and Definition of QKC

Kolmogorov Complexity

The Kolmogorov complexity of any string x {0, 1}∈ ∗ is defined as:

C(x) := min{ℓ(p) | U(p) = x}

The Kolmogorov complexity of x : the length of the shortest program which produces x as its output

The Complexity of DNA

• “genetic language” in DNA sequences (A, C, G, and T)

• heterogeneity in DNA sequences (not random)

• the long-range correlation• Compression

Methods

• Entropy• Spectral Analysis• Kolmogorov Complexity

Entropy

Clausius EntropyBoltzmann EntropyShannon EntropyKolmogorov EntropyTsallis Entropy-- Approximate Entropy---Sample Entropy---Multiscale Entropy…………….

Entropy

• Jensen-Shannon distance == the difference between the entropy calculated from the whole system and the weighted sum of entropies calculated from the subsystems

• Jensen-Shannon distance D(i) for each possible partition point i along the DNA sequence

230,208 nucleotides

near » 189,00012 3

4

• the bigger the difference of the two subsequences as partitioned at point i, and the more ideal to choose that point to partition the sequence

• the average value of D(i) of random sequence is at least 10 times lower than that for the yeast sequence.

• These ups and downs in D(i) for the random sequence are purely random fluctuations

Spectral Analysis

• Power spectrum -- > to represent the correlation structure in a sequence according to wavelength (or

frequency f =c/wavelength).

• The power at a given frequency, P(f), is the contribution from that frequency component to the total variance

of the fluctuation in the sequence.

• A random sequence lacks correlation at any length scale, and its power spectra is flat

Kolmogorov Complexity for Analysis of DNA

The search for DNA regions with low complexity is

one of the pivotal tasks of modern structural analysis

of complete genomes.

The low complexity may be preconditioned by strong

inequality in nucleotide content (biased composition),

by tandem or dispersed repeats or by palindrome-

hairpin structures, as well as by a combination of all

these factors.

Four types of repeat differing by orientation andlocalization in direct or complementary chains are considered: direct, symmetric, inverted anddirect complementary.

Direct and inverted repeats as standard prototypes. Symmetric (the repeated sequence is oppositely oriented on the same DNA strand) Direct complementary (a direct repeat on the complementary DNA strand),

Nucleotide Sequence : the AP2 transcription factorbinding site, GTGCCCCGCGGGAACCCCGC.

Black and gray arrows mark the copied fragments andtheir prototypes. A tandem repeat characterized by partial overlapping of the prototype on the copiedfragment is marked by a dotted line. In thisdecomposition, the first one-lettered components, G andT, are produced by an operation generating a novelsymbol. The complexity of this 20-letteredsequence = 10 [the number of components in H(S)].

Lempel-Ziv complexity S, Q represents two string, respectively.SQ=S+Q. SQP=SQ(deleted last letter)V(SQP) is all subset of SQPNow c(n)=1, assume S=s1s2….sr Q=Sr+1

If QϵV(SQP), S same, Q=Sr+1Sr+2

Until Q V(SQP), So Q=sr+1sr+2…sr+i is not the subset of s1s2..srsr+1sr+2..sr+i-1, c(n)+1

Update S= s1s2..srsr+1sr+2..sr+I and Q=sr+i+1

Until Q take the final letter

b(n) is complexity value of random sequence @ n infinite

b(n) = = nlog2n

Thus,CLZN = c(n)/b(n)

the complexity of random ---- > 1 the complexity of order sequence ---- > 0

The smaller the complexity, the slower the speed of variation === > the change of data is regular, and has good periodic time.

The calculation of c(n) (Lempel-Ziv complexity)

Lempel-Ziv Complexity 1976For sequence S=(10101010)

S=s1=1, Q=s2=0, SQ=10, SQP=1, Q V(SQP), Q insertion, SQ=1● 0

S= s1s2=10, Q=s3=1, SQ=101, SQP=10, Q ϵ V(SQP), Q duplication, SQ=1● 0 ●1

S= s1s2=10, Q=s3 s4=10, SQ=1010, SQP=101, Q ϵ V(SQP), Q duplication, SQ=1● 0●10

• Repeated 2) and 3), Q duplication , S=1●0●101010 , c(n)=3• b(8)=8log28=24. So normalized complexly: CLNZ =c(8)/b(8)=3/24=0.125• Thus, results show that the sequence is low because this sequence is periodical one.

Other estimates of text complexityThe evaluation of complexity in a text region CWF by Wootton and Federhen (7) is given by the formula

Linguistic complexity can also be defined as the ratio of the sum of numbers of words occurring in a sequence analyzed to the maximum possible number of such words (12):

Implementation and Results

Calculation mode in a sliding window(i) a single extended sequence (ii) a group of relatively short sequences up to 1 kb in

length. A table of complexity values is constructed for a window, of ordered size N, Sliding along the sequence. The sequence complexity is assigned to the window center. The calculation mode in a sliding window (complexity profile) is demonstrated here using the example of the Borrelia burgdorferi genome. In Figure 2, complexity profiles for a window sliding along the sequence are illustrated.

http://news.bbc.co.uk/2/hi/8236943.stm

http://news.bbc.co.uk/2/hi/8236943.stm

Quantum Kolmogorov Complexity

Are quantum computers more powerful than classical computers?

Quantum Entanglement

Quantum Factorization of Integers

Quantum computers can solve some problems faster than classical computers (→ Shor’s factoring algorithm).

86782348943904553258203876589276488467282764884783575788579901017459395793602387575786897646492020929237475675203798980000847736223445526263778374774774764657586879989889999531190642287653930057686950486950384756567438556574648876589005088573342257947602867756958696986758959511122344756900768768779957500472667899533045786777487657783691190875682046392930000272645583936857939487456884763747949631611893900540958687034763637485696997576535644578499596997665561098443348899046881020480568572231018209586704589944806808908069887677575969061234894390498988999953119064799010174593957936

>> 2^(129/2) = 2.6088e+019>> 2^125 = 4.2535e+037>> 2^200 = 1.6069e+060

Prime factorization of large number

Principle of Current Cryptography

In 1994, 1600 workstation with super speed obtained primefactors of L=129 in about 8 months. If L=250, 800,000 years

Factorization of Integers

The number N---- approximate length L bits ---(0 ~ 2L-1)The number N has a factor in the range (1, )Try each number in this range to find a factor of N---At least stepsS~ =2L/2

But, for Shor’s Quantum ComputationS ~ poly(log(N))

N

N

The Factoring Firestorm188198812920607963838697239461650439807163563379417382700763356422988859715234665485319060606504743045317388011303396716199692321205734031879550656996221305168759307650257059

472772146107435302536223071973048224632914695302097116459852171130520711256363590397527

398075086424064937397125500550386491199064362342526708406385189575946388957261768583317

Best classical algorithmtakes time

Shor’s quantum algorithm takes time

An efficient algorithm for factoring breaks the RSA public key cryptosystem

PeterShor 1994

Qubit

12sine02cos

02sine12cosi

i

0

• Pure state of a qubit

• Basis

• Superposition of states and 1,0

0 1

Qubit:• The element of carrying information------- The

quantum state• |0>, |1> and any linear combination

(superposition) c1|0>+c2|1>

Definition (Qubit Strings)A qubit string σ is a state vector or density operator

A quantum computer can perform 2n operations at the same time due to superposition :

However we get only one answer when we measure the result:

F[000] F[001] F[010] . . F[111]

Only one answer F[a,b,c]

The Discrete Fourier Transform

• Assume L qubits hold any number x, from 0 to 2L-1• Any number x can be expressed as the state• |x> = |xL-1 xL-2 …x1 x0 >= |xL-1 > |xL-2 > ….|x1 > |x0 >• Where x= and a tensor product• Aj acts only on the qubit represented by j-th atom

• The operator |ij><kj|on the state |nj>

• |ij><kj||nj> = | ij >

• Aj|0j> = (|0j>+|1j>) Aj|0j> = (|0j>-|1j>)

• Bj|0jk> = |0jk> Bj|1jk> = |1jk> Bj|2jk> = |2jk>

• Bj|3jk> =exp( )|3jk>

A0B01B02A1B12A0|x> = {(|0>+|4>)-(|2>+|6>)+i(|1>+|5>)-

i(|3>+|7>) =

|x> == > A0B01B02A1B12A0 perform a discrete Fourier transform!

Shor’s algothrim

[1] Quantum Fourier Transform | > = = == >

Finding the period of a periodic function [2] = , then find Greatest Common Divisor

and

N-gate

Quantum computers U : input qubit string σ → output qubit string U(σ)

• Definition (Quantum Kolmogorov Complexity) Let U be a universal quantum computer and δ >

0. Then, for every qubit string ρ, define QCδ(ρ) = min{ℓ(σ) | || ρ − U(σ)||Tr ≤ δ}

• the difference between two qubit strings, it is natural to use the trace distance which is defined as || ρ − U(σ)||Tr := (1/2)Tr|ρ − σ|

References:

• Y. L. Orlov and V. N. Potapov, Complexity: an internet resource for analysis of DNA sequence complexity, Nucleic Acids Research, 2004, Vol. 32

• Fabio Benatti, Tyll Krüger, Markus Müller, Rainer Siegmund-Schultze, Arleta Szkoła. Entropy and Quantum Kolmogorov Complexity: A Quantum Brudno’s Theorem, Communications in Mathematical Physics 265, 437–461 (2006)

Date post:	02-Jan-2016
Category:	Documents
Upload:	kelsey-hurley
View:	38 times
Download:	0 times

Kolmogorov Complexity for analysis of DNA sequence

Documents