+ All Categories
Home > Documents > Kolmogorov Complexity for analysis of DNA sequence

Kolmogorov Complexity for analysis of DNA sequence

Date post: 02-Jan-2016
Category:
Upload: kelsey-hurley
View: 38 times
Download: 0 times
Share this document with a friend
Description:
Kolmogorov Complexity for analysis of DNA sequence. Shijun Tang Thiraphat Meesumrarn Gaith Albadarin. Outline. Kolmogorov Complexity The Complexity of DNA Methods Quantum Kolmogorov Complexity Qubit and Definition of QKC. Kolmogorov Complexity. - PowerPoint PPT Presentation
39
Kolmogorov Complexity for analysis of DNA sequence Shijun Tang Thiraphat Meesumrarn Gaith Albadarin
Transcript
Page 1: Kolmogorov  Complexity for analysis of DNA sequence

Kolmogorov Complexityfor analysis of DNA sequence

Shijun TangThiraphat Meesumrarn

Gaith Albadarin

Page 2: Kolmogorov  Complexity for analysis of DNA sequence

Outline

• Kolmogorov Complexity The Complexity of DNA

Methods

• Quantum Kolmogorov Complexity Qubit and Definition of QKC

Page 3: Kolmogorov  Complexity for analysis of DNA sequence

Kolmogorov Complexity

The Kolmogorov complexity of any string x {0, 1}∈ ∗ is defined as:

C(x) := min{ℓ(p) | U(p) = x}

The Kolmogorov complexity of x : the length of the shortest program which produces x as its output

Page 4: Kolmogorov  Complexity for analysis of DNA sequence

The Complexity of DNA

• “genetic language” in DNA sequences (A, C, G, and T)

• heterogeneity in DNA sequences (not random)

• the long-range correlation• Compression

Page 5: Kolmogorov  Complexity for analysis of DNA sequence

Methods

• Entropy• Spectral Analysis• Kolmogorov Complexity

Page 6: Kolmogorov  Complexity for analysis of DNA sequence

Entropy

Clausius EntropyBoltzmann EntropyShannon EntropyKolmogorov EntropyTsallis Entropy-- Approximate Entropy---Sample Entropy---Multiscale Entropy…………….

Page 7: Kolmogorov  Complexity for analysis of DNA sequence

Entropy

• Jensen-Shannon distance == the difference between the entropy calculated from the whole system and the weighted sum of entropies calculated from the subsystems

• Jensen-Shannon distance D(i) for each possible partition point i along the DNA sequence

Page 8: Kolmogorov  Complexity for analysis of DNA sequence

230,208 nucleotides

near » 189,00012 3

4

Page 9: Kolmogorov  Complexity for analysis of DNA sequence

• the bigger the difference of the two subsequences as partitioned at point i, and the more ideal to choose that point to partition the sequence

• the average value of D(i) of random sequence is at least 10 times lower than that for the yeast sequence.

• These ups and downs in D(i) for the random sequence are purely random fluctuations

Page 10: Kolmogorov  Complexity for analysis of DNA sequence

Spectral Analysis

• Power spectrum -- > to represent the correlation structure in a sequence according to wavelength (or

frequency f =c/wavelength).

• The power at a given frequency, P(f), is the contribution from that frequency component to the total variance

of the fluctuation in the sequence.

• A random sequence lacks correlation at any length scale, and its power spectra is flat

Page 11: Kolmogorov  Complexity for analysis of DNA sequence
Page 12: Kolmogorov  Complexity for analysis of DNA sequence

Kolmogorov Complexity for Analysis of DNA

The search for DNA regions with low complexity is

one of the pivotal tasks of modern structural analysis

of complete genomes.

The low complexity may be preconditioned by strong

inequality in nucleotide content (biased composition),

by tandem or dispersed repeats or by palindrome-

hairpin structures, as well as by a combination of all

these factors.

Page 13: Kolmogorov  Complexity for analysis of DNA sequence

Four types of repeat differing by orientation andlocalization in direct or complementary chains are considered: direct, symmetric, inverted anddirect complementary.

Direct and inverted repeats as standard prototypes. Symmetric (the repeated sequence is oppositely oriented on the same DNA strand) Direct complementary (a direct repeat on the complementary DNA strand),

Page 14: Kolmogorov  Complexity for analysis of DNA sequence
Page 15: Kolmogorov  Complexity for analysis of DNA sequence

Nucleotide Sequence : the AP2 transcription factorbinding site, GTGCCCCGCGGGAACCCCGC.

Black and gray arrows mark the copied fragments andtheir prototypes. A tandem repeat characterized by partial overlapping of the prototype on the copiedfragment is marked by a dotted line. In thisdecomposition, the first one-lettered components, G andT, are produced by an operation generating a novelsymbol. The complexity of this 20-letteredsequence = 10 [the number of components in H(S)].

Page 16: Kolmogorov  Complexity for analysis of DNA sequence

Lempel-Ziv complexity S, Q represents two string, respectively.SQ=S+Q. SQP=SQ(deleted last letter)V(SQP) is all subset of SQPNow c(n)=1, assume S=s1s2….sr Q=Sr+1

If QϵV(SQP), S same, Q=Sr+1Sr+2

Until Q V(SQP), So Q=sr+1sr+2…sr+i is not the subset of s1s2..srsr+1sr+2..sr+i-1, c(n)+1

Update S= s1s2..srsr+1sr+2..sr+I and Q=sr+i+1

Until Q take the final letter

Page 17: Kolmogorov  Complexity for analysis of DNA sequence

b(n) is complexity value of random sequence @ n infinite

b(n) = = nlog2n

Thus,CLZN = c(n)/b(n)

the complexity of random ---- > 1 the complexity of order sequence ---- > 0

The smaller the complexity, the slower the speed of variation === > the change of data is regular, and has good periodic time.

Page 18: Kolmogorov  Complexity for analysis of DNA sequence

The calculation of c(n) (Lempel-Ziv complexity)

Lempel-Ziv Complexity 1976For sequence S=(10101010)

S=s1=1, Q=s2=0, SQ=10, SQP=1, Q V(SQP), Q insertion, SQ=1● 0

S= s1s2=10, Q=s3=1, SQ=101, SQP=10, Q ϵ V(SQP), Q duplication, SQ=1● 0 ●1

S= s1s2=10, Q=s3 s4=10, SQ=1010, SQP=101, Q ϵ V(SQP), Q duplication, SQ=1● 0●10

• Repeated 2) and 3), Q duplication , S=1●0●101010 , c(n)=3• b(8)=8log28=24. So normalized complexly: CLNZ =c(8)/b(8)=3/24=0.125• Thus, results show that the sequence is low because this sequence is periodical one.

Page 19: Kolmogorov  Complexity for analysis of DNA sequence

Other estimates of text complexityThe evaluation of complexity in a text region CWF by Wootton and Federhen (7) is given by the formula

Page 20: Kolmogorov  Complexity for analysis of DNA sequence

Linguistic complexity can also be defined as the ratio of the sum of numbers of words occurring in a sequence analyzed to the maximum possible number of such words (12):

Page 21: Kolmogorov  Complexity for analysis of DNA sequence
Page 22: Kolmogorov  Complexity for analysis of DNA sequence

Implementation and Results

Calculation mode in a sliding window(i) a single extended sequence (ii) a group of relatively short sequences up to 1 kb in

length. A table of complexity values is constructed for a window, of ordered size N, Sliding along the sequence. The sequence complexity is assigned to the window center. The calculation mode in a sliding window (complexity profile) is demonstrated here using the example of the Borrelia burgdorferi genome. In Figure 2, complexity profiles for a window sliding along the sequence are illustrated.

Page 23: Kolmogorov  Complexity for analysis of DNA sequence
Page 24: Kolmogorov  Complexity for analysis of DNA sequence
Page 25: Kolmogorov  Complexity for analysis of DNA sequence

http://news.bbc.co.uk/2/hi/8236943.stm

Page 26: Kolmogorov  Complexity for analysis of DNA sequence

Quantum Kolmogorov Complexity

Are quantum computers more powerful than classical computers?

Quantum Entanglement

Quantum Factorization of Integers

Quantum computers can solve some problems faster than classical computers (→ Shor’s factoring algorithm).

Page 27: Kolmogorov  Complexity for analysis of DNA sequence

86782348943904553258203876589276488467282764884783575788579901017459395793602387575786897646492020929237475675203798980000847736223445526263778374774774764657586879989889999531190642287653930057686950486950384756567438556574648876589005088573342257947602867756958696986758959511122344756900768768779957500472667899533045786777487657783691190875682046392930000272645583936857939487456884763747949631611893900540958687034763637485696997576535644578499596997665561098443348899046881020480568572231018209586704589944806808908069887677575969061234894390498988999953119064799010174593957936

>> 2^(129/2) = 2.6088e+019>> 2^125 = 4.2535e+037>> 2^200 = 1.6069e+060

Page 28: Kolmogorov  Complexity for analysis of DNA sequence

Prime factorization of large number

Principle of Current Cryptography

In 1994, 1600 workstation with super speed obtained primefactors of L=129 in about 8 months. If L=250, 800,000 years

Factorization of Integers

The number N---- approximate length L bits ---(0 ~ 2L-1)The number N has a factor in the range (1, )Try each number in this range to find a factor of N---At least stepsS~ =2L/2

But, for Shor’s Quantum ComputationS ~ poly(log(N))

N

N

Page 29: Kolmogorov  Complexity for analysis of DNA sequence

The Factoring Firestorm188198812920607963838697239461650439807163563379417382700763356422988859715234665485319060606504743045317388011303396716199692321205734031879550656996221305168759307650257059

472772146107435302536223071973048224632914695302097116459852171130520711256363590397527

398075086424064937397125500550386491199064362342526708406385189575946388957261768583317

Best classical algorithmtakes time

Shor’s quantum algorithm takes time

An efficient algorithm for factoring breaks the RSA public key cryptosystem

PeterShor 1994

Page 30: Kolmogorov  Complexity for analysis of DNA sequence

Qubit

12sine02cos

02sine12cosi

i

0

• Pure state of a qubit

• Basis

• Superposition of states and 1,0

0 1

Page 31: Kolmogorov  Complexity for analysis of DNA sequence

Qubit:• The element of carrying information------- The

quantum state• |0>, |1> and any linear combination

(superposition) c1|0>+c2|1>

Definition (Qubit Strings)A qubit string σ is a state vector or density operator

Page 32: Kolmogorov  Complexity for analysis of DNA sequence

A quantum computer can perform 2n operations at the same time due to superposition :

However we get only one answer when we measure the result:

F[000] F[001] F[010] . . F[111]

Only one answer F[a,b,c]

Page 33: Kolmogorov  Complexity for analysis of DNA sequence

The Discrete Fourier Transform

• Assume L qubits hold any number x, from 0 to 2L-1• Any number x can be expressed as the state• |x> = |xL-1 xL-2 …x1 x0 >= |xL-1 > |xL-2 > ….|x1 > |x0 >• Where x= and a tensor product• Aj acts only on the qubit represented by j-th atom

• The operator |ij><kj|on the state |nj>

• |ij><kj||nj> = | ij >

• Aj|0j> = (|0j>+|1j>) Aj|0j> = (|0j>-|1j>)

• Bj|0jk> = |0jk> Bj|1jk> = |1jk> Bj|2jk> = |2jk>

• Bj|3jk> =exp( )|3jk>

Page 34: Kolmogorov  Complexity for analysis of DNA sequence

A0B01B02A1B12A0|x> = {(|0>+|4>)-(|2>+|6>)+i(|1>+|5>)-

i(|3>+|7>) =

|x> == > A0B01B02A1B12A0 perform a discrete Fourier transform!

Page 35: Kolmogorov  Complexity for analysis of DNA sequence
Page 36: Kolmogorov  Complexity for analysis of DNA sequence

Shor’s algothrim

[1] Quantum Fourier Transform | > = = == >

Finding the period of a periodic function [2] = , then find Greatest Common Divisor

and

Page 37: Kolmogorov  Complexity for analysis of DNA sequence

N-gate

Page 38: Kolmogorov  Complexity for analysis of DNA sequence

Quantum computers U : input qubit string σ → output qubit string U(σ)

• Definition (Quantum Kolmogorov Complexity) Let U be a universal quantum computer and δ >

0. Then, for every qubit string ρ, define QCδ(ρ) = min{ℓ(σ) | || ρ − U(σ)||Tr ≤ δ}

• the difference between two qubit strings, it is natural to use the trace distance which is defined as || ρ − U(σ)||Tr := (1/2)Tr|ρ − σ|

Page 39: Kolmogorov  Complexity for analysis of DNA sequence

References:

• Y. L. Orlov and V. N. Potapov, Complexity: an internet resource for analysis of DNA sequence complexity, Nucleic Acids Research, 2004, Vol. 32

• Fabio Benatti, Tyll Krüger, Markus Müller, Rainer Siegmund-Schultze, Arleta Szkoła. Entropy and Quantum Kolmogorov Complexity: A Quantum Brudno’s Theorem, Communications in Mathematical Physics 265, 437–461 (2006)


Recommended