Algorithm of NGS Data

Speaker: Eric C.Y., LEEAdvisor: I-Fang Chung

2011.Mar.21

1Monday, March 21, 2011

Outline

• Motivation

• Workflow

• Result

• Conclusion

• My Comment


Motivation

• High throughput sequence technology play an important role in the life science now.

• Different high throughput sequence technologies are competing to be able to sequence an individual human genome for less than $1,000 within a few years.

2006.Mar.17 Vol.311 Science


Motivation

• The amount of data produced by HTS technologies creates significant bioinformatics challenge to understand, store and share data.


Workflow

Evaluate algorithms

Analysis datasets

Preliminary result

Golomb-RiceElias GammaMOVHuffman...

Dataset1Dataset2Dataset3...

For locationFor mismatch...


Coding Strategy

Claude Shannoon1916~2001

Optimal encoding of these integers from a compression standpoint depends on their distribution in order to assign shorter binary codes to more probable symbols.

~ Shannon’s Entropy Coding Theory


Encoding Strategies

• Fixed Codes

• Golomb-Rice Codes

• Elias Gamma Codes

• Monotone Value Codes

• Variable Codes

• Huffman Code


Golomb-Rice CodesSet m=10, and try to encode 42

Encoding of quotient partEncoding of quotient partq output bits

0 0

1 10

2 110

3 1110

4 11110

5 111110

6 1111110

.. ..

N <N repetitions of 1>

Encoding of remainder partEncoding of remainder partEncoding of remainder partr binary output bits

0 0000 000

1 0001 001

2 0010 010

3 0011 011

4 0100 100

5 0101 101

6 1100 1100

7 1101 1101

8 1110 1110

9 1111 1111n=42, n/m q=4, r=2

output is 111100108Monday, March 21, 2011

Elias Gamma Codesnumber 2^n output

1 20+0 12 21+0 0103 21+1 0114 22+0 001005 22+1 001016 22+2 001107 22+3 001118 23+0 00010009 23+1 000100110 23+2 000101011 23+3 000101112 23+4 000110013 23+5 000110114 23+6 000111015 23+7 000111116 24+0 00001000017 24+1 000010001

42=25+10

Example

00000101010


MOV Codingnumber 2^n output

1 20+0 12 21+0 103 21+1 114 22+0 1005 22+1 1016 22+2 1107 22+3 1118 23+0 10009 23+1 100110 23+2 101011 23+3 101112 23+4 110013 23+5 110114 23+6 111015 23+7 111116 24+0 1000017 24+1 10001

Beginning with Elias Gamma code’s significant 1-bit.

Decode: 10001

{4bit}

24 + (0001)2


Huffman Codes“this is an example of a huffman tree”


Workflow

Evaluate algorithms

Analysis datasets

Preliminary result





Dataset1

• Retrotransposon Ty3 insertion sites in the yeast genome.

• 6,439,584 reads in 19 bp.

• Highly Clustered.

• High degree of repetition.

• Most two substitutions.

232%

114%

054%


Dataset2

• In vivo binding site locations of the neuron-restrictive silencer factor (NRSF)in humans.

• Mapped to hg18.

• 1,697,990 reads in 25 bp.

• Most two substitutions.

26%1

18%

076%


Dataset2 Nucleotide Substitutions


Dataset3

• Corresponds to a full diploid human genome sequencing experiment for an Asian individual.

• Large dataset. Only mapped to chr.22.

• 31,118,531 reads. 30~40bp. 219%

120%

061%


Workflow

Evaluate algorithms

Analysis datasets

Preliminary result





Alignment Result Example

Bowtie

Name of read that aligned

Strand

Name of reference sequence occurs

0-bases offset into theforward reference strand

Read sequence

Read quality

Value of celing

Mismatch descriptors


Encoding Location Information

• Standalone: Encoding each column independently.

• Combine: Combining column of chromosome, strand and mismatch then compressing together.


Apply the Algorithms

• Elias Gamma (EG) Absolute

• Sequence can’t be sort.

• Apply to Dataset3.



• Elias Gamma Relative (REG)

• Sequence can be sort, compression performance much better.

• Sorting the location address using relative instead of absolute.



• Relative Elias Gamma Indexed (REG Indexed)

• Sorting and creating index file.

• Combine chromosome, strand, mismatches together. Compressing them by relative location.

• Can’t apply to dataset 3.



• Monotone Value (MOV)

• Based on chromosome and location, sorting the sequences.

• Coding the absolute address.



• Huffman codes

• Focused on “relative” start position.

• This algorithm has to storing the Huffman tree for decompression.


Comments for encoding location

• REG is suit for the three datasets.

• From dataset 1, using unique location of chromosome and counting the frequencies for coding. REG is an ideal solution for highly repetitive dataset.

• Huffman code it’s not good for dataset 1.


Encoding Mismatch Information

• Each read may contains 1 or 2 mismatch and has the nucleotide value.

• Using one line to record the mismatch information. If no mismatch leave the line blank.


Mismatches of Dataset2

Calculate the position from the end of the reads.

If the mismatch at 23

From start is 22.

10110

From end is 2.

10


Nucleotide Substitution• Using number instead of characters.

A: 651000001C: 671000011G: 711000111T: 841010100

A: 00 C:01 G:10 T:1128Monday, March 21, 2011

Combining Location and Mismatch

19G

30A

34T

Count the frequencies,coding the location and mismatch together.

19G: 00001010110

19G: 10110

{ 11bit }

{5bit}


Final Encoding

• Dataset1: Mismatches dominates most of space, because of it already be sorted.

• Dataset2: Location is sparse, it dominates lots of storage.

• Dataset3: This dataset is balanced, because of it has full coverage of genome.


Implementation

• Based on REG indexed for location information and combined encoding for mismatch information.

• Pass1: Counting the mismatches.

• Pass2: Actual encoding.


Result

Original

Best Compression

GenCompress

gzip

bzip2

7zip

0 275,000,000 550,000,000 825,000,000 1,100,000,000

30,651,664

42,233,336

41,378,624

56,166,419

56,078,940

1,030,333,440

(bytes)Dataset1


Result

Original

Best Compression

GenCompress

gzip

bzip2

7zip

0 100000000 200000000 300000000 400000000

83,319,584

94,030,320

95,688,992

36,099,480

35,983,322

353,181,920

(bytes)Dataset2


Result

Original

Best Compression

GenCompress

gzip

bzip2

7zip

0 2250000000 4500000000 6750000000 9000000000

411,811,520

955,061,616

618,818,824

390,541,330

390,541,330

8,869,613,392

(bytes)Dataset3


Conclusion

• Any genome sequence can be used for mapping the reads.

• From the view of time consuming, GenCompress is worth to use.


Compression Time

Dataset1

Dataset2

Dataset3

0 125 250 375 500

447

77

107

422

20

78

70

13

10

111

5

20GenCompress gzipbzip2 7zip

(sec)


Decompression Time

Dataset1

Dataset2

Dataset3

0 15 30 45 60

21

2

4

53

4

7

13

1

2

15

1

2GenCompress gzipbzip2 7zip

(sec)


Conclusion• Hard drive is not expensive, the cost is the

bandwidth.

• Doesn’t consider the quality score.

• Read identifier is also important.

• Maybe mismatches are contaminants, de novo. Or the reference sequence is unfinished.

• Only consider the best match.


Conclusion• Huffman tree in dataset 1 and 2.


My Comments• They should open source.

• Hardware configuration.

Why RAID1?


Thanks for your attention!


Date post:	12-Nov-2014
Category:	Technology
Upload:	eric-lee
View:	698 times
Download:	0 times

Algorithm of NGS Data

Technology