Date post: | 12-Nov-2014 |
Category: |
Technology |
Upload: | eric-lee |
View: | 698 times |
Download: | 0 times |
Speaker: Eric C.Y., LEEAdvisor: I-Fang Chung
2011.Mar.21
1Monday, March 21, 2011
Outline
• Motivation
• Workflow
• Result
• Conclusion
• My Comment
2Monday, March 21, 2011
Motivation
• High throughput sequence technology play an important role in the life science now.
• Different high throughput sequence technologies are competing to be able to sequence an individual human genome for less than $1,000 within a few years.
2006.Mar.17 Vol.311 Science
3Monday, March 21, 2011
Motivation
• The amount of data produced by HTS technologies creates significant bioinformatics challenge to understand, store and share data.
4Monday, March 21, 2011
Workflow
Evaluate algorithms
Analysis datasets
Preliminary result
Golomb-RiceElias GammaMOVHuffman...
Dataset1Dataset2Dataset3...
For locationFor mismatch...
5Monday, March 21, 2011
Coding Strategy
Claude Shannoon1916~2001
Optimal encoding of these integers from a compression standpoint depends on their distribution in order to assign shorter binary codes to more probable symbols.
~ Shannon’s Entropy Coding Theory
6Monday, March 21, 2011
Encoding Strategies
• Fixed Codes
• Golomb-Rice Codes
• Elias Gamma Codes
• Monotone Value Codes
• Variable Codes
• Huffman Code
7Monday, March 21, 2011
Golomb-Rice CodesSet m=10, and try to encode 42
Encoding of quotient partEncoding of quotient partq output bits
0 0
1 10
2 110
3 1110
4 11110
5 111110
6 1111110
.. ..
N <N repetitions of 1>
Encoding of remainder partEncoding of remainder partEncoding of remainder partr binary output bits
0 0000 000
1 0001 001
2 0010 010
3 0011 011
4 0100 100
5 0101 101
6 1100 1100
7 1101 1101
8 1110 1110
9 1111 1111n=42, n/m q=4, r=2
output is 111100108Monday, March 21, 2011
Elias Gamma Codesnumber 2^n output
1 20+0 12 21+0 0103 21+1 0114 22+0 001005 22+1 001016 22+2 001107 22+3 001118 23+0 00010009 23+1 000100110 23+2 000101011 23+3 000101112 23+4 000110013 23+5 000110114 23+6 000111015 23+7 000111116 24+0 00001000017 24+1 000010001
42=25+10
Example
00000101010
9Monday, March 21, 2011
MOV Codingnumber 2^n output
1 20+0 12 21+0 103 21+1 114 22+0 1005 22+1 1016 22+2 1107 22+3 1118 23+0 10009 23+1 100110 23+2 101011 23+3 101112 23+4 110013 23+5 110114 23+6 111015 23+7 111116 24+0 1000017 24+1 10001
Beginning with Elias Gamma code’s significant 1-bit.
Decode: 10001
{4bit}
24 + (0001)2
1710Monday, March 21, 2011
Huffman Codes“this is an example of a huffman tree”
11Monday, March 21, 2011
Workflow
Evaluate algorithms
Analysis datasets
Preliminary result
Golomb-RiceElias GammaMOVHuffman...
Dataset1Dataset2Dataset3...
For locationFor mismatch...
12Monday, March 21, 2011
Dataset1
• Retrotransposon Ty3 insertion sites in the yeast genome.
• 6,439,584 reads in 19 bp.
• Highly Clustered.
• High degree of repetition.
• Most two substitutions.
232%
114%
054%
13Monday, March 21, 2011
Dataset2
• In vivo binding site locations of the neuron-restrictive silencer factor (NRSF)in humans.
• Mapped to hg18.
• 1,697,990 reads in 25 bp.
• Most two substitutions.
26%1
18%
076%
14Monday, March 21, 2011
Dataset2 Nucleotide Substitutions
15Monday, March 21, 2011
Dataset3
• Corresponds to a full diploid human genome sequencing experiment for an Asian individual.
• Large dataset. Only mapped to chr.22.
• 31,118,531 reads. 30~40bp. 219%
120%
061%
16Monday, March 21, 2011
Workflow
Evaluate algorithms
Analysis datasets
Preliminary result
Golomb-RiceElias GammaMOVHuffman...
Dataset1Dataset2Dataset3...
For locationFor mismatch...
17Monday, March 21, 2011
Alignment Result Example
Bowtie
Name of read that aligned
Strand
Name of reference sequence occurs
0-bases offset into theforward reference strand
Read sequence
Read quality
Value of celing
Mismatch descriptors
18Monday, March 21, 2011
Encoding Location Information
• Standalone: Encoding each column independently.
• Combine: Combining column of chromosome, strand and mismatch then compressing together.
19Monday, March 21, 2011
Apply the Algorithms
• Elias Gamma (EG) Absolute
• Sequence can’t be sort.
• Apply to Dataset3.
20Monday, March 21, 2011
Apply the Algorithms
• Elias Gamma Relative (REG)
• Sequence can be sort, compression performance much better.
• Sorting the location address using relative instead of absolute.
21Monday, March 21, 2011
Apply the Algorithms
• Relative Elias Gamma Indexed (REG Indexed)
• Sorting and creating index file.
• Combine chromosome, strand, mismatches together. Compressing them by relative location.
• Can’t apply to dataset 3.
22Monday, March 21, 2011
Apply the Algorithms
• Monotone Value (MOV)
• Based on chromosome and location, sorting the sequences.
• Coding the absolute address.
23Monday, March 21, 2011
Apply the Algorithms
• Huffman codes
• Focused on “relative” start position.
• This algorithm has to storing the Huffman tree for decompression.
24Monday, March 21, 2011
Comments for encoding location
• REG is suit for the three datasets.
• From dataset 1, using unique location of chromosome and counting the frequencies for coding. REG is an ideal solution for highly repetitive dataset.
• Huffman code it’s not good for dataset 1.
25Monday, March 21, 2011
Encoding Mismatch Information
• Each read may contains 1 or 2 mismatch and has the nucleotide value.
• Using one line to record the mismatch information. If no mismatch leave the line blank.
26Monday, March 21, 2011
Mismatches of Dataset2
Calculate the position from the end of the reads.
If the mismatch at 23
From start is 22.
10110
From end is 2.
10
27Monday, March 21, 2011
Nucleotide Substitution• Using number instead of characters.
A: 651000001C: 671000011G: 711000111T: 841010100
A: 00 C:01 G:10 T:1128Monday, March 21, 2011
Combining Location and Mismatch
19G
30A
34T
Count the frequencies,coding the location and mismatch together.
19G: 00001010110
19G: 10110
{ 11bit }
{5bit}
29Monday, March 21, 2011
Final Encoding
• Dataset1: Mismatches dominates most of space, because of it already be sorted.
• Dataset2: Location is sparse, it dominates lots of storage.
• Dataset3: This dataset is balanced, because of it has full coverage of genome.
30Monday, March 21, 2011
Implementation
• Based on REG indexed for location information and combined encoding for mismatch information.
• Pass1: Counting the mismatches.
• Pass2: Actual encoding.
31Monday, March 21, 2011
Result
Original
Best Compression
GenCompress
gzip
bzip2
7zip
0 275,000,000 550,000,000 825,000,000 1,100,000,000
30,651,664
42,233,336
41,378,624
56,166,419
56,078,940
1,030,333,440
(bytes)Dataset1
32Monday, March 21, 2011
Result
Original
Best Compression
GenCompress
gzip
bzip2
7zip
0 100000000 200000000 300000000 400000000
83,319,584
94,030,320
95,688,992
36,099,480
35,983,322
353,181,920
(bytes)Dataset2
33Monday, March 21, 2011
Result
Original
Best Compression
GenCompress
gzip
bzip2
7zip
0 2250000000 4500000000 6750000000 9000000000
411,811,520
955,061,616
618,818,824
390,541,330
390,541,330
8,869,613,392
(bytes)Dataset3
34Monday, March 21, 2011
Conclusion
• Any genome sequence can be used for mapping the reads.
• From the view of time consuming, GenCompress is worth to use.
35Monday, March 21, 2011
Compression Time
Dataset1
Dataset2
Dataset3
0 125 250 375 500
447
77
107
422
20
78
70
13
10
111
5
20GenCompress gzipbzip2 7zip
(sec)
36Monday, March 21, 2011
Decompression Time
Dataset1
Dataset2
Dataset3
0 15 30 45 60
21
2
4
53
4
7
13
1
2
15
1
2GenCompress gzipbzip2 7zip
(sec)
37Monday, March 21, 2011
Conclusion• Hard drive is not expensive, the cost is the
bandwidth.
• Doesn’t consider the quality score.
• Read identifier is also important.
• Maybe mismatches are contaminants, de novo. Or the reference sequence is unfinished.
• Only consider the best match.
38Monday, March 21, 2011
Conclusion• Huffman tree in dataset 1 and 2.
39Monday, March 21, 2011
My Comments• They should open source.
• Hardware configuration.
Why RAID1?
40Monday, March 21, 2011
Thanks for your attention!
41Monday, March 21, 2011