+ All Categories
Home > Documents > A4-Long Fragment Read Human Whole Genome …...structural variation analysis, genomic de novo...

A4-Long Fragment Read Human Whole Genome …...structural variation analysis, genomic de novo...

Date post: 03-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
4
DNBseq TM lfrWGS Long Fragment Read Human Whole Genome Sequencing Technical Note DNBseq TM Short-read sequencing technology can detect small SNP and InDel mutations at low cost and high precision, but lacks the ability to read information from long fragments of DNA. Long-read sequencing technologies, such as Pacbio or Oxford Nanopore, enable high-accuracy SV detection, haplotypes, high homology coverage, but show low accuracy for small variation detection (SNP, InDel), and the cost is high. These have limited their applications. In order to address these limitations, we introduce single tube Long Fragment Reads (stLFR) technology-based [1] human whole genome sequencing, lfrWGS. stLFR is developed from DNA co-barcoding technology [2], which is adding the same barcode sequence to sub-fragments of the original long DNA mole- cules (Figure 1). Combing this unique cost-effective and accurate new library methodology with the world's leading DNBseq TM sequencing technology, DNBseq TM lfrWGS enables high-quality mutation detection, diploid phasing of human genomic regions, structural variation analysis, genomic de novo assembly and other long-read applications. Figure 1. lfrWGS library construction and sequencing workflow. lfrWGS library starts with inserting transposons into long genom- ics DNA followed by hybridization of the transposons integrated DNA onto clonally barcoded beads. After barcode ligation, adapter ligation and PCR amplification, the co-barcoded sub-fragments are ready for high throughput sequencing on our propri- etary DNBseq TM platform. DNBseq sequencers can effectively avoid the accumulation of errors and improve the sequencing accuracy. Highlights of lfrWGS: high quality WGS data from as low as 1ng DNA. Over 30Mb of haplotype Contig N50 and powerful detection of structure variations, such as deletions, inversions, translocations and insertions. About 85% of DNA long molecules are labeled with a unique barcode. High SNP and InDel variation detection accuracy and sensitivity. Effective detection of structural variations greater than 20kb, such as inversion, ectopic, deletion, and insertion. Capable of analyzing genome regions that are difficult to process with regular WGS, for example, high homologous regions, high repeat regions, etc.
Transcript
Page 1: A4-Long Fragment Read Human Whole Genome …...structural variation analysis, genomic de novo assembly and other long-read applications. Figure 1. lfrWGS library construction and sequencing

DNBseqTM lfrWGS

Long Fragment Read Human Whole Genome Sequencing

Technical NoteDNBseqTM

Short-read sequencing technology can detect small SNP and InDel mutations at low cost and high precision, but lacks the ability

to read information from long fragments of DNA. Long-read sequencing technologies, such as Pacbio or Oxford Nanopore,

enable high-accuracy SV detection, haplotypes, high homology coverage, but show low accuracy for small variation detection

(SNP, InDel), and the cost is high. These have limited their applications. In order to address these limitations, we introduce single

tube Long Fragment Reads (stLFR) technology-based [1] human whole genome sequencing, lfrWGS. stLFR is developed from

DNA co-barcoding technology [2], which is adding the same barcode sequence to sub-fragments of the original long DNA mole-

cules (Figure 1). Combing this unique cost-e�ective and accurate new library methodology with the world's leading DNBseqTM

sequencing technology, DNBseqTM lfrWGS enables high-quality mutation detection, diploid phasing of human genomic regions,

structural variation analysis, genomic de novo assembly and other long-read applications.

Figure 1. lfrWGS library construction and sequencing workflow. lfrWGS library starts with inserting transposons into long genom-

ics DNA followed by hybridization of the transposons integrated DNA onto clonally barcoded beads. After barcode ligation,

adapter ligation and PCR amplification, the co-barcoded sub-fragments are ready for high throughput sequencing on our propri-

etary DNBseqTM platform. DNBseq sequencers can e�ectively avoid the accumulation of errors and improve the sequencing

accuracy.

Highlights of lfrWGS:

high quality WGS data from as low as 1ng DNA.

Over 30Mb of haplotype Contig N50 and powerful detection of structure variations, such as deletions, inversions,

translocations and insertions.

About 85% of DNA long molecules are labeled with a unique barcode.

High SNP and InDel variation detection accuracy and sensitivity.

E�ective detection of structural variations greater than 20kb, such as inversion, ectopic, deletion, and insertion.

Capable of analyzing genome regions that are di�cult to process with regular WGS, for example, high homologous regions,

high repeat regions, etc.

Page 2: A4-Long Fragment Read Human Whole Genome …...structural variation analysis, genomic de novo assembly and other long-read applications. Figure 1. lfrWGS library construction and sequencing

Long Fragment Read Human Whole Genome Sequenc ing

High-Quality Data Performance

SNP & InDel calling

Read long fragment information

The average length of long DNA fragments from lfrWGS is 50-70kb (maximum length up to 300kb). Benefiting from over 30

million molecule barcodes, more than 85% of long DNA fragment can be co-barcoded by a single unique barcode. This makes

co-barcoded reads analogous to direct single molecule sequencing, but without the high error rates and low throughput.

At 30x coverage, lfrWGS demonstrated high quality variant calling performance equivalent to that of standard short-read WGS.

Both the positive predictive value (PPV) and the sensitivity of SNP detection are 0.99 and above (Figure 3). In addition, F-measures

of InDel detection above 0.95 are also achievable, indicating the great performance of lfrWGS in SNP and InDel calling.

Figure 3. The performance of small variants (SNP & InDel) calling of HG001-005 standard samples. HG001:NA12878;

HG002-HG004: Ashkenazim Father-mother-son Trio; HG005: Asian (Han Chinese) son.

Figure 2. DNA fragment length distribution and DNA molecules number per barcode (right). (A) Typically, lfrWGS can analyze

long fragments with an average length of 50-70 kb. (B) When starting from 1ng of high molecular weight DNA, over 85% of DNA

can be co-barcoded by a single unique barcode.

A B

SNP & InDel

Page 3: A4-Long Fragment Read Human Whole Genome …...structural variation analysis, genomic de novo assembly and other long-read applications. Figure 1. lfrWGS library construction and sequencing

Figure 4. An ideogram of the phasing blocks on each chromosome of NA12878 sample. Phased contigs are represented by

alternating colors (blue and gray).

Figure 5. Large deletion was detected by lfrWGS. The top left panel is a heat map drawn based on barcode overlap. Regions of

high overlap are depicted in dark red. Those with no overlap in beige. Arrows demonstrate how regions that are spatially distant

from each other on Chromosome 8 have increased overlap marking the locations of the deletion. Co-barcoded reads are

separated by haplotype and plotted by unique barcode on the y axis and chromosome 8 position on the x axis. The

heterozygous deletion is found in a single haplotype.

Phasing

To evaluate variant phasing performance, high confidence variants from GIAB (NA12878) were phased using the publicly

available software package HapCut2[3]. An ideogram of the phasing blocks on each chromosome is shown below in Figure

4. With 40X coverage, the phasing block N50 can reach 34 Mb with practically all heterozygous SNPs phased. Notably, the

arms of some chromosomes, such as Chr5 and Chr6 are almost completely phased.

Structure variation detection

With phasing and co-barcoding information, lfrWGS can also be used to detect large scale structure variations. To demon-

strate the power of stLFR technology to detect SVs, we examined barcode overlap data, and previously reported deletions

by Zhang, F et. al [4] in NA12878 were also found using lfrWGS data. Notably, as shown in Figure 5, lfrWGS successfully

detected a heterozygous deletion of 150 kb in length on Chromosome 8 in the NA12878 sample.

Long Fragment Read Human Whole Genome Sequenc ing

% heterozygous SNPs phased 99%

Phasing block N50 size(Mb) 34.0

Short switch error rate 0.0025

Long switch error rate 0.0020

Page 4: A4-Long Fragment Read Human Whole Genome …...structural variation analysis, genomic de novo assembly and other long-read applications. Figure 1. lfrWGS library construction and sequencing

Copyright ©2019 BGI. The BGI logo is a trademark of BGI. All rights reserved. All brand and product names are trademarks or registered trademarks of their respective holders.Information, descriptions and specifications in this publication are subject to change without notice.DNBSEQ is a trademark of MGI CO. Ltd.

Published September 2019.

Request Information or Quotation

Contact your BGI account representative for the most a�ordable rates in the industry and to discuss how we can meet your

specific project requirements or for expert advice on experiment design, from sample to bioinformatics.

[email protected]

www.bgi.com

BGI Genomics BGI_Genomics

We Sequence, You DiscoverAll Services and Solutions are for research use only.

Long Fragment Read Human Whole Genome Sequenc ing

Perfect coverage of high homology regions

Spinal Muscular Dystrophy (SMA) is an autosome recessive neuromuscular disease, which is commonly characterized by

muscle weakness, low muscle tone and weakened sputum response. There is no e�ective treatment of this disease yet. At

present, genetics has been confirmed that SMA is closely related to SMN1 mutation. SMN2 is the highly homologous genes

of SMN1 and these two genes are distinguished in exon 7 and exon 8. There is only a five-base di�erence for SMN1 and SMN2

throughout the DNA level, and only two-base in the coding region, making this case impossible to resolve. With the powerful

DNA co-barcoding strategy, lfrWGS enables analysis of regions which can be di�cult for regular WGS.

Conclusion

Benefiting from co-barcoding technology, DNBseqTM lfrWGS has more than 85% of long DNA fragments with a single unique

barcode and up to 20% of sub-fragments reaching 300 kb in length. Importantly, this is achieved without any amplification of

initial long DNA fragments, which limits the representation bias that arises from PCR amplification.

The quality of variant calling using lfrWGS is high and reproducible. Together with the added benefit that co-barcoding

enables advanced informatics applications, such as near complete phasing of the genome into long contigs with extremely

low error rates, detection of SVs and impressive coverage of high homology regions, lfrWGS is a promising technology to fill

in the gap between short read sequencing and long read sequencing.

References

Figure 6. Accessing regions di�cult for regular WGS. lfrWGS (upper panels) successfully sequenced and identified the culprit

of the SMN1 gene (left panels) and its highly homologous counterpart SMN2 gene (right panels) which is inaccessible by

regular WGS.

Wang, O., et al., E�cient and unique cobarcoding of second-generation sequencing reads from long DNA molecules enabling cost-e�ective

and accurate sequencing, haplotyping, and de novo assembly. Genome Res, 2019. 29(5): p. 798-808.

Peters, B.A., J. Liu, and R. Drmanac, Co-barcoded sequence reads from long DNA fragments: a cost-e�ective solution for "perfect genome"

sequencing. Front Genet, 2014. 5: p. 466.

Edge, P., V. Bafna, and V. Bansal, HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res, 2017.

27(5): p. 801-812.

Zhang, F., et al., Haplotype phasing of whole human genomes using bead-based barcode partitioning in a single tube. Nat Biotechnol, 2017.

35(9): p. 852-857.

[1]

[2]

[3]

[4]

SMN1 SMN2

lfrWGS

Regular WGS

27 kb 27 kb


Recommended