+ All Categories
Home > Technology > D02-NextGenSeq-MOLGENIS

D02-NextGenSeq-MOLGENIS

Date post: 04-Jul-2015
Category:
Upload: bioinformatics-open-source-conference
View: 408 times
Download: 0 times
Share this document with a friend
Description:
Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands (Morris Swertz)
50
[email protected] Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands Morris Swertz , UMC Groningen, Netherlands and members of BBMRI-NL, NBIC, MOLGENIS BOSC 2011, July 15, Vienna
Transcript
Page 1: D02-NextGenSeq-MOLGENIS

[email protected]

Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands

Morris Swertz, UMC Groningen, Netherlands

and members of BBMRI-NL, NBIC, MOLGENIS

BOSC 2011, July 15, Vienna

Page 2: D02-NextGenSeq-MOLGENIS

[email protected]

Use (web)

Animal Observatory

NextGenSeq

Mutation database

Model organisms

Model (xml)

Generator (java)

BOSC 2010 we demonstrated the MOLGENIS software toolkit

Swertz et al (2010) BMC Bioinformatics 11(Suppl 12):S12, http://www.molgenis.org

Page 3: D02-NextGenSeq-MOLGENIS

[email protected]

Get stuff for free as others build it already

3

Connect to annotation services

Plugin rich analysis tools

Connect to statistics

UML documentation of your model

Edit & trace your data

Import/export to Excel

find.investigation()102 downloaded

obs<-find.observedvalue(43,920 downloaded

#some calculationadd.inferredvalue(res)36 added

Page 4: D02-NextGenSeq-MOLGENIS

[email protected]

Three steps: Model –> Generate –> Use

Swertz et al (2010) BMC Bioinformatics 11(Suppl 12):S12, http://www.molgenis.org

Page 5: D02-NextGenSeq-MOLGENIS

[email protected]

9200 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\ProtocolsForm.java9293 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Protocols\ProtocolMenu\ParametersForm.java9325 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Protocols\ProtocolMenu\ProtocolComponentsForm.java9496 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Ontologies\OntologyTermsForm.java9528 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Ontologies\OntologySourcesForm.java9606 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Ontologies\OntologySources\OntologyTermsForm.java9638 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Ontologies\CodeListsForm.java9700 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Ontologies\CodeLists\CodesForm.java9965 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenuMenu.java10012 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenu\MainMenu.java10059 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenu\Main\Investigations\InvestigationMenuMenu.java10152 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenu\Main\Investigations\InvestigationMenu\ProtocolApplications\ProtocolApplicationMenuMenu.java10230 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenu\Main\ObservationTargetsMenu.java10293 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenu\Main\Protocols\ProtocolMenuMenu.java10324 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenu\Main\OntologiesMenu.java11354 INFO [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\Main\ReportPlugin.java11557 INFO [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\Main\Ontologies\OntologyManagerPlugin.java11604 INFO [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\Model_documentationPlugin.java11604 INFO [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\RprojectApiPlugin.java11620 INFO [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\HttpApiPlugin.java11635 INFO [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\WebServicesApiPlugin.java11651 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\report\InvestigationOverview.ftl11807 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\OntologyBrowser\OntologyBrowserPlugin.ftl11807 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\DocumentationScreen.ftl11807 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\RprojectApiScreen.ftl11823 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\HttpAPiScreen.ftl11823 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\SoapApiScreen.ftl11854 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\report\InvestigationOverview.java12057 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\OntologyBrowser\OntologyBrowserPlugin.java12072 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\DocumentationScreen.java12088 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\RprojectApiScreen.java12088 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\HttpAPiScreen.java12088 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\SoapApiScreen.java12103 INFO [MolgenisServletContextGen] generated WebContent\META-INF\context.xml12259 INFO [SoapApiGen] generated generated\java\ui\SoapApi.java12353 INFO [CsvExportGen] generated generated\java\tools\CsvExport.java12431 INFO [CsvImportByNameGen] generated generated\java\tools\CsvImportByName.java12636 INFO [CopyMemoryToDatabaseGen] generated generated\java\ui\tools\CopyMemoryToDatabase.java

Real example:Generates 150 files, 30k lines of Java, MySQL, CXF, Tomcat config, and R code + docs

5

Three steps: Model –> Generate –> Use

Page 6: D02-NextGenSeq-MOLGENIS

[email protected]

6

Three steps: Model –> Generate –> Use

Swertz et al (2010) BMC Bioinformatics 11(Suppl 12):S12, http://www.molgenis.org

Page 7: D02-NextGenSeq-MOLGENIS

[email protected]

7

XGAP for GWAS/GWL

Disease specific databases

BBMRI biobank catalogue

GWAS central data manager

NGS cyber infrastructure

MAGE-TAB microarray

AnimalDB

Currently: Towards an integrated app suite

Swertz et al (2010) BMC Bioinformatics 11(Suppl 12):S12, http://www.molgenis.org

Page 8: D02-NextGenSeq-MOLGENIS

[email protected]

Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands

• Background: Genome of the Netherlands project• Why: create a Dutch genetic hapmap to find rarer variants• Aim: genome sequence of 1000 chromosomes (12x)

• Challenge: analyze 2250 Illumina lanes• Alignment and SNP calls of 760 samples calls• Data handling, QC, reports, etc

• Solution: NGS software/hardware infrastructure• GPFS storage for >100TB of data files• Template system for compute protocols• Generators to automatically produce analysis scripts• MOLGENIS to run and track inputs, analyses, output data

• Demo movie

• Conclusion

Page 9: D02-NextGenSeq-MOLGENIS

[email protected]

Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands

• Background: Genome of the Netherlands project• Why: create a Dutch genetic hapmap to find rarer variants• Aim: genome sequence of 1000 chromosomes (12x)

• Challenge: analyze 2250 Illumina lanes• Alignment and SNP calls of 760 samples calls• Data handling, QC, reports, etc

• Solution: NGS software/hardware infrastructure• GPFS storage for >100TB of data files• Template system for compute protocols• Generators to automatically produce analysis scripts• MOLGENIS to run and track inputs, analyses, output data

• Demo movie

• Conclusion

Page 10: D02-NextGenSeq-MOLGENIS

[email protected]

Motivation: GWAS revolution in human genetics

Page 11: D02-NextGenSeq-MOLGENIS

[email protected]

Motivation: GWAS revolution in human genetics

Page 12: D02-NextGenSeq-MOLGENIS

[email protected]

Motivation: GWAS revolution in human genetics

Page 13: D02-NextGenSeq-MOLGENIS

[email protected]

Motivation: GWAS revolution in human genetics

Page 14: D02-NextGenSeq-MOLGENIS

[email protected]

Motivation: GWAS revolution in human genetics

Page 15: D02-NextGenSeq-MOLGENIS

[email protected]

GREAT!

Ankylosing Spondylitis

Celiac Disease

Crohn’s disease

Multiple Sclerosis

Psoriasis

Rheumatoid Arthritis

Systemic Lupus Erythematosus

Type 1 Diabetes

Ulcerative Colitis

Page 16: D02-NextGenSeq-MOLGENIS

[email protected]

BUT … these explain a small part of heritability

Page 17: D02-NextGenSeq-MOLGENIS

[email protected]

Missing heritability?

Where might it be hiding?

Page 18: D02-NextGenSeq-MOLGENIS

[email protected]

However:Sequencing candidate loci implicates unknown (rare) variants

Page 19: D02-NextGenSeq-MOLGENIS

[email protected]

common

known

First analysis of 1000G project data

Durbin et al., Nature 2010

More insight into the specific genetic architecture of individual populations is crucial

Page 20: D02-NextGenSeq-MOLGENIS

[email protected]

common

known

new

First analysis of 1000G project data shows that the majority of the

newly identified and rare variants are population specific

(and there are no Dutch in 1000G)

Durbin et al., Nature 2010

More insight into the specific genetic architecture of individual populations is crucial

Page 21: D02-NextGenSeq-MOLGENIS

[email protected]

Genome of the Netherlands (GoNL):

•Unique family-based design: 250 trios• 230 x 2 parents – 1 offspring

• 10 x 2 parents – 2 offspring

• 10 x 2 parents – 1 MZ twin offspring

•Immunochip microrray QC control data

•Specifications:• Families equally distributed over the Dutch provinces

• Genomic DNA, paired-end sequencing on HiSeq2000, 12x coverage

• Trios allow phase information; accurate haplotypes

• Other results: Structural variation, detection de novo variants

Idea 1: sequence 1000 independent Dutch chromosomes

Biobanks * analysis teams

Page 22: D02-NextGenSeq-MOLGENIS

[email protected]

Idea 2: lets impute 100.000 existing Dutch GWAS data

GWAS data

Imputation is the process of inferring any missing or untyped genetic variants from typed flanking genetic variants, based on the known local LD relationship

Page 23: D02-NextGenSeq-MOLGENIS

[email protected]

Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands

• Background: Genome of the Netherlands project• Why: create a Dutch genetic hapmap to find rarer variants• Aim: genome sequence of 1000 chromosomes (12x)

• Challenge: analyze 2250 Illumina lanes• Alignment and SNP calls of 760 samples calls• Data handling, QC, reports, etc

• Solution: NGS software/hardware infrastructure• GPFS storage for >100TB of data files• Template system for compute protocols• Generators to automatically produce analysis scripts• MOLGENIS to run and track inputs, analyses, output data

• Demo movie

• Conclusion

Page 24: D02-NextGenSeq-MOLGENIS

[email protected]

GoNL: sequence 1000 independent Dutch chromosomes

Sequence analysis

•230 trio’s (690)•10 quartets (40)•10 MZ twin (40)

•Immunochip GWAS data for QC (UMCG)

Page 25: D02-NextGenSeq-MOLGENIS

[email protected]

GoNL: sequence 1000 independent Dutch chromosomes

Sequence analysis

•230 trio’s (690)•10 quartets (40)•10 MZ twin (40)

•Immunochip GWAS data for QC (UMCG)

Data analysis &Method development

•~ 75% of data aligned to reference (hg19)

•In-depth analysis on 20 trio’s (pilot1)

Page 26: D02-NextGenSeq-MOLGENIS

[email protected]

GoNL: sequence 1000 independent Dutch chromosomes

Sequence analysis

•230 trio’s (690)•10 quartets (40)•10 MZ twin (40)

•Immunochip GWAS data for QC (UMCG)

TODO: Imputation

~100,000 Dutch samples with GWAS data

Data analysis &Method development

•~ 50% of data aligned to reference (hg19)

•In-depth analysis on 20 trio’s (pilot)

Page 27: D02-NextGenSeq-MOLGENIS

[email protected]

GoNL: sequence 1000 independent Dutch chromosomes

Sequence analysis

•230 trio’s (690)•10 quartets (40)•10 MZ twin (40)

TODO: Imputation

~100,000 Dutch samples with GWAS data

Data analysis &Method development

•~ 50% of data aligned to reference (hg19)

•In-depth analysis on 20 trio’s (pilot)

TODO: Further analysisStructural variation, Population Genetics,

De novo mutations, Mitochondrial DNA

This is an open national project: please contact [email protected] [email protected] and

[email protected] for analysis ideas.

Page 28: D02-NextGenSeq-MOLGENIS

[email protected]

Data analysis &Method development

•~ 75% of data aligned to reference (hg19)

•In-depth analysis on 20 trio’s (pilot)

GoNL: sequence 1000 independent Dutch chromosomes

Sequence analysis

•230 trio’s (690)•10 quartets (40)•10 MZ twin (40)

Imputation existing GWAS

~100,000 Dutch samples with GWAS data

Further analysisStructural variation, Population Genetics,

De novo mutations, Mitochondrial DNA

This is an open national project: please contact [email protected]; [email protected];

[email protected] for analysis ideas.

Page 29: D02-NextGenSeq-MOLGENIS

[email protected]

Challenge 1: Data storage

• 45TB raw data (fq.gz)

• 450TB intermediate data (bam)

• 90TB results (bam + vcf)

Page 30: D02-NextGenSeq-MOLGENIS

[email protected]

Challenge 2: Alignment, Variant Calling, and QC pipelines

Alignment Variant calling

Alignment to human genome (Build 37)

Clean up alignment (mark duplicates,

realignment, recalibration)

Quality control

SNP calling

Indel calling

Variant Filtering

~ 1 Week ~ 1 Week

QC: Immunochip concordance

Page 31: D02-NextGenSeq-MOLGENIS

[email protected]

2300 lanes * 15 analysis steps => 34.500 commands needed

• > 2300 * 15 files, 2300 + 750 QC reports, a nightmare to track

/data/gcc/tools/bwa-0.5.8c_patched/bwa aln \/data/gcc/resources/hg19/indices/human_g1k_v37.fa \/data/gcc/rawdata/ngs/in-house/28may11/24173/110303_SN163_0393_L6_A80MP0ABXX_AGAGAT_1.fq.gz \-t 4 \-f /data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe01.bwa_align_pair1.ftl.human_g1k_v37.2011_05_30_20_22.1.sai

/data/gcc/tools/bwa_45_patched/bwa sampe -P \-p illumina \-i L6 \-m 24173 \-l A80MP0ABXX \/data/gcc/resources/hg19/indices/human_g1k_v37.fa \/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe01.bwa_align_pair1.ftl.human_g1k_v37.2011_05_30_20_22.1.sai \/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe02.bwa_align_pair2.ftl.human_g1k_v37.2011_05_30_20_22.2.sai \/data/gcc/rawdata/ngs/in-house/28may11/24173/110303_SN163_0393_L6_A80MP0ABXX_AGAGAT_1.fq.gz \/data/gcc/rawdata/ngs/in-house/28may11/24173/110303_SN163_0393_L6_A80MP0ABXX_AGAGAT_2.fq.gz \-f /data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe03.bwa_sampe.ftl.human_g1k_v37.2011_05_30_20_22.sam

java -jar -Xmx3g /data/gcc/tools/picard-tools-1.32/SamFormatConverter.jar \INPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe03.bwa_sampe.ftl.human_g1k_v37.2011_05_30_20_22.sam \OUTPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe04.sam_to_bam.ftl.human_g1k_v37.2011_05_30_20_22.bam \VALIDATION_STRINGENCY=LENIENT \MAX_RECORDS_IN_RAM=2000000 \TMP_DIR=/local

java -jar -Xmx3g /data/gcc/tools/picard-tools-1.32/SortSam.jar \INPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe04.sam_to_bam.ftl.human_g1k_v37.2011_05_30_20_22.bam \OUTPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe05.sam_sort.ftl.human_g1k_v37.2011_05_30_20_22.sorted.bam \SORT_ORDER=coordinate \VALIDATION_STRINGENCY=LENIENT \MAX_RECORDS_IN_RAM=1000000 \TMP_DIR=/local

java -jar -Xmx3g /data/gcc/tools/picard-tools-1.32/BuildBamIndex.jar \INPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe05.sam_sort.ftl.human_g1k_v37.2011_05_30_20_22.sorted.bam \OUTPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe05.sam_sort.ftl.human_g1k_v37.2011_05_30_20_22.sorted.bam.bai \VALIDATION_STRINGENCY=LENIENT \MAX_RECORDS_IN_RAM=1000000 \TMP_DIR=/local

Page 32: D02-NextGenSeq-MOLGENIS

[email protected]

Challenge 3: > 200.000 hours compute hours

• Alignment 2300 lanes, 15 steps, ~75 hours per lane

• SNP calling 760 samples, 6 steps, ~50 hours per sample

• Immunochip QC 760 samples, 5 steps, 1 hours per sample

Compute power

Network and storage I/O

Page 33: D02-NextGenSeq-MOLGENIS

[email protected]

Challenge 4: Did we analyze it all? Correctly? Completely?

Batches:UModqR 60HUMcriR 90 HUMhxsR 222HUMrutR 235HUMjxbR 153 HUMsnrR 10

Page 34: D02-NextGenSeq-MOLGENIS

[email protected]

Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands

• Background: Genome of the Netherlands project• Why: create a Dutch genetic hapmap to find rarer variants• Aim: genome sequence of 1000 chromosomes (12x)

• Challenge: analyze 2250 Illumina lanes• Alignment and SNP calls of 760 samples calls• Data handling, QC, reports, etc

• Solution: NGS software/hardware infrastructure• GPFS storage for >100TB of data files• Template system for compute protocols• Generators to automatically produce analysis scripts• MOLGENIS to run and track inputs, analyses, output data

• Demo movie

• Conclusion

Page 35: D02-NextGenSeq-MOLGENIS

[email protected]

Kickstart the project building on NBIC/BioAssist

• NGS task force

• Biobanking task force

• e-BioGrid team

Page 36: D02-NextGenSeq-MOLGENIS

[email protected]

Solution 1: GPFS shared data storage

• Primary storage in Groningen on ‘Target’

• Backup storage in Amsterdam on ‘BigGrid’

• Data transfer via hard drives

• Systematic organization of rawdate, resultdata, logs

2.000 TB750 x 3TB disks

3200 tapes

GPFS

http://www.bbmriwiki.nl/wiki/DataManagement http://www.rug.nl/target/index

Page 37: D02-NextGenSeq-MOLGENIS

[email protected]

Solution 2: data management via sample-lane worksheet

sample flowcell lane lib machine date fileA24a FC80R35ABXX L3 HUMhxsRJODIAAPE I433 101119 101119_I433_FC80R35ABXX_L3_HUMhxsRJODIAAPEA24a FC80F2RABXX L3 HUMhxsRJODIABPE I481 101120 101120_I481_FC80F2RABXX_L3_HUMhxsRJODIABPEA24a FC80GHKABXX L2 HUMhxsRJODIBAPE I114 101202 101202_I114_FC80GHKABXX_L2_HUMhxsRJODIBAPEA24b FC80R35ABXX L4 HUMhxsRJPDIAAPE I433 101119 101119_I433_FC80R35ABXX_L4_HUMhxsRJPDIAAPEA24b FC80F2RABXX L4 HUMhxsRJPDIABPE I481 101120 101120_I481_FC80F2RABXX_L4_HUMhxsRJPDIABPEA24b FC80GHKABXX L3 HUMhxsRJPDIBAPE I114 101202 101202_I114_FC80GHKABXX_L3_HUMhxsRJPDIBAPEA24b FC81C8UABXX L3 HUMhxsRJPDIBAPE I340 110114 110114_I340_FC81C8UABXX_L3_HUMhxsRJPDIBAPEA24c FC80R35ABXX L5 HUMhxsRJQDIAAPE I433 101119 101119_I433_FC80R35ABXX_L5_HUMhxsRJQDIAAPEA24c FC80F2RABXX L6 HUMhxsRJQDIABPE I481 101120 101120_I481_FC80F2RABXX_L6_HUMhxsRJQDIABPEA24c FC80GHKABXX L4 HUMhxsRJQDIBAPE I114 101202 101202_I114_FC80GHKABXX_L4_HUMhxsRJQDIBAPEA25a FC80R35ABXX L6 HUMhxsRJRDIAAPE I433 101119 101119_I433_FC80R35ABXX_L6_HUMhxsRJRDIAAPEA25a FC81C8UABXX L2 HUMhxsRJRDIAAPE I340 110114 110114_I340_FC81C8UABXX_L2_HUMhxsRJRDIAAPEA25a FC80F54ABXX L7 HUMhxsRJRDIABPE I171 101122 101122_I171_FC80F54ABXX_L7_HUMhxsRJRDIABPEA25a FC80GHKABXX L5 HUMhxsRJRDIBAPE I114 101202 101202_I114_FC80GHKABXX_L5_HUMhxsRJRDIBAPEA25b FC80R35ABXX L7 HUMhxsRJSDIAAPE I433 101119 101119_I433_FC80R35ABXX_L7_HUMhxsRJSDIAAPEA25b FC80EE1ABXX L5 HUMhxsRJSDIABPE I171 101122 101122_I171_FC80EE1ABXX_L5_HUMhxsRJSDIABPEA25b FC80GHKABXX L6 HUMhxsRJSDIBAPE I114 101202 101202_I114_FC80GHKABXX_L6_HUMhxsRJSDIBAPEA25b FC80GHJABXX L1 HUMhxsRJSDIBAPE I117 101208 101208_I117_FC80GHJABXX_L1_HUMhxsRJSDIBAPEA25c FC80R35ABXX L8 HUMhxsRJTDIAAPE I433 101119 101119_I433_FC80R35ABXX_L8_HUMhxsRJTDIAAPEA25c FC80F54ABXX L5 HUMhxsRJTDIABPE I171 101122 101122_I171_FC80F54ABXX_L5_HUMhxsRJTDIABPEA25c FC80GHKABXX L7 HUMhxsRJTDIBAPE I114 101202 101202_I114_FC80GHKABXX_L7_HUMhxsRJTDIBAPEA25c FC81C7KABXX L5 HUMhxsRJTDIBAPE I125 110115 110115_I125_FC81C7KABXX_L5_HUMhxsRJTDIBAPEA26a FC80PEWABXX L5 HUMhxsRJUDIAAPE I198 101120 101120_I198_FC80PEWABXX_L5_HUMhxsRJUDIAAPEA26a FC80F2RABXX L7 HUMhxsRJUDIABPE I481 101120 101120_I481_FC80F2RABXX_L7_HUMhxsRJUDIABPEA26a FC80GHKABXX L8 HUMhxsRJUDIBAPE I114 101202 101202_I114_FC80GHKABXX_L8_HUMhxsRJUDIBAPEA26b FC80N58ABXX L5 HUMhxsRJVDIAAPE I245 101120 101120_I245_FC80N58ABXX_L5_HUMhxsRJVDIAAPEA26b FC80PNWABXX L2 HUMhxsRJVDIABPE I453 101119 101119_I453_FC80PNWABXX_L2_HUMhxsRJVDIABPEA26b FC80G37ABXX L1 HUMhxsRJVDIBAPE I127 101126 101126_I127_FC80G37ABXX_L1_HUMhxsRJVDIBAPEA26c FC80LDLABXX L1 HUMhxsRJWDIAAPE I453 101119 101119_I453_FC80LDLABXX_L1_HUMhxsRJWDIAAPEA26c FC80PNWABXX L3 HUMhxsRJWDIABPE I453 101119 101119_I453_FC80PNWABXX_L3_HUMhxsRJWDIABPEA26c FC80G37ABXX L2 HUMhxsRJWDIBAPE I127 101126 101126_I127_FC80G37ABXX_L2_HUMhxsRJWDIBAPE

Page 38: D02-NextGenSeq-MOLGENIS

[email protected]

(of course it is a bit more advanced than that)

NB: •we have a beta Galaxy tool.xml mapper•based on GEN2PHEN ‘observation’ model•we would love to have a shared workflow model

Page 39: D02-NextGenSeq-MOLGENIS

[email protected]

Solution 3: auto-generate all computational protocols

• Auto-generate all the analysis commands:

Generate scripts

1. Create SampleLane list

2. Generate pipeline from templates

3. Submit to Compute cluster

bwa aln ${lane}

bwa aln FC80R35ABXX_L3.fq.gzbwa aln FC80R35ABXX_L3.fq.gzbwa aln FC80R35ABXX_L3.fq.gz

34.500 scripts15 templates

http://www.bbmriwiki.nl/svn/ngs_pipelines/templates/ngs/

Page 40: D02-NextGenSeq-MOLGENIS

[email protected]

Solution 4: distributed compute efforts > 200.000 hours

• Alignment 2300 lanes, 15 steps, ~75 hours per lane

• SNP calling 760 samples, 6 steps, ~50 hours per sample

• Immunochip QC 760 samples, 5 steps, 1 hours per sample

RUG CIT/Target~900 lanes done~240 per week

360 cpus

AMC/BigGrid~250 lanes done

~30 per week~270 cpus

EMC

Hubrecht

Other BigGrid

Page 41: D02-NextGenSeq-MOLGENIS

[email protected]

Solution 5: a tool to submit and monitor compute jobs

Page 42: D02-NextGenSeq-MOLGENIS

[email protected]

Solution 6: REST based services

• To interact with R, Galaxy, Taverna (WSDL), Shell etc

e.g. simply upload a csv from shell

e.g. simply get data via R

http://www.molgenis.org/wiki/MolgenisRestInterfacehttp://www.molgenis.org/wiki/MolgenisRinterface

curl -d

'data_type_input=org.molgenis.pheno.Individual&data_input=Name,Descriptio%0AInd1,Desc1%0AInd2,Desc2&data_action=ADD&data_silent=F&submit_input=submit'  

http://vm7.target.rug.nl/ngs_test/api/add

source("http://a.host:8080/molgenis_ngs/api/R")”>res <- find.NgsSample();

Page 43: D02-NextGenSeq-MOLGENIS

[email protected]

All working together (beta)

MOLGENIS user interface for NGS (Java)

Petabyte File storage(GPFS, GridFS?)

compute cluster(PBS, Grid?)

bwa aln ${lane}

Protocol catalogue(Freermaker)

Lane & Sample metadata And QC reports (MySQL)

MOLGENIS/compute

Generate ‘ProtocolApplications’

Submit and monitor (GridGain)

uses

API-R-Galaxy-Taverna-IGV-UCSC

Data & protocols Result exploration

usesTest & play

Page 44: D02-NextGenSeq-MOLGENIS

[email protected]

Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands

• Background: Genome of the Netherlands project• Why: create a Dutch genetic hapmap to find rarer variants• Aim: genome sequence of 1000 chromosomes (12x)

• Challenge: analyze 2250 Illumina lanes• Alignment and SNP calls of 760 samples calls• Data handling, QC, reports, etc

• Solution: NGS software/hardware infrastructure• GPFS storage for >100TB of data files• Template system for compute protocols• Generators to automatically produce analysis scripts• MOLGENIS to run and track inputs, analyses, output data

• Demo movie

• Conclusion

Page 46: D02-NextGenSeq-MOLGENIS

[email protected]

Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands

• Background: Genome of the Netherlands project• Why: create a Dutch genetic hapmap to find rarer variants• Aim: genome sequence of 1000 chromosomes (12x)

• Challenge: analyze 2250 Illumina lanes• Alignment and SNP calls of 760 samples calls• Data handling, QC, reports, etc

• Solution: NGS software/hardware infrastructure• GPFS storage for >100TB of data files• Template system for compute protocols• Generators to automatically produce analysis scripts• MOLGENIS to run and track inputs, analyses, output data

• Demo movie

• Conclusion

Page 47: D02-NextGenSeq-MOLGENIS

[email protected]

Alignment results

Alignment Variant calling

Alignment to human genome (Build 37)

Clean up alignment (mark duplicates,

realignment, recalibration)

Quality control

Individual SNP calling

Indel calling

Variant Filtering

~ 1 Week ~ 1 Week

>94% reads aligned>13x avg coverage

Page 48: D02-NextGenSeq-MOLGENIS

[email protected]

SNP calling result (GoNL Pilot Chr20 – 1KG Phase I)

16,045 177,389 648,284

GoNL Pilot Only

SNPs 16,045

%dbSNP 2.05

Ti/Tv 2.20

1KG Phase 1 Only

SNPs 648,284

%dbSNP 10.23

Ti/Tv 2.36

Intersection

SNPs 177,389

%dbSNP 65.91

Ti/Tv 2.411KG Estimated Chr20

Ti/Tv: 2.36

Page 49: D02-NextGenSeq-MOLGENIS

[email protected]

Next…

• Polish the software ... a lot• Its MOLGENIS so anybody can download and customize (ideas

anyone?)

• Integrate the login/security module

• Providing reports for the ‘end-users’

• Enabeling trend analyses , etc

• Integrate and run more pipelines for GoNL• Structural Variation Group

• Finalize GoNL SV pipeline

• Integrate SNP Calling / SV pipelines

• Imputation Group• Phase Pilot data

• Impute sequence data

• Estimate gain of GoNL vs HapMap/1KG as Imputation panel

Page 50: D02-NextGenSeq-MOLGENIS

[email protected]

Get all as open source:GoNL - http://www.nlgenome.nlMOLGENIS - http://www.molgenis.org Analysis team - http://www.bbmriwiki.nl

• Acknowledgements• GoNL / MOLGENIS Infrastructure team

• George Byelas, Martijn Dijkstra, Robert Wagner, Pieter Neerincx, Abhishek Narain, Jan Bot and indirectly GEN2PHEN, EBI, FIMM, ...

• GoNL Analysis team (creating pipelines and tools)• Freerk van Dijk (UMCG), Barbera van Schaik (AMC), Ies Nijman

(Hubrecht), Slavik Koval (EMC) Laurent Francioli (UU), Kai Ye (LUMC), Jeroen Laros (LUMC), Lennart Karssen (EMC), JoukeJan Hottenga (VU), Mathijs Kattenberg (VU), David van Enckvort (NBIC), Leon Mei (NBIC), Elise van Leeuwen (EMC), … and many, many others

• GoNL Steering group (coordination)• Cisca Wijmenga (PI GoNL), Morris Swertz (PI analysis), Gertjan van

Ommen (LUMC), Eline Slagboom (LUMC), Jasper Bovenberg (ELSI issues), Cornelia van Duijn (EMC), Dorret Boomsma (VU), Paul de Bakker (co-PI analysis, UU)

[email protected]


Recommended