Date post: | 04-Jul-2015 |
Category: |
Technology |
Upload: | bioinformatics-open-source-conference |
View: | 408 times |
Download: | 0 times |
Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands
Morris Swertz, UMC Groningen, Netherlands
and members of BBMRI-NL, NBIC, MOLGENIS
BOSC 2011, July 15, Vienna
Use (web)
Animal Observatory
NextGenSeq
Mutation database
Model organisms
Model (xml)
Generator (java)
BOSC 2010 we demonstrated the MOLGENIS software toolkit
Swertz et al (2010) BMC Bioinformatics 11(Suppl 12):S12, http://www.molgenis.org
Get stuff for free as others build it already
3
Connect to annotation services
Plugin rich analysis tools
Connect to statistics
UML documentation of your model
Edit & trace your data
Import/export to Excel
find.investigation()102 downloaded
obs<-find.observedvalue(43,920 downloaded
#some calculationadd.inferredvalue(res)36 added
Three steps: Model –> Generate –> Use
Swertz et al (2010) BMC Bioinformatics 11(Suppl 12):S12, http://www.molgenis.org
9200 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\ProtocolsForm.java9293 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Protocols\ProtocolMenu\ParametersForm.java9325 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Protocols\ProtocolMenu\ProtocolComponentsForm.java9496 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Ontologies\OntologyTermsForm.java9528 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Ontologies\OntologySourcesForm.java9606 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Ontologies\OntologySources\OntologyTermsForm.java9638 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Ontologies\CodeListsForm.java9700 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Ontologies\CodeLists\CodesForm.java9965 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenuMenu.java10012 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenu\MainMenu.java10059 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenu\Main\Investigations\InvestigationMenuMenu.java10152 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenu\Main\Investigations\InvestigationMenu\ProtocolApplications\ProtocolApplicationMenuMenu.java10230 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenu\Main\ObservationTargetsMenu.java10293 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenu\Main\Protocols\ProtocolMenuMenu.java10324 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenu\Main\OntologiesMenu.java11354 INFO [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\Main\ReportPlugin.java11557 INFO [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\Main\Ontologies\OntologyManagerPlugin.java11604 INFO [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\Model_documentationPlugin.java11604 INFO [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\RprojectApiPlugin.java11620 INFO [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\HttpApiPlugin.java11635 INFO [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\WebServicesApiPlugin.java11651 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\report\InvestigationOverview.ftl11807 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\OntologyBrowser\OntologyBrowserPlugin.ftl11807 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\DocumentationScreen.ftl11807 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\RprojectApiScreen.ftl11823 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\HttpAPiScreen.ftl11823 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\SoapApiScreen.ftl11854 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\report\InvestigationOverview.java12057 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\OntologyBrowser\OntologyBrowserPlugin.java12072 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\DocumentationScreen.java12088 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\RprojectApiScreen.java12088 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\HttpAPiScreen.java12088 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\SoapApiScreen.java12103 INFO [MolgenisServletContextGen] generated WebContent\META-INF\context.xml12259 INFO [SoapApiGen] generated generated\java\ui\SoapApi.java12353 INFO [CsvExportGen] generated generated\java\tools\CsvExport.java12431 INFO [CsvImportByNameGen] generated generated\java\tools\CsvImportByName.java12636 INFO [CopyMemoryToDatabaseGen] generated generated\java\ui\tools\CopyMemoryToDatabase.java
Real example:Generates 150 files, 30k lines of Java, MySQL, CXF, Tomcat config, and R code + docs
5
Three steps: Model –> Generate –> Use
6
Three steps: Model –> Generate –> Use
Swertz et al (2010) BMC Bioinformatics 11(Suppl 12):S12, http://www.molgenis.org
7
XGAP for GWAS/GWL
Disease specific databases
BBMRI biobank catalogue
GWAS central data manager
NGS cyber infrastructure
MAGE-TAB microarray
AnimalDB
Currently: Towards an integrated app suite
Swertz et al (2010) BMC Bioinformatics 11(Suppl 12):S12, http://www.molgenis.org
Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands
• Background: Genome of the Netherlands project• Why: create a Dutch genetic hapmap to find rarer variants• Aim: genome sequence of 1000 chromosomes (12x)
• Challenge: analyze 2250 Illumina lanes• Alignment and SNP calls of 760 samples calls• Data handling, QC, reports, etc
• Solution: NGS software/hardware infrastructure• GPFS storage for >100TB of data files• Template system for compute protocols• Generators to automatically produce analysis scripts• MOLGENIS to run and track inputs, analyses, output data
• Demo movie
• Conclusion
Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands
• Background: Genome of the Netherlands project• Why: create a Dutch genetic hapmap to find rarer variants• Aim: genome sequence of 1000 chromosomes (12x)
• Challenge: analyze 2250 Illumina lanes• Alignment and SNP calls of 760 samples calls• Data handling, QC, reports, etc
• Solution: NGS software/hardware infrastructure• GPFS storage for >100TB of data files• Template system for compute protocols• Generators to automatically produce analysis scripts• MOLGENIS to run and track inputs, analyses, output data
• Demo movie
• Conclusion
GREAT!
Ankylosing Spondylitis
Celiac Disease
Crohn’s disease
Multiple Sclerosis
Psoriasis
Rheumatoid Arthritis
Systemic Lupus Erythematosus
Type 1 Diabetes
Ulcerative Colitis
However:Sequencing candidate loci implicates unknown (rare) variants
common
known
First analysis of 1000G project data
Durbin et al., Nature 2010
More insight into the specific genetic architecture of individual populations is crucial
common
known
new
First analysis of 1000G project data shows that the majority of the
newly identified and rare variants are population specific
(and there are no Dutch in 1000G)
Durbin et al., Nature 2010
More insight into the specific genetic architecture of individual populations is crucial
Genome of the Netherlands (GoNL):
•Unique family-based design: 250 trios• 230 x 2 parents – 1 offspring
• 10 x 2 parents – 2 offspring
• 10 x 2 parents – 1 MZ twin offspring
•Immunochip microrray QC control data
•Specifications:• Families equally distributed over the Dutch provinces
• Genomic DNA, paired-end sequencing on HiSeq2000, 12x coverage
• Trios allow phase information; accurate haplotypes
• Other results: Structural variation, detection de novo variants
Idea 1: sequence 1000 independent Dutch chromosomes
Biobanks * analysis teams
Idea 2: lets impute 100.000 existing Dutch GWAS data
GWAS data
Imputation is the process of inferring any missing or untyped genetic variants from typed flanking genetic variants, based on the known local LD relationship
Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands
• Background: Genome of the Netherlands project• Why: create a Dutch genetic hapmap to find rarer variants• Aim: genome sequence of 1000 chromosomes (12x)
• Challenge: analyze 2250 Illumina lanes• Alignment and SNP calls of 760 samples calls• Data handling, QC, reports, etc
• Solution: NGS software/hardware infrastructure• GPFS storage for >100TB of data files• Template system for compute protocols• Generators to automatically produce analysis scripts• MOLGENIS to run and track inputs, analyses, output data
• Demo movie
• Conclusion
GoNL: sequence 1000 independent Dutch chromosomes
Sequence analysis
•230 trio’s (690)•10 quartets (40)•10 MZ twin (40)
•Immunochip GWAS data for QC (UMCG)
GoNL: sequence 1000 independent Dutch chromosomes
Sequence analysis
•230 trio’s (690)•10 quartets (40)•10 MZ twin (40)
•Immunochip GWAS data for QC (UMCG)
Data analysis &Method development
•~ 75% of data aligned to reference (hg19)
•In-depth analysis on 20 trio’s (pilot1)
GoNL: sequence 1000 independent Dutch chromosomes
Sequence analysis
•230 trio’s (690)•10 quartets (40)•10 MZ twin (40)
•Immunochip GWAS data for QC (UMCG)
TODO: Imputation
~100,000 Dutch samples with GWAS data
Data analysis &Method development
•~ 50% of data aligned to reference (hg19)
•In-depth analysis on 20 trio’s (pilot)
GoNL: sequence 1000 independent Dutch chromosomes
Sequence analysis
•230 trio’s (690)•10 quartets (40)•10 MZ twin (40)
TODO: Imputation
~100,000 Dutch samples with GWAS data
Data analysis &Method development
•~ 50% of data aligned to reference (hg19)
•In-depth analysis on 20 trio’s (pilot)
TODO: Further analysisStructural variation, Population Genetics,
De novo mutations, Mitochondrial DNA
This is an open national project: please contact [email protected] [email protected] and
[email protected] for analysis ideas.
Data analysis &Method development
•~ 75% of data aligned to reference (hg19)
•In-depth analysis on 20 trio’s (pilot)
GoNL: sequence 1000 independent Dutch chromosomes
Sequence analysis
•230 trio’s (690)•10 quartets (40)•10 MZ twin (40)
Imputation existing GWAS
~100,000 Dutch samples with GWAS data
Further analysisStructural variation, Population Genetics,
De novo mutations, Mitochondrial DNA
This is an open national project: please contact [email protected]; [email protected];
[email protected] for analysis ideas.
Challenge 1: Data storage
• 45TB raw data (fq.gz)
• 450TB intermediate data (bam)
• 90TB results (bam + vcf)
Challenge 2: Alignment, Variant Calling, and QC pipelines
Alignment Variant calling
Alignment to human genome (Build 37)
Clean up alignment (mark duplicates,
realignment, recalibration)
Quality control
SNP calling
Indel calling
Variant Filtering
~ 1 Week ~ 1 Week
QC: Immunochip concordance
2300 lanes * 15 analysis steps => 34.500 commands needed
• > 2300 * 15 files, 2300 + 750 QC reports, a nightmare to track
/data/gcc/tools/bwa-0.5.8c_patched/bwa aln \/data/gcc/resources/hg19/indices/human_g1k_v37.fa \/data/gcc/rawdata/ngs/in-house/28may11/24173/110303_SN163_0393_L6_A80MP0ABXX_AGAGAT_1.fq.gz \-t 4 \-f /data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe01.bwa_align_pair1.ftl.human_g1k_v37.2011_05_30_20_22.1.sai
/data/gcc/tools/bwa_45_patched/bwa sampe -P \-p illumina \-i L6 \-m 24173 \-l A80MP0ABXX \/data/gcc/resources/hg19/indices/human_g1k_v37.fa \/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe01.bwa_align_pair1.ftl.human_g1k_v37.2011_05_30_20_22.1.sai \/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe02.bwa_align_pair2.ftl.human_g1k_v37.2011_05_30_20_22.2.sai \/data/gcc/rawdata/ngs/in-house/28may11/24173/110303_SN163_0393_L6_A80MP0ABXX_AGAGAT_1.fq.gz \/data/gcc/rawdata/ngs/in-house/28may11/24173/110303_SN163_0393_L6_A80MP0ABXX_AGAGAT_2.fq.gz \-f /data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe03.bwa_sampe.ftl.human_g1k_v37.2011_05_30_20_22.sam
java -jar -Xmx3g /data/gcc/tools/picard-tools-1.32/SamFormatConverter.jar \INPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe03.bwa_sampe.ftl.human_g1k_v37.2011_05_30_20_22.sam \OUTPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe04.sam_to_bam.ftl.human_g1k_v37.2011_05_30_20_22.bam \VALIDATION_STRINGENCY=LENIENT \MAX_RECORDS_IN_RAM=2000000 \TMP_DIR=/local
java -jar -Xmx3g /data/gcc/tools/picard-tools-1.32/SortSam.jar \INPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe04.sam_to_bam.ftl.human_g1k_v37.2011_05_30_20_22.bam \OUTPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe05.sam_sort.ftl.human_g1k_v37.2011_05_30_20_22.sorted.bam \SORT_ORDER=coordinate \VALIDATION_STRINGENCY=LENIENT \MAX_RECORDS_IN_RAM=1000000 \TMP_DIR=/local
java -jar -Xmx3g /data/gcc/tools/picard-tools-1.32/BuildBamIndex.jar \INPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe05.sam_sort.ftl.human_g1k_v37.2011_05_30_20_22.sorted.bam \OUTPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe05.sam_sort.ftl.human_g1k_v37.2011_05_30_20_22.sorted.bam.bai \VALIDATION_STRINGENCY=LENIENT \MAX_RECORDS_IN_RAM=1000000 \TMP_DIR=/local
Challenge 3: > 200.000 hours compute hours
• Alignment 2300 lanes, 15 steps, ~75 hours per lane
• SNP calling 760 samples, 6 steps, ~50 hours per sample
• Immunochip QC 760 samples, 5 steps, 1 hours per sample
Compute power
Network and storage I/O
Challenge 4: Did we analyze it all? Correctly? Completely?
Batches:UModqR 60HUMcriR 90 HUMhxsR 222HUMrutR 235HUMjxbR 153 HUMsnrR 10
Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands
• Background: Genome of the Netherlands project• Why: create a Dutch genetic hapmap to find rarer variants• Aim: genome sequence of 1000 chromosomes (12x)
• Challenge: analyze 2250 Illumina lanes• Alignment and SNP calls of 760 samples calls• Data handling, QC, reports, etc
• Solution: NGS software/hardware infrastructure• GPFS storage for >100TB of data files• Template system for compute protocols• Generators to automatically produce analysis scripts• MOLGENIS to run and track inputs, analyses, output data
• Demo movie
• Conclusion
Kickstart the project building on NBIC/BioAssist
• NGS task force
• Biobanking task force
• e-BioGrid team
Solution 1: GPFS shared data storage
• Primary storage in Groningen on ‘Target’
• Backup storage in Amsterdam on ‘BigGrid’
• Data transfer via hard drives
• Systematic organization of rawdate, resultdata, logs
2.000 TB750 x 3TB disks
3200 tapes
GPFS
http://www.bbmriwiki.nl/wiki/DataManagement http://www.rug.nl/target/index
Solution 2: data management via sample-lane worksheet
sample flowcell lane lib machine date fileA24a FC80R35ABXX L3 HUMhxsRJODIAAPE I433 101119 101119_I433_FC80R35ABXX_L3_HUMhxsRJODIAAPEA24a FC80F2RABXX L3 HUMhxsRJODIABPE I481 101120 101120_I481_FC80F2RABXX_L3_HUMhxsRJODIABPEA24a FC80GHKABXX L2 HUMhxsRJODIBAPE I114 101202 101202_I114_FC80GHKABXX_L2_HUMhxsRJODIBAPEA24b FC80R35ABXX L4 HUMhxsRJPDIAAPE I433 101119 101119_I433_FC80R35ABXX_L4_HUMhxsRJPDIAAPEA24b FC80F2RABXX L4 HUMhxsRJPDIABPE I481 101120 101120_I481_FC80F2RABXX_L4_HUMhxsRJPDIABPEA24b FC80GHKABXX L3 HUMhxsRJPDIBAPE I114 101202 101202_I114_FC80GHKABXX_L3_HUMhxsRJPDIBAPEA24b FC81C8UABXX L3 HUMhxsRJPDIBAPE I340 110114 110114_I340_FC81C8UABXX_L3_HUMhxsRJPDIBAPEA24c FC80R35ABXX L5 HUMhxsRJQDIAAPE I433 101119 101119_I433_FC80R35ABXX_L5_HUMhxsRJQDIAAPEA24c FC80F2RABXX L6 HUMhxsRJQDIABPE I481 101120 101120_I481_FC80F2RABXX_L6_HUMhxsRJQDIABPEA24c FC80GHKABXX L4 HUMhxsRJQDIBAPE I114 101202 101202_I114_FC80GHKABXX_L4_HUMhxsRJQDIBAPEA25a FC80R35ABXX L6 HUMhxsRJRDIAAPE I433 101119 101119_I433_FC80R35ABXX_L6_HUMhxsRJRDIAAPEA25a FC81C8UABXX L2 HUMhxsRJRDIAAPE I340 110114 110114_I340_FC81C8UABXX_L2_HUMhxsRJRDIAAPEA25a FC80F54ABXX L7 HUMhxsRJRDIABPE I171 101122 101122_I171_FC80F54ABXX_L7_HUMhxsRJRDIABPEA25a FC80GHKABXX L5 HUMhxsRJRDIBAPE I114 101202 101202_I114_FC80GHKABXX_L5_HUMhxsRJRDIBAPEA25b FC80R35ABXX L7 HUMhxsRJSDIAAPE I433 101119 101119_I433_FC80R35ABXX_L7_HUMhxsRJSDIAAPEA25b FC80EE1ABXX L5 HUMhxsRJSDIABPE I171 101122 101122_I171_FC80EE1ABXX_L5_HUMhxsRJSDIABPEA25b FC80GHKABXX L6 HUMhxsRJSDIBAPE I114 101202 101202_I114_FC80GHKABXX_L6_HUMhxsRJSDIBAPEA25b FC80GHJABXX L1 HUMhxsRJSDIBAPE I117 101208 101208_I117_FC80GHJABXX_L1_HUMhxsRJSDIBAPEA25c FC80R35ABXX L8 HUMhxsRJTDIAAPE I433 101119 101119_I433_FC80R35ABXX_L8_HUMhxsRJTDIAAPEA25c FC80F54ABXX L5 HUMhxsRJTDIABPE I171 101122 101122_I171_FC80F54ABXX_L5_HUMhxsRJTDIABPEA25c FC80GHKABXX L7 HUMhxsRJTDIBAPE I114 101202 101202_I114_FC80GHKABXX_L7_HUMhxsRJTDIBAPEA25c FC81C7KABXX L5 HUMhxsRJTDIBAPE I125 110115 110115_I125_FC81C7KABXX_L5_HUMhxsRJTDIBAPEA26a FC80PEWABXX L5 HUMhxsRJUDIAAPE I198 101120 101120_I198_FC80PEWABXX_L5_HUMhxsRJUDIAAPEA26a FC80F2RABXX L7 HUMhxsRJUDIABPE I481 101120 101120_I481_FC80F2RABXX_L7_HUMhxsRJUDIABPEA26a FC80GHKABXX L8 HUMhxsRJUDIBAPE I114 101202 101202_I114_FC80GHKABXX_L8_HUMhxsRJUDIBAPEA26b FC80N58ABXX L5 HUMhxsRJVDIAAPE I245 101120 101120_I245_FC80N58ABXX_L5_HUMhxsRJVDIAAPEA26b FC80PNWABXX L2 HUMhxsRJVDIABPE I453 101119 101119_I453_FC80PNWABXX_L2_HUMhxsRJVDIABPEA26b FC80G37ABXX L1 HUMhxsRJVDIBAPE I127 101126 101126_I127_FC80G37ABXX_L1_HUMhxsRJVDIBAPEA26c FC80LDLABXX L1 HUMhxsRJWDIAAPE I453 101119 101119_I453_FC80LDLABXX_L1_HUMhxsRJWDIAAPEA26c FC80PNWABXX L3 HUMhxsRJWDIABPE I453 101119 101119_I453_FC80PNWABXX_L3_HUMhxsRJWDIABPEA26c FC80G37ABXX L2 HUMhxsRJWDIBAPE I127 101126 101126_I127_FC80G37ABXX_L2_HUMhxsRJWDIBAPE
(of course it is a bit more advanced than that)
NB: •we have a beta Galaxy tool.xml mapper•based on GEN2PHEN ‘observation’ model•we would love to have a shared workflow model
Solution 3: auto-generate all computational protocols
• Auto-generate all the analysis commands:
Generate scripts
1. Create SampleLane list
2. Generate pipeline from templates
3. Submit to Compute cluster
bwa aln ${lane}
bwa aln FC80R35ABXX_L3.fq.gzbwa aln FC80R35ABXX_L3.fq.gzbwa aln FC80R35ABXX_L3.fq.gz
34.500 scripts15 templates
http://www.bbmriwiki.nl/svn/ngs_pipelines/templates/ngs/
Solution 4: distributed compute efforts > 200.000 hours
• Alignment 2300 lanes, 15 steps, ~75 hours per lane
• SNP calling 760 samples, 6 steps, ~50 hours per sample
• Immunochip QC 760 samples, 5 steps, 1 hours per sample
RUG CIT/Target~900 lanes done~240 per week
360 cpus
AMC/BigGrid~250 lanes done
~30 per week~270 cpus
EMC
Hubrecht
Other BigGrid
Solution 6: REST based services
• To interact with R, Galaxy, Taverna (WSDL), Shell etc
e.g. simply upload a csv from shell
e.g. simply get data via R
http://www.molgenis.org/wiki/MolgenisRestInterfacehttp://www.molgenis.org/wiki/MolgenisRinterface
curl -d
'data_type_input=org.molgenis.pheno.Individual&data_input=Name,Descriptio%0AInd1,Desc1%0AInd2,Desc2&data_action=ADD&data_silent=F&submit_input=submit'
http://vm7.target.rug.nl/ngs_test/api/add
source("http://a.host:8080/molgenis_ngs/api/R")”>res <- find.NgsSample();
All working together (beta)
MOLGENIS user interface for NGS (Java)
Petabyte File storage(GPFS, GridFS?)
compute cluster(PBS, Grid?)
bwa aln ${lane}
Protocol catalogue(Freermaker)
Lane & Sample metadata And QC reports (MySQL)
MOLGENIS/compute
Generate ‘ProtocolApplications’
Submit and monitor (GridGain)
uses
API-R-Galaxy-Taverna-IGV-UCSC
Data & protocols Result exploration
usesTest & play
Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands
• Background: Genome of the Netherlands project• Why: create a Dutch genetic hapmap to find rarer variants• Aim: genome sequence of 1000 chromosomes (12x)
• Challenge: analyze 2250 Illumina lanes• Alignment and SNP calls of 760 samples calls• Data handling, QC, reports, etc
• Solution: NGS software/hardware infrastructure• GPFS storage for >100TB of data files• Template system for compute protocols• Generators to automatically produce analysis scripts• MOLGENIS to run and track inputs, analyses, output data
• Demo movie
• Conclusion
Download demo from DropBox
http://dl.dropbox.com/u/1839500/Swertz_BOSC_2011.mp4
Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands
• Background: Genome of the Netherlands project• Why: create a Dutch genetic hapmap to find rarer variants• Aim: genome sequence of 1000 chromosomes (12x)
• Challenge: analyze 2250 Illumina lanes• Alignment and SNP calls of 760 samples calls• Data handling, QC, reports, etc
• Solution: NGS software/hardware infrastructure• GPFS storage for >100TB of data files• Template system for compute protocols• Generators to automatically produce analysis scripts• MOLGENIS to run and track inputs, analyses, output data
• Demo movie
• Conclusion
Alignment results
Alignment Variant calling
Alignment to human genome (Build 37)
Clean up alignment (mark duplicates,
realignment, recalibration)
Quality control
Individual SNP calling
Indel calling
Variant Filtering
~ 1 Week ~ 1 Week
>94% reads aligned>13x avg coverage
SNP calling result (GoNL Pilot Chr20 – 1KG Phase I)
16,045 177,389 648,284
GoNL Pilot Only
SNPs 16,045
%dbSNP 2.05
Ti/Tv 2.20
1KG Phase 1 Only
SNPs 648,284
%dbSNP 10.23
Ti/Tv 2.36
Intersection
SNPs 177,389
%dbSNP 65.91
Ti/Tv 2.411KG Estimated Chr20
Ti/Tv: 2.36
Next…
• Polish the software ... a lot• Its MOLGENIS so anybody can download and customize (ideas
anyone?)
• Integrate the login/security module
• Providing reports for the ‘end-users’
• Enabeling trend analyses , etc
• Integrate and run more pipelines for GoNL• Structural Variation Group
• Finalize GoNL SV pipeline
• Integrate SNP Calling / SV pipelines
• Imputation Group• Phase Pilot data
• Impute sequence data
• Estimate gain of GoNL vs HapMap/1KG as Imputation panel
Get all as open source:GoNL - http://www.nlgenome.nlMOLGENIS - http://www.molgenis.org Analysis team - http://www.bbmriwiki.nl
• Acknowledgements• GoNL / MOLGENIS Infrastructure team
• George Byelas, Martijn Dijkstra, Robert Wagner, Pieter Neerincx, Abhishek Narain, Jan Bot and indirectly GEN2PHEN, EBI, FIMM, ...
• GoNL Analysis team (creating pipelines and tools)• Freerk van Dijk (UMCG), Barbera van Schaik (AMC), Ies Nijman
(Hubrecht), Slavik Koval (EMC) Laurent Francioli (UU), Kai Ye (LUMC), Jeroen Laros (LUMC), Lennart Karssen (EMC), JoukeJan Hottenga (VU), Mathijs Kattenberg (VU), David van Enckvort (NBIC), Leon Mei (NBIC), Elise van Leeuwen (EMC), … and many, many others
•
• GoNL Steering group (coordination)• Cisca Wijmenga (PI GoNL), Morris Swertz (PI analysis), Gertjan van
Ommen (LUMC), Eline Slagboom (LUMC), Jasper Bovenberg (ELSI issues), Cornelia van Duijn (EMC), Dorret Boomsma (VU), Paul de Bakker (co-PI analysis, UU)