Date post: | 27-Jan-2015 |
Category: |
Data & Analytics |
Upload: | andrew-stewart |
View: | 106 times |
Download: | 0 times |
Do It Yourself Annotator
An annotation pipeline for every genomics lab
Andrew Stewart *, Timothy Read
●Genomics Department, Biological Defense Research
Directorate, Navy Medical Research Center,
Rockville, Maryland, United States
●Distribution and source code are available at
https://sourceforge.net/projects/diyg/
●Contact: [email protected]
●DIYA is an open source pipeline for the rapid annotation of genomic sequences.
The software is designed to use as input DNA contigs, either in the form of
complete genomes or the result of shotgun sequencing of a genome library, and
produce as output a fully annotated sequence.
●The DIYA pipeline is modular in nature, and easily expandable to include further
forms of feature finding. Each module follows a similar structure, using for input
and output a standard format as a conduit between stages in the pipeline. The
usefulness of BioPerl (http://bioperl.org) as a format conversion utility and parser
is demonstrated in this system. SGE support allows running multiple sequences
in parallel.
Background
●“A sequencing center in every genomics lab”
●Thus, an annotation pipeline in every genomics lab
●Need for sequence analysis tools with
decentralization of sequencing technology
Background
●Explosion of tools onto the bioinformatics community
●Inconsistent formats, need for ‘pipelining’, bioperl
Background: BDRD
●454 Life Systems FLX sequencers
●Push data off onto servers
oAssembly
oAnnotation
oAnalysis
Outline of the pipeline
●diya-assemble-pseudocontig
●diya-glimmer
●diya-blast
●diya-rfam_scan
●diya-tRNAscan
●Auxiliary scripts
Installation requirements
●Software
oPerl v5+, SGE, MUMer, Glimmer, Blast, tRNAscanSE, Infernal, rfamscan.pl
●Databases
oProtein Clusters, Rfam
●Perl libraries
oBioPerl, Getopt::Long, Data::Dumper, XML::Simple, etc..
Pipeline: diya.pl
●Controller script for the pipeline
●Manages configuration and project data table
generation
●Fires off jobs to SGE
Pipeline: Assembly
●Generate a ‘pseudocontig’
●MUMmer v3.20 (http://mummer.sourceforge.net/)
Pipeline: Glimmer
●Prediction of gene coding regions
●Glimmer v3.02 (http://www.cbcb.umd.edu/software/glimmer/)
og3-iterated.csh - two rounds of iteration
●Uses interpolated Markov models to distinguish
between coding and non-coding regions
Pipeline: Blast
●BLAST v2.2.16 (ftp://ftp.ncbi.nih.gov/blast/)
●Two rounds of blast against..
oReference genome
oProtein Clusters database
Pipeline: rfam_scan
●Identification of ncRNA (rRNA, tRNA)
●Infernal v0.81 (http://infernal.janelia.org/)
●Rfam (http://www.sanger.ac.uk/Software/Rfam/)
●rfamscan.pl v0.1 (http://www.sanger.ac.uk/Users/sgj/code/)
Pipeline: tRNAscan-SE
●Identification of tRNA
●tRNAscan-SE v1.23 (http://lowelab.ucsc.edu/tRNAscan-SE/)
Pipeline: Auxiliary scripts
●Locus tag reordering (cleanup)
●Protein extraction (ie, PIPA input)
●Pseudocontig disassembly
●Hooks
oLoad databases
oReport genome statistics
oWikiLIMS integration
Modularity
●Adding extra modules is rather simple
●Things to come...
oCRISPR elements
opseudogenes
oprophages
Do It Yourself Genomics
●A project community and collection of bioinformatics
tools and applications for the analysis of genomic
sequence data, with the intent of bringing these tools
into the hands of medium to small scale sequencing
labs.
DIYG on disk
●OS (Linux) distribution with DIYG pre-installed
●Simplifies process of installation, compilation,
‘prerequisite gathering’
●Run analysis directly on sequencer workstation?
●Easy deployment across a high performance
computing cluster
DIYG: Virtual Machine
●Virtualization creates a complete, self-contained
deployment of an operating system
●“Disposable” analysis machine
DIYG: Cloud Computing
●Ideal for labs without direct access to a HPC cluster
●Truly an annotation pipeline in every genomics lab
Deployment at BHSAI
●Make sequence annotation available to wider DOD
community
●Concerns about ‘perl’ nature of DIYA
●Need to determine HPC guidelines
●Possible integration / hook into PIPA
Deployment at BHSAI
●Conventional installation (integration into existing
systems, ala PIPA)
●Sourced from disk image
●Virtualization servers? (if available)