+ All Categories
Home > Documents > Using BioHPC Lab Software - Cornell...

Using BioHPC Lab Software - Cornell...

Date post: 15-Jul-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
35
Using BioHPC Lab Software Qi Sun Computational Biology Service Unit Cornell University 3CPG
Transcript
Page 1: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

Using BioHPC Lab Software

Qi SunComputational Biology Service Unit

Cornell University

3CPG

Page 2: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

CBSU/3CPG workshops

1. Linux

2. Using the BioHPC lab software

3. BioHPC web site

4. PERL scripting ?

1. Genetic variation Detection

2. RNAseq

Series 1

Series 2

Page 3: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

Practice in the BioHPC lab

1. You can download the Powerpoint slides and a Word file with step‐by‐step instructions of two exercises.http://cbsu.tc.cornell.edu/lab/workshops.aspx

2. The data files used for exercises are available on  /shared_data/cbsuworkshop_2. 

Page 4: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

1. SNP/INDEL detection 

2. RNA‐seq and ChIP‐seq

3. De‐novo assembly 

4. Genotyping‐by‐sequencing

5. Metagenomics

What software are available?

Reference genome dependent

Reference genome independent

Categories:

Not readyyet

Page 5: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

Raw data files Excel spreadsheetsComputersoftware

Commercial software for nextgen data analysis are not mature yet

• Softgenetics NextGENe• DNAstar Ngen• GenomeQuest• CLC Genomics Workbench

A partial list of commercial software

Page 6: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

BWA

Samtoolsview

Samtoolspileup

FASTQ file

SAM file

BAM file

VCF/BCFPipeline for SNP/INDEL calling

Step 1

Step 2

Step 3

You have more options with open source software 

Page 7: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

BWA

Samtoolsview

Samtoolspileup

FASTQ file

SAM file

BAM file

VCF/BCF

Bowtie

GATK

Picard

Step 1

Step 2

Step 3

Pipeline for SNP/INDEL calling

You have more options with open source software 

Page 8: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

BWA

Bowtie/Tophat/Cufflinks

Samtools

Picard

GATK

Annovar

Velvet

gsAssembler

BLAST

BLAT… 

Software tools installed on the BioHPC labs

Page 9: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

BWA

Bowtie/Tophat/Cufflinks

Samtools

Picard

GATK

Velvet

gsAssembler

BLAST

BLAT

… 

Software tools installed on the BioHPC labs

Each project requires a combination of multiple tools

Page 10: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

Category File Extension ReferenceSequence fasta http://en.wikipedia.org/wiki/FASTA_format

Sequence fastq http://en.wikipedia.org/wiki/FASTQ_format

Alignment SAM/BAM http://samtools.sourceforge.net/SAM‐1.3.pdf

Sequence variation VCF/BCF http://www.1000genomes.org/node/101

Genome Annotation gff/gff3 http://gmod.org/wiki/GFF3

Genome Annotation gtf http://genome.ucsc.edu/FAQ/FAQformat#format4

Commonly Used File Formats

Most files that you downloaded from a web site are compressed .gz files. Use the gunzip command to de‐compress the file.  E.g. 

gunzip s_1_sequence.txt.gz

Page 11: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

SAM

BAM

BWA

VCF

BCF

Bowtie

Cufflinks

Tophat

GATK

PERL

Samtools

Bamtools

IGV

VCFtools

Velvet

FASTQ

GFF

BED

GTF

gz

GFF3

Page 12: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

http://cbsu.tc.cornell.edu/lab/use.aspx

User guide and FAQ for BioHPC lab:

We welcome suggestions and contributions to our documentation library.  

Page 13: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

Designing a pipelineRNAseq: 1. Splicing junction identification2. Alignment to reference genome and junctions 3. Quantify reads aligned to each gene4. Identifying novel transcription active region

SNP/INDELs detection:1. Alignment to reference genome2. Filter out PCR duplication3. Realignment around INDEL4. Variation calling5. Filtering6. Annotation

Page 14: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

Selection of alignment tools

• Gapped alignment or ungapped alignmentBWA: gapped;   Bowtie: ungapped

• Dealing with ambiguous hitsBWA and Bowtie report the ambiguous hits in different ways.

• Alignment to splicing junctionsUse Tophat. (canonical splicing sites only).

Page 15: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

CBSU Recommened alignment tools.

BWA (for genomic sequencing) Tophat (for RNA‐seq)

Page 16: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

Get reference genome and annotation files

Using UCSC site to download genome fasta file. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/

use “cat chr* > allchr.fa” command to concatenate the individual chromsomesinto one file)

Using the UCSC Table Browser to create the GTF file.http://genome.ucsc.edu/cgi‐bin/hgTables?command=start

Page 17: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

What to do after you receive the data?

Page 18: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

What to do after you receive the data?

Fastq file: post QC filterQseq file: pre QC filter 

Page 19: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

Transfer the files to the workstation

• Use WinSCP (win) or FETCH (mac) to move files to the workstation. De‐compress the file using the gunzip command.gunzip s_3_sequence.txt.gz

Page 20: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

Check the data quality (optional)

Run fastx_quality_stats tool (fastx toolkit) to get the quality report.

fastx_quality_stats –i s_3_sequence.txt ‐o stat_report.xls &

Page 21: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

FASTX output

Page 22: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

Decode the multiplexed data file

1. Use the fastx tool kit: (fastx_barcode_splitter.pl)

Note:  This tool cannot be applied to paired‐end reads. If you have paired end reads, you can use the NovoBarCode tool. It is free for academic users. http://www.novocraft.com/userfiles/file/NovoBarcode.pdf

2. Buckler Lab GBS barcode system

cat s_3_sequence.txt | fastx_barcode_splitter.pl ‐‐bcfilemybarcodes.txt –bol ‐‐mismatches  0 –prefix mysample

GBS_barcode.pl yourBarCodeFile yourEnzymeFile

Note:  Documentation for GBS_barcode.pl is on the workstation.  ( /shared_data/doc/GBS_barcode.pdf)

Page 23: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

Alignment with BWA

1. Reformat and index the genome fasta file.

2. Do alignment

bwa index ‐a bwtsw maize.fa &

bwa aln ‐n 2 ‐t 2 maize.fa s_1_sequence.txt > s_1.sai &

bwa samse ‐n 5 maize.fa s_1.sai s_1_sequence.txt > s_1.sam &

http://bio‐bwa.sourceforge.net/bwa.shtml#3Manual:

Page 24: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

Alignment with Tophat

1. Reformat and index the genome fasta file.

2. Do alignment (with or without annotation)

bowtie‐build maize.fa maize &

tophat –p 3 ‐o s1_guided ‐G ZmB73_5a_WGS.gtf ‐‐no‐novel‐juncsmaize s_1_sequence.txt &

tophat –p 3 ‐o s1_unguided maize s_1_sequence.txt &

http://bio‐bwa.sourceforge.net/bwa.shtml#3Manual:

Page 25: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

Convert SAM to BAM

samtools view ‐bS ‐o s_1.bam s_1.sam

samtools sort s_1.bam s_1.sorted

samtools index s_1.sorted.bam

• samtools view:  sam2bam

• samtools sort: sort by genome coordinates or read names

• samtools index: index the BAM file 

Page 26: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

Visualize the sequence alignment

Page 27: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

How to use IGV:

1. Copy the .bam .bai files, as well as the genome fasta and gtf files to you local computer. 

2. Launch IGV program from http://www.broadinstitute.org/software/igv/download3. Load the files.

Page 28: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

Use Cufflinks to assemble/quantify RNA‐seq

cufflinks  ‐G ZmB73_5a_WGS.gtf s1_tophat.bam

Single sample:

cuffdiff ZmB73_5a_WGS.gtf  s1_tophat.bam s5_tophap.bam

Multiple samples:

Page 29: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

trans_idbundle_id chr left right FPKM FMI frac

FPKM_conf_lo

FPKM_conf_hi coverage length

effective_length status

GRMZM2G060082_T01 99289 1 2 3807 2.05938 0.507199 0.576667 0 5.04574 0.25115 2804 2769OKGRMZM2G060082_T02 99289 1 2606 3754 4.0603 1 0.423333 0 8.65942 0.495171 1066 1031OKGRMZM2G059865_T01 99290 1 4853 9652 15.6517 1 0.471931 7.73925 23.5641 1.90879 1966 1931OKGRMZM2G059865_T03 99290 1 4856 6355 4.18E‐09 2.67E‐10 7.70E‐11 0 0.000129 5.10E‐10 1214 1179OKGRMZM2G059865_T02 99290 1 4856 9652 14.2274 0.909003 0.528069 6.68358 21.7713 1.7351 2412 2377OKGRMZM2G059856_T01 99291 1 9855 10388 0 0 0 0 0 0 533 498OKGRMZM5G888250_T01 99291 1 9881 10387 0 0 0 0 0 0 506 471OKGRMZM2G059843_T01 99292 1 11454 14988 0 0 0 0 0 0 1788 1788OKGRMZM5G866996_T01 99293 1 46227 47746 0 0 0 0 0 0 472 437OKGRMZM2G059818_T02 99294 1 50452 54182 0 0 0 0 0 0 3099 3099OKGRMZM2G059818_T01 99294 1 50452 56348 0 0 0 0 0 0 4379 4379OKGRMZM2G059818_T03 99294 1 52003 52543 0 0 0 0 0 0 540 540OKGRMZM2G360269_T01 99295 1 57418 61452 0 0 0 0 0 0 2556 2556OKGRMZM2G518629_T01 99296 1 62320 62588 0 0 0 0 0 0 98 98OKGRMZM5G811273_T02 99296 1 62501 64014 3.65473 0.943998 0.41328 0 8.63598 0.44571 279 244OKGRMZM5G811273_T01 99296 1 62733 64014 3.87154 1 0.58672 0 8.47176 0.472151 362 327OKAC177838.2_FGT002 99297 1 70594 71919 0 0 0 0 0 0 633 633OKGRMZM2G518627_T01 99298 1 73839 74024 0 0 0 0 0 0 185 185OKGRMZM2G059778_T01 99299 1 76119 76752 0 0 0 0 0 0 411 376OKGRMZM2G518609_T01 99300 1 90684 90815 0 0 0 0 0 0 131 96OKGRMZM2G059745_T01 99301 1 92353 93541 0 0 0 0 0 0 425 390OKGRMZM2G093344_T01 99302 1 109518 111769 8.56066 1 1 2.70894 14.4124 1.04401 1012 977OKGRMZM2G394757_T01 99302 1 110764 111506 0 0 0 0 0 0 419 419OK

Cufflinks output

Page 30: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

Some issues with Cufflinks

• Optimized for human RNAseq data.

• Problems in its model for weighting the ambiguously mapped reads, and weighting of reads contribution to different isoforms.

• In our test, simple average weighting gives better correlation to the RT PCR data. 

Page 31: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

CBSU RNAseq pipeline

Available on both workstaions and cbsu BioHPC web site.

In order to use this tool, we will need to create customized database for you.  Premade database are available for 5 species including maize and rice.

Page 32: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

Use Samtools to call variations

samtools mpileup ‐uf bwa/maize.fa s1.sorted.bam |bcftools view ‐bvcg ‐ > s1.raw.bcf

bcftools view s1.raw.bcf | vcfutils.pl varFilter ‐D100 > s1.vcf

1. Pileup2. Variation calling3. Filtering(example in next slide) 

Page 33: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

Pileup:

Chromsome 1 ‐ position 7080:Reference:   “C”Reads:  10 reads covered this position.   

8 “T” 2 “C”   

Possible interpretation:1. PCR or sequencing error;2. Systematic machine error;3. Heterozygous;4. Homologous region;

Page 34: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

Function annotation of the SNPs

Page 35: Using BioHPC Lab Software - Cornell Universitycbsu.tc.cornell.edu/lab/doc/ApplicationsofBioHPCLab.pdf · 2014-10-31 · grmzm2g059865_t03 99290 1 4856 6355 4.18e‐09 2.67e‐10 7.70e‐11

In the coming months …

• Using Picard/GATK to replace Samtools

• De‐novo assembly

• Genotyping‐by‐sequencing


Recommended