Post on 05-Mar-2018
transcript
Linux andLinux andRNA-Seq read RNA-Seq read
alignmentalignment
Brian J. KnausBrian J. KnausUSDA Forest ServiceUSDA Forest Service
Pacific Northwest Research StationPacific Northwest Research Station
1
OutlineOutline
•Intro to LinuxIntro to Linux•Reference typesReference types•Read filteringRead filtering•Short read alignmentShort read alignment
2
3
The Linux operating systemThe Linux operating system
•Many ‘flavors’ of Linux (Ubuntu, fedora, CentOS, openSUSE, Slackware).
•Frequently includes a GUI (Gnome, KDE).
•Strength is in the shell, a programmer’s OS.
•Permissions.
•Multiple shells (bash, tcsh, ksh).
•Text editors (gedit, vi, emacs).
•Finding help.
Putty: http://www.chiark.greenend.org.uk/~sgtatham/putty/
Xming: http://www.straightrunning.com/XmingNotes/
Interacting with a server (PC options)
lsls –lhcd ~cd ..pwdmvcpmkdirdfrmrmdirrm –rf # Will delete everything without asking.cat filename.txthead filename.txt less filename.txtgedit filename.txt &topchmod u+x filename.txttar –xvzf file.tar.gz
(Google ‘linux cheat sheet’)
Shell commands
Tab completionhistory
Shell commands
7
Finding help with LinuxFinding help with Linux
$ man command
$ info command
Google ‘Linux what you need help on’.
O’reilly books (http://oreilly.com/).
8
Reference typesReference types
•From a genome project (model organisms).•De novo or from cDNA.
Are all isoforms present?
How will exon skipping affect inference of regulation?
9
What’s in a name?What’s in a name?
•Bowtie truncates reference names at spaces.•Some characters don’t mix well with the sequence ontologies.http://www.sequenceontology.org/resources/gff3.html
Note the difference between sequence ontology and gene ontology.http://www.geneontology.org/
@HWI-EAS121:1:1:0:952#0/1CGTTNCCACTTCCTCCATCATGTCATCATGTGCGACAGGA+HWI-EAS121:1:1:0:952#0/1aab^D\babbbabbbbabbaaaabaabaaa_`aaaaa]PY@HWI-EAS121:1:1:0:405#0/1CGTTNTAAAGGTGCACCAGGGATCAAATCAATGGAATGCT+HWI-EAS121:1:1:0:405#0/1aa^[DVa^`^_Y`a^a`[\^\Z^aaYZ`a`X__]ZZ_]`_@HWI-EAS121:1:1:0:724#0/1CGTTNCATGCCCTTCTTTAATTTTTACACATGGTTCTTCT+HWI-EAS121:1:1:0:724#0/1aa`[D^aa`aaaaaaaaa_R`aaaaaaaa`aa`Y`aa``a@HWI-EAS121:1:1:0:666#0/1TTGTNAAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG+HWI-EAS121:1:1:0:666#0/1a`bOD[]R]`a__aT^YX\a`aMXaa[a[_a\HT\_``\[@HWI-EAS121:1:1:0:1591#0/1TTGTNCTCACCTATAATTTGACTTTGACATGCTACCTAGC+HWI-EAS121:1:1:0:1591#0/1aaaYD[aaa`aaaaWZaaaaa``_aaaa`aa`_V_``Y[a
Fastq file
40-mer sequences
11
Read filteringRead filtering
•Adapter dimers.•Fastq quality format (Phred, Illumina pre1.3, Illumina post1.3).http://maq.sourceforge.net/qual.shtml
•Poly(A).
•Non-target organism.
12
Alignment softwareAlignment software
•Bowtie: http://bowtie-bio.sourceforge.net/index.shtml
Persistent index
Heterogeneous read-length
•BWA: http://bio-bwa.sourceforge.net/
Persistent index
Heterogenous read length
Gapped alignment
•CASHX: http://jcclab.science.oregonstate.edu/?q=node/view/56095
Non-Smith-Waterman alignment
SAMTools: http://samtools.sourceforge.net/
Manipulate SAM files.
13
Index creation - BowtieIndex creation - Bowtie
mkdir btindexmv rna_ref.fa btindexcd btindexbowtie-build rna_ref.fa rna_refbowtie-inspect -s rna_refcd ..
14
Read alignment - BowtieRead alignment - Bowtie
mkdir btoutcd btout
bowtie -q -n 2 -S ../btindex/rna_ref ../fastq/sample1.fq > sample1.sam
samcounter.pl -a ../btindex/rna_ref.fa -b sample1.sam
samtools view -b -S sample1.sam > sample1.bam
samtools sort sample1.bam sample1
samtools pileup -f ../btindex/rna_ref.fa sample1.bam > sample-pileup.txt
samtools index sample1.bam
15
Read alignment - BowtieRead alignment - Bowtie
#!/bin/tcsh
set index='../btindex/rna_ref'set reads='../fastq/sample1.fq'set samp='sample1'
##### ##### ##### ##### ###### Main.
bowtie -q -n 2 -S $index $reads > $samp.samsamcounter.pl -a $index.fa -b $samp.samsamtools view -b -S $samp.sam > $samp.bamsamtools sort $samp.bam $sampsamtools pileup -f $index.fa $samp.bam > $samp-pileup.txtsamtools index $samp.bam
##### ##### ##### ##### ###### EOF.
16
Read alignment - BowtieRead alignment - Bowtie
#!/bin/tcsh
set index=“../btindex/rna_ref”set reads=“../fastq/sample1.fq”set samp=“sample1”
##### ##### ##### ##### ###### Main.
easyqsub.pl -a "bowtie -q -n 2 -S $index $reads > $samp.sam"
easyqsub.pl -a "samcounter.pl -a $index.fa -b $samp.sam"
easyqsub.pl -a "samtools view -b -S $samp.sam > $samp.bam"
easyqsub.pl -a "samtools sort $samp.bam $samp"
easyqsub.pl -a "samtools pileup -f $index.fa $samp.bam > $samp-pileup.txt"
easyqsub.pl -a "samtools index $samp.bam"
##### ##### ##### ##### ###### EOF.
17
Alignment viewer - SAMtoolsAlignment viewer - SAMtools
samtools tview sample1.bam ../btindex/rna_ref.fa
18
Index creation - BWAIndex creation - BWA
mkdir bwaindexcp btindex/rna_ref.fa bwaindex/cd bwaindexbwa index -a is -p rna_ref rna_ref.fa
19
Read alignment - BWARead alignment - BWA
cd ..mkdir bwaoutcd bwaout
bwa aln -o 0 ../bwaindex/rna_ref ../fastq/sample1.fq > sample1.sai
bwa samse ../bwaindex/rna_ref sample1.sai ../fastq/sample1.fq > sample1.sam
samcounter.pl -a ../btindex/rna_ref.fa -b sample1.sam
samtools view -b -S sample1.sam > sample1.bam
samtools sort sample1.bam sample1
samtools pileup -f ../btindex/rna_ref.fa sample1.bam > sample-pileup.txt
samtools index sample1.bam
20
Alignment viewer - SAMtoolsAlignment viewer - SAMtools
samtools tview sample1.bam ../btindex/rna_ref.fa
21
SAM file formatSAM file format
@HD VN:1.0 SO:sorted@PG TopHat VN:1.0.13 CL:/local/cluster/bin/tophat -p 4 --solexa1.3-quals../indexes/psme_ref ../psme_seqs.fqILLUMINA-3AB384_0001:6:24:19059:8781#GATT 0 0_54_255 1 255 80M *0 0TCTTCTTCATGTTTGGCACGTGTATTCGGGCCTACTTCGCCTTTCCTTCACAGTAGGCGCCTTATCATTATTGGTCAGTTCCCCCCCCCCCCCCCCDCCCCCCCC@CBCBBCCBCCCCCCCCCCCCCCCCCCCDCD@C@CCCC4=CCBCCCCAC>B>BBCNM:i:1HWI-EAS121_0024_FC61F8DAAXX:7:101:7452:15154#CTGT 0 0_54_255 17 25576M * 0 0CACGTGTATTCGGGCCTACTTCGCCTTTCCTTCACAGTAGGCGCCTTGTCATTATTGGTCAGTTATGACCTTAATTGGGGGGGGGGFEGFFGFEEFFBEECEFFFFFGGDGFDDGE:FBBFEGFFD?DEDEFB=DDD=ECCC=EAACDEDC=NM:i:0
@header line1 – file format version@header line2 – program which created the file
1 Query (read) name2 flag3 Reference name4 Leftmost mapping position5 Mapping quality6 CIGAR string7 Reference name of mate8 Position of the mate9 Template length10 Fragment sequence11 Fragment quality
Gene cb_a cb_b yk_a yk_bisotig18613_gene=isogroup07808_length=677_numCo
ntigs=117 18 139 159
isotig01880_gene=isogroup00225_length=652_numContigs=4
11 10 162 56
isotig07160_gene=isogroup01638_length=3698_numContigs=4
31 81 276 226
isotig06362_gene=isogroup01321_length=1396_numContigs=4
32 31 149 91
isotig06005_gene=isogroup01197_length=1204_numContigs=4
52 68 169 198
isotig06363_gene=isogroup01321_length=1470_numContigs=4
21 27 73 100
contig29123_gene=isogroup00629_length=686 30 15 75 161isotig30058_gene=isogroup19254_length=1101_numC
ontigs=131 36 75 400
contig50604_gene=isogroup01657_length=1247 272 405 1153 724contig21101_gene=isogroup01657_length=559 47 96 264 165
isotig05419_gene=isogroup01011_length=1938_numContigs=4
32 49 103 126
contig03433_gene=isogroup00629_length=496 21 10 55 71isotig05877_gene=isogroup01156_length=2570_numC
ontigs=491 70 154 762
23Parkhomchuck et al. 2009. Transcriptome analysis by strand-specific sequencing of complimentary DNA. Nucleic Acids Research 37(18):e123
Strand specificityStrand specificity
24
25
26