Part # 1005020, Rev. AMay 2008
Using Pipeline Output Data for Whole Genome Alignment
FOR RESEARCH ONLY
Topics4 Introduction
4 Pipeline
4 Maq
4 GBrowse
4 Hardware Requirements
5 Workflow
6 Preparing to Run Maq
6 UNIX/Linux Environment
6 Testing PERL
6 Installing Maq
7 Getting Reference Sequences
8 Reference Genome with Multiple Chromosomes
9 Output File from Pipeline
9 Required Pipeline Output File
9 Format of Sequence.txt File
10 Quality Values
11 Getting Consensus, Identifying SNPs and Indels
11 Building Consensus
13 Extracting Consensus Information
2
Part # 1005020, Rev. A
13 SNP Calling
16 Indel Discovery
18 Viewing SNPs and Indels with GBrowse
18 GBrowse
18 Reformatting Data
22 Using GBrowse
25 Appendix A: Installing Maq Yourself
26 Appendix B: Quality Value Tables
26 Illumina Symbolic ASCII Quality Values
27 Sanger Symbolic ASCII Quality Values
This publication and its contents are proprietary to Illumina, Inc., and are intended solely for the contractual use of its customers and for no other purpose than to operate the system described herein. This publication and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed, or reproduced in any way whatsoever without the prior written consent of Illumina, Inc.
For the proper operation of this system and/or all parts thereof, the instructions in this guide must be strictly and explicitly followed by experienced personnel. All of the contents of this guide must be fully read and understood prior to operating the system or any of the parts thereof.
FAILURE TO COMPLETELY READ AND FULLY UNDERSTAND AND FOLLOW ALL OF THE CONTENTS OF THIS GUIDE PRIOR TO OPERATING THIS SYSTEM, OR PARTS THEREOF, MAY RESULT IN DAMAGE TO THE EQUIPMENT, OR PARTS THEREOF, AND INJURY TO ANY PERSONS OPERATING THE SAME.
Illumina, Inc. does not assume any liability arising out of the application or use of any products, component parts, or software described herein. Illumina, Inc. further does not convey any license under its patent, trademark, copyright, or common-law rights nor the similar rights of others. Illumina, Inc. further reserves the right to make any changes in any processes, products, or parts thereof, described herein without notice. While every effort has been made to make this guide as complete and accurate as possible as of the publication date, no warranty or fitness is implied, nor does Illumina accept any liability for damages resulting from the information contained in this guide.
© 2008 Illumina, Inc. All rights reserved. Illumina, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro, iScan, and GenomeStudio are registered trademarks or trademarks of Illumina. All other brands and names contained herein are the property of their respective owners.
4
Part # 1005020, Rev. A
Introduction
The Genome Analyzer can generate several Gb of data a week. Converting these huge amounts of sequence data into usable information requires fast and efficient downstream analysis. This document describes how to align Genome Analyzer Pipeline sequence data to a known genome using the Mapping and Assembly with Quality (Maq) application. Results can then be assessed opening the output files, or imported into a GBrowse implementation to view in the genomic context.
The key sections of this guide are:Preparing to Run Maq on page 6Gives information on installing Maq.Output File from Pipeline on page 9Describes the fields in the relevant Pipeline files and the various metrics.Getting Consensus, Identifying SNPs and Indels on page 11Explains how to get a consensus sequence, SNPs and indels from Maq.Viewing SNPs and Indels with GBrowse on page 18Explains how to use GBrowse to view SNPs and indels.
Pipeline The Genome Analyzer Pipeline software is a highly customizable workflow engine capable of taking the raw image data generated by the Genome Analyzer and producing intensity scores, base calls, quality metrics, and quality scored alignments. This software is the result of extensive collaborations with many of the world’s leading sequencing centers.
Maq Maq is a third party open source software tool that builds mapping assemblies from short reads generated by next-generation sequencing machines. Maq is specifically developed for the Genome Analyzer by Heng Li and Richard Durbin from the Sanger Institute. Maq runs on UNIX/Linux, so you will need a computer that uses Linux or UNIX as the operating system.
GBrowse GBrowse is an open source genome viewer, generated as part of the Generic Model Organism Database project (GMOD). Many genome centers and universities have implemented GBrowse to enable you to view their genomic data.
Hardware Requirements
At minimum, you will need 1 GB of memory. This should be enough to map 2 million reads to a bacterial genome, though 4 GB is preferable. For mammalian-sized genome alignments, you will need to map many batches of about 2 million reads, and you will be better served with 16 GB of memory.
NOTE
This guide does not explain how to use Pipeline, and only provides limited information for the use of Maq and GBrowse. The main goal is to provide a path to efficiently use Pipeline output for whole genome alignment.
5
Pipeline to Maq to GBrowse
Workflow The workflow for generating consensus, SNPs and indels is illustrated in Figure 1.
Figure 1 Workflow Generating Consensus, SNPs and Indels
6
Part # 1005020, Rev. A
Preparing to Run Maq
Before you can install Maq, there are a number of requirements you need to fulfill. This section lists these requirements, and gives some options for installing these.
UNIX/Linux Environment
You need to install Maq in an environment that runs on UNIX or Linux (a version of UNIX).
Workstation
Your best option is to run Maq on a dedicated UNIX or Linux workstation. See if you can find such a workstation in your department where you can install and run Maq.
You may need to install Linux on a computer from scratch. Talk to your IT department to see what is required, and whether they can help.
Linux Distributions
If you do not have access to a workstation running UNIX/Linux and you need to install Linux, there are many different distributions of Linux available, paid or free. Good choices are Red Hat Linux (paid) and Fedora Linux (free), but others should work too. Use the documentation provided with your Linux distribution for installation.
Testing PERL Maq uses a number of scripts that are written in the programming language Perl. Many UNIX/Linux distributions already have Perl installed, so first check whether Perl is installed in your UNIX/Linux environment by typing the following:
1. Go to your UNIX/Linux environment
2. In the command prompt, enter:perl -v
3. Evaluate whether you have Perl installed:• If Perl has been installed, you will get a message stating the version
of Perl, copyright and other information. Continue with the section Installing Maq.
• If Perl is not installed yet, you will get a message like this:perl: command not found
If Perl is not installed yet, go to www.activestate.com and install the most recent fully released version of Perl for Linux and your hardware configuration.
Installing Maq When your Linux environment is set up, ask your IT department to install Maq. The download is available from maq.sourceforge.net (Figure 2). We used Maq versions 0.6.5 and 0.6.6 to test the application.
7
Pipeline to Maq to GBrowse
Figure 2 Maq Home Page
Getting Reference
Sequences
You need to download a reference genome for the organism you sequenced to compare it to. Many are available from the NCBI website.
1. Open your browser and navigate to www.ncbi.nlm.nih.gov.
2. Click on the link Genomic Biology in the left navigation bar.
3. Browse to your species under Genome Projects Database in the right navigation bar.
4. Navigate to or search for the species you are looking for, and click on Project data | Genomic
5. Download the genomic files in fasta format (*.fasta, *.fa or *.fna). Download each chromosome of your organism.
6. Make sure to keep track of the exact build of the genome you are using. You can find this in the genbank file, in the Comments section.
NOTEIf you have to install Maq yourself, refer to Appendix A: Installing Maq Yourself on page 25.
Download page
Maq FAQ
Maq User’s Manual
Maq Reference Manual
Maq Wiki
NOTEAnother good source for reference genomes is UCSC (hgdownload.cse.ucsc.edu).
8
Part # 1005020, Rev. A
Reference Genome with
Multiple Chromosomes
If you use a reference genome with multiple chromosomes, you may only find them as a fasta file per chromosome. You will need to combine these fasta files in one file for the reference genome, else your alignment scores may be affected. Perform the following:
1. Open the command line (Terminal) in Linux.
2. Go to the directory containing the downloaded reference genome files using the cd command.
3. Enter the following:cat chr1.fa chr2.fa chr3.fa >ref.fa
where:• chr1.fa chr2.fa and chr3.fa are the fasta input files.• ref.fa is the fasta reference genome output file.
9
Pipeline to Maq to GBrowse
Output File from Pipeline
After you called the bases in Pipeline, Pipeline saves files containing the sequence information. This section specifies what file you need from Pipeline for alignment in Maq, and explains the different elements in this file.
Required Pipeline Output
File
The Pipeline output file you should use for alignment in Maq has the following naming scheme:
s_N_R_sequence.txt (for paired-end sequence files)
or
s_N_sequence.txt (for single-read sequence files)
where:The N stands for the lane.The R stands for the read, in case of paired-end sequencing.
An example of a sequencing reads for one clusters is s_3_2_sequence.txt; this file contains information from read 2 of lane 3.
Format of Sequence.txt
File
The s_N_R_sequence.txt file contains sequence and quality information for one read from one sequencing lane. The files are in FASTQ format.
An example of an entry for one read is shown below:@SLXA-B3_604:2:1:512:767/1GCCTAACCTTTCTGAACCTCATGCGGAAAAACTGTTT+SLXA-B3_604:2:1:512:767/1ccccccccccccchKhcchcU`]`LPVRTINKSNLAA
Every entry contains the following lines:Read Identifier:The line @SLXA-B3_604:2:1:512:767/1 contains the read identifier, which has the following elements:
The read indentifier line starts with an '@', which indicates this line is going to be followed by a sequence line.Sequence:The line GCCTAACCTTTCTGAACCTCATGCGGAAAAACTGTTT con-tains the called sequence for this entry.
Description Element
Abbreviated run name SLXA-B3_604
Lane 2
Tile 1
Coordinates of the cluster on tile 512,767
Indicates the read of a paired end run /1
10
Part # 1005020, Rev. A
Read Identifier:The line +SLXA-B3_604:2:1:512:767/1 contains the same read identifier as above, but this time the line starts with a '+' , which indicates it is going to be followed by a quality score line.Quality scores:The line ccccccccccccchKhcchcU`]`LPVRTINKSNLAA contains the quality scores for this entry. Every base call in an entry has a corresponding qual-ity score, i.e., the nth position in the quality scores line corresponds to the nth nucleotide in the sequence line.
Quality Values The quality scores are in Illumina symbolic ASCII format, according to the following formula:
Quality value = (ASCII character code) - 64.
The values of the characters in the Illumina symbolic ASCII format are listed in the Appendix, section Illumina Symbolic ASCII Quality Values on page 26.
For a single basecall, a Q value of 30 is great, Q20 is a good score, while Q10 is still usable.
Difference of Illumina and Phred Scoring Scheme
The Illumina quality scoring scheme and the Phred quality scoring scheme are different:
Illumina: 10 x log10((1-e)/e)Phred: -10log10(e)
where: e=error probability.
The two definitions round to the same value from approximately Q15 and above, however our scores can go as low as -5.
Difference of Illumina and Sanger FASTQ
The Sanger FASTQ format, which is used by Maq, differs slightly from the Illumina FASTQ format. The main difference is that the quality of the base calls is scored using different scales (Illumina versus Phred quality scores). Maq comes with tools to convert Illumina FASTQ (also often called Solexa FASTQ) to Sanger FASTQ; see Preparing to Run Maq on page 6 and the Maq documentation for more information.
11
Pipeline to Maq to GBrowse
Getting Consensus, Identifying SNPs and Indels
Maq aligns your sequence reads to a reference sequence, builds a consensus and calls single nucleotide polymorphisms (SNPs), and can identify insertion/deletions (indels) if you have performed paired-end sequencing. This section explains briefly how to perform these actions, and what output files you will get when you call SNPs and identify indels.
A lot of this information has been summarized from the Maq user’s manual and the Maq reference manual, available at maq.sourceforge.net (see Figure 2). For more detailed instructions and comprehensive descriptions of the commands in Maq, see these documents; additional information is present in the FAQ section and in the Maq Wiki.
Generating Analysis Folder
You need to generate a folder in which you run the analysis. Copy the following files to this folder:
Read files (Illumina FASTQ format).Reference sequence file (FASTA format).
All output files Maq generated will be stored in this folder (unless you specifically direct Maq to another folder).
Building Consensus
The first thing you need to do is align the reads to the reference, and build a consensus. This is described in this section.
Converting Illumina FASTQ to Sanger FASTQ
As described in Quality Values on page 10, the FASTQ format used by Maq is different from the Illumina FASTQ format. To use Maq, you need to first convert the format for all read files by entering:
maq sol2sanger s_N_R_sequence.txt s_N_R_sequence.fastqwhere:• s_N_R_sequence.txt is the Illumina read sequence file• s_N_R_sequence.fastq is the output file in Sanger FASTQ.
Converting Sanger FASTQ to BFQ
Next you need to convert Sanger FASTQ to binary FASTQ (bfq) for all read files by entering:
maq fastq2bfq s_N_R_sequence.fastq s_N_R_sequence.bfqwhere:• s_N_R_sequence.fastq is the Sanger FASTQ read sequence file.• s_N_R_sequence.bfq is the output file in binary FASTQ.
NOTE
For small sequencing projects (1 lane of sequence data from a procaryote), many of these steps can be combined as a batch using the easyrun command. See the Maq user’s manual for information.
12
Part # 1005020, Rev. A
Converting Reference FASTA to BFA
Next you need to convert FASTA to binary FASTA (bfa) for the reference sequence by entering:
maq fasta2bfa ref.fasta ref.bfawhere:• ref.fasta is the FASTA reference sequence file.• ref.bfa is the output reference file in binary FASTA.
Aligning Reads to Reference
For single-read sequencing, you align the reads from one file to the reference sequence by entering:
maq map s_N_sequence.map ref.bfa s_N_sequence.bfq
For paired-end sequencing, you align the reads from two matching paired-end files to the reference sequence by entering:
maq map s_N_sequence.map ref.bfa s_N_1_sequence.bfq s_N_2_sequence.bfq
where:• s_N_sequence.map is the mapped alignment output file.• ref.bfa is the reference file in binary FASTA.• s_N_sequence.bfq is the single-read output file in binary FASTQ.• s_N_1_sequence.bfq is the paired-end first read output file in binary
FASTQ.• s_N_2_sequence.bfq is the paired-end second read output file in
binary FASTQ.
Merging Map Files
Maq works best with 1 to 3 million reads as input when aligning reads to the reference sequence. If you have a big sequencing project with multiple lanes, you should perform the alignment per lane first, and then combine the map files using mapmerge.
So if you used multiple lanes to sequence the same sample, you can combine the mapped alignments now by entering:
NOTE
When you align paired-end reads, you will get a message that indicates the success of the pairing:(total, isPE, mapped, paired) = (4316000, 1,
4226477, 6142)The number of mapped reads should be close to the number of paired reads. If the number of paired samples is very low (6142 in the example above), and you have done long distance paired-end reads, you need to specify the maximum read length (which should be slightly longer than the average paired-end fragment length).For example, for paired-end reads from 500 bp fragments, add a maximum fragment length of 550 bp by adding the argument -a 550, i.e. enter the following:
maq map -a 550 s_N_sequence.map ref.bfa s_N_1_sequence.bfq s_N_2_sequence.bfq
13
Pipeline to Maq to GBrowse
maq mapmerge s_123_sequence.map s_1_sequence.map s_2_sequence.map s_3_sequence.map
where:• s_123_sequence.map is the combined mapped alignment output file
for lane 1,2, and 3.• s_N_sequence.map is the mapped alignment file for lane N.
Building Consensus
Now you can assemble the consensus from the (merged) map files:maq assemble s123.cns ref.bfa s_123_sequence.map
where:• s123.cns is the consensus output file• ref.bfa is the reference file in binary FASTA.• s_123_sequence.map is the merged mapped alignment file.
Extracting Consensus
Information
Once you have built the consensus, you can extract the new consensus sequence in FASTA format, or in FASTQ format (containing Sanger quality scores).
Extracting Consensus in FASTA Format
To extract the consensus in FASTA format, enter the following:maq cns2ref s123.cns >s123.cns.fasta
where:• s123.cns is the consensus file.• s123.cns.fasta is the output consensus file in FASTA.
Extracting Consensus in FASTQ Format
To extract the consensus in Sanger FASTQ format, enter the following:maq cns2fq s123.cns >s123.cns.fastq
where:• s123.cns is the consensus file.• s123.cns.fastq is the output consensus file in FASTQ.
The files are saved in the Sanger FASTQ format, with quality scores in the Sanger symbolic ASCII format (see Quality Values on page 10 for differences with the Illumina quality scheme).
The quality scores are in Sanger symbolic ASCII format, according to the following formula:
Quality value = (ASCII character code)- 33
The values of the characters in the Sanger symbolic ASCII format are listed in the Appendix, section Sanger Symbolic ASCII Quality Values on page 27.
SNP Calling Extracting SNP Calls
Once you have built the consensus, extract SNPs the following way:maq cns2snp s123.cns >s123.snp
14
Part # 1005020, Rev. A
where:• s123.cns is the consensus file• s123.snp is the tab-delimited, output snp file.
SNP File
To view the SNP calls, open the snp file in excel (Figure 3).
Figure 3 SNP File Opened in Excel
The columns contain the following information:
Chromosome/Reference
Position
Reference Base
Consensus Base
Consensus Quality
Read Depth
Highest Mapping Quality
Quality Difference
Average # Hits
Column Name Description
A Chromosome / Reference
Chromosome or reference sequence.
B Position Position of SNP on the reference sequence.
C Reference Base The base as present in the reference sequence.
D Consensus Base The base called in the consensus of your sequencing reads.
E Consensus Quality
The quality of the base called in the consensus. This is the Sanger quality, which is different from the Illumina quality scores (see Difference of Illumina and Phred Scoring Scheme on page 10).
F Read Depth The amount of reads covering the position.
G Average # Hits The average number of hits of reads covering this position, which roughly equals the copy number of the flanking region in the reference genome.
15
Pipeline to Maq to GBrowse
For the consensus bases, heterozygotes are designated using IUB codes:
Improving SNP Quality
In addition, the following commands are useful for filtering SNP calls:SNPfilter.SNPfilter removes SNPs that are covered by just one read, fall in a repeti-tive region, or fall in a 10 bp region with at least 3 SNPs. Enter the follow-ing:perl maq.pl SNPfilter s123.snp >s123.filtered.snp
where:• s123.snp is the consensus file.
H Highest Mapping Quality
The highest mapping quality of the reads covering the position.
I Quality Difference
The quality difference between the strong allele and the weak allele. If the quality difference is close to the highest mapping quality, you may be looking at a read error.
IUB code Bases
A A
C C
G G
T T
M A/C
K G/T
Y C/T
R A/G
W A/T
S G/C
D A/G/T
B C/G/T
H A/C/T
V A/C/G
N A/C/G/T
Column Name Description
16
Part # 1005020, Rev. A
• s123.filtered.snp is the tab-delimited, output filtered snp file.rmdup.Rmdup removes pairs with identical ends, which could have been caused by PCR at sample prep. Removing duplicates may improve SNP calling accuracy. This filter needs to be done before the consensus is assembled (Building Consensus on page 13); use it as follows:maq rmdup s_123_rmdup.map s_123_sequence.map
where:• s_123_rmdup.map is the output filtered mapped alignment file• s_123_sequence.map is the input mapped alignment file
Indel Discovery Extracting Indels
Once you have built the consensus, you can extract the indels the following way:
maq indelpe ref.bfa s_123_sequence.map >s_123_sequence.indelpewhere:• ref.bfa is the reference file in binary FASTA.• s_123_sequence.map is the merged mapped alignment file.• s_123_sequence.indelpe is the tab-delimited, output indel file.
Indel File
To view the indels found, open the indel file in excel (Figure 4).
Figure 4 Indel File Opened in Excel
NOTEYou can only find indels using Maq with paired-end data.
Chromosome/Reference
Position
Indel Type
# Ref Reads
Indel Size
Reverse Reads
Forward Reads
17
Pipeline to Maq to GBrowse
The columns contain the following information:
Column Name Description
A Chromosome / Reference
Chromosome or reference sequence.
B Start Position Start position of indel on reference sequence.
C Indel Type * Indicates the indel is confirmed by reads from both strands.+ Means the indel is hit by at least two reads but from the same strand.- Shows the indel is only found on one read.. Means the indel is too close to another indel and is filtered out.
D # Ref Reads The number of reads across the indel.
E Indel Size Size of indel.
F Forward Reads Number of reads on the forward strand confirming the consensus.
G Reverse Reads Number of reads on the reverse strand confirming the consensus.
NOTEIf you want to concentrate on the most promising indels, filter the indel file in excel for * in the Indel Type field.
18
Part # 1005020, Rev. A
Viewing SNPs and Indels with GBrowse
Once you have files with SNPs and indels, you may want to view them in a genomic context. Many genome centers have implimented GBrowse, an open source genome viewer. This section helps you viewing your results in a GBrowse viewer.
You will need to perform the following steps:
1. Find a GBrowse implementation for the organism and build you are interested in.
2. Transfer your SNP or indel data to the proper file format.
3. Upload the file to GBrowse.
Now you are ready to look at your SNPs and indels as annotations in a genomic context.
GBrowse GBrowse is an open source genome viewer, generated as part of the Generic Model Organism Database project (GMOD). Many genome centers and universities have implemented GBrowse to enable you to view their genomic data.
Finding Suitable GBrowse Implementation
Lists of implementations can be found at the following two websites:
http://www.gmod.org/wiki/index.php/GMOD_Users
http://www.gmod.org/wiki/index.php/Gbrowse
Browse through these lists and see if there is a GBrowse implementation for the organism and build you are interested in. These lists are not comprehensive; if you can’t find one you can use, try entering GBrowse and your particular build in google, and see if you can find an appropriate implementation that way.
Alternative Solutions
If no suitable implementation of GBrowse exists, you can do two things:Redo your alignments with a build that is supported in a GBrowse implementation.Install GBrowse locally. This is possible, but requires more work and skill. See http://www.gmod.org/wiki/index.php/GBrowse for instructions.
Reformatting Data
The SNP and indel files do not have the appropriate format for GBrowse to recognize. Fortunately, they are usually not extremely big, and can be handled in Excel, and you do not need a Perl script to change the format. This section explains how to reformat your SNP or indel data.
Annotation File Format
GBrowse can read a number of different file formats. Here we explain the annotation file format that works well with our data (Figure 5).
19
Pipeline to Maq to GBrowse
Figure 5 GBrowse File
The annotation file is a text file, and has to start with the following line:reference=landmark name
The reference line has the following properties:The line starts with reference= (in lowercase).The line refers to the chromosome (reference=chr1) or the accession number of the organism (reference=NC_000913).No spaces allowed.The reference applies to all entries below it, until a new reference is found. Multiple reference lines are allowed.
The reference line is followed by data lines, which have the following fields:
Column Entry Description
A Feature Type In our case SNP or INDEL.
B Feature Name A unique name for each entry.
C Feature Position One or more ranges in the format 123-456,987-654 or 123...456,987...654.
D Description (optional)
A description that will be displayed in the viewer.
E URL (optional) If you have a hyperlink, provide it here.
NOTEDo not use spaces, unless you put quotation marks around the field entry.
20
Part # 1005020, Rev. A
Reformatting SNP Files
To reformat the SNP file, perform the following steps:
1. Open the SNP file in Excel.
2. To get a unique SNP name, enter SNP1 in the top field of the empty column J.
3. You need to have a range of nucleotides for the feature position field. In the top field of the empty column K, enter:=CONCATENATE(B1,"-",B1)
4. You need one field with an informative description for every SNP. In the top field of the empty column L, enter:=CONCATENATE(C1,">",D1,",Q",E1,",",B1)
The SNP description will consist of the following information:reference base>consensus base,quality score,position
5. To copy all formulas and calculate values for every entry:a. Select fields J1, K1, and L1b. Drag down the selected fields by the bottom right corner (Figure 6).
Figure 6 Drag Down Bottom Right Corner
The values in column J and K should automatically recalculate, and col-umn L should be filled with unique names (SNP1, SNP2, and so on).
6. Save the file in Excel format (*.xls).
7. Open a new book. This will be the annotation file
8. Copy the values from columns J, K and L of the modified SNP file to columns B, C and D of the annotation file (paste values only).
9. Enter “SNP” in the top field of the empty column A of the annotation file. Copy SNP all the way down to the last data line.
10. Select the first row and insert an empty line by pressing Ctrl Shift + .
11. Enter the reference line in field A1, for example reference=chr1
orreference=NC_000913
Select Bottom Right CornerDrag Down to Last Entry
21
Pipeline to Maq to GBrowse
The SNP annotation file should look like this (Figure 7):
Figure 7 SNP Annotation File
12. Save the SNP annotation file as a text (tab delimited) file (*.txt).
Reformatting Indel Files
To reformat the indel file, perform the following steps:
1. Open the indel file in Excel.
2. To get a unique indel name, enter INDEL1 in the top field of the empty column H.
3. You need to have a range of nucleotides for the feature position field. In the top field of the empty column I, enter:=CONCATENATE(B1,"-",B1)
4. You need one field with an informative description for every indel. In the top field of the empty column J, enter:=CONCATENATE(C1,",",E1,",f",F1,",r",G1)
The indel description will consist of the following information:Indel type,indel size,f forward reads,r reverse reads
5. To copy all formulas and calculate values for every entry:a. Select fields H1, I1, and J1b. Drag down the selected fields by the bottom right corner (Figure 6).The values in column I and J should automatically recalculate, and col-umn L should be filled with unique names (INDEL1, INDEL2, and so on).
NOTE
You can refer to multiple chromosomes per file; just insert a reference line with the new chromosome above the data line where the next chromosome starts. The reference applies to all entries below it, until a new reference is found.
NOTE
If you want to concentrate on the most promising indels, filter the indel file in excel for * in the Indel Type field (column C), and copy all the promising indels to a new book.
22
Part # 1005020, Rev. A
6. Save the file in Excel format (*.xls).
7. Open a new book. This will be the annotation file
8. Copy the values from columns H, I, and J of the modified indel file to columns B, C and D of the annotation file (paste values only).
9. Enter “INDEL” in the top field of the empty column A of the annotation file. Copy “INDEL”all the way down to the last data line.
10. Select the first row and insert an empty line by pressing Ctrl Shift + .
11. Enter the reference line in field A1, for example reference=chr1
orreference=NC_000913
The indel annotation file should look like this (Figure 8):
Figure 8 Indel Annotation File
12. Save the indel annotation file as a text (tab delimited) file (*.txt).
Using GBrowse When you have generated your annotation file, and found a suitable GBrowse implementation, you can start viewing your indels or SNPs in a genomic context.
For comprehensive GBrowse help, FAQs and a tutorial, see http://www.gmod.org/wiki/index.php/Gbrowse.
Upload the Annotation File
1. Navigate your web browser to the GBrowse running web site.
2. Scroll down to the bottom of the page, where you can upload your own annotations (Figure 9). Different GBrowse implementations may look slightly different.
NOTE
You can refer to multiple chromosomes per file; just insert a reference line with the new chromosome above the data line where the next chromosome starts. The reference applies to all entries below it, until a new reference is found.
23
Pipeline to Maq to GBrowse
Figure 9 Upload Annotation File
3. Click Browse, go to the annotation file, select the file, and click Open.
4. Click Upload.
Viewing SNPs and Indels
Once your annotation file is uploaded you will see the file appear with the separate features (Figure 10).
Figure 10 Uploaded Annotation File
Make sure the annotation check box is selected. You can now edit the uploaded annotation file, or click on the separate features (SNPs or indels). This will display the feature in the viewer panel (Figure 11 and Figure 12).
Figure 11 Your Favorite SNP in the GBrowse Viewer
Browse to File
Upload File
Annotation Check Box
Clickable Features
Uploaded Annotation FileEdit File
Published SNPs
Your Favorite SNP
Gene Information
Zoom and Browse Area
24
Part # 1005020, Rev. A
Figure 12 Your Favorite Indels in the GBrowse Viewer
Your Favorite Indels
Gene Information
Zoom and Browse Area
25
Pipeline to Maq to GBrowse
Appendix A: Installing Maq Yourself
If you decide to install Maq yourself, do the following:
1. Open your browser in Linux and navigate to maq.sourceforge.net.
2. Click on the link download page (see Figure 2).
3. Click on the link Download for the most recent version of Maq.
4. Click on the package for your Linux and hardware configuration. If you are not sure which one is best, choose platform independent.
5. Click Save to download the package.
6. Repeat steps 3 to 5 for Maqview and Maq-Data.
7. Open the command line (Terminal).
8. Go to the directory containing the downloaded files using the cd command. The exact location depends on how your Linux is set up.
9. To unzip the packages type the following in the command line:bunzip2 *.bz2
10. List the directory contents by using the ls command.
11. To remove the files from the archive, type the following for every *.tar file in the directory:tar xvf name.tar
You should get three new directories (check by using the ls command).
12. Go to the directory containing the Maq files:cd maq-x.x.x
13. Install the package by entering the following three commands in succession:./configuremakemake install
14. If you get a message that access is denied to the default install directory, you need to specify a directory that you do have access to. Enter the following two commands:./configure --prefix=/home/share/yourfolder
(with /home/share/yourfolder your accessible directory)make install
15. Go one directory up:cd ..
16. Test whether Maq is working by entering:maq
You should get a message explaining Maq usage. If the command maq is not recognized, try the second method decribed in the Maq User Manual, or ask a Linux expert for help.
26
Part # 1005020, Rev. A
Appendix B: Quality Value Tables
Illumina Symbolic ASCII Quality Values
The quality values of the characters in the Illumina symbolic ASCII quality values are listed in the table below:
Table 1 Quality Value of Characters in the Illumina Symbolic ASCII Format
Char. Code
Qual. Value
Char. Code
Qual. Value
Char. Code
Qual. Value
Char. Code
Qual. Value
Char. Code
Qual. Value
Char. Code
Qual. Value
; -5 C 3 K 11 S 19 [ 27 c 35
< -4 D 4 L 12 T 20 \ 28 d 36
= -3 E 5 M 13 U 21 ] 29 e 37
> -2 F 6 N 14 V 22 ^ 30 f 38
? -1 G 7 O 15 W 23 _ 31 g 39
@ 0 H 8 P 16 X 24 ‘ 32 h 40
A 1 I 9 Q 17 Y 25 a 33
B 2 J 10 R 18 Z 26 b 34
27
Pipeline to Maq to GBrowse
Sanger Symbolic ASCII Quality
Values
The quality values of the characters in the Sanger Symbolic ASCII Quality Values are listed in the table below:
Table 2 Quality Value of Characters in the Sanger Symbolic ASCII Format
Char. Code
Qual. Value
Char. Code
Qual. Value
Char. Code
Qual. Value
Char. Code
Qual. Value
Char. Code
Qual. Value
Char. Code
Qual. Value
Char. Code
Qual. Value
! 0 / 14 = 28 K 42 Y 56 g 70 u 84
" 1 0 15 > 29 L 43 Z 57 h 71 v 85
# 2 1 16 ? 30 M 44 [ 58 i 72 w 86
$ 3 2 17 @ 31 N 45 \ 59 j 73 x 87
% 4 3 18 A 32 O 46 ] 60 k 74 y 88
& 5 4 19 B 33 P 47 ^ 61 l 75 z 89
' 6 5 20 C 34 Q 48 _ 62 m 76 { 90
( 7 6 21 D 35 R 49 ‘ 63 n 77 | 91
) 8 7 22 E 36 S 50 a 64 o 78 } 92
* 9 8 23 F 37 T 51 b 65 p 79 ~ 93
+ 10 9 24 G 38 U 52 c 66 q 80
, 11 : 25 H 39 V 53 d 67 r 81
- 12 ; 26 I 40 W 54 e 68 s 82
. 13 < 27 J 41 X 55 f 69 t 83
Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121-1975 +1.800.809.ILMN (4566)+1.858.202.4566 (outside North America) [email protected]