DNA-Seq Analysis Tutorial - Cancer Genomics...

DNA-Seq Analysis TutorialRelease 10.0

OmicSoft Corporation

August 07, 2017

CONTENTS

1 Introduction 11.1 Array Studio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Test Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 DNA-Seq Analysis Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Create Array Studio Project 3

3 DNA-Seq Analysis Pipeline 7

4 QC of Raw Data Files 94.1 Basic Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Base Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3 Read Quality QC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.4 K-Mer analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Alignment to the Genome 19

6 QC of Aligned Data 236.1 Alignment Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236.2 DNA-Seq Aligned QC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.3 Individual Aligned Data QC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.4 Coverage Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7 DNA-Seq Fusion Gene Detection 357.1 Report Paired-End Fusion Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357.2 Map Fusion Reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

8 DNA-Seq Mutation Detection 418.1 Summarize/Annotate Variant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428.2 Generate VCF files + Annotated VCF files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468.3 Summarize/Annotate Matched Pair Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

9 DNA-Seq Copy Number Analysis 559.1 Summarize Copy Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559.2 Segment Copy Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

10 Save & Close Project 61

i

ii

CHAPTER

ONE

INTRODUCTION

1.1 Array Studio

Array Studio runs local NGS analysis in 64-bit mode on a Windows 64-bit computer with at least 8GB of RAM.Server-based analysis allows the user to run jobs on a remote server (Linux or Windows), which usually has morecomputing power than a desktop computer. This tutorial is based on Windows Server based NGS analysis using aWindows Workstation with 3.4GHz Intel® Core TM i7-6700 Processor (# of cores: 4; # of threads: 8) with 40GBRAM.

It is highly recommend that the user complete the prerequisite for this tutorial, the Microarray tutorial, as a way tolearn the basics in Array Studio. This tutorial assumes working knowledge of Array Studio, standard visualizations,and different modules within the software. As many of the downstream data types from next generation sequencingare “Microarray” datasets, the Microarray tutorial is an invaluable starting tool.

1.2 Test Dataset

This DNA-Seq tutorial will cover the importing and some analysis of three public datasets. We have selected threeSRR run .fastq files. The full dataset is available on the SRA archive:

SRR097848 (MCF10A cell line): http://www.ncbi.nlm.nih.gov/sra/SRX040530

SRR097849 (MCF7 cell line): http://www.ncbi.nlm.nih.gov/sra/SRX040531

SRR064173 (K562 cell line): http://www.ncbi.nlm.nih.gov/sra/SRX025461

The whole dataset is ~2.5GB. For convenience, we provide a subset (10% of reads) of the dataset which can bedownloaded at:

http://omicsoft.com/downloads/data/tutorial/DNASeq.zip

Note: The tutorial is based on the whole dataset; results will be somewhat different if you are using the subset dataset.Also, Omicsoft keeps updating algorithms and data to make sure that users have the most accurate results. Therefore,you may have slightly different results when you compare your results with the results shown in this tutorial.

1.3 DNA-Seq Analysis Workflow

In this tutorial, we will introduce the DNA-Seq data analysis workflow in ArrayStudio, step by step. The workflowconsists of a number of modules for DNA-Seq data processing, including raw data quality control (QC), alignment,aligned data QC, detection of fusion and mutation, and copy number analysis, as shown in the schematic chart below:

1

http://www.ncbi.nlm.nih.gov/sra/SRX040530



http://omicsoft.com/downloads/data/tutorial/DNASeq.zip

DNA-Seq Analysis Tutorial, Release 10.0

2 Chapter 1. Introduction

CHAPTER

TWO

CREATE ARRAY STUDIO PROJECT

Array Studio provides an integrated environment for analyzing and visualizing high dimensional data. It is convenientin organizing and visualizing data with its Solution Explorer, which organizes data/results as projects. User can create alocal project which uses the local computer power to do analysis, or create a server project which will run all analyses inArray Server. Server-based analysis allows the user to run jobs on a remote server (Linux or Windows), which usuallyhas more computing power than a desktop computer. The Array Studio client software, installed on a local desktopmachine, is used to interact with the ArrayServer (like a server terminal). Tasks such as job submission, monitoring,file transfer and data visualization can be done through the client software. Moreover, ArrayServer has a built-inscheduling system that supports high performance computing clusters (both SGE and PBS/Torque), accelerating theanalysis of tremendous amounts of NGS data. In this tutorial, we will create a server project to run the analyses. Thisassumes the user has Array Server installed - the user can run this tutorial as a local project (distributed) and analysissteps are almost the same as described in this tutorial. The interface of creating of new local project is different fromcreating a new server project, but the rest of the steps, such as data import, and data analysis share the same interface.

Different project types tell Array Studio where to run the analysis: If it is a local project, Array Studio will run theanalysis in local machine; if it is a server project, Array Studio will run the analysis on server.

Once Array Studio has been opened, click File | New Server Project from the File Menu (also can be accessed viathe New button on the toolbar).

Note: For a local project, it is required that the user have approximately 10GB of available space on their hard drivefor this tutorial. The general rule of thumb is that the user has 3x the size of the raw data files available for the importof data. The user might also want to specify a different location for the “Omicsoft” home folder using Tools Menu| Preferences | Advanced. This will place the Omicsoft folder, which is used as storage for any genome referenceindex and temp files for the process, at a specified user location. The reference library and index will take ~10Gb ofspace.

3


Users may enter some basic metadata and click Create to create an empty server project:

Another way to create a new project is by this icon:

4 Chapter 2. Create Array Studio Project


Once a project is created, users can choose to follow the DNA-Seq Analysis Pipeline module (all-in-one, Chapter 3)or perform QC/filtering/alignment/mutation analyses (Chapters 4-6,8) all as individual steps.

5


6 Chapter 2. Create Array Studio Project

CHAPTER

THREE

DNA-SEQ ANALYSIS PIPELINE

Array studio provides an easy way for new users to do all analyses in one module:

This function has a similar interface to the DNAseq alignment module.

7


Users need to provide the DNA seq files and choose the genome. Then in the reporting section, users can choosewhich related analysis want to perform. The available analysis include

• Perform raw data QC

• Filter Raw data

• Perform post alignment QC

• Summarize mutation + SNP

• Generate BAM Summary (.bas Files)

Each analysis is done by default setting and will generate data objects in the Solution Explorer when the entire pipelineis complete. Users can perform step-by-step analysis with customized settings as described in later chapters.

8 Chapter 3. DNA-Seq Analysis Pipeline

CHAPTER

FOUR

QC OF RAW DATA FILES

Array Studio contains several modules for QC of raw data files. The easiest way is to run Raw Data QC Wizard,which scans each file once to calculate all quality statistics such as GC content, per position nucleotide distribution,read length distribution, quality box plot, sequence duplication and K-mer analysis. User can also try each functionindividually, which offers more options to specify, such as adapter stripping and max read position.

Click Add to find all 6 files for the three samples. The Array Studio Raw Data QC Wizard provides lots of differentQC metrics such as Basic statistics, Base Distribution etc. Users can select each QC metric they want to run.Optionally, for a quicker analysis, the user can choose preview mode to only generate QC on the 5% sampled reads.This is, in most cases, good enough to get an assessment of quality. Leave Quality encoding as Automatic toautomatically set the correct quality encoding method (this tutorial has fastq files with the Sanger method of qualityencoding).

Click Submit to begin the analysis.

9


The Raw Data QC module returns multiple raw data QC results/reports in Raw Data QC folder, which are describedin the following subsections.

10 Chapter 4. QC of Raw Data Files


4.1 Basic Statistics

The basic statistics table contains some important information about your samples, including total Sample #, Minimumand Maximum read length (if pre-filtering has occurred), total Nucleotide #, and GC%. Use this table to confirm anyexpected values, as well as to get an idea of the overall size of your experiment.

Base distribution QC results are located in the Raw Data QC folder with name BasicStats. Double-click the tableview to open if you do not see the basic statistics table in the middle window:

4.2 Base Distribution

Base distribution of each raw data file is useful for ensuring that the base distribution is expected (sometimes can beused to notice adapter sequences if the user is not aware that adapter sequences are in the read file). Base distributionQC results are located in the Raw Data QC folder with name BaseDistribution. By default, the BaseDistributionProfileView should be shown, but if not, open it by double-clicking the Profile view from the Solution Explorer.

4.1. Basic Statistics 11


Switch to the Legend in the View Controller to see the percentages of A, G, C, and T for each base pair position.Notice that there are a total of 6 charts (scroll through them to look at each sample), one for each file that was QC’ed.Selecting points on the chart will also show additional details in the Details Window, as shown below.



One can also switch to line plot view by going to View Controller | Task | Customize | Change To Line Type.

4.2. Base Distribution 13


4.3 Read Quality QC

The QC results include a PerSequenceQuality (view and table), a QualityBoxPlot (view and table) and a Overal-lQualityReport (view and table) in the Solution Explorer. Per Sequence Quality calculates the average quality scorefor each read and shows the distribution of quality for all reads in each file.

In Quality BoxPlot, all reads in a file are overlaid and box plot for each base pair position is shown. This gives theuser an idea where the quality starts to drop for most reads in a sample. It is useful to compare plots when evaluatingsequencing quality for multiple samples.



From the QualityBoxPlot view (shown above), it is clear that the quality of the read1 starts to drop off earlier thanread 2 in sample SRR097848. Scroll through each of the 6 charts to see the quality BoxPlots for each individual fastqfile. Selecting a point on the chart will show additional details in the Details Window below the plot.

The Overall Quality Report summarizes quality of each base pair. It shows the total number of base pairs in one inputfile that have a certain quality score.

4.3. Read Quality QC 15


4.4 K-Mer analysis

The K-Mer Analysis module counts the enrichment of every possible 5-mer across the positions of the reads. Thisanalysis identifies whether there is an enrichment of a kmer in a particular region of the read. It can help find overrep-resented patterns, such as adapters being read through when inserted fragment is short (e.g. miRNA-seq) and unfilteredadapter dimers.



In the KMerAnalysis profile view, Y-axis is the percentage of reads (0.01 means 1%) that contain each KMer. Thereis no significant (all less than 1.5%) enrichment of k mer in this tutorial dataset.

4.4. K-Mer analysis 17



CHAPTER

FIVE

ALIGNMENT TO THE GENOME

The second step in most DNA-Seq analysis is the alignment of the reads to the genome. Alternatively, if the datahave already been aligned, the alignment (BAM/SAM) files can be imported through Add NGS Data | Add DNA-SeqData | Add Genome Mapped Reads.

For this experiment, we will align the data using Array Studio. To add aligned NGS data, go to the Add Data dropdownmenu on the toolbar, then choose Add NGS Data | Add DNA-Seq Data | Map Reads to Genome (Illumina).

At this point, the Map DNA-Seq Reads to Genome module appears. The first step is to click the Add button to specifythe location of the files. Choose the 6 files that were downloaded from SRA or the subset of the dataset downloadedfrom OmicSoft website. Note that these files are in .gz (gzip) format. The alignment process takes this into accountand it is an effective way to save some space when importing files, as there is no need to extract all files.

19


As this is a paired experiment, click the Reads are paired checkbox. This will ensure that the pairing information isused in the alignment and counting process.

Choose the Genome for the experiment. In this case, use Human.B37.3. Omicsoft supplies standard genome buildsfor common organisms, but the user can always choose to build and use their own genome.

Leave the quality encoding set to automatic. However for your information, these files were encoded using the Sangerquality scoring system. Total penalty should be left on automatic, and is described completely in Omicsoft’s whitepaper on alignment.

Thread number indicates the number of threads to use per alignment, and usually this number should be less than 6.Job number refers to the number of parallel jobs (independent processes).

Non-unique mapping indicates how many “ties” for non-unique reads should be reported, or if they should be ex-cluded all together.

Output folder allows the user to explicitly specify a location to store the alignment BAM files. Otherwise the bamfiles will be saved in a default location (a random number/letter folder in the project folder). In this tutorial, it isrecommended to specify a folder, so that BAM files can be found easily in next step (for fusion detection).

There are a few options in the Advanced Tab (e.g. detailed parameter settings for Indel detection). In general thedefault values have been tuned and should work well in most cases.

Leave the Exclude unmapped reads in BAM file unchecked, so that the generated BAM file will contain the informa-tion for all the reads (i.e. mapped and unmapped). The generated BAM files (which contain unmapped reads, possiblefusion reads) can be directly used as input to the single-end fusion detection algorithm (see the fusion chapter in thistutorial).

20 Chapter 5. Alignment to the Genome


Click Submit to Submit the analysis. This could take hours, depending on the number of threads, type of computer(64-bit/32-bit), etc.

After the alignment, you will see a NgsData object and an alignment report table in the solution explorer:

21


22 Chapter 5. Alignment to the Genome

CHAPTER

SIX

QC OF ALIGNED DATA

6.1 Alignment Report

By default, an alignment report is generated anytime an alignment is done in Array Studio. If it is not already open,go to your Solution Explorer and double click on Report from the AlignmentReport table.

This will show, for each pair (or single file if the user did not do a paired alignment), some information regardingmapping. One of the key statistics is the uniquely paired reads (uniquely mapped and properly paired).

23


6.2 DNA-Seq Aligned QC

This module can be used to generate a table of metrics, along with visualization for each metric by scanning thealignment BAM files. These metrics can be used to provide an overview of different statistics on alignment, coverage,Flag, mapping location, insert size strandness, and more, in a single table. Also generates a ProfileView showing achart for each metric.

Here we assume that the alignment data (NGS data) exist in a project. If not, one can add bam files into a project bygoing to Add Data | Add NGS Data | Add DNA-Seq Data | Add Genome-Mapped Reads. To run the DNA-SeqQC module, go to NGS | Aligned Data QC | DNA-Seq QC Metrics now.

24 Chapter 6. QC of Aligned Data


Choose the NGS data and leave all other settings as their defaults and click Submit to run the module.

Profile metric is based on the provided gene model. It provides the most information with gene models like En-sembl that have detailed information for the source of each transcript. Here, we specify to use gene model Omicsoft-Gene20130723.

6.2. DNA-Seq Aligned QC 25


The analysis returns a Table View of QC metrics and Profile view in the Aligned Data QC folder:

In the Table view, you will find the following sections:

6.2.1 Alignment Metrics

These metrics can be used to give an overall idea of the quality of the alignment for your samples.

6.2.2 Duplication Metrics

The duplication metrics can give you an idea of the total level of duplication for an experiment (after alignment). Thisis based on coordinates (start position), rather than the raw data QC which was based on sequence. It is expected thatan DNA-Seq experiment will have a large amount of duplication, so do not be alarmed if these metrics show highvalues.



6.2.3 FeatureMetrics

Feature metrics measure the rate of CDS/Exon/Gene/Transcript coverage by DNA-Seq data.



6.2.4 Flag Metrics

Flag metrics are generally only useful for paired end reads when data has been aligned with OSA. It provides metricsusing the SAM Flags.

6.2.5 Genome Coverage Metrics

Genome coverage metrics provides metrics for genome coverage with different depth, from 1X to 100X.



6.2.6 Insert Size Metrics

Insert size metrics provide some basic metrics on the insert sizes for paired end experiments. Use these metrics toensure that the paired end experiment is performing as expected, and to look for any outlier values such as mean insertsizes that are significantly different from the library’s size-selection range.

6.2.7 Profile Metrics

Profile Metrics provide important overall statistics based on the provided gene model. It is usually used for RNA-Seqdata, but it is also a useful metric for exon capture DNA-Seq sequencing, since DNA regions fragments are capturedbased on a gene model. Metrics include the rate of reads mapped to an exon, exon junction, intron, anywhere in aknown gene, in a known gene with an insertion or deletion, an inter-gene region, inter-gene region with insertion,inter-gene region with deletion, or a deep inter-gene region (>5kb outside the known gene model). Use these metricsto determine the overall success of the profiling.



6.2.8 Strand Metrics

The strand metrics give you the rates at which reads are aligned to the sense or anti-sense strands. For most IlluminaRNA-Seq experiments (in which the reads are unstranded, it is expected that reads would align in equal portion(50/50)). However, for some stranded protocols, this might not be the case.



6.3 Individual Aligned Data QC

DNA-Seq QC Metrics provides comprehensive assessment of the alignment data. We also provide metrics such asFlag Summary Statistics, Mapping Summary Statistics and Paired End Insert Size as separate functions, whereusers can specify more analysis options.

6.4 Coverage Summary Statistics

Additional coverage summary statistics can be generated by going to NGS | Coverage | Coverage Summary Statisticsnow.

6.3. Individual Aligned Data QC 31


The user can set the size of the coverage for each bin, and whether to output bedGraph files for use in outsideprograms.

Leave options as-is and click Submit to continue.

This generates a new table, NgsCoverageReport, which can be used for downstream analysis and visualization.



Open the histogram; filter to Chromosome=22 only under View Controller|Row; layout three charts in 3*1 mode.You can get:

There is a clear enrichment at chr22:23600001-23700000 in SRR064173, which is expected since this sample is atargeted sequencing experiment of BCR-ABL1 fusion DNA fragment.

In the detail window, users can further check read information in the Genome browser by right clicking the row:

6.4. Coverage Summary Statistics 33



CHAPTER

SEVEN

DNA-SEQ FUSION GENE DETECTION

In DNA-Seq datasets, fusion genes can be detected based on both paired- and single-end reads. In a paired-endNGS dataset, a discordant read pair is one that is not aligned to the reference genome with the expected distance ororientation. If a set of discordant read pairs are mapped to two different genes, a fusion gene is suggested. On the otherhand, single-end reads that span the fusion junctions provide base-pair evidence for the fusion events. In paired-enddatasets, two fusion reports from junction-spanning reads and discordant read pairs can be combined to eliminate falsepositives and provide accurate base pair resolution detection of fusion.

Two fusion detection functions for DNAseq data can be found in NGS | Fusion menu: Map Fusion Reads (Illumina)and Report Fusion Genes (Paired End). Note that the first function, Combined Fusion Analysis, is only designedfor RNAseq data.

7.1 Report Paired-End Fusion Genes

Report Fusion Genes (Paired End) module will detect fusion genes from inter-transcript paired-end reads based onDNA-Seq alignment (NgsData).

35


Choose the NGS data and Gene model; specify the fusion report cutoff and alignment tie cutoff. Check Output fusionreads option and specify the directory path, supporting fusion reads will be saved as BAM files, which can be usedfor visual check in genome browser. Leave all other settings as their defaults and click Submit to run the module. Theoutput is a paired fusion report table listed under Table in solution explorer:

The information in Filter column in the report table comes from a fusion black list. For more information about theblacklist, please read the following wiki article:

http://www.arrayserver.com/wiki/index.php?title=Fusion_gene_detection_in_RNA-Seq#Blacklist_filter

In the report table, there are three columns for each sample. The first column shows the number of unique mappingpositions from reads in Gene1, the second columns shows the number of unique mapping positions from reads inGene2, while the third column shows the total count of read pairs mapped to that fusion. If reads map to only asmall number of unique positions, this could indicate a false positive (potentially PCR duplicates). There are 3*3=9columns of data, and annotation columns: gene name, strand and genomic locations. The start and end positions in thetable describe the genomic coordinate of the gene, not the breakpoints of fusion gene. The exact breakpoint cannot bedetermined by fusion detection based on discordant read pairs.

Below are rows for identified known fusion BCR-ABL1 fusion in SRR064173 (K562) samples at the genomic level:

36 Chapter 7. DNA-Seq Fusion Gene Detection

http://www.arrayserver.com/wiki/index.php?title=Fusion_gene_detection_in_RNA-Seq#Blacklist_filter


Report Fusion Genes (Paired End) module reports fusion events by grouping gene pairs by rows in one table. Itprovides an easy way to detect recurrent fusion events when the analysis was run on multiple samples.

7.2 Map Fusion Reads

Map Fusion Reads module will detect fusion genes from fusion junction-spanning reads, which can characterizefusion genes at base pair resolution. It is the preferred approach to detect fusion events, using OmicSoft’s fusionalignment method (FusionMap, Ge, H, et al. Bioinformatics (2011): 1922-1928).

It is not recommend to run Map Fusion Reads module on multiple samples with different read lengths. In this tutorial,the read length of SRR064173 is 38bp and one of SRR097848/SRR097849 is 50bp. Thus, this tutorial step will focuson the fusion junction detection in data SRR064173, since it is enrichment on BCR-ABL1 fusion regions.

Fusion detection can use raw sequence files (Fastq, fasta, or qseq format) or alignment NGS data (BAM/SAM). If theuser is using the original FASTQ files, the first step is to filter out normal reads and get a pool of potential fusionreads. Make sure to click Reads are paired option so that a pair of files will be considered as one sample duringfusion detection. The module will automatically pair two files based on file names. If it is unchecked, one file will betreated as one sample.

7.2. Map Fusion Reads 37


If user is using BAM files, potential fusion reads (such as reads spanning two nearby genes) in alignment and un-mapped reads will be extracted for fusion detection. It is the preferred approach, which saves running time at thefiltering step.



Remember to uncheck Reads are RNA-Seq reads. By default, this module is used for RNA-Seq data.

Choose the Gene model to be OmicsoftGene20130723.

Minimal cut size is the minimal seed length for fusion detection, which requires the minimal length of a fusion readmapped in two fusion partners. For details, see wiki page:

http://www.arrayserver.com/wiki/index.php?title=Seed_Read

The default Cut size is to use value min(25, max(18, readLength/3)). However, the read length for dataset SRR064173(K562) is 38. For this tutorial dataset, we use a Fixed cut size of 16 to require at least 16 nucleotides to match each ofthe two genomic location, allowing 6 nucleotides in the middle of each read to detect fusion breakpoints.

There are more fusion alignment options. Leave all settings as their defaults and click Submit to run the module. Theoutput is a fusion report table listed under Table in solution explorer:

In the report table, there are three columns for each sample. The first column shows the number of unique mappingpositions from fusion junction spanning reads, the second columns shows the number of fusion seed reads, while thethird column shows number of fusion rescued reads. There are 3 columns of data, and 11 annotation columns forfusion strand, fusion breakpoint, known gene/transcript names of two genes, fusion junction sequence, splice pattern,predict fusion gene.

Below is the result for identified known fusion BCR-ABL1 in K562 samples at genomic level:

7.2. Map Fusion Reads 39

http://www.arrayserver.com/wiki/index.php?title=Seed_Read



CHAPTER

EIGHT

DNA-SEQ MUTATION DETECTION

Mutation data can be generated from DNA-Seq data. This allows the user to compare frequencies of mutation, forindividual sites, between groups of samples. All mutation functions can be found in NGS | Variation.

In this tutorial, we will cover three mutation detection workflows:

• Summarize Variant Data + Annotated Variant Report

• Generate VCF files + Annotated VCF files

• Summarize Matched Pair Variation Data + Annotated Mutation/SNP Report

User can find the documentation for other functions by clicking the Help button in each function menu.

41


8.1 Summarize/Annotate Variant

Summarize Variant Data is developed by OmicSoft. Variant (Mutations and SNPs) are reported based on the pileupdata from alignment data (NgsData).

Choose the NGS data. In the reference section, all references are selected by default. User can select a list of regionsto summarize mutations. Selections can be on Gene list (a list of gene symbols from project lists), Customized regions(load a bed file), or Filtered by region (e.g. chr9:133710831-133763062, or more regions separated by |). Keep thedefault selection for this tutorial.

Specify the base and mapping quality cutoff; choose the minimal coverage and mutation options. By default, themodule only counts a mutation if the coverage >= 20 reads, # of reads supporting the minor allele >= 5 and the minorallele read frequency >= 5%. Note: you should lower the coverage cutoff if you are using the subset (10%) tutorialdataset.

In the Advanced tab, user can adjust quality by neighbours at each position, check more output options and ask themodule to add dbSNP information (dbSNP annotation can also be added in the next step) at this step. Users can alsoadjust Score cutoff and maximal ratio for the SNP calls.

Also, Omicsoft provides the options to generate VCF files (both merged VCF and individual VCF files). Thosegenerated VCF files can be used for futher annotation by Array Studio or other tools.

42 Chapter 8. DNA-Seq Mutation Detection


Leave all settings as their defaults and click Submit to run the module. The output is a mutation2Snp report tablelisted under Table in solution explorer:

In the mutation2Snp report table, there are four columns for each sample:

• MutationFrequency;

• Coverage at this genomic location;

• Percentage of mutation detected on the plus strand (MutationReadOnPlus/MutationReadTotal).

• Genotype for this mutation allele.

If there are multiple minor alleles at the same location with frequency >= cutoff, they will be reported as different rows.The report comes complete with annotation for each site, including chromosome, position, and reference nucleotide,mutation (either a change in sequence or insertion/deletion). The whole table is a merged report from multiple samplesgrouping by mutation site (location + type) in rows. The mutation will be reported if at least one sample contains itwith read frequency >= cutoff. The mutation read frequency, even 0%, and coverage in other samples will be alsoreported. Dots in the table are missing values, indicating that the coverage at this position in that sample is less thancutoff.

8.1. Summarize/Annotate Variant 43


The variants are then annotated with gene coding information, known SNPs and functional prediction databases.

Open NGS | Variation | Annotate Variant Table Report module; choose the mutation2Snp report in Data, Omic-softGene20130723 as gene model in Gene model. Leave all settings as their defaults in this tab. The first time theuser specifies a reference/gene model, it will download files from Omicsoft.

In the Annotation Sources and Additional Sources tab, user can specify more annotation sources such as functionalprediction and COSMIC annotation, or custom mutation annotators built using NGS | Build Mutation Annotator.

User also has the option to write the annotated mutation result directly to a text file.



Leave all settings as their defaults and click Submit to run the module. The output is a MutationAnnotation reporttable listed under Table in solution explorer:

Besides the columns for each sample (mutation frequency, coverage, PlusStrand%, and Genotype) in mutation anno-tation table, there are a number of annotation columns: Dbsnp ID, Dbsnp Category, Gene name/Transcript ID, Type,Location, Consequence, mutation position in the open reading frame (Position On Cds), AAMutation and RS ID. Inaddition, there are columns that correspond to the annotations chosen in the Annotation Sources prior to running theanalysis. Using the View Controller, choose filters to focus on items of interest.

8.1. Summarize/Annotate Variant 45


8.2 Generate VCF files + Annotated VCF files

With Summarize Variant Data, user can choose to export VCF file on both merged file and individual file for furtheranalysis.

The user can specify to generate a VCF for each observation or generate a merged VCF file under Advanced tab. Alsothe user can change the output result folder by providing a directory path to the option Output folder.



Leave the options as defaults, and click Submit. Both individual VCF files and a merged VCF file would be generatedin specified output folder:

Next, the user might want to further annotate the VCF variation. This can be accomplished by using NGS | Variation| Annotate Variant Files(VCF/BED/GTT/RS_ID) module.

Here, users need to specify the VCF file location, note that only merged VCF file is supported. Users need to specifythe Reference and Gene model in the General Tab. And in the Annotation Source Tab, users can choose extraannotator to further annotate VCF file.

8.2. Generate VCF files + Annotated VCF files 47


Leave all other settings as-is, choosing to extract genotype only, and click Submit to continue. A new table is outputto the Solution Explorer.

Each gene that shows variation in the mutation report is returned in the resulting table, along with further annotationfor that gene, including any known dbSNPs at that position, annotation type, amino acid position (if AA change), AAChange, transcript ID, transcript name, transcript strand, and distance to 5’/3’ ends and closest exon boundary.



8.3 Summarize/Annotate Matched Pair Variation

Summarize Matched Pair Variation Data is implemented based on the principle of VarScan2, so the options are thesame as in VarScan2, including a few pileup and filtering options.

The module requires a matched normal sample during the analysis. The module was initially designed to detect somaticmutation in tumor comparing to matched normal samples. It can also be applied to detect mutation in other cases, suchas comparing induced pluripotent stem cells (iPSC) vs. somatic cells. To incorporate the sample information, userhas to prepare a design table and import for NgsData. Double click the Table under Design to show design table. Toimport new design table, right click on the Design Folder under NgsData and choose Import:

The design file for this tutorial is located in the downloaded zip file. Once a design file (usually it is a tab delimitedfile) is selected and imported, user can choose to replace or append to existing design table:

Click OK , the design table is imported.

8.3. Summarize/Annotate Matched Pair Variation 49


In this tutorial, only SRR097848 and SRR097849 are paired: SRR097849 is from breast cancer cell line MCF7 whileSRR097848 is from non-tumor breast cell line MCF10A. Left click and select two NGS samples:

Both sample IDs will be highlighted; then open NGS | Variation | Summarize Matched Pair Variation Data module:

Choose the NGS data, change the Observations to Selected observations only. The analysis will use two NGS samplesselected in the design table.

Specify the Pair based on Tissue column (Breast for both), Tumor status based on cell type column, and chooseNon-tumor epithelium cell lines factor level to be Normal. There are more pileup, variation calling and filteringoptions in the Advanced tab.

Leave all settings as their defaults and click Submit to run the module. The output is a matched pair variation (MPV)report table listed under Table in solution explorer:



In the MPV report table, there are ten columns for each sample:

• Minor allele MutationFrequency in normal

• Coverage in normal

• Minor allele MutationFrequency in tumor

• Coverage in tumor

• Somatic P value for somatic or LOH events

• Variant P value from testing whether the variant allele exists in at least one of the (two) samples

• Filters for the status of mutation calling, such as strandness and mapping quality difference

• Somatic status call (Germline, Somatic, LOH, or Unknown)

• Predicted genotype in normal

• Predicted genotype in tumor

If there are multiple minor alleles at the same location with frequency >= cutoff, they will be reported as differentrows. The report comes complete with basic annotation for each site, including chromosome, position, and referencenucleotide, mutation (either a change in sequence or insertion/deletion).

In this tutorial, we only analyzed one pair of samples. If multiple pairs are analyzed in the same run, the whole tableis a merged report from multiple samples grouping by mutation site (location + type) in rows. The mutation will bereported if at least one sample contains it with read frequency >= cutoff. The mutation read frequency, even 0%, andcoverage in other samples will be also reported. Dots in the table are missing values, indicating that the coverage atthis position in that sample is less than cutoff.

As with the mutation table, the MPV table can also be annotated by NGS | Variation | Annotate Variant TableReport module. Open the module; choose the MPV mutation report in Data, UCSC gene model in Gene model,v137 in DBSNP version. The user also needs to select the data columns in the mutation report; however Array Studioshould in most cases select them automatically. Users can also specify an Output name such as “MPV”.



Leave all other settings as-is and click Submit to continue. An annotated MPV mutation table with the name specified(MPV) is output to the Solution Explorer:

Besides the columns in MPV mutation table, there are following annotation columns: gene/transcript name, dbSNPname, mutation type, mutation position in the open reading frame, amino acid position and change in the transcript,distance to 5’, 3’ of the transcript and to the closest exon boundary. Annotation columns from functional predictionand COSMIC are also attached if these options are checked in the Advanced tab.






CHAPTER

NINE

DNA-SEQ COPY NUMBER ANALYSIS

DNA-Seq Copy number analysis requires a matched normal sample during the analysis. The module was initiallydesigned to detect copy number changes in tumor comparing to matched normal samples from Exon captured DNA-Seq data.

There are two steps in copy number analysis: Summarize Copy Number (Whole Exome Sequencing or TargetSequencing) and Segment Copy Number :

9.1 Summarize Copy Number

Check that the two samples are still selected in the NgsData Design Table .

55


Summarize Copy Number will summarize the copy number log2ratio between two samples. In Summarize CopyNumber , choose the NGS data, choose the Observations to Selected observations only. The analysis will use twoNGS samples selected in the design table. If user does not follow the instruction for Observations option in this tutorialdata, there will be an error: Observation number cannot be divided by two (expecting matched pairs).

Specify the Pair based on Tissue column (Breast for both), Tumor status based on cell type column, and chooseNon-tumor epithelium cell lines factor level to be Normal. There are more pileup and copy number options in theAdvanced tab.

Leave all other settings as their defaults and click Submit to run the module. The output is a copy number reporttable listed under Table in solution explorer:

The report contains observation ID, copy number log2Ratio, predicted copy number (normally having 2 copies inhuman), coverage in tumor and normal sample, and genomic bin start/end.

56 Chapter 9. DNA-Seq Copy Number Analysis


As shown in the GUI, user can also select a base line sample to be a common control and summarize copy number bycomparing every other sample to the same control. Users can sort data by Log2 Ratio or Copy Number to find someregions with significant differences in copy number. Right-click the rowID of a CNV to visualize the pileup in theGenome Browser.

9.2 Segment Copy Number

The CNV Segmentation command will generate segmentation results for Log2Ratio copy number Data. The contigu-ous small segments/regions are processed and joined by a segmentation algorithm (similar to circular binary segmen-tation).

In NGS | Copy Number | Segment Copy Number , choose the CopyNumberReport data.

9.2. Segment Copy Number 57


Leave all other settings as their defaults and click Submit to run the module. The output is a segment report tablelisted under Table | Segment in solution explorer:

By default, there are table , scatter, Segment and SegmentChromosome views for the report. Open the SegmentChro-mosome view and click on any segment to show details.



9.2. Segment Copy Number 59



CHAPTER

TEN

SAVE & CLOSE PROJECT

Go to the File Menu | Save to save your results. Please refer to the MicroArray tutorial for more details on the AuditTrial , which records all the analysis steps in for form of Omic script.

Congratulations! You are done with the analysis. You can reopen this project later on to get back to the same state asyou saved. This includes all views, filters, analyses, etc.

This tutorial represents just a piece of what Array Studio is capable of. Feel free to try different options in the Tasktab or the NGS menu to get a feel for what Array Studio can do. For additional information, don’t hesitate to contactOmicsoft’s support team ( [email protected] ).

Thank you for using Array Studio. Please contact Omicsoft Support ( [email protected] ) or Omicsoft Sales( [email protected] ) for sales-related questions.

61

mailto:[email protected]



Date post:	05-Oct-2018
Category:	Documents
Upload:	lamphuc
View:	233 times
Download:	0 times

DNA-Seq Analysis Tutorial - Cancer Genomics...

Documents