of 42
7/27/2019 Lecture 2102
1/42
Lecture 2
DNA sequencing
Whole genome sequencing
Sequence data formats
SeqLab
7/27/2019 Lecture 2102
2/42
DNA Sequencing
Sanger's method was invented in 1977. It uses purified
template DNA, a DNA primer, deoxynucleotide triphosphates
(dNTPs), di-deoxynucleotide triphosphates (ddNTPs),enzymes (Taq polymerase), and gel electrophoresis.
dNTPs + ddNTPsenzymes
7/27/2019 Lecture 2102
3/42
primer
= a piece of RNA or DNA whose purpose is
to allow the enzyme DNA polymerase to
make new DNA. It "primes" the reaction.
7/27/2019 Lecture 2102
4/42
Base
OH
BaseOO P O
O
...
DNA Sequencing
Base
BaseOO P O
O
XdNTP dNTP
dNTP ddNTP
Taq polymerase adds the complementary nucleotide (normal
or di-deoxy) to the 3' end of the growing strand. If the 3' endhas no 3'-OH group, then growth stops.
7/27/2019 Lecture 2102
5/42
results of polymerization rx:
GCCGATCTAGAAATCTAAGAGGAGAG
AGGCTAGATCTTTAGATTCTCCTCTC
AGGC*
AGGCTAGATC*
AGGCTAGATCTTTAGATTC*
AGGCTAGATCTTTAGATTCTC*
AGGCTAGATCTTTAGATTCTCC*
AGGCTAGATCTTTAGATTCTCCTC*
template
reverse
complement
d
dCTP-termin
atedfragmen
ts
3'
5'
5'
3'
7/27/2019 Lecture 2102
6/42
Separating fragments on a gel
larger
smaller
Gel (or capillary)
electrophoresis separates the
fragments by charge, which is
proportional to length.
Now the sequence can be read
from the gel. Top to bottom is
the reverse complement
sequence. Bottom to top is thetemplate sequence if we
switch the labels of the lanes
to the complement bases.
ddA ddT ddC ddG
A
G
G
C
T
A
G
7/27/2019 Lecture 2102
7/42
DNA SequencingGels can resolve around 500 bases.
dNTP/ddNTP ratio is optimized so that ~1 ddNTP is
incorporated every ~500 bases.
Fragments are visualized by
(1) Using 32P dNTPs in the primer. 32P gives off beta
radiation, which is captured when the gel is layed out on a
piece of film. b won't pass through glass, so the gel must beremoved from the gel apparatus.
(2) Using fluorescent dyes attached to the primer.
The fluorescence is detected by scanning the gel or capillary.
No need to remove the gel.
7/27/2019 Lecture 2102
8/42
Sequence trace
F
Each color is one lane of an electrophoresis gel.
7/27/2019 Lecture 2102
9/42
base-calling
"I call this a G"
"N" = no call, trace
not good enough to
call the base.
"phred" (by Phil Green) is the most widely used base-calling program.
7/27/2019 Lecture 2102
10/42
Naive strategy for sequencing long
pieces of DNAOrder a primer to start the sequencing process.
Sequence ~500 bases.
Design a primer based on the last good part of the sequence.
Order that primer. (wait a few weeks for delivery)
Repeat.
Time required to sequence a small genome (E. coli) this way if
delivery takes one week: 9200 weeks = 176 years
7/27/2019 Lecture 2102
11/42
Whole Genome SequencingDifferent strategies have been tried. We will discuss one:
Shotgun Sequencing.
Start with purified DNA.Shear it into random sized fragments.
Clone fragments into yeast artificial chromosomes (YAC)
orbacterial artificial chromosomes (BAC). Grow
these up.
Sequence the ends of these fragments, and determine the total fragment
size. (These pairs of fragments are calledBAC ends orYAC ends)
Assemble the fragments.
7/27/2019 Lecture 2102
12/42
The Sequence Assembly Problem
Imagine several copies of a book are cut by
scissors into 10 million small pieces. Each copy
is cut in an individual way so that a piece from
one copy may overlap a piece from another
copy. Assuming that 1 million pieces are lost
and the remaining 9 million are splashed with
ink.
Try to recover the original text.
7/27/2019 Lecture 2102
13/42
Sequence assembly strategy
Sequence at least 10 times as much DNA as contained in the
genome. i.e. If the genome has 4.6 Mb (mega-bases) then
sequence 46 Mb. This is called "10-fold redundancy".
Find all overlapping sequences. (sometimes the overlap is
ambiguous)
If the overlap is ambiguous on one end of the BAC or YAC,
the ambiguity can be resolved using the other end.
Errors in assembly can still occur in highly repetitive
regions of the genome (such as near the centromeres).
7/27/2019 Lecture 2102
14/42
Bacterial Artificial Chromosomes
TTAGCTGATACAGGGGCTCAAA
GGGGCTCAAAGTGTCACACATTCA
Size of BAC is known, so distancebetween BAC ends is known
Only 500 bp on the ends of each BACare sequenced
Relative position of BACs is determined by sequence overlap
7/27/2019 Lecture 2102
15/42
Alignment
7/27/2019 Lecture 2102
16/42
Contig
end of one contig
start of next contig
}no data zone
Throughout a "contig" there is a continuous
tiling of overlapping fragments with no gaps
in the data.
7/27/2019 Lecture 2102
17/42
Sources of sequence data
NCBI Washington,DC GenBank
EMBL Heidelberg, Germany EMBL
NIG Shizuoka-ken, Japan DDBJ
Members ofInternational Nucleotide Sequence Database Collaboration
Web sites:
NCBI www.ncbi.nlm.nih.gov
EMBL www.embl-heidelberg.de
DDBJ www.ddbj.nig.ac.jp
7/27/2019 Lecture 2102
18/42
NCBI tour
Log onto yourmodlab (client) machine.
Start a web browser.
Set the browser to
www.ncbi.nlm.nih.gov
And follow along.
S fil hi d bl
7/27/2019 Lecture 2102
19/42
"machine readable" files...
are keyworded
have space delimited fields
contain special characters like /, :,=,{}, etc (/product)
contain database identifiers, accession number
(gi:123456789)
sometimes have a checksum, to guard against corruption.
Sequence files are machine readable
Generally it is better to let the machine do the reading.
7/27/2019 Lecture 2102
20/42
>> > >
>
>>>
>
server
clients
The prompt indicates which machine you are talking to.
7/27/2019 Lecture 2102
21/42
SCP a file to the server
Send the file "lotsofjunk" to bioinf45.bio.rpi.edu as follows:
scp lotsofjunk [email protected]:lotsofjunk
where n is the same as the number of yourmodlab machine(i.e. if you are on modlab16, use bio454016)
Enter your password: (see whiteboard)
7/27/2019 Lecture 2102
22/42
Start SeqLab on the server
client_prompt>ssh -X [email protected]
where n is the same as the number of yourmodlab machine(i.e. if you are on modlab16, login as bio454016)
password: ******* (see whiteboard)
server_prompt>seqlab &
How to start seqlab on the bioinf45 server
background it
7/27/2019 Lecture 2102
23/42
Show otherdisplay
windows
change your
working
directory
and fonts
add-onsHundreds of
functions
some ofthese are
button
functions
Open,
Save,Import,
Export,
Print, etc
Intro to SeqLab
This introduction covers only the basics. Check out the
SeqLab Guide for detailed descriptions.
File Edit Functions Extensions Options Windows
MAIN MENU
7/27/2019 Lecture 2102
24/42
Two SeqLab modes:
Editor The editor window. Where the action is.
Main list A list of available files for editing.
7/27/2019 Lecture 2102
25/42
Basic SeqLab operations
Get a sequence from the local database:
File-->Add Sequence From-->Databases
(only GenEMBL database is actually present)
Select one. Add to Main Window.
Get a sequence from a file:
File-->Import
Use "Filter" button and line to view the files you want.
SeqLab recognizes GenBank format.
7/27/2019 Lecture 2102
26/42
Basic SeqLab operationsSelect a sequence: click on sequence name.
Select a region of a sequence: click and drag. Or click on the
start, then shift-click on the end.
Move a sequence: select a position and hitspace ordelete.(protections may have to be set. Hit the lockicon)
Cut, copy, paste: use icons. (There are two copy buffers, one
for wholesequences, one forregions. SeqLab will ask you
which one you want.)
Create a new sequence: File-->New Sequence (select DNA,
RNA, or protein). Click on blank sequence (~) and type or
paste.
7/27/2019 Lecture 2102
27/42
SeqLab windows
Job manager -- use this to check the status of a job
request, or to kill a job that is taking too long.
Output manager -- find the output of a job.Database browser -- find a sequence if you know
the database and sequence identifier/accession
number.
Traces -- look at trace files
Features -- locate annotated sequence features such
as ORFs, SNPs. Long sequences can be color-coded
by features.
7/27/2019 Lecture 2102
28/42
In class exercise: Seqlab
In the Editor, import the file "lotsofjunk" (GenBank format)
Answer the following:
What kind of sequence is it?
What is the GenBank identifier(accession number) of this
sequence?
What does the keyword CDS mean?
What is its G/C content?
7/27/2019 Lecture 2102
29/42
In class exercise: Seqlab
Display using graphic features and scale so that the whole
sequence fits in the window.
Open the Features window and edit the features as follows:Make all polyproteins purple (with different fills).
Make all BGI-PUPs lime-green.
Make all nucleocapsid and envelope proteins orange.
Make any feature shorter than 100 bases a diamond.
7/27/2019 Lecture 2102
30/42
Edit-->Find
Display as monochrome text, with 1:1 scaling.
Find all A-rich regions with up to 3 mismatches.
enter "AAAAAAAAAAAA"
and allow 3 mismatches. Note coloring. What does it
mean?
Find all potential start codons "atg". Which ones are at thestart of CDS's?
7/27/2019 Lecture 2102
31/42
Save
Save your modified sequence in a new file called "sars.ref"
Explore other functions using the Help button.
You may also use one of the copies of the SeqLab Guide.
7/27/2019 Lecture 2102
32/42
Dot Matrix
Each position in the matrix D[i,j] is either
has a dot, if A[i] == B[j]
or blank, otherwise.
AAGACGTTTAGA
CGTACT
7/27/2019 Lecture 2102
33/42
A more advanced dot matrix
Seqlab function "Compare"
With thousands of bases, it is impossible to plot all dots in
the matrix. Instead we look for stretches of sequence withfew mismatches. If the number of mismatches is less than
the cutoff, plot a dot or line.
AAGACGTTTAGACGTA
CT
All diagonals with
at least 4 out of 5
matches.
In SeqLab:
"window" is the
length of adiagonal,
"stringency" is
minimum number
of matches.
7/27/2019 Lecture 2102
34/42
What you can do with a DNA dot matrix
Find regions of self-similarity (microrepeats, paralogs)
Find regions of complementarity (RNA secondary
structure)
Find the locations of genes between two genomes.
Find Non-sequential alignments
Weaknesses:Dot plots show only the most obvious similarities.
Dot plots are not alignments, yet.
In class exercise:
7/27/2019 Lecture 2102
35/42
Dot matrix for a phageIn SeqLab:
File-->Add Sequence From-->Databases
GenEMBL, phage (click "Show matching entries")
Select first entry (em_ph:s66725)Add to Main Menu
Select the first windowfull (about 80-90 bases), copy it and
use it to make a New DNA sequence.
File-->New Sequence (choose DNA)select a position in the empty sequence (~)
paste
In class exercise:
In class exercise:
7/27/2019 Lecture 2102
36/42
Making the reverse compliment
Copy the sequence. (select, copy, paste) Now you have two
identical sequences.
Make the reverse compliment of the second one. (Edit--
>Reverse. Select "reverse and compliment")
In class exercise:
In class exercise:
7/27/2019 Lecture 2102
37/42
Making a dot plot
Make a dotplot. Select both sequences.
Function-->Pairwise comparison-->Compare
Options...
set Window = 1 (close, Run)Check progress using the Jobs Window
Display the Dotplot
Repeat using Window=8, Stringency=5
Then repeat again using Window=16, stringency=10.
Where is the longest region of self-complimentarity?
In class exercise:
7/27/2019 Lecture 2102
38/42
Self-hybridization plot
Dotplot for randomly-selected DNA and its reverse complement.
DNA sequence
reversecomplementsequence
In class exercise:
7/27/2019 Lecture 2102
39/42
Dot matrix for two viral genomesIn SeqLab:
Import the SARS genome (File-->Import, select lotsofjunk)
Download another coronavirus from the NCBI website.
Search Genome for coronavirus.
Select "Porcine epidemic diarrhea virus" NC_003436
Click on NC_003436 to go to the sequence.
Display GenBank. Send to File.
(Rename it, say,pig.gbk. Copy it to bioinf45)
In SeqLab: File-->Import
Select the two viral genomes, and run
Functions-->Pairwise-->Compare
Use window length =50 andstringency= 30
(Why?)
Try other settings.
In class exercise:
7/27/2019 Lecture 2102
40/42
Parametric search
7/27/2019 Lecture 2102
41/42
Dotplot between two viruses
Results of SeqLabs Compare program using window=50, stringency=26
7/27/2019 Lecture 2102
42/42
Homework 1: Exercise in database searching
Do Problem 1 (a throughj), p.61 in Mount, Bioinformatics2nd Ed.
Write down the number of hits for steps b-i.
Copy the search results into SeqLab for further manipulation.
See details on course web page.