Dynamic Programming for Pairwise Alignment 2

transcript

Dynamic Programmingfor

Pairwise Alignment 2

Dr Alexei Drummond

Department of Computer Science

alexei@cs.auckland.ac.nz

Semester 2, 2006

Review

Dynamic programming algorithm for global alignment (Needleman & Wunsch)

Given sequences:

F(i,j) = score of best alignment

between

and €

Y = (y1,y2,...,yn )

X = (x1,x2,...,xm )

(x1,x2,...,x i)

(y1,y2,...,y j )

Principle of Optimality

Optimal alignment

x1, x2, x3, ..., x i

y1, y2, y3, ..., y j

F(i, j)

Optimal alignment

x1, x2, x3, ..., x i

y1, y2, y3, ..., y j

Looks like ……

x1,x2,x3,...,x i−1

y1,y2,y3,...,y j−1

F(i, j)

F(i −1, j −1) + s(x i,y j )

Optimal alignment

x1, x2, x3, ..., x i

y1, y2, y3, ..., y j

Looks like ……

x1,x2,x3,...,x i−1

y1,y2,y3,...,y j−1

F(i, j)

F(i −1, j −1) + s(x i,y j )

or ……………

x1,x2,x3,...,x i

y1,y2,y3,...,y j−1

F(i, j −1) − d

Optimal alignment

x1, x2, x3, ..., x i

y1, y2, y3, ..., y j

Looks like ……

x1,x2,x3,...,x i−1

y1,y2,y3,...,y j−1

F(i, j)

F(i −1, j −1) + s(x i,y j )

or ……………

x1,x2,x3,...,x i

y1,y2,y3,...,y j−1

F(i, j −1) − d

or ……………

x1,x2,x3,...,x i−1

y1,y2,y3,...,y j

F(i −1, j) − d

Optimal alignment

x1, x2, x3, ..., x i

y1, y2, y3, ..., y j

Looks like ……

x1,x2,x3,...,x i−1

y1,y2,y3,...,y j−1

F(i, j)

F(i −1, j −1) + s(x i,y j )

or ……………

x1,x2,x3,...,x i

y1,y2,y3,...,y j−1

F(i, j −1) − d

or ……………

x1,x2,x3,...,x i−1

y1,y2,y3,...,y j

F(i −1, j) − d

so ……………

F(i −1, j −1) + s(x i,y j )

F(i, j) = max F(i, j −1) − d

F(i −1, j) − d

x1, x2, x3, ..., x i

− − − − ... −

y1, y2, y3, ..., y j

− − − − ... −

F(i,0) = F(i −1,0) + s(x i,−)

F(0, j) = F(0, j −1) + s(−,y j )

F(0,0) = 0

Filling up table

F matrix

0 1 2 n

Optimalalignmentscore

Constructing alignment

F matrix

0 1 2 n

Example

0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80

-8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73

-16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60

-24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37

-32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19

-40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5

-48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2

-56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1

F matrix

0 1 2 n

H E A G A W G H E E

AlignmentAlignmentX

Y H E A G A W G H E - E

- - P - A W - H E A E

Time and space

⇒ Θ(mn)

F matrix

0 1 2 n

(m +1) × (n +1) table entries space

Each entry computed in constant time

⇒ Θ(mn) time

Smith & Waterman algorithm

Computes local alignment.

i.e. look for best alignment of subsequences of X and Y, ignoring scoresof regions on either side

Best subsequence alignment

Recurrences

F(i −1, j −1) + s(x i,y j )

F(i, j) = max F(i, j −1) − d

F(i −1, j) − d

F(i,0) = F(0, j) = 0Basis:

Example

F H E A G A W G H E E

0 0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 5 0 5 0 0 0 0 0

W 0 0 0 0 2 0 20 12 4 0 0

H 0 10 2 0 0 0 12 18 22 14 6

E 0 2 16 8 0 0 4 10 18 28 20

A 0 0 8 21 13 5 0 4 10 20 27

E 0 0 6 13 18 12 4 0 4 16 26

Example

0 0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 5 0 5 0 0 0 0 0

W 0 0 0 0 2 0 20 12 4 0 0

H 0 10 2 0 0 0 12 18 22 14 6

E 0 2 16 8 0 0 4 10 18 28 20

A 0 0 8 21 13 5 0 4 10 20 27

E 0 0 6 13 18 12 4 0 4 16 26

AlignmentX

Y A W G H E

A W - H E

Repeated (local) matches

Long sequences - interested in all local alignments with significant score,> threshold T.

e.g. copies of repeated domain or motif in a protein.

X = sequence containing motif

Y = target sequence

Method is asymmetric

Matching parts of X

Given sequences

Define F(i,j) (i ≥ 1) = best sum of match scores in

and €

Y = (y1,y2,...,yn )

X = (x1,x2,...,xm )

(x1,x2,...,x i)

(y1,y2,...,y j )

assuming

and match ends in

is in a matched region

Ends of matches

F(0,0) = 0

F(0, j) = best sum of completed match scores to

(y1,y2,...,y j )

assuming that

y j is not in a matched region

F(0, j −1)

F(0, j) = max F(i, j −1) −T, i =1,...,n

Row 0 therefore marks unmatched regions and ends of matches in Y.

General recurrence

F(0, j)

F(i −1, j −1) + s(x i,y j )

F(i, j) = max F(i, j −1) − d

F(i −1, j) − d

Start of new match

Extension of previous match

Filling up table

F matrix

0 1 2 n

Filling up table

F matrix

0 1 2 n

Filling up table

F matrix

0 1 2 n

Filling up table

F matrix

0 1 2 n

Filling up table

F matrix

0 1 2 n

Filling up table

F matrix

0 1 2 n

Filling up table

F matrix

0 1 2 n

Filling up table

F matrix

0 1 2 n

Filling up table

F matrix

0 1 2 n

Filling up table

F matrix

0 1 2 n

Filling up table

F matrix

0 1 2 n

OptimalSum ofalignmentscores

ExampleF H E A G A W G H E E

0 0 0 0 1 1 1 1 1 3 9

P 0 0 0 0 1 1 1 1 1 3 9

A 0 0 0 5 1 6 1 1 1 3 9

W 0 0 0 0 2 1 21 13 5 3 9

H 0 10 2 0 1 1 13 19 23 15 9

E 0 2 16 8 1 1 5 11 19 29 21

A 0 0 8 21 13 6 1 5 11 21 28

E 0 0 6 13 18 12 4 1 5 17 27

Extra cell for final total score

Example

AlignmentX

Y H E A G A W G H E E

H E A . A W - H E .

Extra cell for final total score

0 0 0 0 1 1 1 1 1 3 9

P 0 0 0 0 1 1 1 1 1 3 9

A 0 0 0 5 1 6 1 1 1 3 9

W 0 0 0 0 2 1 21 13 5 3 9

H 0 10 2 0 1 1 13 19 23 15 9

E 0 2 16 8 1 1 5 11 19 29 21

A 0 0 8 21 13 6 1 5 11 21 28

E 0 0 6 13 18 12 4 1 5 17 27

Overlap matchesY Y

Don’t penalize overhanging ends i.e. set F(i,0) = F(0,j) = 0

F(i −1, j −1) + s(x i,y j )

F(i, j) = max F(i, j −1) − d

F(i −1, j) − d

Otherwise

0 0 0 0 0 0 0 0 0 0 0

P 0 -2̀ -1 -1 -2 -1 -4 -2 -2 -1 -1

A 0 -2 -2 4 -1 3 -4 -4 -4 -3 -2

W 0 -3 -5 -4 1 -4 18 10 2 6 -6

H 0 10 2 6 -6 -1 10 16 20 12 4

E 0 2 16 8 0 7 2 8 16 26 18

A 0 -2 8 21 13 5 3 2 8 18 25

E 0 0 4 13 18 12 4 4 2 14 24

0 0 0 0 0 0 0 0 0 0 0

P 0 -2̀ -1 -1 -2 -1 -4 -2 -2 -1 -1

A 0 -2 -2 4 -1 3 -4 -4 -4 -3 -2

W 0 -3 -5 -4 1 -4 18 10 2 6 -6

H 0 10 2 6 -6 -1 10 16 20 12 4

E 0 2 16 8 0 7 2 8 16 26 18

A 0 -2 8 21 13 5 3 2 8 18 25

E 0 0 4 13 18 12 4 4 2 14 24

AlignmentX

Y G A W G H E E

P A W - H E A

Affine gap penalities

• Affine score: (g) = -d - (g-1)e

gap-open penality gap-extension penalty

• Different penalties associated with extending alignment with gap symbol

Y = C C T W PX = C S T W -

Y = C C T W PX = C S T - -

different from

General recurrence

F(i −1, j −1) + s(x i,y j )

F(i, j) = max F(k, j) + γ(i − k), k = 0,1,...,i −1

(i, j > 0) F(i,k) + γ ( j − k), k = 0,1,..., j −1

Extend by matching

x i and y j

Extend by matching suffix of Y to gap of length i-k

Extend by matching suffix of X to gap of length j-k

Θ(n3)Problem: Procedure runs in worst-case time

version

Θ(n2)

Extra variables

M(i, j) = best score of alignment of (x1,x2,...,x i) and

(y1,y2,...,y j ) given that x i is aligned with y j Ix (i, j) = best score of alignment of (x1,x2,...,x i) and

(y1,y2,...,y j ) given that x i is aligned with a gap

Iy (i, j) = best score of alignment of (x1,x2,...,x i) and

(y1,y2,...,y j ) given that y j is aligned with a gap

Recurrences

M(i −1, j) − d

Ix (i, j) = max Ix (i −1, j) − e

(i, j > 0)

M(i, j −1) − d

Iy (i, j) = max Iy (i, j −1) − e

(i, j > 0)

M(i −1, j −1) + S(x i,y j )

M(i, j) = max Ix (i −1, j −1) + S(x i,y j )

Iy (i −1, j −1) + S(x i,y j )

(i, j > 0)

aligned to start of gap

Θ(n2)Procedure runs in worst-case time

aligned to continuation of gap

aligned to start of gap

aligned to continuation of gap

Linear space alignment

F matrix

0 1 2 n

F matrix

0 1 2 n

F matrix

0 1 2 n

F matrix

0 1 2 n

F matrix

0 1 2 n

F matrix

0 1 2 n

m2⎣ ⎦

F matrix

0 1 2 n

m2⎣ ⎦

F matrix

0 1 2 n

m2⎣ ⎦

F matrix

0 1 2 n

m2⎣ ⎦

F matrix

0 1 2 n

m2⎣ ⎦

F matrix

0 1 2 n

m2⎣ ⎦

Linear space algorithm

From top

From bottom

Ftop ( j)

Fbottom ( j)

Ftop ( j) + Fbot ( j)

k ∈ {0,1,...,n} such that

Ftop (k) + Fbot (k) is maximized

k is on path of optimal alignment

Linear space alignmentHirschberg’s insight

m2⎣ ⎦

Linear space alignmentHirschberg’s insight

m2⎣ ⎦

Software for pairwise alignment

Pure D.P. runs in

Θ(mn) time

Example

100 million residues in database

Search sequence of length 10,000

# F matrix cells to be calculated:

Computer speed: 10 million cells a second

Total time: 100,000 seconds = 28 hours (approx.)

Heuristic methods

FASTA (Pearson & Lipman, 1988)

Words in X and Y(length ktup)

…, ( i, j ), …cgtta

Position in X Position in Y

• sort matches on j - i • extend best matches (ungapped)• join neighbouring matches by inserting gaps• realign best matches by dynamic programming

Sensitivity

Tradeoff

High values of ktup: faster search, but may miss significant matches

Low values of ktup: catches more matches, but slower

ktup = 1 for sensitivity close to dynamic programming

Available from

http://www.fasta.bioch.virginia.edu/

Example>short1.seq Length: 100 August 1, 2003 11:09 Type: N Check: 5940atgaaattaacagcaatagctaaagcaacattagcattaggaatattaacaacaggtgtgatgacagcagaaagtcaaactgtaaacgcgaaagtaaagt

>short2.seq Length: 100 August 1, 2003 10:43 Type: N Check: 1744atgaagatgacagcaattgcgaaagccagtttagctctaagtattttagcgactggggttataacatcaacggctcaaactgtaaatgcgagcgaacatg

/seqprg/slib/bin/lalign -N 5000 -n -r "+5/-4" -f -12 -g -4 -w 75 -q @ @

resetting to DNA matrix resetting to DNA matrix LALIGN finds the best local alignments between two sequences version 2.1u03 April 2000Please cite: X. Huang and W. Miller (1991) Adv. Appl. Math. 12:373-381

resetting to DNA matrixalignments < E( 0.05):score: 51 (50 max) Comparison of:(A) @ short1.seq Length: 100 August 1, 2003 11:09 Type - 100 nt(B) @ short2.seq Length: 100 August 1, 2003 10:43 Typ - 100 nt using matrix file: DNA, gap penalties: -12/-4 E(limit) 0.05

71.4% identity in 91 nt overlap (1-91:1-91); score: 221 E(10000): 3.7e-12

10 20 30 40 50 60 70short1 ATGAAATTAACAGCAATAGCTAAAGCAACATTAGCATTAGGAATATTAACAACAGGTGTGATGACAGCAGAAAGT ::::: : :::::::: :: ::::: : ::::: :: : :: ::: : :: :: :: :: ::: :: :short2 ATGAAGATGACAGCAATTGCGAAAGCCAGTTTAGCTCTAAGTATTTTAGCGACTGGGGTTATAACATCAACGGCT 10 20 30 40 50 60 70

80 90short1 CAAACTGTAAACGCGA ::::::::::: ::::short2 CAAACTGTAAATGCGA 80 90

----------

Input sequences

Output matches

Example

More matches

64.1% identity in 39 nt overlap (17-54:32-69); score: 53 E(10000): 3.7e+02

20 30 40 50short1 TAGCTAAAGCAACATTAGC-ATTAGGAATATTAACAACA ::::: : : ::::: : : :: ::: :::: ::short2 TAGCTCTAAGTATTTTAGCGACTGGGGTTAT-AACATCA 40 50 60

----------

73.9% identity in 23 nt overlap (60-77:6-28); score: 53 E(10000): 3.7e+02

60 70short1 GATGACAGCA-----GAAAGTCA :::::::::: ::::: ::short2 GATGACAGCAATTGCGAAAGCCA 10 20

----------

Developed by Altschul & al (1990)

Preprocesses query sequence

Makes list of “neighbourhood words” with match > T

Tries to extend “seed” matches (ungapped) in database sequences

GAPPED-BLAST looks for gapped alignments

Genetics Computer Group package

GCG at University of Wisconsin

Commercial package (http://www.gcg.com/)

* assemble * backtranslate * bestfit * blast * breakup * chopup * circles * codonfrequency * codonpreference * coilscan * compare * composition * compresstext * comptable * consensus * correspond * corrupt * dataset * detab * distances * diverge * domes * dotplot * extractpeptide * fasta * fasta_parsable_output * fetch * figure * findpatterns * fingerprint * fitconsensus * foldrna * framealign * frames * framesearch * fromembl * fromfasta * fromgenbank * fromig * frompir * fromstaden * gap * gapshow * gcgtoblast * gelassemble * geldisassemble

* gelenter * gelintroduction * gelmerge * gelstart * gelview * getseq * growtree * helicalwheel * hthscan * isoelectric * lineup * listfile * lookup * lprint * map * mapplot * mapsort * mfold * moment * motifs * mountains * name * names * netblast * nooverlap * olddistances * onecase * overlap * paupdisplay * paupsearch * pepdata * pepplot * peptidemap * peptidesort * peptidestructure * pileup * plasmidmap * plotfold * plotsimilarity * plotstructure * plottest * pretty * prime * profileanalysis * profilegap * profilemake

* profilescan * profilesearch * profilesegments * publish * red * reformat * repeat * replace * reverse * sample * seg * segments * seqed * seqlab * setkeys * setplot * shiftover * shuffle * simplify * spew * spscan * squiggles * statplot * stemloop * stringsearch * symbol * terminator * testcode * tfasta * tofasta * toig * topir * tostaden * translate * whats_new_9.0 * whats_new_9.1 * window * wordsearch * xnu

GAP (“Global Alignment Program” ?)

Needleman & Wunsch algorithm

Input in GCG format

Use GETSEQ

!!NA_SEQUENCE 1.0 GETSEQ from gcg, August 14, 19103 12:19.

Length: 389 August 14, 19103 12:19 Type: N Check: 9580 ..

1 AAATGATAAA CTATTTTACT TTATGTCTAA GGTCTTTCAT AATATGAAAT

51 AGAATGTAGA TATTGCAACA ATAGCATTTT TGGAGACAGC TACCTCCTTT

101 ACCAGGAATA ATCTTTGCAT GTCACATTTA GAGATAAAGC TCAAAATGCA

151 AATCCTTCCC CTGAGAGTGG GAAAGCATTA ACAAATGAGA GTGGGAAAAG

201 CATTAACAAA GCATTAACAC AGGTCTTTAC ATATTCAAAA TATTAAACTA

251 ATGCTAGGAT TATAGACTTG ATTTTAAGAC ATGGTAGTTA ATAGAAAAGT

301 TCTAGATTGA AAACAATTTT GCAAAAATAT ACATTTGGTA TATGTGTATA

351 TATGTATGTG GTATATATAT ATCNACTAGG GAAAATATA

Example<heman.lsb.sbs.auckland.ac.nz:/usr/users/gcg/359Stuff>gapGap uses the algorithm of Needleman and Wunsch to find the alignment oftwo complete sequences that maximizes the number of matches and minimizesthe number of gaps.

GAP of what sequence 1 ? Hs#S374655.gcg

Begin (* 1 *) ? End (* 389 *) ? Reverse (* No *) ?

to what sequence 2 (* Hs#S374655.gcg *) ? Hs#S1117589.gcg

What is the gap creation penalty (* 50 *) ?

What is the gap extension penalty (* 3 *) ?

What should I call the paired output display file (* Hs#S374655.pair *) ?

Aligning ................-. Aligning ................-..

Gaps: 0 Quality: 3080 Quality Ratio: 9.536 % Similarity: 95.356 Length: 389

Display fileGAP of: Hs#S374655.gcg check: 9580 from: 1 to: 389 GETSEQ from gcg, August 14, 19103 12:19.to: Hs#S1117589.gcg check: 8814 from: 1 to: 323 GETSEQ from gcg, August 14, 19103 12:20.

Symbol comparison table: /usr/users/gcg/gcgcore/data/rundata/nwsgapdna.cmp CompCheck: 8760 Gap Weight: 50 Average Match: 10.000 Length Weight: 3 Average Mismatch: 0.000 Quality: 3080 Length: 389 Ratio: 9.536 Gaps: 0 Percent Similarity: 95.356 Percent Identity: 95.356 Match display thresholds for the alignment(s): | = IDENTITY : = 5 . = 1 Hs#S374655.gcg x Hs#S1117589.gcg August 18, 19103 17:59 .. . . . . . 1 AAATGATAAACTATTTTACTTTATGTCTAAGGTCTTTCATAATATGAAAT 50 ||||||||||||||||||||||||||||||||||||||||||||||| 1 ...TGATAAACTATTTTACTTTATGTCTAAGGTCTTTCATAATATGAAAT 47 . . . . . 51 AGAATGTAGATATTGCAACAATAGCATTTTTGGAGACAGCTACCTCCTTT 100 |||||||||||||||||||||||||||||||||||||||||||||||||| 48 AGAATGTAGATATTGCAACAATAGCATTTTTGGAGACAGCTACCTCCTTT 97 . . . . . 101 ACCAGGAATAATCTTTGCATGTCACATTTAGAGATAAAGCTCAAAATGCA 150 |||||||||||||||||||||||||||||||||||||||||||| ||||| 98 ACCAGGAATAATCTTTGCATGTCACATTTAGAGATAAAGCTCAAGATGCA 147 . . . . . 151 AATCCTTCCCCTGAGAGTGGGAAAGCATTAACAAATGAGAGTGGGAAAAG 200 |||||||||||||||||||||||||||||||||||||||||||||||||| 148 AATCCTTCCCCTGAGAGTGGGAAAGCATTAACAAATGAGAGTGGGAAAAG 197 . . . . . 201 CATTAACAAAGCATTAACACAGGTCTTTACATATTCAAAATATTAAACTA 250 |||||||||||||||||||||||||||||||||||||||||||||||||| 198 CATTAACAAAGCATTAACACAGGTCTTTACATATTCAAAATATTAAACTA 247 . . . . . 251 ATGCTAGGATTATAGACTTGATTTTAAGACATGGTAGTTAATAGAAAAGT 300 ||||||||||||||||||||||||||| |||||||| ||||| 248 ATGCTAGGATTATAGACTTGATTTTAAACATGGGTAGTTATAGAAAAAGG 297 . . . . . 301 TCTAGATTGAAAACAATTTTGCAAAAATATACATTTGGTATATGTGTATA 350 |||||||||||||||| ||| ||| 298 TCTAGATTGAAAACAAATTTTGCAAA........................ 323 . .

Bestfit<heman.lsb.sbs.auckland.ac.nz:/usr/users/gcg/359Stuff>bestfit

BestFit makes an optimal alignment of the best segment of similaritybetween two sequences. Optimal alignments are found by inserting gaps tomaximize the number of matches using the local homology algorithm ofSmith and Waterman.

BESTFIT of what sequence 1 ? short1.gcg

to what sequence 2 (* short1.gcg *) ? short2.gcg

What is the gap creation penalty (* 50 *) ?

What is the gap extension penalty (* 3 *) ?

What should I call the paired output display file (* short1.pair *) ?

Aligning ....-. Aligning ....-.

Gaps: 0 Quality: 416 Quality Ratio: 4.571 % Similarity: 71.429 Length: 91

Smith & Waterman algorithm

Local alignment

Same interface as GAP

Bestfit display fileBESTFIT of: short1.gcg check: 2998 from: 1 to: 100

GETSEQ from gcg, August 18, 19103 15:25.

to: short2.gcg check: 6455 from: 1 to: 100

Symbol comparison table: /usr/users/gcg/gcgcore/data/rundata/swgapdna.cmp CompCheck: 2335

Gap Weight: 50 Average Match: 10.000 Length Weight: 3 Average Mismatch: -9.000

Quality: 416 Length: 91 Ratio: 4.571 Gaps: 0 Percent Similarity: 71.429 Percent Identity: 71.429

Match display thresholds for the alignment(s): | = IDENTITY : = 5 . = 1

short1.gcg x short2.gcg August 18, 19103 15:27 ..

. . . . . 1 atgaaattaacagcaatagctaaagcaacattagcattaggaatattaac 50 ||||| | |||||||| || ||||| | ||||| || | || ||| | 1 atgaagatgacagcaattgcgaaagccagtttagctctaagtattttagc 50 . . . . 51 aacaggtgtgatgacagcagaaagtcaaactgtaaacgcga 91 || || || || ||| || |||||||||||| |||| 51 gactggggttataacatcaacggctcaaactgtaaatgcga 91

Wordsearch

Algorithm similar to algorithm of Wilbur and Lipman (1983).

Compares one sequence (the query) to any group of sequences.

Comparisons can be viewed as set of dot-plots.

Search finds registers of comparison (diagonals) that have the largest number of short perfect matches (words).

Best segment of similarity along each diagonal viewed with program SEGMENTS.

Wordsearch example<heman.lsb.sbs.auckland.ac.nz:/usr/users/gcg/359Stuff>wordsearch

WordSearch identifies sequences in the database that share largenumbers of common words in the same register of comparison with yourquery sequence. The output of WordSearch can be displayed withSegments.

WORDSEARCH with what query sequence ? short1.gcg

Begin (* 1 *) ? End (* 100 *) ?

Search for query in what sequence(s) (* GenEMBL:* *) ? short2.gcg

What word size (* 6 *) ?

List how many best diagonals (* 50 *) ? 4

Integrate how many adjacent diagonals (* 3 *) ?

What should I call the output file (* short1.word *) ?

1 short2.gcg Len: 100

6-mers found: 168 Diagonals with words: 6 Total diagonals: 398 Sequences searched: 1 CPU time: 00.03

Output file: /usr/users/gcg/359Stuff/short1.word

Short1.word contents

!!SEQUENCE_LIST 1.0 (Nucleotide) WORDSEARCH of: /usr/users/gcg/359Stuff/short1.gcg check: 2998 from: 1 to:100

TO: short2.gcg Sequences: 1 Total-length: 100 August 18, 19103 15:47

Word-size: 6 Words: 168 Diagonals: 6 Total-diagonals: 398 Integral-width: 3 Alphabet: 4 List-size: 4 CPU minutes: 0.00

Sequence Strd Diag Score Width Documentation ..

/short2.gcg + 0 20 3 GETSEQ from gcg, August 18, 19103 15:26./short2.gcg + -54 10 3 GETSEQ from gcg, August 18, 19103 15:26./short2.gcg - -47 7 3 GETSEQ from gcg, August 18, 19103 15:26./short2.gcg + -69 7 3 GETSEQ from gcg, August 18, 19103 15:26.

Run SEGMENTS

<heman.lsb.sbs.auckland.ac.nz:/usr/users/gcg/359Stuff>segments

Segments aligns and displays the segments of similarity found byWordSearch.

(BestFit) SEGMENTS from what WORDSEARCH file ? short1.word

What should I call the output file (* short1.pairs *) ?

Aligning ....-. /usr/users/gcg/359Stuff/short2.gcg 100 bp Gaps: 0 Quality:500 / Length: 98 Aligning ..-. /usr/users/gcg/359Stuff/short2.gcg 100 bp Gaps: 0 Quality:100 / Length: 10 Aligning ..-. /usr/users/gcg/359Stuff/short2.gcg 100 bp Gaps: 0 Quality:112 / Length: 32 Aligning .-. /usr/users/gcg/359Stuff/short2.gcg 100 bp Gaps: 0 Quality: 96 / Length: 16

Short1.pairs contents(BestFit) SEGMENTS from: short1.word August 18, 19103 15:48

(Nucleotide) WORDSEARCH of:/usr/users/gcg/359Stuff/short1.gcg check: 2998from: 1 to: 100GETSEQ from gcg, August 18, 19103 15:25.TO: short2.gcg Sequences: 1 Total-length: 100 August 18, 19103 15:47Word-size: 6 Words: 168 Diagonals: 6 Total-diagonals: 398Integral-width: 3 Alphabet: 4 List-size: 4 CPU minutes: 0.00

AvMatch: 3.84 AvMisMatch: -6.00 GapWeight: 50 LengthWeight: 3 ..

Match display thresholds for the alignment(s): | = IDENTITY : = 3 . = 1

short1.gcg check: 2998 from: 1 to: 100/usr/users/gcg/359Stuff/short2.gcg check: 6455 from: 1 to: 100 GETSEQ from gcg, August 18, 19103 15:26. Gaps: 0 Quality: 500 Ratio:5.102 Score:20 Width:3 Limits: +/-4 . . . . . 1 atgaaattaacagcaatagctaaagcaacattagcattaggaatattaac 50 ||||| | |||||||| || ||||| | ||||| || | || ||| | 1 atgaagatgacagcaattgcgaaagccagtttagctctaagtattttagc 50 . . . . 51 aacaggtgtgatgacagcagaaagtcaaactgtaaacgcgaaagtaaa 98 || || || || ||| || |||||||||||| |||| | | | 51 gactggggttataacatcaacggctcaaactgtaaatgcgagcgaaca 98

Short1.pairs (continued)short1.gcg check: 2998 from: 54 to: 100/usr/users/gcg/359Stuff/short2.gcg check: 6455 from: 1 to: 100 GETSEQ from gcg, August 18, 19103 15:26. Gaps: 0 Quality: 100 Ratio:10.000 Score:10 Width:3 Limits: +/-4 . 60 gatgacagca 69 |||||||||| 6 gatgacagca 15

short1.gcg check: 2998 from: 47 to: 100 /Reverse/usr/users/gcg/359Stuff/short2.gcg check: 6455 from: 1 to: 100 GETSEQ from gcg, August 18, 19103 15:26. Gaps: 0 Quality: 112 Ratio:3.500 Score:7 Width:3 Limits: +/-4 . . . 40 ctaatgctaatgttgctttagctattgctgtt 9 | | ||| || | ||||||| | | || 14 caattgcgaaagccagtttagctctaagtatt 45

short1.gcg check: 2998 from: 69 to: 100/usr/users/gcg/359Stuff/short2.gcg check: 6455 from: 1 to: 100 GETSEQ from gcg, August 18, 19103 15:26. Gaps: 0 Quality: 96 Ratio:6.000 Score:7 Width:3 Limits: +/-4 . 79 actgtaaacgcgaaag 94 || | || ||||||| 10 acagcaattgcgaaag 25

EMBOSSWhat is EMBOSS?The European Molecular Biology Open Software SuiteEMBOSS is a package of high-quality FREE Open Sourcesoftware for sequence analysis.

Applications in EMBOSSThe EMBOSS programs and their documentation.

User DocumentationTutorial, Command syntax, Sequences and Databases, Reference

Jemboss and other InterfacesMany groups are creating graphical interfaces to EMBOSSJemboss is our supported interface

Downloading the softwareYou can download, install and run the software on most UNIXcomputersIt is known to work on: Irix, AIX(4.3.3 and 5.1), Red Hat, SuSe, Debian,HPUX11/IA64, MacOSX, Mandrake, NetBSD, Slackware, Solaris,Tru64 Unix (Full support soon. Loan machine being arranged)It is reported to work on: FreeBSD, OSF, SuSE-PPC

LATEST NEWS: Release 2.7.1 available as of 3rd June 2003

Suite of programsgetorf HGMP Finds and extracts open reading frames (ORFs)helixturnhelix HGMP Finds nucleic acid binding domains.hmoment HGMP Hydrophobic moment calculationiep HGMP Calculates the isoelectric point of a proteininfoalign HGMP Information on a multiple sequence alignmentinfoseq HGMP Displays some simple information about sequencesisochore Sanger Plots isochores in large DNA sequencesjembossctl HGMP J emboss Authentication Controllindna Norway Draws linear maps of DNA constructslistor HGMP Writes a list file of the logical OR of

two sets of sequencesmarscan HGMP Finds MAR/SAR sites in nucleic sequencesmaskfeat HGMP Mask off features of a sequencemaskseq HGMP Mask off regions of a sequence.matcher Sanger Local alignment of two sequencesmegamerger HGMP Merge two large overlapping nucleic acid sequencesmerger HGMP Merge two overlapping sequencesmsbar HGMP Mutate sequence beyond all recognitionmwcontam HGMP Shows molwts that match across a set of filesmwfilter HGMP Filter noisy molwts from mass spec outputneedle HGMP Needleman-Wunsch global alignment.

Dynamic Programming for Pairwise Alignment 2

Documents