Interoperability of Web Services: granularity and data typespfh/files/embrace_interop_pfh.pdf ·...

Interoperability of Web Services: granularity and data types

Peter Fischer Hallin, Center for Biological Sequence AnalysisEMBRACE Workshop on Client Side Scripting for Web Services

February 2008

take home from yesterday ...• are software designed to enable computer-

to-computer interaction

• should aim to enhance interoperability

• exchanged objects defined in XSD

• described using WSDL

• commonly exchange data over SOAP/HTTP.

• Services described in WSDL and input/output objects typed using XSD

• Solutions for asynchronous job handling defined using job operations (submit-poll-fetch)

• Encourages to type input/output objects according to the conceptual content

• Services must be documented within Service Descriptions

EMBRACE technology

Outline• Design Phase: Initial considerations for deploying

a Web Service

• Case Story I: RNAmmer: Consistent annotation of rRNA genes

• Case Story II: BLASTatlas; large XML messages and RAW out

• Exercise: Preparation for todays exercise: programmatic workflow

Design phaseMaking your in-house resource available to the community ...

• Map the tasks of your resource into I/O of SOAP operation. It may be a workflow which requires multiple operations

• Queuing: If operations take time, a job cannot be handled within a single event of a request-response - queueing (submit-poll-fetch) will be required.

Which operations will be needed?

• Make a careful analysis of the conceptual content of your resource.

• Are there existing standards for the specific data you are working with?

• If not, can you find partners with whom you can develop data standards?

• If not, it’s likely you are about to write your own ...

What data is exchanged?

• SOAP operation I/O is defined using XSD. All the aspects of your data should be defined within the XSD. (arrays, complex data, enumerations, choices, attributes etc)

• XSD is quite a rich format to define XML. There are however rare cases where it’s lacking options. A Rule of thumb: If you take the time, XSD can always fully describe your XML (=no excuses for shortcuts!)

Having the overview, define the data you need to exchange

• This is likely not the difficult part, when all the analysis and XSD writing is done

• Defining the messages from your XSD types, binding them to operations, defining your server endpoint, and composing a WSDL containing everything.

Hooking your data to operations

Ensure proper documentation

• EMBRACE defines methods to put documentation at different levels in the WSDL file: All XSD elements can be documented as can the ‘service’ section of your WSDL

documentation

It doesn’t work ...

It doesn’t work ...

• Of course it doesn’t - you haven’t written all the software that connects the server endpoint with your software ...

But this is a different story - and not so relevant to you being the client user...

Case story I: RNAmmer

http://www.psb.ugent.be/rRNA/

Tool for predicting rRNA genes in full genome sequences



Case story I: RNAmmerTool for predicting rRNA genes in full genome sequences

Performance: selectivity and sensitivity in the range 0.98-1.00:

Lagesen K, Hallin P, Rodland EA, Staerfeldt HH, Rognes T, Ussery DWRNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res 35:3100-8 (2007)

Case-story II: the BLASTatlas WSA service which allows visualization of homology between a reference genome and any number of genomes, metagenomic samples, or sequence databases.

Example: Seven ocean samples from various depths (surface to 4km): 63,837,557 nucleotides, in 65,674 sequences.

24,978 proteins from 12 fully sequenced Prochlorococcus marinus genomes

0M

0.5M1M

1.5M

2M

2 .5M

P. marinus str. MIT 93032,682,675 bp

green P. marinus genomesblue=ocean samples

Surface

4km

• BLASTatlas implemented as a WS, and method was described in a per-review manuscript

• Perl’s SOAP::Lite consumes 10 fold memory during the 20 minutes it take just to prepare the BLASTatlas request. Memory climbs from 200Mb to 2Gb of memory, just doing ... well, we don’t know!

• Bad Web Service design or poor clients?

Challenges: Massive amounts of input data

http://www.cbs.dtu.dk/ws/BLASTatlas



• Output is a PostScript document. How does this fit into a SOAP response?

• Two options: MIME attachments or encoding the raw file content, and place it in an XML element.

output

XSD allows you to...

• Define (almost) any complex XML structure.

• XSD has build data types for basic content

• <complexType> allows you to build you own

• Supports regular expression, enumerations

Data types

Key design considerations

• Common data types

• Granularity

• Typing

Common data types

• Common data types could allow seamless connection between different services

• Saves time on documentation and development

• Would however take efforts to administer - and in some cases creating a new type is easier

• Not essential

(if we could all agree ... )

Granularity, Granularity, and Granularity

• Our choice of technology sets standards for typing any element in input/output - and these standards should be exploited!

• To a certain extent, Web Services is all about plumbing - connecting objects (pipes) from different operations to build a workflow and finally to generate a result for you

• This plumbing gets increasingly difficult, the poorer the granularity

Granularity examples:gr

anul

arity

and

ob

ject

typ

ing

standardization and management

Granularity examples:gr

anul

arity

and

ob

ject

typ

ing

standardization and management

A logical threshold for typing: Data that is likely to be a the endpoint of workflows or have limited scientific meaning in the context of

Bioinformatics: Jpeg/PNG, PosctScript/PDF,

Tropomyosin isoforms




1 2 3

Granularity

A ridiculous example: Substance P (a neuro transmitter)





Typing

Typingarray

id: string

sequence: string - restrictions? /^[ACDEFGHIKLMNPQRSTVWY]+$/

Todays Exercise(warning: a little bit of biology ahead)

• Sigma-70 transcription factor is a protein which binds -10 and -35 nt upstream of the transcription start site during the prokaryotic gene transcription.

• Sigma factor facilitates transcription by binding to the RNA polymerase

• Sigma factors serve a regulatory function

• Aim for the exercise is to predict the promoter binding sites of sigma factors, more specifically looking at the P1 and P2 promoters for ribosomal RNA genes.

Real-life research project

E. coli rRNA operons

tion of equimolar amounts of the three rRNA species. While itis the general rule, this grouping of rRNA genes is not univer-sal, and several examples in which not all rRNA genes arelinked have been noted in the Bacteria (96, 119, 212, 291) andin the Archaea (265, 334).

Sequence comparison suggests that some of the seven E. colirrn operons could be functionally distinct. Information fromextensive hybridization experiments also suggests both largeand small differences. The rrnB, -C, -E, and -G operons all havea gene for tRNAGlu-2 in their spacer region, while rrnA, -D, and-H have genes for tRNAIle-1 and tRNAAla-1B (211) (Fig. 2).Also, the rrnB and rrnG operon spacers contain a 106-nucle-otide sequence of potential secondary structure called the ri-bosomal spacer loop. The ribosomal spacer loop is absent fromthe spacer regions of the rrnC, rrnE, and rrnB operons of somedescendants of the original K-12 strain (rrnB1), in which theribosomal spacer loop is replaced by a shorter sequence (20nucleotides) (120). Some rrn operons also differ in the genes attheir distal ends. The rrnC operon has genes for tRNAAsp andtRNATrp (the sole gene for this tRNA in E. coli). The rrnDoperon has a distal tRNAThr-1 gene, and the rrnH operon hasone for tRNAAsp-1 (210). The rrnD operon is also unique inpossessing two distal 5S RNA sequences (67). In addition tothe above-mentioned heterogeneities, the individual rrn oper-ons contain many small sequence differences, which occur inregions specifying the structural RNAs (e.g., reference 138),promoters, spacers, and terminators (120, 296). Whether anyof these heterogeneities cause differences in the regulation orfunction of particular rRNAs has yet to be established.

Multiple rrn operons are found in the genomes of manyorganisms, and although the advantages conferred on E. coli byrrn operon redundancy are still only partially understood, thephenomenon is clearly important for the survival of E. coli andmany other bacterial species. Experiments in which multiple E.coli rrn operons have been inactivated show that nearly optimalgrowth rates are obtained on complex medium with only five

intact rrn operons (49). However, all seven rrn operons arenecessary for rapid adaptation to certain nutrient and temper-ature changes (50). We have concluded from these experi-ments that a major function of multiple rrn operons is to allowE. coli to commence synthesis of ribosomes faster upon en-countering more favorable growth conditions (50). Other pos-sible roles of rrn operon multiplicity might be that ribosomesderived from specific rrn operons are required to translatecertain mRNAs or that specific rrn operons are expressed un-der special physiological conditions. These speculations seemplausible considering the operon heterogeneities mentionedabove. Although no evidence of such roles has yet been dem-onstrated in E. coli (51), growth stage-specific ribosomes havebeen demonstrated in the malaria-causing Plasmodium bergheiparasite. P. berghei growing in its mosquito host makes 18SrRNA predominantly from one gene (type C) but transcribes18S from a structurally distinct gene (type A) in the mamma-lian host (351, 366). The purpose of this interesting adaptationis unknown but could be necessitated by different temperaturesencountered in the parasite’s two hosts.

The E. coli rrn operons are transcribed to an exceptionaldegree, accounting for more than half of the cell’s total RNAsynthesis under rapid growth conditions. The promoter regionsof all seven operons have been sequenced, and they all havethe same general structure (Fig. 3). Each operon has tandem!70 promoters, P1 and P2, separated by about 100 bp. The P2promoter in turn is separated by a 200-bp leader region fromthe beginning of the mature 16S rRNA. None of the corepromoter sequences (defined as "41 to #1 with respect to thetranscription initiation site [14]) has a perfect consensus !70

promoter sequence in terms of either its "35 and "10 se-quences or its spacing between these two regions. In general,these deviations from the consensus tend to reduce thestrength of the rrn core promoters, but some are responsiblefor increased control over rrn synthesis (14, 66). Ross andcoworkers have subsequently shown that the exceptional

FIG. 1. Location in minutes of the seven rrn operons on the chromosome ofE. coli K-12. Shaded arrows indicate the direction of transcription. The origin ofreplication, oriC, is shown at 84.0 min.

FIG. 2. Structure of the seven E. coli rrn operons. P1 and P2 are the tandemoperon promoters. Shaded boxes represent the 16S, 23S, and 5S genes. Spacerand distal tRNAs are indicated as solid boxes, and the small open boxes in the16S-23S spacer regions of rrnB and rrnG indicate the ribosomal spacer loop(RSL) (120). The complex terminator region (ter) at the end of each operonconsists of a rho-independent terminator(s) followed by a rho-dependent termi-nator (3, 304).

624 CONDON ET AL. MICROBIOL. REV.

at U

NIV

OF

CA

LIF

DA

VIS

on O

cto

ber 2

3, 2

007

mm

br.a

sm

.org

Dow

nlo

aded fro

m

• E. coli has seven rRNA operons, rrnB being the most intensively studied

• All rRNA genes of 16S, 23S, and 5S plus at least one tRNA gene are encoded on the same transcript

• Regulated by two promoters P1 and P2, with a near-consensus σ70 core promoter regions.

• P1 is stronger during fast growth rates, whereas transcription from P2 is predominant at slow rates or during prolonged stationary phase you are here

16S 23S 5StRNAGlu murB4,164,689

murIbtuB

Fis III Fis II Fis I UP -35 -10

P1

-35 -10

P1P2

Estrem et al. 1998

Huerta et al. 2003Hengen et al. 1997

0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

T

4

A

5

A

6

T

Position

Bits

(a) –10 hex-amer

0.0

0.5

1.0

1.5

2.0

1

T

2

G

3

A

4

A

5

A

6

T

7

T

8

T

9

T

10

T

11

T

12

T

13

T

14

T

15

G

16

A

17

A

18

A

19

A

20

G

21

T

22

A

Position

Bits

(b) UP element

0.0

0.5

1.0

1.5

2.0

1T

2

T

3

G

4

A

5

C

6

A

Position

Bits

(c) –35 hex-amer

0.0

0.5

1.0

1.5

2.0

1

A

2

T

3

T

4

G

5

G

6

T

7

Y

8

A

9

A

10

A

11

W

12

T

13

T

14

T

15

R

16

A

17

C

18

C

19

A

20

A

21

T

Position

Bits

(d) FIS binding

Figure 2.4: Combined P1 and P2 promotor elements before matrix adjustment using allE. coli currently available.

!500 !400 !300 !200 !100 0

!1

5!

10

!5

05

10

P1: Raw combined scores, !10,!35, UP (E.coli) (N=63)

Position relative to 16S gene start

Ri

(a) P1 model, raw results

!500 !400 !300 !200 !100 0

!1

5!

10

!5

05

10

15

P1: Adjusted combined scores, !10,!35, UP (E.coli) (N=63)

Position relative to gene start

Ri

(b) P1 model, adjusted results

!500 !400 !300 !200 !100 0

!1

0!

50

51

01

5

P2: Raw combined scores, !10,!35, UP (E. coli) (N=63)


Ri

(c) P2 model, raw results

!500 !400 !300 !200 !100 0

!1

0!

50

51

01

5

P2: Adjusted combined scores, !10,!35, UP (E. coli) (N=63)


Ri

(d) P2 model, adjusted results

Figure 2.5: Results of iscan using initial weight matrices, described in section 2.1.2

12

0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

T

4

A

5

A

6

T

Position

Bits

(a) –10 hex-amer

0.0

0.5

1.0

1.5

2.0

1

T

2

G

3

A

4

A

5

A

6

T

7

T

8

T

9

T

10

T

11

T

12

T

13

T

14

T

15

G

16

A

17

A

18

A

19

A

20

G

21

T

22

A

Position

Bits

(b) UP element

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

G

4

A

5

C

6

A

Position

Bits

(c) –35 hex-amer

0.0

0.5

1.0

1.5

2.0

1

A

2

T

3

T

4

G

5

G

6

T

7

Y

8

A

9

A

10

A

11

W

12

T

13

T

14

T

15

R

16

A

17

C

18

C

19

A

20

A

21

T

Position

Bits

(d) FIS binding


!500 !400 !300 !200 !100 0

!1

5!

10

!5

05

10



Ri


!500 !400 !300 !200 !100 0

!1

5!

10

!5

05

10

15



Ri


!500 !400 !300 !200 !100 0

!1

0!

50

51

01

5



Ri


!500 !400 !300 !200 !100 0

!1

0!

50

51

01

5



Ri



12

0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

T

4

A

5

A

6

T

Position

Bits

(a) –10 hex-amer

0.0

0.5

1.0

1.5

2.0

1

T

2

G

3

A

4

A

5

A

6

T

7

T

8

T

9

T

10

T

11

T

12

T

13

T

14

T

15

G

16

A

17

A18

A19

A

20

G

21

T

22

A

Position

Bits

(b) UP element

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

G

4

A

5

C

6

A

Position

Bits

(c) –35 hex-amer

0.0

0.5

1.0

1.5

2.0

1

A

2

T

3

T

4

G

5

G

6

T

7

Y

8

A

9

A

10

A

11

W

12

T

13

T

14

T

15

R

16

A

17

C

18

C

19

A

20

A

21

T

Position

Bits

(d) FIS binding


!500 !400 !300 !200 !100 0

!1

5!

10

!5

05

10



Ri


!500 !400 !300 !200 !100 0

!1

5!

10

!5

05

10

15



Ri


!500 !400 !300 !200 !100 0

!1

0!

50

51

01

5



Ri


!500 !400 !300 !200 !100 0

!1

0!

50

51

01

5



Ri



12

0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

T

4

A

5

A

6

T

Position

Bits

(a) –10 hex-amer

0.0

0.5

1.0

1.5

2.0

1

T

2

G

3

A

4

A

5

A

6

T

7

T

8

T

9

T

10

T

11

T

12

T

13

T

14

T

15

G

16

A

17

A

18

A

19

A

20

G

21

T

22

A

Position

Bits

(b) UP element

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

G

4

A

5

C

6

A

Position

Bits

(c) –35 hex-amer

0.0

0.5

1.0

1.5

2.0

1

A

2

T

3

T

4

G

5

G

6

T

7

Y

8

A

9

A

10

A

11

W

12

T

13

T

14

T

15

R

16

A

17

C

18

C

19

A

20

A

21

T

Position

Bits

(d) FIS binding


!500 !400 !300 !200 !100 0

!1

5!

10

!5

05

10



Ri


!500 !400 !300 !200 !100 0

!1

5!

10

!5

05

10

15



Ri


!500 !400 !300 !200 !100 0

!1

0!

50

51

01

5



Ri


!500 !400 !300 !200 !100 0

!1

0!

50

51

01

5



Ri



12

Estrem et al 1998

The E. coli rrnB operons

tRNA and rRNA prediction

• Prediction of tRNA genes using tRNAscan (Lowe et al. 1997) and RNAmmer (Lagesen et al. 2007)

http://lowelab.ucsc.edu/tRNAscan-SE/

http://www.cbs.dtu.dk/ws/ws.php?entry=RNAmmer

Lowe TM, Eddy SR tRNAscan-SE: a program for improved detection of transfer RNA genes ingenomic sequence. Nucleic Acids Res 25:955-64

(1997)

Lagesen K, Hallin P, Rodland EA, Staerfeldt HH, Rognes T, Ussery DW RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic

Acids Res 35:3100-8 (2007)





ϒ-proteobacteriaEscherichia coli 536

Escherichia coli APEC O1

Escherichia coli CFT073

Shigella sonnei Ss046

Shigella boydii Sb227

Shigella flexneri 2a str. 301

Shigella flexneri 2a str. 2457T

Escherichia coli UTI89

Escherichia coli K12

Escherichia coli O157:H7 EDL933

Escherichia coli O157:H7 str. Sakai

Escherichia coli W3110

Shigella dysenteriae Sd197

Salmonella enterica subsp. enterica serovar Choleraesuis str. SC-B67

Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150

Salmonella enterica subsp. enterica serovar Typhi Ty2

Salmonella enterica subsp. enterica serovar Typhi str. CT18

Salmonella typhimurium LT2

Yersinia pestis Antiqua

Yersinia pestis CO92

Yersinia pestis KIM

Yersinia pestis Nepal516

Yersinia pestis Pestoides F

Yersinia pestis biovar Microtus str. 91001

Yersinia pseudotuberculosis IP 32953

Figure 2.2: Neighbor-Joining tree of first 1k bases of all 16S rRNA genes of Yersinia,Salmonella, Shigella, and E. coli

7

General σ70 core, FIS site, and UP element (E. coli)

0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

T

4

A

5

A

6

T

Position

Bits

(a)

0.0

0.5

1.0

1.5

2.0

1

T

2

G

3

A

4

A

5

A

6

T

7

T

8

T

9

T

10

T

11

T

12

T

13

T

14

T

15

G

16

A

17

A

18

A

19

A

20

G

21

T

22

A

Position

Bits

(b)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

G

4

A

5

C

6

A

Position

Bits

(c)

0.0

0.5

1.0

1.5

2.0

1

A

2

T

3

T

4

G

5

G

6

T

7

Y

8

A

9

A

10

A

11

W

12

T

13

T

14

T

15

R

16

A

17

C

18

C

19

A

20

A

21

T

Position

Bits

(d)

Figure 2.4: Logo plots showing the initial weight matrices used for searching E. coligenomes. –10 hexamer (a), –35 hexamer (b), UP element (c), and FIS binding motif (d).

!500 !400 !300 !200 !100 0

!1

5!

10

!5

05

10



Ri

(a)

!500 !400 !300 !200 !100 0

!1

5!

10

!5

05

10

15



Ri

(b)

!500 !400 !300 !200 !100 0

!1

0!

50

51

01

5



Ri

(c)

!500 !400 !300 !200 !100 0

!1

0!

50

51

01

5



Ri

(d)

Figure 2.5: Profiles showing the iscan scores of the initial weight matrices, described insection 2.1.2: Unadjusted P1 scores (a), Adjusted P1 scores (b), Unadjusted P2 scores (c),and Adjusted P2 scores (d)

12

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

G

4

A

5

C

6

A

Position

Bits

Figure 3.1: Sequence logo describing the information content as described in table ??

Where j iterates over the four bases A, T, G,C and pwj is the frequency of base j at positionw, and J = 4 corresponds to the length of the alphabet. The program matrix2logotblconverts a .matrix file and computes the total information content as well as relativefrequencies normalized to the total information content of the bases at each column of thematrix:

>matrix2logotbl < Huerta-m35.matrixaxis:0.0,0.5,1.0,1.5,2.0ticks:0,0.5,1,1.5,2.0ylab:Bitsxlab:Positionmax:2.01 T A:0.000 C:0.113 G:0.138 T:0.4772 T G:0.033 A:0.080 C:0.093 T:0.5633 G A:0.015 C:0.054 T:0.146 G:0.2304 A T:0.024 G:0.029 C:0.054 A:0.0905 C G:0.000 T:0.115 A:0.156 C:0.1596 A C:0.018 T:0.039 G:0.058 A:0.089

3.1.2 logovar: Visualizing information content

The table output provided by matrix2logotbl (see section ) is intented for use togetherwith logovar wich reads in a table of stack heights and produces a PostScript logo plot(in file output.pdf) as seen below. The graphic is shown in figure 3.1.

matrix2logotbl < Huerta-m35.matrix | logovar output

3.1.3 iscan: motif information

By using information theory one can compose a measure of how well a given alignedquery sequence conforms to a given weight matrix, measured in bits of information. By

22producing a matrix of Rb,p values, (see equation 3.3) the Ri value is obtained by aligningthe query sequence to the matrix columns and summarize the Rb,p-values.

Rb,p = log2(4) + log2nb,p

NRtot =

L∑

p=1

RB,p (3.3)

–where b ∈ ATGC iterates through the four bases, p denotes the position in thealignment, L is the length of the alignment (or width of the matrix), and nb,p is thenumber of bases b at position p, and B denotes the base at position p in the querysequence. Recently, a method was proposed by Shultzaberger and co-workers introducingan information theory based measure, to quantify the helical phasing of adjacent bindingmotifs (Shultzaberger et al., 2007). The authors defines an accessibility, n(d), equation3.4, and gap surprisal, GS(d).

n(d) = 1 + cos[2π

w(d− c)] (3.4)

–where c is the center distance between two binding sites (e.g. optimally spaced), d is thequery distance, w = 10.6 is the distance of a one helix turn of B-form DNA. Finally, thisgives the GS(d) as follows:

GS(d) = log2n(d)N

(3.5)

–where N is the sum of all n(d), defined in equation 3.6. The sign of the GS(d) waschanged from the original equation described (Shultzaberger et al., 2007) to allow forcombining all scores by addition.

N =max∑

d=min

n(d) (3.6)

–where min and max are the boundaries of a given window examined. The programiscan was written based on the framework of the gap surprisal, helical accessibility, andindividual information based weight matrices just described. The program supports anynumber of PWMs, separated by user–defined spacers applying the GS(d) measure. Thebit score scheme allows the addition of the individual parts of the alignment, shown inequation 3.7

Ri(tot) = Ri(m1) + GS(d, m1) + Ri(m2) + ... + GS(d, mn−1) + Ri(mn) (3.7)

A σ70 model for iscan

A simple model format was composed for iscan, which allows to define mulitple weightmatrices and inter-matrice spacers. The format allows a recursive parsing of # includestatements as well. Using the –35 box desribed above, and the –10 box also desribed byHuerta an co-workers (Huerta and Collado-Vides, 2003), the following iscan configurationfile was constructed:

[pwm]=-10 regionweight=6[A] 1 116 35 52 44 7[C] 15 0 13 24 28 3[G] 3 0 17 12 26 0

23

Spacing and scoring of mulitple motifs–where b ∈ ATGC iterates through the possible bases, N is total number of sequences, andnb,l is the number of bases b observed at position l of the aligned sequences. Recently, amethod was proposed by Shultzaberger and co-workers introducing an information theorybased measure, to quantify the helical phasing of adjacent binding motifs (Shultzabergeret al., 2007). The authors defines an accessibility, n(d), equation 3.4, and gap surprisal,GS(d).

n(d) = 1 + cos[2π

w(d− c)] (3.4)

–where c is the center distance between two binding sites (e.g. optimally spaced), d is thequery distance, w = 10.6 is the distance of a one helix turn of B-form DNA. Finally, thisgives the GS(d) as follows:

GS(d) = log2n(d)N

(3.5)

–where N is the sum of all n(d), defined in equation 3.6. The sign of the GS(d) waschanged from the original equation described (Shultzaberger et al., 2007) to allow forcombining all scores by addition.

N =max∑

d=min

n(d) (3.6)

–where min and max are the boundaries of a given window examined. The programiscan was written based on the framework of the gap surprisal, helical accessibility, andindividual information based weight matrices just described. The program supports anynumber of PWMs, separated by user–defined spacers applying the GS(d) measure. Thebit score scheme allows the addition of the individual parts of the alignment, shown inequation 3.7

Ri(tot) = Ri(m1) + GS(d, m1) + Ri(m2) + ... + GS(d, mn−1) + Ri(mn) (3.7)

A σ70 model for iscan

A simple model format was composed for iscan, which allows to define mulitple weightmatrices and inter-matrice spacers. The format allows a recursive parsing of # includestatements as well. Using the –35 box desribed above, and the –10 box also desribed byHuerta an co-workers (Huerta and Collado-Vides, 2003), the following iscan configurationfile was constructed:

[pwm]=-10 regionweight=6[A] 1 116 35 52 44 7[C] 15 0 13 24 28 3[G] 3 0 17 12 26 0[T] 97 0 51 28 18 106

[spacer]min=14center=16max=18

23

accessibility

distance score (bits)

sum of all n(d)

Total bit score of all boxes

Initial combined σ70 core, FIS I site, and UP element models (E. coli and Shigella)

0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

T

4

A

5

A

6

T

Position

Bits

(a)

0.0

0.5

1.0

1.5

2.0

1

T

2

G

3

A

4

A

5

A

6

T

7

T

8

T

9

T

10

T

11

T

12

T

13

T

14

T

15

G

16

A

17

A

18

A

19

A

20

G

21

T

22

A

Position

Bits

(b)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

G

4

A

5

C

6

A

Position

Bits

(c)

0.0

0.5

1.0

1.5

2.0

1

A

2

T

3

T

4

G

5

G

6

T

7

Y

8

A

9

A

10

A

11

W

12

T

13

T

14

T

15

R

16

A

17

C

18

C

19

A

20

A

21

T

Position

Bits

(d)

Figure 2.4: Logo plots showing the initial weight matrices used for searching E. coli andShigella genomes: –10 hexamer (a), –35 hexamer (b), UP element (c), and FIS bindingmotif (d).

!500 !400 !300 !200 !100 0

!1

5!

10

!5

05

10



Ri

(a)

!500 !400 !300 !200 !100 0

!1

5!

10

!5

05

10

15



Ri

(b)

!500 !400 !300 !200 !100 0

!1

0!

50

51

01

5



Ri

(c)

!500 !400 !300 !200 !100 0

!1

0!

50

51

01

5



Ri

(d)

Figure 2.5: Profiles showing the iscan scores of the initial weight matrices (see section2.1.2applied to E. coli and Shigella: Unadjusted P1 scores (a), Adjusted P1 scores (b),Unadjusted P2 scores (c), and Adjusted P2 scores (d)

12

1st iteration: model scores (E.coli+Shigella)

!500 !400 !300 !200 !100 0

01

02

03

0

P1: Raw combined scores, !10,!35, UP, and FIS (E. coli and Shigella) (N=91)


Ri

(a)

!500 !400 !300 !200 !100 0

01

02

03

04

0

P1: Adjusted combined scores, !10,!35, UP, and FIS (E. coli and Shigella) (N=91)


Ri

(b)

!500 !400 !300 !200 !100 0

01

02

03

0

P2: Raw combined scores, !10,!35, UP (E. coli and Shigella) (N=91)


Ri

(c)

!500 !400 !300 !200 !100 0

01

02

03

04

0

P2: Adjusted combined scores, !10,!35, UP (E. coli and Shigella) (N=91)


Ri

(d)

Figure 2.7: Raw and adjusted iscan profiles of E. coli and Shigella using refined P1 andP2 matrices for E. coli : Unadjusted P1 scores (a), adjusted P1 scores (b), unadjusted P2scores (c), and adjusted P2 scores (d)

0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

T

4

A

5

A

6

T

Position

Bits

(a)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

G

4

T

5

C

6

A

Position

Bits

(b)

0.0

0.5

1.0

1.5

2.0

1

T

2

C

3

A

4

G

5

A

6

A

7

A

8

A

9

T

10

T

11

A

12

T

13

T

14

T

15

T

16

A

17

A

18

A

19

T

20

T

21

T

22

C

Position

Bits

(c)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

T

4

G

5

C

6

T

7

T

8

G

9

A

10

A

11

A

12

A

13

A

14

T

15

G

16

A

17

G

18

C

19

G

20

G

21

T

Position

Bits

(d)

0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

T

4

T

5

A

6

T

Position

Bits

(e)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

G

4

A

5

C

6

T

Position

Bits

(f)

0.0

0.5

1.0

1.5

2.0

1

T

2

C

3

C

4

G

5

A

6

A

7

A

8

A

9

A

10

G

11

A

12

A

13

A

14

G

15

C

16

A

17

A

18

A

19

A

20

A

21

A

22

A

Position

Bits

(g)

Figure 2.8: Logos showing base composition of P1 and P2 promoter elements in E. coliand Shigella: P1 –10 hexamer (a), P1 –35 hexamer (b), P1 UP element (c), P1 FIS bindingmotif (d), P2 –10 hexamer (e), P2 –35 hexamer (f), P2 UP element (g)

14

P1 P2

Strong similarity of P1 and P2 sites

-10 -35 UPFIS I-10 -35 UP

Yers

inia

E. c

oli/S

hige

llaSa

lmon

ella

!500 !400 !300 !200 !100 0

!5

05

10

15

20

P1: Raw combined scores, !10,!35, UP (Yersinia) (N=52)


Ri

(a)

!500 !400 !300 !200 !100 0 100

!5

05

10

15

20

25

P1: Adjusted combined scores, !10,!35, UP (Yersinia) (N=52)


Ri

(b)

!500 !400 !300 !200 !100 0

05

10

15

20

25



Ri

(c)

!500 !400 !300 !200 !100 0

05

10

15

20

25

30



Ri

(d)

Figure 2.11: Raw and adjusted iscan profiles of Yersinia using refined P1 and P2 matricesfor E. coli : Unadjusted P1 scores (a), adjusted P1 scores (b), unadjusted P2 scores (c),and adjusted P2 scores (d)

0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

T

4

A

5

A

6

T

Position

Bits

(a)

0.0

0.5

1.0

1.5

2.0

1

G

2

T

3

T

4

G

5

C

6

A

Position

Bits

(b)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

G

4

A

5

A

6

A

7

A

8

G

9

T

10

T

11

T

12

T

13

T

14

T

15

G

16

A

17

A

18

A

19

T

20

T

21

A

22

G

Position

Bits

(c)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

T

4

G

5

C

6

T

7

G

8

A

9

A

10

A

11

A

12

A

13

A

14

T

15

A

16

A

17

G

18

C

19

G

20

G

21

T

Position

Bits

(d)

0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

T

4

T

5

A

6

T

Position

Bits

(e)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

G

4

A

5

C

6

T

Position

Bits

(f)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

C

4

G

5

A

6

A

7

A

8

G

9

A

10

A

11

A

12

T

13

A

14

A

15

A

16

A

17

G

18

A

19

A

20

A

21

A

22

C

Position

Bits

(g)

Figure 2.12: Logos showing base composition of P1 and P2 promoter elements in Yersinia:P1 –10 hexamer (a), P1 –35 hexamer (b), P1 UP element (c), P1 FIS binding motif (d),P2 –10 hexamer (e), P2 –35 hexamer (f), P2 UP element (g)

16

!500 !400 !300 !200 !100 0

!5

05

10

15

20



Ri

(a)

!500 !400 !300 !200 !100 0 100

!5

05

10

15

20

25



Ri

(b)

!500 !400 !300 !200 !100 0

05

10

15

20

25



Ri

(c)

!500 !400 !300 !200 !100 0

05

10

15

20

25

30



Ri

(d)

Figure 2.11: Raw and adjusted iscan profiles of Yersinia using refined P1 and P2 matricesfor E. coli : Unadjusted P1 scores (a), adjusted P1 scores (b), unadjusted P2 scores (c),and adjusted P2 scores (d)

0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

T

4

A

5

A

6

T

Position

Bits

(a)

0.0

0.5

1.0

1.5

2.0

1

G

2

T

3

T

4

G

5

C

6

A

Position

Bits

(b)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

G

4

A

5

A

6

A

7

A

8

G

9

T

10

T

11

T

12

T

13

T

14

T

15

G

16

A

17

A

18

A

19

T

20

T

21

A

22

G

Position

Bits

(c)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

T

4

G

5

C

6

T

7

G

8

A

9

A

10

A

11

A

12

A

13

A

14

T

15

A

16

A

17

G

18

C

19

G

20

G

21

T

Position

Bits

(d)

0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

T

4

T

5

A

6

T

Position

Bits

(e)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

G

4

A

5

C

6

T

Position

Bits

(f)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

C

4

G

5

A

6

A

7

A

8

G

9

A

10

A

11

A

12

T

13

A

14

A

15

A

16

A

17

G

18

A

19

A

20

A

21

A

22

C

Position

Bits

(g)

Figure 2.12: Logos showing base composition of P1 and P2 promoter elements in Yersinia:P1 –10 hexamer (a), P1 –35 hexamer (b), P1 UP element (c), P1 FIS binding motif (d),P2 –10 hexamer (e), P2 –35 hexamer (f), P2 UP element (g)

16

!500 !400 !300 !200 !100 0

!1

00

10

20

P1: Raw combined scores, !10,!35, UP (Salmonella) (N=35)


Ri

(a)

!500 !400 !300 !200 !100 0

!1

00

10

20

30

40

P1: Adjusted combined scores, !10,!35, UP (Salmonella) (N=35)


Ri

(b)

!500 !400 !300 !200 !100 0

!2

0!

10

01

02

03

0



Ri

(c)

!500 !400 !300 !200 !100 0

!2

0!

10

01

02

03

0



Ri

(d)

Figure 2.9: Raw and adjusted iscan profiles of Salmonella using refined P1 and P2matrices for E. coli : Unadjusted P1 scores (a), adjusted P1 scores (b), unadjusted P2scores (c), and adjusted P2 scores (d)

0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

T

4

A

5

A

6

T

Position

Bits

(a)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

G

4

T

5

C

6

A

Position

Bits

(b)

0.0

0.5

1.0

1.5

2.0

1

T

2

C

3

A

4

G

5

A

6

A

7

A

8

A

9

T

10

T

11

A

12

T

13

T

14

K

15

C

16

A

17

A

18

A

19

T

20

T

21

T

22

C

Position

Bits

(c)

0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

T

4

G

5

G

6

A

7

T

8

G

9

A

10

A

11

A

12

A

13

A

14

T

15

G

16

A

17

G

18

C

19

A

20

A

21

T

Position

Bits

(d)

0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

A

4

T

5

A

6

T

Position

Bits

(e)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

G

4

A

5

C

6

T

Position

Bits

(f)

0.0

0.5

1.0

1.5

2.0

1

A

2

G

3

R

4

G

5

A

6

G

7

A

8

A

9

A

10

A

11

G

12

C

13

G

14

G

15

A

16

A

17

A

18

T

19

A

20

A

21

A

22

C

Position

Bits

(g)

Figure 2.10: Logos showing base composition of P1 and P2 promoter elements in Salmonel-lla: P1 –10 hexamer (a), P1 –35 hexamer (b), P1 UP element (c), P1 FIS binding motif(d), P2 –10 hexamer (e), P2 –35 hexamer (f), P2 UP element (g)

15

!500 !400 !300 !200 !100 0

01

02

03

0



Ri

(a)

!500 !400 !300 !200 !100 0

01

02

03

04

0



Ri

(b)

!500 !400 !300 !200 !100 0

01

02

03

0



Ri

(c)

!500 !400 !300 !200 !100 0

01

02

03

04

0



Ri

(d)


0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

T

4

A

5

A

6

T

Position

Bits

(a)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

G

4

T

5

C

6

A

Position

Bits

(b)

0.0

0.5

1.0

1.5

2.0

1

T

2

C

3

A

4

G

5

A

6

A

7

A

8

A

9

T

10

T

11

A

12

T

13

T

14

T

15

T

16

A

17

A

18

A

19

T

20

T

21

T

22

C

Position

Bits

(c)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

T

4

G

5

C

6

T

7

T

8

G

9

A

10

A

11

A

12

A

13

A

14

T

15

G

16

A

17

G

18

C

19

G

20

G

21

T

Position

Bits

(d)

0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

T

4

T

5

A

6

T

Position

Bits

(e)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

G

4

A

5

C

6

T

Position

Bits

(f)

0.0

0.5

1.0

1.5

2.0

1

T

2

C

3

C

4

G

5

A

6

A

7

A

8

A

9

A

10

G

11

A

12

A

13

A

14

G

15

C

16

A

17

A

18

A

19

A

20

A

21

A

22

A

Position

Bits

(g)


14

!500 !400 !300 !200 !100 0

01

02

03

0



Ri

(a)

!500 !400 !300 !200 !100 0

01

02

03

04

0



Ri

(b)

!500 !400 !300 !200 !100 0

01

02

03

0



Ri

(c)

!500 !400 !300 !200 !100 0

01

02

03

04

0



Ri

(d)


0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

T

4

A

5

A

6

T

Position

Bits

(a)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

G

4

T

5

C

6

A

Position

Bits

(b)

0.0

0.5

1.0

1.5

2.0

1

T

2

C

3

A

4

G

5

A

6

A

7

A

8

A

9

T

10

T

11

A

12

T

13

T

14

T

15

T

16

A

17

A

18

A

19

T

20

T

21

T

22

C

Position

Bits

(c)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

T

4

G

5

C

6

T

7

T

8

G

9

A

10

A

11

A

12

A

13

A

14

T

15

G

16

A

17

G

18

C

19

G

20

G

21

T

Position

Bits

(d)

0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

T

4

T

5

A

6

T

Position

Bits

(e)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

G

4

A

5

C

6

T

Position

Bits

(f)

0.0

0.5

1.0

1.5

2.0

1

T

2

C

3

C

4

G

5

A

6

A

7

A

8

A

9

A

10

G

11

A

12

A

13

A

14

G

15

C

16

A

17

A

18

A

19

A

20

A

21

A

22

A

Position

Bits

(g)


14

!500 !400 !300 !200 !100 0

!1

00

10

20



Ri

(a)

!500 !400 !300 !200 !100 0

!1

00

10

20

30

40



Ri

(b)

!500 !400 !300 !200 !100 0

!2

0!

10

01

02

03

0



Ri

(c)

!500 !400 !300 !200 !100 0

!2

0!

10

01

02

03

0



Ri

(d)

Figure 2.9: Raw and adjusted iscan profiles of Salmonella using refined P1 and P2matrices for E. coli : Unadjusted P1 scores (a), adjusted P1 scores (b), unadjusted P2scores (c), and adjusted P2 scores (d)

0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

T

4

A

5

A

6

T

Position

Bits

(a)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

G

4

T

5

C

6

A

Position

Bits

(b)

0.0

0.5

1.0

1.5

2.0

1

T

2

C

3

A

4

G

5

A

6

A

7

A

8

A

9

T

10

T

11

A

12

T

13

T

14

K

15

C

16

A

17

A

18

A

19

T

20

T

21

T

22

C

Position

Bits

(c)

0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

T

4

G

5

G

6

A

7

T

8

G

9

A

10

A

11

A

12

A13

A14

T15

G

16

A

17

G

18

C

19

A

20

A

21

T

Position

Bits

(d)

0.0

0.5

1.0

1.5

2.0

1

T

2

A

3

A

4

T

5

A

6

T

Position

Bits

(e)

0.0

0.5

1.0

1.5

2.0

1

T

2

T

3

G

4

A

5

C

6

T

Position

Bits

(f)

0.0

0.5

1.0

1.5

2.0

1

A

2

G

3

R

4

G

5

A

6

G

7

A

8

A

9

A

10

A

11

G

12

C

13

G

14

G

15

A

16

A

17

A

18

T

19

A

20

A

21

A

22

C

Position

Bits

(g)

Figure 2.10: Logos showing base composition of P1 and P2 promoter elements in Salmonel-lla: P1 –10 hexamer (a), P1 –35 hexamer (b), P1 UP element (c), P1 FIS binding motif(d), P2 –10 hexamer (e), P2 –35 hexamer (f), P2 UP element (g)

15

exercise 2

• Locate rRNA genes

• Extract promoter regions

• Predict the presence of P1 and P2

• Extract the -10 and -35 regions

• Build a workflow

Acknowledgements

David W. UsseryCraig BenhamKarin LagesenJan Christian BryneFrancisco RoqueKristoffer RapackiHans Henrik Stærfeldt

Date post:	27-Mar-2021
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Interoperability of Web Services: granularity and data typespfh/files/embrace_interop_pfh.pdf ·...

Documents