+ All Categories
Home > Documents > Deep Web Service Crawler

Deep Web Service Crawler

Date post: 03-Feb-2022
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
66
Deep Web Service Crawler 1 Dresden University of Technology (Germany) Deep Web Services Crawler Name: Duan Dehua Matrikel-Nr: 3459827 Faculty: Department of Computer Science Major: Computational Engineering Kind of Topic: Master Thesis Supervisor: Dipl.-Inf. Josef Spillner Prof. Dr. rer. nat. habil. Dr. h. c. Alexander Schill Start Date: May, 01, 2010 Finish Data: October, 31, 2010
Transcript
Page 1: Deep Web Service Crawler

Deep Web Service Crawler

1

Dresden University of Technology (Germany)

Deep Web Services Crawler

Name Duan Dehua

Matrikel-Nr 3459827 Faculty Department of Computer Science Major Computational Engineering

Kind of Topic Master Thesis Supervisor Dipl-Inf Josef Spillner

Prof Dr rer nat habil Dr h c Alexander Schill

Start Date May 01 2010 Finish Data October 31 2010

Deep Web Service Crawler

2

Acknowledgements

The work presented in this master thesis is a result of the master task for Web Service project

which is provided by Computer Networks at Dresden University of Technology

In here I am heartily thankful to my supervisor Dipl-infJosef Spillner whose encouragement

guidance and support from the initial to the final level enabled me to develop an understanding of

the subject

Lastly I offer my regards and blessings to all of those who supported me in any respect during the

completion of the project

Deep Web Service Crawler

3

Abstract

Nowadays Web Service Registries offer convenient access to offering searching and using

electronic Web services Usually they host Web service descriptions along with related metadata

generated by both the system and the users Hence monitoring and rating information can help

users to distinguish similar Web service offerings However at present there is little support to

compare these Web services across platforms and to build a global view Besides for the metadata

not all of them especially non-functional property descriptions are made available in s structured

format

Therefore the task of this master thesis is to apply Deep Web analysis techniques to extract as

much information about these published Web services as possible Corresponding the result shall

be the largest annotated Service Catalogue ever produced

Index Terms

Web Service Deep Web Service Crawler Service-Finder Pica-Pica Web Service Description Crawler

WSDL

Deep Web Service Crawler

4

Table of Contents

Acknowledgements 2

Abstract 3

1 Introduction 7

11 BackgroundMotivation 7

12 Initial Designing of the Deep Web Service Crawler Approach 7

13 Goals of this Master Thesis 8

14 Outline of this Master Thesis 8

2 State of the Art 10

21 Service Finder Project 10

211 Use Cases for Service-Finder Project 10

2111 Use Case Methodology 10

2112 System Administrator 10

212 Architecture Plan for the Service-Finder Project 12

2121 The Principle of the Service Crawler Component 13

2122 The Principle of the Automatic Annotator Component 13

2123 The Principle of the Conceptual Indexer and Matcher Component 14

2124 The Principle of the Service-Finder Portal Interface Component 14

2125 The Principle of the Cluster Engine Component 15

22 Information Extraction 15

221 Input Types of Information Extraction 15

222 Extraction Targets of Information Extraction 17

223 The Used Techniques in Information Extraction 18

23 Pica-Pica Web Service Description Crawler 19

231 Needed Libraries of the Pica-Pica Web Service Description Crawler 19

232 Architecture of the Pica-Pica Web Service Description Crawler 20

233 Implementation of the Pica-Pica Web Service Description Crawler 21

24 Conclusions of the Existing Strategies 22

3 Design and Implementation 23

31 Deep Web Services Crawler Requirements 23

311 Basic Requirements for DWSC 23

Deep Web Service Crawler

5

312 System Requirements for DWSC 23

313 Non-Functional Requirements for DWSC 24

32 Deep Web Services Crawler Architecture 24

321 The Function of Web Service Extractor Component 26

3211 Features of the Web Service Extractor Component 28

3212 Input of the Web Service Extractor Component 28

3213 Output of the Web Service Extractor Component 28

3214 Demonstration for Web Service Extractor 29

322 The Function of WSDL Grabber Component 30

3221 Features of the WSDL Grabber Component 31

3222 Input of the WSDL Grabber Component 31

3223 Output of the WSDL Grabber Component 31

3224 Demonstration for WSDL Grabber Component 31

323 The Function of Property Grabber Component 33

3231 Features of the Property Grabber Component 36

3232 Input of the Property Grabber Component 37

3233 Output of the Property Grabber Component 37

3234 Demonstration for Property Grabber Component 37

324 The Function of Storage Component 40

3241 Features of the Storage Component 42

3242 Input of the Storage Component 43

3243 Output of the Storage Component 43

3244 Demonstration for Storage Component 43

33 Multithreaded Programming for DWSC 46

34 Sleep Time Configuration for Web Service Registries 46

4 Experimental Results and Analysis 48

41 Statistic Information for Different Web Service Registries 48

42 Statistic Information for WSDL Document 49

43 Comparison of Different Average Number of Service Properties 50

44 Different Outputs of Web Services 52

45 Comparison of Average Time Cost for Different Parts of Single Web Service 54

5 Conclusion and Further Direction 59

6 Bibliography 60

Deep Web Service Crawler

6

7 Appendixes 61

Table of Figures 64

Table of Tables 65

Table of Abbreviations 66

Deep Web Service Crawler

7

1 Introduction

In this introduction chapter of the master thesis at first itrsquos going to concisely explain the background

of the current situation then is the basic introduction of the proposed approach which is called Deep

Web Service Extraction Crawler

11 BackgroundMotivation

In the late 1990rsquos the Web Service Registry was a hot commodity The formal definition of Web

Service Registry is known as a link links page Its function is to uniformly present information that

comes from various sources Hence it can provide a convenient channel to the users for offering

searching and using the Web Services Actually the related metadata of the Web Services that

submitted by both the system and users are commonly hosted along with the Service descriptions

Nevertheless in fact when users enter one of the Web Service Registries to look for some Web

Services they might meet some situations that would bring lots of trouble to them One of the

situations may be like that these Web Service Registries return several similar published Web Services

after the users search on it For example two or more Web Services have the same name but their

versions are not the same Or two or more Web Services that derived from the same server but have

different contents etc Furthermore most users are also interested in a global view of the published

services For instance they want to know which Web Service Registry can provide better quality for

the Web Service Therefore in order to help users to differentiate those similar published Web

Services and have a global view of the Web Services this information should be monitored and rated

Moreover there are a great many Web Service Registries in the Internet Each Web Service Registry

can provide a great number of Web Services Obviously there might have some similar Web Services

among these Web Service Registries Or a Web Service in one of the Web Service Registry is related to

another Web Service in other Web Service Registries Hence these Web Services should be

comparable across different Web Service Registries However recently there has not much support of

this In addition towards the metadata actually not all of them are structured especially the

descriptions of the non-functional property Therefore what have to do now is to turn those

non-functional property descriptions into the structured format Clearly speaking it needs to extract

as much information as possible about the Web Services that offered in the Web Service Registries

Eventually after extracting all the information from the Web Service Registries it is necessary to store

them into the disk This procedure should be efficient flexible and completeness

12 Initial Designing of the Deep Web Service

Crawler Approach

The problems have already been stated in the previous section Hence the following work is to solve

Deep Web Service Crawler

8

these problems In this section it will present the basic principle of Deep Web Service Crawler

approach

At first it is the simply introduction of the Deep Web Service Crawler approach to these problems As

have already been mentioned each Web Service Registry can offer Web Services Moreover each

Web Service Registry has its own html page structures These structures may be the same or even

complete different Therefore the first thing is to identify which Web Service Registry that it will be

going to explore Since each Web Service Registry owns a unique URL this job can be done by directly

analyzing the corresponding URL address of that Web Service Registry After identifying which Web

Service Registry it is going to explore the following step is to obtain all these Web Services that

published in that Web Service Registry Then with all these obtained Web Services it is time to extract

analyze and gather the information of the services That information can be in structured format or

even in unstructured format In this master thesis some Deep Web Analysis Techniques will be

applied to obtain this information So that the information about each Web Service shall be the

largest annotated The last but not the least important all the information about the Web Services

need to be stored

13 Goals of this Master Thesis

The lists in the following are the goals of this master thesis

n Produce the largest annotated Service Catalogue

Service Catalogue is a list of service properties The more properties the service has the larger

Service Catalogue it owns Therefore this master program should extract as much service

properties as possible

n Flexible storage of these metadata of each service as annotations or dedicated documents

The metadata of one service includes not only the WSDL document but also service properties

All these metadata are important information for the service Therefore this master program

should provide flexible ways to store these metadata into the disk

n Improve the comparable property of the Web Services across different Web Service Registries

The names of service properties for one Web Service Registry could be different from another

Web Service Registry Hence for the purpose of improving the comparable ability all these

names of the service properties should be uniformed and well-defined

14 Outline of this Master Thesis

In this chapter the motivation objective and initial approach plan have already been discussed

Thereafter the remaining paper is structured as follows

Firstly in chapter 2 the work presented is almost based on these existing techniques In section 21

there is given a detailed introduction to the technique of the Service-Finder project Then in section

22 the already implemented technique of Pica-Pica Web Service Description Crawler is introduction

and discussed After that in section 23 the Information Retrieval technique is presented

Deep Web Service Crawler

9

Then following in chapter 3 it is explained the designing details of this Deep Web Service Crawler

approach In section 31 it gives a short description for the different requirements of this approach

Next in section 32 the actual design for the Deep Web Service Crawler is presented Then in section

33 34 the multithreaded programming and sleep time configuration that used in this master

program are introduced respectively

In chapter 4 it is supposed to display the experiments of this Deep Web Service Crawler approach

and then give some evaluation of it

Finally in Section 5 the conclusion and discussion of the work that already done as well as the work in

the future for this master task are presented respectively

Deep Web Service Crawler

10

2 State of the Art

This chapter aims at presenting some existing techniques or Strategies that related to the work of

applying this Deep Web Service Extraction Crawler approach Section 21 plans to talk about the

existing catalogues Service-Finder project And then in section 22 it is going to explain the existing

implemented crawlers Pica-Pica Web Service Description Crawler Finally in the section 23 it is

supposed to present some details about the Information Extraction technique

21 Service Finder Project

Service-Finder Project aims at developing a platform for Web Service discovery especially for the Web

Services that are embedded in a Web 20 environment [1] Hence it can provide an efficient access to

publicly available services The goals of the Service-Finder project are depicted as follows [1]

n Automatically gather Web Services and their related information

n Semi-automatically create semantic service description based on the information that available

on the Web

n Create and improve semantic annotations via the user feedback

n Describe the aggregated information in semantic models and allow reasoning query

However before describing the basic functionality of the Service-Finder Project there is going to

present one of its use cases and requirements first

211 Use Cases for Service-Finder Project

The Service-Finder Project employed the use case methodology of the W3C Use Case description [6]

for its needs and then applied this methodology to the use cases that it enumerated

2111 Use Case Methodology

There are three aspects needed to be considered for use case definitions of Service-Finder Project [1]

(1) Description that used to describe information of the use case

(2) Actors Roles and Goals that used to identify the actors they would be the roles they act and the

goals they need to achieve in the scenario

(3) Storyboard that used to describe the serial of interactions among the actors and the

Service-Finder Portal

2112 System Administrator

This section is going to present the use case that applied to the Service-Finder portal and that

illustrated the requirements on its functionality from a user point of view However all these

Deep Web Service Crawler

11

information in this use case are derived from [1] In this use case there has a system administrator

whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities

online and working all day and night Therefore if there is any system failures Sam Adams should fix

the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will

alert him immediately by sending him a SMS Message in the case of a system failure

n Description

This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an

SMS Messaging Service that he wants to build it into his application

n Actors Roles and Goals

The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him

are the immediate service delivery the reliability of the service and low base fee and transaction fee

n Storyboard

Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find

many useful services from it especially he know what he is looking for Hence he visits the

Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo

Requirement 1 Search functionality

Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the

number of matching services that will be displayed on one page And he would also expect there has

short information about the service functionality the service provider and the service availability So

that he could decide which service he will choose to read further

Requirement 2 Enable configurable pagination of the matching results and have some short

information for each service

Step 3 When Sam looks through the short information about the services that displayed on the first

page he expects to find the most relevant services that related to his request After that he would

like to read more detailed information about that service to see whether this service can provide the

needed functionality

Requirement 3 Rank the returned matching services and must provide ability to read more details of

a service

Step 4 In the case that all the returned matching services Sam got provide quite different

functionalities or they belong to different service categories for example the SMS messaging services

alert users not through SMS but voice messaging For this reason Sam would like to see other

different categories that may be contain the services he wants Or the services of other categories

which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can

further filter his search in terms of browsing through categories

Requirement 4 Service categories and allow the user to look all services that belonged to that specific

category If possible it should also allow the user to browse through categories

Step 5 When Sam got all the services that could provide a SMS messaging service via the methods

described in the Step 4 at present he wants to look for the services that offered by an Austrian

provider and have no base fees if possible

Requirement 5 Faceted search

Deep Web Service Crawler

12

Step 6 After Sam got all these specific services now he would like to choose the services that can

provide a high reliability

Requirement 6 Sort functionality based on usersrsquo chooses

Step 7 For now Sam expects to compare the service availability between the promised to the service

provider and the actually provided This should be contained in the servicesrsquo details And there needs

also have service coverage information so that Sam can know whether this service covers the areas

he lives and works Moreover Sam would also like to compare these services in other way For

instance put some services into a structured table to compare the transaction fees

Requirement 7 A side-by-side comparison table for services and a functionality that enable users to

select services he wants to compare

Step 8 At last Sam wants to know whether the service providers offer a free try out of the services

So that he can test the service functionality

Requirement 8 If possible display a note that offering free service trials

212 Architecture Plan for the Service-Finder Project

The architecture plan of the Service-Finder project contains five basic components Service Crawler

Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal

Interface Figure 2-1 presents a high level overview of the components and the data flow among

them

Figure2-1Dataflow of Service-Finder and Its Components [3]

Deep Web Service Crawler

13

2121 The Principle of the Service Crawler Component

The Service Crawler is responsible for gathering available services and their related information from

the Web The overall cycle is depicted as following

(1) A Web developer publishes a Web Service

(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services

like WSDL (Web Service Description Language) documents

(3) The Crawler is also going to search for other related information as long as a service is discovered

(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant

part of the Web

At last the output of the crawler would be forwarded to the subsequent components for analyzing

indexing and displaying

2122 The Principle of the Automatic Annotator Component

The Automatic Annotator receives the relevant data from previous component and generates

semantic service descriptions about the WSDL documents and its related information based on the

Service-Finder Ontology and Service Category Ontology

Firstly it will simply introduce those two compatible ontologies that would be used throughout the

whole process [2]

n Generic Service Ontology it is an ontology which is functional to describe the data objects For

example the services the service providers availability payment modalities and so on

n Service Category Ontology it is an ontology which is used to categorize the functionalities or

applications of the services For instance data verification messaging data storage weather etc

Afterwards it is going to talk about the function of this component with its input output

Oslash Input

u Crawled data from Service Crawler

u Service-Finder Ontologies

u Feedback or Correction of before annotations

Oslash Function

u Enrich the information about the service and extract semantic statements according to the

Service-Finder Ontologies For example categorize the service according to the Service

Category Ontology

u Determine whether a particular document is relevant or not through the Web link graph If

not discard these irrelevant documents

u Classify the pages into their genres For instance pricing user comments FAQ and so on

Oslash Output

u Semantic annotation of the services

Deep Web Service Crawler

14

2123 The Principle of the Conceptual Indexer and Matcher

Component

The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted

information of the services and supplying users the capability of retrieval and semantic query For

example the matchmaking between user requests and service offers and the act of retrieving user

feedback on extracted annotations

In addition letrsquos have a look of the function of this component and its input output

Oslash Input

u Semantic annotation data and full text information obtained from Automatic Annotation

u Semantic annotation data and full text information that come from user interfaces

u Cluster data from user and service clustering component

Oslash Function

u Store the semantic annotations received from the Automatic Annotation component and

from the user interface

u Store the cluster data that procured through the clustering component

u Store and index the textual description offered by the Automatic Annotation component

and the textual comments offered by users

u Ontological query the semantic data from the data store center

u Combined keyword and Ontological querying used for user queries

u Provide a list of similar services for a given service

Oslash Output

u A list of matching services that are queried by users In particular these services should be

sorted by ranking and can also be iterated

u All available data that related to a particular entity must be retrievable at the user interface

2124 The Principle of the Service-Finder Portal Interface

Component

The Service-Finder Portal Interface is the main entry point that provided for users of the

Service-Finder system to search and browse the data which is managed by the Conceptual Indexer

and Matcher component In addition the users can also contribute information by means of providing

tags comments categorizations and ratings to the data browsed Furthermore the developers can

still directly invoke the Service-Finder functionalities from their custom applications in terms of an API

Besides the details of this componentrsquos function input and output are represented as below

Oslash Input

u A list of ordered services for a query

u Detailed information about a service or a set of services and a service provider

u Query access to service category ontology and the most used tags provided by the users

Deep Web Service Crawler

15

u Service availability information

Oslash Function

u The Web Interface allows the users to search services by keyword tag or concept in the

categorization sort and filter query results by refining the query compare and bookmark

services try out the services that offer this functionality

u The API allows the developers to invoke Service-Finder functionalities

Oslash Output

u Explicit user annotations such as tags ratings comments decryptions and so on

u Implicit user data for example click stream of users bookmarks comparisons links sent

etc

u Manual advertising of available new services

2125 The Principle of the Cluster Engine Component

The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the

Service-Finder Portal eg the queried services and the compared services of the users Moreover it

also provides cluster data to the Conceptual Indexer and Matcher for providing service

recommendations

Furthermore letrsquos detailed introduce this componentrsquos function input and output

Oslash Input

u Service annotation data of both extracted and user feedback

u Usersrsquo Click streams used for extracting user behaviors

Oslash Function

u Obtain user clusters from user behaviors

u Obtain service clusters from service annotation data to enable to find similar services

Oslash Output

u Clusters of users and services

22 Information Extraction

Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge

amount of information sources on the Internet which has been limited the access to browsing and

searching for the reason of the heterogeneity and the lack of structure of Web information sources

Therefore the appearance of Information Extraction that transforms the Web pages into

program-friendly structures for post-processing would become a great necessity However the task of

Information Extraction is specified in terms of the inputs and the extraction targets And the

techniques used in the process of Information Extraction called extractor

221 Input Types of Information Extraction

Generally speaking there are three different input types The first input type is the unstructured

Deep Web Service Crawler

16

document For example the free text that showed in figure 2-2 It is unstructured and written in

natural language So that it will require substantial natural language processing While the second

input type is called the structured document For instance the XML documents based on the reason

that the data can be described through the available DTD (Document Type Definition) or XML

(eXtensible Markup Language) schema Finally but obviously the third input type is the

semi-structured document that are widespread on the Web Such as the large volume of HTML

pages like tables itemized lists and enumerated lists This is because HTML tags are often used to

render these embedded data in the HTML pages See figure 2-3

Figure2-2Left is the free text input type and right is its output [4]

Figure2-3A Semi-structured page containing data records

(in rectangular box) to be extracted [4]

Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly

regular structure And the data of these documents can be displayed in a format of HTML way or

non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and

generated from structured databases in terms of some templates or layouts thus it would be

considered as one of the input sources which could provide some of these semi-structured documents

For example the authors price and comments of the book pages that provided by Amazon have the

Deep Web Service Crawler

17

same layout That is because these Web pages are generated from the same database and applied

with the same template or layout Furthermore there has another option which could manually

generate HTML pages of semi-structured type For example although the publication lists that

provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have

title and source property for every single pager Eventually the inputs for some Information Extraction

can also be the pages with the same class or among various Web Service Registries

222 Extraction Targets of Information Extraction

Moreover regarding the task of the Information Extraction it has to consider the extraction target

There also have two different extraction targets The first one is the relation of k-tuple And the k in

there means the number of attributes in a record Nevertheless in some cases an attribute of one

record may have none instantiation Otherwise the attribute owns multiple instantiations In addition

the complex object with hierarchically organized data would be the second extraction target Though

the ways for depicting the extraction targets in a page are diverse the most common structure is the

hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf

nodes which called internal nodes And the structure for a data object may also be flat or nested To

be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise

if it is nested structure then the internal nodes that involved in this data object would be more than

two levels

Furthermore in order to make the Web pages readable for human being and having an easier

visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated

or demarcated However the displaying for a data object in a Web page would be affected by

following conditions [4]

Oslash The attribute of a data object has zero or several values

(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo

attribute For example a special offer only available for certain books might be a ldquononerdquo

attribute

(2) If there are more than one values for the attribute of a data object it will be called the

ldquomultiValuerdquo attribute For instance the name of the author for a book could be a

ldquomultiValuerdquo attribute

Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering

That is to say among this set of attribute the position of the attribute might be changed

according to the diverse instances of a data object Thus this attribute will be called the

ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would

enumerate the release data in front of the movesrsquo title while for the movies after year 1999

(including 1999) it will enumerate the release data behind the movesrsquo title

Oslash The attribute has different formats

This means the displaying format of the data object could be completely distinct with respect to

these different instances Therefore if the format of an attribute is free then a lot of rules will be

needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo

attribute For example an ecommerce Web site would use the bold font format to present the

general prices while use the red color format to display the sale prices Nevertheless there has

Deep Web Service Crawler

18

another situation that some different attributes for a data object have the same format For

example various attributes are presented in terms of using the ltTDgt tags in a table presentation

And the attributes like those could be differentiated by means of the order information of these

attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo

attributes it must have to revise the rules for extracting these attributes

Oslash The attribute cannot be decomposed

Because of the easier processing sometimes the input documents would like to be treated as

strings of tokens instead of the strings of characters In addition some of the attribute cannot

even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo

attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The

department code and the course number in them cannot be separated into two different strings

of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo

223 The Used Techniques in Information Extraction

The extractor used in the process of Information Extraction aims at providing a single uniform query

interface to access information sources like database server and Web server It consists of following

phases collecting returned Web pages labeling these Web pages generalizing extraction rules

extracting the relevant data and outputting the result in an appropriate format (XML format or

relational database) for further information integration For example at first the extractor queries the

Web server to gather the returned pages through the HTTP protocols after that it starts to extract the

contents among these HTML documents and integrate with other data sources thereafter Actually

the whole process of the extractor follows below steps

Oslash Step 1

At the beginning it must have to tokenize the input However there are two different

granularities for the input string tokenization They are tag-level encoding and word-level

encoding The tag-level encoding will transform the tags of HTML page into the general tokens

while transform all text string between two tags into a special token Nevertheless the

word-level encoding does this in another way It treats each word in a document as a token

Oslash Step 2

Next it should apply the extraction rules for every attributes of the data object in the Web pages

These extraction rules could be induced in terms of a top-down or bottom-up generalization

pattern mining or logic programming In addition the type of extraction rules may be indicated

by means of regular grammars or logic rules For example some use path-expressions of the

HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic

constraints and some use delimiter-based constraints such as HTML tags or literal words

Oslash Step 3

After that all these extracted data would be assembled into the records

Oslash Step 4

Finally iterate this process until all these data objects in the input

Deep Web Service Crawler

19

23 Pica-Pica Web Service Description Crawler

The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the

Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web

Services problem For example the evaluation of the descriptive quality of Web Services that offered

and how well are these Web Services described in nowadaysrsquo Web Service Registries

231 Needed Libraries of the Pica-Pica Web Service

Description Crawler

This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef

Spillner and programmed in terms of the Python language Actually in order to run these scripts to

parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib

n Beautiful Soup

It is an HTMLXML parser for Python language And it can even turn these invalid markups into a

parse tree [5] Moreover the following three features can make it more powerful

u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that

makes approximately as much sense as the original document Therefore you can obtain

the data that you want

u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating

searching and modifying the parse tree Hence you donrsquot need to create a custom parse

for every application

u If the document has already specified an encoding then you can ignore it since the

Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way

Otherwise what you have to do is just to specify the encoding of the original documents

Furthermore the ways of including Beautiful Soup into the application are displayed in the

following [5]

sup2 From BeautifulSoup import BeautifulSoup For processing HTML

sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML

sup2 Import BeautifulSoup To get everything

n Html5lib

It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to

gain maximum compatibility with the current major desktop web browsers this implementation

will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5

specification

Deep Web Service Crawler

20

232 Architecture of the Pica-Pica Web Service

Description Crawler

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler

Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes

four fundamental components Service Page Grabber component WSDL Grabber component

Property Grabber component and WSML Register component

(1) The Service Page Grabber components is going to take the URL seed as the input and output the

link of the service page into following two components WSDL Grabber component and Property

Grabber component

(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the

delivered service pagersquos link And then check whether the validation for these obtained WSDL

document Finally only these valid WSDL document will be passed into the WSML Register

component for further processing

(3) The Property Grabber component will try to extract the servicersquos property the hosted in the

service page if there exists After that all these servicersquos properties would be saved into an INI

file as the information of that service

(4) The functionality of the WSML Register component is to write appropriate WSML document by

means of the valid WSDL documents that delivered from WSDL Grabber component and the

optionally INI files that delivered from Property Grabber component Afterwards register them in

Conqo

n WSML [9]

It stands for Web Service Modeling Language which provides a framework with different

language variants Hence it is often used to describe the different aspects of the semantic Web

Services according to the conceptual model of WSMO

n WSMO [10]

WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various

aspects related to the Semantic Web Services based on the ontologies Ontology is a formal

explicit specification of a shared conceptualization In fact these ontologies are the sticking points

that can satisfy the linkage between the agreement of the communities of users and the defined

conceptual semantics of the real-world

n Conqo [11]

Deep Web Service Crawler

21

It is a discovery framework that considers not only the Quality of Service (QoS) but also the

context information It will use a Web Service Repository to manage these service descriptions

that based on WSML

233 Implementation of the Pica-Pica Web Service

Description Crawler

This section is going to describe the processes of the implementation of the Pica-Pica Web Service

Description Crawler in detail

(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it

needs an input as the initial seed In this crawler there are five Web Service Registries which are

listed in the below The URL address of these five Web Service Registries will be used as the input

seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web

Service Description Crawler there has a single Python script for each Web Service Registry And

the crawling process for these Web Service Registriesrsquo Python script will be executed one after

another

Biocataloue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwsrvice-repositorycom

Xmethods httpwwwxmethodsnet

(2) Then after feeding with the input seed it will step into the next component Service Page Grabber

At first this component will try to read the data from the Web based on the input seed Then it

will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library

After that this Service Page Grabber component starts to look for the service page link for each

service that published in the Web Service Registry by means of the functions in Html5lib library In

the case that the service page link of one single service is found it will firstly check whether this

service page link is valid or not Once the service page link is valid it will pass it into the following

two components for further processing which are WSDL Grabber component and Property

Grabber component

(3) When the WSDL Grabber component receives a service page link from its previous component it

sets out to extract the WSDL link address for that service through the parsing tree of the data in

this service page Next this component will start to download the WSDL document of that service

in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into

the disk The process of this WSDL Grabber component will be continually carried on until there is

no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They

may either contain bad definitions or bad namespaceURI or be an empty document What is

worse it is not even of XML format Hence in order to pick them out this component will further

analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo

folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in

order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo

folder will be passed into the subsequent component

(4) Moreover since some Web Service Registries would give some addition information about the

Deep Web Service Crawler

22

services such as availability service provider version therefore the Property Grabber

component will set out to extract these information as this servicersquos properties And thereafter

save these servicersquos properties into an INI file However if there has no available additional

information then it is no need to extract the service property thus there is no INI file for that

service However for the implementation of this Pica-Pica Web Service Description Crawler only

the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to

extract the servicesrsquo properties While for other three Web Service Registries there has such

function

(5) Furthermore this is an optional to create a report file which contains the statistic information of

this process Such as the total number of services for one Web Service Registry the number of

services whose WSDL document is invalid etc

(6) As has been stated at present there has a folder with all valid WSDL documents and might also

have some INI files Therefore at this time the task of WSML Register component is to generate

the appropriate WSML documents with these valid WSDL document and INI files And then

register them in Conqo

24 Conclusions of the Existing Strategies

In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder

project Information Extraction technique and Pica-Pica Web Service Description Crawler

The task of this master program is going to obtain the available Web services and their related

information from the Web Actually this is a procedure of extracting the needed information for the

service such as the servicersquos WSDL document and its properties Therefore the Information

Extraction technique which used to extract the information that hosted in the Web could be used in

this master program

Moreover the Service-Finder is a large project that not only able to obtain the available Web services

and their related information from the Web but also enrich these crawled data with annotations by

means of the Service-Finder ontology and the Service Category ontology and integrate all the

information into a coherent semantic mode Furthermore it also provides the capabilities for

searching and browsing the data with the user interface and gives users with service

recommendations However as a master program the Service-Finder project is far more exceeding

the requirements Therefore it is just considered as a consultation for this master program

Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available

Web services and their related information this fulfills the primary task of this master program

Nevertheless regarding the information about the service this Pica-Pica Web Service Description

Crawler extracts only few properties sometime even no property Consequently in order to improve

the quality of the service it has to extract as much properties about the service as possible Thence in

chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler

Deep Web Service Crawler

23

3 Design and Implementation

In the previous chapter of the State of Art those already existing techniques or implementations are

presented In the following it is time to introduce the basic principle of the proposed approach Deep

Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica

Web Service Description Crawler

31 Deep Web Services Crawler Requirements

This section is mainly talking about the goals of the Deep Web Service Crawler approach system

requirements with respect to approach and some non-functional requirements

311 Basic Requirements for DWSC

The following lists are the basic requirements which should be achieved

(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list

of Web services that published in the Web Service Registry It contains not only the WSDL document

of the service but also a file with respect to the information about that service Therefore for the

purpose of producing the largest annotated Service Catalogue this proposed approach need to

extract as much properties about those Web services as possible Moreover it has also to download

the WSDL document that hosted along with the Web service That is to say these properties are not

only the interesting structured properties such as service name its WSDL link address but also some

other non-function properties for example endpoint and its monitoring information

(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties

For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of

schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way

this proposed approach provides three methods for the storage The first one will store them as an

XML file While for the second method it is going to store them in an INI file And the third method is

to use the database for the storage

312 System Requirements for DWSC

General Speaking the needed requirements for achieving a programming project contains following

lists

1) Operating System LinuxUnix Windows (XP Vista 2000 etc)

2) Programming language C++ Java Python C etc

3) Programming Tool NetBean Eclipse Visual Studio and so on

However in this master thesis the scripts of the Deep Web Service Crawler approach are written in

Java programming language Besides these code scripts are only tested running in Windows XP

operating system and Linux operating system but have not been tested in other operating system

Deep Web Service Crawler

24

313 Non-Functional Requirements for DWSC

In this part several non-functional requirements for the Deep Web Service Crawler approach are

presented

1) Transparency the process of data exploration and data storage should be done automatically

without the userrsquos intervention However at first user should specify the path in the hard disk which

will be used store these outputs of this program

2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in

order to stop the process from interruption there must have some necessary error handlings for the

process recovering

3) Completeness this approach should extract the interesting properties about each Web Service as

much as possible eg endpoint monitoring information etc

In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the

following five URLs hence the proposed approach must implement not less than those five Web

Service Registries

Biocatalogue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwservice-repositorycom

Xmethods httpwwwxmethodsnet

32 Deep Web Services Crawler Architecture

In this section an overview of this high-level architecture of Deep Web Services Crawler approach will

be introduced first Thereafter there are four subsections that focusing on the outlining of each single

component and how they will play together presented

The current components and flows of data in Deep Web Service Crawler can be summarized as

depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page

links and related service page links by crawling the Web (Web Service Extractor) Then those gathered

links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL

Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all

these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is

illustrated as following

Oslash Step 1

When the Deep Web Service Crawler starts to run the File Chooser container would require the

user to specify a path in the computer or in any other hard disk The reason for specifying the

path is that this Deep Web Service Crawler program needs a place to store all its outputs

Oslash Step 2

After that the Web Service Extractor will be triggered It is the main entry to the specific crawling

process However since the Deep Web Service Crawler program is a procedure which supposes

to crawl for Web Services in some given Web Service Registries Hence the URL addresses of

Deep Web Service Crawler

25

these Web Service Registries should be given as an initial seed for this Web Service Extractor

process However since the page structures of these Web Service Registries are completely

different thus there will be a dependent process for each Web Service Registry

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler

Oslash Step 3

So that according to the given seed two types of links would be obtained by the Web Service

Extractor component One is the service list page link And another is the service page link

Service list page is page that contains a list of Web Services and may be some information about

these Web Services While service page is page that contains much more information for single

service Finally it forwards these two types of links into the next two components Property

Grabber and WSDL Grabber

Oslash Step 4

Then on the one hand the Property Grabber component tries to gather the information about

the service that hosted in the service list page and service page Such as the name of the service

its description the ranking for this service etc Finally all the information of the service will be

collected together as the service properties which then will be delivered to the storage

component for further processing

Oslash Step 5

On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page

or the service page That is because for some Web Service Registry the WSDL link is hosted in

Deep Web Service Crawler

26

the service list page like the Biocatalogue and for other Web Service Registries it is hosted in

the service page such as Xmethods Then after obtaining the WSDL link it will also be

transmitted to the Storage component for further processing

Oslash Step6

When service properties information and WSDL link of the service are received by Storage

component it will be going to store them into the disk For the service properties they will be

stored into the disk by means of three different ways They are an XML file an INI file or one

record inside the table of Database However for the WSDL link the Storage component will try

to download the page content first according to the URL address of the WSDL link If it can work

successful the page content of the service will be stored as a WSDL document into the disk

Oslash Step 7

Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there

are more than one service or there are more than one service list page in the those Web Service

Registries the crawling process from step 3 to step 6 will be continued again and again until

there is no service or service list page in that Web Service Registries any more

Oslash Step 8

Furthermore after the crawling process for one Web Service Registry finishes there will also

generate a file that contains some of statistic information about this crawling process For

example the time when the crawling process of this Web Service Registry started and when

finished the total number of Web services in this Web Service Registry how many services have

empty WSDL document the average number of service properties in this Web Service Registry

and the average time cost for extracting service properties getting WSDL document and

generating XML file INI file etc

321 The Function of Web Service Extractor Component

The Web Service Extractor component is responsible for gathering service list page links and service

page links It pursues a focused crawl of the Web and only forwards service list page and service page

links to subsequent components for analyzing collecting gathering purposes Therefore it is going to

identify both service list page links and related service page links on these Web Service Registries

As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost

as important as the Web Service Extractor itself as it highly influences the part of the Web which it

needs to crawl The seed can or shall contain eg Web pages where these Web Services are

published or that they talk about Web Services

Deep Web Service Crawler

27

Figure3-2 Overview the process flow of the Web Service Extractor Component

After feeding with the seed of URL the Web Service Extractor component starts to get the link of

service list page from the initial page of this URL seed However this process would be diverse for

these five Web Service Registries Following shows the different situation in these Web Service

Registries

Oslash Service-Repository Web Service Registry

In this Web Service Registry the link of the first service list page is the URL address of its seed

which means some Web Services can be found in the home page of the Service-Repository Web

Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this

Web Service Registry Therefore the process of getting service list page link in this registry will be

continually carried on until there no more service list page link exists

Oslash Xmethods Web Service Registry

Although there has Web Services in the home page of the Xmethods Web Service Registry the

number of those Web Services is only a small subset of these in this Web Service Registry

Moreover in the Xmethods Web Service Registry there is only one page containing all Web

Services Therefore it must have to get the service list page link for that page

Oslash Ebi Web Service Registry

The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That

is to say there also has one page that contains all Web Services in this Web Service Registry

However this page is not the initial page of the input seed Therefore there needs more than

one operation steps to get the service list page link of that page

Oslash Seekda Web Service Registry

In Seekda Web Service Registry these Web Services are not contained in the initial page of the

input seed The service list page link can be obtained after several additional operation steps

Deep Web Service Crawler

28

However there has a problem for getting the service list page link in this registry Simply put if

there are more than one pages contains the Web Services for some unknown reasons it cannot

not get the links of the rest service list page In another word it can only get the link of the first

service list page

Oslash Biocatalogue Web Service Registry

The process of getting service list page in Biocatalogue Web Service Registry is almost the same

as in Seekda Web Service Registry The only difference is that it can get all these service list page

links in Biocatalogue Web Service Registry if there are more than one service list pages

Then after getting the link of service list page the Web Service Extractor will begin to get the link of

service page for each service that listed in the service list page The reason for why it can do like this is

that there has an internal link for every service which will address to the service page It is worth

noting that once a service page link is obtained this component will immediately forward the service

page link and the service list page link into the subsequent two components for further processing

Nevertheless the process of obtaining the service page link will be continuously carried out until all

services listed in that service list page are crawled Analogously the process of getting service list page

will also be continuously carried out until there no more service list page exists

3211 Features of the Web Service Extractor Component

The main features are described in the following paragraphs

1) Obtain service list page links

A central task is to obtain the corresponding service list page links It is an URL address that includes a

publicly list of Web Services and just a simply information about these Web services like the name of

the service an internal URL that links to another page which would contain the detailed information

about that service sometimes may have the link address for WSDL document

2) Obtain service page links

Once service list page links are found a crucial aspect is to extract the internal link of each service to

aid the task of service information discovery Therefore it is the task of the Web Service Extractor to

harvest the html page content of the service list page so that the service page links which would

contain much more detailed information of the single Web service can be obtained

3212 Input of the Web Service Extractor Component

This component is dependent on some specific input seeds And the only input required for this

component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section

313

3213 Output of the Web Service Extractor Component

The component will produce two service related page links from the Web

l Service list page links

l Service page links

Deep Web Service Crawler

29

3214 Demonstration for Web Service Extractor

In order to have comprehensive understanding of the process of the Web Service Extractor

component following gives some figures for explanation Though there are five URL addresses in this

section only the URL of the service-repository address is showed as an example

1) The input seed is the initial URL address of the Service-Repository which is

ldquohttpwwwservice-repositorycomrdquo

2) As has already said in section 321 the first service list link of this Web Service Registry is its input

seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page

of that link

Figure3-3 Service list page of the Service-Repository

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo

Figure3-5Code Overview of getting service page link in Service Repository

Figure3-6 Service page of the Web service ldquoBLZServicerdquo

Deep Web Service Crawler

30

3) For now because of the service list page link is already know the next step is to acquiring the

service page link according to these services that listed in the service list page The text in red box

of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the

complete link of the service page It should be prefixed with the initial URL address of

Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for

getting service page link of the Web service in Service Repository is showed in figure 3-5

Therefore the final link of this service page is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure 3-6 is the corrsponding service page of that link

4) Afterwards those two links service list page link and service page link which are gathered by the

Web Service Extractor component will be immediately forwarded to the next two components

which are WSDL Grabber component and Property Grabber component

322 The Function of WSDL Grabber Component

The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the

service lists page link or service page link And the whole process flow is illustrated in figure 3-7

Figure3-7 Overview the process flow of the WSDL Grabber Component

When the WSDL Grabber component receives the inputs that delivered from previous component it

will start to get the WSDL link for the service based on these inputs However although the inputs of

the WSDL Grabber component are the links of service page and service list page only one of them

contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the

service page or the service list page The reason why there needs these two links to be delivered into

this component is that the only one of the five Web Service Registries which is Biocatalogue hosts

the WSDL link in the service list page While for other four Web Service Registries the WSDL link is

hosted in the service page Therefore the WSDL link of these four Web Service Registries will be

obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL

link will be obtained through the service list page link However there has a problem for getting the

WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in

Deep Web Service Crawler

31

the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another

word these services will not have WSDL document For the situation like this the value of the WSDL

link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web

Services in other four Web Service Registries the WSDL link would always exist in the service page

Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service

it will inevitably be forwarded into Storage component for downloading the WSDL document at once

3221 Features of the WSDL Grabber Component

The WSDL Grabber component has to provide the following features

l Obtain WSDL links

WSDL link is the direct way to get into the page that contains the contents of the WSDL document

It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo

to represent this is an address that addressed to the page of WSDL document

3222 Input of the WSDL Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3223 Output of the WSDL Grabber Component

The component will only produce the following output data

l The URL address of WSDL link for each service

3224 Demonstration for WSDL Grabber Component

In this section it will demonstrate a list of figures in order to give a comprehensive understanding of

the process of the WSDL Grabber component Without a doubt it uses the same Web service of the

Service-Repository as an example too

1) The input for this WSDL Grabber component is the link of service page that obtained from Web

Service Extractor component The address of this link is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page

Deep Web Service Crawler

32

2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the

figure 3-8

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo

3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure

3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry

For other four Web Service Registries this will be different This function

ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo

Then it checks all of the nodes one by one to see if the text value of that node is the value of

ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will

be extracted as the value of WSDL link for this Web service

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function

Figure3-11 Code overview of ldquooneParameterrdquo function

4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web

service ldquoBLZServicerdquo can be obtained The value of it is

ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo

Deep Web Service Crawler

33

323 The Function of Property Grabber Component

The Property Grabber component is a module which is used to extract and gather all these Web

service information that hosted in the Web actually are the information that showed in service list

page and service page In the end all the obtained Web service information will be collected together

as the service properties which would be delivered into the Storage component for storing The

detailed process flow of Property Grabber component is illustrated in figure 3-12

The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component

However there is still a little difference between them with respect to the seed As have already

mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is

sufficient to get the WSDL link While regarding the Property Grabber component it needs two of

them as the seeds Therefore once the Property Grabber component receives the needed inputs it

will start to extract the information of that single Web service

Figure3-12 Overview the process flow of the Property Grabber Component

Therefore after the Property Grabber component receives the inputs it starts to extract the service

information for the Web service Generally speaking the service information consists of four aspects

which are structured information endpoint information monitoring information and whois

information respectively

(1) Structured Information

The structured information can be obtained by extracting the information that hosted in the

service page and service list page They are the basic descriptive information about the service

Such as the name of the service the URL address through which the WSDL document can be

obtained the description of introducing the service the provider who provides that service its

rating and the server who owns this service etc However the elements constituting this

Deep Web Service Crawler

34

structured information would be diverse according to these different Web Service Registries For

example the rating information of the Web service exists in the Service-Repository Web Service

Registry while the Xmethods Web Service Registry does not have this information In addition

even though the Web services in the same Web Service Registry some elements of the structured

information could be not exist For instance one service in a Web Service Registry may have the

description information of this service while another service in the same Web Service Registry

does not have this description information Tables 3-1 to 3-5 demonstrate the structured

information that should be extracted in these five Web Service Registries Moreover if the style of

the Biocataloue Web Service Registry is SOAP there has some additional information describing

the SOAP operation of this service While it is REST this additional information would be

something that talked about the REST operation They should also be considered as a part of the

structured information Table 3-6 and table 3-7 illustrate the information for these two different

operations

Service Name WSDL Link WSDL Version

Provider Server Rating

Homepage Owner Homepage Description

Table 3-1Structured Information of Service-Repository Web Service Registry

Service Name WSDL Link Provider

Service Style Homepage Implementation Language

Description User Description Contributed Client Name

Type of this Client Publisher for this client Used Tookit of this client

Used Language of this client Used Operation System of this client

Table 3-2Structured Information of Xmethods Web Service Registry

Service Name WSDL Link Server

Provider Providerrsquos Country Service Style

Rating Description User Description

Service Tags Documentation (within WSDL)

Table 3-3Structured Information of Seekda Web Service Registry

Service Name WSDL Link Port Name

Service URL Address Implementation Class

Table 3-4Structured Information of Ebi Web Service Registry

Service Name WSDL Link Style

Provider Providerrsquos Country View Times

Favorite Times Submitter Service Tags

Total Annotation Provider Annotation Member Annotation

Registry Annotation Base URL SOAP Lab Server Base URL

Description User Description Category

Table 3-5Structured Information of Biocatalogue Web Service Registry

Deep Web Service Crawler

35

SOAP Operation Name Inputs and Outputs Operation Description

Operation Tags Part of Which Service

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry

REST Operation Name Service Tags Used Template

Operation Description Part of Which Service Part of Which Endpoint Group

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry

(2) Endpoint Information

Endpoint information is the port information of the Web service which can be extracted only

through the service page However since different Web Service Registries have different structure

of the endpoint information with respect to the Web service Thence some elements of the

endpoint information for the Web service would be very diverse But there is one thing needed to

pay attention the Ebi Web Service Registry does not have endpoint information for all these Web

services published in this registry Moreover though the Web services in the same Web Service

Registry have the same structure of this endpoint information some elements of the endpoint

information would be missing or empty Furthermore these Web Service Registries could even

have no endpoint information for some Web services published by them Nevertheless whatever

there is endpoint information for the Web service there will be at least one such element which

is the URL address of the endpoint Following table 3-8 shows the endpoint information that

should be extracted in these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Endpoint Name Endpoint URL

Endpoint Critical Endpoint Type

Bound Endpoint

Xmethods

Endpoint URL Publisher of this Endpoint

Contact Email of this Publisher Implementation Language of this

Endpoint

Seekda Endpoint URL

Biocatalogue Endpoint Name Endpoint URL

Table 3-8 Endpoint Information of these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Service Availability Number of Downs

Total Uptime Total Downtime

MTTR MTBF

RTT Max of Endpoint RTT Min of Endpoint

RTT Average of Endpoint Ping Count of Endpoint

Seekda Service Availability Begin Time of Monitoring

Biocatalogue Monitored Status of Endpoint Overall Status of Service

Table 3-9 Monitoring Information of these five Web Service Registries

(3) Monitoring Information

Monitoring information is the tested statistic information for the Web service But it is worth

Deep Web Service Crawler

36

noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of

all these Web services published by them While for other three Web Service Registries only a

few of Web services may not have this information Table 3-12 displays the monitoring

information for these three Web Service Registries

(4) Whois Information

Whois information is not extracted from the information that hosted in the service page and

service list page It is the descriptive information for the service domain which can be gained by

means of the address of the WSDL link Because of that the process of getting the Whois

information will start after gaining the service domain first The final value of service domain

should not contain the string like this http https www etc It must belong to the top level

domain After that it is going to query the service domain database by means of sending the value

of service domain to the Whois client which is just a Web site on the Internet for example

ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain

exists a list of its information will be returned as the output However the structure of the

returned information will be diverse with respect to different service domain Therefore the most

challenge thing is that there has to deal with the extracting process for each different situation of

the returned information Table 3-10 gives the Whois information that needs be extracted for all

these five Web Service Registries

Service Domain URL Domain Name Domain Type

Domain Address Domain Description State

Postal Code City Country

Country Code Phone Fax

Email Organization Established Time

Table 3-10Whois Information for these five Web Service Registries

Finally all the information of these four aspects will be collected together And then be delivered into

the Storage component for further storage processing

3231 Features of the Property Grabber Component

The Property Grabber component has to provide the following features

l Obtain basic information

Generally speaking the more information one Web service has the better you can know how

well this Web service is Hence this Property Grabber component will be necessary to extract all

these basic information that hosted in the service list page and service page This basic

information contains the structure information endpoint information and monitoring

information

l Obtain Whois information

Due to the reason that the more information the Web service has the better the quality of the

Web service is That makes it is necessary to extract as much information about the Web service

as possible Therefore except the basic information the Property Grabber component can also

obtain some additional information called Whois information such as the type of the domain the

name of the person in charge the postal code of the domain city phone fax detailed address

email etc

Deep Web Service Crawler

37

3232 Input of the Property Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3233 Output of the Property Grabber Component

The component will produce the following output data

l Structured information of each service

l Endpoint information about each service if exists

l Monitoring information for the service and endpoint if exists

l Whois information of the service domain

However all these information will be collected together as the properties for each service Thereafter

the collected properties will be sent to the Storage component

3234 Demonstration for Property Grabber Component

Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the

Property Grabber component To simplify the explanation the example showed in this section uses

the same Web service as before

1) The inputs of the Property Grabber component are links of service list page and service page that

received from Web Service Extractor component Their links are

ldquohttpwwwservice-repositorycomrdquo

and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page

Deep Web Service Crawler

38

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page

3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that

displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of

the structured information have the same content like the contents of description showed in

service page and service list page Thence in order to saving the time of extracting process and

the space for storing process the elements with the same content would only be extracted once

Moreover there needs a transformation of non-descriptive text to the descriptive text for the

rating information because the contents of it are several image of stars Therefore the final

results of the extracted structured information for this Web service are showed in table 3-11

Because of there are no descriptive information for the provider homepage and Owner

Homepage their values are assigned as ldquoNULLrdquo

Service Name BLZService

WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl

WSDL Version 0

Server Apache-Coyote11

Description BLZService

Rating Four stars and A Half

Provider NULL

Homepage NULL

Owner Homepage NULL

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo

4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that

displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list

only one of them will be extracted as the information of endpoint That is because this master

program is intended to extract as more information as possible but these information should not

contain some redundant information Therefore for the purpose of that only one of them will be

extracted as the endpoint information even if there are more than one endpoint records Table

3-12 shows the final results of the endpoint information of this Web service

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page

Deep Web Service Crawler

39

Endpoint Name BLZServiceSOAP12port_http

Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService

Endpoint Critical True

Endpoint Type production

Bound Endpoint BLZServiceSOAP12Binding

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo

5) Then it is time to extract the monitoring information through invoking the function

ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box

above contains the monitoring information about the Web service and the red box below lists

the monitoring information for these endpoints As have already mentioned before only one

endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two

types of availability Actually they all represent the availability for this Web service just like the

availability that showed in figure 3-14 Therefore only one content of these availability

information is sufficient Table 3-13 shows the final results of this extracting process

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page

Service Availability 100

Number of Downs 0

Total Uptime 1 day 19 hours 19 minutes

Total Downtime 0 second

MTBF 1 day 19 hours 19 minutes

MTTR 0 second

RTT Max of Endpoint 141 ms

RTT Min of Endpoint 0 ms

RTT Average of Endpoint 577 ms

Ping Count of Endpoint 112

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo

6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the

service domain from the WSDL link first Towards this Web service the gained service domain is

ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying

process After that it returns a list of information for that service domain See figure 3-17 And

table 3-14 is the extracted Whois information

Deep Web Service Crawler

40

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo

Service Domain URL thomas-bayercom

Domain Name Thomas Bayer

Domain Type NULL

Domain Address Moltkestr40

Domain Description NULL

State NULL

Postal Code 54173

City Bonn

Country NULL

Country Code DE

Phone +4922855525760

Fax NULL

Email infopredic8de

Organization predic8 GmbH

Established Time NULL

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo

7) Finally all the information of these four aspects will be collected together as the service

properties and then these service properties are forwarded into the Storage component

324 The Function of Storage Component

The WSDL link from the WSDL Grabber component will be used by this Storage component to

download the WSDL document from the Web and stores it into the disk thereafter In addition the

service properties from the Property Grabber component will also be directly stored into the disk in

three different manners with this Storage component Figure 3-18 illustrates the process flow of the

Storage component

Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo

of the Storage component will be triggered After that it is going to transform the service properties

into three different output formats and store them into the disk These output formats are XML file

INI file and database records Besides it will also try to download the WSDL document with the URL

address of the WSDL link and then store the obtained WSDL document into disk too However this

ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo

ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the

storage tasks

Deep Web Service Crawler

41

Figure3-18 Overview the process flow of the Storage Component

(1) ldquogetWSDLrdquo sub function

The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it

into the disk Therefore above all it has to get the content of the WSDL document This

procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value

of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if

the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo

For the case like that it will create a WSDL document whose name is the service name appended

with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content

It is an empty document Nevertheless if that service does have WSDL link this sub function will

try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all

these contents that hosted in the Web would be downloaded and stored into the disk and named

only with the name of the service Otherwise it would create a WSDL document which prefixes

ldquoBadrdquo before the service name

(2) ldquogenerateXMLrdquo sub function

The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform

them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo

XML stands for eXtensible Markup Language which is a markup language that designed to

transport and store data The first line of an XML file is the XML declaration which defines the

XML version and the encoding used For example following ldquoltxml version=10

encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is

8-bit Unicode Transformation Format character set Besides an XML file contains XML elements

which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML

element can also contain other elements which could be simple text or a mixture of both

Deep Web Service Crawler

42

However an XML file must contain a root element as the parent of all other elements

(3) ldquogenerateINIrdquo sub function

The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will

transform them into an INI file and then store it in the disk with the name structure of servicersquos

name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for

configuration files They are just simple text files with a basic structure Generally speaking the

INI file contains three different parts They are section parameter and comment respectively

Parameter is the basic element contained in an INI file Its format is a key-value pair or can also

be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always

appears to the left of the equals sign The section is more like a room that grouped all its

parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And

section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo

Thence anything between the semicolon and the end of line will be ignored

(4) ldquogenerateDatabaserdquo sub function

The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions

Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will

turn them into the data of the database by using the SQL statements SQL stands for Structured

Query Language which is a database language that designed for accessing and manipulating data

in database Most of the actions that performed on a database are done with SQL statements

And the primary statements of SQL include insert into delete update select create alter and

drop Therefore for the purpose of transforming these service properties into the data of the

database this sub function has to create a database first using the ldquocreate databaserdquo statement

Then it should create a table to store the data A table is a collections of related data entries and

it consists of columns and rows Since the data for all these five Web Service Registries are not

very large therefore one database table is enough for storing these service properties Because of

that the field names of the service properties in columns for all these five Web Service Registries

should be uniform and well-defined Afterwards the service properties of each single service can

be put into the table as a record with the ldquoinsert intordquo statement of the SQL

3241 Features of the Storage Component

The Property Grabber component has to provide the following features

l Generate different output formats

The final result of this master program is to store the information of the service into the disk for

future work However this Storage component provides three different formats for storing these

service properties of the services that published in the Web Service Registries This makes the

storage of the service very flexible and also longevity

l Obtain the WSDL document

The important thing of this component is to check if the WSDL document can be obtained from

the WSDL link because the WSDL document plays a decisive role for determining the quality of

the service This Storage component provides the ability to deal with the different situations

about the process of obtaining the WSDL document

Deep Web Service Crawler

43

3242 Input of the Storage Component

This component requires the following input data

l WSDL link of each service

l Each service property information

3243 Output of the Storage Component

The component will produce the following output data

l WSDL document of the service

l XML document INI file and tables in database

3244 Demonstration for Storage Component

Following figures describe the fundamental implementation codes of this Storage component

The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the

implementation codes The first common place is about the parameters that defined in each of

these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute

path in the computer disk It will be used for the storing procedure and has already been specified

by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an

increasing Integer which is used as a part of the name of the service The reason for doing this is

that it can prevent the services that have the same name from overriding in the disk The content

of red mark among the codes of these figures is the second common place Its function is to

create a file or document for the service with the corresponding parameters

2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to

get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the

name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo

is the actual WSDL link which is the most important one for this sub function The other two

parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and

ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data

of the service in each Web Service Registry Such as the overall number of properties for that Web

Service Registry the overall number of services the number of services that have no WSDL link

the number of services contains no contents the number of services whose WSDL link are not

available etc The ldquoLog Informationrdquo text file records the results of the process steps and the

problems encountered For example which service is crawling now which Web Service Registry it

belongs the reason why it cannot get the WSDL document for this service the reason why this

service is unreachable and so on

Deep Web Service Crawler

44

Figure3-19 Implementation code for getting WSDL document

3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file

and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of

ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two

variables name and value

Figure3-20 Implementation code for generating XML file

Deep Web Service Crawler

45

Figure3-21 Implementation code for generating INI file

4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all

services in these five Web Service Registries into the records of the database Therefore it has to

create a database first The name of the database can be arbitrary as long as it conforms to the

naming rules in database This is the same for the name of the table Figure 3-22 displays the

procedure of creating a table in the database Due to the reason that it is hard to decide the

length of each service property thence the data types for all service properties are set to ldquoTextrdquo

Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo

statement

Figure3-22 Implementation code for creating table in database

Deep Web Service Crawler

46

Figure3-23 Implementation code for generating table records

33 Multithreaded Programming for DWSC

Multithreaded programming is a built-in characteristic which is provided by Java language A

multithreaded program contains two or more separate parts that that can execute concurrently Each

part of such a program is called a thread The use of multithreading enables to create programs that

makes efficient use of the system resources For example make maximum use of the CPU thatrsquos

because the idle time of the CPU can be kept to a minimum

In this master program there are five Web Service Registries that are needed to crawl for the services

among them Moreover the number of services published among each Web Service Registry is quite

different That makes the running time cost by each Web Service Registry is different Then it will

happen that one Web Service Registry who owns fewer services has to be waiting for executing until

another Web Service Registry who has much more services finishes Therefore in order to reduce the

waiting time for other Web Service Registries and maximize the use of the system resources it is

necessary to apply this multithreaded programming to this master program That is to say this master

program will create a thread for each Web Service Registry And these threads are executed

independently

34 Sleep Time Configuration for Web Service

Registries

Due to the reason that this master program is intended for downloading the WSDL document and

extracting the service information of these Web services that published in the Web Service Registry

this would inevitably affect the performance of the Web Service Registry In addition for the purpose

of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of

accessing Because of that sometimes when this master program is executing unknown errors would

happen For instance this master program will be continually halting at one point without getting any

more WSDL document and service information or some servicesrsquo WSDL documents of some Web

Deep Web Service Crawler

47

Service Registries cannot be obtained or there would miss some service information Therefore in

order to obtain the largest staffs of these Web services published in these five Web Service Registries

and not to affect the throughput of them it has to configure the accessing rate for each service of all

Web Service Registries

In consequence before going into the essential procedure for each single service of these Web Service

Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long

milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for

the specified number of milliseconds In other words it will temporarily cease execution for a while

Following table shows this time interval of the sleep function for each Web Service Registry

Web Service Registry Name Time Interval (milliseconds)

Service Repository 8000

Ebi 3000

Xmethods 10000

Seekda 20000

Biocatalogue 10000

Table 3-15Sleep Time of these five Web Service Registries

Deep Web Service Crawler

48

4 Experimental Results and Analysis

This chapter is going to show the quantitative experimental results of the prototype that presented in

chapter 3 Besides the analysis of these results will also be described and explained However in

order to gain a rather accurate result the experiments are carried out more than five times All these

data that displayed in the following tables and charts are the average of them

41 Statistic Information for Different Web Service

Registries

This section is mainly used to talk about the amount statistic of these Web services that published in

these five Web Service Registries It includes the overall number of the Web services that published in

each Web Service Registry and the number of these unavailable Web services which have been

archived because they may not be active anymore or are close to being non active Table 4-1 shows

the service amount statistic of these five Web Service Registries

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Overall

Services 57 289 382 853 2567

Unavailable

Services 0 0 0 0 125

Table4-1 Service amount statistic of these five Web Service Registries

Nevertheless in order to have an intuitive view of the service amount statistic in these five Web

Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As

you can see from the bar chart one on hand there has an ascending increase for the overall number

of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service

Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web

service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to

provide the Web services for the users because it contains far more services than other four Web

Service Registries On the other hand there is no unavailable service existed in almost all Web Service

Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web

Service Registry contains some Web services that cannot be used by the users To some degree this is

useless for the reason that these services cannot be used anymore and it will waste the network

resources on the Web Therefore all these unavailable services should be eliminated in order to

reduce the wasting of the network resources

Deep Web Service Crawler

49

Figure4-1 Service amount statistic of these five Web Service Registries

42 Statistic Information for WSDL Document

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Failed WSDL

Links 1 0 23 145 32

Without

WSDL Links 0 0 0 0 16

Empty

Content 0 0 2 0 2

Table4-2 Statistic information for WSDL Document

Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in

these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed

WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the

Web services whose WSDL links are invalid Say in other words that means it is impossible to get the

WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no

WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services

for these Web Service Registries It is used as the overall number of Web services in each Web Service

Registry who has no WSDL link That is to say there would be no WSDL document for Web services

like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL

document will also be created But there is no content and the name of this WSDL document will

contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which

represents the overall number of Web services that have the WSDL links and the URL address of them

0

500

1000

1500

2000

2500

3000

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

57289 382

853

2567

0 0 0 0 125

The

Num

ber o

f Web

Ser

vice

s

The name of Web Service Registry

Overall Services

Unavailable Services

Deep Web Service Crawler

50

are valid but their WSDL documents contains no content In this case a WSDL document whose name

contains a string ldquo(BAD)rdquo will be created

Figure4-2 Statistic information for WSDL Document

43 Comparison of Different Average Number of

Service Properties

This section is going to compare the average number of service properties in these five Web Service

Registries This average number of service properties is calculated by means of following equation = (1)

Where

ASP is the average number of service properties for one Web Service Registry

ONSP is the overall number of service properties in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

Figure 4-3 shows the average number of service properties of each Web service in these five Web

Service Registries As have already mentioned before one of the measurements for testing the quality

of Web services in the Web Service Registry is the service information The more information of a Web

service the better you know about that service consequently the corresponding Web Service

Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service

Repository and Biocatalogue Web Service Registries own a larger number of service properties than

that in other three Web Service Registries This can directly reflect that these two Web Service

Registries can provide more detailed information about the Web services published among them So

that the users can be easier to choose the service they need And they would also like to use the Web

services that published in these two Web Service Registries By contrast the Xmethods and Seekda

0

50

100

150

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

1 023

145

32

0 0 0 016

0 0 2 0 2

The

Num

ber o

f Ser

vice

s

The Name of the Web Service Registry

Failed WSDL Links

Without WSDL links

Empty Content

Deep Web Service Crawler

51

Web Service Registries who have less service information of the Web services would offer less quality

for these Web services Therefore users may be not like to use the Web services provided in these

two Web Service Registries Not to mention the Web services published in the Ebi Web Service

Registry

Figure4-3 Average Number of Service Properties

From the description presented in section 323 the causes of the issue for the different number of

service properties in these Web Service Registries may consist of following several points First the

number of the structured information for these Web services is different with respect to these five

Web Service Registries Even part of information for some Web services in one Web Service Registry

could be missing or have empty value For example the number of structure information that

supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that

in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for

almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce

the amount of the overall number of service properties Thirdly some Web Service Registries do not

have monitoring information such as Xmethods and Ebi In particular the Service Repository Web

Service Registry has a large amount of monitoring information of the Web services that could be

extracted in the Web Obvious the last one is the number of the whois information for these Web

services If the database of the whois client does not contain the information about the service

domain for the Web service in one Web Service Registry then there will be no whois information

could be extracted Moreover even there has information of the service domain the amount of the

information could be very diverse Therefore if the Web Service Registry undergoes the situation like

that many service domains of the Web services in this registry have no and few whois information

then the average number of service properties in that registry will decrease greatly

As a result in order to help users get better acquainted with the Web services that provided in Web

Service Registry and distinguish these Web services the Web Service Registry should do it best to

offer more and more information for each published Web service of it

05

101520253035

23

717 17

32

The

Num

ber o

f Pro

pert

ies

The Name of Web Service Registry

Average Number of Service Properties

Deep Web Service Crawler

52

44 Different Outputs of Web Services

The fundamental task of this master program is to obtain the WSDL document of the Web services

that hosted in the Web Service Registries as well as extract and gather the properties of these Web

services and thereafter store them into disk Therefore this section is going to describe the different

outputs of this master program which include the WSDL documents of Web services the generated

XML file and INI file and data records of these service properties in Database

Figure4-4 WSDL Document format of one Web service

The WSDL document of the Web service is just read from the Web according to the URL address of its

WSDL link And then these data will be stored into the disk as the WSDL document The name of the

WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document

And in order to distinguish these WSDL documents whose names would be the same while the

content of them are different the name of each obtained WSDL document in one Web Service

Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL

document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and

data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output

formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name

is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the

materials belonged to the same Web service The first three lines in that INI file are some service

comments which start from the semicolon to the end of the line They are the basic information for

the description of this INI file The following one line of them is the section that included in a pairs of

brackets It is important because it represents that the following lines behind it are the information of

this Web service Therefore the rest of these lines are the actual service information that existed in a

key-value pair with an equal between the pair Each service property is displayed from the beginning

of the line

Deep Web Service Crawler

53

Figure4-5 INI File format of one Web service

Figure4-6 XML File format of one Web service

Figure4-7 Database data format for all Web services

Deep Web Service Crawler

54

Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its

name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web

service Though the format of the XML file is different from that of the INI file the essential contents

of them are the same That is to say the values of these service properties are no different This is

because they are files generated from the collection of properties for the same Web service The XML

file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the

section in INI file is something like the root in XML file Therefore all values of these elements among

the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service

Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of

service information for all Web services in these five Web Service Registries And the entire service

information for one Web service will be only one record in this table Because of that the column

names of that table should be the union of the name of the service information in each Web Service

Registry However since the column names of the table should be unique thus these redundant

names in this union must be eliminated This could make sense and possible because the names of the

service information are well-defined and uniform for all these five Web Service Registries In addition

in this table the first column is the primary key which is an increasing Integer The function of this is

more like the Integer that contained in the name of the XML file and INI file While the rest columns in

that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in

this table represents that this property of that Web service is empty or missing

45 Comparison of Average Time Cost for Different

Parts of Single Web Service

This section aims to describe the comparison of the average time cost for different parts of getting

one single Web service in all these five Web Service Registries At first it has to calculate the average

time cost of getting one single service in the Web Service Registry The calculation of it can be

obtained through following equation = (2)

Where

ATC is the average time cost for one single Web service

OTS is the overall time cost of all the Web services in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the different parts of the average time cost for getting one single service consists of

following 6 aspects which are the average time cost for extracting service property the average time

cost for obtaining WSDL document the average time cost for generating XML file the average time

cost for generating INI file the average time cost for inserting the service property into the table of

the Database and the average time cost for some other procedures such as get the service list page

link get the service page link and so on The average time cost for extracting service property would

be obtained by means of following equation

Deep Web Service Crawler

55

= (3)

Where

ATCSI is the average time cost for extracting service property of one single Web service

OTSSI is the overall time cost for extracting service property of all the Web services in one Web

Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the calculation of other parts is similar to the equation for calculating the average time

cost for extracting service property While the calculation of the average time cost for other

procedures equals to the result of the average time cost for one single Web service minus the sum of

the average time cost of all other five parts

Service

property

WSDL

Document

XML

File INI File Database Others Overall

Service

Repository

8801 918 2 1 53 267 10042

Ebi 699 82 2 1 28 11 823

Xmethods 5801 1168 2 1 45 12 7029

Seekda 5186 1013 2 1 41 23 6266

Biocatalogue 39533 762 2 1 66 1636 42000

Table4-3 Average time cost information for all Web Service Registries

Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web

Service Registries The first column of table 4-3 is the name of these five Web Service Registries And

the last column in this table is the average time cost for the single service in one Web Service Registry

While the rest columns in this table are the average time cost of these six different parts In order to

have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated

with following corresponding figures See figure 4-8 to figure 4-14

Figure4-8 Average time cost for extracting service property in all Web Service Registries

8801

6995801 5186

39533

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Service Property

Deep Web Service Crawler

56

As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue

Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries

which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively

That is to say it will take much longer time to extract the service properties of these Web services

published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the

Biocatalogue Web Service Registry has even more average number of service properties which have

already been talked about in section 43 On the contrary the average number of service properties in

Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service

Registry is larger than that in Seekda Web Service Registry As have already known that the average

number of service properties is the same for these two Web Service Registries Nevertheless there

has course might explain the case that the average time in Xmethods costs more than that in Seekda

which is the process for extracting service properties in Xmethods Web Service Registry needs to be

executed by means of service page and service list page while it needs only service page link for

Seekda Web Service Registry

The average time cost for obtaining WSDL document in all these five Web Service Registries is

displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the

average time for extracting the WSDL link of Web service and the average time for reading the data of

WSDL document from the Web and then storing it into the disk As seen from this figure the average

time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is

1168 Although the process for extracting WSDL link will cost certain amount of time it does not have

a significant influence to the total average time spent for obtaining WSDL document This is because

the WSDL link of one Web service is almost gained in one step Therefore this can imply that the

average size of the WSDL document for Xmethods Web Service Registry is larger than that for other

four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds

for the process of extracting WSDL document

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries

918

82

11681013

762

0

200

400

600

800

1000

1200

1400

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

WSDL Document

Deep Web Service Crawler

57

Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different

outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11

the average time of generating XML file for one Web service cost the same for all these five Web

Service Registries which is only 2 milliseconds In addition the average time of generating INI file for

one Web service cost the same too And its value is just 1 millisecond Even though sum up the values

of these two types of average time costs the result is still so small that it could be omitted when

comparing to the overall average time cost of getting one Web service for each corresponding Web

Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI

file would be finished at once after receiving the input of service properties of one Web service

Furthermore it can be seen from figure 4-12 although the average time costs of creating database

record for each Web service in all these five Web Service Registries are larger than that time of

generating XML and INI file the process of the operation for creating database record is still fast

Figure4-10 Average time cost for generating XML file in all Web Service Registries

Figure4-11 Average time cost for generating INI file in all Web Service Registries

2 2 2 2 2

0

05

1

15

2

25

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

XML File

1 1 1 1 1

0

02

04

06

08

1

12

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

INI File

Deep Web Service Crawler

58

Figure4-12 Average time cost for creating database record in all Web Service Registries

Figure4-13 Average time cost for getting one Web service in all Web Service Registries

In the figure 4-13 it gives the average time cost for getting one single Web service in all these five

Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the

longest time for this process This is because the presentation of these five different parts that

described in the front shows that the average time cost of each part would needs more time to finish

the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining

WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web

Service Registry does not cost the longest Moreover there is amazing thing when looking at these

figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same

trend This can further indicate that one Web Service Registry who spends more time to get the

description information of one Web service would offer more information about the Web service

53

28

4541

66

0

10

20

30

40

50

60

70

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Database

10042

823

7029 6266

42000

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Overall

Deep Web Service Crawler

59

5 Conclusion and Further Direction

This master thesis provides a schema which aims to explore the description information of the Web

services that hosted in the different Web Service Registries Actually the description information of

the Web service consists of WSDL document and service information about that Web service

Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can

be used to obtain the WSDL document and service information of the Web service in these Web

Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web

services that hosted in the these Web Service Registries as well as the storage of these Web services

is not flexible and the most important thing is that only a few service information of the Web service

are extracted even more there is no service information would be extracted for some Web Service

Registry However the work presented in this master thesis can be able to explore all Web services in

these Web Service Registries Moreover the service information of Web service would be extracted as

much as it could with the approach presented in this master thesis so that the final result would be

the largest annotated service catalogue ever produced Furthermore regarding the storage process of

the description information of the Web service this master thesis provides three different ways that

can guarantee not only the completeness but also longevity of the description information for the

Web service

However in the implementation performed in this master thesis the whois client used for querying

the information of service domain will return a free text if the information exists And sometime this

free text differs completely That makes it must have to crawl each Web service in all Web Service

Registries at least once in the experiment stage So that all the cases of these free text would be

foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web

services in these Web Service Registries Therefore in order to simply the work other whois client

who can ease the work here needs to be found and used

Moreover in the experiment stage of this master thesis the time cost for getting Web service is still

large For the purpose of reducing this time the multithreaded programming would be applied to

each some parts of the process for getting one Web service

Although the work performed here is specialized for only these five Web Service Registries the

main parts of the principles used here is adaptable to other Web Service Registries that with only

small changes in the implementation codes or the structure

Deep Web Service Crawler

60

6 Bibliography

[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)

June 27 2008

[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from

httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-

whole Emanuele Della Valle (CEFRIEL) July 1 2008

[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk

ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-

architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009

[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web

Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October

2006

[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from

httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008

[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from

httpwwww3orgTRws-arch-scenarios February 11 2004

[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo

Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering

University of Washington Seattle pp233-272 February 1999

[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web

Consortium Working Draft WD-html5-20100624 January 22 2008

[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language

WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from

httpwwwwsmoorgTRd16d161v02

[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO

-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from

httpwwwwsmoorgTRd2v11

[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware

Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet

2008

Deep Web Service Crawler

61

7 Appendixes

There are additional outputs of this master program which are log information file and statistic report

file Figure 8-1 shows one of the basic output formats of the log information for these five Web

Service Registries

Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry

Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry

Deep Web Service Crawler

62

Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo

ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively

Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry

Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry

Deep Web Service Crawler

63

Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry

Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry

Deep Web Service Crawler

64

Table of Figures

Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12

Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16

Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25

Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27

Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41

Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46

Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50

Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51

Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52

Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56

Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58

Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58

Deep Web Service Crawler

65

Table of Tables

Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47

Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48

Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55

Deep Web Service Crawler

66

Table of Abbreviations

DTD Document Type Definition

DWSC Deep Web Service Crawler

HTML HyperText Markup Language

MTBF Mean Time between Failures

MTTR Mean Time to Recovery

QoS Quality of Service

REST Representational State Transfer

RTT Round Trip Time

SOAP Simple Object Access Protocol

SQL Structured Query Language

URL Uniform Resource Locator

WHATWG Web Hypertext Application Technology Working Group

XML eXtensible Markup Language

WSDL Web Service Description Language

WSML Web Service Modeling Language

WSMO Web Service Modeling Ontology

Page 2: Deep Web Service Crawler

Deep Web Service Crawler

2

Acknowledgements

The work presented in this master thesis is a result of the master task for Web Service project

which is provided by Computer Networks at Dresden University of Technology

In here I am heartily thankful to my supervisor Dipl-infJosef Spillner whose encouragement

guidance and support from the initial to the final level enabled me to develop an understanding of

the subject

Lastly I offer my regards and blessings to all of those who supported me in any respect during the

completion of the project

Deep Web Service Crawler

3

Abstract

Nowadays Web Service Registries offer convenient access to offering searching and using

electronic Web services Usually they host Web service descriptions along with related metadata

generated by both the system and the users Hence monitoring and rating information can help

users to distinguish similar Web service offerings However at present there is little support to

compare these Web services across platforms and to build a global view Besides for the metadata

not all of them especially non-functional property descriptions are made available in s structured

format

Therefore the task of this master thesis is to apply Deep Web analysis techniques to extract as

much information about these published Web services as possible Corresponding the result shall

be the largest annotated Service Catalogue ever produced

Index Terms

Web Service Deep Web Service Crawler Service-Finder Pica-Pica Web Service Description Crawler

WSDL

Deep Web Service Crawler

4

Table of Contents

Acknowledgements 2

Abstract 3

1 Introduction 7

11 BackgroundMotivation 7

12 Initial Designing of the Deep Web Service Crawler Approach 7

13 Goals of this Master Thesis 8

14 Outline of this Master Thesis 8

2 State of the Art 10

21 Service Finder Project 10

211 Use Cases for Service-Finder Project 10

2111 Use Case Methodology 10

2112 System Administrator 10

212 Architecture Plan for the Service-Finder Project 12

2121 The Principle of the Service Crawler Component 13

2122 The Principle of the Automatic Annotator Component 13

2123 The Principle of the Conceptual Indexer and Matcher Component 14

2124 The Principle of the Service-Finder Portal Interface Component 14

2125 The Principle of the Cluster Engine Component 15

22 Information Extraction 15

221 Input Types of Information Extraction 15

222 Extraction Targets of Information Extraction 17

223 The Used Techniques in Information Extraction 18

23 Pica-Pica Web Service Description Crawler 19

231 Needed Libraries of the Pica-Pica Web Service Description Crawler 19

232 Architecture of the Pica-Pica Web Service Description Crawler 20

233 Implementation of the Pica-Pica Web Service Description Crawler 21

24 Conclusions of the Existing Strategies 22

3 Design and Implementation 23

31 Deep Web Services Crawler Requirements 23

311 Basic Requirements for DWSC 23

Deep Web Service Crawler

5

312 System Requirements for DWSC 23

313 Non-Functional Requirements for DWSC 24

32 Deep Web Services Crawler Architecture 24

321 The Function of Web Service Extractor Component 26

3211 Features of the Web Service Extractor Component 28

3212 Input of the Web Service Extractor Component 28

3213 Output of the Web Service Extractor Component 28

3214 Demonstration for Web Service Extractor 29

322 The Function of WSDL Grabber Component 30

3221 Features of the WSDL Grabber Component 31

3222 Input of the WSDL Grabber Component 31

3223 Output of the WSDL Grabber Component 31

3224 Demonstration for WSDL Grabber Component 31

323 The Function of Property Grabber Component 33

3231 Features of the Property Grabber Component 36

3232 Input of the Property Grabber Component 37

3233 Output of the Property Grabber Component 37

3234 Demonstration for Property Grabber Component 37

324 The Function of Storage Component 40

3241 Features of the Storage Component 42

3242 Input of the Storage Component 43

3243 Output of the Storage Component 43

3244 Demonstration for Storage Component 43

33 Multithreaded Programming for DWSC 46

34 Sleep Time Configuration for Web Service Registries 46

4 Experimental Results and Analysis 48

41 Statistic Information for Different Web Service Registries 48

42 Statistic Information for WSDL Document 49

43 Comparison of Different Average Number of Service Properties 50

44 Different Outputs of Web Services 52

45 Comparison of Average Time Cost for Different Parts of Single Web Service 54

5 Conclusion and Further Direction 59

6 Bibliography 60

Deep Web Service Crawler

6

7 Appendixes 61

Table of Figures 64

Table of Tables 65

Table of Abbreviations 66

Deep Web Service Crawler

7

1 Introduction

In this introduction chapter of the master thesis at first itrsquos going to concisely explain the background

of the current situation then is the basic introduction of the proposed approach which is called Deep

Web Service Extraction Crawler

11 BackgroundMotivation

In the late 1990rsquos the Web Service Registry was a hot commodity The formal definition of Web

Service Registry is known as a link links page Its function is to uniformly present information that

comes from various sources Hence it can provide a convenient channel to the users for offering

searching and using the Web Services Actually the related metadata of the Web Services that

submitted by both the system and users are commonly hosted along with the Service descriptions

Nevertheless in fact when users enter one of the Web Service Registries to look for some Web

Services they might meet some situations that would bring lots of trouble to them One of the

situations may be like that these Web Service Registries return several similar published Web Services

after the users search on it For example two or more Web Services have the same name but their

versions are not the same Or two or more Web Services that derived from the same server but have

different contents etc Furthermore most users are also interested in a global view of the published

services For instance they want to know which Web Service Registry can provide better quality for

the Web Service Therefore in order to help users to differentiate those similar published Web

Services and have a global view of the Web Services this information should be monitored and rated

Moreover there are a great many Web Service Registries in the Internet Each Web Service Registry

can provide a great number of Web Services Obviously there might have some similar Web Services

among these Web Service Registries Or a Web Service in one of the Web Service Registry is related to

another Web Service in other Web Service Registries Hence these Web Services should be

comparable across different Web Service Registries However recently there has not much support of

this In addition towards the metadata actually not all of them are structured especially the

descriptions of the non-functional property Therefore what have to do now is to turn those

non-functional property descriptions into the structured format Clearly speaking it needs to extract

as much information as possible about the Web Services that offered in the Web Service Registries

Eventually after extracting all the information from the Web Service Registries it is necessary to store

them into the disk This procedure should be efficient flexible and completeness

12 Initial Designing of the Deep Web Service

Crawler Approach

The problems have already been stated in the previous section Hence the following work is to solve

Deep Web Service Crawler

8

these problems In this section it will present the basic principle of Deep Web Service Crawler

approach

At first it is the simply introduction of the Deep Web Service Crawler approach to these problems As

have already been mentioned each Web Service Registry can offer Web Services Moreover each

Web Service Registry has its own html page structures These structures may be the same or even

complete different Therefore the first thing is to identify which Web Service Registry that it will be

going to explore Since each Web Service Registry owns a unique URL this job can be done by directly

analyzing the corresponding URL address of that Web Service Registry After identifying which Web

Service Registry it is going to explore the following step is to obtain all these Web Services that

published in that Web Service Registry Then with all these obtained Web Services it is time to extract

analyze and gather the information of the services That information can be in structured format or

even in unstructured format In this master thesis some Deep Web Analysis Techniques will be

applied to obtain this information So that the information about each Web Service shall be the

largest annotated The last but not the least important all the information about the Web Services

need to be stored

13 Goals of this Master Thesis

The lists in the following are the goals of this master thesis

n Produce the largest annotated Service Catalogue

Service Catalogue is a list of service properties The more properties the service has the larger

Service Catalogue it owns Therefore this master program should extract as much service

properties as possible

n Flexible storage of these metadata of each service as annotations or dedicated documents

The metadata of one service includes not only the WSDL document but also service properties

All these metadata are important information for the service Therefore this master program

should provide flexible ways to store these metadata into the disk

n Improve the comparable property of the Web Services across different Web Service Registries

The names of service properties for one Web Service Registry could be different from another

Web Service Registry Hence for the purpose of improving the comparable ability all these

names of the service properties should be uniformed and well-defined

14 Outline of this Master Thesis

In this chapter the motivation objective and initial approach plan have already been discussed

Thereafter the remaining paper is structured as follows

Firstly in chapter 2 the work presented is almost based on these existing techniques In section 21

there is given a detailed introduction to the technique of the Service-Finder project Then in section

22 the already implemented technique of Pica-Pica Web Service Description Crawler is introduction

and discussed After that in section 23 the Information Retrieval technique is presented

Deep Web Service Crawler

9

Then following in chapter 3 it is explained the designing details of this Deep Web Service Crawler

approach In section 31 it gives a short description for the different requirements of this approach

Next in section 32 the actual design for the Deep Web Service Crawler is presented Then in section

33 34 the multithreaded programming and sleep time configuration that used in this master

program are introduced respectively

In chapter 4 it is supposed to display the experiments of this Deep Web Service Crawler approach

and then give some evaluation of it

Finally in Section 5 the conclusion and discussion of the work that already done as well as the work in

the future for this master task are presented respectively

Deep Web Service Crawler

10

2 State of the Art

This chapter aims at presenting some existing techniques or Strategies that related to the work of

applying this Deep Web Service Extraction Crawler approach Section 21 plans to talk about the

existing catalogues Service-Finder project And then in section 22 it is going to explain the existing

implemented crawlers Pica-Pica Web Service Description Crawler Finally in the section 23 it is

supposed to present some details about the Information Extraction technique

21 Service Finder Project

Service-Finder Project aims at developing a platform for Web Service discovery especially for the Web

Services that are embedded in a Web 20 environment [1] Hence it can provide an efficient access to

publicly available services The goals of the Service-Finder project are depicted as follows [1]

n Automatically gather Web Services and their related information

n Semi-automatically create semantic service description based on the information that available

on the Web

n Create and improve semantic annotations via the user feedback

n Describe the aggregated information in semantic models and allow reasoning query

However before describing the basic functionality of the Service-Finder Project there is going to

present one of its use cases and requirements first

211 Use Cases for Service-Finder Project

The Service-Finder Project employed the use case methodology of the W3C Use Case description [6]

for its needs and then applied this methodology to the use cases that it enumerated

2111 Use Case Methodology

There are three aspects needed to be considered for use case definitions of Service-Finder Project [1]

(1) Description that used to describe information of the use case

(2) Actors Roles and Goals that used to identify the actors they would be the roles they act and the

goals they need to achieve in the scenario

(3) Storyboard that used to describe the serial of interactions among the actors and the

Service-Finder Portal

2112 System Administrator

This section is going to present the use case that applied to the Service-Finder portal and that

illustrated the requirements on its functionality from a user point of view However all these

Deep Web Service Crawler

11

information in this use case are derived from [1] In this use case there has a system administrator

whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities

online and working all day and night Therefore if there is any system failures Sam Adams should fix

the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will

alert him immediately by sending him a SMS Message in the case of a system failure

n Description

This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an

SMS Messaging Service that he wants to build it into his application

n Actors Roles and Goals

The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him

are the immediate service delivery the reliability of the service and low base fee and transaction fee

n Storyboard

Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find

many useful services from it especially he know what he is looking for Hence he visits the

Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo

Requirement 1 Search functionality

Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the

number of matching services that will be displayed on one page And he would also expect there has

short information about the service functionality the service provider and the service availability So

that he could decide which service he will choose to read further

Requirement 2 Enable configurable pagination of the matching results and have some short

information for each service

Step 3 When Sam looks through the short information about the services that displayed on the first

page he expects to find the most relevant services that related to his request After that he would

like to read more detailed information about that service to see whether this service can provide the

needed functionality

Requirement 3 Rank the returned matching services and must provide ability to read more details of

a service

Step 4 In the case that all the returned matching services Sam got provide quite different

functionalities or they belong to different service categories for example the SMS messaging services

alert users not through SMS but voice messaging For this reason Sam would like to see other

different categories that may be contain the services he wants Or the services of other categories

which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can

further filter his search in terms of browsing through categories

Requirement 4 Service categories and allow the user to look all services that belonged to that specific

category If possible it should also allow the user to browse through categories

Step 5 When Sam got all the services that could provide a SMS messaging service via the methods

described in the Step 4 at present he wants to look for the services that offered by an Austrian

provider and have no base fees if possible

Requirement 5 Faceted search

Deep Web Service Crawler

12

Step 6 After Sam got all these specific services now he would like to choose the services that can

provide a high reliability

Requirement 6 Sort functionality based on usersrsquo chooses

Step 7 For now Sam expects to compare the service availability between the promised to the service

provider and the actually provided This should be contained in the servicesrsquo details And there needs

also have service coverage information so that Sam can know whether this service covers the areas

he lives and works Moreover Sam would also like to compare these services in other way For

instance put some services into a structured table to compare the transaction fees

Requirement 7 A side-by-side comparison table for services and a functionality that enable users to

select services he wants to compare

Step 8 At last Sam wants to know whether the service providers offer a free try out of the services

So that he can test the service functionality

Requirement 8 If possible display a note that offering free service trials

212 Architecture Plan for the Service-Finder Project

The architecture plan of the Service-Finder project contains five basic components Service Crawler

Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal

Interface Figure 2-1 presents a high level overview of the components and the data flow among

them

Figure2-1Dataflow of Service-Finder and Its Components [3]

Deep Web Service Crawler

13

2121 The Principle of the Service Crawler Component

The Service Crawler is responsible for gathering available services and their related information from

the Web The overall cycle is depicted as following

(1) A Web developer publishes a Web Service

(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services

like WSDL (Web Service Description Language) documents

(3) The Crawler is also going to search for other related information as long as a service is discovered

(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant

part of the Web

At last the output of the crawler would be forwarded to the subsequent components for analyzing

indexing and displaying

2122 The Principle of the Automatic Annotator Component

The Automatic Annotator receives the relevant data from previous component and generates

semantic service descriptions about the WSDL documents and its related information based on the

Service-Finder Ontology and Service Category Ontology

Firstly it will simply introduce those two compatible ontologies that would be used throughout the

whole process [2]

n Generic Service Ontology it is an ontology which is functional to describe the data objects For

example the services the service providers availability payment modalities and so on

n Service Category Ontology it is an ontology which is used to categorize the functionalities or

applications of the services For instance data verification messaging data storage weather etc

Afterwards it is going to talk about the function of this component with its input output

Oslash Input

u Crawled data from Service Crawler

u Service-Finder Ontologies

u Feedback or Correction of before annotations

Oslash Function

u Enrich the information about the service and extract semantic statements according to the

Service-Finder Ontologies For example categorize the service according to the Service

Category Ontology

u Determine whether a particular document is relevant or not through the Web link graph If

not discard these irrelevant documents

u Classify the pages into their genres For instance pricing user comments FAQ and so on

Oslash Output

u Semantic annotation of the services

Deep Web Service Crawler

14

2123 The Principle of the Conceptual Indexer and Matcher

Component

The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted

information of the services and supplying users the capability of retrieval and semantic query For

example the matchmaking between user requests and service offers and the act of retrieving user

feedback on extracted annotations

In addition letrsquos have a look of the function of this component and its input output

Oslash Input

u Semantic annotation data and full text information obtained from Automatic Annotation

u Semantic annotation data and full text information that come from user interfaces

u Cluster data from user and service clustering component

Oslash Function

u Store the semantic annotations received from the Automatic Annotation component and

from the user interface

u Store the cluster data that procured through the clustering component

u Store and index the textual description offered by the Automatic Annotation component

and the textual comments offered by users

u Ontological query the semantic data from the data store center

u Combined keyword and Ontological querying used for user queries

u Provide a list of similar services for a given service

Oslash Output

u A list of matching services that are queried by users In particular these services should be

sorted by ranking and can also be iterated

u All available data that related to a particular entity must be retrievable at the user interface

2124 The Principle of the Service-Finder Portal Interface

Component

The Service-Finder Portal Interface is the main entry point that provided for users of the

Service-Finder system to search and browse the data which is managed by the Conceptual Indexer

and Matcher component In addition the users can also contribute information by means of providing

tags comments categorizations and ratings to the data browsed Furthermore the developers can

still directly invoke the Service-Finder functionalities from their custom applications in terms of an API

Besides the details of this componentrsquos function input and output are represented as below

Oslash Input

u A list of ordered services for a query

u Detailed information about a service or a set of services and a service provider

u Query access to service category ontology and the most used tags provided by the users

Deep Web Service Crawler

15

u Service availability information

Oslash Function

u The Web Interface allows the users to search services by keyword tag or concept in the

categorization sort and filter query results by refining the query compare and bookmark

services try out the services that offer this functionality

u The API allows the developers to invoke Service-Finder functionalities

Oslash Output

u Explicit user annotations such as tags ratings comments decryptions and so on

u Implicit user data for example click stream of users bookmarks comparisons links sent

etc

u Manual advertising of available new services

2125 The Principle of the Cluster Engine Component

The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the

Service-Finder Portal eg the queried services and the compared services of the users Moreover it

also provides cluster data to the Conceptual Indexer and Matcher for providing service

recommendations

Furthermore letrsquos detailed introduce this componentrsquos function input and output

Oslash Input

u Service annotation data of both extracted and user feedback

u Usersrsquo Click streams used for extracting user behaviors

Oslash Function

u Obtain user clusters from user behaviors

u Obtain service clusters from service annotation data to enable to find similar services

Oslash Output

u Clusters of users and services

22 Information Extraction

Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge

amount of information sources on the Internet which has been limited the access to browsing and

searching for the reason of the heterogeneity and the lack of structure of Web information sources

Therefore the appearance of Information Extraction that transforms the Web pages into

program-friendly structures for post-processing would become a great necessity However the task of

Information Extraction is specified in terms of the inputs and the extraction targets And the

techniques used in the process of Information Extraction called extractor

221 Input Types of Information Extraction

Generally speaking there are three different input types The first input type is the unstructured

Deep Web Service Crawler

16

document For example the free text that showed in figure 2-2 It is unstructured and written in

natural language So that it will require substantial natural language processing While the second

input type is called the structured document For instance the XML documents based on the reason

that the data can be described through the available DTD (Document Type Definition) or XML

(eXtensible Markup Language) schema Finally but obviously the third input type is the

semi-structured document that are widespread on the Web Such as the large volume of HTML

pages like tables itemized lists and enumerated lists This is because HTML tags are often used to

render these embedded data in the HTML pages See figure 2-3

Figure2-2Left is the free text input type and right is its output [4]

Figure2-3A Semi-structured page containing data records

(in rectangular box) to be extracted [4]

Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly

regular structure And the data of these documents can be displayed in a format of HTML way or

non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and

generated from structured databases in terms of some templates or layouts thus it would be

considered as one of the input sources which could provide some of these semi-structured documents

For example the authors price and comments of the book pages that provided by Amazon have the

Deep Web Service Crawler

17

same layout That is because these Web pages are generated from the same database and applied

with the same template or layout Furthermore there has another option which could manually

generate HTML pages of semi-structured type For example although the publication lists that

provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have

title and source property for every single pager Eventually the inputs for some Information Extraction

can also be the pages with the same class or among various Web Service Registries

222 Extraction Targets of Information Extraction

Moreover regarding the task of the Information Extraction it has to consider the extraction target

There also have two different extraction targets The first one is the relation of k-tuple And the k in

there means the number of attributes in a record Nevertheless in some cases an attribute of one

record may have none instantiation Otherwise the attribute owns multiple instantiations In addition

the complex object with hierarchically organized data would be the second extraction target Though

the ways for depicting the extraction targets in a page are diverse the most common structure is the

hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf

nodes which called internal nodes And the structure for a data object may also be flat or nested To

be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise

if it is nested structure then the internal nodes that involved in this data object would be more than

two levels

Furthermore in order to make the Web pages readable for human being and having an easier

visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated

or demarcated However the displaying for a data object in a Web page would be affected by

following conditions [4]

Oslash The attribute of a data object has zero or several values

(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo

attribute For example a special offer only available for certain books might be a ldquononerdquo

attribute

(2) If there are more than one values for the attribute of a data object it will be called the

ldquomultiValuerdquo attribute For instance the name of the author for a book could be a

ldquomultiValuerdquo attribute

Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering

That is to say among this set of attribute the position of the attribute might be changed

according to the diverse instances of a data object Thus this attribute will be called the

ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would

enumerate the release data in front of the movesrsquo title while for the movies after year 1999

(including 1999) it will enumerate the release data behind the movesrsquo title

Oslash The attribute has different formats

This means the displaying format of the data object could be completely distinct with respect to

these different instances Therefore if the format of an attribute is free then a lot of rules will be

needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo

attribute For example an ecommerce Web site would use the bold font format to present the

general prices while use the red color format to display the sale prices Nevertheless there has

Deep Web Service Crawler

18

another situation that some different attributes for a data object have the same format For

example various attributes are presented in terms of using the ltTDgt tags in a table presentation

And the attributes like those could be differentiated by means of the order information of these

attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo

attributes it must have to revise the rules for extracting these attributes

Oslash The attribute cannot be decomposed

Because of the easier processing sometimes the input documents would like to be treated as

strings of tokens instead of the strings of characters In addition some of the attribute cannot

even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo

attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The

department code and the course number in them cannot be separated into two different strings

of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo

223 The Used Techniques in Information Extraction

The extractor used in the process of Information Extraction aims at providing a single uniform query

interface to access information sources like database server and Web server It consists of following

phases collecting returned Web pages labeling these Web pages generalizing extraction rules

extracting the relevant data and outputting the result in an appropriate format (XML format or

relational database) for further information integration For example at first the extractor queries the

Web server to gather the returned pages through the HTTP protocols after that it starts to extract the

contents among these HTML documents and integrate with other data sources thereafter Actually

the whole process of the extractor follows below steps

Oslash Step 1

At the beginning it must have to tokenize the input However there are two different

granularities for the input string tokenization They are tag-level encoding and word-level

encoding The tag-level encoding will transform the tags of HTML page into the general tokens

while transform all text string between two tags into a special token Nevertheless the

word-level encoding does this in another way It treats each word in a document as a token

Oslash Step 2

Next it should apply the extraction rules for every attributes of the data object in the Web pages

These extraction rules could be induced in terms of a top-down or bottom-up generalization

pattern mining or logic programming In addition the type of extraction rules may be indicated

by means of regular grammars or logic rules For example some use path-expressions of the

HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic

constraints and some use delimiter-based constraints such as HTML tags or literal words

Oslash Step 3

After that all these extracted data would be assembled into the records

Oslash Step 4

Finally iterate this process until all these data objects in the input

Deep Web Service Crawler

19

23 Pica-Pica Web Service Description Crawler

The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the

Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web

Services problem For example the evaluation of the descriptive quality of Web Services that offered

and how well are these Web Services described in nowadaysrsquo Web Service Registries

231 Needed Libraries of the Pica-Pica Web Service

Description Crawler

This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef

Spillner and programmed in terms of the Python language Actually in order to run these scripts to

parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib

n Beautiful Soup

It is an HTMLXML parser for Python language And it can even turn these invalid markups into a

parse tree [5] Moreover the following three features can make it more powerful

u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that

makes approximately as much sense as the original document Therefore you can obtain

the data that you want

u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating

searching and modifying the parse tree Hence you donrsquot need to create a custom parse

for every application

u If the document has already specified an encoding then you can ignore it since the

Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way

Otherwise what you have to do is just to specify the encoding of the original documents

Furthermore the ways of including Beautiful Soup into the application are displayed in the

following [5]

sup2 From BeautifulSoup import BeautifulSoup For processing HTML

sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML

sup2 Import BeautifulSoup To get everything

n Html5lib

It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to

gain maximum compatibility with the current major desktop web browsers this implementation

will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5

specification

Deep Web Service Crawler

20

232 Architecture of the Pica-Pica Web Service

Description Crawler

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler

Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes

four fundamental components Service Page Grabber component WSDL Grabber component

Property Grabber component and WSML Register component

(1) The Service Page Grabber components is going to take the URL seed as the input and output the

link of the service page into following two components WSDL Grabber component and Property

Grabber component

(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the

delivered service pagersquos link And then check whether the validation for these obtained WSDL

document Finally only these valid WSDL document will be passed into the WSML Register

component for further processing

(3) The Property Grabber component will try to extract the servicersquos property the hosted in the

service page if there exists After that all these servicersquos properties would be saved into an INI

file as the information of that service

(4) The functionality of the WSML Register component is to write appropriate WSML document by

means of the valid WSDL documents that delivered from WSDL Grabber component and the

optionally INI files that delivered from Property Grabber component Afterwards register them in

Conqo

n WSML [9]

It stands for Web Service Modeling Language which provides a framework with different

language variants Hence it is often used to describe the different aspects of the semantic Web

Services according to the conceptual model of WSMO

n WSMO [10]

WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various

aspects related to the Semantic Web Services based on the ontologies Ontology is a formal

explicit specification of a shared conceptualization In fact these ontologies are the sticking points

that can satisfy the linkage between the agreement of the communities of users and the defined

conceptual semantics of the real-world

n Conqo [11]

Deep Web Service Crawler

21

It is a discovery framework that considers not only the Quality of Service (QoS) but also the

context information It will use a Web Service Repository to manage these service descriptions

that based on WSML

233 Implementation of the Pica-Pica Web Service

Description Crawler

This section is going to describe the processes of the implementation of the Pica-Pica Web Service

Description Crawler in detail

(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it

needs an input as the initial seed In this crawler there are five Web Service Registries which are

listed in the below The URL address of these five Web Service Registries will be used as the input

seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web

Service Description Crawler there has a single Python script for each Web Service Registry And

the crawling process for these Web Service Registriesrsquo Python script will be executed one after

another

Biocataloue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwsrvice-repositorycom

Xmethods httpwwwxmethodsnet

(2) Then after feeding with the input seed it will step into the next component Service Page Grabber

At first this component will try to read the data from the Web based on the input seed Then it

will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library

After that this Service Page Grabber component starts to look for the service page link for each

service that published in the Web Service Registry by means of the functions in Html5lib library In

the case that the service page link of one single service is found it will firstly check whether this

service page link is valid or not Once the service page link is valid it will pass it into the following

two components for further processing which are WSDL Grabber component and Property

Grabber component

(3) When the WSDL Grabber component receives a service page link from its previous component it

sets out to extract the WSDL link address for that service through the parsing tree of the data in

this service page Next this component will start to download the WSDL document of that service

in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into

the disk The process of this WSDL Grabber component will be continually carried on until there is

no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They

may either contain bad definitions or bad namespaceURI or be an empty document What is

worse it is not even of XML format Hence in order to pick them out this component will further

analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo

folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in

order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo

folder will be passed into the subsequent component

(4) Moreover since some Web Service Registries would give some addition information about the

Deep Web Service Crawler

22

services such as availability service provider version therefore the Property Grabber

component will set out to extract these information as this servicersquos properties And thereafter

save these servicersquos properties into an INI file However if there has no available additional

information then it is no need to extract the service property thus there is no INI file for that

service However for the implementation of this Pica-Pica Web Service Description Crawler only

the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to

extract the servicesrsquo properties While for other three Web Service Registries there has such

function

(5) Furthermore this is an optional to create a report file which contains the statistic information of

this process Such as the total number of services for one Web Service Registry the number of

services whose WSDL document is invalid etc

(6) As has been stated at present there has a folder with all valid WSDL documents and might also

have some INI files Therefore at this time the task of WSML Register component is to generate

the appropriate WSML documents with these valid WSDL document and INI files And then

register them in Conqo

24 Conclusions of the Existing Strategies

In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder

project Information Extraction technique and Pica-Pica Web Service Description Crawler

The task of this master program is going to obtain the available Web services and their related

information from the Web Actually this is a procedure of extracting the needed information for the

service such as the servicersquos WSDL document and its properties Therefore the Information

Extraction technique which used to extract the information that hosted in the Web could be used in

this master program

Moreover the Service-Finder is a large project that not only able to obtain the available Web services

and their related information from the Web but also enrich these crawled data with annotations by

means of the Service-Finder ontology and the Service Category ontology and integrate all the

information into a coherent semantic mode Furthermore it also provides the capabilities for

searching and browsing the data with the user interface and gives users with service

recommendations However as a master program the Service-Finder project is far more exceeding

the requirements Therefore it is just considered as a consultation for this master program

Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available

Web services and their related information this fulfills the primary task of this master program

Nevertheless regarding the information about the service this Pica-Pica Web Service Description

Crawler extracts only few properties sometime even no property Consequently in order to improve

the quality of the service it has to extract as much properties about the service as possible Thence in

chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler

Deep Web Service Crawler

23

3 Design and Implementation

In the previous chapter of the State of Art those already existing techniques or implementations are

presented In the following it is time to introduce the basic principle of the proposed approach Deep

Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica

Web Service Description Crawler

31 Deep Web Services Crawler Requirements

This section is mainly talking about the goals of the Deep Web Service Crawler approach system

requirements with respect to approach and some non-functional requirements

311 Basic Requirements for DWSC

The following lists are the basic requirements which should be achieved

(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list

of Web services that published in the Web Service Registry It contains not only the WSDL document

of the service but also a file with respect to the information about that service Therefore for the

purpose of producing the largest annotated Service Catalogue this proposed approach need to

extract as much properties about those Web services as possible Moreover it has also to download

the WSDL document that hosted along with the Web service That is to say these properties are not

only the interesting structured properties such as service name its WSDL link address but also some

other non-function properties for example endpoint and its monitoring information

(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties

For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of

schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way

this proposed approach provides three methods for the storage The first one will store them as an

XML file While for the second method it is going to store them in an INI file And the third method is

to use the database for the storage

312 System Requirements for DWSC

General Speaking the needed requirements for achieving a programming project contains following

lists

1) Operating System LinuxUnix Windows (XP Vista 2000 etc)

2) Programming language C++ Java Python C etc

3) Programming Tool NetBean Eclipse Visual Studio and so on

However in this master thesis the scripts of the Deep Web Service Crawler approach are written in

Java programming language Besides these code scripts are only tested running in Windows XP

operating system and Linux operating system but have not been tested in other operating system

Deep Web Service Crawler

24

313 Non-Functional Requirements for DWSC

In this part several non-functional requirements for the Deep Web Service Crawler approach are

presented

1) Transparency the process of data exploration and data storage should be done automatically

without the userrsquos intervention However at first user should specify the path in the hard disk which

will be used store these outputs of this program

2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in

order to stop the process from interruption there must have some necessary error handlings for the

process recovering

3) Completeness this approach should extract the interesting properties about each Web Service as

much as possible eg endpoint monitoring information etc

In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the

following five URLs hence the proposed approach must implement not less than those five Web

Service Registries

Biocatalogue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwservice-repositorycom

Xmethods httpwwwxmethodsnet

32 Deep Web Services Crawler Architecture

In this section an overview of this high-level architecture of Deep Web Services Crawler approach will

be introduced first Thereafter there are four subsections that focusing on the outlining of each single

component and how they will play together presented

The current components and flows of data in Deep Web Service Crawler can be summarized as

depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page

links and related service page links by crawling the Web (Web Service Extractor) Then those gathered

links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL

Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all

these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is

illustrated as following

Oslash Step 1

When the Deep Web Service Crawler starts to run the File Chooser container would require the

user to specify a path in the computer or in any other hard disk The reason for specifying the

path is that this Deep Web Service Crawler program needs a place to store all its outputs

Oslash Step 2

After that the Web Service Extractor will be triggered It is the main entry to the specific crawling

process However since the Deep Web Service Crawler program is a procedure which supposes

to crawl for Web Services in some given Web Service Registries Hence the URL addresses of

Deep Web Service Crawler

25

these Web Service Registries should be given as an initial seed for this Web Service Extractor

process However since the page structures of these Web Service Registries are completely

different thus there will be a dependent process for each Web Service Registry

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler

Oslash Step 3

So that according to the given seed two types of links would be obtained by the Web Service

Extractor component One is the service list page link And another is the service page link

Service list page is page that contains a list of Web Services and may be some information about

these Web Services While service page is page that contains much more information for single

service Finally it forwards these two types of links into the next two components Property

Grabber and WSDL Grabber

Oslash Step 4

Then on the one hand the Property Grabber component tries to gather the information about

the service that hosted in the service list page and service page Such as the name of the service

its description the ranking for this service etc Finally all the information of the service will be

collected together as the service properties which then will be delivered to the storage

component for further processing

Oslash Step 5

On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page

or the service page That is because for some Web Service Registry the WSDL link is hosted in

Deep Web Service Crawler

26

the service list page like the Biocatalogue and for other Web Service Registries it is hosted in

the service page such as Xmethods Then after obtaining the WSDL link it will also be

transmitted to the Storage component for further processing

Oslash Step6

When service properties information and WSDL link of the service are received by Storage

component it will be going to store them into the disk For the service properties they will be

stored into the disk by means of three different ways They are an XML file an INI file or one

record inside the table of Database However for the WSDL link the Storage component will try

to download the page content first according to the URL address of the WSDL link If it can work

successful the page content of the service will be stored as a WSDL document into the disk

Oslash Step 7

Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there

are more than one service or there are more than one service list page in the those Web Service

Registries the crawling process from step 3 to step 6 will be continued again and again until

there is no service or service list page in that Web Service Registries any more

Oslash Step 8

Furthermore after the crawling process for one Web Service Registry finishes there will also

generate a file that contains some of statistic information about this crawling process For

example the time when the crawling process of this Web Service Registry started and when

finished the total number of Web services in this Web Service Registry how many services have

empty WSDL document the average number of service properties in this Web Service Registry

and the average time cost for extracting service properties getting WSDL document and

generating XML file INI file etc

321 The Function of Web Service Extractor Component

The Web Service Extractor component is responsible for gathering service list page links and service

page links It pursues a focused crawl of the Web and only forwards service list page and service page

links to subsequent components for analyzing collecting gathering purposes Therefore it is going to

identify both service list page links and related service page links on these Web Service Registries

As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost

as important as the Web Service Extractor itself as it highly influences the part of the Web which it

needs to crawl The seed can or shall contain eg Web pages where these Web Services are

published or that they talk about Web Services

Deep Web Service Crawler

27

Figure3-2 Overview the process flow of the Web Service Extractor Component

After feeding with the seed of URL the Web Service Extractor component starts to get the link of

service list page from the initial page of this URL seed However this process would be diverse for

these five Web Service Registries Following shows the different situation in these Web Service

Registries

Oslash Service-Repository Web Service Registry

In this Web Service Registry the link of the first service list page is the URL address of its seed

which means some Web Services can be found in the home page of the Service-Repository Web

Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this

Web Service Registry Therefore the process of getting service list page link in this registry will be

continually carried on until there no more service list page link exists

Oslash Xmethods Web Service Registry

Although there has Web Services in the home page of the Xmethods Web Service Registry the

number of those Web Services is only a small subset of these in this Web Service Registry

Moreover in the Xmethods Web Service Registry there is only one page containing all Web

Services Therefore it must have to get the service list page link for that page

Oslash Ebi Web Service Registry

The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That

is to say there also has one page that contains all Web Services in this Web Service Registry

However this page is not the initial page of the input seed Therefore there needs more than

one operation steps to get the service list page link of that page

Oslash Seekda Web Service Registry

In Seekda Web Service Registry these Web Services are not contained in the initial page of the

input seed The service list page link can be obtained after several additional operation steps

Deep Web Service Crawler

28

However there has a problem for getting the service list page link in this registry Simply put if

there are more than one pages contains the Web Services for some unknown reasons it cannot

not get the links of the rest service list page In another word it can only get the link of the first

service list page

Oslash Biocatalogue Web Service Registry

The process of getting service list page in Biocatalogue Web Service Registry is almost the same

as in Seekda Web Service Registry The only difference is that it can get all these service list page

links in Biocatalogue Web Service Registry if there are more than one service list pages

Then after getting the link of service list page the Web Service Extractor will begin to get the link of

service page for each service that listed in the service list page The reason for why it can do like this is

that there has an internal link for every service which will address to the service page It is worth

noting that once a service page link is obtained this component will immediately forward the service

page link and the service list page link into the subsequent two components for further processing

Nevertheless the process of obtaining the service page link will be continuously carried out until all

services listed in that service list page are crawled Analogously the process of getting service list page

will also be continuously carried out until there no more service list page exists

3211 Features of the Web Service Extractor Component

The main features are described in the following paragraphs

1) Obtain service list page links

A central task is to obtain the corresponding service list page links It is an URL address that includes a

publicly list of Web Services and just a simply information about these Web services like the name of

the service an internal URL that links to another page which would contain the detailed information

about that service sometimes may have the link address for WSDL document

2) Obtain service page links

Once service list page links are found a crucial aspect is to extract the internal link of each service to

aid the task of service information discovery Therefore it is the task of the Web Service Extractor to

harvest the html page content of the service list page so that the service page links which would

contain much more detailed information of the single Web service can be obtained

3212 Input of the Web Service Extractor Component

This component is dependent on some specific input seeds And the only input required for this

component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section

313

3213 Output of the Web Service Extractor Component

The component will produce two service related page links from the Web

l Service list page links

l Service page links

Deep Web Service Crawler

29

3214 Demonstration for Web Service Extractor

In order to have comprehensive understanding of the process of the Web Service Extractor

component following gives some figures for explanation Though there are five URL addresses in this

section only the URL of the service-repository address is showed as an example

1) The input seed is the initial URL address of the Service-Repository which is

ldquohttpwwwservice-repositorycomrdquo

2) As has already said in section 321 the first service list link of this Web Service Registry is its input

seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page

of that link

Figure3-3 Service list page of the Service-Repository

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo

Figure3-5Code Overview of getting service page link in Service Repository

Figure3-6 Service page of the Web service ldquoBLZServicerdquo

Deep Web Service Crawler

30

3) For now because of the service list page link is already know the next step is to acquiring the

service page link according to these services that listed in the service list page The text in red box

of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the

complete link of the service page It should be prefixed with the initial URL address of

Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for

getting service page link of the Web service in Service Repository is showed in figure 3-5

Therefore the final link of this service page is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure 3-6 is the corrsponding service page of that link

4) Afterwards those two links service list page link and service page link which are gathered by the

Web Service Extractor component will be immediately forwarded to the next two components

which are WSDL Grabber component and Property Grabber component

322 The Function of WSDL Grabber Component

The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the

service lists page link or service page link And the whole process flow is illustrated in figure 3-7

Figure3-7 Overview the process flow of the WSDL Grabber Component

When the WSDL Grabber component receives the inputs that delivered from previous component it

will start to get the WSDL link for the service based on these inputs However although the inputs of

the WSDL Grabber component are the links of service page and service list page only one of them

contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the

service page or the service list page The reason why there needs these two links to be delivered into

this component is that the only one of the five Web Service Registries which is Biocatalogue hosts

the WSDL link in the service list page While for other four Web Service Registries the WSDL link is

hosted in the service page Therefore the WSDL link of these four Web Service Registries will be

obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL

link will be obtained through the service list page link However there has a problem for getting the

WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in

Deep Web Service Crawler

31

the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another

word these services will not have WSDL document For the situation like this the value of the WSDL

link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web

Services in other four Web Service Registries the WSDL link would always exist in the service page

Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service

it will inevitably be forwarded into Storage component for downloading the WSDL document at once

3221 Features of the WSDL Grabber Component

The WSDL Grabber component has to provide the following features

l Obtain WSDL links

WSDL link is the direct way to get into the page that contains the contents of the WSDL document

It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo

to represent this is an address that addressed to the page of WSDL document

3222 Input of the WSDL Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3223 Output of the WSDL Grabber Component

The component will only produce the following output data

l The URL address of WSDL link for each service

3224 Demonstration for WSDL Grabber Component

In this section it will demonstrate a list of figures in order to give a comprehensive understanding of

the process of the WSDL Grabber component Without a doubt it uses the same Web service of the

Service-Repository as an example too

1) The input for this WSDL Grabber component is the link of service page that obtained from Web

Service Extractor component The address of this link is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page

Deep Web Service Crawler

32

2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the

figure 3-8

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo

3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure

3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry

For other four Web Service Registries this will be different This function

ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo

Then it checks all of the nodes one by one to see if the text value of that node is the value of

ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will

be extracted as the value of WSDL link for this Web service

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function

Figure3-11 Code overview of ldquooneParameterrdquo function

4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web

service ldquoBLZServicerdquo can be obtained The value of it is

ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo

Deep Web Service Crawler

33

323 The Function of Property Grabber Component

The Property Grabber component is a module which is used to extract and gather all these Web

service information that hosted in the Web actually are the information that showed in service list

page and service page In the end all the obtained Web service information will be collected together

as the service properties which would be delivered into the Storage component for storing The

detailed process flow of Property Grabber component is illustrated in figure 3-12

The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component

However there is still a little difference between them with respect to the seed As have already

mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is

sufficient to get the WSDL link While regarding the Property Grabber component it needs two of

them as the seeds Therefore once the Property Grabber component receives the needed inputs it

will start to extract the information of that single Web service

Figure3-12 Overview the process flow of the Property Grabber Component

Therefore after the Property Grabber component receives the inputs it starts to extract the service

information for the Web service Generally speaking the service information consists of four aspects

which are structured information endpoint information monitoring information and whois

information respectively

(1) Structured Information

The structured information can be obtained by extracting the information that hosted in the

service page and service list page They are the basic descriptive information about the service

Such as the name of the service the URL address through which the WSDL document can be

obtained the description of introducing the service the provider who provides that service its

rating and the server who owns this service etc However the elements constituting this

Deep Web Service Crawler

34

structured information would be diverse according to these different Web Service Registries For

example the rating information of the Web service exists in the Service-Repository Web Service

Registry while the Xmethods Web Service Registry does not have this information In addition

even though the Web services in the same Web Service Registry some elements of the structured

information could be not exist For instance one service in a Web Service Registry may have the

description information of this service while another service in the same Web Service Registry

does not have this description information Tables 3-1 to 3-5 demonstrate the structured

information that should be extracted in these five Web Service Registries Moreover if the style of

the Biocataloue Web Service Registry is SOAP there has some additional information describing

the SOAP operation of this service While it is REST this additional information would be

something that talked about the REST operation They should also be considered as a part of the

structured information Table 3-6 and table 3-7 illustrate the information for these two different

operations

Service Name WSDL Link WSDL Version

Provider Server Rating

Homepage Owner Homepage Description

Table 3-1Structured Information of Service-Repository Web Service Registry

Service Name WSDL Link Provider

Service Style Homepage Implementation Language

Description User Description Contributed Client Name

Type of this Client Publisher for this client Used Tookit of this client

Used Language of this client Used Operation System of this client

Table 3-2Structured Information of Xmethods Web Service Registry

Service Name WSDL Link Server

Provider Providerrsquos Country Service Style

Rating Description User Description

Service Tags Documentation (within WSDL)

Table 3-3Structured Information of Seekda Web Service Registry

Service Name WSDL Link Port Name

Service URL Address Implementation Class

Table 3-4Structured Information of Ebi Web Service Registry

Service Name WSDL Link Style

Provider Providerrsquos Country View Times

Favorite Times Submitter Service Tags

Total Annotation Provider Annotation Member Annotation

Registry Annotation Base URL SOAP Lab Server Base URL

Description User Description Category

Table 3-5Structured Information of Biocatalogue Web Service Registry

Deep Web Service Crawler

35

SOAP Operation Name Inputs and Outputs Operation Description

Operation Tags Part of Which Service

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry

REST Operation Name Service Tags Used Template

Operation Description Part of Which Service Part of Which Endpoint Group

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry

(2) Endpoint Information

Endpoint information is the port information of the Web service which can be extracted only

through the service page However since different Web Service Registries have different structure

of the endpoint information with respect to the Web service Thence some elements of the

endpoint information for the Web service would be very diverse But there is one thing needed to

pay attention the Ebi Web Service Registry does not have endpoint information for all these Web

services published in this registry Moreover though the Web services in the same Web Service

Registry have the same structure of this endpoint information some elements of the endpoint

information would be missing or empty Furthermore these Web Service Registries could even

have no endpoint information for some Web services published by them Nevertheless whatever

there is endpoint information for the Web service there will be at least one such element which

is the URL address of the endpoint Following table 3-8 shows the endpoint information that

should be extracted in these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Endpoint Name Endpoint URL

Endpoint Critical Endpoint Type

Bound Endpoint

Xmethods

Endpoint URL Publisher of this Endpoint

Contact Email of this Publisher Implementation Language of this

Endpoint

Seekda Endpoint URL

Biocatalogue Endpoint Name Endpoint URL

Table 3-8 Endpoint Information of these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Service Availability Number of Downs

Total Uptime Total Downtime

MTTR MTBF

RTT Max of Endpoint RTT Min of Endpoint

RTT Average of Endpoint Ping Count of Endpoint

Seekda Service Availability Begin Time of Monitoring

Biocatalogue Monitored Status of Endpoint Overall Status of Service

Table 3-9 Monitoring Information of these five Web Service Registries

(3) Monitoring Information

Monitoring information is the tested statistic information for the Web service But it is worth

Deep Web Service Crawler

36

noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of

all these Web services published by them While for other three Web Service Registries only a

few of Web services may not have this information Table 3-12 displays the monitoring

information for these three Web Service Registries

(4) Whois Information

Whois information is not extracted from the information that hosted in the service page and

service list page It is the descriptive information for the service domain which can be gained by

means of the address of the WSDL link Because of that the process of getting the Whois

information will start after gaining the service domain first The final value of service domain

should not contain the string like this http https www etc It must belong to the top level

domain After that it is going to query the service domain database by means of sending the value

of service domain to the Whois client which is just a Web site on the Internet for example

ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain

exists a list of its information will be returned as the output However the structure of the

returned information will be diverse with respect to different service domain Therefore the most

challenge thing is that there has to deal with the extracting process for each different situation of

the returned information Table 3-10 gives the Whois information that needs be extracted for all

these five Web Service Registries

Service Domain URL Domain Name Domain Type

Domain Address Domain Description State

Postal Code City Country

Country Code Phone Fax

Email Organization Established Time

Table 3-10Whois Information for these five Web Service Registries

Finally all the information of these four aspects will be collected together And then be delivered into

the Storage component for further storage processing

3231 Features of the Property Grabber Component

The Property Grabber component has to provide the following features

l Obtain basic information

Generally speaking the more information one Web service has the better you can know how

well this Web service is Hence this Property Grabber component will be necessary to extract all

these basic information that hosted in the service list page and service page This basic

information contains the structure information endpoint information and monitoring

information

l Obtain Whois information

Due to the reason that the more information the Web service has the better the quality of the

Web service is That makes it is necessary to extract as much information about the Web service

as possible Therefore except the basic information the Property Grabber component can also

obtain some additional information called Whois information such as the type of the domain the

name of the person in charge the postal code of the domain city phone fax detailed address

email etc

Deep Web Service Crawler

37

3232 Input of the Property Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3233 Output of the Property Grabber Component

The component will produce the following output data

l Structured information of each service

l Endpoint information about each service if exists

l Monitoring information for the service and endpoint if exists

l Whois information of the service domain

However all these information will be collected together as the properties for each service Thereafter

the collected properties will be sent to the Storage component

3234 Demonstration for Property Grabber Component

Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the

Property Grabber component To simplify the explanation the example showed in this section uses

the same Web service as before

1) The inputs of the Property Grabber component are links of service list page and service page that

received from Web Service Extractor component Their links are

ldquohttpwwwservice-repositorycomrdquo

and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page

Deep Web Service Crawler

38

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page

3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that

displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of

the structured information have the same content like the contents of description showed in

service page and service list page Thence in order to saving the time of extracting process and

the space for storing process the elements with the same content would only be extracted once

Moreover there needs a transformation of non-descriptive text to the descriptive text for the

rating information because the contents of it are several image of stars Therefore the final

results of the extracted structured information for this Web service are showed in table 3-11

Because of there are no descriptive information for the provider homepage and Owner

Homepage their values are assigned as ldquoNULLrdquo

Service Name BLZService

WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl

WSDL Version 0

Server Apache-Coyote11

Description BLZService

Rating Four stars and A Half

Provider NULL

Homepage NULL

Owner Homepage NULL

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo

4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that

displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list

only one of them will be extracted as the information of endpoint That is because this master

program is intended to extract as more information as possible but these information should not

contain some redundant information Therefore for the purpose of that only one of them will be

extracted as the endpoint information even if there are more than one endpoint records Table

3-12 shows the final results of the endpoint information of this Web service

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page

Deep Web Service Crawler

39

Endpoint Name BLZServiceSOAP12port_http

Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService

Endpoint Critical True

Endpoint Type production

Bound Endpoint BLZServiceSOAP12Binding

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo

5) Then it is time to extract the monitoring information through invoking the function

ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box

above contains the monitoring information about the Web service and the red box below lists

the monitoring information for these endpoints As have already mentioned before only one

endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two

types of availability Actually they all represent the availability for this Web service just like the

availability that showed in figure 3-14 Therefore only one content of these availability

information is sufficient Table 3-13 shows the final results of this extracting process

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page

Service Availability 100

Number of Downs 0

Total Uptime 1 day 19 hours 19 minutes

Total Downtime 0 second

MTBF 1 day 19 hours 19 minutes

MTTR 0 second

RTT Max of Endpoint 141 ms

RTT Min of Endpoint 0 ms

RTT Average of Endpoint 577 ms

Ping Count of Endpoint 112

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo

6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the

service domain from the WSDL link first Towards this Web service the gained service domain is

ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying

process After that it returns a list of information for that service domain See figure 3-17 And

table 3-14 is the extracted Whois information

Deep Web Service Crawler

40

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo

Service Domain URL thomas-bayercom

Domain Name Thomas Bayer

Domain Type NULL

Domain Address Moltkestr40

Domain Description NULL

State NULL

Postal Code 54173

City Bonn

Country NULL

Country Code DE

Phone +4922855525760

Fax NULL

Email infopredic8de

Organization predic8 GmbH

Established Time NULL

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo

7) Finally all the information of these four aspects will be collected together as the service

properties and then these service properties are forwarded into the Storage component

324 The Function of Storage Component

The WSDL link from the WSDL Grabber component will be used by this Storage component to

download the WSDL document from the Web and stores it into the disk thereafter In addition the

service properties from the Property Grabber component will also be directly stored into the disk in

three different manners with this Storage component Figure 3-18 illustrates the process flow of the

Storage component

Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo

of the Storage component will be triggered After that it is going to transform the service properties

into three different output formats and store them into the disk These output formats are XML file

INI file and database records Besides it will also try to download the WSDL document with the URL

address of the WSDL link and then store the obtained WSDL document into disk too However this

ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo

ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the

storage tasks

Deep Web Service Crawler

41

Figure3-18 Overview the process flow of the Storage Component

(1) ldquogetWSDLrdquo sub function

The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it

into the disk Therefore above all it has to get the content of the WSDL document This

procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value

of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if

the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo

For the case like that it will create a WSDL document whose name is the service name appended

with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content

It is an empty document Nevertheless if that service does have WSDL link this sub function will

try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all

these contents that hosted in the Web would be downloaded and stored into the disk and named

only with the name of the service Otherwise it would create a WSDL document which prefixes

ldquoBadrdquo before the service name

(2) ldquogenerateXMLrdquo sub function

The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform

them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo

XML stands for eXtensible Markup Language which is a markup language that designed to

transport and store data The first line of an XML file is the XML declaration which defines the

XML version and the encoding used For example following ldquoltxml version=10

encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is

8-bit Unicode Transformation Format character set Besides an XML file contains XML elements

which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML

element can also contain other elements which could be simple text or a mixture of both

Deep Web Service Crawler

42

However an XML file must contain a root element as the parent of all other elements

(3) ldquogenerateINIrdquo sub function

The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will

transform them into an INI file and then store it in the disk with the name structure of servicersquos

name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for

configuration files They are just simple text files with a basic structure Generally speaking the

INI file contains three different parts They are section parameter and comment respectively

Parameter is the basic element contained in an INI file Its format is a key-value pair or can also

be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always

appears to the left of the equals sign The section is more like a room that grouped all its

parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And

section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo

Thence anything between the semicolon and the end of line will be ignored

(4) ldquogenerateDatabaserdquo sub function

The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions

Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will

turn them into the data of the database by using the SQL statements SQL stands for Structured

Query Language which is a database language that designed for accessing and manipulating data

in database Most of the actions that performed on a database are done with SQL statements

And the primary statements of SQL include insert into delete update select create alter and

drop Therefore for the purpose of transforming these service properties into the data of the

database this sub function has to create a database first using the ldquocreate databaserdquo statement

Then it should create a table to store the data A table is a collections of related data entries and

it consists of columns and rows Since the data for all these five Web Service Registries are not

very large therefore one database table is enough for storing these service properties Because of

that the field names of the service properties in columns for all these five Web Service Registries

should be uniform and well-defined Afterwards the service properties of each single service can

be put into the table as a record with the ldquoinsert intordquo statement of the SQL

3241 Features of the Storage Component

The Property Grabber component has to provide the following features

l Generate different output formats

The final result of this master program is to store the information of the service into the disk for

future work However this Storage component provides three different formats for storing these

service properties of the services that published in the Web Service Registries This makes the

storage of the service very flexible and also longevity

l Obtain the WSDL document

The important thing of this component is to check if the WSDL document can be obtained from

the WSDL link because the WSDL document plays a decisive role for determining the quality of

the service This Storage component provides the ability to deal with the different situations

about the process of obtaining the WSDL document

Deep Web Service Crawler

43

3242 Input of the Storage Component

This component requires the following input data

l WSDL link of each service

l Each service property information

3243 Output of the Storage Component

The component will produce the following output data

l WSDL document of the service

l XML document INI file and tables in database

3244 Demonstration for Storage Component

Following figures describe the fundamental implementation codes of this Storage component

The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the

implementation codes The first common place is about the parameters that defined in each of

these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute

path in the computer disk It will be used for the storing procedure and has already been specified

by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an

increasing Integer which is used as a part of the name of the service The reason for doing this is

that it can prevent the services that have the same name from overriding in the disk The content

of red mark among the codes of these figures is the second common place Its function is to

create a file or document for the service with the corresponding parameters

2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to

get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the

name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo

is the actual WSDL link which is the most important one for this sub function The other two

parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and

ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data

of the service in each Web Service Registry Such as the overall number of properties for that Web

Service Registry the overall number of services the number of services that have no WSDL link

the number of services contains no contents the number of services whose WSDL link are not

available etc The ldquoLog Informationrdquo text file records the results of the process steps and the

problems encountered For example which service is crawling now which Web Service Registry it

belongs the reason why it cannot get the WSDL document for this service the reason why this

service is unreachable and so on

Deep Web Service Crawler

44

Figure3-19 Implementation code for getting WSDL document

3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file

and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of

ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two

variables name and value

Figure3-20 Implementation code for generating XML file

Deep Web Service Crawler

45

Figure3-21 Implementation code for generating INI file

4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all

services in these five Web Service Registries into the records of the database Therefore it has to

create a database first The name of the database can be arbitrary as long as it conforms to the

naming rules in database This is the same for the name of the table Figure 3-22 displays the

procedure of creating a table in the database Due to the reason that it is hard to decide the

length of each service property thence the data types for all service properties are set to ldquoTextrdquo

Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo

statement

Figure3-22 Implementation code for creating table in database

Deep Web Service Crawler

46

Figure3-23 Implementation code for generating table records

33 Multithreaded Programming for DWSC

Multithreaded programming is a built-in characteristic which is provided by Java language A

multithreaded program contains two or more separate parts that that can execute concurrently Each

part of such a program is called a thread The use of multithreading enables to create programs that

makes efficient use of the system resources For example make maximum use of the CPU thatrsquos

because the idle time of the CPU can be kept to a minimum

In this master program there are five Web Service Registries that are needed to crawl for the services

among them Moreover the number of services published among each Web Service Registry is quite

different That makes the running time cost by each Web Service Registry is different Then it will

happen that one Web Service Registry who owns fewer services has to be waiting for executing until

another Web Service Registry who has much more services finishes Therefore in order to reduce the

waiting time for other Web Service Registries and maximize the use of the system resources it is

necessary to apply this multithreaded programming to this master program That is to say this master

program will create a thread for each Web Service Registry And these threads are executed

independently

34 Sleep Time Configuration for Web Service

Registries

Due to the reason that this master program is intended for downloading the WSDL document and

extracting the service information of these Web services that published in the Web Service Registry

this would inevitably affect the performance of the Web Service Registry In addition for the purpose

of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of

accessing Because of that sometimes when this master program is executing unknown errors would

happen For instance this master program will be continually halting at one point without getting any

more WSDL document and service information or some servicesrsquo WSDL documents of some Web

Deep Web Service Crawler

47

Service Registries cannot be obtained or there would miss some service information Therefore in

order to obtain the largest staffs of these Web services published in these five Web Service Registries

and not to affect the throughput of them it has to configure the accessing rate for each service of all

Web Service Registries

In consequence before going into the essential procedure for each single service of these Web Service

Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long

milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for

the specified number of milliseconds In other words it will temporarily cease execution for a while

Following table shows this time interval of the sleep function for each Web Service Registry

Web Service Registry Name Time Interval (milliseconds)

Service Repository 8000

Ebi 3000

Xmethods 10000

Seekda 20000

Biocatalogue 10000

Table 3-15Sleep Time of these five Web Service Registries

Deep Web Service Crawler

48

4 Experimental Results and Analysis

This chapter is going to show the quantitative experimental results of the prototype that presented in

chapter 3 Besides the analysis of these results will also be described and explained However in

order to gain a rather accurate result the experiments are carried out more than five times All these

data that displayed in the following tables and charts are the average of them

41 Statistic Information for Different Web Service

Registries

This section is mainly used to talk about the amount statistic of these Web services that published in

these five Web Service Registries It includes the overall number of the Web services that published in

each Web Service Registry and the number of these unavailable Web services which have been

archived because they may not be active anymore or are close to being non active Table 4-1 shows

the service amount statistic of these five Web Service Registries

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Overall

Services 57 289 382 853 2567

Unavailable

Services 0 0 0 0 125

Table4-1 Service amount statistic of these five Web Service Registries

Nevertheless in order to have an intuitive view of the service amount statistic in these five Web

Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As

you can see from the bar chart one on hand there has an ascending increase for the overall number

of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service

Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web

service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to

provide the Web services for the users because it contains far more services than other four Web

Service Registries On the other hand there is no unavailable service existed in almost all Web Service

Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web

Service Registry contains some Web services that cannot be used by the users To some degree this is

useless for the reason that these services cannot be used anymore and it will waste the network

resources on the Web Therefore all these unavailable services should be eliminated in order to

reduce the wasting of the network resources

Deep Web Service Crawler

49

Figure4-1 Service amount statistic of these five Web Service Registries

42 Statistic Information for WSDL Document

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Failed WSDL

Links 1 0 23 145 32

Without

WSDL Links 0 0 0 0 16

Empty

Content 0 0 2 0 2

Table4-2 Statistic information for WSDL Document

Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in

these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed

WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the

Web services whose WSDL links are invalid Say in other words that means it is impossible to get the

WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no

WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services

for these Web Service Registries It is used as the overall number of Web services in each Web Service

Registry who has no WSDL link That is to say there would be no WSDL document for Web services

like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL

document will also be created But there is no content and the name of this WSDL document will

contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which

represents the overall number of Web services that have the WSDL links and the URL address of them

0

500

1000

1500

2000

2500

3000

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

57289 382

853

2567

0 0 0 0 125

The

Num

ber o

f Web

Ser

vice

s

The name of Web Service Registry

Overall Services

Unavailable Services

Deep Web Service Crawler

50

are valid but their WSDL documents contains no content In this case a WSDL document whose name

contains a string ldquo(BAD)rdquo will be created

Figure4-2 Statistic information for WSDL Document

43 Comparison of Different Average Number of

Service Properties

This section is going to compare the average number of service properties in these five Web Service

Registries This average number of service properties is calculated by means of following equation = (1)

Where

ASP is the average number of service properties for one Web Service Registry

ONSP is the overall number of service properties in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

Figure 4-3 shows the average number of service properties of each Web service in these five Web

Service Registries As have already mentioned before one of the measurements for testing the quality

of Web services in the Web Service Registry is the service information The more information of a Web

service the better you know about that service consequently the corresponding Web Service

Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service

Repository and Biocatalogue Web Service Registries own a larger number of service properties than

that in other three Web Service Registries This can directly reflect that these two Web Service

Registries can provide more detailed information about the Web services published among them So

that the users can be easier to choose the service they need And they would also like to use the Web

services that published in these two Web Service Registries By contrast the Xmethods and Seekda

0

50

100

150

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

1 023

145

32

0 0 0 016

0 0 2 0 2

The

Num

ber o

f Ser

vice

s

The Name of the Web Service Registry

Failed WSDL Links

Without WSDL links

Empty Content

Deep Web Service Crawler

51

Web Service Registries who have less service information of the Web services would offer less quality

for these Web services Therefore users may be not like to use the Web services provided in these

two Web Service Registries Not to mention the Web services published in the Ebi Web Service

Registry

Figure4-3 Average Number of Service Properties

From the description presented in section 323 the causes of the issue for the different number of

service properties in these Web Service Registries may consist of following several points First the

number of the structured information for these Web services is different with respect to these five

Web Service Registries Even part of information for some Web services in one Web Service Registry

could be missing or have empty value For example the number of structure information that

supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that

in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for

almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce

the amount of the overall number of service properties Thirdly some Web Service Registries do not

have monitoring information such as Xmethods and Ebi In particular the Service Repository Web

Service Registry has a large amount of monitoring information of the Web services that could be

extracted in the Web Obvious the last one is the number of the whois information for these Web

services If the database of the whois client does not contain the information about the service

domain for the Web service in one Web Service Registry then there will be no whois information

could be extracted Moreover even there has information of the service domain the amount of the

information could be very diverse Therefore if the Web Service Registry undergoes the situation like

that many service domains of the Web services in this registry have no and few whois information

then the average number of service properties in that registry will decrease greatly

As a result in order to help users get better acquainted with the Web services that provided in Web

Service Registry and distinguish these Web services the Web Service Registry should do it best to

offer more and more information for each published Web service of it

05

101520253035

23

717 17

32

The

Num

ber o

f Pro

pert

ies

The Name of Web Service Registry

Average Number of Service Properties

Deep Web Service Crawler

52

44 Different Outputs of Web Services

The fundamental task of this master program is to obtain the WSDL document of the Web services

that hosted in the Web Service Registries as well as extract and gather the properties of these Web

services and thereafter store them into disk Therefore this section is going to describe the different

outputs of this master program which include the WSDL documents of Web services the generated

XML file and INI file and data records of these service properties in Database

Figure4-4 WSDL Document format of one Web service

The WSDL document of the Web service is just read from the Web according to the URL address of its

WSDL link And then these data will be stored into the disk as the WSDL document The name of the

WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document

And in order to distinguish these WSDL documents whose names would be the same while the

content of them are different the name of each obtained WSDL document in one Web Service

Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL

document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and

data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output

formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name

is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the

materials belonged to the same Web service The first three lines in that INI file are some service

comments which start from the semicolon to the end of the line They are the basic information for

the description of this INI file The following one line of them is the section that included in a pairs of

brackets It is important because it represents that the following lines behind it are the information of

this Web service Therefore the rest of these lines are the actual service information that existed in a

key-value pair with an equal between the pair Each service property is displayed from the beginning

of the line

Deep Web Service Crawler

53

Figure4-5 INI File format of one Web service

Figure4-6 XML File format of one Web service

Figure4-7 Database data format for all Web services

Deep Web Service Crawler

54

Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its

name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web

service Though the format of the XML file is different from that of the INI file the essential contents

of them are the same That is to say the values of these service properties are no different This is

because they are files generated from the collection of properties for the same Web service The XML

file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the

section in INI file is something like the root in XML file Therefore all values of these elements among

the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service

Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of

service information for all Web services in these five Web Service Registries And the entire service

information for one Web service will be only one record in this table Because of that the column

names of that table should be the union of the name of the service information in each Web Service

Registry However since the column names of the table should be unique thus these redundant

names in this union must be eliminated This could make sense and possible because the names of the

service information are well-defined and uniform for all these five Web Service Registries In addition

in this table the first column is the primary key which is an increasing Integer The function of this is

more like the Integer that contained in the name of the XML file and INI file While the rest columns in

that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in

this table represents that this property of that Web service is empty or missing

45 Comparison of Average Time Cost for Different

Parts of Single Web Service

This section aims to describe the comparison of the average time cost for different parts of getting

one single Web service in all these five Web Service Registries At first it has to calculate the average

time cost of getting one single service in the Web Service Registry The calculation of it can be

obtained through following equation = (2)

Where

ATC is the average time cost for one single Web service

OTS is the overall time cost of all the Web services in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the different parts of the average time cost for getting one single service consists of

following 6 aspects which are the average time cost for extracting service property the average time

cost for obtaining WSDL document the average time cost for generating XML file the average time

cost for generating INI file the average time cost for inserting the service property into the table of

the Database and the average time cost for some other procedures such as get the service list page

link get the service page link and so on The average time cost for extracting service property would

be obtained by means of following equation

Deep Web Service Crawler

55

= (3)

Where

ATCSI is the average time cost for extracting service property of one single Web service

OTSSI is the overall time cost for extracting service property of all the Web services in one Web

Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the calculation of other parts is similar to the equation for calculating the average time

cost for extracting service property While the calculation of the average time cost for other

procedures equals to the result of the average time cost for one single Web service minus the sum of

the average time cost of all other five parts

Service

property

WSDL

Document

XML

File INI File Database Others Overall

Service

Repository

8801 918 2 1 53 267 10042

Ebi 699 82 2 1 28 11 823

Xmethods 5801 1168 2 1 45 12 7029

Seekda 5186 1013 2 1 41 23 6266

Biocatalogue 39533 762 2 1 66 1636 42000

Table4-3 Average time cost information for all Web Service Registries

Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web

Service Registries The first column of table 4-3 is the name of these five Web Service Registries And

the last column in this table is the average time cost for the single service in one Web Service Registry

While the rest columns in this table are the average time cost of these six different parts In order to

have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated

with following corresponding figures See figure 4-8 to figure 4-14

Figure4-8 Average time cost for extracting service property in all Web Service Registries

8801

6995801 5186

39533

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Service Property

Deep Web Service Crawler

56

As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue

Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries

which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively

That is to say it will take much longer time to extract the service properties of these Web services

published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the

Biocatalogue Web Service Registry has even more average number of service properties which have

already been talked about in section 43 On the contrary the average number of service properties in

Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service

Registry is larger than that in Seekda Web Service Registry As have already known that the average

number of service properties is the same for these two Web Service Registries Nevertheless there

has course might explain the case that the average time in Xmethods costs more than that in Seekda

which is the process for extracting service properties in Xmethods Web Service Registry needs to be

executed by means of service page and service list page while it needs only service page link for

Seekda Web Service Registry

The average time cost for obtaining WSDL document in all these five Web Service Registries is

displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the

average time for extracting the WSDL link of Web service and the average time for reading the data of

WSDL document from the Web and then storing it into the disk As seen from this figure the average

time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is

1168 Although the process for extracting WSDL link will cost certain amount of time it does not have

a significant influence to the total average time spent for obtaining WSDL document This is because

the WSDL link of one Web service is almost gained in one step Therefore this can imply that the

average size of the WSDL document for Xmethods Web Service Registry is larger than that for other

four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds

for the process of extracting WSDL document

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries

918

82

11681013

762

0

200

400

600

800

1000

1200

1400

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

WSDL Document

Deep Web Service Crawler

57

Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different

outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11

the average time of generating XML file for one Web service cost the same for all these five Web

Service Registries which is only 2 milliseconds In addition the average time of generating INI file for

one Web service cost the same too And its value is just 1 millisecond Even though sum up the values

of these two types of average time costs the result is still so small that it could be omitted when

comparing to the overall average time cost of getting one Web service for each corresponding Web

Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI

file would be finished at once after receiving the input of service properties of one Web service

Furthermore it can be seen from figure 4-12 although the average time costs of creating database

record for each Web service in all these five Web Service Registries are larger than that time of

generating XML and INI file the process of the operation for creating database record is still fast

Figure4-10 Average time cost for generating XML file in all Web Service Registries

Figure4-11 Average time cost for generating INI file in all Web Service Registries

2 2 2 2 2

0

05

1

15

2

25

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

XML File

1 1 1 1 1

0

02

04

06

08

1

12

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

INI File

Deep Web Service Crawler

58

Figure4-12 Average time cost for creating database record in all Web Service Registries

Figure4-13 Average time cost for getting one Web service in all Web Service Registries

In the figure 4-13 it gives the average time cost for getting one single Web service in all these five

Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the

longest time for this process This is because the presentation of these five different parts that

described in the front shows that the average time cost of each part would needs more time to finish

the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining

WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web

Service Registry does not cost the longest Moreover there is amazing thing when looking at these

figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same

trend This can further indicate that one Web Service Registry who spends more time to get the

description information of one Web service would offer more information about the Web service

53

28

4541

66

0

10

20

30

40

50

60

70

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Database

10042

823

7029 6266

42000

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Overall

Deep Web Service Crawler

59

5 Conclusion and Further Direction

This master thesis provides a schema which aims to explore the description information of the Web

services that hosted in the different Web Service Registries Actually the description information of

the Web service consists of WSDL document and service information about that Web service

Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can

be used to obtain the WSDL document and service information of the Web service in these Web

Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web

services that hosted in the these Web Service Registries as well as the storage of these Web services

is not flexible and the most important thing is that only a few service information of the Web service

are extracted even more there is no service information would be extracted for some Web Service

Registry However the work presented in this master thesis can be able to explore all Web services in

these Web Service Registries Moreover the service information of Web service would be extracted as

much as it could with the approach presented in this master thesis so that the final result would be

the largest annotated service catalogue ever produced Furthermore regarding the storage process of

the description information of the Web service this master thesis provides three different ways that

can guarantee not only the completeness but also longevity of the description information for the

Web service

However in the implementation performed in this master thesis the whois client used for querying

the information of service domain will return a free text if the information exists And sometime this

free text differs completely That makes it must have to crawl each Web service in all Web Service

Registries at least once in the experiment stage So that all the cases of these free text would be

foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web

services in these Web Service Registries Therefore in order to simply the work other whois client

who can ease the work here needs to be found and used

Moreover in the experiment stage of this master thesis the time cost for getting Web service is still

large For the purpose of reducing this time the multithreaded programming would be applied to

each some parts of the process for getting one Web service

Although the work performed here is specialized for only these five Web Service Registries the

main parts of the principles used here is adaptable to other Web Service Registries that with only

small changes in the implementation codes or the structure

Deep Web Service Crawler

60

6 Bibliography

[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)

June 27 2008

[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from

httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-

whole Emanuele Della Valle (CEFRIEL) July 1 2008

[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk

ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-

architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009

[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web

Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October

2006

[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from

httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008

[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from

httpwwww3orgTRws-arch-scenarios February 11 2004

[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo

Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering

University of Washington Seattle pp233-272 February 1999

[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web

Consortium Working Draft WD-html5-20100624 January 22 2008

[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language

WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from

httpwwwwsmoorgTRd16d161v02

[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO

-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from

httpwwwwsmoorgTRd2v11

[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware

Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet

2008

Deep Web Service Crawler

61

7 Appendixes

There are additional outputs of this master program which are log information file and statistic report

file Figure 8-1 shows one of the basic output formats of the log information for these five Web

Service Registries

Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry

Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry

Deep Web Service Crawler

62

Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo

ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively

Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry

Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry

Deep Web Service Crawler

63

Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry

Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry

Deep Web Service Crawler

64

Table of Figures

Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12

Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16

Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25

Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27

Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41

Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46

Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50

Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51

Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52

Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56

Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58

Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58

Deep Web Service Crawler

65

Table of Tables

Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47

Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48

Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55

Deep Web Service Crawler

66

Table of Abbreviations

DTD Document Type Definition

DWSC Deep Web Service Crawler

HTML HyperText Markup Language

MTBF Mean Time between Failures

MTTR Mean Time to Recovery

QoS Quality of Service

REST Representational State Transfer

RTT Round Trip Time

SOAP Simple Object Access Protocol

SQL Structured Query Language

URL Uniform Resource Locator

WHATWG Web Hypertext Application Technology Working Group

XML eXtensible Markup Language

WSDL Web Service Description Language

WSML Web Service Modeling Language

WSMO Web Service Modeling Ontology

Page 3: Deep Web Service Crawler

Deep Web Service Crawler

3

Abstract

Nowadays Web Service Registries offer convenient access to offering searching and using

electronic Web services Usually they host Web service descriptions along with related metadata

generated by both the system and the users Hence monitoring and rating information can help

users to distinguish similar Web service offerings However at present there is little support to

compare these Web services across platforms and to build a global view Besides for the metadata

not all of them especially non-functional property descriptions are made available in s structured

format

Therefore the task of this master thesis is to apply Deep Web analysis techniques to extract as

much information about these published Web services as possible Corresponding the result shall

be the largest annotated Service Catalogue ever produced

Index Terms

Web Service Deep Web Service Crawler Service-Finder Pica-Pica Web Service Description Crawler

WSDL

Deep Web Service Crawler

4

Table of Contents

Acknowledgements 2

Abstract 3

1 Introduction 7

11 BackgroundMotivation 7

12 Initial Designing of the Deep Web Service Crawler Approach 7

13 Goals of this Master Thesis 8

14 Outline of this Master Thesis 8

2 State of the Art 10

21 Service Finder Project 10

211 Use Cases for Service-Finder Project 10

2111 Use Case Methodology 10

2112 System Administrator 10

212 Architecture Plan for the Service-Finder Project 12

2121 The Principle of the Service Crawler Component 13

2122 The Principle of the Automatic Annotator Component 13

2123 The Principle of the Conceptual Indexer and Matcher Component 14

2124 The Principle of the Service-Finder Portal Interface Component 14

2125 The Principle of the Cluster Engine Component 15

22 Information Extraction 15

221 Input Types of Information Extraction 15

222 Extraction Targets of Information Extraction 17

223 The Used Techniques in Information Extraction 18

23 Pica-Pica Web Service Description Crawler 19

231 Needed Libraries of the Pica-Pica Web Service Description Crawler 19

232 Architecture of the Pica-Pica Web Service Description Crawler 20

233 Implementation of the Pica-Pica Web Service Description Crawler 21

24 Conclusions of the Existing Strategies 22

3 Design and Implementation 23

31 Deep Web Services Crawler Requirements 23

311 Basic Requirements for DWSC 23

Deep Web Service Crawler

5

312 System Requirements for DWSC 23

313 Non-Functional Requirements for DWSC 24

32 Deep Web Services Crawler Architecture 24

321 The Function of Web Service Extractor Component 26

3211 Features of the Web Service Extractor Component 28

3212 Input of the Web Service Extractor Component 28

3213 Output of the Web Service Extractor Component 28

3214 Demonstration for Web Service Extractor 29

322 The Function of WSDL Grabber Component 30

3221 Features of the WSDL Grabber Component 31

3222 Input of the WSDL Grabber Component 31

3223 Output of the WSDL Grabber Component 31

3224 Demonstration for WSDL Grabber Component 31

323 The Function of Property Grabber Component 33

3231 Features of the Property Grabber Component 36

3232 Input of the Property Grabber Component 37

3233 Output of the Property Grabber Component 37

3234 Demonstration for Property Grabber Component 37

324 The Function of Storage Component 40

3241 Features of the Storage Component 42

3242 Input of the Storage Component 43

3243 Output of the Storage Component 43

3244 Demonstration for Storage Component 43

33 Multithreaded Programming for DWSC 46

34 Sleep Time Configuration for Web Service Registries 46

4 Experimental Results and Analysis 48

41 Statistic Information for Different Web Service Registries 48

42 Statistic Information for WSDL Document 49

43 Comparison of Different Average Number of Service Properties 50

44 Different Outputs of Web Services 52

45 Comparison of Average Time Cost for Different Parts of Single Web Service 54

5 Conclusion and Further Direction 59

6 Bibliography 60

Deep Web Service Crawler

6

7 Appendixes 61

Table of Figures 64

Table of Tables 65

Table of Abbreviations 66

Deep Web Service Crawler

7

1 Introduction

In this introduction chapter of the master thesis at first itrsquos going to concisely explain the background

of the current situation then is the basic introduction of the proposed approach which is called Deep

Web Service Extraction Crawler

11 BackgroundMotivation

In the late 1990rsquos the Web Service Registry was a hot commodity The formal definition of Web

Service Registry is known as a link links page Its function is to uniformly present information that

comes from various sources Hence it can provide a convenient channel to the users for offering

searching and using the Web Services Actually the related metadata of the Web Services that

submitted by both the system and users are commonly hosted along with the Service descriptions

Nevertheless in fact when users enter one of the Web Service Registries to look for some Web

Services they might meet some situations that would bring lots of trouble to them One of the

situations may be like that these Web Service Registries return several similar published Web Services

after the users search on it For example two or more Web Services have the same name but their

versions are not the same Or two or more Web Services that derived from the same server but have

different contents etc Furthermore most users are also interested in a global view of the published

services For instance they want to know which Web Service Registry can provide better quality for

the Web Service Therefore in order to help users to differentiate those similar published Web

Services and have a global view of the Web Services this information should be monitored and rated

Moreover there are a great many Web Service Registries in the Internet Each Web Service Registry

can provide a great number of Web Services Obviously there might have some similar Web Services

among these Web Service Registries Or a Web Service in one of the Web Service Registry is related to

another Web Service in other Web Service Registries Hence these Web Services should be

comparable across different Web Service Registries However recently there has not much support of

this In addition towards the metadata actually not all of them are structured especially the

descriptions of the non-functional property Therefore what have to do now is to turn those

non-functional property descriptions into the structured format Clearly speaking it needs to extract

as much information as possible about the Web Services that offered in the Web Service Registries

Eventually after extracting all the information from the Web Service Registries it is necessary to store

them into the disk This procedure should be efficient flexible and completeness

12 Initial Designing of the Deep Web Service

Crawler Approach

The problems have already been stated in the previous section Hence the following work is to solve

Deep Web Service Crawler

8

these problems In this section it will present the basic principle of Deep Web Service Crawler

approach

At first it is the simply introduction of the Deep Web Service Crawler approach to these problems As

have already been mentioned each Web Service Registry can offer Web Services Moreover each

Web Service Registry has its own html page structures These structures may be the same or even

complete different Therefore the first thing is to identify which Web Service Registry that it will be

going to explore Since each Web Service Registry owns a unique URL this job can be done by directly

analyzing the corresponding URL address of that Web Service Registry After identifying which Web

Service Registry it is going to explore the following step is to obtain all these Web Services that

published in that Web Service Registry Then with all these obtained Web Services it is time to extract

analyze and gather the information of the services That information can be in structured format or

even in unstructured format In this master thesis some Deep Web Analysis Techniques will be

applied to obtain this information So that the information about each Web Service shall be the

largest annotated The last but not the least important all the information about the Web Services

need to be stored

13 Goals of this Master Thesis

The lists in the following are the goals of this master thesis

n Produce the largest annotated Service Catalogue

Service Catalogue is a list of service properties The more properties the service has the larger

Service Catalogue it owns Therefore this master program should extract as much service

properties as possible

n Flexible storage of these metadata of each service as annotations or dedicated documents

The metadata of one service includes not only the WSDL document but also service properties

All these metadata are important information for the service Therefore this master program

should provide flexible ways to store these metadata into the disk

n Improve the comparable property of the Web Services across different Web Service Registries

The names of service properties for one Web Service Registry could be different from another

Web Service Registry Hence for the purpose of improving the comparable ability all these

names of the service properties should be uniformed and well-defined

14 Outline of this Master Thesis

In this chapter the motivation objective and initial approach plan have already been discussed

Thereafter the remaining paper is structured as follows

Firstly in chapter 2 the work presented is almost based on these existing techniques In section 21

there is given a detailed introduction to the technique of the Service-Finder project Then in section

22 the already implemented technique of Pica-Pica Web Service Description Crawler is introduction

and discussed After that in section 23 the Information Retrieval technique is presented

Deep Web Service Crawler

9

Then following in chapter 3 it is explained the designing details of this Deep Web Service Crawler

approach In section 31 it gives a short description for the different requirements of this approach

Next in section 32 the actual design for the Deep Web Service Crawler is presented Then in section

33 34 the multithreaded programming and sleep time configuration that used in this master

program are introduced respectively

In chapter 4 it is supposed to display the experiments of this Deep Web Service Crawler approach

and then give some evaluation of it

Finally in Section 5 the conclusion and discussion of the work that already done as well as the work in

the future for this master task are presented respectively

Deep Web Service Crawler

10

2 State of the Art

This chapter aims at presenting some existing techniques or Strategies that related to the work of

applying this Deep Web Service Extraction Crawler approach Section 21 plans to talk about the

existing catalogues Service-Finder project And then in section 22 it is going to explain the existing

implemented crawlers Pica-Pica Web Service Description Crawler Finally in the section 23 it is

supposed to present some details about the Information Extraction technique

21 Service Finder Project

Service-Finder Project aims at developing a platform for Web Service discovery especially for the Web

Services that are embedded in a Web 20 environment [1] Hence it can provide an efficient access to

publicly available services The goals of the Service-Finder project are depicted as follows [1]

n Automatically gather Web Services and their related information

n Semi-automatically create semantic service description based on the information that available

on the Web

n Create and improve semantic annotations via the user feedback

n Describe the aggregated information in semantic models and allow reasoning query

However before describing the basic functionality of the Service-Finder Project there is going to

present one of its use cases and requirements first

211 Use Cases for Service-Finder Project

The Service-Finder Project employed the use case methodology of the W3C Use Case description [6]

for its needs and then applied this methodology to the use cases that it enumerated

2111 Use Case Methodology

There are three aspects needed to be considered for use case definitions of Service-Finder Project [1]

(1) Description that used to describe information of the use case

(2) Actors Roles and Goals that used to identify the actors they would be the roles they act and the

goals they need to achieve in the scenario

(3) Storyboard that used to describe the serial of interactions among the actors and the

Service-Finder Portal

2112 System Administrator

This section is going to present the use case that applied to the Service-Finder portal and that

illustrated the requirements on its functionality from a user point of view However all these

Deep Web Service Crawler

11

information in this use case are derived from [1] In this use case there has a system administrator

whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities

online and working all day and night Therefore if there is any system failures Sam Adams should fix

the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will

alert him immediately by sending him a SMS Message in the case of a system failure

n Description

This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an

SMS Messaging Service that he wants to build it into his application

n Actors Roles and Goals

The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him

are the immediate service delivery the reliability of the service and low base fee and transaction fee

n Storyboard

Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find

many useful services from it especially he know what he is looking for Hence he visits the

Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo

Requirement 1 Search functionality

Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the

number of matching services that will be displayed on one page And he would also expect there has

short information about the service functionality the service provider and the service availability So

that he could decide which service he will choose to read further

Requirement 2 Enable configurable pagination of the matching results and have some short

information for each service

Step 3 When Sam looks through the short information about the services that displayed on the first

page he expects to find the most relevant services that related to his request After that he would

like to read more detailed information about that service to see whether this service can provide the

needed functionality

Requirement 3 Rank the returned matching services and must provide ability to read more details of

a service

Step 4 In the case that all the returned matching services Sam got provide quite different

functionalities or they belong to different service categories for example the SMS messaging services

alert users not through SMS but voice messaging For this reason Sam would like to see other

different categories that may be contain the services he wants Or the services of other categories

which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can

further filter his search in terms of browsing through categories

Requirement 4 Service categories and allow the user to look all services that belonged to that specific

category If possible it should also allow the user to browse through categories

Step 5 When Sam got all the services that could provide a SMS messaging service via the methods

described in the Step 4 at present he wants to look for the services that offered by an Austrian

provider and have no base fees if possible

Requirement 5 Faceted search

Deep Web Service Crawler

12

Step 6 After Sam got all these specific services now he would like to choose the services that can

provide a high reliability

Requirement 6 Sort functionality based on usersrsquo chooses

Step 7 For now Sam expects to compare the service availability between the promised to the service

provider and the actually provided This should be contained in the servicesrsquo details And there needs

also have service coverage information so that Sam can know whether this service covers the areas

he lives and works Moreover Sam would also like to compare these services in other way For

instance put some services into a structured table to compare the transaction fees

Requirement 7 A side-by-side comparison table for services and a functionality that enable users to

select services he wants to compare

Step 8 At last Sam wants to know whether the service providers offer a free try out of the services

So that he can test the service functionality

Requirement 8 If possible display a note that offering free service trials

212 Architecture Plan for the Service-Finder Project

The architecture plan of the Service-Finder project contains five basic components Service Crawler

Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal

Interface Figure 2-1 presents a high level overview of the components and the data flow among

them

Figure2-1Dataflow of Service-Finder and Its Components [3]

Deep Web Service Crawler

13

2121 The Principle of the Service Crawler Component

The Service Crawler is responsible for gathering available services and their related information from

the Web The overall cycle is depicted as following

(1) A Web developer publishes a Web Service

(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services

like WSDL (Web Service Description Language) documents

(3) The Crawler is also going to search for other related information as long as a service is discovered

(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant

part of the Web

At last the output of the crawler would be forwarded to the subsequent components for analyzing

indexing and displaying

2122 The Principle of the Automatic Annotator Component

The Automatic Annotator receives the relevant data from previous component and generates

semantic service descriptions about the WSDL documents and its related information based on the

Service-Finder Ontology and Service Category Ontology

Firstly it will simply introduce those two compatible ontologies that would be used throughout the

whole process [2]

n Generic Service Ontology it is an ontology which is functional to describe the data objects For

example the services the service providers availability payment modalities and so on

n Service Category Ontology it is an ontology which is used to categorize the functionalities or

applications of the services For instance data verification messaging data storage weather etc

Afterwards it is going to talk about the function of this component with its input output

Oslash Input

u Crawled data from Service Crawler

u Service-Finder Ontologies

u Feedback or Correction of before annotations

Oslash Function

u Enrich the information about the service and extract semantic statements according to the

Service-Finder Ontologies For example categorize the service according to the Service

Category Ontology

u Determine whether a particular document is relevant or not through the Web link graph If

not discard these irrelevant documents

u Classify the pages into their genres For instance pricing user comments FAQ and so on

Oslash Output

u Semantic annotation of the services

Deep Web Service Crawler

14

2123 The Principle of the Conceptual Indexer and Matcher

Component

The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted

information of the services and supplying users the capability of retrieval and semantic query For

example the matchmaking between user requests and service offers and the act of retrieving user

feedback on extracted annotations

In addition letrsquos have a look of the function of this component and its input output

Oslash Input

u Semantic annotation data and full text information obtained from Automatic Annotation

u Semantic annotation data and full text information that come from user interfaces

u Cluster data from user and service clustering component

Oslash Function

u Store the semantic annotations received from the Automatic Annotation component and

from the user interface

u Store the cluster data that procured through the clustering component

u Store and index the textual description offered by the Automatic Annotation component

and the textual comments offered by users

u Ontological query the semantic data from the data store center

u Combined keyword and Ontological querying used for user queries

u Provide a list of similar services for a given service

Oslash Output

u A list of matching services that are queried by users In particular these services should be

sorted by ranking and can also be iterated

u All available data that related to a particular entity must be retrievable at the user interface

2124 The Principle of the Service-Finder Portal Interface

Component

The Service-Finder Portal Interface is the main entry point that provided for users of the

Service-Finder system to search and browse the data which is managed by the Conceptual Indexer

and Matcher component In addition the users can also contribute information by means of providing

tags comments categorizations and ratings to the data browsed Furthermore the developers can

still directly invoke the Service-Finder functionalities from their custom applications in terms of an API

Besides the details of this componentrsquos function input and output are represented as below

Oslash Input

u A list of ordered services for a query

u Detailed information about a service or a set of services and a service provider

u Query access to service category ontology and the most used tags provided by the users

Deep Web Service Crawler

15

u Service availability information

Oslash Function

u The Web Interface allows the users to search services by keyword tag or concept in the

categorization sort and filter query results by refining the query compare and bookmark

services try out the services that offer this functionality

u The API allows the developers to invoke Service-Finder functionalities

Oslash Output

u Explicit user annotations such as tags ratings comments decryptions and so on

u Implicit user data for example click stream of users bookmarks comparisons links sent

etc

u Manual advertising of available new services

2125 The Principle of the Cluster Engine Component

The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the

Service-Finder Portal eg the queried services and the compared services of the users Moreover it

also provides cluster data to the Conceptual Indexer and Matcher for providing service

recommendations

Furthermore letrsquos detailed introduce this componentrsquos function input and output

Oslash Input

u Service annotation data of both extracted and user feedback

u Usersrsquo Click streams used for extracting user behaviors

Oslash Function

u Obtain user clusters from user behaviors

u Obtain service clusters from service annotation data to enable to find similar services

Oslash Output

u Clusters of users and services

22 Information Extraction

Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge

amount of information sources on the Internet which has been limited the access to browsing and

searching for the reason of the heterogeneity and the lack of structure of Web information sources

Therefore the appearance of Information Extraction that transforms the Web pages into

program-friendly structures for post-processing would become a great necessity However the task of

Information Extraction is specified in terms of the inputs and the extraction targets And the

techniques used in the process of Information Extraction called extractor

221 Input Types of Information Extraction

Generally speaking there are three different input types The first input type is the unstructured

Deep Web Service Crawler

16

document For example the free text that showed in figure 2-2 It is unstructured and written in

natural language So that it will require substantial natural language processing While the second

input type is called the structured document For instance the XML documents based on the reason

that the data can be described through the available DTD (Document Type Definition) or XML

(eXtensible Markup Language) schema Finally but obviously the third input type is the

semi-structured document that are widespread on the Web Such as the large volume of HTML

pages like tables itemized lists and enumerated lists This is because HTML tags are often used to

render these embedded data in the HTML pages See figure 2-3

Figure2-2Left is the free text input type and right is its output [4]

Figure2-3A Semi-structured page containing data records

(in rectangular box) to be extracted [4]

Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly

regular structure And the data of these documents can be displayed in a format of HTML way or

non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and

generated from structured databases in terms of some templates or layouts thus it would be

considered as one of the input sources which could provide some of these semi-structured documents

For example the authors price and comments of the book pages that provided by Amazon have the

Deep Web Service Crawler

17

same layout That is because these Web pages are generated from the same database and applied

with the same template or layout Furthermore there has another option which could manually

generate HTML pages of semi-structured type For example although the publication lists that

provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have

title and source property for every single pager Eventually the inputs for some Information Extraction

can also be the pages with the same class or among various Web Service Registries

222 Extraction Targets of Information Extraction

Moreover regarding the task of the Information Extraction it has to consider the extraction target

There also have two different extraction targets The first one is the relation of k-tuple And the k in

there means the number of attributes in a record Nevertheless in some cases an attribute of one

record may have none instantiation Otherwise the attribute owns multiple instantiations In addition

the complex object with hierarchically organized data would be the second extraction target Though

the ways for depicting the extraction targets in a page are diverse the most common structure is the

hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf

nodes which called internal nodes And the structure for a data object may also be flat or nested To

be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise

if it is nested structure then the internal nodes that involved in this data object would be more than

two levels

Furthermore in order to make the Web pages readable for human being and having an easier

visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated

or demarcated However the displaying for a data object in a Web page would be affected by

following conditions [4]

Oslash The attribute of a data object has zero or several values

(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo

attribute For example a special offer only available for certain books might be a ldquononerdquo

attribute

(2) If there are more than one values for the attribute of a data object it will be called the

ldquomultiValuerdquo attribute For instance the name of the author for a book could be a

ldquomultiValuerdquo attribute

Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering

That is to say among this set of attribute the position of the attribute might be changed

according to the diverse instances of a data object Thus this attribute will be called the

ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would

enumerate the release data in front of the movesrsquo title while for the movies after year 1999

(including 1999) it will enumerate the release data behind the movesrsquo title

Oslash The attribute has different formats

This means the displaying format of the data object could be completely distinct with respect to

these different instances Therefore if the format of an attribute is free then a lot of rules will be

needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo

attribute For example an ecommerce Web site would use the bold font format to present the

general prices while use the red color format to display the sale prices Nevertheless there has

Deep Web Service Crawler

18

another situation that some different attributes for a data object have the same format For

example various attributes are presented in terms of using the ltTDgt tags in a table presentation

And the attributes like those could be differentiated by means of the order information of these

attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo

attributes it must have to revise the rules for extracting these attributes

Oslash The attribute cannot be decomposed

Because of the easier processing sometimes the input documents would like to be treated as

strings of tokens instead of the strings of characters In addition some of the attribute cannot

even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo

attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The

department code and the course number in them cannot be separated into two different strings

of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo

223 The Used Techniques in Information Extraction

The extractor used in the process of Information Extraction aims at providing a single uniform query

interface to access information sources like database server and Web server It consists of following

phases collecting returned Web pages labeling these Web pages generalizing extraction rules

extracting the relevant data and outputting the result in an appropriate format (XML format or

relational database) for further information integration For example at first the extractor queries the

Web server to gather the returned pages through the HTTP protocols after that it starts to extract the

contents among these HTML documents and integrate with other data sources thereafter Actually

the whole process of the extractor follows below steps

Oslash Step 1

At the beginning it must have to tokenize the input However there are two different

granularities for the input string tokenization They are tag-level encoding and word-level

encoding The tag-level encoding will transform the tags of HTML page into the general tokens

while transform all text string between two tags into a special token Nevertheless the

word-level encoding does this in another way It treats each word in a document as a token

Oslash Step 2

Next it should apply the extraction rules for every attributes of the data object in the Web pages

These extraction rules could be induced in terms of a top-down or bottom-up generalization

pattern mining or logic programming In addition the type of extraction rules may be indicated

by means of regular grammars or logic rules For example some use path-expressions of the

HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic

constraints and some use delimiter-based constraints such as HTML tags or literal words

Oslash Step 3

After that all these extracted data would be assembled into the records

Oslash Step 4

Finally iterate this process until all these data objects in the input

Deep Web Service Crawler

19

23 Pica-Pica Web Service Description Crawler

The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the

Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web

Services problem For example the evaluation of the descriptive quality of Web Services that offered

and how well are these Web Services described in nowadaysrsquo Web Service Registries

231 Needed Libraries of the Pica-Pica Web Service

Description Crawler

This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef

Spillner and programmed in terms of the Python language Actually in order to run these scripts to

parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib

n Beautiful Soup

It is an HTMLXML parser for Python language And it can even turn these invalid markups into a

parse tree [5] Moreover the following three features can make it more powerful

u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that

makes approximately as much sense as the original document Therefore you can obtain

the data that you want

u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating

searching and modifying the parse tree Hence you donrsquot need to create a custom parse

for every application

u If the document has already specified an encoding then you can ignore it since the

Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way

Otherwise what you have to do is just to specify the encoding of the original documents

Furthermore the ways of including Beautiful Soup into the application are displayed in the

following [5]

sup2 From BeautifulSoup import BeautifulSoup For processing HTML

sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML

sup2 Import BeautifulSoup To get everything

n Html5lib

It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to

gain maximum compatibility with the current major desktop web browsers this implementation

will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5

specification

Deep Web Service Crawler

20

232 Architecture of the Pica-Pica Web Service

Description Crawler

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler

Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes

four fundamental components Service Page Grabber component WSDL Grabber component

Property Grabber component and WSML Register component

(1) The Service Page Grabber components is going to take the URL seed as the input and output the

link of the service page into following two components WSDL Grabber component and Property

Grabber component

(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the

delivered service pagersquos link And then check whether the validation for these obtained WSDL

document Finally only these valid WSDL document will be passed into the WSML Register

component for further processing

(3) The Property Grabber component will try to extract the servicersquos property the hosted in the

service page if there exists After that all these servicersquos properties would be saved into an INI

file as the information of that service

(4) The functionality of the WSML Register component is to write appropriate WSML document by

means of the valid WSDL documents that delivered from WSDL Grabber component and the

optionally INI files that delivered from Property Grabber component Afterwards register them in

Conqo

n WSML [9]

It stands for Web Service Modeling Language which provides a framework with different

language variants Hence it is often used to describe the different aspects of the semantic Web

Services according to the conceptual model of WSMO

n WSMO [10]

WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various

aspects related to the Semantic Web Services based on the ontologies Ontology is a formal

explicit specification of a shared conceptualization In fact these ontologies are the sticking points

that can satisfy the linkage between the agreement of the communities of users and the defined

conceptual semantics of the real-world

n Conqo [11]

Deep Web Service Crawler

21

It is a discovery framework that considers not only the Quality of Service (QoS) but also the

context information It will use a Web Service Repository to manage these service descriptions

that based on WSML

233 Implementation of the Pica-Pica Web Service

Description Crawler

This section is going to describe the processes of the implementation of the Pica-Pica Web Service

Description Crawler in detail

(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it

needs an input as the initial seed In this crawler there are five Web Service Registries which are

listed in the below The URL address of these five Web Service Registries will be used as the input

seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web

Service Description Crawler there has a single Python script for each Web Service Registry And

the crawling process for these Web Service Registriesrsquo Python script will be executed one after

another

Biocataloue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwsrvice-repositorycom

Xmethods httpwwwxmethodsnet

(2) Then after feeding with the input seed it will step into the next component Service Page Grabber

At first this component will try to read the data from the Web based on the input seed Then it

will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library

After that this Service Page Grabber component starts to look for the service page link for each

service that published in the Web Service Registry by means of the functions in Html5lib library In

the case that the service page link of one single service is found it will firstly check whether this

service page link is valid or not Once the service page link is valid it will pass it into the following

two components for further processing which are WSDL Grabber component and Property

Grabber component

(3) When the WSDL Grabber component receives a service page link from its previous component it

sets out to extract the WSDL link address for that service through the parsing tree of the data in

this service page Next this component will start to download the WSDL document of that service

in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into

the disk The process of this WSDL Grabber component will be continually carried on until there is

no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They

may either contain bad definitions or bad namespaceURI or be an empty document What is

worse it is not even of XML format Hence in order to pick them out this component will further

analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo

folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in

order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo

folder will be passed into the subsequent component

(4) Moreover since some Web Service Registries would give some addition information about the

Deep Web Service Crawler

22

services such as availability service provider version therefore the Property Grabber

component will set out to extract these information as this servicersquos properties And thereafter

save these servicersquos properties into an INI file However if there has no available additional

information then it is no need to extract the service property thus there is no INI file for that

service However for the implementation of this Pica-Pica Web Service Description Crawler only

the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to

extract the servicesrsquo properties While for other three Web Service Registries there has such

function

(5) Furthermore this is an optional to create a report file which contains the statistic information of

this process Such as the total number of services for one Web Service Registry the number of

services whose WSDL document is invalid etc

(6) As has been stated at present there has a folder with all valid WSDL documents and might also

have some INI files Therefore at this time the task of WSML Register component is to generate

the appropriate WSML documents with these valid WSDL document and INI files And then

register them in Conqo

24 Conclusions of the Existing Strategies

In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder

project Information Extraction technique and Pica-Pica Web Service Description Crawler

The task of this master program is going to obtain the available Web services and their related

information from the Web Actually this is a procedure of extracting the needed information for the

service such as the servicersquos WSDL document and its properties Therefore the Information

Extraction technique which used to extract the information that hosted in the Web could be used in

this master program

Moreover the Service-Finder is a large project that not only able to obtain the available Web services

and their related information from the Web but also enrich these crawled data with annotations by

means of the Service-Finder ontology and the Service Category ontology and integrate all the

information into a coherent semantic mode Furthermore it also provides the capabilities for

searching and browsing the data with the user interface and gives users with service

recommendations However as a master program the Service-Finder project is far more exceeding

the requirements Therefore it is just considered as a consultation for this master program

Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available

Web services and their related information this fulfills the primary task of this master program

Nevertheless regarding the information about the service this Pica-Pica Web Service Description

Crawler extracts only few properties sometime even no property Consequently in order to improve

the quality of the service it has to extract as much properties about the service as possible Thence in

chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler

Deep Web Service Crawler

23

3 Design and Implementation

In the previous chapter of the State of Art those already existing techniques or implementations are

presented In the following it is time to introduce the basic principle of the proposed approach Deep

Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica

Web Service Description Crawler

31 Deep Web Services Crawler Requirements

This section is mainly talking about the goals of the Deep Web Service Crawler approach system

requirements with respect to approach and some non-functional requirements

311 Basic Requirements for DWSC

The following lists are the basic requirements which should be achieved

(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list

of Web services that published in the Web Service Registry It contains not only the WSDL document

of the service but also a file with respect to the information about that service Therefore for the

purpose of producing the largest annotated Service Catalogue this proposed approach need to

extract as much properties about those Web services as possible Moreover it has also to download

the WSDL document that hosted along with the Web service That is to say these properties are not

only the interesting structured properties such as service name its WSDL link address but also some

other non-function properties for example endpoint and its monitoring information

(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties

For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of

schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way

this proposed approach provides three methods for the storage The first one will store them as an

XML file While for the second method it is going to store them in an INI file And the third method is

to use the database for the storage

312 System Requirements for DWSC

General Speaking the needed requirements for achieving a programming project contains following

lists

1) Operating System LinuxUnix Windows (XP Vista 2000 etc)

2) Programming language C++ Java Python C etc

3) Programming Tool NetBean Eclipse Visual Studio and so on

However in this master thesis the scripts of the Deep Web Service Crawler approach are written in

Java programming language Besides these code scripts are only tested running in Windows XP

operating system and Linux operating system but have not been tested in other operating system

Deep Web Service Crawler

24

313 Non-Functional Requirements for DWSC

In this part several non-functional requirements for the Deep Web Service Crawler approach are

presented

1) Transparency the process of data exploration and data storage should be done automatically

without the userrsquos intervention However at first user should specify the path in the hard disk which

will be used store these outputs of this program

2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in

order to stop the process from interruption there must have some necessary error handlings for the

process recovering

3) Completeness this approach should extract the interesting properties about each Web Service as

much as possible eg endpoint monitoring information etc

In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the

following five URLs hence the proposed approach must implement not less than those five Web

Service Registries

Biocatalogue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwservice-repositorycom

Xmethods httpwwwxmethodsnet

32 Deep Web Services Crawler Architecture

In this section an overview of this high-level architecture of Deep Web Services Crawler approach will

be introduced first Thereafter there are four subsections that focusing on the outlining of each single

component and how they will play together presented

The current components and flows of data in Deep Web Service Crawler can be summarized as

depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page

links and related service page links by crawling the Web (Web Service Extractor) Then those gathered

links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL

Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all

these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is

illustrated as following

Oslash Step 1

When the Deep Web Service Crawler starts to run the File Chooser container would require the

user to specify a path in the computer or in any other hard disk The reason for specifying the

path is that this Deep Web Service Crawler program needs a place to store all its outputs

Oslash Step 2

After that the Web Service Extractor will be triggered It is the main entry to the specific crawling

process However since the Deep Web Service Crawler program is a procedure which supposes

to crawl for Web Services in some given Web Service Registries Hence the URL addresses of

Deep Web Service Crawler

25

these Web Service Registries should be given as an initial seed for this Web Service Extractor

process However since the page structures of these Web Service Registries are completely

different thus there will be a dependent process for each Web Service Registry

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler

Oslash Step 3

So that according to the given seed two types of links would be obtained by the Web Service

Extractor component One is the service list page link And another is the service page link

Service list page is page that contains a list of Web Services and may be some information about

these Web Services While service page is page that contains much more information for single

service Finally it forwards these two types of links into the next two components Property

Grabber and WSDL Grabber

Oslash Step 4

Then on the one hand the Property Grabber component tries to gather the information about

the service that hosted in the service list page and service page Such as the name of the service

its description the ranking for this service etc Finally all the information of the service will be

collected together as the service properties which then will be delivered to the storage

component for further processing

Oslash Step 5

On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page

or the service page That is because for some Web Service Registry the WSDL link is hosted in

Deep Web Service Crawler

26

the service list page like the Biocatalogue and for other Web Service Registries it is hosted in

the service page such as Xmethods Then after obtaining the WSDL link it will also be

transmitted to the Storage component for further processing

Oslash Step6

When service properties information and WSDL link of the service are received by Storage

component it will be going to store them into the disk For the service properties they will be

stored into the disk by means of three different ways They are an XML file an INI file or one

record inside the table of Database However for the WSDL link the Storage component will try

to download the page content first according to the URL address of the WSDL link If it can work

successful the page content of the service will be stored as a WSDL document into the disk

Oslash Step 7

Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there

are more than one service or there are more than one service list page in the those Web Service

Registries the crawling process from step 3 to step 6 will be continued again and again until

there is no service or service list page in that Web Service Registries any more

Oslash Step 8

Furthermore after the crawling process for one Web Service Registry finishes there will also

generate a file that contains some of statistic information about this crawling process For

example the time when the crawling process of this Web Service Registry started and when

finished the total number of Web services in this Web Service Registry how many services have

empty WSDL document the average number of service properties in this Web Service Registry

and the average time cost for extracting service properties getting WSDL document and

generating XML file INI file etc

321 The Function of Web Service Extractor Component

The Web Service Extractor component is responsible for gathering service list page links and service

page links It pursues a focused crawl of the Web and only forwards service list page and service page

links to subsequent components for analyzing collecting gathering purposes Therefore it is going to

identify both service list page links and related service page links on these Web Service Registries

As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost

as important as the Web Service Extractor itself as it highly influences the part of the Web which it

needs to crawl The seed can or shall contain eg Web pages where these Web Services are

published or that they talk about Web Services

Deep Web Service Crawler

27

Figure3-2 Overview the process flow of the Web Service Extractor Component

After feeding with the seed of URL the Web Service Extractor component starts to get the link of

service list page from the initial page of this URL seed However this process would be diverse for

these five Web Service Registries Following shows the different situation in these Web Service

Registries

Oslash Service-Repository Web Service Registry

In this Web Service Registry the link of the first service list page is the URL address of its seed

which means some Web Services can be found in the home page of the Service-Repository Web

Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this

Web Service Registry Therefore the process of getting service list page link in this registry will be

continually carried on until there no more service list page link exists

Oslash Xmethods Web Service Registry

Although there has Web Services in the home page of the Xmethods Web Service Registry the

number of those Web Services is only a small subset of these in this Web Service Registry

Moreover in the Xmethods Web Service Registry there is only one page containing all Web

Services Therefore it must have to get the service list page link for that page

Oslash Ebi Web Service Registry

The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That

is to say there also has one page that contains all Web Services in this Web Service Registry

However this page is not the initial page of the input seed Therefore there needs more than

one operation steps to get the service list page link of that page

Oslash Seekda Web Service Registry

In Seekda Web Service Registry these Web Services are not contained in the initial page of the

input seed The service list page link can be obtained after several additional operation steps

Deep Web Service Crawler

28

However there has a problem for getting the service list page link in this registry Simply put if

there are more than one pages contains the Web Services for some unknown reasons it cannot

not get the links of the rest service list page In another word it can only get the link of the first

service list page

Oslash Biocatalogue Web Service Registry

The process of getting service list page in Biocatalogue Web Service Registry is almost the same

as in Seekda Web Service Registry The only difference is that it can get all these service list page

links in Biocatalogue Web Service Registry if there are more than one service list pages

Then after getting the link of service list page the Web Service Extractor will begin to get the link of

service page for each service that listed in the service list page The reason for why it can do like this is

that there has an internal link for every service which will address to the service page It is worth

noting that once a service page link is obtained this component will immediately forward the service

page link and the service list page link into the subsequent two components for further processing

Nevertheless the process of obtaining the service page link will be continuously carried out until all

services listed in that service list page are crawled Analogously the process of getting service list page

will also be continuously carried out until there no more service list page exists

3211 Features of the Web Service Extractor Component

The main features are described in the following paragraphs

1) Obtain service list page links

A central task is to obtain the corresponding service list page links It is an URL address that includes a

publicly list of Web Services and just a simply information about these Web services like the name of

the service an internal URL that links to another page which would contain the detailed information

about that service sometimes may have the link address for WSDL document

2) Obtain service page links

Once service list page links are found a crucial aspect is to extract the internal link of each service to

aid the task of service information discovery Therefore it is the task of the Web Service Extractor to

harvest the html page content of the service list page so that the service page links which would

contain much more detailed information of the single Web service can be obtained

3212 Input of the Web Service Extractor Component

This component is dependent on some specific input seeds And the only input required for this

component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section

313

3213 Output of the Web Service Extractor Component

The component will produce two service related page links from the Web

l Service list page links

l Service page links

Deep Web Service Crawler

29

3214 Demonstration for Web Service Extractor

In order to have comprehensive understanding of the process of the Web Service Extractor

component following gives some figures for explanation Though there are five URL addresses in this

section only the URL of the service-repository address is showed as an example

1) The input seed is the initial URL address of the Service-Repository which is

ldquohttpwwwservice-repositorycomrdquo

2) As has already said in section 321 the first service list link of this Web Service Registry is its input

seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page

of that link

Figure3-3 Service list page of the Service-Repository

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo

Figure3-5Code Overview of getting service page link in Service Repository

Figure3-6 Service page of the Web service ldquoBLZServicerdquo

Deep Web Service Crawler

30

3) For now because of the service list page link is already know the next step is to acquiring the

service page link according to these services that listed in the service list page The text in red box

of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the

complete link of the service page It should be prefixed with the initial URL address of

Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for

getting service page link of the Web service in Service Repository is showed in figure 3-5

Therefore the final link of this service page is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure 3-6 is the corrsponding service page of that link

4) Afterwards those two links service list page link and service page link which are gathered by the

Web Service Extractor component will be immediately forwarded to the next two components

which are WSDL Grabber component and Property Grabber component

322 The Function of WSDL Grabber Component

The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the

service lists page link or service page link And the whole process flow is illustrated in figure 3-7

Figure3-7 Overview the process flow of the WSDL Grabber Component

When the WSDL Grabber component receives the inputs that delivered from previous component it

will start to get the WSDL link for the service based on these inputs However although the inputs of

the WSDL Grabber component are the links of service page and service list page only one of them

contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the

service page or the service list page The reason why there needs these two links to be delivered into

this component is that the only one of the five Web Service Registries which is Biocatalogue hosts

the WSDL link in the service list page While for other four Web Service Registries the WSDL link is

hosted in the service page Therefore the WSDL link of these four Web Service Registries will be

obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL

link will be obtained through the service list page link However there has a problem for getting the

WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in

Deep Web Service Crawler

31

the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another

word these services will not have WSDL document For the situation like this the value of the WSDL

link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web

Services in other four Web Service Registries the WSDL link would always exist in the service page

Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service

it will inevitably be forwarded into Storage component for downloading the WSDL document at once

3221 Features of the WSDL Grabber Component

The WSDL Grabber component has to provide the following features

l Obtain WSDL links

WSDL link is the direct way to get into the page that contains the contents of the WSDL document

It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo

to represent this is an address that addressed to the page of WSDL document

3222 Input of the WSDL Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3223 Output of the WSDL Grabber Component

The component will only produce the following output data

l The URL address of WSDL link for each service

3224 Demonstration for WSDL Grabber Component

In this section it will demonstrate a list of figures in order to give a comprehensive understanding of

the process of the WSDL Grabber component Without a doubt it uses the same Web service of the

Service-Repository as an example too

1) The input for this WSDL Grabber component is the link of service page that obtained from Web

Service Extractor component The address of this link is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page

Deep Web Service Crawler

32

2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the

figure 3-8

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo

3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure

3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry

For other four Web Service Registries this will be different This function

ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo

Then it checks all of the nodes one by one to see if the text value of that node is the value of

ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will

be extracted as the value of WSDL link for this Web service

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function

Figure3-11 Code overview of ldquooneParameterrdquo function

4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web

service ldquoBLZServicerdquo can be obtained The value of it is

ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo

Deep Web Service Crawler

33

323 The Function of Property Grabber Component

The Property Grabber component is a module which is used to extract and gather all these Web

service information that hosted in the Web actually are the information that showed in service list

page and service page In the end all the obtained Web service information will be collected together

as the service properties which would be delivered into the Storage component for storing The

detailed process flow of Property Grabber component is illustrated in figure 3-12

The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component

However there is still a little difference between them with respect to the seed As have already

mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is

sufficient to get the WSDL link While regarding the Property Grabber component it needs two of

them as the seeds Therefore once the Property Grabber component receives the needed inputs it

will start to extract the information of that single Web service

Figure3-12 Overview the process flow of the Property Grabber Component

Therefore after the Property Grabber component receives the inputs it starts to extract the service

information for the Web service Generally speaking the service information consists of four aspects

which are structured information endpoint information monitoring information and whois

information respectively

(1) Structured Information

The structured information can be obtained by extracting the information that hosted in the

service page and service list page They are the basic descriptive information about the service

Such as the name of the service the URL address through which the WSDL document can be

obtained the description of introducing the service the provider who provides that service its

rating and the server who owns this service etc However the elements constituting this

Deep Web Service Crawler

34

structured information would be diverse according to these different Web Service Registries For

example the rating information of the Web service exists in the Service-Repository Web Service

Registry while the Xmethods Web Service Registry does not have this information In addition

even though the Web services in the same Web Service Registry some elements of the structured

information could be not exist For instance one service in a Web Service Registry may have the

description information of this service while another service in the same Web Service Registry

does not have this description information Tables 3-1 to 3-5 demonstrate the structured

information that should be extracted in these five Web Service Registries Moreover if the style of

the Biocataloue Web Service Registry is SOAP there has some additional information describing

the SOAP operation of this service While it is REST this additional information would be

something that talked about the REST operation They should also be considered as a part of the

structured information Table 3-6 and table 3-7 illustrate the information for these two different

operations

Service Name WSDL Link WSDL Version

Provider Server Rating

Homepage Owner Homepage Description

Table 3-1Structured Information of Service-Repository Web Service Registry

Service Name WSDL Link Provider

Service Style Homepage Implementation Language

Description User Description Contributed Client Name

Type of this Client Publisher for this client Used Tookit of this client

Used Language of this client Used Operation System of this client

Table 3-2Structured Information of Xmethods Web Service Registry

Service Name WSDL Link Server

Provider Providerrsquos Country Service Style

Rating Description User Description

Service Tags Documentation (within WSDL)

Table 3-3Structured Information of Seekda Web Service Registry

Service Name WSDL Link Port Name

Service URL Address Implementation Class

Table 3-4Structured Information of Ebi Web Service Registry

Service Name WSDL Link Style

Provider Providerrsquos Country View Times

Favorite Times Submitter Service Tags

Total Annotation Provider Annotation Member Annotation

Registry Annotation Base URL SOAP Lab Server Base URL

Description User Description Category

Table 3-5Structured Information of Biocatalogue Web Service Registry

Deep Web Service Crawler

35

SOAP Operation Name Inputs and Outputs Operation Description

Operation Tags Part of Which Service

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry

REST Operation Name Service Tags Used Template

Operation Description Part of Which Service Part of Which Endpoint Group

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry

(2) Endpoint Information

Endpoint information is the port information of the Web service which can be extracted only

through the service page However since different Web Service Registries have different structure

of the endpoint information with respect to the Web service Thence some elements of the

endpoint information for the Web service would be very diverse But there is one thing needed to

pay attention the Ebi Web Service Registry does not have endpoint information for all these Web

services published in this registry Moreover though the Web services in the same Web Service

Registry have the same structure of this endpoint information some elements of the endpoint

information would be missing or empty Furthermore these Web Service Registries could even

have no endpoint information for some Web services published by them Nevertheless whatever

there is endpoint information for the Web service there will be at least one such element which

is the URL address of the endpoint Following table 3-8 shows the endpoint information that

should be extracted in these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Endpoint Name Endpoint URL

Endpoint Critical Endpoint Type

Bound Endpoint

Xmethods

Endpoint URL Publisher of this Endpoint

Contact Email of this Publisher Implementation Language of this

Endpoint

Seekda Endpoint URL

Biocatalogue Endpoint Name Endpoint URL

Table 3-8 Endpoint Information of these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Service Availability Number of Downs

Total Uptime Total Downtime

MTTR MTBF

RTT Max of Endpoint RTT Min of Endpoint

RTT Average of Endpoint Ping Count of Endpoint

Seekda Service Availability Begin Time of Monitoring

Biocatalogue Monitored Status of Endpoint Overall Status of Service

Table 3-9 Monitoring Information of these five Web Service Registries

(3) Monitoring Information

Monitoring information is the tested statistic information for the Web service But it is worth

Deep Web Service Crawler

36

noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of

all these Web services published by them While for other three Web Service Registries only a

few of Web services may not have this information Table 3-12 displays the monitoring

information for these three Web Service Registries

(4) Whois Information

Whois information is not extracted from the information that hosted in the service page and

service list page It is the descriptive information for the service domain which can be gained by

means of the address of the WSDL link Because of that the process of getting the Whois

information will start after gaining the service domain first The final value of service domain

should not contain the string like this http https www etc It must belong to the top level

domain After that it is going to query the service domain database by means of sending the value

of service domain to the Whois client which is just a Web site on the Internet for example

ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain

exists a list of its information will be returned as the output However the structure of the

returned information will be diverse with respect to different service domain Therefore the most

challenge thing is that there has to deal with the extracting process for each different situation of

the returned information Table 3-10 gives the Whois information that needs be extracted for all

these five Web Service Registries

Service Domain URL Domain Name Domain Type

Domain Address Domain Description State

Postal Code City Country

Country Code Phone Fax

Email Organization Established Time

Table 3-10Whois Information for these five Web Service Registries

Finally all the information of these four aspects will be collected together And then be delivered into

the Storage component for further storage processing

3231 Features of the Property Grabber Component

The Property Grabber component has to provide the following features

l Obtain basic information

Generally speaking the more information one Web service has the better you can know how

well this Web service is Hence this Property Grabber component will be necessary to extract all

these basic information that hosted in the service list page and service page This basic

information contains the structure information endpoint information and monitoring

information

l Obtain Whois information

Due to the reason that the more information the Web service has the better the quality of the

Web service is That makes it is necessary to extract as much information about the Web service

as possible Therefore except the basic information the Property Grabber component can also

obtain some additional information called Whois information such as the type of the domain the

name of the person in charge the postal code of the domain city phone fax detailed address

email etc

Deep Web Service Crawler

37

3232 Input of the Property Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3233 Output of the Property Grabber Component

The component will produce the following output data

l Structured information of each service

l Endpoint information about each service if exists

l Monitoring information for the service and endpoint if exists

l Whois information of the service domain

However all these information will be collected together as the properties for each service Thereafter

the collected properties will be sent to the Storage component

3234 Demonstration for Property Grabber Component

Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the

Property Grabber component To simplify the explanation the example showed in this section uses

the same Web service as before

1) The inputs of the Property Grabber component are links of service list page and service page that

received from Web Service Extractor component Their links are

ldquohttpwwwservice-repositorycomrdquo

and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page

Deep Web Service Crawler

38

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page

3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that

displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of

the structured information have the same content like the contents of description showed in

service page and service list page Thence in order to saving the time of extracting process and

the space for storing process the elements with the same content would only be extracted once

Moreover there needs a transformation of non-descriptive text to the descriptive text for the

rating information because the contents of it are several image of stars Therefore the final

results of the extracted structured information for this Web service are showed in table 3-11

Because of there are no descriptive information for the provider homepage and Owner

Homepage their values are assigned as ldquoNULLrdquo

Service Name BLZService

WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl

WSDL Version 0

Server Apache-Coyote11

Description BLZService

Rating Four stars and A Half

Provider NULL

Homepage NULL

Owner Homepage NULL

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo

4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that

displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list

only one of them will be extracted as the information of endpoint That is because this master

program is intended to extract as more information as possible but these information should not

contain some redundant information Therefore for the purpose of that only one of them will be

extracted as the endpoint information even if there are more than one endpoint records Table

3-12 shows the final results of the endpoint information of this Web service

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page

Deep Web Service Crawler

39

Endpoint Name BLZServiceSOAP12port_http

Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService

Endpoint Critical True

Endpoint Type production

Bound Endpoint BLZServiceSOAP12Binding

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo

5) Then it is time to extract the monitoring information through invoking the function

ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box

above contains the monitoring information about the Web service and the red box below lists

the monitoring information for these endpoints As have already mentioned before only one

endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two

types of availability Actually they all represent the availability for this Web service just like the

availability that showed in figure 3-14 Therefore only one content of these availability

information is sufficient Table 3-13 shows the final results of this extracting process

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page

Service Availability 100

Number of Downs 0

Total Uptime 1 day 19 hours 19 minutes

Total Downtime 0 second

MTBF 1 day 19 hours 19 minutes

MTTR 0 second

RTT Max of Endpoint 141 ms

RTT Min of Endpoint 0 ms

RTT Average of Endpoint 577 ms

Ping Count of Endpoint 112

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo

6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the

service domain from the WSDL link first Towards this Web service the gained service domain is

ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying

process After that it returns a list of information for that service domain See figure 3-17 And

table 3-14 is the extracted Whois information

Deep Web Service Crawler

40

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo

Service Domain URL thomas-bayercom

Domain Name Thomas Bayer

Domain Type NULL

Domain Address Moltkestr40

Domain Description NULL

State NULL

Postal Code 54173

City Bonn

Country NULL

Country Code DE

Phone +4922855525760

Fax NULL

Email infopredic8de

Organization predic8 GmbH

Established Time NULL

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo

7) Finally all the information of these four aspects will be collected together as the service

properties and then these service properties are forwarded into the Storage component

324 The Function of Storage Component

The WSDL link from the WSDL Grabber component will be used by this Storage component to

download the WSDL document from the Web and stores it into the disk thereafter In addition the

service properties from the Property Grabber component will also be directly stored into the disk in

three different manners with this Storage component Figure 3-18 illustrates the process flow of the

Storage component

Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo

of the Storage component will be triggered After that it is going to transform the service properties

into three different output formats and store them into the disk These output formats are XML file

INI file and database records Besides it will also try to download the WSDL document with the URL

address of the WSDL link and then store the obtained WSDL document into disk too However this

ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo

ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the

storage tasks

Deep Web Service Crawler

41

Figure3-18 Overview the process flow of the Storage Component

(1) ldquogetWSDLrdquo sub function

The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it

into the disk Therefore above all it has to get the content of the WSDL document This

procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value

of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if

the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo

For the case like that it will create a WSDL document whose name is the service name appended

with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content

It is an empty document Nevertheless if that service does have WSDL link this sub function will

try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all

these contents that hosted in the Web would be downloaded and stored into the disk and named

only with the name of the service Otherwise it would create a WSDL document which prefixes

ldquoBadrdquo before the service name

(2) ldquogenerateXMLrdquo sub function

The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform

them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo

XML stands for eXtensible Markup Language which is a markup language that designed to

transport and store data The first line of an XML file is the XML declaration which defines the

XML version and the encoding used For example following ldquoltxml version=10

encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is

8-bit Unicode Transformation Format character set Besides an XML file contains XML elements

which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML

element can also contain other elements which could be simple text or a mixture of both

Deep Web Service Crawler

42

However an XML file must contain a root element as the parent of all other elements

(3) ldquogenerateINIrdquo sub function

The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will

transform them into an INI file and then store it in the disk with the name structure of servicersquos

name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for

configuration files They are just simple text files with a basic structure Generally speaking the

INI file contains three different parts They are section parameter and comment respectively

Parameter is the basic element contained in an INI file Its format is a key-value pair or can also

be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always

appears to the left of the equals sign The section is more like a room that grouped all its

parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And

section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo

Thence anything between the semicolon and the end of line will be ignored

(4) ldquogenerateDatabaserdquo sub function

The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions

Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will

turn them into the data of the database by using the SQL statements SQL stands for Structured

Query Language which is a database language that designed for accessing and manipulating data

in database Most of the actions that performed on a database are done with SQL statements

And the primary statements of SQL include insert into delete update select create alter and

drop Therefore for the purpose of transforming these service properties into the data of the

database this sub function has to create a database first using the ldquocreate databaserdquo statement

Then it should create a table to store the data A table is a collections of related data entries and

it consists of columns and rows Since the data for all these five Web Service Registries are not

very large therefore one database table is enough for storing these service properties Because of

that the field names of the service properties in columns for all these five Web Service Registries

should be uniform and well-defined Afterwards the service properties of each single service can

be put into the table as a record with the ldquoinsert intordquo statement of the SQL

3241 Features of the Storage Component

The Property Grabber component has to provide the following features

l Generate different output formats

The final result of this master program is to store the information of the service into the disk for

future work However this Storage component provides three different formats for storing these

service properties of the services that published in the Web Service Registries This makes the

storage of the service very flexible and also longevity

l Obtain the WSDL document

The important thing of this component is to check if the WSDL document can be obtained from

the WSDL link because the WSDL document plays a decisive role for determining the quality of

the service This Storage component provides the ability to deal with the different situations

about the process of obtaining the WSDL document

Deep Web Service Crawler

43

3242 Input of the Storage Component

This component requires the following input data

l WSDL link of each service

l Each service property information

3243 Output of the Storage Component

The component will produce the following output data

l WSDL document of the service

l XML document INI file and tables in database

3244 Demonstration for Storage Component

Following figures describe the fundamental implementation codes of this Storage component

The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the

implementation codes The first common place is about the parameters that defined in each of

these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute

path in the computer disk It will be used for the storing procedure and has already been specified

by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an

increasing Integer which is used as a part of the name of the service The reason for doing this is

that it can prevent the services that have the same name from overriding in the disk The content

of red mark among the codes of these figures is the second common place Its function is to

create a file or document for the service with the corresponding parameters

2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to

get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the

name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo

is the actual WSDL link which is the most important one for this sub function The other two

parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and

ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data

of the service in each Web Service Registry Such as the overall number of properties for that Web

Service Registry the overall number of services the number of services that have no WSDL link

the number of services contains no contents the number of services whose WSDL link are not

available etc The ldquoLog Informationrdquo text file records the results of the process steps and the

problems encountered For example which service is crawling now which Web Service Registry it

belongs the reason why it cannot get the WSDL document for this service the reason why this

service is unreachable and so on

Deep Web Service Crawler

44

Figure3-19 Implementation code for getting WSDL document

3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file

and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of

ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two

variables name and value

Figure3-20 Implementation code for generating XML file

Deep Web Service Crawler

45

Figure3-21 Implementation code for generating INI file

4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all

services in these five Web Service Registries into the records of the database Therefore it has to

create a database first The name of the database can be arbitrary as long as it conforms to the

naming rules in database This is the same for the name of the table Figure 3-22 displays the

procedure of creating a table in the database Due to the reason that it is hard to decide the

length of each service property thence the data types for all service properties are set to ldquoTextrdquo

Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo

statement

Figure3-22 Implementation code for creating table in database

Deep Web Service Crawler

46

Figure3-23 Implementation code for generating table records

33 Multithreaded Programming for DWSC

Multithreaded programming is a built-in characteristic which is provided by Java language A

multithreaded program contains two or more separate parts that that can execute concurrently Each

part of such a program is called a thread The use of multithreading enables to create programs that

makes efficient use of the system resources For example make maximum use of the CPU thatrsquos

because the idle time of the CPU can be kept to a minimum

In this master program there are five Web Service Registries that are needed to crawl for the services

among them Moreover the number of services published among each Web Service Registry is quite

different That makes the running time cost by each Web Service Registry is different Then it will

happen that one Web Service Registry who owns fewer services has to be waiting for executing until

another Web Service Registry who has much more services finishes Therefore in order to reduce the

waiting time for other Web Service Registries and maximize the use of the system resources it is

necessary to apply this multithreaded programming to this master program That is to say this master

program will create a thread for each Web Service Registry And these threads are executed

independently

34 Sleep Time Configuration for Web Service

Registries

Due to the reason that this master program is intended for downloading the WSDL document and

extracting the service information of these Web services that published in the Web Service Registry

this would inevitably affect the performance of the Web Service Registry In addition for the purpose

of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of

accessing Because of that sometimes when this master program is executing unknown errors would

happen For instance this master program will be continually halting at one point without getting any

more WSDL document and service information or some servicesrsquo WSDL documents of some Web

Deep Web Service Crawler

47

Service Registries cannot be obtained or there would miss some service information Therefore in

order to obtain the largest staffs of these Web services published in these five Web Service Registries

and not to affect the throughput of them it has to configure the accessing rate for each service of all

Web Service Registries

In consequence before going into the essential procedure for each single service of these Web Service

Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long

milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for

the specified number of milliseconds In other words it will temporarily cease execution for a while

Following table shows this time interval of the sleep function for each Web Service Registry

Web Service Registry Name Time Interval (milliseconds)

Service Repository 8000

Ebi 3000

Xmethods 10000

Seekda 20000

Biocatalogue 10000

Table 3-15Sleep Time of these five Web Service Registries

Deep Web Service Crawler

48

4 Experimental Results and Analysis

This chapter is going to show the quantitative experimental results of the prototype that presented in

chapter 3 Besides the analysis of these results will also be described and explained However in

order to gain a rather accurate result the experiments are carried out more than five times All these

data that displayed in the following tables and charts are the average of them

41 Statistic Information for Different Web Service

Registries

This section is mainly used to talk about the amount statistic of these Web services that published in

these five Web Service Registries It includes the overall number of the Web services that published in

each Web Service Registry and the number of these unavailable Web services which have been

archived because they may not be active anymore or are close to being non active Table 4-1 shows

the service amount statistic of these five Web Service Registries

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Overall

Services 57 289 382 853 2567

Unavailable

Services 0 0 0 0 125

Table4-1 Service amount statistic of these five Web Service Registries

Nevertheless in order to have an intuitive view of the service amount statistic in these five Web

Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As

you can see from the bar chart one on hand there has an ascending increase for the overall number

of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service

Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web

service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to

provide the Web services for the users because it contains far more services than other four Web

Service Registries On the other hand there is no unavailable service existed in almost all Web Service

Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web

Service Registry contains some Web services that cannot be used by the users To some degree this is

useless for the reason that these services cannot be used anymore and it will waste the network

resources on the Web Therefore all these unavailable services should be eliminated in order to

reduce the wasting of the network resources

Deep Web Service Crawler

49

Figure4-1 Service amount statistic of these five Web Service Registries

42 Statistic Information for WSDL Document

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Failed WSDL

Links 1 0 23 145 32

Without

WSDL Links 0 0 0 0 16

Empty

Content 0 0 2 0 2

Table4-2 Statistic information for WSDL Document

Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in

these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed

WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the

Web services whose WSDL links are invalid Say in other words that means it is impossible to get the

WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no

WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services

for these Web Service Registries It is used as the overall number of Web services in each Web Service

Registry who has no WSDL link That is to say there would be no WSDL document for Web services

like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL

document will also be created But there is no content and the name of this WSDL document will

contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which

represents the overall number of Web services that have the WSDL links and the URL address of them

0

500

1000

1500

2000

2500

3000

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

57289 382

853

2567

0 0 0 0 125

The

Num

ber o

f Web

Ser

vice

s

The name of Web Service Registry

Overall Services

Unavailable Services

Deep Web Service Crawler

50

are valid but their WSDL documents contains no content In this case a WSDL document whose name

contains a string ldquo(BAD)rdquo will be created

Figure4-2 Statistic information for WSDL Document

43 Comparison of Different Average Number of

Service Properties

This section is going to compare the average number of service properties in these five Web Service

Registries This average number of service properties is calculated by means of following equation = (1)

Where

ASP is the average number of service properties for one Web Service Registry

ONSP is the overall number of service properties in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

Figure 4-3 shows the average number of service properties of each Web service in these five Web

Service Registries As have already mentioned before one of the measurements for testing the quality

of Web services in the Web Service Registry is the service information The more information of a Web

service the better you know about that service consequently the corresponding Web Service

Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service

Repository and Biocatalogue Web Service Registries own a larger number of service properties than

that in other three Web Service Registries This can directly reflect that these two Web Service

Registries can provide more detailed information about the Web services published among them So

that the users can be easier to choose the service they need And they would also like to use the Web

services that published in these two Web Service Registries By contrast the Xmethods and Seekda

0

50

100

150

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

1 023

145

32

0 0 0 016

0 0 2 0 2

The

Num

ber o

f Ser

vice

s

The Name of the Web Service Registry

Failed WSDL Links

Without WSDL links

Empty Content

Deep Web Service Crawler

51

Web Service Registries who have less service information of the Web services would offer less quality

for these Web services Therefore users may be not like to use the Web services provided in these

two Web Service Registries Not to mention the Web services published in the Ebi Web Service

Registry

Figure4-3 Average Number of Service Properties

From the description presented in section 323 the causes of the issue for the different number of

service properties in these Web Service Registries may consist of following several points First the

number of the structured information for these Web services is different with respect to these five

Web Service Registries Even part of information for some Web services in one Web Service Registry

could be missing or have empty value For example the number of structure information that

supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that

in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for

almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce

the amount of the overall number of service properties Thirdly some Web Service Registries do not

have monitoring information such as Xmethods and Ebi In particular the Service Repository Web

Service Registry has a large amount of monitoring information of the Web services that could be

extracted in the Web Obvious the last one is the number of the whois information for these Web

services If the database of the whois client does not contain the information about the service

domain for the Web service in one Web Service Registry then there will be no whois information

could be extracted Moreover even there has information of the service domain the amount of the

information could be very diverse Therefore if the Web Service Registry undergoes the situation like

that many service domains of the Web services in this registry have no and few whois information

then the average number of service properties in that registry will decrease greatly

As a result in order to help users get better acquainted with the Web services that provided in Web

Service Registry and distinguish these Web services the Web Service Registry should do it best to

offer more and more information for each published Web service of it

05

101520253035

23

717 17

32

The

Num

ber o

f Pro

pert

ies

The Name of Web Service Registry

Average Number of Service Properties

Deep Web Service Crawler

52

44 Different Outputs of Web Services

The fundamental task of this master program is to obtain the WSDL document of the Web services

that hosted in the Web Service Registries as well as extract and gather the properties of these Web

services and thereafter store them into disk Therefore this section is going to describe the different

outputs of this master program which include the WSDL documents of Web services the generated

XML file and INI file and data records of these service properties in Database

Figure4-4 WSDL Document format of one Web service

The WSDL document of the Web service is just read from the Web according to the URL address of its

WSDL link And then these data will be stored into the disk as the WSDL document The name of the

WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document

And in order to distinguish these WSDL documents whose names would be the same while the

content of them are different the name of each obtained WSDL document in one Web Service

Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL

document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and

data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output

formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name

is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the

materials belonged to the same Web service The first three lines in that INI file are some service

comments which start from the semicolon to the end of the line They are the basic information for

the description of this INI file The following one line of them is the section that included in a pairs of

brackets It is important because it represents that the following lines behind it are the information of

this Web service Therefore the rest of these lines are the actual service information that existed in a

key-value pair with an equal between the pair Each service property is displayed from the beginning

of the line

Deep Web Service Crawler

53

Figure4-5 INI File format of one Web service

Figure4-6 XML File format of one Web service

Figure4-7 Database data format for all Web services

Deep Web Service Crawler

54

Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its

name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web

service Though the format of the XML file is different from that of the INI file the essential contents

of them are the same That is to say the values of these service properties are no different This is

because they are files generated from the collection of properties for the same Web service The XML

file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the

section in INI file is something like the root in XML file Therefore all values of these elements among

the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service

Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of

service information for all Web services in these five Web Service Registries And the entire service

information for one Web service will be only one record in this table Because of that the column

names of that table should be the union of the name of the service information in each Web Service

Registry However since the column names of the table should be unique thus these redundant

names in this union must be eliminated This could make sense and possible because the names of the

service information are well-defined and uniform for all these five Web Service Registries In addition

in this table the first column is the primary key which is an increasing Integer The function of this is

more like the Integer that contained in the name of the XML file and INI file While the rest columns in

that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in

this table represents that this property of that Web service is empty or missing

45 Comparison of Average Time Cost for Different

Parts of Single Web Service

This section aims to describe the comparison of the average time cost for different parts of getting

one single Web service in all these five Web Service Registries At first it has to calculate the average

time cost of getting one single service in the Web Service Registry The calculation of it can be

obtained through following equation = (2)

Where

ATC is the average time cost for one single Web service

OTS is the overall time cost of all the Web services in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the different parts of the average time cost for getting one single service consists of

following 6 aspects which are the average time cost for extracting service property the average time

cost for obtaining WSDL document the average time cost for generating XML file the average time

cost for generating INI file the average time cost for inserting the service property into the table of

the Database and the average time cost for some other procedures such as get the service list page

link get the service page link and so on The average time cost for extracting service property would

be obtained by means of following equation

Deep Web Service Crawler

55

= (3)

Where

ATCSI is the average time cost for extracting service property of one single Web service

OTSSI is the overall time cost for extracting service property of all the Web services in one Web

Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the calculation of other parts is similar to the equation for calculating the average time

cost for extracting service property While the calculation of the average time cost for other

procedures equals to the result of the average time cost for one single Web service minus the sum of

the average time cost of all other five parts

Service

property

WSDL

Document

XML

File INI File Database Others Overall

Service

Repository

8801 918 2 1 53 267 10042

Ebi 699 82 2 1 28 11 823

Xmethods 5801 1168 2 1 45 12 7029

Seekda 5186 1013 2 1 41 23 6266

Biocatalogue 39533 762 2 1 66 1636 42000

Table4-3 Average time cost information for all Web Service Registries

Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web

Service Registries The first column of table 4-3 is the name of these five Web Service Registries And

the last column in this table is the average time cost for the single service in one Web Service Registry

While the rest columns in this table are the average time cost of these six different parts In order to

have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated

with following corresponding figures See figure 4-8 to figure 4-14

Figure4-8 Average time cost for extracting service property in all Web Service Registries

8801

6995801 5186

39533

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Service Property

Deep Web Service Crawler

56

As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue

Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries

which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively

That is to say it will take much longer time to extract the service properties of these Web services

published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the

Biocatalogue Web Service Registry has even more average number of service properties which have

already been talked about in section 43 On the contrary the average number of service properties in

Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service

Registry is larger than that in Seekda Web Service Registry As have already known that the average

number of service properties is the same for these two Web Service Registries Nevertheless there

has course might explain the case that the average time in Xmethods costs more than that in Seekda

which is the process for extracting service properties in Xmethods Web Service Registry needs to be

executed by means of service page and service list page while it needs only service page link for

Seekda Web Service Registry

The average time cost for obtaining WSDL document in all these five Web Service Registries is

displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the

average time for extracting the WSDL link of Web service and the average time for reading the data of

WSDL document from the Web and then storing it into the disk As seen from this figure the average

time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is

1168 Although the process for extracting WSDL link will cost certain amount of time it does not have

a significant influence to the total average time spent for obtaining WSDL document This is because

the WSDL link of one Web service is almost gained in one step Therefore this can imply that the

average size of the WSDL document for Xmethods Web Service Registry is larger than that for other

four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds

for the process of extracting WSDL document

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries

918

82

11681013

762

0

200

400

600

800

1000

1200

1400

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

WSDL Document

Deep Web Service Crawler

57

Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different

outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11

the average time of generating XML file for one Web service cost the same for all these five Web

Service Registries which is only 2 milliseconds In addition the average time of generating INI file for

one Web service cost the same too And its value is just 1 millisecond Even though sum up the values

of these two types of average time costs the result is still so small that it could be omitted when

comparing to the overall average time cost of getting one Web service for each corresponding Web

Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI

file would be finished at once after receiving the input of service properties of one Web service

Furthermore it can be seen from figure 4-12 although the average time costs of creating database

record for each Web service in all these five Web Service Registries are larger than that time of

generating XML and INI file the process of the operation for creating database record is still fast

Figure4-10 Average time cost for generating XML file in all Web Service Registries

Figure4-11 Average time cost for generating INI file in all Web Service Registries

2 2 2 2 2

0

05

1

15

2

25

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

XML File

1 1 1 1 1

0

02

04

06

08

1

12

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

INI File

Deep Web Service Crawler

58

Figure4-12 Average time cost for creating database record in all Web Service Registries

Figure4-13 Average time cost for getting one Web service in all Web Service Registries

In the figure 4-13 it gives the average time cost for getting one single Web service in all these five

Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the

longest time for this process This is because the presentation of these five different parts that

described in the front shows that the average time cost of each part would needs more time to finish

the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining

WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web

Service Registry does not cost the longest Moreover there is amazing thing when looking at these

figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same

trend This can further indicate that one Web Service Registry who spends more time to get the

description information of one Web service would offer more information about the Web service

53

28

4541

66

0

10

20

30

40

50

60

70

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Database

10042

823

7029 6266

42000

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Overall

Deep Web Service Crawler

59

5 Conclusion and Further Direction

This master thesis provides a schema which aims to explore the description information of the Web

services that hosted in the different Web Service Registries Actually the description information of

the Web service consists of WSDL document and service information about that Web service

Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can

be used to obtain the WSDL document and service information of the Web service in these Web

Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web

services that hosted in the these Web Service Registries as well as the storage of these Web services

is not flexible and the most important thing is that only a few service information of the Web service

are extracted even more there is no service information would be extracted for some Web Service

Registry However the work presented in this master thesis can be able to explore all Web services in

these Web Service Registries Moreover the service information of Web service would be extracted as

much as it could with the approach presented in this master thesis so that the final result would be

the largest annotated service catalogue ever produced Furthermore regarding the storage process of

the description information of the Web service this master thesis provides three different ways that

can guarantee not only the completeness but also longevity of the description information for the

Web service

However in the implementation performed in this master thesis the whois client used for querying

the information of service domain will return a free text if the information exists And sometime this

free text differs completely That makes it must have to crawl each Web service in all Web Service

Registries at least once in the experiment stage So that all the cases of these free text would be

foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web

services in these Web Service Registries Therefore in order to simply the work other whois client

who can ease the work here needs to be found and used

Moreover in the experiment stage of this master thesis the time cost for getting Web service is still

large For the purpose of reducing this time the multithreaded programming would be applied to

each some parts of the process for getting one Web service

Although the work performed here is specialized for only these five Web Service Registries the

main parts of the principles used here is adaptable to other Web Service Registries that with only

small changes in the implementation codes or the structure

Deep Web Service Crawler

60

6 Bibliography

[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)

June 27 2008

[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from

httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-

whole Emanuele Della Valle (CEFRIEL) July 1 2008

[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk

ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-

architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009

[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web

Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October

2006

[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from

httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008

[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from

httpwwww3orgTRws-arch-scenarios February 11 2004

[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo

Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering

University of Washington Seattle pp233-272 February 1999

[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web

Consortium Working Draft WD-html5-20100624 January 22 2008

[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language

WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from

httpwwwwsmoorgTRd16d161v02

[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO

-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from

httpwwwwsmoorgTRd2v11

[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware

Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet

2008

Deep Web Service Crawler

61

7 Appendixes

There are additional outputs of this master program which are log information file and statistic report

file Figure 8-1 shows one of the basic output formats of the log information for these five Web

Service Registries

Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry

Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry

Deep Web Service Crawler

62

Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo

ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively

Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry

Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry

Deep Web Service Crawler

63

Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry

Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry

Deep Web Service Crawler

64

Table of Figures

Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12

Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16

Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25

Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27

Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41

Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46

Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50

Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51

Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52

Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56

Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58

Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58

Deep Web Service Crawler

65

Table of Tables

Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47

Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48

Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55

Deep Web Service Crawler

66

Table of Abbreviations

DTD Document Type Definition

DWSC Deep Web Service Crawler

HTML HyperText Markup Language

MTBF Mean Time between Failures

MTTR Mean Time to Recovery

QoS Quality of Service

REST Representational State Transfer

RTT Round Trip Time

SOAP Simple Object Access Protocol

SQL Structured Query Language

URL Uniform Resource Locator

WHATWG Web Hypertext Application Technology Working Group

XML eXtensible Markup Language

WSDL Web Service Description Language

WSML Web Service Modeling Language

WSMO Web Service Modeling Ontology

Page 4: Deep Web Service Crawler

Deep Web Service Crawler

4

Table of Contents

Acknowledgements 2

Abstract 3

1 Introduction 7

11 BackgroundMotivation 7

12 Initial Designing of the Deep Web Service Crawler Approach 7

13 Goals of this Master Thesis 8

14 Outline of this Master Thesis 8

2 State of the Art 10

21 Service Finder Project 10

211 Use Cases for Service-Finder Project 10

2111 Use Case Methodology 10

2112 System Administrator 10

212 Architecture Plan for the Service-Finder Project 12

2121 The Principle of the Service Crawler Component 13

2122 The Principle of the Automatic Annotator Component 13

2123 The Principle of the Conceptual Indexer and Matcher Component 14

2124 The Principle of the Service-Finder Portal Interface Component 14

2125 The Principle of the Cluster Engine Component 15

22 Information Extraction 15

221 Input Types of Information Extraction 15

222 Extraction Targets of Information Extraction 17

223 The Used Techniques in Information Extraction 18

23 Pica-Pica Web Service Description Crawler 19

231 Needed Libraries of the Pica-Pica Web Service Description Crawler 19

232 Architecture of the Pica-Pica Web Service Description Crawler 20

233 Implementation of the Pica-Pica Web Service Description Crawler 21

24 Conclusions of the Existing Strategies 22

3 Design and Implementation 23

31 Deep Web Services Crawler Requirements 23

311 Basic Requirements for DWSC 23

Deep Web Service Crawler

5

312 System Requirements for DWSC 23

313 Non-Functional Requirements for DWSC 24

32 Deep Web Services Crawler Architecture 24

321 The Function of Web Service Extractor Component 26

3211 Features of the Web Service Extractor Component 28

3212 Input of the Web Service Extractor Component 28

3213 Output of the Web Service Extractor Component 28

3214 Demonstration for Web Service Extractor 29

322 The Function of WSDL Grabber Component 30

3221 Features of the WSDL Grabber Component 31

3222 Input of the WSDL Grabber Component 31

3223 Output of the WSDL Grabber Component 31

3224 Demonstration for WSDL Grabber Component 31

323 The Function of Property Grabber Component 33

3231 Features of the Property Grabber Component 36

3232 Input of the Property Grabber Component 37

3233 Output of the Property Grabber Component 37

3234 Demonstration for Property Grabber Component 37

324 The Function of Storage Component 40

3241 Features of the Storage Component 42

3242 Input of the Storage Component 43

3243 Output of the Storage Component 43

3244 Demonstration for Storage Component 43

33 Multithreaded Programming for DWSC 46

34 Sleep Time Configuration for Web Service Registries 46

4 Experimental Results and Analysis 48

41 Statistic Information for Different Web Service Registries 48

42 Statistic Information for WSDL Document 49

43 Comparison of Different Average Number of Service Properties 50

44 Different Outputs of Web Services 52

45 Comparison of Average Time Cost for Different Parts of Single Web Service 54

5 Conclusion and Further Direction 59

6 Bibliography 60

Deep Web Service Crawler

6

7 Appendixes 61

Table of Figures 64

Table of Tables 65

Table of Abbreviations 66

Deep Web Service Crawler

7

1 Introduction

In this introduction chapter of the master thesis at first itrsquos going to concisely explain the background

of the current situation then is the basic introduction of the proposed approach which is called Deep

Web Service Extraction Crawler

11 BackgroundMotivation

In the late 1990rsquos the Web Service Registry was a hot commodity The formal definition of Web

Service Registry is known as a link links page Its function is to uniformly present information that

comes from various sources Hence it can provide a convenient channel to the users for offering

searching and using the Web Services Actually the related metadata of the Web Services that

submitted by both the system and users are commonly hosted along with the Service descriptions

Nevertheless in fact when users enter one of the Web Service Registries to look for some Web

Services they might meet some situations that would bring lots of trouble to them One of the

situations may be like that these Web Service Registries return several similar published Web Services

after the users search on it For example two or more Web Services have the same name but their

versions are not the same Or two or more Web Services that derived from the same server but have

different contents etc Furthermore most users are also interested in a global view of the published

services For instance they want to know which Web Service Registry can provide better quality for

the Web Service Therefore in order to help users to differentiate those similar published Web

Services and have a global view of the Web Services this information should be monitored and rated

Moreover there are a great many Web Service Registries in the Internet Each Web Service Registry

can provide a great number of Web Services Obviously there might have some similar Web Services

among these Web Service Registries Or a Web Service in one of the Web Service Registry is related to

another Web Service in other Web Service Registries Hence these Web Services should be

comparable across different Web Service Registries However recently there has not much support of

this In addition towards the metadata actually not all of them are structured especially the

descriptions of the non-functional property Therefore what have to do now is to turn those

non-functional property descriptions into the structured format Clearly speaking it needs to extract

as much information as possible about the Web Services that offered in the Web Service Registries

Eventually after extracting all the information from the Web Service Registries it is necessary to store

them into the disk This procedure should be efficient flexible and completeness

12 Initial Designing of the Deep Web Service

Crawler Approach

The problems have already been stated in the previous section Hence the following work is to solve

Deep Web Service Crawler

8

these problems In this section it will present the basic principle of Deep Web Service Crawler

approach

At first it is the simply introduction of the Deep Web Service Crawler approach to these problems As

have already been mentioned each Web Service Registry can offer Web Services Moreover each

Web Service Registry has its own html page structures These structures may be the same or even

complete different Therefore the first thing is to identify which Web Service Registry that it will be

going to explore Since each Web Service Registry owns a unique URL this job can be done by directly

analyzing the corresponding URL address of that Web Service Registry After identifying which Web

Service Registry it is going to explore the following step is to obtain all these Web Services that

published in that Web Service Registry Then with all these obtained Web Services it is time to extract

analyze and gather the information of the services That information can be in structured format or

even in unstructured format In this master thesis some Deep Web Analysis Techniques will be

applied to obtain this information So that the information about each Web Service shall be the

largest annotated The last but not the least important all the information about the Web Services

need to be stored

13 Goals of this Master Thesis

The lists in the following are the goals of this master thesis

n Produce the largest annotated Service Catalogue

Service Catalogue is a list of service properties The more properties the service has the larger

Service Catalogue it owns Therefore this master program should extract as much service

properties as possible

n Flexible storage of these metadata of each service as annotations or dedicated documents

The metadata of one service includes not only the WSDL document but also service properties

All these metadata are important information for the service Therefore this master program

should provide flexible ways to store these metadata into the disk

n Improve the comparable property of the Web Services across different Web Service Registries

The names of service properties for one Web Service Registry could be different from another

Web Service Registry Hence for the purpose of improving the comparable ability all these

names of the service properties should be uniformed and well-defined

14 Outline of this Master Thesis

In this chapter the motivation objective and initial approach plan have already been discussed

Thereafter the remaining paper is structured as follows

Firstly in chapter 2 the work presented is almost based on these existing techniques In section 21

there is given a detailed introduction to the technique of the Service-Finder project Then in section

22 the already implemented technique of Pica-Pica Web Service Description Crawler is introduction

and discussed After that in section 23 the Information Retrieval technique is presented

Deep Web Service Crawler

9

Then following in chapter 3 it is explained the designing details of this Deep Web Service Crawler

approach In section 31 it gives a short description for the different requirements of this approach

Next in section 32 the actual design for the Deep Web Service Crawler is presented Then in section

33 34 the multithreaded programming and sleep time configuration that used in this master

program are introduced respectively

In chapter 4 it is supposed to display the experiments of this Deep Web Service Crawler approach

and then give some evaluation of it

Finally in Section 5 the conclusion and discussion of the work that already done as well as the work in

the future for this master task are presented respectively

Deep Web Service Crawler

10

2 State of the Art

This chapter aims at presenting some existing techniques or Strategies that related to the work of

applying this Deep Web Service Extraction Crawler approach Section 21 plans to talk about the

existing catalogues Service-Finder project And then in section 22 it is going to explain the existing

implemented crawlers Pica-Pica Web Service Description Crawler Finally in the section 23 it is

supposed to present some details about the Information Extraction technique

21 Service Finder Project

Service-Finder Project aims at developing a platform for Web Service discovery especially for the Web

Services that are embedded in a Web 20 environment [1] Hence it can provide an efficient access to

publicly available services The goals of the Service-Finder project are depicted as follows [1]

n Automatically gather Web Services and their related information

n Semi-automatically create semantic service description based on the information that available

on the Web

n Create and improve semantic annotations via the user feedback

n Describe the aggregated information in semantic models and allow reasoning query

However before describing the basic functionality of the Service-Finder Project there is going to

present one of its use cases and requirements first

211 Use Cases for Service-Finder Project

The Service-Finder Project employed the use case methodology of the W3C Use Case description [6]

for its needs and then applied this methodology to the use cases that it enumerated

2111 Use Case Methodology

There are three aspects needed to be considered for use case definitions of Service-Finder Project [1]

(1) Description that used to describe information of the use case

(2) Actors Roles and Goals that used to identify the actors they would be the roles they act and the

goals they need to achieve in the scenario

(3) Storyboard that used to describe the serial of interactions among the actors and the

Service-Finder Portal

2112 System Administrator

This section is going to present the use case that applied to the Service-Finder portal and that

illustrated the requirements on its functionality from a user point of view However all these

Deep Web Service Crawler

11

information in this use case are derived from [1] In this use case there has a system administrator

whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities

online and working all day and night Therefore if there is any system failures Sam Adams should fix

the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will

alert him immediately by sending him a SMS Message in the case of a system failure

n Description

This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an

SMS Messaging Service that he wants to build it into his application

n Actors Roles and Goals

The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him

are the immediate service delivery the reliability of the service and low base fee and transaction fee

n Storyboard

Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find

many useful services from it especially he know what he is looking for Hence he visits the

Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo

Requirement 1 Search functionality

Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the

number of matching services that will be displayed on one page And he would also expect there has

short information about the service functionality the service provider and the service availability So

that he could decide which service he will choose to read further

Requirement 2 Enable configurable pagination of the matching results and have some short

information for each service

Step 3 When Sam looks through the short information about the services that displayed on the first

page he expects to find the most relevant services that related to his request After that he would

like to read more detailed information about that service to see whether this service can provide the

needed functionality

Requirement 3 Rank the returned matching services and must provide ability to read more details of

a service

Step 4 In the case that all the returned matching services Sam got provide quite different

functionalities or they belong to different service categories for example the SMS messaging services

alert users not through SMS but voice messaging For this reason Sam would like to see other

different categories that may be contain the services he wants Or the services of other categories

which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can

further filter his search in terms of browsing through categories

Requirement 4 Service categories and allow the user to look all services that belonged to that specific

category If possible it should also allow the user to browse through categories

Step 5 When Sam got all the services that could provide a SMS messaging service via the methods

described in the Step 4 at present he wants to look for the services that offered by an Austrian

provider and have no base fees if possible

Requirement 5 Faceted search

Deep Web Service Crawler

12

Step 6 After Sam got all these specific services now he would like to choose the services that can

provide a high reliability

Requirement 6 Sort functionality based on usersrsquo chooses

Step 7 For now Sam expects to compare the service availability between the promised to the service

provider and the actually provided This should be contained in the servicesrsquo details And there needs

also have service coverage information so that Sam can know whether this service covers the areas

he lives and works Moreover Sam would also like to compare these services in other way For

instance put some services into a structured table to compare the transaction fees

Requirement 7 A side-by-side comparison table for services and a functionality that enable users to

select services he wants to compare

Step 8 At last Sam wants to know whether the service providers offer a free try out of the services

So that he can test the service functionality

Requirement 8 If possible display a note that offering free service trials

212 Architecture Plan for the Service-Finder Project

The architecture plan of the Service-Finder project contains five basic components Service Crawler

Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal

Interface Figure 2-1 presents a high level overview of the components and the data flow among

them

Figure2-1Dataflow of Service-Finder and Its Components [3]

Deep Web Service Crawler

13

2121 The Principle of the Service Crawler Component

The Service Crawler is responsible for gathering available services and their related information from

the Web The overall cycle is depicted as following

(1) A Web developer publishes a Web Service

(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services

like WSDL (Web Service Description Language) documents

(3) The Crawler is also going to search for other related information as long as a service is discovered

(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant

part of the Web

At last the output of the crawler would be forwarded to the subsequent components for analyzing

indexing and displaying

2122 The Principle of the Automatic Annotator Component

The Automatic Annotator receives the relevant data from previous component and generates

semantic service descriptions about the WSDL documents and its related information based on the

Service-Finder Ontology and Service Category Ontology

Firstly it will simply introduce those two compatible ontologies that would be used throughout the

whole process [2]

n Generic Service Ontology it is an ontology which is functional to describe the data objects For

example the services the service providers availability payment modalities and so on

n Service Category Ontology it is an ontology which is used to categorize the functionalities or

applications of the services For instance data verification messaging data storage weather etc

Afterwards it is going to talk about the function of this component with its input output

Oslash Input

u Crawled data from Service Crawler

u Service-Finder Ontologies

u Feedback or Correction of before annotations

Oslash Function

u Enrich the information about the service and extract semantic statements according to the

Service-Finder Ontologies For example categorize the service according to the Service

Category Ontology

u Determine whether a particular document is relevant or not through the Web link graph If

not discard these irrelevant documents

u Classify the pages into their genres For instance pricing user comments FAQ and so on

Oslash Output

u Semantic annotation of the services

Deep Web Service Crawler

14

2123 The Principle of the Conceptual Indexer and Matcher

Component

The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted

information of the services and supplying users the capability of retrieval and semantic query For

example the matchmaking between user requests and service offers and the act of retrieving user

feedback on extracted annotations

In addition letrsquos have a look of the function of this component and its input output

Oslash Input

u Semantic annotation data and full text information obtained from Automatic Annotation

u Semantic annotation data and full text information that come from user interfaces

u Cluster data from user and service clustering component

Oslash Function

u Store the semantic annotations received from the Automatic Annotation component and

from the user interface

u Store the cluster data that procured through the clustering component

u Store and index the textual description offered by the Automatic Annotation component

and the textual comments offered by users

u Ontological query the semantic data from the data store center

u Combined keyword and Ontological querying used for user queries

u Provide a list of similar services for a given service

Oslash Output

u A list of matching services that are queried by users In particular these services should be

sorted by ranking and can also be iterated

u All available data that related to a particular entity must be retrievable at the user interface

2124 The Principle of the Service-Finder Portal Interface

Component

The Service-Finder Portal Interface is the main entry point that provided for users of the

Service-Finder system to search and browse the data which is managed by the Conceptual Indexer

and Matcher component In addition the users can also contribute information by means of providing

tags comments categorizations and ratings to the data browsed Furthermore the developers can

still directly invoke the Service-Finder functionalities from their custom applications in terms of an API

Besides the details of this componentrsquos function input and output are represented as below

Oslash Input

u A list of ordered services for a query

u Detailed information about a service or a set of services and a service provider

u Query access to service category ontology and the most used tags provided by the users

Deep Web Service Crawler

15

u Service availability information

Oslash Function

u The Web Interface allows the users to search services by keyword tag or concept in the

categorization sort and filter query results by refining the query compare and bookmark

services try out the services that offer this functionality

u The API allows the developers to invoke Service-Finder functionalities

Oslash Output

u Explicit user annotations such as tags ratings comments decryptions and so on

u Implicit user data for example click stream of users bookmarks comparisons links sent

etc

u Manual advertising of available new services

2125 The Principle of the Cluster Engine Component

The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the

Service-Finder Portal eg the queried services and the compared services of the users Moreover it

also provides cluster data to the Conceptual Indexer and Matcher for providing service

recommendations

Furthermore letrsquos detailed introduce this componentrsquos function input and output

Oslash Input

u Service annotation data of both extracted and user feedback

u Usersrsquo Click streams used for extracting user behaviors

Oslash Function

u Obtain user clusters from user behaviors

u Obtain service clusters from service annotation data to enable to find similar services

Oslash Output

u Clusters of users and services

22 Information Extraction

Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge

amount of information sources on the Internet which has been limited the access to browsing and

searching for the reason of the heterogeneity and the lack of structure of Web information sources

Therefore the appearance of Information Extraction that transforms the Web pages into

program-friendly structures for post-processing would become a great necessity However the task of

Information Extraction is specified in terms of the inputs and the extraction targets And the

techniques used in the process of Information Extraction called extractor

221 Input Types of Information Extraction

Generally speaking there are three different input types The first input type is the unstructured

Deep Web Service Crawler

16

document For example the free text that showed in figure 2-2 It is unstructured and written in

natural language So that it will require substantial natural language processing While the second

input type is called the structured document For instance the XML documents based on the reason

that the data can be described through the available DTD (Document Type Definition) or XML

(eXtensible Markup Language) schema Finally but obviously the third input type is the

semi-structured document that are widespread on the Web Such as the large volume of HTML

pages like tables itemized lists and enumerated lists This is because HTML tags are often used to

render these embedded data in the HTML pages See figure 2-3

Figure2-2Left is the free text input type and right is its output [4]

Figure2-3A Semi-structured page containing data records

(in rectangular box) to be extracted [4]

Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly

regular structure And the data of these documents can be displayed in a format of HTML way or

non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and

generated from structured databases in terms of some templates or layouts thus it would be

considered as one of the input sources which could provide some of these semi-structured documents

For example the authors price and comments of the book pages that provided by Amazon have the

Deep Web Service Crawler

17

same layout That is because these Web pages are generated from the same database and applied

with the same template or layout Furthermore there has another option which could manually

generate HTML pages of semi-structured type For example although the publication lists that

provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have

title and source property for every single pager Eventually the inputs for some Information Extraction

can also be the pages with the same class or among various Web Service Registries

222 Extraction Targets of Information Extraction

Moreover regarding the task of the Information Extraction it has to consider the extraction target

There also have two different extraction targets The first one is the relation of k-tuple And the k in

there means the number of attributes in a record Nevertheless in some cases an attribute of one

record may have none instantiation Otherwise the attribute owns multiple instantiations In addition

the complex object with hierarchically organized data would be the second extraction target Though

the ways for depicting the extraction targets in a page are diverse the most common structure is the

hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf

nodes which called internal nodes And the structure for a data object may also be flat or nested To

be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise

if it is nested structure then the internal nodes that involved in this data object would be more than

two levels

Furthermore in order to make the Web pages readable for human being and having an easier

visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated

or demarcated However the displaying for a data object in a Web page would be affected by

following conditions [4]

Oslash The attribute of a data object has zero or several values

(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo

attribute For example a special offer only available for certain books might be a ldquononerdquo

attribute

(2) If there are more than one values for the attribute of a data object it will be called the

ldquomultiValuerdquo attribute For instance the name of the author for a book could be a

ldquomultiValuerdquo attribute

Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering

That is to say among this set of attribute the position of the attribute might be changed

according to the diverse instances of a data object Thus this attribute will be called the

ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would

enumerate the release data in front of the movesrsquo title while for the movies after year 1999

(including 1999) it will enumerate the release data behind the movesrsquo title

Oslash The attribute has different formats

This means the displaying format of the data object could be completely distinct with respect to

these different instances Therefore if the format of an attribute is free then a lot of rules will be

needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo

attribute For example an ecommerce Web site would use the bold font format to present the

general prices while use the red color format to display the sale prices Nevertheless there has

Deep Web Service Crawler

18

another situation that some different attributes for a data object have the same format For

example various attributes are presented in terms of using the ltTDgt tags in a table presentation

And the attributes like those could be differentiated by means of the order information of these

attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo

attributes it must have to revise the rules for extracting these attributes

Oslash The attribute cannot be decomposed

Because of the easier processing sometimes the input documents would like to be treated as

strings of tokens instead of the strings of characters In addition some of the attribute cannot

even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo

attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The

department code and the course number in them cannot be separated into two different strings

of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo

223 The Used Techniques in Information Extraction

The extractor used in the process of Information Extraction aims at providing a single uniform query

interface to access information sources like database server and Web server It consists of following

phases collecting returned Web pages labeling these Web pages generalizing extraction rules

extracting the relevant data and outputting the result in an appropriate format (XML format or

relational database) for further information integration For example at first the extractor queries the

Web server to gather the returned pages through the HTTP protocols after that it starts to extract the

contents among these HTML documents and integrate with other data sources thereafter Actually

the whole process of the extractor follows below steps

Oslash Step 1

At the beginning it must have to tokenize the input However there are two different

granularities for the input string tokenization They are tag-level encoding and word-level

encoding The tag-level encoding will transform the tags of HTML page into the general tokens

while transform all text string between two tags into a special token Nevertheless the

word-level encoding does this in another way It treats each word in a document as a token

Oslash Step 2

Next it should apply the extraction rules for every attributes of the data object in the Web pages

These extraction rules could be induced in terms of a top-down or bottom-up generalization

pattern mining or logic programming In addition the type of extraction rules may be indicated

by means of regular grammars or logic rules For example some use path-expressions of the

HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic

constraints and some use delimiter-based constraints such as HTML tags or literal words

Oslash Step 3

After that all these extracted data would be assembled into the records

Oslash Step 4

Finally iterate this process until all these data objects in the input

Deep Web Service Crawler

19

23 Pica-Pica Web Service Description Crawler

The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the

Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web

Services problem For example the evaluation of the descriptive quality of Web Services that offered

and how well are these Web Services described in nowadaysrsquo Web Service Registries

231 Needed Libraries of the Pica-Pica Web Service

Description Crawler

This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef

Spillner and programmed in terms of the Python language Actually in order to run these scripts to

parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib

n Beautiful Soup

It is an HTMLXML parser for Python language And it can even turn these invalid markups into a

parse tree [5] Moreover the following three features can make it more powerful

u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that

makes approximately as much sense as the original document Therefore you can obtain

the data that you want

u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating

searching and modifying the parse tree Hence you donrsquot need to create a custom parse

for every application

u If the document has already specified an encoding then you can ignore it since the

Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way

Otherwise what you have to do is just to specify the encoding of the original documents

Furthermore the ways of including Beautiful Soup into the application are displayed in the

following [5]

sup2 From BeautifulSoup import BeautifulSoup For processing HTML

sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML

sup2 Import BeautifulSoup To get everything

n Html5lib

It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to

gain maximum compatibility with the current major desktop web browsers this implementation

will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5

specification

Deep Web Service Crawler

20

232 Architecture of the Pica-Pica Web Service

Description Crawler

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler

Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes

four fundamental components Service Page Grabber component WSDL Grabber component

Property Grabber component and WSML Register component

(1) The Service Page Grabber components is going to take the URL seed as the input and output the

link of the service page into following two components WSDL Grabber component and Property

Grabber component

(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the

delivered service pagersquos link And then check whether the validation for these obtained WSDL

document Finally only these valid WSDL document will be passed into the WSML Register

component for further processing

(3) The Property Grabber component will try to extract the servicersquos property the hosted in the

service page if there exists After that all these servicersquos properties would be saved into an INI

file as the information of that service

(4) The functionality of the WSML Register component is to write appropriate WSML document by

means of the valid WSDL documents that delivered from WSDL Grabber component and the

optionally INI files that delivered from Property Grabber component Afterwards register them in

Conqo

n WSML [9]

It stands for Web Service Modeling Language which provides a framework with different

language variants Hence it is often used to describe the different aspects of the semantic Web

Services according to the conceptual model of WSMO

n WSMO [10]

WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various

aspects related to the Semantic Web Services based on the ontologies Ontology is a formal

explicit specification of a shared conceptualization In fact these ontologies are the sticking points

that can satisfy the linkage between the agreement of the communities of users and the defined

conceptual semantics of the real-world

n Conqo [11]

Deep Web Service Crawler

21

It is a discovery framework that considers not only the Quality of Service (QoS) but also the

context information It will use a Web Service Repository to manage these service descriptions

that based on WSML

233 Implementation of the Pica-Pica Web Service

Description Crawler

This section is going to describe the processes of the implementation of the Pica-Pica Web Service

Description Crawler in detail

(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it

needs an input as the initial seed In this crawler there are five Web Service Registries which are

listed in the below The URL address of these five Web Service Registries will be used as the input

seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web

Service Description Crawler there has a single Python script for each Web Service Registry And

the crawling process for these Web Service Registriesrsquo Python script will be executed one after

another

Biocataloue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwsrvice-repositorycom

Xmethods httpwwwxmethodsnet

(2) Then after feeding with the input seed it will step into the next component Service Page Grabber

At first this component will try to read the data from the Web based on the input seed Then it

will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library

After that this Service Page Grabber component starts to look for the service page link for each

service that published in the Web Service Registry by means of the functions in Html5lib library In

the case that the service page link of one single service is found it will firstly check whether this

service page link is valid or not Once the service page link is valid it will pass it into the following

two components for further processing which are WSDL Grabber component and Property

Grabber component

(3) When the WSDL Grabber component receives a service page link from its previous component it

sets out to extract the WSDL link address for that service through the parsing tree of the data in

this service page Next this component will start to download the WSDL document of that service

in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into

the disk The process of this WSDL Grabber component will be continually carried on until there is

no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They

may either contain bad definitions or bad namespaceURI or be an empty document What is

worse it is not even of XML format Hence in order to pick them out this component will further

analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo

folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in

order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo

folder will be passed into the subsequent component

(4) Moreover since some Web Service Registries would give some addition information about the

Deep Web Service Crawler

22

services such as availability service provider version therefore the Property Grabber

component will set out to extract these information as this servicersquos properties And thereafter

save these servicersquos properties into an INI file However if there has no available additional

information then it is no need to extract the service property thus there is no INI file for that

service However for the implementation of this Pica-Pica Web Service Description Crawler only

the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to

extract the servicesrsquo properties While for other three Web Service Registries there has such

function

(5) Furthermore this is an optional to create a report file which contains the statistic information of

this process Such as the total number of services for one Web Service Registry the number of

services whose WSDL document is invalid etc

(6) As has been stated at present there has a folder with all valid WSDL documents and might also

have some INI files Therefore at this time the task of WSML Register component is to generate

the appropriate WSML documents with these valid WSDL document and INI files And then

register them in Conqo

24 Conclusions of the Existing Strategies

In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder

project Information Extraction technique and Pica-Pica Web Service Description Crawler

The task of this master program is going to obtain the available Web services and their related

information from the Web Actually this is a procedure of extracting the needed information for the

service such as the servicersquos WSDL document and its properties Therefore the Information

Extraction technique which used to extract the information that hosted in the Web could be used in

this master program

Moreover the Service-Finder is a large project that not only able to obtain the available Web services

and their related information from the Web but also enrich these crawled data with annotations by

means of the Service-Finder ontology and the Service Category ontology and integrate all the

information into a coherent semantic mode Furthermore it also provides the capabilities for

searching and browsing the data with the user interface and gives users with service

recommendations However as a master program the Service-Finder project is far more exceeding

the requirements Therefore it is just considered as a consultation for this master program

Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available

Web services and their related information this fulfills the primary task of this master program

Nevertheless regarding the information about the service this Pica-Pica Web Service Description

Crawler extracts only few properties sometime even no property Consequently in order to improve

the quality of the service it has to extract as much properties about the service as possible Thence in

chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler

Deep Web Service Crawler

23

3 Design and Implementation

In the previous chapter of the State of Art those already existing techniques or implementations are

presented In the following it is time to introduce the basic principle of the proposed approach Deep

Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica

Web Service Description Crawler

31 Deep Web Services Crawler Requirements

This section is mainly talking about the goals of the Deep Web Service Crawler approach system

requirements with respect to approach and some non-functional requirements

311 Basic Requirements for DWSC

The following lists are the basic requirements which should be achieved

(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list

of Web services that published in the Web Service Registry It contains not only the WSDL document

of the service but also a file with respect to the information about that service Therefore for the

purpose of producing the largest annotated Service Catalogue this proposed approach need to

extract as much properties about those Web services as possible Moreover it has also to download

the WSDL document that hosted along with the Web service That is to say these properties are not

only the interesting structured properties such as service name its WSDL link address but also some

other non-function properties for example endpoint and its monitoring information

(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties

For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of

schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way

this proposed approach provides three methods for the storage The first one will store them as an

XML file While for the second method it is going to store them in an INI file And the third method is

to use the database for the storage

312 System Requirements for DWSC

General Speaking the needed requirements for achieving a programming project contains following

lists

1) Operating System LinuxUnix Windows (XP Vista 2000 etc)

2) Programming language C++ Java Python C etc

3) Programming Tool NetBean Eclipse Visual Studio and so on

However in this master thesis the scripts of the Deep Web Service Crawler approach are written in

Java programming language Besides these code scripts are only tested running in Windows XP

operating system and Linux operating system but have not been tested in other operating system

Deep Web Service Crawler

24

313 Non-Functional Requirements for DWSC

In this part several non-functional requirements for the Deep Web Service Crawler approach are

presented

1) Transparency the process of data exploration and data storage should be done automatically

without the userrsquos intervention However at first user should specify the path in the hard disk which

will be used store these outputs of this program

2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in

order to stop the process from interruption there must have some necessary error handlings for the

process recovering

3) Completeness this approach should extract the interesting properties about each Web Service as

much as possible eg endpoint monitoring information etc

In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the

following five URLs hence the proposed approach must implement not less than those five Web

Service Registries

Biocatalogue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwservice-repositorycom

Xmethods httpwwwxmethodsnet

32 Deep Web Services Crawler Architecture

In this section an overview of this high-level architecture of Deep Web Services Crawler approach will

be introduced first Thereafter there are four subsections that focusing on the outlining of each single

component and how they will play together presented

The current components and flows of data in Deep Web Service Crawler can be summarized as

depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page

links and related service page links by crawling the Web (Web Service Extractor) Then those gathered

links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL

Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all

these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is

illustrated as following

Oslash Step 1

When the Deep Web Service Crawler starts to run the File Chooser container would require the

user to specify a path in the computer or in any other hard disk The reason for specifying the

path is that this Deep Web Service Crawler program needs a place to store all its outputs

Oslash Step 2

After that the Web Service Extractor will be triggered It is the main entry to the specific crawling

process However since the Deep Web Service Crawler program is a procedure which supposes

to crawl for Web Services in some given Web Service Registries Hence the URL addresses of

Deep Web Service Crawler

25

these Web Service Registries should be given as an initial seed for this Web Service Extractor

process However since the page structures of these Web Service Registries are completely

different thus there will be a dependent process for each Web Service Registry

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler

Oslash Step 3

So that according to the given seed two types of links would be obtained by the Web Service

Extractor component One is the service list page link And another is the service page link

Service list page is page that contains a list of Web Services and may be some information about

these Web Services While service page is page that contains much more information for single

service Finally it forwards these two types of links into the next two components Property

Grabber and WSDL Grabber

Oslash Step 4

Then on the one hand the Property Grabber component tries to gather the information about

the service that hosted in the service list page and service page Such as the name of the service

its description the ranking for this service etc Finally all the information of the service will be

collected together as the service properties which then will be delivered to the storage

component for further processing

Oslash Step 5

On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page

or the service page That is because for some Web Service Registry the WSDL link is hosted in

Deep Web Service Crawler

26

the service list page like the Biocatalogue and for other Web Service Registries it is hosted in

the service page such as Xmethods Then after obtaining the WSDL link it will also be

transmitted to the Storage component for further processing

Oslash Step6

When service properties information and WSDL link of the service are received by Storage

component it will be going to store them into the disk For the service properties they will be

stored into the disk by means of three different ways They are an XML file an INI file or one

record inside the table of Database However for the WSDL link the Storage component will try

to download the page content first according to the URL address of the WSDL link If it can work

successful the page content of the service will be stored as a WSDL document into the disk

Oslash Step 7

Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there

are more than one service or there are more than one service list page in the those Web Service

Registries the crawling process from step 3 to step 6 will be continued again and again until

there is no service or service list page in that Web Service Registries any more

Oslash Step 8

Furthermore after the crawling process for one Web Service Registry finishes there will also

generate a file that contains some of statistic information about this crawling process For

example the time when the crawling process of this Web Service Registry started and when

finished the total number of Web services in this Web Service Registry how many services have

empty WSDL document the average number of service properties in this Web Service Registry

and the average time cost for extracting service properties getting WSDL document and

generating XML file INI file etc

321 The Function of Web Service Extractor Component

The Web Service Extractor component is responsible for gathering service list page links and service

page links It pursues a focused crawl of the Web and only forwards service list page and service page

links to subsequent components for analyzing collecting gathering purposes Therefore it is going to

identify both service list page links and related service page links on these Web Service Registries

As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost

as important as the Web Service Extractor itself as it highly influences the part of the Web which it

needs to crawl The seed can or shall contain eg Web pages where these Web Services are

published or that they talk about Web Services

Deep Web Service Crawler

27

Figure3-2 Overview the process flow of the Web Service Extractor Component

After feeding with the seed of URL the Web Service Extractor component starts to get the link of

service list page from the initial page of this URL seed However this process would be diverse for

these five Web Service Registries Following shows the different situation in these Web Service

Registries

Oslash Service-Repository Web Service Registry

In this Web Service Registry the link of the first service list page is the URL address of its seed

which means some Web Services can be found in the home page of the Service-Repository Web

Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this

Web Service Registry Therefore the process of getting service list page link in this registry will be

continually carried on until there no more service list page link exists

Oslash Xmethods Web Service Registry

Although there has Web Services in the home page of the Xmethods Web Service Registry the

number of those Web Services is only a small subset of these in this Web Service Registry

Moreover in the Xmethods Web Service Registry there is only one page containing all Web

Services Therefore it must have to get the service list page link for that page

Oslash Ebi Web Service Registry

The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That

is to say there also has one page that contains all Web Services in this Web Service Registry

However this page is not the initial page of the input seed Therefore there needs more than

one operation steps to get the service list page link of that page

Oslash Seekda Web Service Registry

In Seekda Web Service Registry these Web Services are not contained in the initial page of the

input seed The service list page link can be obtained after several additional operation steps

Deep Web Service Crawler

28

However there has a problem for getting the service list page link in this registry Simply put if

there are more than one pages contains the Web Services for some unknown reasons it cannot

not get the links of the rest service list page In another word it can only get the link of the first

service list page

Oslash Biocatalogue Web Service Registry

The process of getting service list page in Biocatalogue Web Service Registry is almost the same

as in Seekda Web Service Registry The only difference is that it can get all these service list page

links in Biocatalogue Web Service Registry if there are more than one service list pages

Then after getting the link of service list page the Web Service Extractor will begin to get the link of

service page for each service that listed in the service list page The reason for why it can do like this is

that there has an internal link for every service which will address to the service page It is worth

noting that once a service page link is obtained this component will immediately forward the service

page link and the service list page link into the subsequent two components for further processing

Nevertheless the process of obtaining the service page link will be continuously carried out until all

services listed in that service list page are crawled Analogously the process of getting service list page

will also be continuously carried out until there no more service list page exists

3211 Features of the Web Service Extractor Component

The main features are described in the following paragraphs

1) Obtain service list page links

A central task is to obtain the corresponding service list page links It is an URL address that includes a

publicly list of Web Services and just a simply information about these Web services like the name of

the service an internal URL that links to another page which would contain the detailed information

about that service sometimes may have the link address for WSDL document

2) Obtain service page links

Once service list page links are found a crucial aspect is to extract the internal link of each service to

aid the task of service information discovery Therefore it is the task of the Web Service Extractor to

harvest the html page content of the service list page so that the service page links which would

contain much more detailed information of the single Web service can be obtained

3212 Input of the Web Service Extractor Component

This component is dependent on some specific input seeds And the only input required for this

component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section

313

3213 Output of the Web Service Extractor Component

The component will produce two service related page links from the Web

l Service list page links

l Service page links

Deep Web Service Crawler

29

3214 Demonstration for Web Service Extractor

In order to have comprehensive understanding of the process of the Web Service Extractor

component following gives some figures for explanation Though there are five URL addresses in this

section only the URL of the service-repository address is showed as an example

1) The input seed is the initial URL address of the Service-Repository which is

ldquohttpwwwservice-repositorycomrdquo

2) As has already said in section 321 the first service list link of this Web Service Registry is its input

seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page

of that link

Figure3-3 Service list page of the Service-Repository

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo

Figure3-5Code Overview of getting service page link in Service Repository

Figure3-6 Service page of the Web service ldquoBLZServicerdquo

Deep Web Service Crawler

30

3) For now because of the service list page link is already know the next step is to acquiring the

service page link according to these services that listed in the service list page The text in red box

of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the

complete link of the service page It should be prefixed with the initial URL address of

Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for

getting service page link of the Web service in Service Repository is showed in figure 3-5

Therefore the final link of this service page is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure 3-6 is the corrsponding service page of that link

4) Afterwards those two links service list page link and service page link which are gathered by the

Web Service Extractor component will be immediately forwarded to the next two components

which are WSDL Grabber component and Property Grabber component

322 The Function of WSDL Grabber Component

The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the

service lists page link or service page link And the whole process flow is illustrated in figure 3-7

Figure3-7 Overview the process flow of the WSDL Grabber Component

When the WSDL Grabber component receives the inputs that delivered from previous component it

will start to get the WSDL link for the service based on these inputs However although the inputs of

the WSDL Grabber component are the links of service page and service list page only one of them

contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the

service page or the service list page The reason why there needs these two links to be delivered into

this component is that the only one of the five Web Service Registries which is Biocatalogue hosts

the WSDL link in the service list page While for other four Web Service Registries the WSDL link is

hosted in the service page Therefore the WSDL link of these four Web Service Registries will be

obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL

link will be obtained through the service list page link However there has a problem for getting the

WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in

Deep Web Service Crawler

31

the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another

word these services will not have WSDL document For the situation like this the value of the WSDL

link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web

Services in other four Web Service Registries the WSDL link would always exist in the service page

Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service

it will inevitably be forwarded into Storage component for downloading the WSDL document at once

3221 Features of the WSDL Grabber Component

The WSDL Grabber component has to provide the following features

l Obtain WSDL links

WSDL link is the direct way to get into the page that contains the contents of the WSDL document

It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo

to represent this is an address that addressed to the page of WSDL document

3222 Input of the WSDL Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3223 Output of the WSDL Grabber Component

The component will only produce the following output data

l The URL address of WSDL link for each service

3224 Demonstration for WSDL Grabber Component

In this section it will demonstrate a list of figures in order to give a comprehensive understanding of

the process of the WSDL Grabber component Without a doubt it uses the same Web service of the

Service-Repository as an example too

1) The input for this WSDL Grabber component is the link of service page that obtained from Web

Service Extractor component The address of this link is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page

Deep Web Service Crawler

32

2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the

figure 3-8

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo

3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure

3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry

For other four Web Service Registries this will be different This function

ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo

Then it checks all of the nodes one by one to see if the text value of that node is the value of

ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will

be extracted as the value of WSDL link for this Web service

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function

Figure3-11 Code overview of ldquooneParameterrdquo function

4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web

service ldquoBLZServicerdquo can be obtained The value of it is

ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo

Deep Web Service Crawler

33

323 The Function of Property Grabber Component

The Property Grabber component is a module which is used to extract and gather all these Web

service information that hosted in the Web actually are the information that showed in service list

page and service page In the end all the obtained Web service information will be collected together

as the service properties which would be delivered into the Storage component for storing The

detailed process flow of Property Grabber component is illustrated in figure 3-12

The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component

However there is still a little difference between them with respect to the seed As have already

mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is

sufficient to get the WSDL link While regarding the Property Grabber component it needs two of

them as the seeds Therefore once the Property Grabber component receives the needed inputs it

will start to extract the information of that single Web service

Figure3-12 Overview the process flow of the Property Grabber Component

Therefore after the Property Grabber component receives the inputs it starts to extract the service

information for the Web service Generally speaking the service information consists of four aspects

which are structured information endpoint information monitoring information and whois

information respectively

(1) Structured Information

The structured information can be obtained by extracting the information that hosted in the

service page and service list page They are the basic descriptive information about the service

Such as the name of the service the URL address through which the WSDL document can be

obtained the description of introducing the service the provider who provides that service its

rating and the server who owns this service etc However the elements constituting this

Deep Web Service Crawler

34

structured information would be diverse according to these different Web Service Registries For

example the rating information of the Web service exists in the Service-Repository Web Service

Registry while the Xmethods Web Service Registry does not have this information In addition

even though the Web services in the same Web Service Registry some elements of the structured

information could be not exist For instance one service in a Web Service Registry may have the

description information of this service while another service in the same Web Service Registry

does not have this description information Tables 3-1 to 3-5 demonstrate the structured

information that should be extracted in these five Web Service Registries Moreover if the style of

the Biocataloue Web Service Registry is SOAP there has some additional information describing

the SOAP operation of this service While it is REST this additional information would be

something that talked about the REST operation They should also be considered as a part of the

structured information Table 3-6 and table 3-7 illustrate the information for these two different

operations

Service Name WSDL Link WSDL Version

Provider Server Rating

Homepage Owner Homepage Description

Table 3-1Structured Information of Service-Repository Web Service Registry

Service Name WSDL Link Provider

Service Style Homepage Implementation Language

Description User Description Contributed Client Name

Type of this Client Publisher for this client Used Tookit of this client

Used Language of this client Used Operation System of this client

Table 3-2Structured Information of Xmethods Web Service Registry

Service Name WSDL Link Server

Provider Providerrsquos Country Service Style

Rating Description User Description

Service Tags Documentation (within WSDL)

Table 3-3Structured Information of Seekda Web Service Registry

Service Name WSDL Link Port Name

Service URL Address Implementation Class

Table 3-4Structured Information of Ebi Web Service Registry

Service Name WSDL Link Style

Provider Providerrsquos Country View Times

Favorite Times Submitter Service Tags

Total Annotation Provider Annotation Member Annotation

Registry Annotation Base URL SOAP Lab Server Base URL

Description User Description Category

Table 3-5Structured Information of Biocatalogue Web Service Registry

Deep Web Service Crawler

35

SOAP Operation Name Inputs and Outputs Operation Description

Operation Tags Part of Which Service

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry

REST Operation Name Service Tags Used Template

Operation Description Part of Which Service Part of Which Endpoint Group

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry

(2) Endpoint Information

Endpoint information is the port information of the Web service which can be extracted only

through the service page However since different Web Service Registries have different structure

of the endpoint information with respect to the Web service Thence some elements of the

endpoint information for the Web service would be very diverse But there is one thing needed to

pay attention the Ebi Web Service Registry does not have endpoint information for all these Web

services published in this registry Moreover though the Web services in the same Web Service

Registry have the same structure of this endpoint information some elements of the endpoint

information would be missing or empty Furthermore these Web Service Registries could even

have no endpoint information for some Web services published by them Nevertheless whatever

there is endpoint information for the Web service there will be at least one such element which

is the URL address of the endpoint Following table 3-8 shows the endpoint information that

should be extracted in these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Endpoint Name Endpoint URL

Endpoint Critical Endpoint Type

Bound Endpoint

Xmethods

Endpoint URL Publisher of this Endpoint

Contact Email of this Publisher Implementation Language of this

Endpoint

Seekda Endpoint URL

Biocatalogue Endpoint Name Endpoint URL

Table 3-8 Endpoint Information of these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Service Availability Number of Downs

Total Uptime Total Downtime

MTTR MTBF

RTT Max of Endpoint RTT Min of Endpoint

RTT Average of Endpoint Ping Count of Endpoint

Seekda Service Availability Begin Time of Monitoring

Biocatalogue Monitored Status of Endpoint Overall Status of Service

Table 3-9 Monitoring Information of these five Web Service Registries

(3) Monitoring Information

Monitoring information is the tested statistic information for the Web service But it is worth

Deep Web Service Crawler

36

noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of

all these Web services published by them While for other three Web Service Registries only a

few of Web services may not have this information Table 3-12 displays the monitoring

information for these three Web Service Registries

(4) Whois Information

Whois information is not extracted from the information that hosted in the service page and

service list page It is the descriptive information for the service domain which can be gained by

means of the address of the WSDL link Because of that the process of getting the Whois

information will start after gaining the service domain first The final value of service domain

should not contain the string like this http https www etc It must belong to the top level

domain After that it is going to query the service domain database by means of sending the value

of service domain to the Whois client which is just a Web site on the Internet for example

ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain

exists a list of its information will be returned as the output However the structure of the

returned information will be diverse with respect to different service domain Therefore the most

challenge thing is that there has to deal with the extracting process for each different situation of

the returned information Table 3-10 gives the Whois information that needs be extracted for all

these five Web Service Registries

Service Domain URL Domain Name Domain Type

Domain Address Domain Description State

Postal Code City Country

Country Code Phone Fax

Email Organization Established Time

Table 3-10Whois Information for these five Web Service Registries

Finally all the information of these four aspects will be collected together And then be delivered into

the Storage component for further storage processing

3231 Features of the Property Grabber Component

The Property Grabber component has to provide the following features

l Obtain basic information

Generally speaking the more information one Web service has the better you can know how

well this Web service is Hence this Property Grabber component will be necessary to extract all

these basic information that hosted in the service list page and service page This basic

information contains the structure information endpoint information and monitoring

information

l Obtain Whois information

Due to the reason that the more information the Web service has the better the quality of the

Web service is That makes it is necessary to extract as much information about the Web service

as possible Therefore except the basic information the Property Grabber component can also

obtain some additional information called Whois information such as the type of the domain the

name of the person in charge the postal code of the domain city phone fax detailed address

email etc

Deep Web Service Crawler

37

3232 Input of the Property Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3233 Output of the Property Grabber Component

The component will produce the following output data

l Structured information of each service

l Endpoint information about each service if exists

l Monitoring information for the service and endpoint if exists

l Whois information of the service domain

However all these information will be collected together as the properties for each service Thereafter

the collected properties will be sent to the Storage component

3234 Demonstration for Property Grabber Component

Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the

Property Grabber component To simplify the explanation the example showed in this section uses

the same Web service as before

1) The inputs of the Property Grabber component are links of service list page and service page that

received from Web Service Extractor component Their links are

ldquohttpwwwservice-repositorycomrdquo

and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page

Deep Web Service Crawler

38

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page

3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that

displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of

the structured information have the same content like the contents of description showed in

service page and service list page Thence in order to saving the time of extracting process and

the space for storing process the elements with the same content would only be extracted once

Moreover there needs a transformation of non-descriptive text to the descriptive text for the

rating information because the contents of it are several image of stars Therefore the final

results of the extracted structured information for this Web service are showed in table 3-11

Because of there are no descriptive information for the provider homepage and Owner

Homepage their values are assigned as ldquoNULLrdquo

Service Name BLZService

WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl

WSDL Version 0

Server Apache-Coyote11

Description BLZService

Rating Four stars and A Half

Provider NULL

Homepage NULL

Owner Homepage NULL

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo

4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that

displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list

only one of them will be extracted as the information of endpoint That is because this master

program is intended to extract as more information as possible but these information should not

contain some redundant information Therefore for the purpose of that only one of them will be

extracted as the endpoint information even if there are more than one endpoint records Table

3-12 shows the final results of the endpoint information of this Web service

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page

Deep Web Service Crawler

39

Endpoint Name BLZServiceSOAP12port_http

Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService

Endpoint Critical True

Endpoint Type production

Bound Endpoint BLZServiceSOAP12Binding

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo

5) Then it is time to extract the monitoring information through invoking the function

ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box

above contains the monitoring information about the Web service and the red box below lists

the monitoring information for these endpoints As have already mentioned before only one

endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two

types of availability Actually they all represent the availability for this Web service just like the

availability that showed in figure 3-14 Therefore only one content of these availability

information is sufficient Table 3-13 shows the final results of this extracting process

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page

Service Availability 100

Number of Downs 0

Total Uptime 1 day 19 hours 19 minutes

Total Downtime 0 second

MTBF 1 day 19 hours 19 minutes

MTTR 0 second

RTT Max of Endpoint 141 ms

RTT Min of Endpoint 0 ms

RTT Average of Endpoint 577 ms

Ping Count of Endpoint 112

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo

6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the

service domain from the WSDL link first Towards this Web service the gained service domain is

ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying

process After that it returns a list of information for that service domain See figure 3-17 And

table 3-14 is the extracted Whois information

Deep Web Service Crawler

40

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo

Service Domain URL thomas-bayercom

Domain Name Thomas Bayer

Domain Type NULL

Domain Address Moltkestr40

Domain Description NULL

State NULL

Postal Code 54173

City Bonn

Country NULL

Country Code DE

Phone +4922855525760

Fax NULL

Email infopredic8de

Organization predic8 GmbH

Established Time NULL

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo

7) Finally all the information of these four aspects will be collected together as the service

properties and then these service properties are forwarded into the Storage component

324 The Function of Storage Component

The WSDL link from the WSDL Grabber component will be used by this Storage component to

download the WSDL document from the Web and stores it into the disk thereafter In addition the

service properties from the Property Grabber component will also be directly stored into the disk in

three different manners with this Storage component Figure 3-18 illustrates the process flow of the

Storage component

Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo

of the Storage component will be triggered After that it is going to transform the service properties

into three different output formats and store them into the disk These output formats are XML file

INI file and database records Besides it will also try to download the WSDL document with the URL

address of the WSDL link and then store the obtained WSDL document into disk too However this

ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo

ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the

storage tasks

Deep Web Service Crawler

41

Figure3-18 Overview the process flow of the Storage Component

(1) ldquogetWSDLrdquo sub function

The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it

into the disk Therefore above all it has to get the content of the WSDL document This

procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value

of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if

the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo

For the case like that it will create a WSDL document whose name is the service name appended

with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content

It is an empty document Nevertheless if that service does have WSDL link this sub function will

try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all

these contents that hosted in the Web would be downloaded and stored into the disk and named

only with the name of the service Otherwise it would create a WSDL document which prefixes

ldquoBadrdquo before the service name

(2) ldquogenerateXMLrdquo sub function

The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform

them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo

XML stands for eXtensible Markup Language which is a markup language that designed to

transport and store data The first line of an XML file is the XML declaration which defines the

XML version and the encoding used For example following ldquoltxml version=10

encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is

8-bit Unicode Transformation Format character set Besides an XML file contains XML elements

which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML

element can also contain other elements which could be simple text or a mixture of both

Deep Web Service Crawler

42

However an XML file must contain a root element as the parent of all other elements

(3) ldquogenerateINIrdquo sub function

The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will

transform them into an INI file and then store it in the disk with the name structure of servicersquos

name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for

configuration files They are just simple text files with a basic structure Generally speaking the

INI file contains three different parts They are section parameter and comment respectively

Parameter is the basic element contained in an INI file Its format is a key-value pair or can also

be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always

appears to the left of the equals sign The section is more like a room that grouped all its

parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And

section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo

Thence anything between the semicolon and the end of line will be ignored

(4) ldquogenerateDatabaserdquo sub function

The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions

Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will

turn them into the data of the database by using the SQL statements SQL stands for Structured

Query Language which is a database language that designed for accessing and manipulating data

in database Most of the actions that performed on a database are done with SQL statements

And the primary statements of SQL include insert into delete update select create alter and

drop Therefore for the purpose of transforming these service properties into the data of the

database this sub function has to create a database first using the ldquocreate databaserdquo statement

Then it should create a table to store the data A table is a collections of related data entries and

it consists of columns and rows Since the data for all these five Web Service Registries are not

very large therefore one database table is enough for storing these service properties Because of

that the field names of the service properties in columns for all these five Web Service Registries

should be uniform and well-defined Afterwards the service properties of each single service can

be put into the table as a record with the ldquoinsert intordquo statement of the SQL

3241 Features of the Storage Component

The Property Grabber component has to provide the following features

l Generate different output formats

The final result of this master program is to store the information of the service into the disk for

future work However this Storage component provides three different formats for storing these

service properties of the services that published in the Web Service Registries This makes the

storage of the service very flexible and also longevity

l Obtain the WSDL document

The important thing of this component is to check if the WSDL document can be obtained from

the WSDL link because the WSDL document plays a decisive role for determining the quality of

the service This Storage component provides the ability to deal with the different situations

about the process of obtaining the WSDL document

Deep Web Service Crawler

43

3242 Input of the Storage Component

This component requires the following input data

l WSDL link of each service

l Each service property information

3243 Output of the Storage Component

The component will produce the following output data

l WSDL document of the service

l XML document INI file and tables in database

3244 Demonstration for Storage Component

Following figures describe the fundamental implementation codes of this Storage component

The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the

implementation codes The first common place is about the parameters that defined in each of

these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute

path in the computer disk It will be used for the storing procedure and has already been specified

by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an

increasing Integer which is used as a part of the name of the service The reason for doing this is

that it can prevent the services that have the same name from overriding in the disk The content

of red mark among the codes of these figures is the second common place Its function is to

create a file or document for the service with the corresponding parameters

2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to

get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the

name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo

is the actual WSDL link which is the most important one for this sub function The other two

parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and

ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data

of the service in each Web Service Registry Such as the overall number of properties for that Web

Service Registry the overall number of services the number of services that have no WSDL link

the number of services contains no contents the number of services whose WSDL link are not

available etc The ldquoLog Informationrdquo text file records the results of the process steps and the

problems encountered For example which service is crawling now which Web Service Registry it

belongs the reason why it cannot get the WSDL document for this service the reason why this

service is unreachable and so on

Deep Web Service Crawler

44

Figure3-19 Implementation code for getting WSDL document

3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file

and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of

ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two

variables name and value

Figure3-20 Implementation code for generating XML file

Deep Web Service Crawler

45

Figure3-21 Implementation code for generating INI file

4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all

services in these five Web Service Registries into the records of the database Therefore it has to

create a database first The name of the database can be arbitrary as long as it conforms to the

naming rules in database This is the same for the name of the table Figure 3-22 displays the

procedure of creating a table in the database Due to the reason that it is hard to decide the

length of each service property thence the data types for all service properties are set to ldquoTextrdquo

Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo

statement

Figure3-22 Implementation code for creating table in database

Deep Web Service Crawler

46

Figure3-23 Implementation code for generating table records

33 Multithreaded Programming for DWSC

Multithreaded programming is a built-in characteristic which is provided by Java language A

multithreaded program contains two or more separate parts that that can execute concurrently Each

part of such a program is called a thread The use of multithreading enables to create programs that

makes efficient use of the system resources For example make maximum use of the CPU thatrsquos

because the idle time of the CPU can be kept to a minimum

In this master program there are five Web Service Registries that are needed to crawl for the services

among them Moreover the number of services published among each Web Service Registry is quite

different That makes the running time cost by each Web Service Registry is different Then it will

happen that one Web Service Registry who owns fewer services has to be waiting for executing until

another Web Service Registry who has much more services finishes Therefore in order to reduce the

waiting time for other Web Service Registries and maximize the use of the system resources it is

necessary to apply this multithreaded programming to this master program That is to say this master

program will create a thread for each Web Service Registry And these threads are executed

independently

34 Sleep Time Configuration for Web Service

Registries

Due to the reason that this master program is intended for downloading the WSDL document and

extracting the service information of these Web services that published in the Web Service Registry

this would inevitably affect the performance of the Web Service Registry In addition for the purpose

of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of

accessing Because of that sometimes when this master program is executing unknown errors would

happen For instance this master program will be continually halting at one point without getting any

more WSDL document and service information or some servicesrsquo WSDL documents of some Web

Deep Web Service Crawler

47

Service Registries cannot be obtained or there would miss some service information Therefore in

order to obtain the largest staffs of these Web services published in these five Web Service Registries

and not to affect the throughput of them it has to configure the accessing rate for each service of all

Web Service Registries

In consequence before going into the essential procedure for each single service of these Web Service

Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long

milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for

the specified number of milliseconds In other words it will temporarily cease execution for a while

Following table shows this time interval of the sleep function for each Web Service Registry

Web Service Registry Name Time Interval (milliseconds)

Service Repository 8000

Ebi 3000

Xmethods 10000

Seekda 20000

Biocatalogue 10000

Table 3-15Sleep Time of these five Web Service Registries

Deep Web Service Crawler

48

4 Experimental Results and Analysis

This chapter is going to show the quantitative experimental results of the prototype that presented in

chapter 3 Besides the analysis of these results will also be described and explained However in

order to gain a rather accurate result the experiments are carried out more than five times All these

data that displayed in the following tables and charts are the average of them

41 Statistic Information for Different Web Service

Registries

This section is mainly used to talk about the amount statistic of these Web services that published in

these five Web Service Registries It includes the overall number of the Web services that published in

each Web Service Registry and the number of these unavailable Web services which have been

archived because they may not be active anymore or are close to being non active Table 4-1 shows

the service amount statistic of these five Web Service Registries

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Overall

Services 57 289 382 853 2567

Unavailable

Services 0 0 0 0 125

Table4-1 Service amount statistic of these five Web Service Registries

Nevertheless in order to have an intuitive view of the service amount statistic in these five Web

Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As

you can see from the bar chart one on hand there has an ascending increase for the overall number

of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service

Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web

service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to

provide the Web services for the users because it contains far more services than other four Web

Service Registries On the other hand there is no unavailable service existed in almost all Web Service

Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web

Service Registry contains some Web services that cannot be used by the users To some degree this is

useless for the reason that these services cannot be used anymore and it will waste the network

resources on the Web Therefore all these unavailable services should be eliminated in order to

reduce the wasting of the network resources

Deep Web Service Crawler

49

Figure4-1 Service amount statistic of these five Web Service Registries

42 Statistic Information for WSDL Document

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Failed WSDL

Links 1 0 23 145 32

Without

WSDL Links 0 0 0 0 16

Empty

Content 0 0 2 0 2

Table4-2 Statistic information for WSDL Document

Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in

these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed

WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the

Web services whose WSDL links are invalid Say in other words that means it is impossible to get the

WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no

WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services

for these Web Service Registries It is used as the overall number of Web services in each Web Service

Registry who has no WSDL link That is to say there would be no WSDL document for Web services

like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL

document will also be created But there is no content and the name of this WSDL document will

contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which

represents the overall number of Web services that have the WSDL links and the URL address of them

0

500

1000

1500

2000

2500

3000

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

57289 382

853

2567

0 0 0 0 125

The

Num

ber o

f Web

Ser

vice

s

The name of Web Service Registry

Overall Services

Unavailable Services

Deep Web Service Crawler

50

are valid but their WSDL documents contains no content In this case a WSDL document whose name

contains a string ldquo(BAD)rdquo will be created

Figure4-2 Statistic information for WSDL Document

43 Comparison of Different Average Number of

Service Properties

This section is going to compare the average number of service properties in these five Web Service

Registries This average number of service properties is calculated by means of following equation = (1)

Where

ASP is the average number of service properties for one Web Service Registry

ONSP is the overall number of service properties in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

Figure 4-3 shows the average number of service properties of each Web service in these five Web

Service Registries As have already mentioned before one of the measurements for testing the quality

of Web services in the Web Service Registry is the service information The more information of a Web

service the better you know about that service consequently the corresponding Web Service

Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service

Repository and Biocatalogue Web Service Registries own a larger number of service properties than

that in other three Web Service Registries This can directly reflect that these two Web Service

Registries can provide more detailed information about the Web services published among them So

that the users can be easier to choose the service they need And they would also like to use the Web

services that published in these two Web Service Registries By contrast the Xmethods and Seekda

0

50

100

150

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

1 023

145

32

0 0 0 016

0 0 2 0 2

The

Num

ber o

f Ser

vice

s

The Name of the Web Service Registry

Failed WSDL Links

Without WSDL links

Empty Content

Deep Web Service Crawler

51

Web Service Registries who have less service information of the Web services would offer less quality

for these Web services Therefore users may be not like to use the Web services provided in these

two Web Service Registries Not to mention the Web services published in the Ebi Web Service

Registry

Figure4-3 Average Number of Service Properties

From the description presented in section 323 the causes of the issue for the different number of

service properties in these Web Service Registries may consist of following several points First the

number of the structured information for these Web services is different with respect to these five

Web Service Registries Even part of information for some Web services in one Web Service Registry

could be missing or have empty value For example the number of structure information that

supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that

in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for

almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce

the amount of the overall number of service properties Thirdly some Web Service Registries do not

have monitoring information such as Xmethods and Ebi In particular the Service Repository Web

Service Registry has a large amount of monitoring information of the Web services that could be

extracted in the Web Obvious the last one is the number of the whois information for these Web

services If the database of the whois client does not contain the information about the service

domain for the Web service in one Web Service Registry then there will be no whois information

could be extracted Moreover even there has information of the service domain the amount of the

information could be very diverse Therefore if the Web Service Registry undergoes the situation like

that many service domains of the Web services in this registry have no and few whois information

then the average number of service properties in that registry will decrease greatly

As a result in order to help users get better acquainted with the Web services that provided in Web

Service Registry and distinguish these Web services the Web Service Registry should do it best to

offer more and more information for each published Web service of it

05

101520253035

23

717 17

32

The

Num

ber o

f Pro

pert

ies

The Name of Web Service Registry

Average Number of Service Properties

Deep Web Service Crawler

52

44 Different Outputs of Web Services

The fundamental task of this master program is to obtain the WSDL document of the Web services

that hosted in the Web Service Registries as well as extract and gather the properties of these Web

services and thereafter store them into disk Therefore this section is going to describe the different

outputs of this master program which include the WSDL documents of Web services the generated

XML file and INI file and data records of these service properties in Database

Figure4-4 WSDL Document format of one Web service

The WSDL document of the Web service is just read from the Web according to the URL address of its

WSDL link And then these data will be stored into the disk as the WSDL document The name of the

WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document

And in order to distinguish these WSDL documents whose names would be the same while the

content of them are different the name of each obtained WSDL document in one Web Service

Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL

document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and

data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output

formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name

is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the

materials belonged to the same Web service The first three lines in that INI file are some service

comments which start from the semicolon to the end of the line They are the basic information for

the description of this INI file The following one line of them is the section that included in a pairs of

brackets It is important because it represents that the following lines behind it are the information of

this Web service Therefore the rest of these lines are the actual service information that existed in a

key-value pair with an equal between the pair Each service property is displayed from the beginning

of the line

Deep Web Service Crawler

53

Figure4-5 INI File format of one Web service

Figure4-6 XML File format of one Web service

Figure4-7 Database data format for all Web services

Deep Web Service Crawler

54

Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its

name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web

service Though the format of the XML file is different from that of the INI file the essential contents

of them are the same That is to say the values of these service properties are no different This is

because they are files generated from the collection of properties for the same Web service The XML

file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the

section in INI file is something like the root in XML file Therefore all values of these elements among

the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service

Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of

service information for all Web services in these five Web Service Registries And the entire service

information for one Web service will be only one record in this table Because of that the column

names of that table should be the union of the name of the service information in each Web Service

Registry However since the column names of the table should be unique thus these redundant

names in this union must be eliminated This could make sense and possible because the names of the

service information are well-defined and uniform for all these five Web Service Registries In addition

in this table the first column is the primary key which is an increasing Integer The function of this is

more like the Integer that contained in the name of the XML file and INI file While the rest columns in

that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in

this table represents that this property of that Web service is empty or missing

45 Comparison of Average Time Cost for Different

Parts of Single Web Service

This section aims to describe the comparison of the average time cost for different parts of getting

one single Web service in all these five Web Service Registries At first it has to calculate the average

time cost of getting one single service in the Web Service Registry The calculation of it can be

obtained through following equation = (2)

Where

ATC is the average time cost for one single Web service

OTS is the overall time cost of all the Web services in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the different parts of the average time cost for getting one single service consists of

following 6 aspects which are the average time cost for extracting service property the average time

cost for obtaining WSDL document the average time cost for generating XML file the average time

cost for generating INI file the average time cost for inserting the service property into the table of

the Database and the average time cost for some other procedures such as get the service list page

link get the service page link and so on The average time cost for extracting service property would

be obtained by means of following equation

Deep Web Service Crawler

55

= (3)

Where

ATCSI is the average time cost for extracting service property of one single Web service

OTSSI is the overall time cost for extracting service property of all the Web services in one Web

Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the calculation of other parts is similar to the equation for calculating the average time

cost for extracting service property While the calculation of the average time cost for other

procedures equals to the result of the average time cost for one single Web service minus the sum of

the average time cost of all other five parts

Service

property

WSDL

Document

XML

File INI File Database Others Overall

Service

Repository

8801 918 2 1 53 267 10042

Ebi 699 82 2 1 28 11 823

Xmethods 5801 1168 2 1 45 12 7029

Seekda 5186 1013 2 1 41 23 6266

Biocatalogue 39533 762 2 1 66 1636 42000

Table4-3 Average time cost information for all Web Service Registries

Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web

Service Registries The first column of table 4-3 is the name of these five Web Service Registries And

the last column in this table is the average time cost for the single service in one Web Service Registry

While the rest columns in this table are the average time cost of these six different parts In order to

have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated

with following corresponding figures See figure 4-8 to figure 4-14

Figure4-8 Average time cost for extracting service property in all Web Service Registries

8801

6995801 5186

39533

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Service Property

Deep Web Service Crawler

56

As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue

Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries

which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively

That is to say it will take much longer time to extract the service properties of these Web services

published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the

Biocatalogue Web Service Registry has even more average number of service properties which have

already been talked about in section 43 On the contrary the average number of service properties in

Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service

Registry is larger than that in Seekda Web Service Registry As have already known that the average

number of service properties is the same for these two Web Service Registries Nevertheless there

has course might explain the case that the average time in Xmethods costs more than that in Seekda

which is the process for extracting service properties in Xmethods Web Service Registry needs to be

executed by means of service page and service list page while it needs only service page link for

Seekda Web Service Registry

The average time cost for obtaining WSDL document in all these five Web Service Registries is

displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the

average time for extracting the WSDL link of Web service and the average time for reading the data of

WSDL document from the Web and then storing it into the disk As seen from this figure the average

time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is

1168 Although the process for extracting WSDL link will cost certain amount of time it does not have

a significant influence to the total average time spent for obtaining WSDL document This is because

the WSDL link of one Web service is almost gained in one step Therefore this can imply that the

average size of the WSDL document for Xmethods Web Service Registry is larger than that for other

four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds

for the process of extracting WSDL document

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries

918

82

11681013

762

0

200

400

600

800

1000

1200

1400

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

WSDL Document

Deep Web Service Crawler

57

Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different

outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11

the average time of generating XML file for one Web service cost the same for all these five Web

Service Registries which is only 2 milliseconds In addition the average time of generating INI file for

one Web service cost the same too And its value is just 1 millisecond Even though sum up the values

of these two types of average time costs the result is still so small that it could be omitted when

comparing to the overall average time cost of getting one Web service for each corresponding Web

Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI

file would be finished at once after receiving the input of service properties of one Web service

Furthermore it can be seen from figure 4-12 although the average time costs of creating database

record for each Web service in all these five Web Service Registries are larger than that time of

generating XML and INI file the process of the operation for creating database record is still fast

Figure4-10 Average time cost for generating XML file in all Web Service Registries

Figure4-11 Average time cost for generating INI file in all Web Service Registries

2 2 2 2 2

0

05

1

15

2

25

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

XML File

1 1 1 1 1

0

02

04

06

08

1

12

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

INI File

Deep Web Service Crawler

58

Figure4-12 Average time cost for creating database record in all Web Service Registries

Figure4-13 Average time cost for getting one Web service in all Web Service Registries

In the figure 4-13 it gives the average time cost for getting one single Web service in all these five

Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the

longest time for this process This is because the presentation of these five different parts that

described in the front shows that the average time cost of each part would needs more time to finish

the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining

WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web

Service Registry does not cost the longest Moreover there is amazing thing when looking at these

figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same

trend This can further indicate that one Web Service Registry who spends more time to get the

description information of one Web service would offer more information about the Web service

53

28

4541

66

0

10

20

30

40

50

60

70

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Database

10042

823

7029 6266

42000

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Overall

Deep Web Service Crawler

59

5 Conclusion and Further Direction

This master thesis provides a schema which aims to explore the description information of the Web

services that hosted in the different Web Service Registries Actually the description information of

the Web service consists of WSDL document and service information about that Web service

Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can

be used to obtain the WSDL document and service information of the Web service in these Web

Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web

services that hosted in the these Web Service Registries as well as the storage of these Web services

is not flexible and the most important thing is that only a few service information of the Web service

are extracted even more there is no service information would be extracted for some Web Service

Registry However the work presented in this master thesis can be able to explore all Web services in

these Web Service Registries Moreover the service information of Web service would be extracted as

much as it could with the approach presented in this master thesis so that the final result would be

the largest annotated service catalogue ever produced Furthermore regarding the storage process of

the description information of the Web service this master thesis provides three different ways that

can guarantee not only the completeness but also longevity of the description information for the

Web service

However in the implementation performed in this master thesis the whois client used for querying

the information of service domain will return a free text if the information exists And sometime this

free text differs completely That makes it must have to crawl each Web service in all Web Service

Registries at least once in the experiment stage So that all the cases of these free text would be

foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web

services in these Web Service Registries Therefore in order to simply the work other whois client

who can ease the work here needs to be found and used

Moreover in the experiment stage of this master thesis the time cost for getting Web service is still

large For the purpose of reducing this time the multithreaded programming would be applied to

each some parts of the process for getting one Web service

Although the work performed here is specialized for only these five Web Service Registries the

main parts of the principles used here is adaptable to other Web Service Registries that with only

small changes in the implementation codes or the structure

Deep Web Service Crawler

60

6 Bibliography

[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)

June 27 2008

[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from

httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-

whole Emanuele Della Valle (CEFRIEL) July 1 2008

[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk

ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-

architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009

[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web

Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October

2006

[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from

httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008

[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from

httpwwww3orgTRws-arch-scenarios February 11 2004

[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo

Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering

University of Washington Seattle pp233-272 February 1999

[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web

Consortium Working Draft WD-html5-20100624 January 22 2008

[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language

WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from

httpwwwwsmoorgTRd16d161v02

[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO

-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from

httpwwwwsmoorgTRd2v11

[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware

Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet

2008

Deep Web Service Crawler

61

7 Appendixes

There are additional outputs of this master program which are log information file and statistic report

file Figure 8-1 shows one of the basic output formats of the log information for these five Web

Service Registries

Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry

Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry

Deep Web Service Crawler

62

Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo

ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively

Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry

Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry

Deep Web Service Crawler

63

Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry

Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry

Deep Web Service Crawler

64

Table of Figures

Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12

Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16

Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25

Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27

Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41

Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46

Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50

Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51

Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52

Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56

Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58

Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58

Deep Web Service Crawler

65

Table of Tables

Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47

Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48

Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55

Deep Web Service Crawler

66

Table of Abbreviations

DTD Document Type Definition

DWSC Deep Web Service Crawler

HTML HyperText Markup Language

MTBF Mean Time between Failures

MTTR Mean Time to Recovery

QoS Quality of Service

REST Representational State Transfer

RTT Round Trip Time

SOAP Simple Object Access Protocol

SQL Structured Query Language

URL Uniform Resource Locator

WHATWG Web Hypertext Application Technology Working Group

XML eXtensible Markup Language

WSDL Web Service Description Language

WSML Web Service Modeling Language

WSMO Web Service Modeling Ontology

Page 5: Deep Web Service Crawler

Deep Web Service Crawler

5

312 System Requirements for DWSC 23

313 Non-Functional Requirements for DWSC 24

32 Deep Web Services Crawler Architecture 24

321 The Function of Web Service Extractor Component 26

3211 Features of the Web Service Extractor Component 28

3212 Input of the Web Service Extractor Component 28

3213 Output of the Web Service Extractor Component 28

3214 Demonstration for Web Service Extractor 29

322 The Function of WSDL Grabber Component 30

3221 Features of the WSDL Grabber Component 31

3222 Input of the WSDL Grabber Component 31

3223 Output of the WSDL Grabber Component 31

3224 Demonstration for WSDL Grabber Component 31

323 The Function of Property Grabber Component 33

3231 Features of the Property Grabber Component 36

3232 Input of the Property Grabber Component 37

3233 Output of the Property Grabber Component 37

3234 Demonstration for Property Grabber Component 37

324 The Function of Storage Component 40

3241 Features of the Storage Component 42

3242 Input of the Storage Component 43

3243 Output of the Storage Component 43

3244 Demonstration for Storage Component 43

33 Multithreaded Programming for DWSC 46

34 Sleep Time Configuration for Web Service Registries 46

4 Experimental Results and Analysis 48

41 Statistic Information for Different Web Service Registries 48

42 Statistic Information for WSDL Document 49

43 Comparison of Different Average Number of Service Properties 50

44 Different Outputs of Web Services 52

45 Comparison of Average Time Cost for Different Parts of Single Web Service 54

5 Conclusion and Further Direction 59

6 Bibliography 60

Deep Web Service Crawler

6

7 Appendixes 61

Table of Figures 64

Table of Tables 65

Table of Abbreviations 66

Deep Web Service Crawler

7

1 Introduction

In this introduction chapter of the master thesis at first itrsquos going to concisely explain the background

of the current situation then is the basic introduction of the proposed approach which is called Deep

Web Service Extraction Crawler

11 BackgroundMotivation

In the late 1990rsquos the Web Service Registry was a hot commodity The formal definition of Web

Service Registry is known as a link links page Its function is to uniformly present information that

comes from various sources Hence it can provide a convenient channel to the users for offering

searching and using the Web Services Actually the related metadata of the Web Services that

submitted by both the system and users are commonly hosted along with the Service descriptions

Nevertheless in fact when users enter one of the Web Service Registries to look for some Web

Services they might meet some situations that would bring lots of trouble to them One of the

situations may be like that these Web Service Registries return several similar published Web Services

after the users search on it For example two or more Web Services have the same name but their

versions are not the same Or two or more Web Services that derived from the same server but have

different contents etc Furthermore most users are also interested in a global view of the published

services For instance they want to know which Web Service Registry can provide better quality for

the Web Service Therefore in order to help users to differentiate those similar published Web

Services and have a global view of the Web Services this information should be monitored and rated

Moreover there are a great many Web Service Registries in the Internet Each Web Service Registry

can provide a great number of Web Services Obviously there might have some similar Web Services

among these Web Service Registries Or a Web Service in one of the Web Service Registry is related to

another Web Service in other Web Service Registries Hence these Web Services should be

comparable across different Web Service Registries However recently there has not much support of

this In addition towards the metadata actually not all of them are structured especially the

descriptions of the non-functional property Therefore what have to do now is to turn those

non-functional property descriptions into the structured format Clearly speaking it needs to extract

as much information as possible about the Web Services that offered in the Web Service Registries

Eventually after extracting all the information from the Web Service Registries it is necessary to store

them into the disk This procedure should be efficient flexible and completeness

12 Initial Designing of the Deep Web Service

Crawler Approach

The problems have already been stated in the previous section Hence the following work is to solve

Deep Web Service Crawler

8

these problems In this section it will present the basic principle of Deep Web Service Crawler

approach

At first it is the simply introduction of the Deep Web Service Crawler approach to these problems As

have already been mentioned each Web Service Registry can offer Web Services Moreover each

Web Service Registry has its own html page structures These structures may be the same or even

complete different Therefore the first thing is to identify which Web Service Registry that it will be

going to explore Since each Web Service Registry owns a unique URL this job can be done by directly

analyzing the corresponding URL address of that Web Service Registry After identifying which Web

Service Registry it is going to explore the following step is to obtain all these Web Services that

published in that Web Service Registry Then with all these obtained Web Services it is time to extract

analyze and gather the information of the services That information can be in structured format or

even in unstructured format In this master thesis some Deep Web Analysis Techniques will be

applied to obtain this information So that the information about each Web Service shall be the

largest annotated The last but not the least important all the information about the Web Services

need to be stored

13 Goals of this Master Thesis

The lists in the following are the goals of this master thesis

n Produce the largest annotated Service Catalogue

Service Catalogue is a list of service properties The more properties the service has the larger

Service Catalogue it owns Therefore this master program should extract as much service

properties as possible

n Flexible storage of these metadata of each service as annotations or dedicated documents

The metadata of one service includes not only the WSDL document but also service properties

All these metadata are important information for the service Therefore this master program

should provide flexible ways to store these metadata into the disk

n Improve the comparable property of the Web Services across different Web Service Registries

The names of service properties for one Web Service Registry could be different from another

Web Service Registry Hence for the purpose of improving the comparable ability all these

names of the service properties should be uniformed and well-defined

14 Outline of this Master Thesis

In this chapter the motivation objective and initial approach plan have already been discussed

Thereafter the remaining paper is structured as follows

Firstly in chapter 2 the work presented is almost based on these existing techniques In section 21

there is given a detailed introduction to the technique of the Service-Finder project Then in section

22 the already implemented technique of Pica-Pica Web Service Description Crawler is introduction

and discussed After that in section 23 the Information Retrieval technique is presented

Deep Web Service Crawler

9

Then following in chapter 3 it is explained the designing details of this Deep Web Service Crawler

approach In section 31 it gives a short description for the different requirements of this approach

Next in section 32 the actual design for the Deep Web Service Crawler is presented Then in section

33 34 the multithreaded programming and sleep time configuration that used in this master

program are introduced respectively

In chapter 4 it is supposed to display the experiments of this Deep Web Service Crawler approach

and then give some evaluation of it

Finally in Section 5 the conclusion and discussion of the work that already done as well as the work in

the future for this master task are presented respectively

Deep Web Service Crawler

10

2 State of the Art

This chapter aims at presenting some existing techniques or Strategies that related to the work of

applying this Deep Web Service Extraction Crawler approach Section 21 plans to talk about the

existing catalogues Service-Finder project And then in section 22 it is going to explain the existing

implemented crawlers Pica-Pica Web Service Description Crawler Finally in the section 23 it is

supposed to present some details about the Information Extraction technique

21 Service Finder Project

Service-Finder Project aims at developing a platform for Web Service discovery especially for the Web

Services that are embedded in a Web 20 environment [1] Hence it can provide an efficient access to

publicly available services The goals of the Service-Finder project are depicted as follows [1]

n Automatically gather Web Services and their related information

n Semi-automatically create semantic service description based on the information that available

on the Web

n Create and improve semantic annotations via the user feedback

n Describe the aggregated information in semantic models and allow reasoning query

However before describing the basic functionality of the Service-Finder Project there is going to

present one of its use cases and requirements first

211 Use Cases for Service-Finder Project

The Service-Finder Project employed the use case methodology of the W3C Use Case description [6]

for its needs and then applied this methodology to the use cases that it enumerated

2111 Use Case Methodology

There are three aspects needed to be considered for use case definitions of Service-Finder Project [1]

(1) Description that used to describe information of the use case

(2) Actors Roles and Goals that used to identify the actors they would be the roles they act and the

goals they need to achieve in the scenario

(3) Storyboard that used to describe the serial of interactions among the actors and the

Service-Finder Portal

2112 System Administrator

This section is going to present the use case that applied to the Service-Finder portal and that

illustrated the requirements on its functionality from a user point of view However all these

Deep Web Service Crawler

11

information in this use case are derived from [1] In this use case there has a system administrator

whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities

online and working all day and night Therefore if there is any system failures Sam Adams should fix

the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will

alert him immediately by sending him a SMS Message in the case of a system failure

n Description

This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an

SMS Messaging Service that he wants to build it into his application

n Actors Roles and Goals

The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him

are the immediate service delivery the reliability of the service and low base fee and transaction fee

n Storyboard

Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find

many useful services from it especially he know what he is looking for Hence he visits the

Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo

Requirement 1 Search functionality

Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the

number of matching services that will be displayed on one page And he would also expect there has

short information about the service functionality the service provider and the service availability So

that he could decide which service he will choose to read further

Requirement 2 Enable configurable pagination of the matching results and have some short

information for each service

Step 3 When Sam looks through the short information about the services that displayed on the first

page he expects to find the most relevant services that related to his request After that he would

like to read more detailed information about that service to see whether this service can provide the

needed functionality

Requirement 3 Rank the returned matching services and must provide ability to read more details of

a service

Step 4 In the case that all the returned matching services Sam got provide quite different

functionalities or they belong to different service categories for example the SMS messaging services

alert users not through SMS but voice messaging For this reason Sam would like to see other

different categories that may be contain the services he wants Or the services of other categories

which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can

further filter his search in terms of browsing through categories

Requirement 4 Service categories and allow the user to look all services that belonged to that specific

category If possible it should also allow the user to browse through categories

Step 5 When Sam got all the services that could provide a SMS messaging service via the methods

described in the Step 4 at present he wants to look for the services that offered by an Austrian

provider and have no base fees if possible

Requirement 5 Faceted search

Deep Web Service Crawler

12

Step 6 After Sam got all these specific services now he would like to choose the services that can

provide a high reliability

Requirement 6 Sort functionality based on usersrsquo chooses

Step 7 For now Sam expects to compare the service availability between the promised to the service

provider and the actually provided This should be contained in the servicesrsquo details And there needs

also have service coverage information so that Sam can know whether this service covers the areas

he lives and works Moreover Sam would also like to compare these services in other way For

instance put some services into a structured table to compare the transaction fees

Requirement 7 A side-by-side comparison table for services and a functionality that enable users to

select services he wants to compare

Step 8 At last Sam wants to know whether the service providers offer a free try out of the services

So that he can test the service functionality

Requirement 8 If possible display a note that offering free service trials

212 Architecture Plan for the Service-Finder Project

The architecture plan of the Service-Finder project contains five basic components Service Crawler

Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal

Interface Figure 2-1 presents a high level overview of the components and the data flow among

them

Figure2-1Dataflow of Service-Finder and Its Components [3]

Deep Web Service Crawler

13

2121 The Principle of the Service Crawler Component

The Service Crawler is responsible for gathering available services and their related information from

the Web The overall cycle is depicted as following

(1) A Web developer publishes a Web Service

(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services

like WSDL (Web Service Description Language) documents

(3) The Crawler is also going to search for other related information as long as a service is discovered

(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant

part of the Web

At last the output of the crawler would be forwarded to the subsequent components for analyzing

indexing and displaying

2122 The Principle of the Automatic Annotator Component

The Automatic Annotator receives the relevant data from previous component and generates

semantic service descriptions about the WSDL documents and its related information based on the

Service-Finder Ontology and Service Category Ontology

Firstly it will simply introduce those two compatible ontologies that would be used throughout the

whole process [2]

n Generic Service Ontology it is an ontology which is functional to describe the data objects For

example the services the service providers availability payment modalities and so on

n Service Category Ontology it is an ontology which is used to categorize the functionalities or

applications of the services For instance data verification messaging data storage weather etc

Afterwards it is going to talk about the function of this component with its input output

Oslash Input

u Crawled data from Service Crawler

u Service-Finder Ontologies

u Feedback or Correction of before annotations

Oslash Function

u Enrich the information about the service and extract semantic statements according to the

Service-Finder Ontologies For example categorize the service according to the Service

Category Ontology

u Determine whether a particular document is relevant or not through the Web link graph If

not discard these irrelevant documents

u Classify the pages into their genres For instance pricing user comments FAQ and so on

Oslash Output

u Semantic annotation of the services

Deep Web Service Crawler

14

2123 The Principle of the Conceptual Indexer and Matcher

Component

The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted

information of the services and supplying users the capability of retrieval and semantic query For

example the matchmaking between user requests and service offers and the act of retrieving user

feedback on extracted annotations

In addition letrsquos have a look of the function of this component and its input output

Oslash Input

u Semantic annotation data and full text information obtained from Automatic Annotation

u Semantic annotation data and full text information that come from user interfaces

u Cluster data from user and service clustering component

Oslash Function

u Store the semantic annotations received from the Automatic Annotation component and

from the user interface

u Store the cluster data that procured through the clustering component

u Store and index the textual description offered by the Automatic Annotation component

and the textual comments offered by users

u Ontological query the semantic data from the data store center

u Combined keyword and Ontological querying used for user queries

u Provide a list of similar services for a given service

Oslash Output

u A list of matching services that are queried by users In particular these services should be

sorted by ranking and can also be iterated

u All available data that related to a particular entity must be retrievable at the user interface

2124 The Principle of the Service-Finder Portal Interface

Component

The Service-Finder Portal Interface is the main entry point that provided for users of the

Service-Finder system to search and browse the data which is managed by the Conceptual Indexer

and Matcher component In addition the users can also contribute information by means of providing

tags comments categorizations and ratings to the data browsed Furthermore the developers can

still directly invoke the Service-Finder functionalities from their custom applications in terms of an API

Besides the details of this componentrsquos function input and output are represented as below

Oslash Input

u A list of ordered services for a query

u Detailed information about a service or a set of services and a service provider

u Query access to service category ontology and the most used tags provided by the users

Deep Web Service Crawler

15

u Service availability information

Oslash Function

u The Web Interface allows the users to search services by keyword tag or concept in the

categorization sort and filter query results by refining the query compare and bookmark

services try out the services that offer this functionality

u The API allows the developers to invoke Service-Finder functionalities

Oslash Output

u Explicit user annotations such as tags ratings comments decryptions and so on

u Implicit user data for example click stream of users bookmarks comparisons links sent

etc

u Manual advertising of available new services

2125 The Principle of the Cluster Engine Component

The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the

Service-Finder Portal eg the queried services and the compared services of the users Moreover it

also provides cluster data to the Conceptual Indexer and Matcher for providing service

recommendations

Furthermore letrsquos detailed introduce this componentrsquos function input and output

Oslash Input

u Service annotation data of both extracted and user feedback

u Usersrsquo Click streams used for extracting user behaviors

Oslash Function

u Obtain user clusters from user behaviors

u Obtain service clusters from service annotation data to enable to find similar services

Oslash Output

u Clusters of users and services

22 Information Extraction

Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge

amount of information sources on the Internet which has been limited the access to browsing and

searching for the reason of the heterogeneity and the lack of structure of Web information sources

Therefore the appearance of Information Extraction that transforms the Web pages into

program-friendly structures for post-processing would become a great necessity However the task of

Information Extraction is specified in terms of the inputs and the extraction targets And the

techniques used in the process of Information Extraction called extractor

221 Input Types of Information Extraction

Generally speaking there are three different input types The first input type is the unstructured

Deep Web Service Crawler

16

document For example the free text that showed in figure 2-2 It is unstructured and written in

natural language So that it will require substantial natural language processing While the second

input type is called the structured document For instance the XML documents based on the reason

that the data can be described through the available DTD (Document Type Definition) or XML

(eXtensible Markup Language) schema Finally but obviously the third input type is the

semi-structured document that are widespread on the Web Such as the large volume of HTML

pages like tables itemized lists and enumerated lists This is because HTML tags are often used to

render these embedded data in the HTML pages See figure 2-3

Figure2-2Left is the free text input type and right is its output [4]

Figure2-3A Semi-structured page containing data records

(in rectangular box) to be extracted [4]

Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly

regular structure And the data of these documents can be displayed in a format of HTML way or

non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and

generated from structured databases in terms of some templates or layouts thus it would be

considered as one of the input sources which could provide some of these semi-structured documents

For example the authors price and comments of the book pages that provided by Amazon have the

Deep Web Service Crawler

17

same layout That is because these Web pages are generated from the same database and applied

with the same template or layout Furthermore there has another option which could manually

generate HTML pages of semi-structured type For example although the publication lists that

provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have

title and source property for every single pager Eventually the inputs for some Information Extraction

can also be the pages with the same class or among various Web Service Registries

222 Extraction Targets of Information Extraction

Moreover regarding the task of the Information Extraction it has to consider the extraction target

There also have two different extraction targets The first one is the relation of k-tuple And the k in

there means the number of attributes in a record Nevertheless in some cases an attribute of one

record may have none instantiation Otherwise the attribute owns multiple instantiations In addition

the complex object with hierarchically organized data would be the second extraction target Though

the ways for depicting the extraction targets in a page are diverse the most common structure is the

hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf

nodes which called internal nodes And the structure for a data object may also be flat or nested To

be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise

if it is nested structure then the internal nodes that involved in this data object would be more than

two levels

Furthermore in order to make the Web pages readable for human being and having an easier

visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated

or demarcated However the displaying for a data object in a Web page would be affected by

following conditions [4]

Oslash The attribute of a data object has zero or several values

(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo

attribute For example a special offer only available for certain books might be a ldquononerdquo

attribute

(2) If there are more than one values for the attribute of a data object it will be called the

ldquomultiValuerdquo attribute For instance the name of the author for a book could be a

ldquomultiValuerdquo attribute

Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering

That is to say among this set of attribute the position of the attribute might be changed

according to the diverse instances of a data object Thus this attribute will be called the

ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would

enumerate the release data in front of the movesrsquo title while for the movies after year 1999

(including 1999) it will enumerate the release data behind the movesrsquo title

Oslash The attribute has different formats

This means the displaying format of the data object could be completely distinct with respect to

these different instances Therefore if the format of an attribute is free then a lot of rules will be

needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo

attribute For example an ecommerce Web site would use the bold font format to present the

general prices while use the red color format to display the sale prices Nevertheless there has

Deep Web Service Crawler

18

another situation that some different attributes for a data object have the same format For

example various attributes are presented in terms of using the ltTDgt tags in a table presentation

And the attributes like those could be differentiated by means of the order information of these

attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo

attributes it must have to revise the rules for extracting these attributes

Oslash The attribute cannot be decomposed

Because of the easier processing sometimes the input documents would like to be treated as

strings of tokens instead of the strings of characters In addition some of the attribute cannot

even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo

attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The

department code and the course number in them cannot be separated into two different strings

of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo

223 The Used Techniques in Information Extraction

The extractor used in the process of Information Extraction aims at providing a single uniform query

interface to access information sources like database server and Web server It consists of following

phases collecting returned Web pages labeling these Web pages generalizing extraction rules

extracting the relevant data and outputting the result in an appropriate format (XML format or

relational database) for further information integration For example at first the extractor queries the

Web server to gather the returned pages through the HTTP protocols after that it starts to extract the

contents among these HTML documents and integrate with other data sources thereafter Actually

the whole process of the extractor follows below steps

Oslash Step 1

At the beginning it must have to tokenize the input However there are two different

granularities for the input string tokenization They are tag-level encoding and word-level

encoding The tag-level encoding will transform the tags of HTML page into the general tokens

while transform all text string between two tags into a special token Nevertheless the

word-level encoding does this in another way It treats each word in a document as a token

Oslash Step 2

Next it should apply the extraction rules for every attributes of the data object in the Web pages

These extraction rules could be induced in terms of a top-down or bottom-up generalization

pattern mining or logic programming In addition the type of extraction rules may be indicated

by means of regular grammars or logic rules For example some use path-expressions of the

HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic

constraints and some use delimiter-based constraints such as HTML tags or literal words

Oslash Step 3

After that all these extracted data would be assembled into the records

Oslash Step 4

Finally iterate this process until all these data objects in the input

Deep Web Service Crawler

19

23 Pica-Pica Web Service Description Crawler

The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the

Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web

Services problem For example the evaluation of the descriptive quality of Web Services that offered

and how well are these Web Services described in nowadaysrsquo Web Service Registries

231 Needed Libraries of the Pica-Pica Web Service

Description Crawler

This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef

Spillner and programmed in terms of the Python language Actually in order to run these scripts to

parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib

n Beautiful Soup

It is an HTMLXML parser for Python language And it can even turn these invalid markups into a

parse tree [5] Moreover the following three features can make it more powerful

u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that

makes approximately as much sense as the original document Therefore you can obtain

the data that you want

u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating

searching and modifying the parse tree Hence you donrsquot need to create a custom parse

for every application

u If the document has already specified an encoding then you can ignore it since the

Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way

Otherwise what you have to do is just to specify the encoding of the original documents

Furthermore the ways of including Beautiful Soup into the application are displayed in the

following [5]

sup2 From BeautifulSoup import BeautifulSoup For processing HTML

sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML

sup2 Import BeautifulSoup To get everything

n Html5lib

It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to

gain maximum compatibility with the current major desktop web browsers this implementation

will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5

specification

Deep Web Service Crawler

20

232 Architecture of the Pica-Pica Web Service

Description Crawler

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler

Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes

four fundamental components Service Page Grabber component WSDL Grabber component

Property Grabber component and WSML Register component

(1) The Service Page Grabber components is going to take the URL seed as the input and output the

link of the service page into following two components WSDL Grabber component and Property

Grabber component

(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the

delivered service pagersquos link And then check whether the validation for these obtained WSDL

document Finally only these valid WSDL document will be passed into the WSML Register

component for further processing

(3) The Property Grabber component will try to extract the servicersquos property the hosted in the

service page if there exists After that all these servicersquos properties would be saved into an INI

file as the information of that service

(4) The functionality of the WSML Register component is to write appropriate WSML document by

means of the valid WSDL documents that delivered from WSDL Grabber component and the

optionally INI files that delivered from Property Grabber component Afterwards register them in

Conqo

n WSML [9]

It stands for Web Service Modeling Language which provides a framework with different

language variants Hence it is often used to describe the different aspects of the semantic Web

Services according to the conceptual model of WSMO

n WSMO [10]

WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various

aspects related to the Semantic Web Services based on the ontologies Ontology is a formal

explicit specification of a shared conceptualization In fact these ontologies are the sticking points

that can satisfy the linkage between the agreement of the communities of users and the defined

conceptual semantics of the real-world

n Conqo [11]

Deep Web Service Crawler

21

It is a discovery framework that considers not only the Quality of Service (QoS) but also the

context information It will use a Web Service Repository to manage these service descriptions

that based on WSML

233 Implementation of the Pica-Pica Web Service

Description Crawler

This section is going to describe the processes of the implementation of the Pica-Pica Web Service

Description Crawler in detail

(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it

needs an input as the initial seed In this crawler there are five Web Service Registries which are

listed in the below The URL address of these five Web Service Registries will be used as the input

seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web

Service Description Crawler there has a single Python script for each Web Service Registry And

the crawling process for these Web Service Registriesrsquo Python script will be executed one after

another

Biocataloue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwsrvice-repositorycom

Xmethods httpwwwxmethodsnet

(2) Then after feeding with the input seed it will step into the next component Service Page Grabber

At first this component will try to read the data from the Web based on the input seed Then it

will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library

After that this Service Page Grabber component starts to look for the service page link for each

service that published in the Web Service Registry by means of the functions in Html5lib library In

the case that the service page link of one single service is found it will firstly check whether this

service page link is valid or not Once the service page link is valid it will pass it into the following

two components for further processing which are WSDL Grabber component and Property

Grabber component

(3) When the WSDL Grabber component receives a service page link from its previous component it

sets out to extract the WSDL link address for that service through the parsing tree of the data in

this service page Next this component will start to download the WSDL document of that service

in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into

the disk The process of this WSDL Grabber component will be continually carried on until there is

no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They

may either contain bad definitions or bad namespaceURI or be an empty document What is

worse it is not even of XML format Hence in order to pick them out this component will further

analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo

folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in

order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo

folder will be passed into the subsequent component

(4) Moreover since some Web Service Registries would give some addition information about the

Deep Web Service Crawler

22

services such as availability service provider version therefore the Property Grabber

component will set out to extract these information as this servicersquos properties And thereafter

save these servicersquos properties into an INI file However if there has no available additional

information then it is no need to extract the service property thus there is no INI file for that

service However for the implementation of this Pica-Pica Web Service Description Crawler only

the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to

extract the servicesrsquo properties While for other three Web Service Registries there has such

function

(5) Furthermore this is an optional to create a report file which contains the statistic information of

this process Such as the total number of services for one Web Service Registry the number of

services whose WSDL document is invalid etc

(6) As has been stated at present there has a folder with all valid WSDL documents and might also

have some INI files Therefore at this time the task of WSML Register component is to generate

the appropriate WSML documents with these valid WSDL document and INI files And then

register them in Conqo

24 Conclusions of the Existing Strategies

In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder

project Information Extraction technique and Pica-Pica Web Service Description Crawler

The task of this master program is going to obtain the available Web services and their related

information from the Web Actually this is a procedure of extracting the needed information for the

service such as the servicersquos WSDL document and its properties Therefore the Information

Extraction technique which used to extract the information that hosted in the Web could be used in

this master program

Moreover the Service-Finder is a large project that not only able to obtain the available Web services

and their related information from the Web but also enrich these crawled data with annotations by

means of the Service-Finder ontology and the Service Category ontology and integrate all the

information into a coherent semantic mode Furthermore it also provides the capabilities for

searching and browsing the data with the user interface and gives users with service

recommendations However as a master program the Service-Finder project is far more exceeding

the requirements Therefore it is just considered as a consultation for this master program

Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available

Web services and their related information this fulfills the primary task of this master program

Nevertheless regarding the information about the service this Pica-Pica Web Service Description

Crawler extracts only few properties sometime even no property Consequently in order to improve

the quality of the service it has to extract as much properties about the service as possible Thence in

chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler

Deep Web Service Crawler

23

3 Design and Implementation

In the previous chapter of the State of Art those already existing techniques or implementations are

presented In the following it is time to introduce the basic principle of the proposed approach Deep

Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica

Web Service Description Crawler

31 Deep Web Services Crawler Requirements

This section is mainly talking about the goals of the Deep Web Service Crawler approach system

requirements with respect to approach and some non-functional requirements

311 Basic Requirements for DWSC

The following lists are the basic requirements which should be achieved

(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list

of Web services that published in the Web Service Registry It contains not only the WSDL document

of the service but also a file with respect to the information about that service Therefore for the

purpose of producing the largest annotated Service Catalogue this proposed approach need to

extract as much properties about those Web services as possible Moreover it has also to download

the WSDL document that hosted along with the Web service That is to say these properties are not

only the interesting structured properties such as service name its WSDL link address but also some

other non-function properties for example endpoint and its monitoring information

(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties

For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of

schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way

this proposed approach provides three methods for the storage The first one will store them as an

XML file While for the second method it is going to store them in an INI file And the third method is

to use the database for the storage

312 System Requirements for DWSC

General Speaking the needed requirements for achieving a programming project contains following

lists

1) Operating System LinuxUnix Windows (XP Vista 2000 etc)

2) Programming language C++ Java Python C etc

3) Programming Tool NetBean Eclipse Visual Studio and so on

However in this master thesis the scripts of the Deep Web Service Crawler approach are written in

Java programming language Besides these code scripts are only tested running in Windows XP

operating system and Linux operating system but have not been tested in other operating system

Deep Web Service Crawler

24

313 Non-Functional Requirements for DWSC

In this part several non-functional requirements for the Deep Web Service Crawler approach are

presented

1) Transparency the process of data exploration and data storage should be done automatically

without the userrsquos intervention However at first user should specify the path in the hard disk which

will be used store these outputs of this program

2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in

order to stop the process from interruption there must have some necessary error handlings for the

process recovering

3) Completeness this approach should extract the interesting properties about each Web Service as

much as possible eg endpoint monitoring information etc

In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the

following five URLs hence the proposed approach must implement not less than those five Web

Service Registries

Biocatalogue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwservice-repositorycom

Xmethods httpwwwxmethodsnet

32 Deep Web Services Crawler Architecture

In this section an overview of this high-level architecture of Deep Web Services Crawler approach will

be introduced first Thereafter there are four subsections that focusing on the outlining of each single

component and how they will play together presented

The current components and flows of data in Deep Web Service Crawler can be summarized as

depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page

links and related service page links by crawling the Web (Web Service Extractor) Then those gathered

links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL

Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all

these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is

illustrated as following

Oslash Step 1

When the Deep Web Service Crawler starts to run the File Chooser container would require the

user to specify a path in the computer or in any other hard disk The reason for specifying the

path is that this Deep Web Service Crawler program needs a place to store all its outputs

Oslash Step 2

After that the Web Service Extractor will be triggered It is the main entry to the specific crawling

process However since the Deep Web Service Crawler program is a procedure which supposes

to crawl for Web Services in some given Web Service Registries Hence the URL addresses of

Deep Web Service Crawler

25

these Web Service Registries should be given as an initial seed for this Web Service Extractor

process However since the page structures of these Web Service Registries are completely

different thus there will be a dependent process for each Web Service Registry

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler

Oslash Step 3

So that according to the given seed two types of links would be obtained by the Web Service

Extractor component One is the service list page link And another is the service page link

Service list page is page that contains a list of Web Services and may be some information about

these Web Services While service page is page that contains much more information for single

service Finally it forwards these two types of links into the next two components Property

Grabber and WSDL Grabber

Oslash Step 4

Then on the one hand the Property Grabber component tries to gather the information about

the service that hosted in the service list page and service page Such as the name of the service

its description the ranking for this service etc Finally all the information of the service will be

collected together as the service properties which then will be delivered to the storage

component for further processing

Oslash Step 5

On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page

or the service page That is because for some Web Service Registry the WSDL link is hosted in

Deep Web Service Crawler

26

the service list page like the Biocatalogue and for other Web Service Registries it is hosted in

the service page such as Xmethods Then after obtaining the WSDL link it will also be

transmitted to the Storage component for further processing

Oslash Step6

When service properties information and WSDL link of the service are received by Storage

component it will be going to store them into the disk For the service properties they will be

stored into the disk by means of three different ways They are an XML file an INI file or one

record inside the table of Database However for the WSDL link the Storage component will try

to download the page content first according to the URL address of the WSDL link If it can work

successful the page content of the service will be stored as a WSDL document into the disk

Oslash Step 7

Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there

are more than one service or there are more than one service list page in the those Web Service

Registries the crawling process from step 3 to step 6 will be continued again and again until

there is no service or service list page in that Web Service Registries any more

Oslash Step 8

Furthermore after the crawling process for one Web Service Registry finishes there will also

generate a file that contains some of statistic information about this crawling process For

example the time when the crawling process of this Web Service Registry started and when

finished the total number of Web services in this Web Service Registry how many services have

empty WSDL document the average number of service properties in this Web Service Registry

and the average time cost for extracting service properties getting WSDL document and

generating XML file INI file etc

321 The Function of Web Service Extractor Component

The Web Service Extractor component is responsible for gathering service list page links and service

page links It pursues a focused crawl of the Web and only forwards service list page and service page

links to subsequent components for analyzing collecting gathering purposes Therefore it is going to

identify both service list page links and related service page links on these Web Service Registries

As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost

as important as the Web Service Extractor itself as it highly influences the part of the Web which it

needs to crawl The seed can or shall contain eg Web pages where these Web Services are

published or that they talk about Web Services

Deep Web Service Crawler

27

Figure3-2 Overview the process flow of the Web Service Extractor Component

After feeding with the seed of URL the Web Service Extractor component starts to get the link of

service list page from the initial page of this URL seed However this process would be diverse for

these five Web Service Registries Following shows the different situation in these Web Service

Registries

Oslash Service-Repository Web Service Registry

In this Web Service Registry the link of the first service list page is the URL address of its seed

which means some Web Services can be found in the home page of the Service-Repository Web

Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this

Web Service Registry Therefore the process of getting service list page link in this registry will be

continually carried on until there no more service list page link exists

Oslash Xmethods Web Service Registry

Although there has Web Services in the home page of the Xmethods Web Service Registry the

number of those Web Services is only a small subset of these in this Web Service Registry

Moreover in the Xmethods Web Service Registry there is only one page containing all Web

Services Therefore it must have to get the service list page link for that page

Oslash Ebi Web Service Registry

The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That

is to say there also has one page that contains all Web Services in this Web Service Registry

However this page is not the initial page of the input seed Therefore there needs more than

one operation steps to get the service list page link of that page

Oslash Seekda Web Service Registry

In Seekda Web Service Registry these Web Services are not contained in the initial page of the

input seed The service list page link can be obtained after several additional operation steps

Deep Web Service Crawler

28

However there has a problem for getting the service list page link in this registry Simply put if

there are more than one pages contains the Web Services for some unknown reasons it cannot

not get the links of the rest service list page In another word it can only get the link of the first

service list page

Oslash Biocatalogue Web Service Registry

The process of getting service list page in Biocatalogue Web Service Registry is almost the same

as in Seekda Web Service Registry The only difference is that it can get all these service list page

links in Biocatalogue Web Service Registry if there are more than one service list pages

Then after getting the link of service list page the Web Service Extractor will begin to get the link of

service page for each service that listed in the service list page The reason for why it can do like this is

that there has an internal link for every service which will address to the service page It is worth

noting that once a service page link is obtained this component will immediately forward the service

page link and the service list page link into the subsequent two components for further processing

Nevertheless the process of obtaining the service page link will be continuously carried out until all

services listed in that service list page are crawled Analogously the process of getting service list page

will also be continuously carried out until there no more service list page exists

3211 Features of the Web Service Extractor Component

The main features are described in the following paragraphs

1) Obtain service list page links

A central task is to obtain the corresponding service list page links It is an URL address that includes a

publicly list of Web Services and just a simply information about these Web services like the name of

the service an internal URL that links to another page which would contain the detailed information

about that service sometimes may have the link address for WSDL document

2) Obtain service page links

Once service list page links are found a crucial aspect is to extract the internal link of each service to

aid the task of service information discovery Therefore it is the task of the Web Service Extractor to

harvest the html page content of the service list page so that the service page links which would

contain much more detailed information of the single Web service can be obtained

3212 Input of the Web Service Extractor Component

This component is dependent on some specific input seeds And the only input required for this

component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section

313

3213 Output of the Web Service Extractor Component

The component will produce two service related page links from the Web

l Service list page links

l Service page links

Deep Web Service Crawler

29

3214 Demonstration for Web Service Extractor

In order to have comprehensive understanding of the process of the Web Service Extractor

component following gives some figures for explanation Though there are five URL addresses in this

section only the URL of the service-repository address is showed as an example

1) The input seed is the initial URL address of the Service-Repository which is

ldquohttpwwwservice-repositorycomrdquo

2) As has already said in section 321 the first service list link of this Web Service Registry is its input

seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page

of that link

Figure3-3 Service list page of the Service-Repository

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo

Figure3-5Code Overview of getting service page link in Service Repository

Figure3-6 Service page of the Web service ldquoBLZServicerdquo

Deep Web Service Crawler

30

3) For now because of the service list page link is already know the next step is to acquiring the

service page link according to these services that listed in the service list page The text in red box

of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the

complete link of the service page It should be prefixed with the initial URL address of

Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for

getting service page link of the Web service in Service Repository is showed in figure 3-5

Therefore the final link of this service page is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure 3-6 is the corrsponding service page of that link

4) Afterwards those two links service list page link and service page link which are gathered by the

Web Service Extractor component will be immediately forwarded to the next two components

which are WSDL Grabber component and Property Grabber component

322 The Function of WSDL Grabber Component

The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the

service lists page link or service page link And the whole process flow is illustrated in figure 3-7

Figure3-7 Overview the process flow of the WSDL Grabber Component

When the WSDL Grabber component receives the inputs that delivered from previous component it

will start to get the WSDL link for the service based on these inputs However although the inputs of

the WSDL Grabber component are the links of service page and service list page only one of them

contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the

service page or the service list page The reason why there needs these two links to be delivered into

this component is that the only one of the five Web Service Registries which is Biocatalogue hosts

the WSDL link in the service list page While for other four Web Service Registries the WSDL link is

hosted in the service page Therefore the WSDL link of these four Web Service Registries will be

obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL

link will be obtained through the service list page link However there has a problem for getting the

WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in

Deep Web Service Crawler

31

the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another

word these services will not have WSDL document For the situation like this the value of the WSDL

link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web

Services in other four Web Service Registries the WSDL link would always exist in the service page

Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service

it will inevitably be forwarded into Storage component for downloading the WSDL document at once

3221 Features of the WSDL Grabber Component

The WSDL Grabber component has to provide the following features

l Obtain WSDL links

WSDL link is the direct way to get into the page that contains the contents of the WSDL document

It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo

to represent this is an address that addressed to the page of WSDL document

3222 Input of the WSDL Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3223 Output of the WSDL Grabber Component

The component will only produce the following output data

l The URL address of WSDL link for each service

3224 Demonstration for WSDL Grabber Component

In this section it will demonstrate a list of figures in order to give a comprehensive understanding of

the process of the WSDL Grabber component Without a doubt it uses the same Web service of the

Service-Repository as an example too

1) The input for this WSDL Grabber component is the link of service page that obtained from Web

Service Extractor component The address of this link is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page

Deep Web Service Crawler

32

2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the

figure 3-8

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo

3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure

3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry

For other four Web Service Registries this will be different This function

ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo

Then it checks all of the nodes one by one to see if the text value of that node is the value of

ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will

be extracted as the value of WSDL link for this Web service

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function

Figure3-11 Code overview of ldquooneParameterrdquo function

4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web

service ldquoBLZServicerdquo can be obtained The value of it is

ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo

Deep Web Service Crawler

33

323 The Function of Property Grabber Component

The Property Grabber component is a module which is used to extract and gather all these Web

service information that hosted in the Web actually are the information that showed in service list

page and service page In the end all the obtained Web service information will be collected together

as the service properties which would be delivered into the Storage component for storing The

detailed process flow of Property Grabber component is illustrated in figure 3-12

The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component

However there is still a little difference between them with respect to the seed As have already

mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is

sufficient to get the WSDL link While regarding the Property Grabber component it needs two of

them as the seeds Therefore once the Property Grabber component receives the needed inputs it

will start to extract the information of that single Web service

Figure3-12 Overview the process flow of the Property Grabber Component

Therefore after the Property Grabber component receives the inputs it starts to extract the service

information for the Web service Generally speaking the service information consists of four aspects

which are structured information endpoint information monitoring information and whois

information respectively

(1) Structured Information

The structured information can be obtained by extracting the information that hosted in the

service page and service list page They are the basic descriptive information about the service

Such as the name of the service the URL address through which the WSDL document can be

obtained the description of introducing the service the provider who provides that service its

rating and the server who owns this service etc However the elements constituting this

Deep Web Service Crawler

34

structured information would be diverse according to these different Web Service Registries For

example the rating information of the Web service exists in the Service-Repository Web Service

Registry while the Xmethods Web Service Registry does not have this information In addition

even though the Web services in the same Web Service Registry some elements of the structured

information could be not exist For instance one service in a Web Service Registry may have the

description information of this service while another service in the same Web Service Registry

does not have this description information Tables 3-1 to 3-5 demonstrate the structured

information that should be extracted in these five Web Service Registries Moreover if the style of

the Biocataloue Web Service Registry is SOAP there has some additional information describing

the SOAP operation of this service While it is REST this additional information would be

something that talked about the REST operation They should also be considered as a part of the

structured information Table 3-6 and table 3-7 illustrate the information for these two different

operations

Service Name WSDL Link WSDL Version

Provider Server Rating

Homepage Owner Homepage Description

Table 3-1Structured Information of Service-Repository Web Service Registry

Service Name WSDL Link Provider

Service Style Homepage Implementation Language

Description User Description Contributed Client Name

Type of this Client Publisher for this client Used Tookit of this client

Used Language of this client Used Operation System of this client

Table 3-2Structured Information of Xmethods Web Service Registry

Service Name WSDL Link Server

Provider Providerrsquos Country Service Style

Rating Description User Description

Service Tags Documentation (within WSDL)

Table 3-3Structured Information of Seekda Web Service Registry

Service Name WSDL Link Port Name

Service URL Address Implementation Class

Table 3-4Structured Information of Ebi Web Service Registry

Service Name WSDL Link Style

Provider Providerrsquos Country View Times

Favorite Times Submitter Service Tags

Total Annotation Provider Annotation Member Annotation

Registry Annotation Base URL SOAP Lab Server Base URL

Description User Description Category

Table 3-5Structured Information of Biocatalogue Web Service Registry

Deep Web Service Crawler

35

SOAP Operation Name Inputs and Outputs Operation Description

Operation Tags Part of Which Service

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry

REST Operation Name Service Tags Used Template

Operation Description Part of Which Service Part of Which Endpoint Group

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry

(2) Endpoint Information

Endpoint information is the port information of the Web service which can be extracted only

through the service page However since different Web Service Registries have different structure

of the endpoint information with respect to the Web service Thence some elements of the

endpoint information for the Web service would be very diverse But there is one thing needed to

pay attention the Ebi Web Service Registry does not have endpoint information for all these Web

services published in this registry Moreover though the Web services in the same Web Service

Registry have the same structure of this endpoint information some elements of the endpoint

information would be missing or empty Furthermore these Web Service Registries could even

have no endpoint information for some Web services published by them Nevertheless whatever

there is endpoint information for the Web service there will be at least one such element which

is the URL address of the endpoint Following table 3-8 shows the endpoint information that

should be extracted in these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Endpoint Name Endpoint URL

Endpoint Critical Endpoint Type

Bound Endpoint

Xmethods

Endpoint URL Publisher of this Endpoint

Contact Email of this Publisher Implementation Language of this

Endpoint

Seekda Endpoint URL

Biocatalogue Endpoint Name Endpoint URL

Table 3-8 Endpoint Information of these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Service Availability Number of Downs

Total Uptime Total Downtime

MTTR MTBF

RTT Max of Endpoint RTT Min of Endpoint

RTT Average of Endpoint Ping Count of Endpoint

Seekda Service Availability Begin Time of Monitoring

Biocatalogue Monitored Status of Endpoint Overall Status of Service

Table 3-9 Monitoring Information of these five Web Service Registries

(3) Monitoring Information

Monitoring information is the tested statistic information for the Web service But it is worth

Deep Web Service Crawler

36

noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of

all these Web services published by them While for other three Web Service Registries only a

few of Web services may not have this information Table 3-12 displays the monitoring

information for these three Web Service Registries

(4) Whois Information

Whois information is not extracted from the information that hosted in the service page and

service list page It is the descriptive information for the service domain which can be gained by

means of the address of the WSDL link Because of that the process of getting the Whois

information will start after gaining the service domain first The final value of service domain

should not contain the string like this http https www etc It must belong to the top level

domain After that it is going to query the service domain database by means of sending the value

of service domain to the Whois client which is just a Web site on the Internet for example

ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain

exists a list of its information will be returned as the output However the structure of the

returned information will be diverse with respect to different service domain Therefore the most

challenge thing is that there has to deal with the extracting process for each different situation of

the returned information Table 3-10 gives the Whois information that needs be extracted for all

these five Web Service Registries

Service Domain URL Domain Name Domain Type

Domain Address Domain Description State

Postal Code City Country

Country Code Phone Fax

Email Organization Established Time

Table 3-10Whois Information for these five Web Service Registries

Finally all the information of these four aspects will be collected together And then be delivered into

the Storage component for further storage processing

3231 Features of the Property Grabber Component

The Property Grabber component has to provide the following features

l Obtain basic information

Generally speaking the more information one Web service has the better you can know how

well this Web service is Hence this Property Grabber component will be necessary to extract all

these basic information that hosted in the service list page and service page This basic

information contains the structure information endpoint information and monitoring

information

l Obtain Whois information

Due to the reason that the more information the Web service has the better the quality of the

Web service is That makes it is necessary to extract as much information about the Web service

as possible Therefore except the basic information the Property Grabber component can also

obtain some additional information called Whois information such as the type of the domain the

name of the person in charge the postal code of the domain city phone fax detailed address

email etc

Deep Web Service Crawler

37

3232 Input of the Property Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3233 Output of the Property Grabber Component

The component will produce the following output data

l Structured information of each service

l Endpoint information about each service if exists

l Monitoring information for the service and endpoint if exists

l Whois information of the service domain

However all these information will be collected together as the properties for each service Thereafter

the collected properties will be sent to the Storage component

3234 Demonstration for Property Grabber Component

Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the

Property Grabber component To simplify the explanation the example showed in this section uses

the same Web service as before

1) The inputs of the Property Grabber component are links of service list page and service page that

received from Web Service Extractor component Their links are

ldquohttpwwwservice-repositorycomrdquo

and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page

Deep Web Service Crawler

38

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page

3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that

displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of

the structured information have the same content like the contents of description showed in

service page and service list page Thence in order to saving the time of extracting process and

the space for storing process the elements with the same content would only be extracted once

Moreover there needs a transformation of non-descriptive text to the descriptive text for the

rating information because the contents of it are several image of stars Therefore the final

results of the extracted structured information for this Web service are showed in table 3-11

Because of there are no descriptive information for the provider homepage and Owner

Homepage their values are assigned as ldquoNULLrdquo

Service Name BLZService

WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl

WSDL Version 0

Server Apache-Coyote11

Description BLZService

Rating Four stars and A Half

Provider NULL

Homepage NULL

Owner Homepage NULL

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo

4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that

displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list

only one of them will be extracted as the information of endpoint That is because this master

program is intended to extract as more information as possible but these information should not

contain some redundant information Therefore for the purpose of that only one of them will be

extracted as the endpoint information even if there are more than one endpoint records Table

3-12 shows the final results of the endpoint information of this Web service

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page

Deep Web Service Crawler

39

Endpoint Name BLZServiceSOAP12port_http

Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService

Endpoint Critical True

Endpoint Type production

Bound Endpoint BLZServiceSOAP12Binding

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo

5) Then it is time to extract the monitoring information through invoking the function

ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box

above contains the monitoring information about the Web service and the red box below lists

the monitoring information for these endpoints As have already mentioned before only one

endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two

types of availability Actually they all represent the availability for this Web service just like the

availability that showed in figure 3-14 Therefore only one content of these availability

information is sufficient Table 3-13 shows the final results of this extracting process

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page

Service Availability 100

Number of Downs 0

Total Uptime 1 day 19 hours 19 minutes

Total Downtime 0 second

MTBF 1 day 19 hours 19 minutes

MTTR 0 second

RTT Max of Endpoint 141 ms

RTT Min of Endpoint 0 ms

RTT Average of Endpoint 577 ms

Ping Count of Endpoint 112

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo

6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the

service domain from the WSDL link first Towards this Web service the gained service domain is

ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying

process After that it returns a list of information for that service domain See figure 3-17 And

table 3-14 is the extracted Whois information

Deep Web Service Crawler

40

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo

Service Domain URL thomas-bayercom

Domain Name Thomas Bayer

Domain Type NULL

Domain Address Moltkestr40

Domain Description NULL

State NULL

Postal Code 54173

City Bonn

Country NULL

Country Code DE

Phone +4922855525760

Fax NULL

Email infopredic8de

Organization predic8 GmbH

Established Time NULL

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo

7) Finally all the information of these four aspects will be collected together as the service

properties and then these service properties are forwarded into the Storage component

324 The Function of Storage Component

The WSDL link from the WSDL Grabber component will be used by this Storage component to

download the WSDL document from the Web and stores it into the disk thereafter In addition the

service properties from the Property Grabber component will also be directly stored into the disk in

three different manners with this Storage component Figure 3-18 illustrates the process flow of the

Storage component

Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo

of the Storage component will be triggered After that it is going to transform the service properties

into three different output formats and store them into the disk These output formats are XML file

INI file and database records Besides it will also try to download the WSDL document with the URL

address of the WSDL link and then store the obtained WSDL document into disk too However this

ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo

ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the

storage tasks

Deep Web Service Crawler

41

Figure3-18 Overview the process flow of the Storage Component

(1) ldquogetWSDLrdquo sub function

The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it

into the disk Therefore above all it has to get the content of the WSDL document This

procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value

of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if

the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo

For the case like that it will create a WSDL document whose name is the service name appended

with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content

It is an empty document Nevertheless if that service does have WSDL link this sub function will

try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all

these contents that hosted in the Web would be downloaded and stored into the disk and named

only with the name of the service Otherwise it would create a WSDL document which prefixes

ldquoBadrdquo before the service name

(2) ldquogenerateXMLrdquo sub function

The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform

them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo

XML stands for eXtensible Markup Language which is a markup language that designed to

transport and store data The first line of an XML file is the XML declaration which defines the

XML version and the encoding used For example following ldquoltxml version=10

encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is

8-bit Unicode Transformation Format character set Besides an XML file contains XML elements

which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML

element can also contain other elements which could be simple text or a mixture of both

Deep Web Service Crawler

42

However an XML file must contain a root element as the parent of all other elements

(3) ldquogenerateINIrdquo sub function

The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will

transform them into an INI file and then store it in the disk with the name structure of servicersquos

name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for

configuration files They are just simple text files with a basic structure Generally speaking the

INI file contains three different parts They are section parameter and comment respectively

Parameter is the basic element contained in an INI file Its format is a key-value pair or can also

be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always

appears to the left of the equals sign The section is more like a room that grouped all its

parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And

section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo

Thence anything between the semicolon and the end of line will be ignored

(4) ldquogenerateDatabaserdquo sub function

The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions

Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will

turn them into the data of the database by using the SQL statements SQL stands for Structured

Query Language which is a database language that designed for accessing and manipulating data

in database Most of the actions that performed on a database are done with SQL statements

And the primary statements of SQL include insert into delete update select create alter and

drop Therefore for the purpose of transforming these service properties into the data of the

database this sub function has to create a database first using the ldquocreate databaserdquo statement

Then it should create a table to store the data A table is a collections of related data entries and

it consists of columns and rows Since the data for all these five Web Service Registries are not

very large therefore one database table is enough for storing these service properties Because of

that the field names of the service properties in columns for all these five Web Service Registries

should be uniform and well-defined Afterwards the service properties of each single service can

be put into the table as a record with the ldquoinsert intordquo statement of the SQL

3241 Features of the Storage Component

The Property Grabber component has to provide the following features

l Generate different output formats

The final result of this master program is to store the information of the service into the disk for

future work However this Storage component provides three different formats for storing these

service properties of the services that published in the Web Service Registries This makes the

storage of the service very flexible and also longevity

l Obtain the WSDL document

The important thing of this component is to check if the WSDL document can be obtained from

the WSDL link because the WSDL document plays a decisive role for determining the quality of

the service This Storage component provides the ability to deal with the different situations

about the process of obtaining the WSDL document

Deep Web Service Crawler

43

3242 Input of the Storage Component

This component requires the following input data

l WSDL link of each service

l Each service property information

3243 Output of the Storage Component

The component will produce the following output data

l WSDL document of the service

l XML document INI file and tables in database

3244 Demonstration for Storage Component

Following figures describe the fundamental implementation codes of this Storage component

The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the

implementation codes The first common place is about the parameters that defined in each of

these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute

path in the computer disk It will be used for the storing procedure and has already been specified

by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an

increasing Integer which is used as a part of the name of the service The reason for doing this is

that it can prevent the services that have the same name from overriding in the disk The content

of red mark among the codes of these figures is the second common place Its function is to

create a file or document for the service with the corresponding parameters

2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to

get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the

name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo

is the actual WSDL link which is the most important one for this sub function The other two

parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and

ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data

of the service in each Web Service Registry Such as the overall number of properties for that Web

Service Registry the overall number of services the number of services that have no WSDL link

the number of services contains no contents the number of services whose WSDL link are not

available etc The ldquoLog Informationrdquo text file records the results of the process steps and the

problems encountered For example which service is crawling now which Web Service Registry it

belongs the reason why it cannot get the WSDL document for this service the reason why this

service is unreachable and so on

Deep Web Service Crawler

44

Figure3-19 Implementation code for getting WSDL document

3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file

and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of

ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two

variables name and value

Figure3-20 Implementation code for generating XML file

Deep Web Service Crawler

45

Figure3-21 Implementation code for generating INI file

4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all

services in these five Web Service Registries into the records of the database Therefore it has to

create a database first The name of the database can be arbitrary as long as it conforms to the

naming rules in database This is the same for the name of the table Figure 3-22 displays the

procedure of creating a table in the database Due to the reason that it is hard to decide the

length of each service property thence the data types for all service properties are set to ldquoTextrdquo

Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo

statement

Figure3-22 Implementation code for creating table in database

Deep Web Service Crawler

46

Figure3-23 Implementation code for generating table records

33 Multithreaded Programming for DWSC

Multithreaded programming is a built-in characteristic which is provided by Java language A

multithreaded program contains two or more separate parts that that can execute concurrently Each

part of such a program is called a thread The use of multithreading enables to create programs that

makes efficient use of the system resources For example make maximum use of the CPU thatrsquos

because the idle time of the CPU can be kept to a minimum

In this master program there are five Web Service Registries that are needed to crawl for the services

among them Moreover the number of services published among each Web Service Registry is quite

different That makes the running time cost by each Web Service Registry is different Then it will

happen that one Web Service Registry who owns fewer services has to be waiting for executing until

another Web Service Registry who has much more services finishes Therefore in order to reduce the

waiting time for other Web Service Registries and maximize the use of the system resources it is

necessary to apply this multithreaded programming to this master program That is to say this master

program will create a thread for each Web Service Registry And these threads are executed

independently

34 Sleep Time Configuration for Web Service

Registries

Due to the reason that this master program is intended for downloading the WSDL document and

extracting the service information of these Web services that published in the Web Service Registry

this would inevitably affect the performance of the Web Service Registry In addition for the purpose

of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of

accessing Because of that sometimes when this master program is executing unknown errors would

happen For instance this master program will be continually halting at one point without getting any

more WSDL document and service information or some servicesrsquo WSDL documents of some Web

Deep Web Service Crawler

47

Service Registries cannot be obtained or there would miss some service information Therefore in

order to obtain the largest staffs of these Web services published in these five Web Service Registries

and not to affect the throughput of them it has to configure the accessing rate for each service of all

Web Service Registries

In consequence before going into the essential procedure for each single service of these Web Service

Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long

milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for

the specified number of milliseconds In other words it will temporarily cease execution for a while

Following table shows this time interval of the sleep function for each Web Service Registry

Web Service Registry Name Time Interval (milliseconds)

Service Repository 8000

Ebi 3000

Xmethods 10000

Seekda 20000

Biocatalogue 10000

Table 3-15Sleep Time of these five Web Service Registries

Deep Web Service Crawler

48

4 Experimental Results and Analysis

This chapter is going to show the quantitative experimental results of the prototype that presented in

chapter 3 Besides the analysis of these results will also be described and explained However in

order to gain a rather accurate result the experiments are carried out more than five times All these

data that displayed in the following tables and charts are the average of them

41 Statistic Information for Different Web Service

Registries

This section is mainly used to talk about the amount statistic of these Web services that published in

these five Web Service Registries It includes the overall number of the Web services that published in

each Web Service Registry and the number of these unavailable Web services which have been

archived because they may not be active anymore or are close to being non active Table 4-1 shows

the service amount statistic of these five Web Service Registries

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Overall

Services 57 289 382 853 2567

Unavailable

Services 0 0 0 0 125

Table4-1 Service amount statistic of these five Web Service Registries

Nevertheless in order to have an intuitive view of the service amount statistic in these five Web

Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As

you can see from the bar chart one on hand there has an ascending increase for the overall number

of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service

Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web

service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to

provide the Web services for the users because it contains far more services than other four Web

Service Registries On the other hand there is no unavailable service existed in almost all Web Service

Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web

Service Registry contains some Web services that cannot be used by the users To some degree this is

useless for the reason that these services cannot be used anymore and it will waste the network

resources on the Web Therefore all these unavailable services should be eliminated in order to

reduce the wasting of the network resources

Deep Web Service Crawler

49

Figure4-1 Service amount statistic of these five Web Service Registries

42 Statistic Information for WSDL Document

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Failed WSDL

Links 1 0 23 145 32

Without

WSDL Links 0 0 0 0 16

Empty

Content 0 0 2 0 2

Table4-2 Statistic information for WSDL Document

Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in

these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed

WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the

Web services whose WSDL links are invalid Say in other words that means it is impossible to get the

WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no

WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services

for these Web Service Registries It is used as the overall number of Web services in each Web Service

Registry who has no WSDL link That is to say there would be no WSDL document for Web services

like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL

document will also be created But there is no content and the name of this WSDL document will

contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which

represents the overall number of Web services that have the WSDL links and the URL address of them

0

500

1000

1500

2000

2500

3000

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

57289 382

853

2567

0 0 0 0 125

The

Num

ber o

f Web

Ser

vice

s

The name of Web Service Registry

Overall Services

Unavailable Services

Deep Web Service Crawler

50

are valid but their WSDL documents contains no content In this case a WSDL document whose name

contains a string ldquo(BAD)rdquo will be created

Figure4-2 Statistic information for WSDL Document

43 Comparison of Different Average Number of

Service Properties

This section is going to compare the average number of service properties in these five Web Service

Registries This average number of service properties is calculated by means of following equation = (1)

Where

ASP is the average number of service properties for one Web Service Registry

ONSP is the overall number of service properties in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

Figure 4-3 shows the average number of service properties of each Web service in these five Web

Service Registries As have already mentioned before one of the measurements for testing the quality

of Web services in the Web Service Registry is the service information The more information of a Web

service the better you know about that service consequently the corresponding Web Service

Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service

Repository and Biocatalogue Web Service Registries own a larger number of service properties than

that in other three Web Service Registries This can directly reflect that these two Web Service

Registries can provide more detailed information about the Web services published among them So

that the users can be easier to choose the service they need And they would also like to use the Web

services that published in these two Web Service Registries By contrast the Xmethods and Seekda

0

50

100

150

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

1 023

145

32

0 0 0 016

0 0 2 0 2

The

Num

ber o

f Ser

vice

s

The Name of the Web Service Registry

Failed WSDL Links

Without WSDL links

Empty Content

Deep Web Service Crawler

51

Web Service Registries who have less service information of the Web services would offer less quality

for these Web services Therefore users may be not like to use the Web services provided in these

two Web Service Registries Not to mention the Web services published in the Ebi Web Service

Registry

Figure4-3 Average Number of Service Properties

From the description presented in section 323 the causes of the issue for the different number of

service properties in these Web Service Registries may consist of following several points First the

number of the structured information for these Web services is different with respect to these five

Web Service Registries Even part of information for some Web services in one Web Service Registry

could be missing or have empty value For example the number of structure information that

supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that

in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for

almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce

the amount of the overall number of service properties Thirdly some Web Service Registries do not

have monitoring information such as Xmethods and Ebi In particular the Service Repository Web

Service Registry has a large amount of monitoring information of the Web services that could be

extracted in the Web Obvious the last one is the number of the whois information for these Web

services If the database of the whois client does not contain the information about the service

domain for the Web service in one Web Service Registry then there will be no whois information

could be extracted Moreover even there has information of the service domain the amount of the

information could be very diverse Therefore if the Web Service Registry undergoes the situation like

that many service domains of the Web services in this registry have no and few whois information

then the average number of service properties in that registry will decrease greatly

As a result in order to help users get better acquainted with the Web services that provided in Web

Service Registry and distinguish these Web services the Web Service Registry should do it best to

offer more and more information for each published Web service of it

05

101520253035

23

717 17

32

The

Num

ber o

f Pro

pert

ies

The Name of Web Service Registry

Average Number of Service Properties

Deep Web Service Crawler

52

44 Different Outputs of Web Services

The fundamental task of this master program is to obtain the WSDL document of the Web services

that hosted in the Web Service Registries as well as extract and gather the properties of these Web

services and thereafter store them into disk Therefore this section is going to describe the different

outputs of this master program which include the WSDL documents of Web services the generated

XML file and INI file and data records of these service properties in Database

Figure4-4 WSDL Document format of one Web service

The WSDL document of the Web service is just read from the Web according to the URL address of its

WSDL link And then these data will be stored into the disk as the WSDL document The name of the

WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document

And in order to distinguish these WSDL documents whose names would be the same while the

content of them are different the name of each obtained WSDL document in one Web Service

Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL

document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and

data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output

formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name

is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the

materials belonged to the same Web service The first three lines in that INI file are some service

comments which start from the semicolon to the end of the line They are the basic information for

the description of this INI file The following one line of them is the section that included in a pairs of

brackets It is important because it represents that the following lines behind it are the information of

this Web service Therefore the rest of these lines are the actual service information that existed in a

key-value pair with an equal between the pair Each service property is displayed from the beginning

of the line

Deep Web Service Crawler

53

Figure4-5 INI File format of one Web service

Figure4-6 XML File format of one Web service

Figure4-7 Database data format for all Web services

Deep Web Service Crawler

54

Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its

name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web

service Though the format of the XML file is different from that of the INI file the essential contents

of them are the same That is to say the values of these service properties are no different This is

because they are files generated from the collection of properties for the same Web service The XML

file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the

section in INI file is something like the root in XML file Therefore all values of these elements among

the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service

Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of

service information for all Web services in these five Web Service Registries And the entire service

information for one Web service will be only one record in this table Because of that the column

names of that table should be the union of the name of the service information in each Web Service

Registry However since the column names of the table should be unique thus these redundant

names in this union must be eliminated This could make sense and possible because the names of the

service information are well-defined and uniform for all these five Web Service Registries In addition

in this table the first column is the primary key which is an increasing Integer The function of this is

more like the Integer that contained in the name of the XML file and INI file While the rest columns in

that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in

this table represents that this property of that Web service is empty or missing

45 Comparison of Average Time Cost for Different

Parts of Single Web Service

This section aims to describe the comparison of the average time cost for different parts of getting

one single Web service in all these five Web Service Registries At first it has to calculate the average

time cost of getting one single service in the Web Service Registry The calculation of it can be

obtained through following equation = (2)

Where

ATC is the average time cost for one single Web service

OTS is the overall time cost of all the Web services in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the different parts of the average time cost for getting one single service consists of

following 6 aspects which are the average time cost for extracting service property the average time

cost for obtaining WSDL document the average time cost for generating XML file the average time

cost for generating INI file the average time cost for inserting the service property into the table of

the Database and the average time cost for some other procedures such as get the service list page

link get the service page link and so on The average time cost for extracting service property would

be obtained by means of following equation

Deep Web Service Crawler

55

= (3)

Where

ATCSI is the average time cost for extracting service property of one single Web service

OTSSI is the overall time cost for extracting service property of all the Web services in one Web

Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the calculation of other parts is similar to the equation for calculating the average time

cost for extracting service property While the calculation of the average time cost for other

procedures equals to the result of the average time cost for one single Web service minus the sum of

the average time cost of all other five parts

Service

property

WSDL

Document

XML

File INI File Database Others Overall

Service

Repository

8801 918 2 1 53 267 10042

Ebi 699 82 2 1 28 11 823

Xmethods 5801 1168 2 1 45 12 7029

Seekda 5186 1013 2 1 41 23 6266

Biocatalogue 39533 762 2 1 66 1636 42000

Table4-3 Average time cost information for all Web Service Registries

Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web

Service Registries The first column of table 4-3 is the name of these five Web Service Registries And

the last column in this table is the average time cost for the single service in one Web Service Registry

While the rest columns in this table are the average time cost of these six different parts In order to

have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated

with following corresponding figures See figure 4-8 to figure 4-14

Figure4-8 Average time cost for extracting service property in all Web Service Registries

8801

6995801 5186

39533

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Service Property

Deep Web Service Crawler

56

As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue

Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries

which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively

That is to say it will take much longer time to extract the service properties of these Web services

published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the

Biocatalogue Web Service Registry has even more average number of service properties which have

already been talked about in section 43 On the contrary the average number of service properties in

Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service

Registry is larger than that in Seekda Web Service Registry As have already known that the average

number of service properties is the same for these two Web Service Registries Nevertheless there

has course might explain the case that the average time in Xmethods costs more than that in Seekda

which is the process for extracting service properties in Xmethods Web Service Registry needs to be

executed by means of service page and service list page while it needs only service page link for

Seekda Web Service Registry

The average time cost for obtaining WSDL document in all these five Web Service Registries is

displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the

average time for extracting the WSDL link of Web service and the average time for reading the data of

WSDL document from the Web and then storing it into the disk As seen from this figure the average

time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is

1168 Although the process for extracting WSDL link will cost certain amount of time it does not have

a significant influence to the total average time spent for obtaining WSDL document This is because

the WSDL link of one Web service is almost gained in one step Therefore this can imply that the

average size of the WSDL document for Xmethods Web Service Registry is larger than that for other

four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds

for the process of extracting WSDL document

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries

918

82

11681013

762

0

200

400

600

800

1000

1200

1400

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

WSDL Document

Deep Web Service Crawler

57

Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different

outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11

the average time of generating XML file for one Web service cost the same for all these five Web

Service Registries which is only 2 milliseconds In addition the average time of generating INI file for

one Web service cost the same too And its value is just 1 millisecond Even though sum up the values

of these two types of average time costs the result is still so small that it could be omitted when

comparing to the overall average time cost of getting one Web service for each corresponding Web

Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI

file would be finished at once after receiving the input of service properties of one Web service

Furthermore it can be seen from figure 4-12 although the average time costs of creating database

record for each Web service in all these five Web Service Registries are larger than that time of

generating XML and INI file the process of the operation for creating database record is still fast

Figure4-10 Average time cost for generating XML file in all Web Service Registries

Figure4-11 Average time cost for generating INI file in all Web Service Registries

2 2 2 2 2

0

05

1

15

2

25

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

XML File

1 1 1 1 1

0

02

04

06

08

1

12

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

INI File

Deep Web Service Crawler

58

Figure4-12 Average time cost for creating database record in all Web Service Registries

Figure4-13 Average time cost for getting one Web service in all Web Service Registries

In the figure 4-13 it gives the average time cost for getting one single Web service in all these five

Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the

longest time for this process This is because the presentation of these five different parts that

described in the front shows that the average time cost of each part would needs more time to finish

the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining

WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web

Service Registry does not cost the longest Moreover there is amazing thing when looking at these

figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same

trend This can further indicate that one Web Service Registry who spends more time to get the

description information of one Web service would offer more information about the Web service

53

28

4541

66

0

10

20

30

40

50

60

70

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Database

10042

823

7029 6266

42000

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Overall

Deep Web Service Crawler

59

5 Conclusion and Further Direction

This master thesis provides a schema which aims to explore the description information of the Web

services that hosted in the different Web Service Registries Actually the description information of

the Web service consists of WSDL document and service information about that Web service

Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can

be used to obtain the WSDL document and service information of the Web service in these Web

Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web

services that hosted in the these Web Service Registries as well as the storage of these Web services

is not flexible and the most important thing is that only a few service information of the Web service

are extracted even more there is no service information would be extracted for some Web Service

Registry However the work presented in this master thesis can be able to explore all Web services in

these Web Service Registries Moreover the service information of Web service would be extracted as

much as it could with the approach presented in this master thesis so that the final result would be

the largest annotated service catalogue ever produced Furthermore regarding the storage process of

the description information of the Web service this master thesis provides three different ways that

can guarantee not only the completeness but also longevity of the description information for the

Web service

However in the implementation performed in this master thesis the whois client used for querying

the information of service domain will return a free text if the information exists And sometime this

free text differs completely That makes it must have to crawl each Web service in all Web Service

Registries at least once in the experiment stage So that all the cases of these free text would be

foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web

services in these Web Service Registries Therefore in order to simply the work other whois client

who can ease the work here needs to be found and used

Moreover in the experiment stage of this master thesis the time cost for getting Web service is still

large For the purpose of reducing this time the multithreaded programming would be applied to

each some parts of the process for getting one Web service

Although the work performed here is specialized for only these five Web Service Registries the

main parts of the principles used here is adaptable to other Web Service Registries that with only

small changes in the implementation codes or the structure

Deep Web Service Crawler

60

6 Bibliography

[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)

June 27 2008

[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from

httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-

whole Emanuele Della Valle (CEFRIEL) July 1 2008

[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk

ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-

architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009

[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web

Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October

2006

[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from

httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008

[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from

httpwwww3orgTRws-arch-scenarios February 11 2004

[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo

Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering

University of Washington Seattle pp233-272 February 1999

[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web

Consortium Working Draft WD-html5-20100624 January 22 2008

[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language

WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from

httpwwwwsmoorgTRd16d161v02

[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO

-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from

httpwwwwsmoorgTRd2v11

[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware

Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet

2008

Deep Web Service Crawler

61

7 Appendixes

There are additional outputs of this master program which are log information file and statistic report

file Figure 8-1 shows one of the basic output formats of the log information for these five Web

Service Registries

Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry

Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry

Deep Web Service Crawler

62

Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo

ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively

Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry

Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry

Deep Web Service Crawler

63

Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry

Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry

Deep Web Service Crawler

64

Table of Figures

Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12

Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16

Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25

Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27

Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41

Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46

Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50

Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51

Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52

Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56

Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58

Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58

Deep Web Service Crawler

65

Table of Tables

Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47

Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48

Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55

Deep Web Service Crawler

66

Table of Abbreviations

DTD Document Type Definition

DWSC Deep Web Service Crawler

HTML HyperText Markup Language

MTBF Mean Time between Failures

MTTR Mean Time to Recovery

QoS Quality of Service

REST Representational State Transfer

RTT Round Trip Time

SOAP Simple Object Access Protocol

SQL Structured Query Language

URL Uniform Resource Locator

WHATWG Web Hypertext Application Technology Working Group

XML eXtensible Markup Language

WSDL Web Service Description Language

WSML Web Service Modeling Language

WSMO Web Service Modeling Ontology

Page 6: Deep Web Service Crawler

Deep Web Service Crawler

6

7 Appendixes 61

Table of Figures 64

Table of Tables 65

Table of Abbreviations 66

Deep Web Service Crawler

7

1 Introduction

In this introduction chapter of the master thesis at first itrsquos going to concisely explain the background

of the current situation then is the basic introduction of the proposed approach which is called Deep

Web Service Extraction Crawler

11 BackgroundMotivation

In the late 1990rsquos the Web Service Registry was a hot commodity The formal definition of Web

Service Registry is known as a link links page Its function is to uniformly present information that

comes from various sources Hence it can provide a convenient channel to the users for offering

searching and using the Web Services Actually the related metadata of the Web Services that

submitted by both the system and users are commonly hosted along with the Service descriptions

Nevertheless in fact when users enter one of the Web Service Registries to look for some Web

Services they might meet some situations that would bring lots of trouble to them One of the

situations may be like that these Web Service Registries return several similar published Web Services

after the users search on it For example two or more Web Services have the same name but their

versions are not the same Or two or more Web Services that derived from the same server but have

different contents etc Furthermore most users are also interested in a global view of the published

services For instance they want to know which Web Service Registry can provide better quality for

the Web Service Therefore in order to help users to differentiate those similar published Web

Services and have a global view of the Web Services this information should be monitored and rated

Moreover there are a great many Web Service Registries in the Internet Each Web Service Registry

can provide a great number of Web Services Obviously there might have some similar Web Services

among these Web Service Registries Or a Web Service in one of the Web Service Registry is related to

another Web Service in other Web Service Registries Hence these Web Services should be

comparable across different Web Service Registries However recently there has not much support of

this In addition towards the metadata actually not all of them are structured especially the

descriptions of the non-functional property Therefore what have to do now is to turn those

non-functional property descriptions into the structured format Clearly speaking it needs to extract

as much information as possible about the Web Services that offered in the Web Service Registries

Eventually after extracting all the information from the Web Service Registries it is necessary to store

them into the disk This procedure should be efficient flexible and completeness

12 Initial Designing of the Deep Web Service

Crawler Approach

The problems have already been stated in the previous section Hence the following work is to solve

Deep Web Service Crawler

8

these problems In this section it will present the basic principle of Deep Web Service Crawler

approach

At first it is the simply introduction of the Deep Web Service Crawler approach to these problems As

have already been mentioned each Web Service Registry can offer Web Services Moreover each

Web Service Registry has its own html page structures These structures may be the same or even

complete different Therefore the first thing is to identify which Web Service Registry that it will be

going to explore Since each Web Service Registry owns a unique URL this job can be done by directly

analyzing the corresponding URL address of that Web Service Registry After identifying which Web

Service Registry it is going to explore the following step is to obtain all these Web Services that

published in that Web Service Registry Then with all these obtained Web Services it is time to extract

analyze and gather the information of the services That information can be in structured format or

even in unstructured format In this master thesis some Deep Web Analysis Techniques will be

applied to obtain this information So that the information about each Web Service shall be the

largest annotated The last but not the least important all the information about the Web Services

need to be stored

13 Goals of this Master Thesis

The lists in the following are the goals of this master thesis

n Produce the largest annotated Service Catalogue

Service Catalogue is a list of service properties The more properties the service has the larger

Service Catalogue it owns Therefore this master program should extract as much service

properties as possible

n Flexible storage of these metadata of each service as annotations or dedicated documents

The metadata of one service includes not only the WSDL document but also service properties

All these metadata are important information for the service Therefore this master program

should provide flexible ways to store these metadata into the disk

n Improve the comparable property of the Web Services across different Web Service Registries

The names of service properties for one Web Service Registry could be different from another

Web Service Registry Hence for the purpose of improving the comparable ability all these

names of the service properties should be uniformed and well-defined

14 Outline of this Master Thesis

In this chapter the motivation objective and initial approach plan have already been discussed

Thereafter the remaining paper is structured as follows

Firstly in chapter 2 the work presented is almost based on these existing techniques In section 21

there is given a detailed introduction to the technique of the Service-Finder project Then in section

22 the already implemented technique of Pica-Pica Web Service Description Crawler is introduction

and discussed After that in section 23 the Information Retrieval technique is presented

Deep Web Service Crawler

9

Then following in chapter 3 it is explained the designing details of this Deep Web Service Crawler

approach In section 31 it gives a short description for the different requirements of this approach

Next in section 32 the actual design for the Deep Web Service Crawler is presented Then in section

33 34 the multithreaded programming and sleep time configuration that used in this master

program are introduced respectively

In chapter 4 it is supposed to display the experiments of this Deep Web Service Crawler approach

and then give some evaluation of it

Finally in Section 5 the conclusion and discussion of the work that already done as well as the work in

the future for this master task are presented respectively

Deep Web Service Crawler

10

2 State of the Art

This chapter aims at presenting some existing techniques or Strategies that related to the work of

applying this Deep Web Service Extraction Crawler approach Section 21 plans to talk about the

existing catalogues Service-Finder project And then in section 22 it is going to explain the existing

implemented crawlers Pica-Pica Web Service Description Crawler Finally in the section 23 it is

supposed to present some details about the Information Extraction technique

21 Service Finder Project

Service-Finder Project aims at developing a platform for Web Service discovery especially for the Web

Services that are embedded in a Web 20 environment [1] Hence it can provide an efficient access to

publicly available services The goals of the Service-Finder project are depicted as follows [1]

n Automatically gather Web Services and their related information

n Semi-automatically create semantic service description based on the information that available

on the Web

n Create and improve semantic annotations via the user feedback

n Describe the aggregated information in semantic models and allow reasoning query

However before describing the basic functionality of the Service-Finder Project there is going to

present one of its use cases and requirements first

211 Use Cases for Service-Finder Project

The Service-Finder Project employed the use case methodology of the W3C Use Case description [6]

for its needs and then applied this methodology to the use cases that it enumerated

2111 Use Case Methodology

There are three aspects needed to be considered for use case definitions of Service-Finder Project [1]

(1) Description that used to describe information of the use case

(2) Actors Roles and Goals that used to identify the actors they would be the roles they act and the

goals they need to achieve in the scenario

(3) Storyboard that used to describe the serial of interactions among the actors and the

Service-Finder Portal

2112 System Administrator

This section is going to present the use case that applied to the Service-Finder portal and that

illustrated the requirements on its functionality from a user point of view However all these

Deep Web Service Crawler

11

information in this use case are derived from [1] In this use case there has a system administrator

whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities

online and working all day and night Therefore if there is any system failures Sam Adams should fix

the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will

alert him immediately by sending him a SMS Message in the case of a system failure

n Description

This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an

SMS Messaging Service that he wants to build it into his application

n Actors Roles and Goals

The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him

are the immediate service delivery the reliability of the service and low base fee and transaction fee

n Storyboard

Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find

many useful services from it especially he know what he is looking for Hence he visits the

Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo

Requirement 1 Search functionality

Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the

number of matching services that will be displayed on one page And he would also expect there has

short information about the service functionality the service provider and the service availability So

that he could decide which service he will choose to read further

Requirement 2 Enable configurable pagination of the matching results and have some short

information for each service

Step 3 When Sam looks through the short information about the services that displayed on the first

page he expects to find the most relevant services that related to his request After that he would

like to read more detailed information about that service to see whether this service can provide the

needed functionality

Requirement 3 Rank the returned matching services and must provide ability to read more details of

a service

Step 4 In the case that all the returned matching services Sam got provide quite different

functionalities or they belong to different service categories for example the SMS messaging services

alert users not through SMS but voice messaging For this reason Sam would like to see other

different categories that may be contain the services he wants Or the services of other categories

which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can

further filter his search in terms of browsing through categories

Requirement 4 Service categories and allow the user to look all services that belonged to that specific

category If possible it should also allow the user to browse through categories

Step 5 When Sam got all the services that could provide a SMS messaging service via the methods

described in the Step 4 at present he wants to look for the services that offered by an Austrian

provider and have no base fees if possible

Requirement 5 Faceted search

Deep Web Service Crawler

12

Step 6 After Sam got all these specific services now he would like to choose the services that can

provide a high reliability

Requirement 6 Sort functionality based on usersrsquo chooses

Step 7 For now Sam expects to compare the service availability between the promised to the service

provider and the actually provided This should be contained in the servicesrsquo details And there needs

also have service coverage information so that Sam can know whether this service covers the areas

he lives and works Moreover Sam would also like to compare these services in other way For

instance put some services into a structured table to compare the transaction fees

Requirement 7 A side-by-side comparison table for services and a functionality that enable users to

select services he wants to compare

Step 8 At last Sam wants to know whether the service providers offer a free try out of the services

So that he can test the service functionality

Requirement 8 If possible display a note that offering free service trials

212 Architecture Plan for the Service-Finder Project

The architecture plan of the Service-Finder project contains five basic components Service Crawler

Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal

Interface Figure 2-1 presents a high level overview of the components and the data flow among

them

Figure2-1Dataflow of Service-Finder and Its Components [3]

Deep Web Service Crawler

13

2121 The Principle of the Service Crawler Component

The Service Crawler is responsible for gathering available services and their related information from

the Web The overall cycle is depicted as following

(1) A Web developer publishes a Web Service

(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services

like WSDL (Web Service Description Language) documents

(3) The Crawler is also going to search for other related information as long as a service is discovered

(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant

part of the Web

At last the output of the crawler would be forwarded to the subsequent components for analyzing

indexing and displaying

2122 The Principle of the Automatic Annotator Component

The Automatic Annotator receives the relevant data from previous component and generates

semantic service descriptions about the WSDL documents and its related information based on the

Service-Finder Ontology and Service Category Ontology

Firstly it will simply introduce those two compatible ontologies that would be used throughout the

whole process [2]

n Generic Service Ontology it is an ontology which is functional to describe the data objects For

example the services the service providers availability payment modalities and so on

n Service Category Ontology it is an ontology which is used to categorize the functionalities or

applications of the services For instance data verification messaging data storage weather etc

Afterwards it is going to talk about the function of this component with its input output

Oslash Input

u Crawled data from Service Crawler

u Service-Finder Ontologies

u Feedback or Correction of before annotations

Oslash Function

u Enrich the information about the service and extract semantic statements according to the

Service-Finder Ontologies For example categorize the service according to the Service

Category Ontology

u Determine whether a particular document is relevant or not through the Web link graph If

not discard these irrelevant documents

u Classify the pages into their genres For instance pricing user comments FAQ and so on

Oslash Output

u Semantic annotation of the services

Deep Web Service Crawler

14

2123 The Principle of the Conceptual Indexer and Matcher

Component

The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted

information of the services and supplying users the capability of retrieval and semantic query For

example the matchmaking between user requests and service offers and the act of retrieving user

feedback on extracted annotations

In addition letrsquos have a look of the function of this component and its input output

Oslash Input

u Semantic annotation data and full text information obtained from Automatic Annotation

u Semantic annotation data and full text information that come from user interfaces

u Cluster data from user and service clustering component

Oslash Function

u Store the semantic annotations received from the Automatic Annotation component and

from the user interface

u Store the cluster data that procured through the clustering component

u Store and index the textual description offered by the Automatic Annotation component

and the textual comments offered by users

u Ontological query the semantic data from the data store center

u Combined keyword and Ontological querying used for user queries

u Provide a list of similar services for a given service

Oslash Output

u A list of matching services that are queried by users In particular these services should be

sorted by ranking and can also be iterated

u All available data that related to a particular entity must be retrievable at the user interface

2124 The Principle of the Service-Finder Portal Interface

Component

The Service-Finder Portal Interface is the main entry point that provided for users of the

Service-Finder system to search and browse the data which is managed by the Conceptual Indexer

and Matcher component In addition the users can also contribute information by means of providing

tags comments categorizations and ratings to the data browsed Furthermore the developers can

still directly invoke the Service-Finder functionalities from their custom applications in terms of an API

Besides the details of this componentrsquos function input and output are represented as below

Oslash Input

u A list of ordered services for a query

u Detailed information about a service or a set of services and a service provider

u Query access to service category ontology and the most used tags provided by the users

Deep Web Service Crawler

15

u Service availability information

Oslash Function

u The Web Interface allows the users to search services by keyword tag or concept in the

categorization sort and filter query results by refining the query compare and bookmark

services try out the services that offer this functionality

u The API allows the developers to invoke Service-Finder functionalities

Oslash Output

u Explicit user annotations such as tags ratings comments decryptions and so on

u Implicit user data for example click stream of users bookmarks comparisons links sent

etc

u Manual advertising of available new services

2125 The Principle of the Cluster Engine Component

The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the

Service-Finder Portal eg the queried services and the compared services of the users Moreover it

also provides cluster data to the Conceptual Indexer and Matcher for providing service

recommendations

Furthermore letrsquos detailed introduce this componentrsquos function input and output

Oslash Input

u Service annotation data of both extracted and user feedback

u Usersrsquo Click streams used for extracting user behaviors

Oslash Function

u Obtain user clusters from user behaviors

u Obtain service clusters from service annotation data to enable to find similar services

Oslash Output

u Clusters of users and services

22 Information Extraction

Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge

amount of information sources on the Internet which has been limited the access to browsing and

searching for the reason of the heterogeneity and the lack of structure of Web information sources

Therefore the appearance of Information Extraction that transforms the Web pages into

program-friendly structures for post-processing would become a great necessity However the task of

Information Extraction is specified in terms of the inputs and the extraction targets And the

techniques used in the process of Information Extraction called extractor

221 Input Types of Information Extraction

Generally speaking there are three different input types The first input type is the unstructured

Deep Web Service Crawler

16

document For example the free text that showed in figure 2-2 It is unstructured and written in

natural language So that it will require substantial natural language processing While the second

input type is called the structured document For instance the XML documents based on the reason

that the data can be described through the available DTD (Document Type Definition) or XML

(eXtensible Markup Language) schema Finally but obviously the third input type is the

semi-structured document that are widespread on the Web Such as the large volume of HTML

pages like tables itemized lists and enumerated lists This is because HTML tags are often used to

render these embedded data in the HTML pages See figure 2-3

Figure2-2Left is the free text input type and right is its output [4]

Figure2-3A Semi-structured page containing data records

(in rectangular box) to be extracted [4]

Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly

regular structure And the data of these documents can be displayed in a format of HTML way or

non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and

generated from structured databases in terms of some templates or layouts thus it would be

considered as one of the input sources which could provide some of these semi-structured documents

For example the authors price and comments of the book pages that provided by Amazon have the

Deep Web Service Crawler

17

same layout That is because these Web pages are generated from the same database and applied

with the same template or layout Furthermore there has another option which could manually

generate HTML pages of semi-structured type For example although the publication lists that

provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have

title and source property for every single pager Eventually the inputs for some Information Extraction

can also be the pages with the same class or among various Web Service Registries

222 Extraction Targets of Information Extraction

Moreover regarding the task of the Information Extraction it has to consider the extraction target

There also have two different extraction targets The first one is the relation of k-tuple And the k in

there means the number of attributes in a record Nevertheless in some cases an attribute of one

record may have none instantiation Otherwise the attribute owns multiple instantiations In addition

the complex object with hierarchically organized data would be the second extraction target Though

the ways for depicting the extraction targets in a page are diverse the most common structure is the

hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf

nodes which called internal nodes And the structure for a data object may also be flat or nested To

be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise

if it is nested structure then the internal nodes that involved in this data object would be more than

two levels

Furthermore in order to make the Web pages readable for human being and having an easier

visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated

or demarcated However the displaying for a data object in a Web page would be affected by

following conditions [4]

Oslash The attribute of a data object has zero or several values

(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo

attribute For example a special offer only available for certain books might be a ldquononerdquo

attribute

(2) If there are more than one values for the attribute of a data object it will be called the

ldquomultiValuerdquo attribute For instance the name of the author for a book could be a

ldquomultiValuerdquo attribute

Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering

That is to say among this set of attribute the position of the attribute might be changed

according to the diverse instances of a data object Thus this attribute will be called the

ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would

enumerate the release data in front of the movesrsquo title while for the movies after year 1999

(including 1999) it will enumerate the release data behind the movesrsquo title

Oslash The attribute has different formats

This means the displaying format of the data object could be completely distinct with respect to

these different instances Therefore if the format of an attribute is free then a lot of rules will be

needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo

attribute For example an ecommerce Web site would use the bold font format to present the

general prices while use the red color format to display the sale prices Nevertheless there has

Deep Web Service Crawler

18

another situation that some different attributes for a data object have the same format For

example various attributes are presented in terms of using the ltTDgt tags in a table presentation

And the attributes like those could be differentiated by means of the order information of these

attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo

attributes it must have to revise the rules for extracting these attributes

Oslash The attribute cannot be decomposed

Because of the easier processing sometimes the input documents would like to be treated as

strings of tokens instead of the strings of characters In addition some of the attribute cannot

even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo

attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The

department code and the course number in them cannot be separated into two different strings

of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo

223 The Used Techniques in Information Extraction

The extractor used in the process of Information Extraction aims at providing a single uniform query

interface to access information sources like database server and Web server It consists of following

phases collecting returned Web pages labeling these Web pages generalizing extraction rules

extracting the relevant data and outputting the result in an appropriate format (XML format or

relational database) for further information integration For example at first the extractor queries the

Web server to gather the returned pages through the HTTP protocols after that it starts to extract the

contents among these HTML documents and integrate with other data sources thereafter Actually

the whole process of the extractor follows below steps

Oslash Step 1

At the beginning it must have to tokenize the input However there are two different

granularities for the input string tokenization They are tag-level encoding and word-level

encoding The tag-level encoding will transform the tags of HTML page into the general tokens

while transform all text string between two tags into a special token Nevertheless the

word-level encoding does this in another way It treats each word in a document as a token

Oslash Step 2

Next it should apply the extraction rules for every attributes of the data object in the Web pages

These extraction rules could be induced in terms of a top-down or bottom-up generalization

pattern mining or logic programming In addition the type of extraction rules may be indicated

by means of regular grammars or logic rules For example some use path-expressions of the

HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic

constraints and some use delimiter-based constraints such as HTML tags or literal words

Oslash Step 3

After that all these extracted data would be assembled into the records

Oslash Step 4

Finally iterate this process until all these data objects in the input

Deep Web Service Crawler

19

23 Pica-Pica Web Service Description Crawler

The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the

Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web

Services problem For example the evaluation of the descriptive quality of Web Services that offered

and how well are these Web Services described in nowadaysrsquo Web Service Registries

231 Needed Libraries of the Pica-Pica Web Service

Description Crawler

This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef

Spillner and programmed in terms of the Python language Actually in order to run these scripts to

parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib

n Beautiful Soup

It is an HTMLXML parser for Python language And it can even turn these invalid markups into a

parse tree [5] Moreover the following three features can make it more powerful

u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that

makes approximately as much sense as the original document Therefore you can obtain

the data that you want

u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating

searching and modifying the parse tree Hence you donrsquot need to create a custom parse

for every application

u If the document has already specified an encoding then you can ignore it since the

Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way

Otherwise what you have to do is just to specify the encoding of the original documents

Furthermore the ways of including Beautiful Soup into the application are displayed in the

following [5]

sup2 From BeautifulSoup import BeautifulSoup For processing HTML

sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML

sup2 Import BeautifulSoup To get everything

n Html5lib

It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to

gain maximum compatibility with the current major desktop web browsers this implementation

will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5

specification

Deep Web Service Crawler

20

232 Architecture of the Pica-Pica Web Service

Description Crawler

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler

Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes

four fundamental components Service Page Grabber component WSDL Grabber component

Property Grabber component and WSML Register component

(1) The Service Page Grabber components is going to take the URL seed as the input and output the

link of the service page into following two components WSDL Grabber component and Property

Grabber component

(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the

delivered service pagersquos link And then check whether the validation for these obtained WSDL

document Finally only these valid WSDL document will be passed into the WSML Register

component for further processing

(3) The Property Grabber component will try to extract the servicersquos property the hosted in the

service page if there exists After that all these servicersquos properties would be saved into an INI

file as the information of that service

(4) The functionality of the WSML Register component is to write appropriate WSML document by

means of the valid WSDL documents that delivered from WSDL Grabber component and the

optionally INI files that delivered from Property Grabber component Afterwards register them in

Conqo

n WSML [9]

It stands for Web Service Modeling Language which provides a framework with different

language variants Hence it is often used to describe the different aspects of the semantic Web

Services according to the conceptual model of WSMO

n WSMO [10]

WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various

aspects related to the Semantic Web Services based on the ontologies Ontology is a formal

explicit specification of a shared conceptualization In fact these ontologies are the sticking points

that can satisfy the linkage between the agreement of the communities of users and the defined

conceptual semantics of the real-world

n Conqo [11]

Deep Web Service Crawler

21

It is a discovery framework that considers not only the Quality of Service (QoS) but also the

context information It will use a Web Service Repository to manage these service descriptions

that based on WSML

233 Implementation of the Pica-Pica Web Service

Description Crawler

This section is going to describe the processes of the implementation of the Pica-Pica Web Service

Description Crawler in detail

(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it

needs an input as the initial seed In this crawler there are five Web Service Registries which are

listed in the below The URL address of these five Web Service Registries will be used as the input

seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web

Service Description Crawler there has a single Python script for each Web Service Registry And

the crawling process for these Web Service Registriesrsquo Python script will be executed one after

another

Biocataloue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwsrvice-repositorycom

Xmethods httpwwwxmethodsnet

(2) Then after feeding with the input seed it will step into the next component Service Page Grabber

At first this component will try to read the data from the Web based on the input seed Then it

will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library

After that this Service Page Grabber component starts to look for the service page link for each

service that published in the Web Service Registry by means of the functions in Html5lib library In

the case that the service page link of one single service is found it will firstly check whether this

service page link is valid or not Once the service page link is valid it will pass it into the following

two components for further processing which are WSDL Grabber component and Property

Grabber component

(3) When the WSDL Grabber component receives a service page link from its previous component it

sets out to extract the WSDL link address for that service through the parsing tree of the data in

this service page Next this component will start to download the WSDL document of that service

in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into

the disk The process of this WSDL Grabber component will be continually carried on until there is

no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They

may either contain bad definitions or bad namespaceURI or be an empty document What is

worse it is not even of XML format Hence in order to pick them out this component will further

analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo

folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in

order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo

folder will be passed into the subsequent component

(4) Moreover since some Web Service Registries would give some addition information about the

Deep Web Service Crawler

22

services such as availability service provider version therefore the Property Grabber

component will set out to extract these information as this servicersquos properties And thereafter

save these servicersquos properties into an INI file However if there has no available additional

information then it is no need to extract the service property thus there is no INI file for that

service However for the implementation of this Pica-Pica Web Service Description Crawler only

the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to

extract the servicesrsquo properties While for other three Web Service Registries there has such

function

(5) Furthermore this is an optional to create a report file which contains the statistic information of

this process Such as the total number of services for one Web Service Registry the number of

services whose WSDL document is invalid etc

(6) As has been stated at present there has a folder with all valid WSDL documents and might also

have some INI files Therefore at this time the task of WSML Register component is to generate

the appropriate WSML documents with these valid WSDL document and INI files And then

register them in Conqo

24 Conclusions of the Existing Strategies

In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder

project Information Extraction technique and Pica-Pica Web Service Description Crawler

The task of this master program is going to obtain the available Web services and their related

information from the Web Actually this is a procedure of extracting the needed information for the

service such as the servicersquos WSDL document and its properties Therefore the Information

Extraction technique which used to extract the information that hosted in the Web could be used in

this master program

Moreover the Service-Finder is a large project that not only able to obtain the available Web services

and their related information from the Web but also enrich these crawled data with annotations by

means of the Service-Finder ontology and the Service Category ontology and integrate all the

information into a coherent semantic mode Furthermore it also provides the capabilities for

searching and browsing the data with the user interface and gives users with service

recommendations However as a master program the Service-Finder project is far more exceeding

the requirements Therefore it is just considered as a consultation for this master program

Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available

Web services and their related information this fulfills the primary task of this master program

Nevertheless regarding the information about the service this Pica-Pica Web Service Description

Crawler extracts only few properties sometime even no property Consequently in order to improve

the quality of the service it has to extract as much properties about the service as possible Thence in

chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler

Deep Web Service Crawler

23

3 Design and Implementation

In the previous chapter of the State of Art those already existing techniques or implementations are

presented In the following it is time to introduce the basic principle of the proposed approach Deep

Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica

Web Service Description Crawler

31 Deep Web Services Crawler Requirements

This section is mainly talking about the goals of the Deep Web Service Crawler approach system

requirements with respect to approach and some non-functional requirements

311 Basic Requirements for DWSC

The following lists are the basic requirements which should be achieved

(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list

of Web services that published in the Web Service Registry It contains not only the WSDL document

of the service but also a file with respect to the information about that service Therefore for the

purpose of producing the largest annotated Service Catalogue this proposed approach need to

extract as much properties about those Web services as possible Moreover it has also to download

the WSDL document that hosted along with the Web service That is to say these properties are not

only the interesting structured properties such as service name its WSDL link address but also some

other non-function properties for example endpoint and its monitoring information

(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties

For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of

schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way

this proposed approach provides three methods for the storage The first one will store them as an

XML file While for the second method it is going to store them in an INI file And the third method is

to use the database for the storage

312 System Requirements for DWSC

General Speaking the needed requirements for achieving a programming project contains following

lists

1) Operating System LinuxUnix Windows (XP Vista 2000 etc)

2) Programming language C++ Java Python C etc

3) Programming Tool NetBean Eclipse Visual Studio and so on

However in this master thesis the scripts of the Deep Web Service Crawler approach are written in

Java programming language Besides these code scripts are only tested running in Windows XP

operating system and Linux operating system but have not been tested in other operating system

Deep Web Service Crawler

24

313 Non-Functional Requirements for DWSC

In this part several non-functional requirements for the Deep Web Service Crawler approach are

presented

1) Transparency the process of data exploration and data storage should be done automatically

without the userrsquos intervention However at first user should specify the path in the hard disk which

will be used store these outputs of this program

2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in

order to stop the process from interruption there must have some necessary error handlings for the

process recovering

3) Completeness this approach should extract the interesting properties about each Web Service as

much as possible eg endpoint monitoring information etc

In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the

following five URLs hence the proposed approach must implement not less than those five Web

Service Registries

Biocatalogue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwservice-repositorycom

Xmethods httpwwwxmethodsnet

32 Deep Web Services Crawler Architecture

In this section an overview of this high-level architecture of Deep Web Services Crawler approach will

be introduced first Thereafter there are four subsections that focusing on the outlining of each single

component and how they will play together presented

The current components and flows of data in Deep Web Service Crawler can be summarized as

depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page

links and related service page links by crawling the Web (Web Service Extractor) Then those gathered

links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL

Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all

these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is

illustrated as following

Oslash Step 1

When the Deep Web Service Crawler starts to run the File Chooser container would require the

user to specify a path in the computer or in any other hard disk The reason for specifying the

path is that this Deep Web Service Crawler program needs a place to store all its outputs

Oslash Step 2

After that the Web Service Extractor will be triggered It is the main entry to the specific crawling

process However since the Deep Web Service Crawler program is a procedure which supposes

to crawl for Web Services in some given Web Service Registries Hence the URL addresses of

Deep Web Service Crawler

25

these Web Service Registries should be given as an initial seed for this Web Service Extractor

process However since the page structures of these Web Service Registries are completely

different thus there will be a dependent process for each Web Service Registry

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler

Oslash Step 3

So that according to the given seed two types of links would be obtained by the Web Service

Extractor component One is the service list page link And another is the service page link

Service list page is page that contains a list of Web Services and may be some information about

these Web Services While service page is page that contains much more information for single

service Finally it forwards these two types of links into the next two components Property

Grabber and WSDL Grabber

Oslash Step 4

Then on the one hand the Property Grabber component tries to gather the information about

the service that hosted in the service list page and service page Such as the name of the service

its description the ranking for this service etc Finally all the information of the service will be

collected together as the service properties which then will be delivered to the storage

component for further processing

Oslash Step 5

On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page

or the service page That is because for some Web Service Registry the WSDL link is hosted in

Deep Web Service Crawler

26

the service list page like the Biocatalogue and for other Web Service Registries it is hosted in

the service page such as Xmethods Then after obtaining the WSDL link it will also be

transmitted to the Storage component for further processing

Oslash Step6

When service properties information and WSDL link of the service are received by Storage

component it will be going to store them into the disk For the service properties they will be

stored into the disk by means of three different ways They are an XML file an INI file or one

record inside the table of Database However for the WSDL link the Storage component will try

to download the page content first according to the URL address of the WSDL link If it can work

successful the page content of the service will be stored as a WSDL document into the disk

Oslash Step 7

Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there

are more than one service or there are more than one service list page in the those Web Service

Registries the crawling process from step 3 to step 6 will be continued again and again until

there is no service or service list page in that Web Service Registries any more

Oslash Step 8

Furthermore after the crawling process for one Web Service Registry finishes there will also

generate a file that contains some of statistic information about this crawling process For

example the time when the crawling process of this Web Service Registry started and when

finished the total number of Web services in this Web Service Registry how many services have

empty WSDL document the average number of service properties in this Web Service Registry

and the average time cost for extracting service properties getting WSDL document and

generating XML file INI file etc

321 The Function of Web Service Extractor Component

The Web Service Extractor component is responsible for gathering service list page links and service

page links It pursues a focused crawl of the Web and only forwards service list page and service page

links to subsequent components for analyzing collecting gathering purposes Therefore it is going to

identify both service list page links and related service page links on these Web Service Registries

As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost

as important as the Web Service Extractor itself as it highly influences the part of the Web which it

needs to crawl The seed can or shall contain eg Web pages where these Web Services are

published or that they talk about Web Services

Deep Web Service Crawler

27

Figure3-2 Overview the process flow of the Web Service Extractor Component

After feeding with the seed of URL the Web Service Extractor component starts to get the link of

service list page from the initial page of this URL seed However this process would be diverse for

these five Web Service Registries Following shows the different situation in these Web Service

Registries

Oslash Service-Repository Web Service Registry

In this Web Service Registry the link of the first service list page is the URL address of its seed

which means some Web Services can be found in the home page of the Service-Repository Web

Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this

Web Service Registry Therefore the process of getting service list page link in this registry will be

continually carried on until there no more service list page link exists

Oslash Xmethods Web Service Registry

Although there has Web Services in the home page of the Xmethods Web Service Registry the

number of those Web Services is only a small subset of these in this Web Service Registry

Moreover in the Xmethods Web Service Registry there is only one page containing all Web

Services Therefore it must have to get the service list page link for that page

Oslash Ebi Web Service Registry

The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That

is to say there also has one page that contains all Web Services in this Web Service Registry

However this page is not the initial page of the input seed Therefore there needs more than

one operation steps to get the service list page link of that page

Oslash Seekda Web Service Registry

In Seekda Web Service Registry these Web Services are not contained in the initial page of the

input seed The service list page link can be obtained after several additional operation steps

Deep Web Service Crawler

28

However there has a problem for getting the service list page link in this registry Simply put if

there are more than one pages contains the Web Services for some unknown reasons it cannot

not get the links of the rest service list page In another word it can only get the link of the first

service list page

Oslash Biocatalogue Web Service Registry

The process of getting service list page in Biocatalogue Web Service Registry is almost the same

as in Seekda Web Service Registry The only difference is that it can get all these service list page

links in Biocatalogue Web Service Registry if there are more than one service list pages

Then after getting the link of service list page the Web Service Extractor will begin to get the link of

service page for each service that listed in the service list page The reason for why it can do like this is

that there has an internal link for every service which will address to the service page It is worth

noting that once a service page link is obtained this component will immediately forward the service

page link and the service list page link into the subsequent two components for further processing

Nevertheless the process of obtaining the service page link will be continuously carried out until all

services listed in that service list page are crawled Analogously the process of getting service list page

will also be continuously carried out until there no more service list page exists

3211 Features of the Web Service Extractor Component

The main features are described in the following paragraphs

1) Obtain service list page links

A central task is to obtain the corresponding service list page links It is an URL address that includes a

publicly list of Web Services and just a simply information about these Web services like the name of

the service an internal URL that links to another page which would contain the detailed information

about that service sometimes may have the link address for WSDL document

2) Obtain service page links

Once service list page links are found a crucial aspect is to extract the internal link of each service to

aid the task of service information discovery Therefore it is the task of the Web Service Extractor to

harvest the html page content of the service list page so that the service page links which would

contain much more detailed information of the single Web service can be obtained

3212 Input of the Web Service Extractor Component

This component is dependent on some specific input seeds And the only input required for this

component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section

313

3213 Output of the Web Service Extractor Component

The component will produce two service related page links from the Web

l Service list page links

l Service page links

Deep Web Service Crawler

29

3214 Demonstration for Web Service Extractor

In order to have comprehensive understanding of the process of the Web Service Extractor

component following gives some figures for explanation Though there are five URL addresses in this

section only the URL of the service-repository address is showed as an example

1) The input seed is the initial URL address of the Service-Repository which is

ldquohttpwwwservice-repositorycomrdquo

2) As has already said in section 321 the first service list link of this Web Service Registry is its input

seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page

of that link

Figure3-3 Service list page of the Service-Repository

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo

Figure3-5Code Overview of getting service page link in Service Repository

Figure3-6 Service page of the Web service ldquoBLZServicerdquo

Deep Web Service Crawler

30

3) For now because of the service list page link is already know the next step is to acquiring the

service page link according to these services that listed in the service list page The text in red box

of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the

complete link of the service page It should be prefixed with the initial URL address of

Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for

getting service page link of the Web service in Service Repository is showed in figure 3-5

Therefore the final link of this service page is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure 3-6 is the corrsponding service page of that link

4) Afterwards those two links service list page link and service page link which are gathered by the

Web Service Extractor component will be immediately forwarded to the next two components

which are WSDL Grabber component and Property Grabber component

322 The Function of WSDL Grabber Component

The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the

service lists page link or service page link And the whole process flow is illustrated in figure 3-7

Figure3-7 Overview the process flow of the WSDL Grabber Component

When the WSDL Grabber component receives the inputs that delivered from previous component it

will start to get the WSDL link for the service based on these inputs However although the inputs of

the WSDL Grabber component are the links of service page and service list page only one of them

contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the

service page or the service list page The reason why there needs these two links to be delivered into

this component is that the only one of the five Web Service Registries which is Biocatalogue hosts

the WSDL link in the service list page While for other four Web Service Registries the WSDL link is

hosted in the service page Therefore the WSDL link of these four Web Service Registries will be

obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL

link will be obtained through the service list page link However there has a problem for getting the

WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in

Deep Web Service Crawler

31

the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another

word these services will not have WSDL document For the situation like this the value of the WSDL

link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web

Services in other four Web Service Registries the WSDL link would always exist in the service page

Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service

it will inevitably be forwarded into Storage component for downloading the WSDL document at once

3221 Features of the WSDL Grabber Component

The WSDL Grabber component has to provide the following features

l Obtain WSDL links

WSDL link is the direct way to get into the page that contains the contents of the WSDL document

It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo

to represent this is an address that addressed to the page of WSDL document

3222 Input of the WSDL Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3223 Output of the WSDL Grabber Component

The component will only produce the following output data

l The URL address of WSDL link for each service

3224 Demonstration for WSDL Grabber Component

In this section it will demonstrate a list of figures in order to give a comprehensive understanding of

the process of the WSDL Grabber component Without a doubt it uses the same Web service of the

Service-Repository as an example too

1) The input for this WSDL Grabber component is the link of service page that obtained from Web

Service Extractor component The address of this link is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page

Deep Web Service Crawler

32

2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the

figure 3-8

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo

3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure

3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry

For other four Web Service Registries this will be different This function

ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo

Then it checks all of the nodes one by one to see if the text value of that node is the value of

ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will

be extracted as the value of WSDL link for this Web service

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function

Figure3-11 Code overview of ldquooneParameterrdquo function

4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web

service ldquoBLZServicerdquo can be obtained The value of it is

ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo

Deep Web Service Crawler

33

323 The Function of Property Grabber Component

The Property Grabber component is a module which is used to extract and gather all these Web

service information that hosted in the Web actually are the information that showed in service list

page and service page In the end all the obtained Web service information will be collected together

as the service properties which would be delivered into the Storage component for storing The

detailed process flow of Property Grabber component is illustrated in figure 3-12

The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component

However there is still a little difference between them with respect to the seed As have already

mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is

sufficient to get the WSDL link While regarding the Property Grabber component it needs two of

them as the seeds Therefore once the Property Grabber component receives the needed inputs it

will start to extract the information of that single Web service

Figure3-12 Overview the process flow of the Property Grabber Component

Therefore after the Property Grabber component receives the inputs it starts to extract the service

information for the Web service Generally speaking the service information consists of four aspects

which are structured information endpoint information monitoring information and whois

information respectively

(1) Structured Information

The structured information can be obtained by extracting the information that hosted in the

service page and service list page They are the basic descriptive information about the service

Such as the name of the service the URL address through which the WSDL document can be

obtained the description of introducing the service the provider who provides that service its

rating and the server who owns this service etc However the elements constituting this

Deep Web Service Crawler

34

structured information would be diverse according to these different Web Service Registries For

example the rating information of the Web service exists in the Service-Repository Web Service

Registry while the Xmethods Web Service Registry does not have this information In addition

even though the Web services in the same Web Service Registry some elements of the structured

information could be not exist For instance one service in a Web Service Registry may have the

description information of this service while another service in the same Web Service Registry

does not have this description information Tables 3-1 to 3-5 demonstrate the structured

information that should be extracted in these five Web Service Registries Moreover if the style of

the Biocataloue Web Service Registry is SOAP there has some additional information describing

the SOAP operation of this service While it is REST this additional information would be

something that talked about the REST operation They should also be considered as a part of the

structured information Table 3-6 and table 3-7 illustrate the information for these two different

operations

Service Name WSDL Link WSDL Version

Provider Server Rating

Homepage Owner Homepage Description

Table 3-1Structured Information of Service-Repository Web Service Registry

Service Name WSDL Link Provider

Service Style Homepage Implementation Language

Description User Description Contributed Client Name

Type of this Client Publisher for this client Used Tookit of this client

Used Language of this client Used Operation System of this client

Table 3-2Structured Information of Xmethods Web Service Registry

Service Name WSDL Link Server

Provider Providerrsquos Country Service Style

Rating Description User Description

Service Tags Documentation (within WSDL)

Table 3-3Structured Information of Seekda Web Service Registry

Service Name WSDL Link Port Name

Service URL Address Implementation Class

Table 3-4Structured Information of Ebi Web Service Registry

Service Name WSDL Link Style

Provider Providerrsquos Country View Times

Favorite Times Submitter Service Tags

Total Annotation Provider Annotation Member Annotation

Registry Annotation Base URL SOAP Lab Server Base URL

Description User Description Category

Table 3-5Structured Information of Biocatalogue Web Service Registry

Deep Web Service Crawler

35

SOAP Operation Name Inputs and Outputs Operation Description

Operation Tags Part of Which Service

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry

REST Operation Name Service Tags Used Template

Operation Description Part of Which Service Part of Which Endpoint Group

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry

(2) Endpoint Information

Endpoint information is the port information of the Web service which can be extracted only

through the service page However since different Web Service Registries have different structure

of the endpoint information with respect to the Web service Thence some elements of the

endpoint information for the Web service would be very diverse But there is one thing needed to

pay attention the Ebi Web Service Registry does not have endpoint information for all these Web

services published in this registry Moreover though the Web services in the same Web Service

Registry have the same structure of this endpoint information some elements of the endpoint

information would be missing or empty Furthermore these Web Service Registries could even

have no endpoint information for some Web services published by them Nevertheless whatever

there is endpoint information for the Web service there will be at least one such element which

is the URL address of the endpoint Following table 3-8 shows the endpoint information that

should be extracted in these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Endpoint Name Endpoint URL

Endpoint Critical Endpoint Type

Bound Endpoint

Xmethods

Endpoint URL Publisher of this Endpoint

Contact Email of this Publisher Implementation Language of this

Endpoint

Seekda Endpoint URL

Biocatalogue Endpoint Name Endpoint URL

Table 3-8 Endpoint Information of these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Service Availability Number of Downs

Total Uptime Total Downtime

MTTR MTBF

RTT Max of Endpoint RTT Min of Endpoint

RTT Average of Endpoint Ping Count of Endpoint

Seekda Service Availability Begin Time of Monitoring

Biocatalogue Monitored Status of Endpoint Overall Status of Service

Table 3-9 Monitoring Information of these five Web Service Registries

(3) Monitoring Information

Monitoring information is the tested statistic information for the Web service But it is worth

Deep Web Service Crawler

36

noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of

all these Web services published by them While for other three Web Service Registries only a

few of Web services may not have this information Table 3-12 displays the monitoring

information for these three Web Service Registries

(4) Whois Information

Whois information is not extracted from the information that hosted in the service page and

service list page It is the descriptive information for the service domain which can be gained by

means of the address of the WSDL link Because of that the process of getting the Whois

information will start after gaining the service domain first The final value of service domain

should not contain the string like this http https www etc It must belong to the top level

domain After that it is going to query the service domain database by means of sending the value

of service domain to the Whois client which is just a Web site on the Internet for example

ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain

exists a list of its information will be returned as the output However the structure of the

returned information will be diverse with respect to different service domain Therefore the most

challenge thing is that there has to deal with the extracting process for each different situation of

the returned information Table 3-10 gives the Whois information that needs be extracted for all

these five Web Service Registries

Service Domain URL Domain Name Domain Type

Domain Address Domain Description State

Postal Code City Country

Country Code Phone Fax

Email Organization Established Time

Table 3-10Whois Information for these five Web Service Registries

Finally all the information of these four aspects will be collected together And then be delivered into

the Storage component for further storage processing

3231 Features of the Property Grabber Component

The Property Grabber component has to provide the following features

l Obtain basic information

Generally speaking the more information one Web service has the better you can know how

well this Web service is Hence this Property Grabber component will be necessary to extract all

these basic information that hosted in the service list page and service page This basic

information contains the structure information endpoint information and monitoring

information

l Obtain Whois information

Due to the reason that the more information the Web service has the better the quality of the

Web service is That makes it is necessary to extract as much information about the Web service

as possible Therefore except the basic information the Property Grabber component can also

obtain some additional information called Whois information such as the type of the domain the

name of the person in charge the postal code of the domain city phone fax detailed address

email etc

Deep Web Service Crawler

37

3232 Input of the Property Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3233 Output of the Property Grabber Component

The component will produce the following output data

l Structured information of each service

l Endpoint information about each service if exists

l Monitoring information for the service and endpoint if exists

l Whois information of the service domain

However all these information will be collected together as the properties for each service Thereafter

the collected properties will be sent to the Storage component

3234 Demonstration for Property Grabber Component

Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the

Property Grabber component To simplify the explanation the example showed in this section uses

the same Web service as before

1) The inputs of the Property Grabber component are links of service list page and service page that

received from Web Service Extractor component Their links are

ldquohttpwwwservice-repositorycomrdquo

and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page

Deep Web Service Crawler

38

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page

3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that

displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of

the structured information have the same content like the contents of description showed in

service page and service list page Thence in order to saving the time of extracting process and

the space for storing process the elements with the same content would only be extracted once

Moreover there needs a transformation of non-descriptive text to the descriptive text for the

rating information because the contents of it are several image of stars Therefore the final

results of the extracted structured information for this Web service are showed in table 3-11

Because of there are no descriptive information for the provider homepage and Owner

Homepage their values are assigned as ldquoNULLrdquo

Service Name BLZService

WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl

WSDL Version 0

Server Apache-Coyote11

Description BLZService

Rating Four stars and A Half

Provider NULL

Homepage NULL

Owner Homepage NULL

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo

4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that

displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list

only one of them will be extracted as the information of endpoint That is because this master

program is intended to extract as more information as possible but these information should not

contain some redundant information Therefore for the purpose of that only one of them will be

extracted as the endpoint information even if there are more than one endpoint records Table

3-12 shows the final results of the endpoint information of this Web service

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page

Deep Web Service Crawler

39

Endpoint Name BLZServiceSOAP12port_http

Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService

Endpoint Critical True

Endpoint Type production

Bound Endpoint BLZServiceSOAP12Binding

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo

5) Then it is time to extract the monitoring information through invoking the function

ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box

above contains the monitoring information about the Web service and the red box below lists

the monitoring information for these endpoints As have already mentioned before only one

endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two

types of availability Actually they all represent the availability for this Web service just like the

availability that showed in figure 3-14 Therefore only one content of these availability

information is sufficient Table 3-13 shows the final results of this extracting process

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page

Service Availability 100

Number of Downs 0

Total Uptime 1 day 19 hours 19 minutes

Total Downtime 0 second

MTBF 1 day 19 hours 19 minutes

MTTR 0 second

RTT Max of Endpoint 141 ms

RTT Min of Endpoint 0 ms

RTT Average of Endpoint 577 ms

Ping Count of Endpoint 112

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo

6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the

service domain from the WSDL link first Towards this Web service the gained service domain is

ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying

process After that it returns a list of information for that service domain See figure 3-17 And

table 3-14 is the extracted Whois information

Deep Web Service Crawler

40

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo

Service Domain URL thomas-bayercom

Domain Name Thomas Bayer

Domain Type NULL

Domain Address Moltkestr40

Domain Description NULL

State NULL

Postal Code 54173

City Bonn

Country NULL

Country Code DE

Phone +4922855525760

Fax NULL

Email infopredic8de

Organization predic8 GmbH

Established Time NULL

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo

7) Finally all the information of these four aspects will be collected together as the service

properties and then these service properties are forwarded into the Storage component

324 The Function of Storage Component

The WSDL link from the WSDL Grabber component will be used by this Storage component to

download the WSDL document from the Web and stores it into the disk thereafter In addition the

service properties from the Property Grabber component will also be directly stored into the disk in

three different manners with this Storage component Figure 3-18 illustrates the process flow of the

Storage component

Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo

of the Storage component will be triggered After that it is going to transform the service properties

into three different output formats and store them into the disk These output formats are XML file

INI file and database records Besides it will also try to download the WSDL document with the URL

address of the WSDL link and then store the obtained WSDL document into disk too However this

ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo

ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the

storage tasks

Deep Web Service Crawler

41

Figure3-18 Overview the process flow of the Storage Component

(1) ldquogetWSDLrdquo sub function

The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it

into the disk Therefore above all it has to get the content of the WSDL document This

procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value

of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if

the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo

For the case like that it will create a WSDL document whose name is the service name appended

with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content

It is an empty document Nevertheless if that service does have WSDL link this sub function will

try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all

these contents that hosted in the Web would be downloaded and stored into the disk and named

only with the name of the service Otherwise it would create a WSDL document which prefixes

ldquoBadrdquo before the service name

(2) ldquogenerateXMLrdquo sub function

The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform

them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo

XML stands for eXtensible Markup Language which is a markup language that designed to

transport and store data The first line of an XML file is the XML declaration which defines the

XML version and the encoding used For example following ldquoltxml version=10

encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is

8-bit Unicode Transformation Format character set Besides an XML file contains XML elements

which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML

element can also contain other elements which could be simple text or a mixture of both

Deep Web Service Crawler

42

However an XML file must contain a root element as the parent of all other elements

(3) ldquogenerateINIrdquo sub function

The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will

transform them into an INI file and then store it in the disk with the name structure of servicersquos

name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for

configuration files They are just simple text files with a basic structure Generally speaking the

INI file contains three different parts They are section parameter and comment respectively

Parameter is the basic element contained in an INI file Its format is a key-value pair or can also

be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always

appears to the left of the equals sign The section is more like a room that grouped all its

parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And

section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo

Thence anything between the semicolon and the end of line will be ignored

(4) ldquogenerateDatabaserdquo sub function

The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions

Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will

turn them into the data of the database by using the SQL statements SQL stands for Structured

Query Language which is a database language that designed for accessing and manipulating data

in database Most of the actions that performed on a database are done with SQL statements

And the primary statements of SQL include insert into delete update select create alter and

drop Therefore for the purpose of transforming these service properties into the data of the

database this sub function has to create a database first using the ldquocreate databaserdquo statement

Then it should create a table to store the data A table is a collections of related data entries and

it consists of columns and rows Since the data for all these five Web Service Registries are not

very large therefore one database table is enough for storing these service properties Because of

that the field names of the service properties in columns for all these five Web Service Registries

should be uniform and well-defined Afterwards the service properties of each single service can

be put into the table as a record with the ldquoinsert intordquo statement of the SQL

3241 Features of the Storage Component

The Property Grabber component has to provide the following features

l Generate different output formats

The final result of this master program is to store the information of the service into the disk for

future work However this Storage component provides three different formats for storing these

service properties of the services that published in the Web Service Registries This makes the

storage of the service very flexible and also longevity

l Obtain the WSDL document

The important thing of this component is to check if the WSDL document can be obtained from

the WSDL link because the WSDL document plays a decisive role for determining the quality of

the service This Storage component provides the ability to deal with the different situations

about the process of obtaining the WSDL document

Deep Web Service Crawler

43

3242 Input of the Storage Component

This component requires the following input data

l WSDL link of each service

l Each service property information

3243 Output of the Storage Component

The component will produce the following output data

l WSDL document of the service

l XML document INI file and tables in database

3244 Demonstration for Storage Component

Following figures describe the fundamental implementation codes of this Storage component

The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the

implementation codes The first common place is about the parameters that defined in each of

these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute

path in the computer disk It will be used for the storing procedure and has already been specified

by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an

increasing Integer which is used as a part of the name of the service The reason for doing this is

that it can prevent the services that have the same name from overriding in the disk The content

of red mark among the codes of these figures is the second common place Its function is to

create a file or document for the service with the corresponding parameters

2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to

get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the

name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo

is the actual WSDL link which is the most important one for this sub function The other two

parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and

ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data

of the service in each Web Service Registry Such as the overall number of properties for that Web

Service Registry the overall number of services the number of services that have no WSDL link

the number of services contains no contents the number of services whose WSDL link are not

available etc The ldquoLog Informationrdquo text file records the results of the process steps and the

problems encountered For example which service is crawling now which Web Service Registry it

belongs the reason why it cannot get the WSDL document for this service the reason why this

service is unreachable and so on

Deep Web Service Crawler

44

Figure3-19 Implementation code for getting WSDL document

3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file

and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of

ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two

variables name and value

Figure3-20 Implementation code for generating XML file

Deep Web Service Crawler

45

Figure3-21 Implementation code for generating INI file

4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all

services in these five Web Service Registries into the records of the database Therefore it has to

create a database first The name of the database can be arbitrary as long as it conforms to the

naming rules in database This is the same for the name of the table Figure 3-22 displays the

procedure of creating a table in the database Due to the reason that it is hard to decide the

length of each service property thence the data types for all service properties are set to ldquoTextrdquo

Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo

statement

Figure3-22 Implementation code for creating table in database

Deep Web Service Crawler

46

Figure3-23 Implementation code for generating table records

33 Multithreaded Programming for DWSC

Multithreaded programming is a built-in characteristic which is provided by Java language A

multithreaded program contains two or more separate parts that that can execute concurrently Each

part of such a program is called a thread The use of multithreading enables to create programs that

makes efficient use of the system resources For example make maximum use of the CPU thatrsquos

because the idle time of the CPU can be kept to a minimum

In this master program there are five Web Service Registries that are needed to crawl for the services

among them Moreover the number of services published among each Web Service Registry is quite

different That makes the running time cost by each Web Service Registry is different Then it will

happen that one Web Service Registry who owns fewer services has to be waiting for executing until

another Web Service Registry who has much more services finishes Therefore in order to reduce the

waiting time for other Web Service Registries and maximize the use of the system resources it is

necessary to apply this multithreaded programming to this master program That is to say this master

program will create a thread for each Web Service Registry And these threads are executed

independently

34 Sleep Time Configuration for Web Service

Registries

Due to the reason that this master program is intended for downloading the WSDL document and

extracting the service information of these Web services that published in the Web Service Registry

this would inevitably affect the performance of the Web Service Registry In addition for the purpose

of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of

accessing Because of that sometimes when this master program is executing unknown errors would

happen For instance this master program will be continually halting at one point without getting any

more WSDL document and service information or some servicesrsquo WSDL documents of some Web

Deep Web Service Crawler

47

Service Registries cannot be obtained or there would miss some service information Therefore in

order to obtain the largest staffs of these Web services published in these five Web Service Registries

and not to affect the throughput of them it has to configure the accessing rate for each service of all

Web Service Registries

In consequence before going into the essential procedure for each single service of these Web Service

Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long

milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for

the specified number of milliseconds In other words it will temporarily cease execution for a while

Following table shows this time interval of the sleep function for each Web Service Registry

Web Service Registry Name Time Interval (milliseconds)

Service Repository 8000

Ebi 3000

Xmethods 10000

Seekda 20000

Biocatalogue 10000

Table 3-15Sleep Time of these five Web Service Registries

Deep Web Service Crawler

48

4 Experimental Results and Analysis

This chapter is going to show the quantitative experimental results of the prototype that presented in

chapter 3 Besides the analysis of these results will also be described and explained However in

order to gain a rather accurate result the experiments are carried out more than five times All these

data that displayed in the following tables and charts are the average of them

41 Statistic Information for Different Web Service

Registries

This section is mainly used to talk about the amount statistic of these Web services that published in

these five Web Service Registries It includes the overall number of the Web services that published in

each Web Service Registry and the number of these unavailable Web services which have been

archived because they may not be active anymore or are close to being non active Table 4-1 shows

the service amount statistic of these five Web Service Registries

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Overall

Services 57 289 382 853 2567

Unavailable

Services 0 0 0 0 125

Table4-1 Service amount statistic of these five Web Service Registries

Nevertheless in order to have an intuitive view of the service amount statistic in these five Web

Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As

you can see from the bar chart one on hand there has an ascending increase for the overall number

of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service

Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web

service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to

provide the Web services for the users because it contains far more services than other four Web

Service Registries On the other hand there is no unavailable service existed in almost all Web Service

Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web

Service Registry contains some Web services that cannot be used by the users To some degree this is

useless for the reason that these services cannot be used anymore and it will waste the network

resources on the Web Therefore all these unavailable services should be eliminated in order to

reduce the wasting of the network resources

Deep Web Service Crawler

49

Figure4-1 Service amount statistic of these five Web Service Registries

42 Statistic Information for WSDL Document

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Failed WSDL

Links 1 0 23 145 32

Without

WSDL Links 0 0 0 0 16

Empty

Content 0 0 2 0 2

Table4-2 Statistic information for WSDL Document

Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in

these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed

WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the

Web services whose WSDL links are invalid Say in other words that means it is impossible to get the

WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no

WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services

for these Web Service Registries It is used as the overall number of Web services in each Web Service

Registry who has no WSDL link That is to say there would be no WSDL document for Web services

like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL

document will also be created But there is no content and the name of this WSDL document will

contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which

represents the overall number of Web services that have the WSDL links and the URL address of them

0

500

1000

1500

2000

2500

3000

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

57289 382

853

2567

0 0 0 0 125

The

Num

ber o

f Web

Ser

vice

s

The name of Web Service Registry

Overall Services

Unavailable Services

Deep Web Service Crawler

50

are valid but their WSDL documents contains no content In this case a WSDL document whose name

contains a string ldquo(BAD)rdquo will be created

Figure4-2 Statistic information for WSDL Document

43 Comparison of Different Average Number of

Service Properties

This section is going to compare the average number of service properties in these five Web Service

Registries This average number of service properties is calculated by means of following equation = (1)

Where

ASP is the average number of service properties for one Web Service Registry

ONSP is the overall number of service properties in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

Figure 4-3 shows the average number of service properties of each Web service in these five Web

Service Registries As have already mentioned before one of the measurements for testing the quality

of Web services in the Web Service Registry is the service information The more information of a Web

service the better you know about that service consequently the corresponding Web Service

Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service

Repository and Biocatalogue Web Service Registries own a larger number of service properties than

that in other three Web Service Registries This can directly reflect that these two Web Service

Registries can provide more detailed information about the Web services published among them So

that the users can be easier to choose the service they need And they would also like to use the Web

services that published in these two Web Service Registries By contrast the Xmethods and Seekda

0

50

100

150

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

1 023

145

32

0 0 0 016

0 0 2 0 2

The

Num

ber o

f Ser

vice

s

The Name of the Web Service Registry

Failed WSDL Links

Without WSDL links

Empty Content

Deep Web Service Crawler

51

Web Service Registries who have less service information of the Web services would offer less quality

for these Web services Therefore users may be not like to use the Web services provided in these

two Web Service Registries Not to mention the Web services published in the Ebi Web Service

Registry

Figure4-3 Average Number of Service Properties

From the description presented in section 323 the causes of the issue for the different number of

service properties in these Web Service Registries may consist of following several points First the

number of the structured information for these Web services is different with respect to these five

Web Service Registries Even part of information for some Web services in one Web Service Registry

could be missing or have empty value For example the number of structure information that

supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that

in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for

almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce

the amount of the overall number of service properties Thirdly some Web Service Registries do not

have monitoring information such as Xmethods and Ebi In particular the Service Repository Web

Service Registry has a large amount of monitoring information of the Web services that could be

extracted in the Web Obvious the last one is the number of the whois information for these Web

services If the database of the whois client does not contain the information about the service

domain for the Web service in one Web Service Registry then there will be no whois information

could be extracted Moreover even there has information of the service domain the amount of the

information could be very diverse Therefore if the Web Service Registry undergoes the situation like

that many service domains of the Web services in this registry have no and few whois information

then the average number of service properties in that registry will decrease greatly

As a result in order to help users get better acquainted with the Web services that provided in Web

Service Registry and distinguish these Web services the Web Service Registry should do it best to

offer more and more information for each published Web service of it

05

101520253035

23

717 17

32

The

Num

ber o

f Pro

pert

ies

The Name of Web Service Registry

Average Number of Service Properties

Deep Web Service Crawler

52

44 Different Outputs of Web Services

The fundamental task of this master program is to obtain the WSDL document of the Web services

that hosted in the Web Service Registries as well as extract and gather the properties of these Web

services and thereafter store them into disk Therefore this section is going to describe the different

outputs of this master program which include the WSDL documents of Web services the generated

XML file and INI file and data records of these service properties in Database

Figure4-4 WSDL Document format of one Web service

The WSDL document of the Web service is just read from the Web according to the URL address of its

WSDL link And then these data will be stored into the disk as the WSDL document The name of the

WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document

And in order to distinguish these WSDL documents whose names would be the same while the

content of them are different the name of each obtained WSDL document in one Web Service

Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL

document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and

data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output

formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name

is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the

materials belonged to the same Web service The first three lines in that INI file are some service

comments which start from the semicolon to the end of the line They are the basic information for

the description of this INI file The following one line of them is the section that included in a pairs of

brackets It is important because it represents that the following lines behind it are the information of

this Web service Therefore the rest of these lines are the actual service information that existed in a

key-value pair with an equal between the pair Each service property is displayed from the beginning

of the line

Deep Web Service Crawler

53

Figure4-5 INI File format of one Web service

Figure4-6 XML File format of one Web service

Figure4-7 Database data format for all Web services

Deep Web Service Crawler

54

Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its

name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web

service Though the format of the XML file is different from that of the INI file the essential contents

of them are the same That is to say the values of these service properties are no different This is

because they are files generated from the collection of properties for the same Web service The XML

file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the

section in INI file is something like the root in XML file Therefore all values of these elements among

the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service

Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of

service information for all Web services in these five Web Service Registries And the entire service

information for one Web service will be only one record in this table Because of that the column

names of that table should be the union of the name of the service information in each Web Service

Registry However since the column names of the table should be unique thus these redundant

names in this union must be eliminated This could make sense and possible because the names of the

service information are well-defined and uniform for all these five Web Service Registries In addition

in this table the first column is the primary key which is an increasing Integer The function of this is

more like the Integer that contained in the name of the XML file and INI file While the rest columns in

that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in

this table represents that this property of that Web service is empty or missing

45 Comparison of Average Time Cost for Different

Parts of Single Web Service

This section aims to describe the comparison of the average time cost for different parts of getting

one single Web service in all these five Web Service Registries At first it has to calculate the average

time cost of getting one single service in the Web Service Registry The calculation of it can be

obtained through following equation = (2)

Where

ATC is the average time cost for one single Web service

OTS is the overall time cost of all the Web services in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the different parts of the average time cost for getting one single service consists of

following 6 aspects which are the average time cost for extracting service property the average time

cost for obtaining WSDL document the average time cost for generating XML file the average time

cost for generating INI file the average time cost for inserting the service property into the table of

the Database and the average time cost for some other procedures such as get the service list page

link get the service page link and so on The average time cost for extracting service property would

be obtained by means of following equation

Deep Web Service Crawler

55

= (3)

Where

ATCSI is the average time cost for extracting service property of one single Web service

OTSSI is the overall time cost for extracting service property of all the Web services in one Web

Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the calculation of other parts is similar to the equation for calculating the average time

cost for extracting service property While the calculation of the average time cost for other

procedures equals to the result of the average time cost for one single Web service minus the sum of

the average time cost of all other five parts

Service

property

WSDL

Document

XML

File INI File Database Others Overall

Service

Repository

8801 918 2 1 53 267 10042

Ebi 699 82 2 1 28 11 823

Xmethods 5801 1168 2 1 45 12 7029

Seekda 5186 1013 2 1 41 23 6266

Biocatalogue 39533 762 2 1 66 1636 42000

Table4-3 Average time cost information for all Web Service Registries

Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web

Service Registries The first column of table 4-3 is the name of these five Web Service Registries And

the last column in this table is the average time cost for the single service in one Web Service Registry

While the rest columns in this table are the average time cost of these six different parts In order to

have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated

with following corresponding figures See figure 4-8 to figure 4-14

Figure4-8 Average time cost for extracting service property in all Web Service Registries

8801

6995801 5186

39533

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Service Property

Deep Web Service Crawler

56

As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue

Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries

which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively

That is to say it will take much longer time to extract the service properties of these Web services

published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the

Biocatalogue Web Service Registry has even more average number of service properties which have

already been talked about in section 43 On the contrary the average number of service properties in

Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service

Registry is larger than that in Seekda Web Service Registry As have already known that the average

number of service properties is the same for these two Web Service Registries Nevertheless there

has course might explain the case that the average time in Xmethods costs more than that in Seekda

which is the process for extracting service properties in Xmethods Web Service Registry needs to be

executed by means of service page and service list page while it needs only service page link for

Seekda Web Service Registry

The average time cost for obtaining WSDL document in all these five Web Service Registries is

displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the

average time for extracting the WSDL link of Web service and the average time for reading the data of

WSDL document from the Web and then storing it into the disk As seen from this figure the average

time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is

1168 Although the process for extracting WSDL link will cost certain amount of time it does not have

a significant influence to the total average time spent for obtaining WSDL document This is because

the WSDL link of one Web service is almost gained in one step Therefore this can imply that the

average size of the WSDL document for Xmethods Web Service Registry is larger than that for other

four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds

for the process of extracting WSDL document

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries

918

82

11681013

762

0

200

400

600

800

1000

1200

1400

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

WSDL Document

Deep Web Service Crawler

57

Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different

outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11

the average time of generating XML file for one Web service cost the same for all these five Web

Service Registries which is only 2 milliseconds In addition the average time of generating INI file for

one Web service cost the same too And its value is just 1 millisecond Even though sum up the values

of these two types of average time costs the result is still so small that it could be omitted when

comparing to the overall average time cost of getting one Web service for each corresponding Web

Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI

file would be finished at once after receiving the input of service properties of one Web service

Furthermore it can be seen from figure 4-12 although the average time costs of creating database

record for each Web service in all these five Web Service Registries are larger than that time of

generating XML and INI file the process of the operation for creating database record is still fast

Figure4-10 Average time cost for generating XML file in all Web Service Registries

Figure4-11 Average time cost for generating INI file in all Web Service Registries

2 2 2 2 2

0

05

1

15

2

25

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

XML File

1 1 1 1 1

0

02

04

06

08

1

12

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

INI File

Deep Web Service Crawler

58

Figure4-12 Average time cost for creating database record in all Web Service Registries

Figure4-13 Average time cost for getting one Web service in all Web Service Registries

In the figure 4-13 it gives the average time cost for getting one single Web service in all these five

Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the

longest time for this process This is because the presentation of these five different parts that

described in the front shows that the average time cost of each part would needs more time to finish

the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining

WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web

Service Registry does not cost the longest Moreover there is amazing thing when looking at these

figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same

trend This can further indicate that one Web Service Registry who spends more time to get the

description information of one Web service would offer more information about the Web service

53

28

4541

66

0

10

20

30

40

50

60

70

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Database

10042

823

7029 6266

42000

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Overall

Deep Web Service Crawler

59

5 Conclusion and Further Direction

This master thesis provides a schema which aims to explore the description information of the Web

services that hosted in the different Web Service Registries Actually the description information of

the Web service consists of WSDL document and service information about that Web service

Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can

be used to obtain the WSDL document and service information of the Web service in these Web

Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web

services that hosted in the these Web Service Registries as well as the storage of these Web services

is not flexible and the most important thing is that only a few service information of the Web service

are extracted even more there is no service information would be extracted for some Web Service

Registry However the work presented in this master thesis can be able to explore all Web services in

these Web Service Registries Moreover the service information of Web service would be extracted as

much as it could with the approach presented in this master thesis so that the final result would be

the largest annotated service catalogue ever produced Furthermore regarding the storage process of

the description information of the Web service this master thesis provides three different ways that

can guarantee not only the completeness but also longevity of the description information for the

Web service

However in the implementation performed in this master thesis the whois client used for querying

the information of service domain will return a free text if the information exists And sometime this

free text differs completely That makes it must have to crawl each Web service in all Web Service

Registries at least once in the experiment stage So that all the cases of these free text would be

foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web

services in these Web Service Registries Therefore in order to simply the work other whois client

who can ease the work here needs to be found and used

Moreover in the experiment stage of this master thesis the time cost for getting Web service is still

large For the purpose of reducing this time the multithreaded programming would be applied to

each some parts of the process for getting one Web service

Although the work performed here is specialized for only these five Web Service Registries the

main parts of the principles used here is adaptable to other Web Service Registries that with only

small changes in the implementation codes or the structure

Deep Web Service Crawler

60

6 Bibliography

[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)

June 27 2008

[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from

httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-

whole Emanuele Della Valle (CEFRIEL) July 1 2008

[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk

ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-

architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009

[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web

Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October

2006

[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from

httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008

[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from

httpwwww3orgTRws-arch-scenarios February 11 2004

[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo

Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering

University of Washington Seattle pp233-272 February 1999

[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web

Consortium Working Draft WD-html5-20100624 January 22 2008

[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language

WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from

httpwwwwsmoorgTRd16d161v02

[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO

-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from

httpwwwwsmoorgTRd2v11

[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware

Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet

2008

Deep Web Service Crawler

61

7 Appendixes

There are additional outputs of this master program which are log information file and statistic report

file Figure 8-1 shows one of the basic output formats of the log information for these five Web

Service Registries

Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry

Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry

Deep Web Service Crawler

62

Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo

ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively

Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry

Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry

Deep Web Service Crawler

63

Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry

Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry

Deep Web Service Crawler

64

Table of Figures

Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12

Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16

Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25

Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27

Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41

Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46

Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50

Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51

Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52

Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56

Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58

Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58

Deep Web Service Crawler

65

Table of Tables

Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47

Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48

Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55

Deep Web Service Crawler

66

Table of Abbreviations

DTD Document Type Definition

DWSC Deep Web Service Crawler

HTML HyperText Markup Language

MTBF Mean Time between Failures

MTTR Mean Time to Recovery

QoS Quality of Service

REST Representational State Transfer

RTT Round Trip Time

SOAP Simple Object Access Protocol

SQL Structured Query Language

URL Uniform Resource Locator

WHATWG Web Hypertext Application Technology Working Group

XML eXtensible Markup Language

WSDL Web Service Description Language

WSML Web Service Modeling Language

WSMO Web Service Modeling Ontology

Page 7: Deep Web Service Crawler

Deep Web Service Crawler

7

1 Introduction

In this introduction chapter of the master thesis at first itrsquos going to concisely explain the background

of the current situation then is the basic introduction of the proposed approach which is called Deep

Web Service Extraction Crawler

11 BackgroundMotivation

In the late 1990rsquos the Web Service Registry was a hot commodity The formal definition of Web

Service Registry is known as a link links page Its function is to uniformly present information that

comes from various sources Hence it can provide a convenient channel to the users for offering

searching and using the Web Services Actually the related metadata of the Web Services that

submitted by both the system and users are commonly hosted along with the Service descriptions

Nevertheless in fact when users enter one of the Web Service Registries to look for some Web

Services they might meet some situations that would bring lots of trouble to them One of the

situations may be like that these Web Service Registries return several similar published Web Services

after the users search on it For example two or more Web Services have the same name but their

versions are not the same Or two or more Web Services that derived from the same server but have

different contents etc Furthermore most users are also interested in a global view of the published

services For instance they want to know which Web Service Registry can provide better quality for

the Web Service Therefore in order to help users to differentiate those similar published Web

Services and have a global view of the Web Services this information should be monitored and rated

Moreover there are a great many Web Service Registries in the Internet Each Web Service Registry

can provide a great number of Web Services Obviously there might have some similar Web Services

among these Web Service Registries Or a Web Service in one of the Web Service Registry is related to

another Web Service in other Web Service Registries Hence these Web Services should be

comparable across different Web Service Registries However recently there has not much support of

this In addition towards the metadata actually not all of them are structured especially the

descriptions of the non-functional property Therefore what have to do now is to turn those

non-functional property descriptions into the structured format Clearly speaking it needs to extract

as much information as possible about the Web Services that offered in the Web Service Registries

Eventually after extracting all the information from the Web Service Registries it is necessary to store

them into the disk This procedure should be efficient flexible and completeness

12 Initial Designing of the Deep Web Service

Crawler Approach

The problems have already been stated in the previous section Hence the following work is to solve

Deep Web Service Crawler

8

these problems In this section it will present the basic principle of Deep Web Service Crawler

approach

At first it is the simply introduction of the Deep Web Service Crawler approach to these problems As

have already been mentioned each Web Service Registry can offer Web Services Moreover each

Web Service Registry has its own html page structures These structures may be the same or even

complete different Therefore the first thing is to identify which Web Service Registry that it will be

going to explore Since each Web Service Registry owns a unique URL this job can be done by directly

analyzing the corresponding URL address of that Web Service Registry After identifying which Web

Service Registry it is going to explore the following step is to obtain all these Web Services that

published in that Web Service Registry Then with all these obtained Web Services it is time to extract

analyze and gather the information of the services That information can be in structured format or

even in unstructured format In this master thesis some Deep Web Analysis Techniques will be

applied to obtain this information So that the information about each Web Service shall be the

largest annotated The last but not the least important all the information about the Web Services

need to be stored

13 Goals of this Master Thesis

The lists in the following are the goals of this master thesis

n Produce the largest annotated Service Catalogue

Service Catalogue is a list of service properties The more properties the service has the larger

Service Catalogue it owns Therefore this master program should extract as much service

properties as possible

n Flexible storage of these metadata of each service as annotations or dedicated documents

The metadata of one service includes not only the WSDL document but also service properties

All these metadata are important information for the service Therefore this master program

should provide flexible ways to store these metadata into the disk

n Improve the comparable property of the Web Services across different Web Service Registries

The names of service properties for one Web Service Registry could be different from another

Web Service Registry Hence for the purpose of improving the comparable ability all these

names of the service properties should be uniformed and well-defined

14 Outline of this Master Thesis

In this chapter the motivation objective and initial approach plan have already been discussed

Thereafter the remaining paper is structured as follows

Firstly in chapter 2 the work presented is almost based on these existing techniques In section 21

there is given a detailed introduction to the technique of the Service-Finder project Then in section

22 the already implemented technique of Pica-Pica Web Service Description Crawler is introduction

and discussed After that in section 23 the Information Retrieval technique is presented

Deep Web Service Crawler

9

Then following in chapter 3 it is explained the designing details of this Deep Web Service Crawler

approach In section 31 it gives a short description for the different requirements of this approach

Next in section 32 the actual design for the Deep Web Service Crawler is presented Then in section

33 34 the multithreaded programming and sleep time configuration that used in this master

program are introduced respectively

In chapter 4 it is supposed to display the experiments of this Deep Web Service Crawler approach

and then give some evaluation of it

Finally in Section 5 the conclusion and discussion of the work that already done as well as the work in

the future for this master task are presented respectively

Deep Web Service Crawler

10

2 State of the Art

This chapter aims at presenting some existing techniques or Strategies that related to the work of

applying this Deep Web Service Extraction Crawler approach Section 21 plans to talk about the

existing catalogues Service-Finder project And then in section 22 it is going to explain the existing

implemented crawlers Pica-Pica Web Service Description Crawler Finally in the section 23 it is

supposed to present some details about the Information Extraction technique

21 Service Finder Project

Service-Finder Project aims at developing a platform for Web Service discovery especially for the Web

Services that are embedded in a Web 20 environment [1] Hence it can provide an efficient access to

publicly available services The goals of the Service-Finder project are depicted as follows [1]

n Automatically gather Web Services and their related information

n Semi-automatically create semantic service description based on the information that available

on the Web

n Create and improve semantic annotations via the user feedback

n Describe the aggregated information in semantic models and allow reasoning query

However before describing the basic functionality of the Service-Finder Project there is going to

present one of its use cases and requirements first

211 Use Cases for Service-Finder Project

The Service-Finder Project employed the use case methodology of the W3C Use Case description [6]

for its needs and then applied this methodology to the use cases that it enumerated

2111 Use Case Methodology

There are three aspects needed to be considered for use case definitions of Service-Finder Project [1]

(1) Description that used to describe information of the use case

(2) Actors Roles and Goals that used to identify the actors they would be the roles they act and the

goals they need to achieve in the scenario

(3) Storyboard that used to describe the serial of interactions among the actors and the

Service-Finder Portal

2112 System Administrator

This section is going to present the use case that applied to the Service-Finder portal and that

illustrated the requirements on its functionality from a user point of view However all these

Deep Web Service Crawler

11

information in this use case are derived from [1] In this use case there has a system administrator

whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities

online and working all day and night Therefore if there is any system failures Sam Adams should fix

the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will

alert him immediately by sending him a SMS Message in the case of a system failure

n Description

This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an

SMS Messaging Service that he wants to build it into his application

n Actors Roles and Goals

The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him

are the immediate service delivery the reliability of the service and low base fee and transaction fee

n Storyboard

Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find

many useful services from it especially he know what he is looking for Hence he visits the

Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo

Requirement 1 Search functionality

Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the

number of matching services that will be displayed on one page And he would also expect there has

short information about the service functionality the service provider and the service availability So

that he could decide which service he will choose to read further

Requirement 2 Enable configurable pagination of the matching results and have some short

information for each service

Step 3 When Sam looks through the short information about the services that displayed on the first

page he expects to find the most relevant services that related to his request After that he would

like to read more detailed information about that service to see whether this service can provide the

needed functionality

Requirement 3 Rank the returned matching services and must provide ability to read more details of

a service

Step 4 In the case that all the returned matching services Sam got provide quite different

functionalities or they belong to different service categories for example the SMS messaging services

alert users not through SMS but voice messaging For this reason Sam would like to see other

different categories that may be contain the services he wants Or the services of other categories

which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can

further filter his search in terms of browsing through categories

Requirement 4 Service categories and allow the user to look all services that belonged to that specific

category If possible it should also allow the user to browse through categories

Step 5 When Sam got all the services that could provide a SMS messaging service via the methods

described in the Step 4 at present he wants to look for the services that offered by an Austrian

provider and have no base fees if possible

Requirement 5 Faceted search

Deep Web Service Crawler

12

Step 6 After Sam got all these specific services now he would like to choose the services that can

provide a high reliability

Requirement 6 Sort functionality based on usersrsquo chooses

Step 7 For now Sam expects to compare the service availability between the promised to the service

provider and the actually provided This should be contained in the servicesrsquo details And there needs

also have service coverage information so that Sam can know whether this service covers the areas

he lives and works Moreover Sam would also like to compare these services in other way For

instance put some services into a structured table to compare the transaction fees

Requirement 7 A side-by-side comparison table for services and a functionality that enable users to

select services he wants to compare

Step 8 At last Sam wants to know whether the service providers offer a free try out of the services

So that he can test the service functionality

Requirement 8 If possible display a note that offering free service trials

212 Architecture Plan for the Service-Finder Project

The architecture plan of the Service-Finder project contains five basic components Service Crawler

Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal

Interface Figure 2-1 presents a high level overview of the components and the data flow among

them

Figure2-1Dataflow of Service-Finder and Its Components [3]

Deep Web Service Crawler

13

2121 The Principle of the Service Crawler Component

The Service Crawler is responsible for gathering available services and their related information from

the Web The overall cycle is depicted as following

(1) A Web developer publishes a Web Service

(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services

like WSDL (Web Service Description Language) documents

(3) The Crawler is also going to search for other related information as long as a service is discovered

(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant

part of the Web

At last the output of the crawler would be forwarded to the subsequent components for analyzing

indexing and displaying

2122 The Principle of the Automatic Annotator Component

The Automatic Annotator receives the relevant data from previous component and generates

semantic service descriptions about the WSDL documents and its related information based on the

Service-Finder Ontology and Service Category Ontology

Firstly it will simply introduce those two compatible ontologies that would be used throughout the

whole process [2]

n Generic Service Ontology it is an ontology which is functional to describe the data objects For

example the services the service providers availability payment modalities and so on

n Service Category Ontology it is an ontology which is used to categorize the functionalities or

applications of the services For instance data verification messaging data storage weather etc

Afterwards it is going to talk about the function of this component with its input output

Oslash Input

u Crawled data from Service Crawler

u Service-Finder Ontologies

u Feedback or Correction of before annotations

Oslash Function

u Enrich the information about the service and extract semantic statements according to the

Service-Finder Ontologies For example categorize the service according to the Service

Category Ontology

u Determine whether a particular document is relevant or not through the Web link graph If

not discard these irrelevant documents

u Classify the pages into their genres For instance pricing user comments FAQ and so on

Oslash Output

u Semantic annotation of the services

Deep Web Service Crawler

14

2123 The Principle of the Conceptual Indexer and Matcher

Component

The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted

information of the services and supplying users the capability of retrieval and semantic query For

example the matchmaking between user requests and service offers and the act of retrieving user

feedback on extracted annotations

In addition letrsquos have a look of the function of this component and its input output

Oslash Input

u Semantic annotation data and full text information obtained from Automatic Annotation

u Semantic annotation data and full text information that come from user interfaces

u Cluster data from user and service clustering component

Oslash Function

u Store the semantic annotations received from the Automatic Annotation component and

from the user interface

u Store the cluster data that procured through the clustering component

u Store and index the textual description offered by the Automatic Annotation component

and the textual comments offered by users

u Ontological query the semantic data from the data store center

u Combined keyword and Ontological querying used for user queries

u Provide a list of similar services for a given service

Oslash Output

u A list of matching services that are queried by users In particular these services should be

sorted by ranking and can also be iterated

u All available data that related to a particular entity must be retrievable at the user interface

2124 The Principle of the Service-Finder Portal Interface

Component

The Service-Finder Portal Interface is the main entry point that provided for users of the

Service-Finder system to search and browse the data which is managed by the Conceptual Indexer

and Matcher component In addition the users can also contribute information by means of providing

tags comments categorizations and ratings to the data browsed Furthermore the developers can

still directly invoke the Service-Finder functionalities from their custom applications in terms of an API

Besides the details of this componentrsquos function input and output are represented as below

Oslash Input

u A list of ordered services for a query

u Detailed information about a service or a set of services and a service provider

u Query access to service category ontology and the most used tags provided by the users

Deep Web Service Crawler

15

u Service availability information

Oslash Function

u The Web Interface allows the users to search services by keyword tag or concept in the

categorization sort and filter query results by refining the query compare and bookmark

services try out the services that offer this functionality

u The API allows the developers to invoke Service-Finder functionalities

Oslash Output

u Explicit user annotations such as tags ratings comments decryptions and so on

u Implicit user data for example click stream of users bookmarks comparisons links sent

etc

u Manual advertising of available new services

2125 The Principle of the Cluster Engine Component

The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the

Service-Finder Portal eg the queried services and the compared services of the users Moreover it

also provides cluster data to the Conceptual Indexer and Matcher for providing service

recommendations

Furthermore letrsquos detailed introduce this componentrsquos function input and output

Oslash Input

u Service annotation data of both extracted and user feedback

u Usersrsquo Click streams used for extracting user behaviors

Oslash Function

u Obtain user clusters from user behaviors

u Obtain service clusters from service annotation data to enable to find similar services

Oslash Output

u Clusters of users and services

22 Information Extraction

Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge

amount of information sources on the Internet which has been limited the access to browsing and

searching for the reason of the heterogeneity and the lack of structure of Web information sources

Therefore the appearance of Information Extraction that transforms the Web pages into

program-friendly structures for post-processing would become a great necessity However the task of

Information Extraction is specified in terms of the inputs and the extraction targets And the

techniques used in the process of Information Extraction called extractor

221 Input Types of Information Extraction

Generally speaking there are three different input types The first input type is the unstructured

Deep Web Service Crawler

16

document For example the free text that showed in figure 2-2 It is unstructured and written in

natural language So that it will require substantial natural language processing While the second

input type is called the structured document For instance the XML documents based on the reason

that the data can be described through the available DTD (Document Type Definition) or XML

(eXtensible Markup Language) schema Finally but obviously the third input type is the

semi-structured document that are widespread on the Web Such as the large volume of HTML

pages like tables itemized lists and enumerated lists This is because HTML tags are often used to

render these embedded data in the HTML pages See figure 2-3

Figure2-2Left is the free text input type and right is its output [4]

Figure2-3A Semi-structured page containing data records

(in rectangular box) to be extracted [4]

Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly

regular structure And the data of these documents can be displayed in a format of HTML way or

non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and

generated from structured databases in terms of some templates or layouts thus it would be

considered as one of the input sources which could provide some of these semi-structured documents

For example the authors price and comments of the book pages that provided by Amazon have the

Deep Web Service Crawler

17

same layout That is because these Web pages are generated from the same database and applied

with the same template or layout Furthermore there has another option which could manually

generate HTML pages of semi-structured type For example although the publication lists that

provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have

title and source property for every single pager Eventually the inputs for some Information Extraction

can also be the pages with the same class or among various Web Service Registries

222 Extraction Targets of Information Extraction

Moreover regarding the task of the Information Extraction it has to consider the extraction target

There also have two different extraction targets The first one is the relation of k-tuple And the k in

there means the number of attributes in a record Nevertheless in some cases an attribute of one

record may have none instantiation Otherwise the attribute owns multiple instantiations In addition

the complex object with hierarchically organized data would be the second extraction target Though

the ways for depicting the extraction targets in a page are diverse the most common structure is the

hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf

nodes which called internal nodes And the structure for a data object may also be flat or nested To

be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise

if it is nested structure then the internal nodes that involved in this data object would be more than

two levels

Furthermore in order to make the Web pages readable for human being and having an easier

visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated

or demarcated However the displaying for a data object in a Web page would be affected by

following conditions [4]

Oslash The attribute of a data object has zero or several values

(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo

attribute For example a special offer only available for certain books might be a ldquononerdquo

attribute

(2) If there are more than one values for the attribute of a data object it will be called the

ldquomultiValuerdquo attribute For instance the name of the author for a book could be a

ldquomultiValuerdquo attribute

Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering

That is to say among this set of attribute the position of the attribute might be changed

according to the diverse instances of a data object Thus this attribute will be called the

ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would

enumerate the release data in front of the movesrsquo title while for the movies after year 1999

(including 1999) it will enumerate the release data behind the movesrsquo title

Oslash The attribute has different formats

This means the displaying format of the data object could be completely distinct with respect to

these different instances Therefore if the format of an attribute is free then a lot of rules will be

needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo

attribute For example an ecommerce Web site would use the bold font format to present the

general prices while use the red color format to display the sale prices Nevertheless there has

Deep Web Service Crawler

18

another situation that some different attributes for a data object have the same format For

example various attributes are presented in terms of using the ltTDgt tags in a table presentation

And the attributes like those could be differentiated by means of the order information of these

attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo

attributes it must have to revise the rules for extracting these attributes

Oslash The attribute cannot be decomposed

Because of the easier processing sometimes the input documents would like to be treated as

strings of tokens instead of the strings of characters In addition some of the attribute cannot

even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo

attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The

department code and the course number in them cannot be separated into two different strings

of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo

223 The Used Techniques in Information Extraction

The extractor used in the process of Information Extraction aims at providing a single uniform query

interface to access information sources like database server and Web server It consists of following

phases collecting returned Web pages labeling these Web pages generalizing extraction rules

extracting the relevant data and outputting the result in an appropriate format (XML format or

relational database) for further information integration For example at first the extractor queries the

Web server to gather the returned pages through the HTTP protocols after that it starts to extract the

contents among these HTML documents and integrate with other data sources thereafter Actually

the whole process of the extractor follows below steps

Oslash Step 1

At the beginning it must have to tokenize the input However there are two different

granularities for the input string tokenization They are tag-level encoding and word-level

encoding The tag-level encoding will transform the tags of HTML page into the general tokens

while transform all text string between two tags into a special token Nevertheless the

word-level encoding does this in another way It treats each word in a document as a token

Oslash Step 2

Next it should apply the extraction rules for every attributes of the data object in the Web pages

These extraction rules could be induced in terms of a top-down or bottom-up generalization

pattern mining or logic programming In addition the type of extraction rules may be indicated

by means of regular grammars or logic rules For example some use path-expressions of the

HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic

constraints and some use delimiter-based constraints such as HTML tags or literal words

Oslash Step 3

After that all these extracted data would be assembled into the records

Oslash Step 4

Finally iterate this process until all these data objects in the input

Deep Web Service Crawler

19

23 Pica-Pica Web Service Description Crawler

The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the

Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web

Services problem For example the evaluation of the descriptive quality of Web Services that offered

and how well are these Web Services described in nowadaysrsquo Web Service Registries

231 Needed Libraries of the Pica-Pica Web Service

Description Crawler

This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef

Spillner and programmed in terms of the Python language Actually in order to run these scripts to

parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib

n Beautiful Soup

It is an HTMLXML parser for Python language And it can even turn these invalid markups into a

parse tree [5] Moreover the following three features can make it more powerful

u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that

makes approximately as much sense as the original document Therefore you can obtain

the data that you want

u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating

searching and modifying the parse tree Hence you donrsquot need to create a custom parse

for every application

u If the document has already specified an encoding then you can ignore it since the

Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way

Otherwise what you have to do is just to specify the encoding of the original documents

Furthermore the ways of including Beautiful Soup into the application are displayed in the

following [5]

sup2 From BeautifulSoup import BeautifulSoup For processing HTML

sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML

sup2 Import BeautifulSoup To get everything

n Html5lib

It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to

gain maximum compatibility with the current major desktop web browsers this implementation

will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5

specification

Deep Web Service Crawler

20

232 Architecture of the Pica-Pica Web Service

Description Crawler

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler

Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes

four fundamental components Service Page Grabber component WSDL Grabber component

Property Grabber component and WSML Register component

(1) The Service Page Grabber components is going to take the URL seed as the input and output the

link of the service page into following two components WSDL Grabber component and Property

Grabber component

(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the

delivered service pagersquos link And then check whether the validation for these obtained WSDL

document Finally only these valid WSDL document will be passed into the WSML Register

component for further processing

(3) The Property Grabber component will try to extract the servicersquos property the hosted in the

service page if there exists After that all these servicersquos properties would be saved into an INI

file as the information of that service

(4) The functionality of the WSML Register component is to write appropriate WSML document by

means of the valid WSDL documents that delivered from WSDL Grabber component and the

optionally INI files that delivered from Property Grabber component Afterwards register them in

Conqo

n WSML [9]

It stands for Web Service Modeling Language which provides a framework with different

language variants Hence it is often used to describe the different aspects of the semantic Web

Services according to the conceptual model of WSMO

n WSMO [10]

WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various

aspects related to the Semantic Web Services based on the ontologies Ontology is a formal

explicit specification of a shared conceptualization In fact these ontologies are the sticking points

that can satisfy the linkage between the agreement of the communities of users and the defined

conceptual semantics of the real-world

n Conqo [11]

Deep Web Service Crawler

21

It is a discovery framework that considers not only the Quality of Service (QoS) but also the

context information It will use a Web Service Repository to manage these service descriptions

that based on WSML

233 Implementation of the Pica-Pica Web Service

Description Crawler

This section is going to describe the processes of the implementation of the Pica-Pica Web Service

Description Crawler in detail

(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it

needs an input as the initial seed In this crawler there are five Web Service Registries which are

listed in the below The URL address of these five Web Service Registries will be used as the input

seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web

Service Description Crawler there has a single Python script for each Web Service Registry And

the crawling process for these Web Service Registriesrsquo Python script will be executed one after

another

Biocataloue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwsrvice-repositorycom

Xmethods httpwwwxmethodsnet

(2) Then after feeding with the input seed it will step into the next component Service Page Grabber

At first this component will try to read the data from the Web based on the input seed Then it

will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library

After that this Service Page Grabber component starts to look for the service page link for each

service that published in the Web Service Registry by means of the functions in Html5lib library In

the case that the service page link of one single service is found it will firstly check whether this

service page link is valid or not Once the service page link is valid it will pass it into the following

two components for further processing which are WSDL Grabber component and Property

Grabber component

(3) When the WSDL Grabber component receives a service page link from its previous component it

sets out to extract the WSDL link address for that service through the parsing tree of the data in

this service page Next this component will start to download the WSDL document of that service

in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into

the disk The process of this WSDL Grabber component will be continually carried on until there is

no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They

may either contain bad definitions or bad namespaceURI or be an empty document What is

worse it is not even of XML format Hence in order to pick them out this component will further

analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo

folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in

order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo

folder will be passed into the subsequent component

(4) Moreover since some Web Service Registries would give some addition information about the

Deep Web Service Crawler

22

services such as availability service provider version therefore the Property Grabber

component will set out to extract these information as this servicersquos properties And thereafter

save these servicersquos properties into an INI file However if there has no available additional

information then it is no need to extract the service property thus there is no INI file for that

service However for the implementation of this Pica-Pica Web Service Description Crawler only

the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to

extract the servicesrsquo properties While for other three Web Service Registries there has such

function

(5) Furthermore this is an optional to create a report file which contains the statistic information of

this process Such as the total number of services for one Web Service Registry the number of

services whose WSDL document is invalid etc

(6) As has been stated at present there has a folder with all valid WSDL documents and might also

have some INI files Therefore at this time the task of WSML Register component is to generate

the appropriate WSML documents with these valid WSDL document and INI files And then

register them in Conqo

24 Conclusions of the Existing Strategies

In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder

project Information Extraction technique and Pica-Pica Web Service Description Crawler

The task of this master program is going to obtain the available Web services and their related

information from the Web Actually this is a procedure of extracting the needed information for the

service such as the servicersquos WSDL document and its properties Therefore the Information

Extraction technique which used to extract the information that hosted in the Web could be used in

this master program

Moreover the Service-Finder is a large project that not only able to obtain the available Web services

and their related information from the Web but also enrich these crawled data with annotations by

means of the Service-Finder ontology and the Service Category ontology and integrate all the

information into a coherent semantic mode Furthermore it also provides the capabilities for

searching and browsing the data with the user interface and gives users with service

recommendations However as a master program the Service-Finder project is far more exceeding

the requirements Therefore it is just considered as a consultation for this master program

Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available

Web services and their related information this fulfills the primary task of this master program

Nevertheless regarding the information about the service this Pica-Pica Web Service Description

Crawler extracts only few properties sometime even no property Consequently in order to improve

the quality of the service it has to extract as much properties about the service as possible Thence in

chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler

Deep Web Service Crawler

23

3 Design and Implementation

In the previous chapter of the State of Art those already existing techniques or implementations are

presented In the following it is time to introduce the basic principle of the proposed approach Deep

Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica

Web Service Description Crawler

31 Deep Web Services Crawler Requirements

This section is mainly talking about the goals of the Deep Web Service Crawler approach system

requirements with respect to approach and some non-functional requirements

311 Basic Requirements for DWSC

The following lists are the basic requirements which should be achieved

(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list

of Web services that published in the Web Service Registry It contains not only the WSDL document

of the service but also a file with respect to the information about that service Therefore for the

purpose of producing the largest annotated Service Catalogue this proposed approach need to

extract as much properties about those Web services as possible Moreover it has also to download

the WSDL document that hosted along with the Web service That is to say these properties are not

only the interesting structured properties such as service name its WSDL link address but also some

other non-function properties for example endpoint and its monitoring information

(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties

For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of

schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way

this proposed approach provides three methods for the storage The first one will store them as an

XML file While for the second method it is going to store them in an INI file And the third method is

to use the database for the storage

312 System Requirements for DWSC

General Speaking the needed requirements for achieving a programming project contains following

lists

1) Operating System LinuxUnix Windows (XP Vista 2000 etc)

2) Programming language C++ Java Python C etc

3) Programming Tool NetBean Eclipse Visual Studio and so on

However in this master thesis the scripts of the Deep Web Service Crawler approach are written in

Java programming language Besides these code scripts are only tested running in Windows XP

operating system and Linux operating system but have not been tested in other operating system

Deep Web Service Crawler

24

313 Non-Functional Requirements for DWSC

In this part several non-functional requirements for the Deep Web Service Crawler approach are

presented

1) Transparency the process of data exploration and data storage should be done automatically

without the userrsquos intervention However at first user should specify the path in the hard disk which

will be used store these outputs of this program

2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in

order to stop the process from interruption there must have some necessary error handlings for the

process recovering

3) Completeness this approach should extract the interesting properties about each Web Service as

much as possible eg endpoint monitoring information etc

In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the

following five URLs hence the proposed approach must implement not less than those five Web

Service Registries

Biocatalogue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwservice-repositorycom

Xmethods httpwwwxmethodsnet

32 Deep Web Services Crawler Architecture

In this section an overview of this high-level architecture of Deep Web Services Crawler approach will

be introduced first Thereafter there are four subsections that focusing on the outlining of each single

component and how they will play together presented

The current components and flows of data in Deep Web Service Crawler can be summarized as

depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page

links and related service page links by crawling the Web (Web Service Extractor) Then those gathered

links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL

Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all

these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is

illustrated as following

Oslash Step 1

When the Deep Web Service Crawler starts to run the File Chooser container would require the

user to specify a path in the computer or in any other hard disk The reason for specifying the

path is that this Deep Web Service Crawler program needs a place to store all its outputs

Oslash Step 2

After that the Web Service Extractor will be triggered It is the main entry to the specific crawling

process However since the Deep Web Service Crawler program is a procedure which supposes

to crawl for Web Services in some given Web Service Registries Hence the URL addresses of

Deep Web Service Crawler

25

these Web Service Registries should be given as an initial seed for this Web Service Extractor

process However since the page structures of these Web Service Registries are completely

different thus there will be a dependent process for each Web Service Registry

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler

Oslash Step 3

So that according to the given seed two types of links would be obtained by the Web Service

Extractor component One is the service list page link And another is the service page link

Service list page is page that contains a list of Web Services and may be some information about

these Web Services While service page is page that contains much more information for single

service Finally it forwards these two types of links into the next two components Property

Grabber and WSDL Grabber

Oslash Step 4

Then on the one hand the Property Grabber component tries to gather the information about

the service that hosted in the service list page and service page Such as the name of the service

its description the ranking for this service etc Finally all the information of the service will be

collected together as the service properties which then will be delivered to the storage

component for further processing

Oslash Step 5

On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page

or the service page That is because for some Web Service Registry the WSDL link is hosted in

Deep Web Service Crawler

26

the service list page like the Biocatalogue and for other Web Service Registries it is hosted in

the service page such as Xmethods Then after obtaining the WSDL link it will also be

transmitted to the Storage component for further processing

Oslash Step6

When service properties information and WSDL link of the service are received by Storage

component it will be going to store them into the disk For the service properties they will be

stored into the disk by means of three different ways They are an XML file an INI file or one

record inside the table of Database However for the WSDL link the Storage component will try

to download the page content first according to the URL address of the WSDL link If it can work

successful the page content of the service will be stored as a WSDL document into the disk

Oslash Step 7

Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there

are more than one service or there are more than one service list page in the those Web Service

Registries the crawling process from step 3 to step 6 will be continued again and again until

there is no service or service list page in that Web Service Registries any more

Oslash Step 8

Furthermore after the crawling process for one Web Service Registry finishes there will also

generate a file that contains some of statistic information about this crawling process For

example the time when the crawling process of this Web Service Registry started and when

finished the total number of Web services in this Web Service Registry how many services have

empty WSDL document the average number of service properties in this Web Service Registry

and the average time cost for extracting service properties getting WSDL document and

generating XML file INI file etc

321 The Function of Web Service Extractor Component

The Web Service Extractor component is responsible for gathering service list page links and service

page links It pursues a focused crawl of the Web and only forwards service list page and service page

links to subsequent components for analyzing collecting gathering purposes Therefore it is going to

identify both service list page links and related service page links on these Web Service Registries

As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost

as important as the Web Service Extractor itself as it highly influences the part of the Web which it

needs to crawl The seed can or shall contain eg Web pages where these Web Services are

published or that they talk about Web Services

Deep Web Service Crawler

27

Figure3-2 Overview the process flow of the Web Service Extractor Component

After feeding with the seed of URL the Web Service Extractor component starts to get the link of

service list page from the initial page of this URL seed However this process would be diverse for

these five Web Service Registries Following shows the different situation in these Web Service

Registries

Oslash Service-Repository Web Service Registry

In this Web Service Registry the link of the first service list page is the URL address of its seed

which means some Web Services can be found in the home page of the Service-Repository Web

Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this

Web Service Registry Therefore the process of getting service list page link in this registry will be

continually carried on until there no more service list page link exists

Oslash Xmethods Web Service Registry

Although there has Web Services in the home page of the Xmethods Web Service Registry the

number of those Web Services is only a small subset of these in this Web Service Registry

Moreover in the Xmethods Web Service Registry there is only one page containing all Web

Services Therefore it must have to get the service list page link for that page

Oslash Ebi Web Service Registry

The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That

is to say there also has one page that contains all Web Services in this Web Service Registry

However this page is not the initial page of the input seed Therefore there needs more than

one operation steps to get the service list page link of that page

Oslash Seekda Web Service Registry

In Seekda Web Service Registry these Web Services are not contained in the initial page of the

input seed The service list page link can be obtained after several additional operation steps

Deep Web Service Crawler

28

However there has a problem for getting the service list page link in this registry Simply put if

there are more than one pages contains the Web Services for some unknown reasons it cannot

not get the links of the rest service list page In another word it can only get the link of the first

service list page

Oslash Biocatalogue Web Service Registry

The process of getting service list page in Biocatalogue Web Service Registry is almost the same

as in Seekda Web Service Registry The only difference is that it can get all these service list page

links in Biocatalogue Web Service Registry if there are more than one service list pages

Then after getting the link of service list page the Web Service Extractor will begin to get the link of

service page for each service that listed in the service list page The reason for why it can do like this is

that there has an internal link for every service which will address to the service page It is worth

noting that once a service page link is obtained this component will immediately forward the service

page link and the service list page link into the subsequent two components for further processing

Nevertheless the process of obtaining the service page link will be continuously carried out until all

services listed in that service list page are crawled Analogously the process of getting service list page

will also be continuously carried out until there no more service list page exists

3211 Features of the Web Service Extractor Component

The main features are described in the following paragraphs

1) Obtain service list page links

A central task is to obtain the corresponding service list page links It is an URL address that includes a

publicly list of Web Services and just a simply information about these Web services like the name of

the service an internal URL that links to another page which would contain the detailed information

about that service sometimes may have the link address for WSDL document

2) Obtain service page links

Once service list page links are found a crucial aspect is to extract the internal link of each service to

aid the task of service information discovery Therefore it is the task of the Web Service Extractor to

harvest the html page content of the service list page so that the service page links which would

contain much more detailed information of the single Web service can be obtained

3212 Input of the Web Service Extractor Component

This component is dependent on some specific input seeds And the only input required for this

component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section

313

3213 Output of the Web Service Extractor Component

The component will produce two service related page links from the Web

l Service list page links

l Service page links

Deep Web Service Crawler

29

3214 Demonstration for Web Service Extractor

In order to have comprehensive understanding of the process of the Web Service Extractor

component following gives some figures for explanation Though there are five URL addresses in this

section only the URL of the service-repository address is showed as an example

1) The input seed is the initial URL address of the Service-Repository which is

ldquohttpwwwservice-repositorycomrdquo

2) As has already said in section 321 the first service list link of this Web Service Registry is its input

seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page

of that link

Figure3-3 Service list page of the Service-Repository

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo

Figure3-5Code Overview of getting service page link in Service Repository

Figure3-6 Service page of the Web service ldquoBLZServicerdquo

Deep Web Service Crawler

30

3) For now because of the service list page link is already know the next step is to acquiring the

service page link according to these services that listed in the service list page The text in red box

of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the

complete link of the service page It should be prefixed with the initial URL address of

Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for

getting service page link of the Web service in Service Repository is showed in figure 3-5

Therefore the final link of this service page is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure 3-6 is the corrsponding service page of that link

4) Afterwards those two links service list page link and service page link which are gathered by the

Web Service Extractor component will be immediately forwarded to the next two components

which are WSDL Grabber component and Property Grabber component

322 The Function of WSDL Grabber Component

The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the

service lists page link or service page link And the whole process flow is illustrated in figure 3-7

Figure3-7 Overview the process flow of the WSDL Grabber Component

When the WSDL Grabber component receives the inputs that delivered from previous component it

will start to get the WSDL link for the service based on these inputs However although the inputs of

the WSDL Grabber component are the links of service page and service list page only one of them

contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the

service page or the service list page The reason why there needs these two links to be delivered into

this component is that the only one of the five Web Service Registries which is Biocatalogue hosts

the WSDL link in the service list page While for other four Web Service Registries the WSDL link is

hosted in the service page Therefore the WSDL link of these four Web Service Registries will be

obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL

link will be obtained through the service list page link However there has a problem for getting the

WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in

Deep Web Service Crawler

31

the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another

word these services will not have WSDL document For the situation like this the value of the WSDL

link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web

Services in other four Web Service Registries the WSDL link would always exist in the service page

Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service

it will inevitably be forwarded into Storage component for downloading the WSDL document at once

3221 Features of the WSDL Grabber Component

The WSDL Grabber component has to provide the following features

l Obtain WSDL links

WSDL link is the direct way to get into the page that contains the contents of the WSDL document

It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo

to represent this is an address that addressed to the page of WSDL document

3222 Input of the WSDL Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3223 Output of the WSDL Grabber Component

The component will only produce the following output data

l The URL address of WSDL link for each service

3224 Demonstration for WSDL Grabber Component

In this section it will demonstrate a list of figures in order to give a comprehensive understanding of

the process of the WSDL Grabber component Without a doubt it uses the same Web service of the

Service-Repository as an example too

1) The input for this WSDL Grabber component is the link of service page that obtained from Web

Service Extractor component The address of this link is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page

Deep Web Service Crawler

32

2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the

figure 3-8

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo

3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure

3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry

For other four Web Service Registries this will be different This function

ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo

Then it checks all of the nodes one by one to see if the text value of that node is the value of

ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will

be extracted as the value of WSDL link for this Web service

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function

Figure3-11 Code overview of ldquooneParameterrdquo function

4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web

service ldquoBLZServicerdquo can be obtained The value of it is

ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo

Deep Web Service Crawler

33

323 The Function of Property Grabber Component

The Property Grabber component is a module which is used to extract and gather all these Web

service information that hosted in the Web actually are the information that showed in service list

page and service page In the end all the obtained Web service information will be collected together

as the service properties which would be delivered into the Storage component for storing The

detailed process flow of Property Grabber component is illustrated in figure 3-12

The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component

However there is still a little difference between them with respect to the seed As have already

mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is

sufficient to get the WSDL link While regarding the Property Grabber component it needs two of

them as the seeds Therefore once the Property Grabber component receives the needed inputs it

will start to extract the information of that single Web service

Figure3-12 Overview the process flow of the Property Grabber Component

Therefore after the Property Grabber component receives the inputs it starts to extract the service

information for the Web service Generally speaking the service information consists of four aspects

which are structured information endpoint information monitoring information and whois

information respectively

(1) Structured Information

The structured information can be obtained by extracting the information that hosted in the

service page and service list page They are the basic descriptive information about the service

Such as the name of the service the URL address through which the WSDL document can be

obtained the description of introducing the service the provider who provides that service its

rating and the server who owns this service etc However the elements constituting this

Deep Web Service Crawler

34

structured information would be diverse according to these different Web Service Registries For

example the rating information of the Web service exists in the Service-Repository Web Service

Registry while the Xmethods Web Service Registry does not have this information In addition

even though the Web services in the same Web Service Registry some elements of the structured

information could be not exist For instance one service in a Web Service Registry may have the

description information of this service while another service in the same Web Service Registry

does not have this description information Tables 3-1 to 3-5 demonstrate the structured

information that should be extracted in these five Web Service Registries Moreover if the style of

the Biocataloue Web Service Registry is SOAP there has some additional information describing

the SOAP operation of this service While it is REST this additional information would be

something that talked about the REST operation They should also be considered as a part of the

structured information Table 3-6 and table 3-7 illustrate the information for these two different

operations

Service Name WSDL Link WSDL Version

Provider Server Rating

Homepage Owner Homepage Description

Table 3-1Structured Information of Service-Repository Web Service Registry

Service Name WSDL Link Provider

Service Style Homepage Implementation Language

Description User Description Contributed Client Name

Type of this Client Publisher for this client Used Tookit of this client

Used Language of this client Used Operation System of this client

Table 3-2Structured Information of Xmethods Web Service Registry

Service Name WSDL Link Server

Provider Providerrsquos Country Service Style

Rating Description User Description

Service Tags Documentation (within WSDL)

Table 3-3Structured Information of Seekda Web Service Registry

Service Name WSDL Link Port Name

Service URL Address Implementation Class

Table 3-4Structured Information of Ebi Web Service Registry

Service Name WSDL Link Style

Provider Providerrsquos Country View Times

Favorite Times Submitter Service Tags

Total Annotation Provider Annotation Member Annotation

Registry Annotation Base URL SOAP Lab Server Base URL

Description User Description Category

Table 3-5Structured Information of Biocatalogue Web Service Registry

Deep Web Service Crawler

35

SOAP Operation Name Inputs and Outputs Operation Description

Operation Tags Part of Which Service

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry

REST Operation Name Service Tags Used Template

Operation Description Part of Which Service Part of Which Endpoint Group

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry

(2) Endpoint Information

Endpoint information is the port information of the Web service which can be extracted only

through the service page However since different Web Service Registries have different structure

of the endpoint information with respect to the Web service Thence some elements of the

endpoint information for the Web service would be very diverse But there is one thing needed to

pay attention the Ebi Web Service Registry does not have endpoint information for all these Web

services published in this registry Moreover though the Web services in the same Web Service

Registry have the same structure of this endpoint information some elements of the endpoint

information would be missing or empty Furthermore these Web Service Registries could even

have no endpoint information for some Web services published by them Nevertheless whatever

there is endpoint information for the Web service there will be at least one such element which

is the URL address of the endpoint Following table 3-8 shows the endpoint information that

should be extracted in these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Endpoint Name Endpoint URL

Endpoint Critical Endpoint Type

Bound Endpoint

Xmethods

Endpoint URL Publisher of this Endpoint

Contact Email of this Publisher Implementation Language of this

Endpoint

Seekda Endpoint URL

Biocatalogue Endpoint Name Endpoint URL

Table 3-8 Endpoint Information of these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Service Availability Number of Downs

Total Uptime Total Downtime

MTTR MTBF

RTT Max of Endpoint RTT Min of Endpoint

RTT Average of Endpoint Ping Count of Endpoint

Seekda Service Availability Begin Time of Monitoring

Biocatalogue Monitored Status of Endpoint Overall Status of Service

Table 3-9 Monitoring Information of these five Web Service Registries

(3) Monitoring Information

Monitoring information is the tested statistic information for the Web service But it is worth

Deep Web Service Crawler

36

noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of

all these Web services published by them While for other three Web Service Registries only a

few of Web services may not have this information Table 3-12 displays the monitoring

information for these three Web Service Registries

(4) Whois Information

Whois information is not extracted from the information that hosted in the service page and

service list page It is the descriptive information for the service domain which can be gained by

means of the address of the WSDL link Because of that the process of getting the Whois

information will start after gaining the service domain first The final value of service domain

should not contain the string like this http https www etc It must belong to the top level

domain After that it is going to query the service domain database by means of sending the value

of service domain to the Whois client which is just a Web site on the Internet for example

ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain

exists a list of its information will be returned as the output However the structure of the

returned information will be diverse with respect to different service domain Therefore the most

challenge thing is that there has to deal with the extracting process for each different situation of

the returned information Table 3-10 gives the Whois information that needs be extracted for all

these five Web Service Registries

Service Domain URL Domain Name Domain Type

Domain Address Domain Description State

Postal Code City Country

Country Code Phone Fax

Email Organization Established Time

Table 3-10Whois Information for these five Web Service Registries

Finally all the information of these four aspects will be collected together And then be delivered into

the Storage component for further storage processing

3231 Features of the Property Grabber Component

The Property Grabber component has to provide the following features

l Obtain basic information

Generally speaking the more information one Web service has the better you can know how

well this Web service is Hence this Property Grabber component will be necessary to extract all

these basic information that hosted in the service list page and service page This basic

information contains the structure information endpoint information and monitoring

information

l Obtain Whois information

Due to the reason that the more information the Web service has the better the quality of the

Web service is That makes it is necessary to extract as much information about the Web service

as possible Therefore except the basic information the Property Grabber component can also

obtain some additional information called Whois information such as the type of the domain the

name of the person in charge the postal code of the domain city phone fax detailed address

email etc

Deep Web Service Crawler

37

3232 Input of the Property Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3233 Output of the Property Grabber Component

The component will produce the following output data

l Structured information of each service

l Endpoint information about each service if exists

l Monitoring information for the service and endpoint if exists

l Whois information of the service domain

However all these information will be collected together as the properties for each service Thereafter

the collected properties will be sent to the Storage component

3234 Demonstration for Property Grabber Component

Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the

Property Grabber component To simplify the explanation the example showed in this section uses

the same Web service as before

1) The inputs of the Property Grabber component are links of service list page and service page that

received from Web Service Extractor component Their links are

ldquohttpwwwservice-repositorycomrdquo

and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page

Deep Web Service Crawler

38

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page

3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that

displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of

the structured information have the same content like the contents of description showed in

service page and service list page Thence in order to saving the time of extracting process and

the space for storing process the elements with the same content would only be extracted once

Moreover there needs a transformation of non-descriptive text to the descriptive text for the

rating information because the contents of it are several image of stars Therefore the final

results of the extracted structured information for this Web service are showed in table 3-11

Because of there are no descriptive information for the provider homepage and Owner

Homepage their values are assigned as ldquoNULLrdquo

Service Name BLZService

WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl

WSDL Version 0

Server Apache-Coyote11

Description BLZService

Rating Four stars and A Half

Provider NULL

Homepage NULL

Owner Homepage NULL

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo

4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that

displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list

only one of them will be extracted as the information of endpoint That is because this master

program is intended to extract as more information as possible but these information should not

contain some redundant information Therefore for the purpose of that only one of them will be

extracted as the endpoint information even if there are more than one endpoint records Table

3-12 shows the final results of the endpoint information of this Web service

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page

Deep Web Service Crawler

39

Endpoint Name BLZServiceSOAP12port_http

Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService

Endpoint Critical True

Endpoint Type production

Bound Endpoint BLZServiceSOAP12Binding

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo

5) Then it is time to extract the monitoring information through invoking the function

ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box

above contains the monitoring information about the Web service and the red box below lists

the monitoring information for these endpoints As have already mentioned before only one

endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two

types of availability Actually they all represent the availability for this Web service just like the

availability that showed in figure 3-14 Therefore only one content of these availability

information is sufficient Table 3-13 shows the final results of this extracting process

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page

Service Availability 100

Number of Downs 0

Total Uptime 1 day 19 hours 19 minutes

Total Downtime 0 second

MTBF 1 day 19 hours 19 minutes

MTTR 0 second

RTT Max of Endpoint 141 ms

RTT Min of Endpoint 0 ms

RTT Average of Endpoint 577 ms

Ping Count of Endpoint 112

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo

6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the

service domain from the WSDL link first Towards this Web service the gained service domain is

ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying

process After that it returns a list of information for that service domain See figure 3-17 And

table 3-14 is the extracted Whois information

Deep Web Service Crawler

40

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo

Service Domain URL thomas-bayercom

Domain Name Thomas Bayer

Domain Type NULL

Domain Address Moltkestr40

Domain Description NULL

State NULL

Postal Code 54173

City Bonn

Country NULL

Country Code DE

Phone +4922855525760

Fax NULL

Email infopredic8de

Organization predic8 GmbH

Established Time NULL

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo

7) Finally all the information of these four aspects will be collected together as the service

properties and then these service properties are forwarded into the Storage component

324 The Function of Storage Component

The WSDL link from the WSDL Grabber component will be used by this Storage component to

download the WSDL document from the Web and stores it into the disk thereafter In addition the

service properties from the Property Grabber component will also be directly stored into the disk in

three different manners with this Storage component Figure 3-18 illustrates the process flow of the

Storage component

Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo

of the Storage component will be triggered After that it is going to transform the service properties

into three different output formats and store them into the disk These output formats are XML file

INI file and database records Besides it will also try to download the WSDL document with the URL

address of the WSDL link and then store the obtained WSDL document into disk too However this

ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo

ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the

storage tasks

Deep Web Service Crawler

41

Figure3-18 Overview the process flow of the Storage Component

(1) ldquogetWSDLrdquo sub function

The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it

into the disk Therefore above all it has to get the content of the WSDL document This

procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value

of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if

the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo

For the case like that it will create a WSDL document whose name is the service name appended

with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content

It is an empty document Nevertheless if that service does have WSDL link this sub function will

try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all

these contents that hosted in the Web would be downloaded and stored into the disk and named

only with the name of the service Otherwise it would create a WSDL document which prefixes

ldquoBadrdquo before the service name

(2) ldquogenerateXMLrdquo sub function

The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform

them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo

XML stands for eXtensible Markup Language which is a markup language that designed to

transport and store data The first line of an XML file is the XML declaration which defines the

XML version and the encoding used For example following ldquoltxml version=10

encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is

8-bit Unicode Transformation Format character set Besides an XML file contains XML elements

which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML

element can also contain other elements which could be simple text or a mixture of both

Deep Web Service Crawler

42

However an XML file must contain a root element as the parent of all other elements

(3) ldquogenerateINIrdquo sub function

The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will

transform them into an INI file and then store it in the disk with the name structure of servicersquos

name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for

configuration files They are just simple text files with a basic structure Generally speaking the

INI file contains three different parts They are section parameter and comment respectively

Parameter is the basic element contained in an INI file Its format is a key-value pair or can also

be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always

appears to the left of the equals sign The section is more like a room that grouped all its

parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And

section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo

Thence anything between the semicolon and the end of line will be ignored

(4) ldquogenerateDatabaserdquo sub function

The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions

Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will

turn them into the data of the database by using the SQL statements SQL stands for Structured

Query Language which is a database language that designed for accessing and manipulating data

in database Most of the actions that performed on a database are done with SQL statements

And the primary statements of SQL include insert into delete update select create alter and

drop Therefore for the purpose of transforming these service properties into the data of the

database this sub function has to create a database first using the ldquocreate databaserdquo statement

Then it should create a table to store the data A table is a collections of related data entries and

it consists of columns and rows Since the data for all these five Web Service Registries are not

very large therefore one database table is enough for storing these service properties Because of

that the field names of the service properties in columns for all these five Web Service Registries

should be uniform and well-defined Afterwards the service properties of each single service can

be put into the table as a record with the ldquoinsert intordquo statement of the SQL

3241 Features of the Storage Component

The Property Grabber component has to provide the following features

l Generate different output formats

The final result of this master program is to store the information of the service into the disk for

future work However this Storage component provides three different formats for storing these

service properties of the services that published in the Web Service Registries This makes the

storage of the service very flexible and also longevity

l Obtain the WSDL document

The important thing of this component is to check if the WSDL document can be obtained from

the WSDL link because the WSDL document plays a decisive role for determining the quality of

the service This Storage component provides the ability to deal with the different situations

about the process of obtaining the WSDL document

Deep Web Service Crawler

43

3242 Input of the Storage Component

This component requires the following input data

l WSDL link of each service

l Each service property information

3243 Output of the Storage Component

The component will produce the following output data

l WSDL document of the service

l XML document INI file and tables in database

3244 Demonstration for Storage Component

Following figures describe the fundamental implementation codes of this Storage component

The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the

implementation codes The first common place is about the parameters that defined in each of

these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute

path in the computer disk It will be used for the storing procedure and has already been specified

by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an

increasing Integer which is used as a part of the name of the service The reason for doing this is

that it can prevent the services that have the same name from overriding in the disk The content

of red mark among the codes of these figures is the second common place Its function is to

create a file or document for the service with the corresponding parameters

2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to

get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the

name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo

is the actual WSDL link which is the most important one for this sub function The other two

parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and

ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data

of the service in each Web Service Registry Such as the overall number of properties for that Web

Service Registry the overall number of services the number of services that have no WSDL link

the number of services contains no contents the number of services whose WSDL link are not

available etc The ldquoLog Informationrdquo text file records the results of the process steps and the

problems encountered For example which service is crawling now which Web Service Registry it

belongs the reason why it cannot get the WSDL document for this service the reason why this

service is unreachable and so on

Deep Web Service Crawler

44

Figure3-19 Implementation code for getting WSDL document

3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file

and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of

ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two

variables name and value

Figure3-20 Implementation code for generating XML file

Deep Web Service Crawler

45

Figure3-21 Implementation code for generating INI file

4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all

services in these five Web Service Registries into the records of the database Therefore it has to

create a database first The name of the database can be arbitrary as long as it conforms to the

naming rules in database This is the same for the name of the table Figure 3-22 displays the

procedure of creating a table in the database Due to the reason that it is hard to decide the

length of each service property thence the data types for all service properties are set to ldquoTextrdquo

Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo

statement

Figure3-22 Implementation code for creating table in database

Deep Web Service Crawler

46

Figure3-23 Implementation code for generating table records

33 Multithreaded Programming for DWSC

Multithreaded programming is a built-in characteristic which is provided by Java language A

multithreaded program contains two or more separate parts that that can execute concurrently Each

part of such a program is called a thread The use of multithreading enables to create programs that

makes efficient use of the system resources For example make maximum use of the CPU thatrsquos

because the idle time of the CPU can be kept to a minimum

In this master program there are five Web Service Registries that are needed to crawl for the services

among them Moreover the number of services published among each Web Service Registry is quite

different That makes the running time cost by each Web Service Registry is different Then it will

happen that one Web Service Registry who owns fewer services has to be waiting for executing until

another Web Service Registry who has much more services finishes Therefore in order to reduce the

waiting time for other Web Service Registries and maximize the use of the system resources it is

necessary to apply this multithreaded programming to this master program That is to say this master

program will create a thread for each Web Service Registry And these threads are executed

independently

34 Sleep Time Configuration for Web Service

Registries

Due to the reason that this master program is intended for downloading the WSDL document and

extracting the service information of these Web services that published in the Web Service Registry

this would inevitably affect the performance of the Web Service Registry In addition for the purpose

of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of

accessing Because of that sometimes when this master program is executing unknown errors would

happen For instance this master program will be continually halting at one point without getting any

more WSDL document and service information or some servicesrsquo WSDL documents of some Web

Deep Web Service Crawler

47

Service Registries cannot be obtained or there would miss some service information Therefore in

order to obtain the largest staffs of these Web services published in these five Web Service Registries

and not to affect the throughput of them it has to configure the accessing rate for each service of all

Web Service Registries

In consequence before going into the essential procedure for each single service of these Web Service

Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long

milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for

the specified number of milliseconds In other words it will temporarily cease execution for a while

Following table shows this time interval of the sleep function for each Web Service Registry

Web Service Registry Name Time Interval (milliseconds)

Service Repository 8000

Ebi 3000

Xmethods 10000

Seekda 20000

Biocatalogue 10000

Table 3-15Sleep Time of these five Web Service Registries

Deep Web Service Crawler

48

4 Experimental Results and Analysis

This chapter is going to show the quantitative experimental results of the prototype that presented in

chapter 3 Besides the analysis of these results will also be described and explained However in

order to gain a rather accurate result the experiments are carried out more than five times All these

data that displayed in the following tables and charts are the average of them

41 Statistic Information for Different Web Service

Registries

This section is mainly used to talk about the amount statistic of these Web services that published in

these five Web Service Registries It includes the overall number of the Web services that published in

each Web Service Registry and the number of these unavailable Web services which have been

archived because they may not be active anymore or are close to being non active Table 4-1 shows

the service amount statistic of these five Web Service Registries

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Overall

Services 57 289 382 853 2567

Unavailable

Services 0 0 0 0 125

Table4-1 Service amount statistic of these five Web Service Registries

Nevertheless in order to have an intuitive view of the service amount statistic in these five Web

Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As

you can see from the bar chart one on hand there has an ascending increase for the overall number

of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service

Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web

service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to

provide the Web services for the users because it contains far more services than other four Web

Service Registries On the other hand there is no unavailable service existed in almost all Web Service

Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web

Service Registry contains some Web services that cannot be used by the users To some degree this is

useless for the reason that these services cannot be used anymore and it will waste the network

resources on the Web Therefore all these unavailable services should be eliminated in order to

reduce the wasting of the network resources

Deep Web Service Crawler

49

Figure4-1 Service amount statistic of these five Web Service Registries

42 Statistic Information for WSDL Document

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Failed WSDL

Links 1 0 23 145 32

Without

WSDL Links 0 0 0 0 16

Empty

Content 0 0 2 0 2

Table4-2 Statistic information for WSDL Document

Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in

these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed

WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the

Web services whose WSDL links are invalid Say in other words that means it is impossible to get the

WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no

WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services

for these Web Service Registries It is used as the overall number of Web services in each Web Service

Registry who has no WSDL link That is to say there would be no WSDL document for Web services

like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL

document will also be created But there is no content and the name of this WSDL document will

contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which

represents the overall number of Web services that have the WSDL links and the URL address of them

0

500

1000

1500

2000

2500

3000

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

57289 382

853

2567

0 0 0 0 125

The

Num

ber o

f Web

Ser

vice

s

The name of Web Service Registry

Overall Services

Unavailable Services

Deep Web Service Crawler

50

are valid but their WSDL documents contains no content In this case a WSDL document whose name

contains a string ldquo(BAD)rdquo will be created

Figure4-2 Statistic information for WSDL Document

43 Comparison of Different Average Number of

Service Properties

This section is going to compare the average number of service properties in these five Web Service

Registries This average number of service properties is calculated by means of following equation = (1)

Where

ASP is the average number of service properties for one Web Service Registry

ONSP is the overall number of service properties in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

Figure 4-3 shows the average number of service properties of each Web service in these five Web

Service Registries As have already mentioned before one of the measurements for testing the quality

of Web services in the Web Service Registry is the service information The more information of a Web

service the better you know about that service consequently the corresponding Web Service

Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service

Repository and Biocatalogue Web Service Registries own a larger number of service properties than

that in other three Web Service Registries This can directly reflect that these two Web Service

Registries can provide more detailed information about the Web services published among them So

that the users can be easier to choose the service they need And they would also like to use the Web

services that published in these two Web Service Registries By contrast the Xmethods and Seekda

0

50

100

150

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

1 023

145

32

0 0 0 016

0 0 2 0 2

The

Num

ber o

f Ser

vice

s

The Name of the Web Service Registry

Failed WSDL Links

Without WSDL links

Empty Content

Deep Web Service Crawler

51

Web Service Registries who have less service information of the Web services would offer less quality

for these Web services Therefore users may be not like to use the Web services provided in these

two Web Service Registries Not to mention the Web services published in the Ebi Web Service

Registry

Figure4-3 Average Number of Service Properties

From the description presented in section 323 the causes of the issue for the different number of

service properties in these Web Service Registries may consist of following several points First the

number of the structured information for these Web services is different with respect to these five

Web Service Registries Even part of information for some Web services in one Web Service Registry

could be missing or have empty value For example the number of structure information that

supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that

in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for

almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce

the amount of the overall number of service properties Thirdly some Web Service Registries do not

have monitoring information such as Xmethods and Ebi In particular the Service Repository Web

Service Registry has a large amount of monitoring information of the Web services that could be

extracted in the Web Obvious the last one is the number of the whois information for these Web

services If the database of the whois client does not contain the information about the service

domain for the Web service in one Web Service Registry then there will be no whois information

could be extracted Moreover even there has information of the service domain the amount of the

information could be very diverse Therefore if the Web Service Registry undergoes the situation like

that many service domains of the Web services in this registry have no and few whois information

then the average number of service properties in that registry will decrease greatly

As a result in order to help users get better acquainted with the Web services that provided in Web

Service Registry and distinguish these Web services the Web Service Registry should do it best to

offer more and more information for each published Web service of it

05

101520253035

23

717 17

32

The

Num

ber o

f Pro

pert

ies

The Name of Web Service Registry

Average Number of Service Properties

Deep Web Service Crawler

52

44 Different Outputs of Web Services

The fundamental task of this master program is to obtain the WSDL document of the Web services

that hosted in the Web Service Registries as well as extract and gather the properties of these Web

services and thereafter store them into disk Therefore this section is going to describe the different

outputs of this master program which include the WSDL documents of Web services the generated

XML file and INI file and data records of these service properties in Database

Figure4-4 WSDL Document format of one Web service

The WSDL document of the Web service is just read from the Web according to the URL address of its

WSDL link And then these data will be stored into the disk as the WSDL document The name of the

WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document

And in order to distinguish these WSDL documents whose names would be the same while the

content of them are different the name of each obtained WSDL document in one Web Service

Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL

document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and

data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output

formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name

is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the

materials belonged to the same Web service The first three lines in that INI file are some service

comments which start from the semicolon to the end of the line They are the basic information for

the description of this INI file The following one line of them is the section that included in a pairs of

brackets It is important because it represents that the following lines behind it are the information of

this Web service Therefore the rest of these lines are the actual service information that existed in a

key-value pair with an equal between the pair Each service property is displayed from the beginning

of the line

Deep Web Service Crawler

53

Figure4-5 INI File format of one Web service

Figure4-6 XML File format of one Web service

Figure4-7 Database data format for all Web services

Deep Web Service Crawler

54

Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its

name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web

service Though the format of the XML file is different from that of the INI file the essential contents

of them are the same That is to say the values of these service properties are no different This is

because they are files generated from the collection of properties for the same Web service The XML

file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the

section in INI file is something like the root in XML file Therefore all values of these elements among

the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service

Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of

service information for all Web services in these five Web Service Registries And the entire service

information for one Web service will be only one record in this table Because of that the column

names of that table should be the union of the name of the service information in each Web Service

Registry However since the column names of the table should be unique thus these redundant

names in this union must be eliminated This could make sense and possible because the names of the

service information are well-defined and uniform for all these five Web Service Registries In addition

in this table the first column is the primary key which is an increasing Integer The function of this is

more like the Integer that contained in the name of the XML file and INI file While the rest columns in

that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in

this table represents that this property of that Web service is empty or missing

45 Comparison of Average Time Cost for Different

Parts of Single Web Service

This section aims to describe the comparison of the average time cost for different parts of getting

one single Web service in all these five Web Service Registries At first it has to calculate the average

time cost of getting one single service in the Web Service Registry The calculation of it can be

obtained through following equation = (2)

Where

ATC is the average time cost for one single Web service

OTS is the overall time cost of all the Web services in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the different parts of the average time cost for getting one single service consists of

following 6 aspects which are the average time cost for extracting service property the average time

cost for obtaining WSDL document the average time cost for generating XML file the average time

cost for generating INI file the average time cost for inserting the service property into the table of

the Database and the average time cost for some other procedures such as get the service list page

link get the service page link and so on The average time cost for extracting service property would

be obtained by means of following equation

Deep Web Service Crawler

55

= (3)

Where

ATCSI is the average time cost for extracting service property of one single Web service

OTSSI is the overall time cost for extracting service property of all the Web services in one Web

Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the calculation of other parts is similar to the equation for calculating the average time

cost for extracting service property While the calculation of the average time cost for other

procedures equals to the result of the average time cost for one single Web service minus the sum of

the average time cost of all other five parts

Service

property

WSDL

Document

XML

File INI File Database Others Overall

Service

Repository

8801 918 2 1 53 267 10042

Ebi 699 82 2 1 28 11 823

Xmethods 5801 1168 2 1 45 12 7029

Seekda 5186 1013 2 1 41 23 6266

Biocatalogue 39533 762 2 1 66 1636 42000

Table4-3 Average time cost information for all Web Service Registries

Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web

Service Registries The first column of table 4-3 is the name of these five Web Service Registries And

the last column in this table is the average time cost for the single service in one Web Service Registry

While the rest columns in this table are the average time cost of these six different parts In order to

have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated

with following corresponding figures See figure 4-8 to figure 4-14

Figure4-8 Average time cost for extracting service property in all Web Service Registries

8801

6995801 5186

39533

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Service Property

Deep Web Service Crawler

56

As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue

Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries

which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively

That is to say it will take much longer time to extract the service properties of these Web services

published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the

Biocatalogue Web Service Registry has even more average number of service properties which have

already been talked about in section 43 On the contrary the average number of service properties in

Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service

Registry is larger than that in Seekda Web Service Registry As have already known that the average

number of service properties is the same for these two Web Service Registries Nevertheless there

has course might explain the case that the average time in Xmethods costs more than that in Seekda

which is the process for extracting service properties in Xmethods Web Service Registry needs to be

executed by means of service page and service list page while it needs only service page link for

Seekda Web Service Registry

The average time cost for obtaining WSDL document in all these five Web Service Registries is

displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the

average time for extracting the WSDL link of Web service and the average time for reading the data of

WSDL document from the Web and then storing it into the disk As seen from this figure the average

time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is

1168 Although the process for extracting WSDL link will cost certain amount of time it does not have

a significant influence to the total average time spent for obtaining WSDL document This is because

the WSDL link of one Web service is almost gained in one step Therefore this can imply that the

average size of the WSDL document for Xmethods Web Service Registry is larger than that for other

four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds

for the process of extracting WSDL document

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries

918

82

11681013

762

0

200

400

600

800

1000

1200

1400

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

WSDL Document

Deep Web Service Crawler

57

Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different

outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11

the average time of generating XML file for one Web service cost the same for all these five Web

Service Registries which is only 2 milliseconds In addition the average time of generating INI file for

one Web service cost the same too And its value is just 1 millisecond Even though sum up the values

of these two types of average time costs the result is still so small that it could be omitted when

comparing to the overall average time cost of getting one Web service for each corresponding Web

Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI

file would be finished at once after receiving the input of service properties of one Web service

Furthermore it can be seen from figure 4-12 although the average time costs of creating database

record for each Web service in all these five Web Service Registries are larger than that time of

generating XML and INI file the process of the operation for creating database record is still fast

Figure4-10 Average time cost for generating XML file in all Web Service Registries

Figure4-11 Average time cost for generating INI file in all Web Service Registries

2 2 2 2 2

0

05

1

15

2

25

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

XML File

1 1 1 1 1

0

02

04

06

08

1

12

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

INI File

Deep Web Service Crawler

58

Figure4-12 Average time cost for creating database record in all Web Service Registries

Figure4-13 Average time cost for getting one Web service in all Web Service Registries

In the figure 4-13 it gives the average time cost for getting one single Web service in all these five

Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the

longest time for this process This is because the presentation of these five different parts that

described in the front shows that the average time cost of each part would needs more time to finish

the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining

WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web

Service Registry does not cost the longest Moreover there is amazing thing when looking at these

figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same

trend This can further indicate that one Web Service Registry who spends more time to get the

description information of one Web service would offer more information about the Web service

53

28

4541

66

0

10

20

30

40

50

60

70

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Database

10042

823

7029 6266

42000

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Overall

Deep Web Service Crawler

59

5 Conclusion and Further Direction

This master thesis provides a schema which aims to explore the description information of the Web

services that hosted in the different Web Service Registries Actually the description information of

the Web service consists of WSDL document and service information about that Web service

Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can

be used to obtain the WSDL document and service information of the Web service in these Web

Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web

services that hosted in the these Web Service Registries as well as the storage of these Web services

is not flexible and the most important thing is that only a few service information of the Web service

are extracted even more there is no service information would be extracted for some Web Service

Registry However the work presented in this master thesis can be able to explore all Web services in

these Web Service Registries Moreover the service information of Web service would be extracted as

much as it could with the approach presented in this master thesis so that the final result would be

the largest annotated service catalogue ever produced Furthermore regarding the storage process of

the description information of the Web service this master thesis provides three different ways that

can guarantee not only the completeness but also longevity of the description information for the

Web service

However in the implementation performed in this master thesis the whois client used for querying

the information of service domain will return a free text if the information exists And sometime this

free text differs completely That makes it must have to crawl each Web service in all Web Service

Registries at least once in the experiment stage So that all the cases of these free text would be

foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web

services in these Web Service Registries Therefore in order to simply the work other whois client

who can ease the work here needs to be found and used

Moreover in the experiment stage of this master thesis the time cost for getting Web service is still

large For the purpose of reducing this time the multithreaded programming would be applied to

each some parts of the process for getting one Web service

Although the work performed here is specialized for only these five Web Service Registries the

main parts of the principles used here is adaptable to other Web Service Registries that with only

small changes in the implementation codes or the structure

Deep Web Service Crawler

60

6 Bibliography

[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)

June 27 2008

[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from

httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-

whole Emanuele Della Valle (CEFRIEL) July 1 2008

[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk

ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-

architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009

[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web

Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October

2006

[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from

httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008

[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from

httpwwww3orgTRws-arch-scenarios February 11 2004

[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo

Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering

University of Washington Seattle pp233-272 February 1999

[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web

Consortium Working Draft WD-html5-20100624 January 22 2008

[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language

WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from

httpwwwwsmoorgTRd16d161v02

[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO

-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from

httpwwwwsmoorgTRd2v11

[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware

Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet

2008

Deep Web Service Crawler

61

7 Appendixes

There are additional outputs of this master program which are log information file and statistic report

file Figure 8-1 shows one of the basic output formats of the log information for these five Web

Service Registries

Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry

Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry

Deep Web Service Crawler

62

Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo

ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively

Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry

Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry

Deep Web Service Crawler

63

Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry

Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry

Deep Web Service Crawler

64

Table of Figures

Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12

Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16

Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25

Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27

Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41

Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46

Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50

Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51

Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52

Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56

Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58

Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58

Deep Web Service Crawler

65

Table of Tables

Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47

Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48

Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55

Deep Web Service Crawler

66

Table of Abbreviations

DTD Document Type Definition

DWSC Deep Web Service Crawler

HTML HyperText Markup Language

MTBF Mean Time between Failures

MTTR Mean Time to Recovery

QoS Quality of Service

REST Representational State Transfer

RTT Round Trip Time

SOAP Simple Object Access Protocol

SQL Structured Query Language

URL Uniform Resource Locator

WHATWG Web Hypertext Application Technology Working Group

XML eXtensible Markup Language

WSDL Web Service Description Language

WSML Web Service Modeling Language

WSMO Web Service Modeling Ontology

Page 8: Deep Web Service Crawler

Deep Web Service Crawler

8

these problems In this section it will present the basic principle of Deep Web Service Crawler

approach

At first it is the simply introduction of the Deep Web Service Crawler approach to these problems As

have already been mentioned each Web Service Registry can offer Web Services Moreover each

Web Service Registry has its own html page structures These structures may be the same or even

complete different Therefore the first thing is to identify which Web Service Registry that it will be

going to explore Since each Web Service Registry owns a unique URL this job can be done by directly

analyzing the corresponding URL address of that Web Service Registry After identifying which Web

Service Registry it is going to explore the following step is to obtain all these Web Services that

published in that Web Service Registry Then with all these obtained Web Services it is time to extract

analyze and gather the information of the services That information can be in structured format or

even in unstructured format In this master thesis some Deep Web Analysis Techniques will be

applied to obtain this information So that the information about each Web Service shall be the

largest annotated The last but not the least important all the information about the Web Services

need to be stored

13 Goals of this Master Thesis

The lists in the following are the goals of this master thesis

n Produce the largest annotated Service Catalogue

Service Catalogue is a list of service properties The more properties the service has the larger

Service Catalogue it owns Therefore this master program should extract as much service

properties as possible

n Flexible storage of these metadata of each service as annotations or dedicated documents

The metadata of one service includes not only the WSDL document but also service properties

All these metadata are important information for the service Therefore this master program

should provide flexible ways to store these metadata into the disk

n Improve the comparable property of the Web Services across different Web Service Registries

The names of service properties for one Web Service Registry could be different from another

Web Service Registry Hence for the purpose of improving the comparable ability all these

names of the service properties should be uniformed and well-defined

14 Outline of this Master Thesis

In this chapter the motivation objective and initial approach plan have already been discussed

Thereafter the remaining paper is structured as follows

Firstly in chapter 2 the work presented is almost based on these existing techniques In section 21

there is given a detailed introduction to the technique of the Service-Finder project Then in section

22 the already implemented technique of Pica-Pica Web Service Description Crawler is introduction

and discussed After that in section 23 the Information Retrieval technique is presented

Deep Web Service Crawler

9

Then following in chapter 3 it is explained the designing details of this Deep Web Service Crawler

approach In section 31 it gives a short description for the different requirements of this approach

Next in section 32 the actual design for the Deep Web Service Crawler is presented Then in section

33 34 the multithreaded programming and sleep time configuration that used in this master

program are introduced respectively

In chapter 4 it is supposed to display the experiments of this Deep Web Service Crawler approach

and then give some evaluation of it

Finally in Section 5 the conclusion and discussion of the work that already done as well as the work in

the future for this master task are presented respectively

Deep Web Service Crawler

10

2 State of the Art

This chapter aims at presenting some existing techniques or Strategies that related to the work of

applying this Deep Web Service Extraction Crawler approach Section 21 plans to talk about the

existing catalogues Service-Finder project And then in section 22 it is going to explain the existing

implemented crawlers Pica-Pica Web Service Description Crawler Finally in the section 23 it is

supposed to present some details about the Information Extraction technique

21 Service Finder Project

Service-Finder Project aims at developing a platform for Web Service discovery especially for the Web

Services that are embedded in a Web 20 environment [1] Hence it can provide an efficient access to

publicly available services The goals of the Service-Finder project are depicted as follows [1]

n Automatically gather Web Services and their related information

n Semi-automatically create semantic service description based on the information that available

on the Web

n Create and improve semantic annotations via the user feedback

n Describe the aggregated information in semantic models and allow reasoning query

However before describing the basic functionality of the Service-Finder Project there is going to

present one of its use cases and requirements first

211 Use Cases for Service-Finder Project

The Service-Finder Project employed the use case methodology of the W3C Use Case description [6]

for its needs and then applied this methodology to the use cases that it enumerated

2111 Use Case Methodology

There are three aspects needed to be considered for use case definitions of Service-Finder Project [1]

(1) Description that used to describe information of the use case

(2) Actors Roles and Goals that used to identify the actors they would be the roles they act and the

goals they need to achieve in the scenario

(3) Storyboard that used to describe the serial of interactions among the actors and the

Service-Finder Portal

2112 System Administrator

This section is going to present the use case that applied to the Service-Finder portal and that

illustrated the requirements on its functionality from a user point of view However all these

Deep Web Service Crawler

11

information in this use case are derived from [1] In this use case there has a system administrator

whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities

online and working all day and night Therefore if there is any system failures Sam Adams should fix

the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will

alert him immediately by sending him a SMS Message in the case of a system failure

n Description

This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an

SMS Messaging Service that he wants to build it into his application

n Actors Roles and Goals

The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him

are the immediate service delivery the reliability of the service and low base fee and transaction fee

n Storyboard

Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find

many useful services from it especially he know what he is looking for Hence he visits the

Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo

Requirement 1 Search functionality

Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the

number of matching services that will be displayed on one page And he would also expect there has

short information about the service functionality the service provider and the service availability So

that he could decide which service he will choose to read further

Requirement 2 Enable configurable pagination of the matching results and have some short

information for each service

Step 3 When Sam looks through the short information about the services that displayed on the first

page he expects to find the most relevant services that related to his request After that he would

like to read more detailed information about that service to see whether this service can provide the

needed functionality

Requirement 3 Rank the returned matching services and must provide ability to read more details of

a service

Step 4 In the case that all the returned matching services Sam got provide quite different

functionalities or they belong to different service categories for example the SMS messaging services

alert users not through SMS but voice messaging For this reason Sam would like to see other

different categories that may be contain the services he wants Or the services of other categories

which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can

further filter his search in terms of browsing through categories

Requirement 4 Service categories and allow the user to look all services that belonged to that specific

category If possible it should also allow the user to browse through categories

Step 5 When Sam got all the services that could provide a SMS messaging service via the methods

described in the Step 4 at present he wants to look for the services that offered by an Austrian

provider and have no base fees if possible

Requirement 5 Faceted search

Deep Web Service Crawler

12

Step 6 After Sam got all these specific services now he would like to choose the services that can

provide a high reliability

Requirement 6 Sort functionality based on usersrsquo chooses

Step 7 For now Sam expects to compare the service availability between the promised to the service

provider and the actually provided This should be contained in the servicesrsquo details And there needs

also have service coverage information so that Sam can know whether this service covers the areas

he lives and works Moreover Sam would also like to compare these services in other way For

instance put some services into a structured table to compare the transaction fees

Requirement 7 A side-by-side comparison table for services and a functionality that enable users to

select services he wants to compare

Step 8 At last Sam wants to know whether the service providers offer a free try out of the services

So that he can test the service functionality

Requirement 8 If possible display a note that offering free service trials

212 Architecture Plan for the Service-Finder Project

The architecture plan of the Service-Finder project contains five basic components Service Crawler

Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal

Interface Figure 2-1 presents a high level overview of the components and the data flow among

them

Figure2-1Dataflow of Service-Finder and Its Components [3]

Deep Web Service Crawler

13

2121 The Principle of the Service Crawler Component

The Service Crawler is responsible for gathering available services and their related information from

the Web The overall cycle is depicted as following

(1) A Web developer publishes a Web Service

(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services

like WSDL (Web Service Description Language) documents

(3) The Crawler is also going to search for other related information as long as a service is discovered

(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant

part of the Web

At last the output of the crawler would be forwarded to the subsequent components for analyzing

indexing and displaying

2122 The Principle of the Automatic Annotator Component

The Automatic Annotator receives the relevant data from previous component and generates

semantic service descriptions about the WSDL documents and its related information based on the

Service-Finder Ontology and Service Category Ontology

Firstly it will simply introduce those two compatible ontologies that would be used throughout the

whole process [2]

n Generic Service Ontology it is an ontology which is functional to describe the data objects For

example the services the service providers availability payment modalities and so on

n Service Category Ontology it is an ontology which is used to categorize the functionalities or

applications of the services For instance data verification messaging data storage weather etc

Afterwards it is going to talk about the function of this component with its input output

Oslash Input

u Crawled data from Service Crawler

u Service-Finder Ontologies

u Feedback or Correction of before annotations

Oslash Function

u Enrich the information about the service and extract semantic statements according to the

Service-Finder Ontologies For example categorize the service according to the Service

Category Ontology

u Determine whether a particular document is relevant or not through the Web link graph If

not discard these irrelevant documents

u Classify the pages into their genres For instance pricing user comments FAQ and so on

Oslash Output

u Semantic annotation of the services

Deep Web Service Crawler

14

2123 The Principle of the Conceptual Indexer and Matcher

Component

The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted

information of the services and supplying users the capability of retrieval and semantic query For

example the matchmaking between user requests and service offers and the act of retrieving user

feedback on extracted annotations

In addition letrsquos have a look of the function of this component and its input output

Oslash Input

u Semantic annotation data and full text information obtained from Automatic Annotation

u Semantic annotation data and full text information that come from user interfaces

u Cluster data from user and service clustering component

Oslash Function

u Store the semantic annotations received from the Automatic Annotation component and

from the user interface

u Store the cluster data that procured through the clustering component

u Store and index the textual description offered by the Automatic Annotation component

and the textual comments offered by users

u Ontological query the semantic data from the data store center

u Combined keyword and Ontological querying used for user queries

u Provide a list of similar services for a given service

Oslash Output

u A list of matching services that are queried by users In particular these services should be

sorted by ranking and can also be iterated

u All available data that related to a particular entity must be retrievable at the user interface

2124 The Principle of the Service-Finder Portal Interface

Component

The Service-Finder Portal Interface is the main entry point that provided for users of the

Service-Finder system to search and browse the data which is managed by the Conceptual Indexer

and Matcher component In addition the users can also contribute information by means of providing

tags comments categorizations and ratings to the data browsed Furthermore the developers can

still directly invoke the Service-Finder functionalities from their custom applications in terms of an API

Besides the details of this componentrsquos function input and output are represented as below

Oslash Input

u A list of ordered services for a query

u Detailed information about a service or a set of services and a service provider

u Query access to service category ontology and the most used tags provided by the users

Deep Web Service Crawler

15

u Service availability information

Oslash Function

u The Web Interface allows the users to search services by keyword tag or concept in the

categorization sort and filter query results by refining the query compare and bookmark

services try out the services that offer this functionality

u The API allows the developers to invoke Service-Finder functionalities

Oslash Output

u Explicit user annotations such as tags ratings comments decryptions and so on

u Implicit user data for example click stream of users bookmarks comparisons links sent

etc

u Manual advertising of available new services

2125 The Principle of the Cluster Engine Component

The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the

Service-Finder Portal eg the queried services and the compared services of the users Moreover it

also provides cluster data to the Conceptual Indexer and Matcher for providing service

recommendations

Furthermore letrsquos detailed introduce this componentrsquos function input and output

Oslash Input

u Service annotation data of both extracted and user feedback

u Usersrsquo Click streams used for extracting user behaviors

Oslash Function

u Obtain user clusters from user behaviors

u Obtain service clusters from service annotation data to enable to find similar services

Oslash Output

u Clusters of users and services

22 Information Extraction

Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge

amount of information sources on the Internet which has been limited the access to browsing and

searching for the reason of the heterogeneity and the lack of structure of Web information sources

Therefore the appearance of Information Extraction that transforms the Web pages into

program-friendly structures for post-processing would become a great necessity However the task of

Information Extraction is specified in terms of the inputs and the extraction targets And the

techniques used in the process of Information Extraction called extractor

221 Input Types of Information Extraction

Generally speaking there are three different input types The first input type is the unstructured

Deep Web Service Crawler

16

document For example the free text that showed in figure 2-2 It is unstructured and written in

natural language So that it will require substantial natural language processing While the second

input type is called the structured document For instance the XML documents based on the reason

that the data can be described through the available DTD (Document Type Definition) or XML

(eXtensible Markup Language) schema Finally but obviously the third input type is the

semi-structured document that are widespread on the Web Such as the large volume of HTML

pages like tables itemized lists and enumerated lists This is because HTML tags are often used to

render these embedded data in the HTML pages See figure 2-3

Figure2-2Left is the free text input type and right is its output [4]

Figure2-3A Semi-structured page containing data records

(in rectangular box) to be extracted [4]

Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly

regular structure And the data of these documents can be displayed in a format of HTML way or

non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and

generated from structured databases in terms of some templates or layouts thus it would be

considered as one of the input sources which could provide some of these semi-structured documents

For example the authors price and comments of the book pages that provided by Amazon have the

Deep Web Service Crawler

17

same layout That is because these Web pages are generated from the same database and applied

with the same template or layout Furthermore there has another option which could manually

generate HTML pages of semi-structured type For example although the publication lists that

provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have

title and source property for every single pager Eventually the inputs for some Information Extraction

can also be the pages with the same class or among various Web Service Registries

222 Extraction Targets of Information Extraction

Moreover regarding the task of the Information Extraction it has to consider the extraction target

There also have two different extraction targets The first one is the relation of k-tuple And the k in

there means the number of attributes in a record Nevertheless in some cases an attribute of one

record may have none instantiation Otherwise the attribute owns multiple instantiations In addition

the complex object with hierarchically organized data would be the second extraction target Though

the ways for depicting the extraction targets in a page are diverse the most common structure is the

hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf

nodes which called internal nodes And the structure for a data object may also be flat or nested To

be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise

if it is nested structure then the internal nodes that involved in this data object would be more than

two levels

Furthermore in order to make the Web pages readable for human being and having an easier

visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated

or demarcated However the displaying for a data object in a Web page would be affected by

following conditions [4]

Oslash The attribute of a data object has zero or several values

(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo

attribute For example a special offer only available for certain books might be a ldquononerdquo

attribute

(2) If there are more than one values for the attribute of a data object it will be called the

ldquomultiValuerdquo attribute For instance the name of the author for a book could be a

ldquomultiValuerdquo attribute

Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering

That is to say among this set of attribute the position of the attribute might be changed

according to the diverse instances of a data object Thus this attribute will be called the

ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would

enumerate the release data in front of the movesrsquo title while for the movies after year 1999

(including 1999) it will enumerate the release data behind the movesrsquo title

Oslash The attribute has different formats

This means the displaying format of the data object could be completely distinct with respect to

these different instances Therefore if the format of an attribute is free then a lot of rules will be

needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo

attribute For example an ecommerce Web site would use the bold font format to present the

general prices while use the red color format to display the sale prices Nevertheless there has

Deep Web Service Crawler

18

another situation that some different attributes for a data object have the same format For

example various attributes are presented in terms of using the ltTDgt tags in a table presentation

And the attributes like those could be differentiated by means of the order information of these

attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo

attributes it must have to revise the rules for extracting these attributes

Oslash The attribute cannot be decomposed

Because of the easier processing sometimes the input documents would like to be treated as

strings of tokens instead of the strings of characters In addition some of the attribute cannot

even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo

attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The

department code and the course number in them cannot be separated into two different strings

of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo

223 The Used Techniques in Information Extraction

The extractor used in the process of Information Extraction aims at providing a single uniform query

interface to access information sources like database server and Web server It consists of following

phases collecting returned Web pages labeling these Web pages generalizing extraction rules

extracting the relevant data and outputting the result in an appropriate format (XML format or

relational database) for further information integration For example at first the extractor queries the

Web server to gather the returned pages through the HTTP protocols after that it starts to extract the

contents among these HTML documents and integrate with other data sources thereafter Actually

the whole process of the extractor follows below steps

Oslash Step 1

At the beginning it must have to tokenize the input However there are two different

granularities for the input string tokenization They are tag-level encoding and word-level

encoding The tag-level encoding will transform the tags of HTML page into the general tokens

while transform all text string between two tags into a special token Nevertheless the

word-level encoding does this in another way It treats each word in a document as a token

Oslash Step 2

Next it should apply the extraction rules for every attributes of the data object in the Web pages

These extraction rules could be induced in terms of a top-down or bottom-up generalization

pattern mining or logic programming In addition the type of extraction rules may be indicated

by means of regular grammars or logic rules For example some use path-expressions of the

HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic

constraints and some use delimiter-based constraints such as HTML tags or literal words

Oslash Step 3

After that all these extracted data would be assembled into the records

Oslash Step 4

Finally iterate this process until all these data objects in the input

Deep Web Service Crawler

19

23 Pica-Pica Web Service Description Crawler

The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the

Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web

Services problem For example the evaluation of the descriptive quality of Web Services that offered

and how well are these Web Services described in nowadaysrsquo Web Service Registries

231 Needed Libraries of the Pica-Pica Web Service

Description Crawler

This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef

Spillner and programmed in terms of the Python language Actually in order to run these scripts to

parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib

n Beautiful Soup

It is an HTMLXML parser for Python language And it can even turn these invalid markups into a

parse tree [5] Moreover the following three features can make it more powerful

u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that

makes approximately as much sense as the original document Therefore you can obtain

the data that you want

u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating

searching and modifying the parse tree Hence you donrsquot need to create a custom parse

for every application

u If the document has already specified an encoding then you can ignore it since the

Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way

Otherwise what you have to do is just to specify the encoding of the original documents

Furthermore the ways of including Beautiful Soup into the application are displayed in the

following [5]

sup2 From BeautifulSoup import BeautifulSoup For processing HTML

sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML

sup2 Import BeautifulSoup To get everything

n Html5lib

It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to

gain maximum compatibility with the current major desktop web browsers this implementation

will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5

specification

Deep Web Service Crawler

20

232 Architecture of the Pica-Pica Web Service

Description Crawler

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler

Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes

four fundamental components Service Page Grabber component WSDL Grabber component

Property Grabber component and WSML Register component

(1) The Service Page Grabber components is going to take the URL seed as the input and output the

link of the service page into following two components WSDL Grabber component and Property

Grabber component

(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the

delivered service pagersquos link And then check whether the validation for these obtained WSDL

document Finally only these valid WSDL document will be passed into the WSML Register

component for further processing

(3) The Property Grabber component will try to extract the servicersquos property the hosted in the

service page if there exists After that all these servicersquos properties would be saved into an INI

file as the information of that service

(4) The functionality of the WSML Register component is to write appropriate WSML document by

means of the valid WSDL documents that delivered from WSDL Grabber component and the

optionally INI files that delivered from Property Grabber component Afterwards register them in

Conqo

n WSML [9]

It stands for Web Service Modeling Language which provides a framework with different

language variants Hence it is often used to describe the different aspects of the semantic Web

Services according to the conceptual model of WSMO

n WSMO [10]

WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various

aspects related to the Semantic Web Services based on the ontologies Ontology is a formal

explicit specification of a shared conceptualization In fact these ontologies are the sticking points

that can satisfy the linkage between the agreement of the communities of users and the defined

conceptual semantics of the real-world

n Conqo [11]

Deep Web Service Crawler

21

It is a discovery framework that considers not only the Quality of Service (QoS) but also the

context information It will use a Web Service Repository to manage these service descriptions

that based on WSML

233 Implementation of the Pica-Pica Web Service

Description Crawler

This section is going to describe the processes of the implementation of the Pica-Pica Web Service

Description Crawler in detail

(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it

needs an input as the initial seed In this crawler there are five Web Service Registries which are

listed in the below The URL address of these five Web Service Registries will be used as the input

seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web

Service Description Crawler there has a single Python script for each Web Service Registry And

the crawling process for these Web Service Registriesrsquo Python script will be executed one after

another

Biocataloue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwsrvice-repositorycom

Xmethods httpwwwxmethodsnet

(2) Then after feeding with the input seed it will step into the next component Service Page Grabber

At first this component will try to read the data from the Web based on the input seed Then it

will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library

After that this Service Page Grabber component starts to look for the service page link for each

service that published in the Web Service Registry by means of the functions in Html5lib library In

the case that the service page link of one single service is found it will firstly check whether this

service page link is valid or not Once the service page link is valid it will pass it into the following

two components for further processing which are WSDL Grabber component and Property

Grabber component

(3) When the WSDL Grabber component receives a service page link from its previous component it

sets out to extract the WSDL link address for that service through the parsing tree of the data in

this service page Next this component will start to download the WSDL document of that service

in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into

the disk The process of this WSDL Grabber component will be continually carried on until there is

no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They

may either contain bad definitions or bad namespaceURI or be an empty document What is

worse it is not even of XML format Hence in order to pick them out this component will further

analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo

folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in

order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo

folder will be passed into the subsequent component

(4) Moreover since some Web Service Registries would give some addition information about the

Deep Web Service Crawler

22

services such as availability service provider version therefore the Property Grabber

component will set out to extract these information as this servicersquos properties And thereafter

save these servicersquos properties into an INI file However if there has no available additional

information then it is no need to extract the service property thus there is no INI file for that

service However for the implementation of this Pica-Pica Web Service Description Crawler only

the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to

extract the servicesrsquo properties While for other three Web Service Registries there has such

function

(5) Furthermore this is an optional to create a report file which contains the statistic information of

this process Such as the total number of services for one Web Service Registry the number of

services whose WSDL document is invalid etc

(6) As has been stated at present there has a folder with all valid WSDL documents and might also

have some INI files Therefore at this time the task of WSML Register component is to generate

the appropriate WSML documents with these valid WSDL document and INI files And then

register them in Conqo

24 Conclusions of the Existing Strategies

In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder

project Information Extraction technique and Pica-Pica Web Service Description Crawler

The task of this master program is going to obtain the available Web services and their related

information from the Web Actually this is a procedure of extracting the needed information for the

service such as the servicersquos WSDL document and its properties Therefore the Information

Extraction technique which used to extract the information that hosted in the Web could be used in

this master program

Moreover the Service-Finder is a large project that not only able to obtain the available Web services

and their related information from the Web but also enrich these crawled data with annotations by

means of the Service-Finder ontology and the Service Category ontology and integrate all the

information into a coherent semantic mode Furthermore it also provides the capabilities for

searching and browsing the data with the user interface and gives users with service

recommendations However as a master program the Service-Finder project is far more exceeding

the requirements Therefore it is just considered as a consultation for this master program

Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available

Web services and their related information this fulfills the primary task of this master program

Nevertheless regarding the information about the service this Pica-Pica Web Service Description

Crawler extracts only few properties sometime even no property Consequently in order to improve

the quality of the service it has to extract as much properties about the service as possible Thence in

chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler

Deep Web Service Crawler

23

3 Design and Implementation

In the previous chapter of the State of Art those already existing techniques or implementations are

presented In the following it is time to introduce the basic principle of the proposed approach Deep

Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica

Web Service Description Crawler

31 Deep Web Services Crawler Requirements

This section is mainly talking about the goals of the Deep Web Service Crawler approach system

requirements with respect to approach and some non-functional requirements

311 Basic Requirements for DWSC

The following lists are the basic requirements which should be achieved

(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list

of Web services that published in the Web Service Registry It contains not only the WSDL document

of the service but also a file with respect to the information about that service Therefore for the

purpose of producing the largest annotated Service Catalogue this proposed approach need to

extract as much properties about those Web services as possible Moreover it has also to download

the WSDL document that hosted along with the Web service That is to say these properties are not

only the interesting structured properties such as service name its WSDL link address but also some

other non-function properties for example endpoint and its monitoring information

(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties

For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of

schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way

this proposed approach provides three methods for the storage The first one will store them as an

XML file While for the second method it is going to store them in an INI file And the third method is

to use the database for the storage

312 System Requirements for DWSC

General Speaking the needed requirements for achieving a programming project contains following

lists

1) Operating System LinuxUnix Windows (XP Vista 2000 etc)

2) Programming language C++ Java Python C etc

3) Programming Tool NetBean Eclipse Visual Studio and so on

However in this master thesis the scripts of the Deep Web Service Crawler approach are written in

Java programming language Besides these code scripts are only tested running in Windows XP

operating system and Linux operating system but have not been tested in other operating system

Deep Web Service Crawler

24

313 Non-Functional Requirements for DWSC

In this part several non-functional requirements for the Deep Web Service Crawler approach are

presented

1) Transparency the process of data exploration and data storage should be done automatically

without the userrsquos intervention However at first user should specify the path in the hard disk which

will be used store these outputs of this program

2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in

order to stop the process from interruption there must have some necessary error handlings for the

process recovering

3) Completeness this approach should extract the interesting properties about each Web Service as

much as possible eg endpoint monitoring information etc

In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the

following five URLs hence the proposed approach must implement not less than those five Web

Service Registries

Biocatalogue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwservice-repositorycom

Xmethods httpwwwxmethodsnet

32 Deep Web Services Crawler Architecture

In this section an overview of this high-level architecture of Deep Web Services Crawler approach will

be introduced first Thereafter there are four subsections that focusing on the outlining of each single

component and how they will play together presented

The current components and flows of data in Deep Web Service Crawler can be summarized as

depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page

links and related service page links by crawling the Web (Web Service Extractor) Then those gathered

links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL

Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all

these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is

illustrated as following

Oslash Step 1

When the Deep Web Service Crawler starts to run the File Chooser container would require the

user to specify a path in the computer or in any other hard disk The reason for specifying the

path is that this Deep Web Service Crawler program needs a place to store all its outputs

Oslash Step 2

After that the Web Service Extractor will be triggered It is the main entry to the specific crawling

process However since the Deep Web Service Crawler program is a procedure which supposes

to crawl for Web Services in some given Web Service Registries Hence the URL addresses of

Deep Web Service Crawler

25

these Web Service Registries should be given as an initial seed for this Web Service Extractor

process However since the page structures of these Web Service Registries are completely

different thus there will be a dependent process for each Web Service Registry

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler

Oslash Step 3

So that according to the given seed two types of links would be obtained by the Web Service

Extractor component One is the service list page link And another is the service page link

Service list page is page that contains a list of Web Services and may be some information about

these Web Services While service page is page that contains much more information for single

service Finally it forwards these two types of links into the next two components Property

Grabber and WSDL Grabber

Oslash Step 4

Then on the one hand the Property Grabber component tries to gather the information about

the service that hosted in the service list page and service page Such as the name of the service

its description the ranking for this service etc Finally all the information of the service will be

collected together as the service properties which then will be delivered to the storage

component for further processing

Oslash Step 5

On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page

or the service page That is because for some Web Service Registry the WSDL link is hosted in

Deep Web Service Crawler

26

the service list page like the Biocatalogue and for other Web Service Registries it is hosted in

the service page such as Xmethods Then after obtaining the WSDL link it will also be

transmitted to the Storage component for further processing

Oslash Step6

When service properties information and WSDL link of the service are received by Storage

component it will be going to store them into the disk For the service properties they will be

stored into the disk by means of three different ways They are an XML file an INI file or one

record inside the table of Database However for the WSDL link the Storage component will try

to download the page content first according to the URL address of the WSDL link If it can work

successful the page content of the service will be stored as a WSDL document into the disk

Oslash Step 7

Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there

are more than one service or there are more than one service list page in the those Web Service

Registries the crawling process from step 3 to step 6 will be continued again and again until

there is no service or service list page in that Web Service Registries any more

Oslash Step 8

Furthermore after the crawling process for one Web Service Registry finishes there will also

generate a file that contains some of statistic information about this crawling process For

example the time when the crawling process of this Web Service Registry started and when

finished the total number of Web services in this Web Service Registry how many services have

empty WSDL document the average number of service properties in this Web Service Registry

and the average time cost for extracting service properties getting WSDL document and

generating XML file INI file etc

321 The Function of Web Service Extractor Component

The Web Service Extractor component is responsible for gathering service list page links and service

page links It pursues a focused crawl of the Web and only forwards service list page and service page

links to subsequent components for analyzing collecting gathering purposes Therefore it is going to

identify both service list page links and related service page links on these Web Service Registries

As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost

as important as the Web Service Extractor itself as it highly influences the part of the Web which it

needs to crawl The seed can or shall contain eg Web pages where these Web Services are

published or that they talk about Web Services

Deep Web Service Crawler

27

Figure3-2 Overview the process flow of the Web Service Extractor Component

After feeding with the seed of URL the Web Service Extractor component starts to get the link of

service list page from the initial page of this URL seed However this process would be diverse for

these five Web Service Registries Following shows the different situation in these Web Service

Registries

Oslash Service-Repository Web Service Registry

In this Web Service Registry the link of the first service list page is the URL address of its seed

which means some Web Services can be found in the home page of the Service-Repository Web

Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this

Web Service Registry Therefore the process of getting service list page link in this registry will be

continually carried on until there no more service list page link exists

Oslash Xmethods Web Service Registry

Although there has Web Services in the home page of the Xmethods Web Service Registry the

number of those Web Services is only a small subset of these in this Web Service Registry

Moreover in the Xmethods Web Service Registry there is only one page containing all Web

Services Therefore it must have to get the service list page link for that page

Oslash Ebi Web Service Registry

The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That

is to say there also has one page that contains all Web Services in this Web Service Registry

However this page is not the initial page of the input seed Therefore there needs more than

one operation steps to get the service list page link of that page

Oslash Seekda Web Service Registry

In Seekda Web Service Registry these Web Services are not contained in the initial page of the

input seed The service list page link can be obtained after several additional operation steps

Deep Web Service Crawler

28

However there has a problem for getting the service list page link in this registry Simply put if

there are more than one pages contains the Web Services for some unknown reasons it cannot

not get the links of the rest service list page In another word it can only get the link of the first

service list page

Oslash Biocatalogue Web Service Registry

The process of getting service list page in Biocatalogue Web Service Registry is almost the same

as in Seekda Web Service Registry The only difference is that it can get all these service list page

links in Biocatalogue Web Service Registry if there are more than one service list pages

Then after getting the link of service list page the Web Service Extractor will begin to get the link of

service page for each service that listed in the service list page The reason for why it can do like this is

that there has an internal link for every service which will address to the service page It is worth

noting that once a service page link is obtained this component will immediately forward the service

page link and the service list page link into the subsequent two components for further processing

Nevertheless the process of obtaining the service page link will be continuously carried out until all

services listed in that service list page are crawled Analogously the process of getting service list page

will also be continuously carried out until there no more service list page exists

3211 Features of the Web Service Extractor Component

The main features are described in the following paragraphs

1) Obtain service list page links

A central task is to obtain the corresponding service list page links It is an URL address that includes a

publicly list of Web Services and just a simply information about these Web services like the name of

the service an internal URL that links to another page which would contain the detailed information

about that service sometimes may have the link address for WSDL document

2) Obtain service page links

Once service list page links are found a crucial aspect is to extract the internal link of each service to

aid the task of service information discovery Therefore it is the task of the Web Service Extractor to

harvest the html page content of the service list page so that the service page links which would

contain much more detailed information of the single Web service can be obtained

3212 Input of the Web Service Extractor Component

This component is dependent on some specific input seeds And the only input required for this

component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section

313

3213 Output of the Web Service Extractor Component

The component will produce two service related page links from the Web

l Service list page links

l Service page links

Deep Web Service Crawler

29

3214 Demonstration for Web Service Extractor

In order to have comprehensive understanding of the process of the Web Service Extractor

component following gives some figures for explanation Though there are five URL addresses in this

section only the URL of the service-repository address is showed as an example

1) The input seed is the initial URL address of the Service-Repository which is

ldquohttpwwwservice-repositorycomrdquo

2) As has already said in section 321 the first service list link of this Web Service Registry is its input

seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page

of that link

Figure3-3 Service list page of the Service-Repository

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo

Figure3-5Code Overview of getting service page link in Service Repository

Figure3-6 Service page of the Web service ldquoBLZServicerdquo

Deep Web Service Crawler

30

3) For now because of the service list page link is already know the next step is to acquiring the

service page link according to these services that listed in the service list page The text in red box

of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the

complete link of the service page It should be prefixed with the initial URL address of

Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for

getting service page link of the Web service in Service Repository is showed in figure 3-5

Therefore the final link of this service page is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure 3-6 is the corrsponding service page of that link

4) Afterwards those two links service list page link and service page link which are gathered by the

Web Service Extractor component will be immediately forwarded to the next two components

which are WSDL Grabber component and Property Grabber component

322 The Function of WSDL Grabber Component

The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the

service lists page link or service page link And the whole process flow is illustrated in figure 3-7

Figure3-7 Overview the process flow of the WSDL Grabber Component

When the WSDL Grabber component receives the inputs that delivered from previous component it

will start to get the WSDL link for the service based on these inputs However although the inputs of

the WSDL Grabber component are the links of service page and service list page only one of them

contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the

service page or the service list page The reason why there needs these two links to be delivered into

this component is that the only one of the five Web Service Registries which is Biocatalogue hosts

the WSDL link in the service list page While for other four Web Service Registries the WSDL link is

hosted in the service page Therefore the WSDL link of these four Web Service Registries will be

obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL

link will be obtained through the service list page link However there has a problem for getting the

WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in

Deep Web Service Crawler

31

the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another

word these services will not have WSDL document For the situation like this the value of the WSDL

link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web

Services in other four Web Service Registries the WSDL link would always exist in the service page

Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service

it will inevitably be forwarded into Storage component for downloading the WSDL document at once

3221 Features of the WSDL Grabber Component

The WSDL Grabber component has to provide the following features

l Obtain WSDL links

WSDL link is the direct way to get into the page that contains the contents of the WSDL document

It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo

to represent this is an address that addressed to the page of WSDL document

3222 Input of the WSDL Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3223 Output of the WSDL Grabber Component

The component will only produce the following output data

l The URL address of WSDL link for each service

3224 Demonstration for WSDL Grabber Component

In this section it will demonstrate a list of figures in order to give a comprehensive understanding of

the process of the WSDL Grabber component Without a doubt it uses the same Web service of the

Service-Repository as an example too

1) The input for this WSDL Grabber component is the link of service page that obtained from Web

Service Extractor component The address of this link is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page

Deep Web Service Crawler

32

2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the

figure 3-8

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo

3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure

3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry

For other four Web Service Registries this will be different This function

ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo

Then it checks all of the nodes one by one to see if the text value of that node is the value of

ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will

be extracted as the value of WSDL link for this Web service

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function

Figure3-11 Code overview of ldquooneParameterrdquo function

4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web

service ldquoBLZServicerdquo can be obtained The value of it is

ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo

Deep Web Service Crawler

33

323 The Function of Property Grabber Component

The Property Grabber component is a module which is used to extract and gather all these Web

service information that hosted in the Web actually are the information that showed in service list

page and service page In the end all the obtained Web service information will be collected together

as the service properties which would be delivered into the Storage component for storing The

detailed process flow of Property Grabber component is illustrated in figure 3-12

The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component

However there is still a little difference between them with respect to the seed As have already

mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is

sufficient to get the WSDL link While regarding the Property Grabber component it needs two of

them as the seeds Therefore once the Property Grabber component receives the needed inputs it

will start to extract the information of that single Web service

Figure3-12 Overview the process flow of the Property Grabber Component

Therefore after the Property Grabber component receives the inputs it starts to extract the service

information for the Web service Generally speaking the service information consists of four aspects

which are structured information endpoint information monitoring information and whois

information respectively

(1) Structured Information

The structured information can be obtained by extracting the information that hosted in the

service page and service list page They are the basic descriptive information about the service

Such as the name of the service the URL address through which the WSDL document can be

obtained the description of introducing the service the provider who provides that service its

rating and the server who owns this service etc However the elements constituting this

Deep Web Service Crawler

34

structured information would be diverse according to these different Web Service Registries For

example the rating information of the Web service exists in the Service-Repository Web Service

Registry while the Xmethods Web Service Registry does not have this information In addition

even though the Web services in the same Web Service Registry some elements of the structured

information could be not exist For instance one service in a Web Service Registry may have the

description information of this service while another service in the same Web Service Registry

does not have this description information Tables 3-1 to 3-5 demonstrate the structured

information that should be extracted in these five Web Service Registries Moreover if the style of

the Biocataloue Web Service Registry is SOAP there has some additional information describing

the SOAP operation of this service While it is REST this additional information would be

something that talked about the REST operation They should also be considered as a part of the

structured information Table 3-6 and table 3-7 illustrate the information for these two different

operations

Service Name WSDL Link WSDL Version

Provider Server Rating

Homepage Owner Homepage Description

Table 3-1Structured Information of Service-Repository Web Service Registry

Service Name WSDL Link Provider

Service Style Homepage Implementation Language

Description User Description Contributed Client Name

Type of this Client Publisher for this client Used Tookit of this client

Used Language of this client Used Operation System of this client

Table 3-2Structured Information of Xmethods Web Service Registry

Service Name WSDL Link Server

Provider Providerrsquos Country Service Style

Rating Description User Description

Service Tags Documentation (within WSDL)

Table 3-3Structured Information of Seekda Web Service Registry

Service Name WSDL Link Port Name

Service URL Address Implementation Class

Table 3-4Structured Information of Ebi Web Service Registry

Service Name WSDL Link Style

Provider Providerrsquos Country View Times

Favorite Times Submitter Service Tags

Total Annotation Provider Annotation Member Annotation

Registry Annotation Base URL SOAP Lab Server Base URL

Description User Description Category

Table 3-5Structured Information of Biocatalogue Web Service Registry

Deep Web Service Crawler

35

SOAP Operation Name Inputs and Outputs Operation Description

Operation Tags Part of Which Service

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry

REST Operation Name Service Tags Used Template

Operation Description Part of Which Service Part of Which Endpoint Group

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry

(2) Endpoint Information

Endpoint information is the port information of the Web service which can be extracted only

through the service page However since different Web Service Registries have different structure

of the endpoint information with respect to the Web service Thence some elements of the

endpoint information for the Web service would be very diverse But there is one thing needed to

pay attention the Ebi Web Service Registry does not have endpoint information for all these Web

services published in this registry Moreover though the Web services in the same Web Service

Registry have the same structure of this endpoint information some elements of the endpoint

information would be missing or empty Furthermore these Web Service Registries could even

have no endpoint information for some Web services published by them Nevertheless whatever

there is endpoint information for the Web service there will be at least one such element which

is the URL address of the endpoint Following table 3-8 shows the endpoint information that

should be extracted in these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Endpoint Name Endpoint URL

Endpoint Critical Endpoint Type

Bound Endpoint

Xmethods

Endpoint URL Publisher of this Endpoint

Contact Email of this Publisher Implementation Language of this

Endpoint

Seekda Endpoint URL

Biocatalogue Endpoint Name Endpoint URL

Table 3-8 Endpoint Information of these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Service Availability Number of Downs

Total Uptime Total Downtime

MTTR MTBF

RTT Max of Endpoint RTT Min of Endpoint

RTT Average of Endpoint Ping Count of Endpoint

Seekda Service Availability Begin Time of Monitoring

Biocatalogue Monitored Status of Endpoint Overall Status of Service

Table 3-9 Monitoring Information of these five Web Service Registries

(3) Monitoring Information

Monitoring information is the tested statistic information for the Web service But it is worth

Deep Web Service Crawler

36

noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of

all these Web services published by them While for other three Web Service Registries only a

few of Web services may not have this information Table 3-12 displays the monitoring

information for these three Web Service Registries

(4) Whois Information

Whois information is not extracted from the information that hosted in the service page and

service list page It is the descriptive information for the service domain which can be gained by

means of the address of the WSDL link Because of that the process of getting the Whois

information will start after gaining the service domain first The final value of service domain

should not contain the string like this http https www etc It must belong to the top level

domain After that it is going to query the service domain database by means of sending the value

of service domain to the Whois client which is just a Web site on the Internet for example

ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain

exists a list of its information will be returned as the output However the structure of the

returned information will be diverse with respect to different service domain Therefore the most

challenge thing is that there has to deal with the extracting process for each different situation of

the returned information Table 3-10 gives the Whois information that needs be extracted for all

these five Web Service Registries

Service Domain URL Domain Name Domain Type

Domain Address Domain Description State

Postal Code City Country

Country Code Phone Fax

Email Organization Established Time

Table 3-10Whois Information for these five Web Service Registries

Finally all the information of these four aspects will be collected together And then be delivered into

the Storage component for further storage processing

3231 Features of the Property Grabber Component

The Property Grabber component has to provide the following features

l Obtain basic information

Generally speaking the more information one Web service has the better you can know how

well this Web service is Hence this Property Grabber component will be necessary to extract all

these basic information that hosted in the service list page and service page This basic

information contains the structure information endpoint information and monitoring

information

l Obtain Whois information

Due to the reason that the more information the Web service has the better the quality of the

Web service is That makes it is necessary to extract as much information about the Web service

as possible Therefore except the basic information the Property Grabber component can also

obtain some additional information called Whois information such as the type of the domain the

name of the person in charge the postal code of the domain city phone fax detailed address

email etc

Deep Web Service Crawler

37

3232 Input of the Property Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3233 Output of the Property Grabber Component

The component will produce the following output data

l Structured information of each service

l Endpoint information about each service if exists

l Monitoring information for the service and endpoint if exists

l Whois information of the service domain

However all these information will be collected together as the properties for each service Thereafter

the collected properties will be sent to the Storage component

3234 Demonstration for Property Grabber Component

Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the

Property Grabber component To simplify the explanation the example showed in this section uses

the same Web service as before

1) The inputs of the Property Grabber component are links of service list page and service page that

received from Web Service Extractor component Their links are

ldquohttpwwwservice-repositorycomrdquo

and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page

Deep Web Service Crawler

38

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page

3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that

displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of

the structured information have the same content like the contents of description showed in

service page and service list page Thence in order to saving the time of extracting process and

the space for storing process the elements with the same content would only be extracted once

Moreover there needs a transformation of non-descriptive text to the descriptive text for the

rating information because the contents of it are several image of stars Therefore the final

results of the extracted structured information for this Web service are showed in table 3-11

Because of there are no descriptive information for the provider homepage and Owner

Homepage their values are assigned as ldquoNULLrdquo

Service Name BLZService

WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl

WSDL Version 0

Server Apache-Coyote11

Description BLZService

Rating Four stars and A Half

Provider NULL

Homepage NULL

Owner Homepage NULL

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo

4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that

displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list

only one of them will be extracted as the information of endpoint That is because this master

program is intended to extract as more information as possible but these information should not

contain some redundant information Therefore for the purpose of that only one of them will be

extracted as the endpoint information even if there are more than one endpoint records Table

3-12 shows the final results of the endpoint information of this Web service

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page

Deep Web Service Crawler

39

Endpoint Name BLZServiceSOAP12port_http

Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService

Endpoint Critical True

Endpoint Type production

Bound Endpoint BLZServiceSOAP12Binding

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo

5) Then it is time to extract the monitoring information through invoking the function

ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box

above contains the monitoring information about the Web service and the red box below lists

the monitoring information for these endpoints As have already mentioned before only one

endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two

types of availability Actually they all represent the availability for this Web service just like the

availability that showed in figure 3-14 Therefore only one content of these availability

information is sufficient Table 3-13 shows the final results of this extracting process

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page

Service Availability 100

Number of Downs 0

Total Uptime 1 day 19 hours 19 minutes

Total Downtime 0 second

MTBF 1 day 19 hours 19 minutes

MTTR 0 second

RTT Max of Endpoint 141 ms

RTT Min of Endpoint 0 ms

RTT Average of Endpoint 577 ms

Ping Count of Endpoint 112

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo

6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the

service domain from the WSDL link first Towards this Web service the gained service domain is

ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying

process After that it returns a list of information for that service domain See figure 3-17 And

table 3-14 is the extracted Whois information

Deep Web Service Crawler

40

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo

Service Domain URL thomas-bayercom

Domain Name Thomas Bayer

Domain Type NULL

Domain Address Moltkestr40

Domain Description NULL

State NULL

Postal Code 54173

City Bonn

Country NULL

Country Code DE

Phone +4922855525760

Fax NULL

Email infopredic8de

Organization predic8 GmbH

Established Time NULL

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo

7) Finally all the information of these four aspects will be collected together as the service

properties and then these service properties are forwarded into the Storage component

324 The Function of Storage Component

The WSDL link from the WSDL Grabber component will be used by this Storage component to

download the WSDL document from the Web and stores it into the disk thereafter In addition the

service properties from the Property Grabber component will also be directly stored into the disk in

three different manners with this Storage component Figure 3-18 illustrates the process flow of the

Storage component

Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo

of the Storage component will be triggered After that it is going to transform the service properties

into three different output formats and store them into the disk These output formats are XML file

INI file and database records Besides it will also try to download the WSDL document with the URL

address of the WSDL link and then store the obtained WSDL document into disk too However this

ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo

ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the

storage tasks

Deep Web Service Crawler

41

Figure3-18 Overview the process flow of the Storage Component

(1) ldquogetWSDLrdquo sub function

The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it

into the disk Therefore above all it has to get the content of the WSDL document This

procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value

of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if

the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo

For the case like that it will create a WSDL document whose name is the service name appended

with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content

It is an empty document Nevertheless if that service does have WSDL link this sub function will

try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all

these contents that hosted in the Web would be downloaded and stored into the disk and named

only with the name of the service Otherwise it would create a WSDL document which prefixes

ldquoBadrdquo before the service name

(2) ldquogenerateXMLrdquo sub function

The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform

them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo

XML stands for eXtensible Markup Language which is a markup language that designed to

transport and store data The first line of an XML file is the XML declaration which defines the

XML version and the encoding used For example following ldquoltxml version=10

encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is

8-bit Unicode Transformation Format character set Besides an XML file contains XML elements

which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML

element can also contain other elements which could be simple text or a mixture of both

Deep Web Service Crawler

42

However an XML file must contain a root element as the parent of all other elements

(3) ldquogenerateINIrdquo sub function

The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will

transform them into an INI file and then store it in the disk with the name structure of servicersquos

name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for

configuration files They are just simple text files with a basic structure Generally speaking the

INI file contains three different parts They are section parameter and comment respectively

Parameter is the basic element contained in an INI file Its format is a key-value pair or can also

be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always

appears to the left of the equals sign The section is more like a room that grouped all its

parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And

section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo

Thence anything between the semicolon and the end of line will be ignored

(4) ldquogenerateDatabaserdquo sub function

The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions

Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will

turn them into the data of the database by using the SQL statements SQL stands for Structured

Query Language which is a database language that designed for accessing and manipulating data

in database Most of the actions that performed on a database are done with SQL statements

And the primary statements of SQL include insert into delete update select create alter and

drop Therefore for the purpose of transforming these service properties into the data of the

database this sub function has to create a database first using the ldquocreate databaserdquo statement

Then it should create a table to store the data A table is a collections of related data entries and

it consists of columns and rows Since the data for all these five Web Service Registries are not

very large therefore one database table is enough for storing these service properties Because of

that the field names of the service properties in columns for all these five Web Service Registries

should be uniform and well-defined Afterwards the service properties of each single service can

be put into the table as a record with the ldquoinsert intordquo statement of the SQL

3241 Features of the Storage Component

The Property Grabber component has to provide the following features

l Generate different output formats

The final result of this master program is to store the information of the service into the disk for

future work However this Storage component provides three different formats for storing these

service properties of the services that published in the Web Service Registries This makes the

storage of the service very flexible and also longevity

l Obtain the WSDL document

The important thing of this component is to check if the WSDL document can be obtained from

the WSDL link because the WSDL document plays a decisive role for determining the quality of

the service This Storage component provides the ability to deal with the different situations

about the process of obtaining the WSDL document

Deep Web Service Crawler

43

3242 Input of the Storage Component

This component requires the following input data

l WSDL link of each service

l Each service property information

3243 Output of the Storage Component

The component will produce the following output data

l WSDL document of the service

l XML document INI file and tables in database

3244 Demonstration for Storage Component

Following figures describe the fundamental implementation codes of this Storage component

The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the

implementation codes The first common place is about the parameters that defined in each of

these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute

path in the computer disk It will be used for the storing procedure and has already been specified

by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an

increasing Integer which is used as a part of the name of the service The reason for doing this is

that it can prevent the services that have the same name from overriding in the disk The content

of red mark among the codes of these figures is the second common place Its function is to

create a file or document for the service with the corresponding parameters

2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to

get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the

name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo

is the actual WSDL link which is the most important one for this sub function The other two

parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and

ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data

of the service in each Web Service Registry Such as the overall number of properties for that Web

Service Registry the overall number of services the number of services that have no WSDL link

the number of services contains no contents the number of services whose WSDL link are not

available etc The ldquoLog Informationrdquo text file records the results of the process steps and the

problems encountered For example which service is crawling now which Web Service Registry it

belongs the reason why it cannot get the WSDL document for this service the reason why this

service is unreachable and so on

Deep Web Service Crawler

44

Figure3-19 Implementation code for getting WSDL document

3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file

and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of

ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two

variables name and value

Figure3-20 Implementation code for generating XML file

Deep Web Service Crawler

45

Figure3-21 Implementation code for generating INI file

4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all

services in these five Web Service Registries into the records of the database Therefore it has to

create a database first The name of the database can be arbitrary as long as it conforms to the

naming rules in database This is the same for the name of the table Figure 3-22 displays the

procedure of creating a table in the database Due to the reason that it is hard to decide the

length of each service property thence the data types for all service properties are set to ldquoTextrdquo

Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo

statement

Figure3-22 Implementation code for creating table in database

Deep Web Service Crawler

46

Figure3-23 Implementation code for generating table records

33 Multithreaded Programming for DWSC

Multithreaded programming is a built-in characteristic which is provided by Java language A

multithreaded program contains two or more separate parts that that can execute concurrently Each

part of such a program is called a thread The use of multithreading enables to create programs that

makes efficient use of the system resources For example make maximum use of the CPU thatrsquos

because the idle time of the CPU can be kept to a minimum

In this master program there are five Web Service Registries that are needed to crawl for the services

among them Moreover the number of services published among each Web Service Registry is quite

different That makes the running time cost by each Web Service Registry is different Then it will

happen that one Web Service Registry who owns fewer services has to be waiting for executing until

another Web Service Registry who has much more services finishes Therefore in order to reduce the

waiting time for other Web Service Registries and maximize the use of the system resources it is

necessary to apply this multithreaded programming to this master program That is to say this master

program will create a thread for each Web Service Registry And these threads are executed

independently

34 Sleep Time Configuration for Web Service

Registries

Due to the reason that this master program is intended for downloading the WSDL document and

extracting the service information of these Web services that published in the Web Service Registry

this would inevitably affect the performance of the Web Service Registry In addition for the purpose

of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of

accessing Because of that sometimes when this master program is executing unknown errors would

happen For instance this master program will be continually halting at one point without getting any

more WSDL document and service information or some servicesrsquo WSDL documents of some Web

Deep Web Service Crawler

47

Service Registries cannot be obtained or there would miss some service information Therefore in

order to obtain the largest staffs of these Web services published in these five Web Service Registries

and not to affect the throughput of them it has to configure the accessing rate for each service of all

Web Service Registries

In consequence before going into the essential procedure for each single service of these Web Service

Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long

milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for

the specified number of milliseconds In other words it will temporarily cease execution for a while

Following table shows this time interval of the sleep function for each Web Service Registry

Web Service Registry Name Time Interval (milliseconds)

Service Repository 8000

Ebi 3000

Xmethods 10000

Seekda 20000

Biocatalogue 10000

Table 3-15Sleep Time of these five Web Service Registries

Deep Web Service Crawler

48

4 Experimental Results and Analysis

This chapter is going to show the quantitative experimental results of the prototype that presented in

chapter 3 Besides the analysis of these results will also be described and explained However in

order to gain a rather accurate result the experiments are carried out more than five times All these

data that displayed in the following tables and charts are the average of them

41 Statistic Information for Different Web Service

Registries

This section is mainly used to talk about the amount statistic of these Web services that published in

these five Web Service Registries It includes the overall number of the Web services that published in

each Web Service Registry and the number of these unavailable Web services which have been

archived because they may not be active anymore or are close to being non active Table 4-1 shows

the service amount statistic of these five Web Service Registries

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Overall

Services 57 289 382 853 2567

Unavailable

Services 0 0 0 0 125

Table4-1 Service amount statistic of these five Web Service Registries

Nevertheless in order to have an intuitive view of the service amount statistic in these five Web

Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As

you can see from the bar chart one on hand there has an ascending increase for the overall number

of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service

Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web

service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to

provide the Web services for the users because it contains far more services than other four Web

Service Registries On the other hand there is no unavailable service existed in almost all Web Service

Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web

Service Registry contains some Web services that cannot be used by the users To some degree this is

useless for the reason that these services cannot be used anymore and it will waste the network

resources on the Web Therefore all these unavailable services should be eliminated in order to

reduce the wasting of the network resources

Deep Web Service Crawler

49

Figure4-1 Service amount statistic of these five Web Service Registries

42 Statistic Information for WSDL Document

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Failed WSDL

Links 1 0 23 145 32

Without

WSDL Links 0 0 0 0 16

Empty

Content 0 0 2 0 2

Table4-2 Statistic information for WSDL Document

Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in

these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed

WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the

Web services whose WSDL links are invalid Say in other words that means it is impossible to get the

WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no

WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services

for these Web Service Registries It is used as the overall number of Web services in each Web Service

Registry who has no WSDL link That is to say there would be no WSDL document for Web services

like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL

document will also be created But there is no content and the name of this WSDL document will

contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which

represents the overall number of Web services that have the WSDL links and the URL address of them

0

500

1000

1500

2000

2500

3000

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

57289 382

853

2567

0 0 0 0 125

The

Num

ber o

f Web

Ser

vice

s

The name of Web Service Registry

Overall Services

Unavailable Services

Deep Web Service Crawler

50

are valid but their WSDL documents contains no content In this case a WSDL document whose name

contains a string ldquo(BAD)rdquo will be created

Figure4-2 Statistic information for WSDL Document

43 Comparison of Different Average Number of

Service Properties

This section is going to compare the average number of service properties in these five Web Service

Registries This average number of service properties is calculated by means of following equation = (1)

Where

ASP is the average number of service properties for one Web Service Registry

ONSP is the overall number of service properties in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

Figure 4-3 shows the average number of service properties of each Web service in these five Web

Service Registries As have already mentioned before one of the measurements for testing the quality

of Web services in the Web Service Registry is the service information The more information of a Web

service the better you know about that service consequently the corresponding Web Service

Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service

Repository and Biocatalogue Web Service Registries own a larger number of service properties than

that in other three Web Service Registries This can directly reflect that these two Web Service

Registries can provide more detailed information about the Web services published among them So

that the users can be easier to choose the service they need And they would also like to use the Web

services that published in these two Web Service Registries By contrast the Xmethods and Seekda

0

50

100

150

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

1 023

145

32

0 0 0 016

0 0 2 0 2

The

Num

ber o

f Ser

vice

s

The Name of the Web Service Registry

Failed WSDL Links

Without WSDL links

Empty Content

Deep Web Service Crawler

51

Web Service Registries who have less service information of the Web services would offer less quality

for these Web services Therefore users may be not like to use the Web services provided in these

two Web Service Registries Not to mention the Web services published in the Ebi Web Service

Registry

Figure4-3 Average Number of Service Properties

From the description presented in section 323 the causes of the issue for the different number of

service properties in these Web Service Registries may consist of following several points First the

number of the structured information for these Web services is different with respect to these five

Web Service Registries Even part of information for some Web services in one Web Service Registry

could be missing or have empty value For example the number of structure information that

supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that

in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for

almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce

the amount of the overall number of service properties Thirdly some Web Service Registries do not

have monitoring information such as Xmethods and Ebi In particular the Service Repository Web

Service Registry has a large amount of monitoring information of the Web services that could be

extracted in the Web Obvious the last one is the number of the whois information for these Web

services If the database of the whois client does not contain the information about the service

domain for the Web service in one Web Service Registry then there will be no whois information

could be extracted Moreover even there has information of the service domain the amount of the

information could be very diverse Therefore if the Web Service Registry undergoes the situation like

that many service domains of the Web services in this registry have no and few whois information

then the average number of service properties in that registry will decrease greatly

As a result in order to help users get better acquainted with the Web services that provided in Web

Service Registry and distinguish these Web services the Web Service Registry should do it best to

offer more and more information for each published Web service of it

05

101520253035

23

717 17

32

The

Num

ber o

f Pro

pert

ies

The Name of Web Service Registry

Average Number of Service Properties

Deep Web Service Crawler

52

44 Different Outputs of Web Services

The fundamental task of this master program is to obtain the WSDL document of the Web services

that hosted in the Web Service Registries as well as extract and gather the properties of these Web

services and thereafter store them into disk Therefore this section is going to describe the different

outputs of this master program which include the WSDL documents of Web services the generated

XML file and INI file and data records of these service properties in Database

Figure4-4 WSDL Document format of one Web service

The WSDL document of the Web service is just read from the Web according to the URL address of its

WSDL link And then these data will be stored into the disk as the WSDL document The name of the

WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document

And in order to distinguish these WSDL documents whose names would be the same while the

content of them are different the name of each obtained WSDL document in one Web Service

Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL

document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and

data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output

formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name

is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the

materials belonged to the same Web service The first three lines in that INI file are some service

comments which start from the semicolon to the end of the line They are the basic information for

the description of this INI file The following one line of them is the section that included in a pairs of

brackets It is important because it represents that the following lines behind it are the information of

this Web service Therefore the rest of these lines are the actual service information that existed in a

key-value pair with an equal between the pair Each service property is displayed from the beginning

of the line

Deep Web Service Crawler

53

Figure4-5 INI File format of one Web service

Figure4-6 XML File format of one Web service

Figure4-7 Database data format for all Web services

Deep Web Service Crawler

54

Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its

name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web

service Though the format of the XML file is different from that of the INI file the essential contents

of them are the same That is to say the values of these service properties are no different This is

because they are files generated from the collection of properties for the same Web service The XML

file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the

section in INI file is something like the root in XML file Therefore all values of these elements among

the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service

Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of

service information for all Web services in these five Web Service Registries And the entire service

information for one Web service will be only one record in this table Because of that the column

names of that table should be the union of the name of the service information in each Web Service

Registry However since the column names of the table should be unique thus these redundant

names in this union must be eliminated This could make sense and possible because the names of the

service information are well-defined and uniform for all these five Web Service Registries In addition

in this table the first column is the primary key which is an increasing Integer The function of this is

more like the Integer that contained in the name of the XML file and INI file While the rest columns in

that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in

this table represents that this property of that Web service is empty or missing

45 Comparison of Average Time Cost for Different

Parts of Single Web Service

This section aims to describe the comparison of the average time cost for different parts of getting

one single Web service in all these five Web Service Registries At first it has to calculate the average

time cost of getting one single service in the Web Service Registry The calculation of it can be

obtained through following equation = (2)

Where

ATC is the average time cost for one single Web service

OTS is the overall time cost of all the Web services in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the different parts of the average time cost for getting one single service consists of

following 6 aspects which are the average time cost for extracting service property the average time

cost for obtaining WSDL document the average time cost for generating XML file the average time

cost for generating INI file the average time cost for inserting the service property into the table of

the Database and the average time cost for some other procedures such as get the service list page

link get the service page link and so on The average time cost for extracting service property would

be obtained by means of following equation

Deep Web Service Crawler

55

= (3)

Where

ATCSI is the average time cost for extracting service property of one single Web service

OTSSI is the overall time cost for extracting service property of all the Web services in one Web

Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the calculation of other parts is similar to the equation for calculating the average time

cost for extracting service property While the calculation of the average time cost for other

procedures equals to the result of the average time cost for one single Web service minus the sum of

the average time cost of all other five parts

Service

property

WSDL

Document

XML

File INI File Database Others Overall

Service

Repository

8801 918 2 1 53 267 10042

Ebi 699 82 2 1 28 11 823

Xmethods 5801 1168 2 1 45 12 7029

Seekda 5186 1013 2 1 41 23 6266

Biocatalogue 39533 762 2 1 66 1636 42000

Table4-3 Average time cost information for all Web Service Registries

Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web

Service Registries The first column of table 4-3 is the name of these five Web Service Registries And

the last column in this table is the average time cost for the single service in one Web Service Registry

While the rest columns in this table are the average time cost of these six different parts In order to

have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated

with following corresponding figures See figure 4-8 to figure 4-14

Figure4-8 Average time cost for extracting service property in all Web Service Registries

8801

6995801 5186

39533

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Service Property

Deep Web Service Crawler

56

As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue

Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries

which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively

That is to say it will take much longer time to extract the service properties of these Web services

published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the

Biocatalogue Web Service Registry has even more average number of service properties which have

already been talked about in section 43 On the contrary the average number of service properties in

Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service

Registry is larger than that in Seekda Web Service Registry As have already known that the average

number of service properties is the same for these two Web Service Registries Nevertheless there

has course might explain the case that the average time in Xmethods costs more than that in Seekda

which is the process for extracting service properties in Xmethods Web Service Registry needs to be

executed by means of service page and service list page while it needs only service page link for

Seekda Web Service Registry

The average time cost for obtaining WSDL document in all these five Web Service Registries is

displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the

average time for extracting the WSDL link of Web service and the average time for reading the data of

WSDL document from the Web and then storing it into the disk As seen from this figure the average

time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is

1168 Although the process for extracting WSDL link will cost certain amount of time it does not have

a significant influence to the total average time spent for obtaining WSDL document This is because

the WSDL link of one Web service is almost gained in one step Therefore this can imply that the

average size of the WSDL document for Xmethods Web Service Registry is larger than that for other

four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds

for the process of extracting WSDL document

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries

918

82

11681013

762

0

200

400

600

800

1000

1200

1400

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

WSDL Document

Deep Web Service Crawler

57

Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different

outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11

the average time of generating XML file for one Web service cost the same for all these five Web

Service Registries which is only 2 milliseconds In addition the average time of generating INI file for

one Web service cost the same too And its value is just 1 millisecond Even though sum up the values

of these two types of average time costs the result is still so small that it could be omitted when

comparing to the overall average time cost of getting one Web service for each corresponding Web

Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI

file would be finished at once after receiving the input of service properties of one Web service

Furthermore it can be seen from figure 4-12 although the average time costs of creating database

record for each Web service in all these five Web Service Registries are larger than that time of

generating XML and INI file the process of the operation for creating database record is still fast

Figure4-10 Average time cost for generating XML file in all Web Service Registries

Figure4-11 Average time cost for generating INI file in all Web Service Registries

2 2 2 2 2

0

05

1

15

2

25

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

XML File

1 1 1 1 1

0

02

04

06

08

1

12

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

INI File

Deep Web Service Crawler

58

Figure4-12 Average time cost for creating database record in all Web Service Registries

Figure4-13 Average time cost for getting one Web service in all Web Service Registries

In the figure 4-13 it gives the average time cost for getting one single Web service in all these five

Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the

longest time for this process This is because the presentation of these five different parts that

described in the front shows that the average time cost of each part would needs more time to finish

the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining

WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web

Service Registry does not cost the longest Moreover there is amazing thing when looking at these

figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same

trend This can further indicate that one Web Service Registry who spends more time to get the

description information of one Web service would offer more information about the Web service

53

28

4541

66

0

10

20

30

40

50

60

70

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Database

10042

823

7029 6266

42000

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Overall

Deep Web Service Crawler

59

5 Conclusion and Further Direction

This master thesis provides a schema which aims to explore the description information of the Web

services that hosted in the different Web Service Registries Actually the description information of

the Web service consists of WSDL document and service information about that Web service

Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can

be used to obtain the WSDL document and service information of the Web service in these Web

Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web

services that hosted in the these Web Service Registries as well as the storage of these Web services

is not flexible and the most important thing is that only a few service information of the Web service

are extracted even more there is no service information would be extracted for some Web Service

Registry However the work presented in this master thesis can be able to explore all Web services in

these Web Service Registries Moreover the service information of Web service would be extracted as

much as it could with the approach presented in this master thesis so that the final result would be

the largest annotated service catalogue ever produced Furthermore regarding the storage process of

the description information of the Web service this master thesis provides three different ways that

can guarantee not only the completeness but also longevity of the description information for the

Web service

However in the implementation performed in this master thesis the whois client used for querying

the information of service domain will return a free text if the information exists And sometime this

free text differs completely That makes it must have to crawl each Web service in all Web Service

Registries at least once in the experiment stage So that all the cases of these free text would be

foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web

services in these Web Service Registries Therefore in order to simply the work other whois client

who can ease the work here needs to be found and used

Moreover in the experiment stage of this master thesis the time cost for getting Web service is still

large For the purpose of reducing this time the multithreaded programming would be applied to

each some parts of the process for getting one Web service

Although the work performed here is specialized for only these five Web Service Registries the

main parts of the principles used here is adaptable to other Web Service Registries that with only

small changes in the implementation codes or the structure

Deep Web Service Crawler

60

6 Bibliography

[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)

June 27 2008

[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from

httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-

whole Emanuele Della Valle (CEFRIEL) July 1 2008

[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk

ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-

architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009

[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web

Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October

2006

[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from

httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008

[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from

httpwwww3orgTRws-arch-scenarios February 11 2004

[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo

Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering

University of Washington Seattle pp233-272 February 1999

[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web

Consortium Working Draft WD-html5-20100624 January 22 2008

[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language

WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from

httpwwwwsmoorgTRd16d161v02

[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO

-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from

httpwwwwsmoorgTRd2v11

[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware

Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet

2008

Deep Web Service Crawler

61

7 Appendixes

There are additional outputs of this master program which are log information file and statistic report

file Figure 8-1 shows one of the basic output formats of the log information for these five Web

Service Registries

Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry

Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry

Deep Web Service Crawler

62

Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo

ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively

Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry

Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry

Deep Web Service Crawler

63

Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry

Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry

Deep Web Service Crawler

64

Table of Figures

Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12

Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16

Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25

Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27

Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41

Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46

Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50

Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51

Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52

Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56

Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58

Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58

Deep Web Service Crawler

65

Table of Tables

Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47

Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48

Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55

Deep Web Service Crawler

66

Table of Abbreviations

DTD Document Type Definition

DWSC Deep Web Service Crawler

HTML HyperText Markup Language

MTBF Mean Time between Failures

MTTR Mean Time to Recovery

QoS Quality of Service

REST Representational State Transfer

RTT Round Trip Time

SOAP Simple Object Access Protocol

SQL Structured Query Language

URL Uniform Resource Locator

WHATWG Web Hypertext Application Technology Working Group

XML eXtensible Markup Language

WSDL Web Service Description Language

WSML Web Service Modeling Language

WSMO Web Service Modeling Ontology

Page 9: Deep Web Service Crawler

Deep Web Service Crawler

9

Then following in chapter 3 it is explained the designing details of this Deep Web Service Crawler

approach In section 31 it gives a short description for the different requirements of this approach

Next in section 32 the actual design for the Deep Web Service Crawler is presented Then in section

33 34 the multithreaded programming and sleep time configuration that used in this master

program are introduced respectively

In chapter 4 it is supposed to display the experiments of this Deep Web Service Crawler approach

and then give some evaluation of it

Finally in Section 5 the conclusion and discussion of the work that already done as well as the work in

the future for this master task are presented respectively

Deep Web Service Crawler

10

2 State of the Art

This chapter aims at presenting some existing techniques or Strategies that related to the work of

applying this Deep Web Service Extraction Crawler approach Section 21 plans to talk about the

existing catalogues Service-Finder project And then in section 22 it is going to explain the existing

implemented crawlers Pica-Pica Web Service Description Crawler Finally in the section 23 it is

supposed to present some details about the Information Extraction technique

21 Service Finder Project

Service-Finder Project aims at developing a platform for Web Service discovery especially for the Web

Services that are embedded in a Web 20 environment [1] Hence it can provide an efficient access to

publicly available services The goals of the Service-Finder project are depicted as follows [1]

n Automatically gather Web Services and their related information

n Semi-automatically create semantic service description based on the information that available

on the Web

n Create and improve semantic annotations via the user feedback

n Describe the aggregated information in semantic models and allow reasoning query

However before describing the basic functionality of the Service-Finder Project there is going to

present one of its use cases and requirements first

211 Use Cases for Service-Finder Project

The Service-Finder Project employed the use case methodology of the W3C Use Case description [6]

for its needs and then applied this methodology to the use cases that it enumerated

2111 Use Case Methodology

There are three aspects needed to be considered for use case definitions of Service-Finder Project [1]

(1) Description that used to describe information of the use case

(2) Actors Roles and Goals that used to identify the actors they would be the roles they act and the

goals they need to achieve in the scenario

(3) Storyboard that used to describe the serial of interactions among the actors and the

Service-Finder Portal

2112 System Administrator

This section is going to present the use case that applied to the Service-Finder portal and that

illustrated the requirements on its functionality from a user point of view However all these

Deep Web Service Crawler

11

information in this use case are derived from [1] In this use case there has a system administrator

whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities

online and working all day and night Therefore if there is any system failures Sam Adams should fix

the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will

alert him immediately by sending him a SMS Message in the case of a system failure

n Description

This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an

SMS Messaging Service that he wants to build it into his application

n Actors Roles and Goals

The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him

are the immediate service delivery the reliability of the service and low base fee and transaction fee

n Storyboard

Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find

many useful services from it especially he know what he is looking for Hence he visits the

Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo

Requirement 1 Search functionality

Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the

number of matching services that will be displayed on one page And he would also expect there has

short information about the service functionality the service provider and the service availability So

that he could decide which service he will choose to read further

Requirement 2 Enable configurable pagination of the matching results and have some short

information for each service

Step 3 When Sam looks through the short information about the services that displayed on the first

page he expects to find the most relevant services that related to his request After that he would

like to read more detailed information about that service to see whether this service can provide the

needed functionality

Requirement 3 Rank the returned matching services and must provide ability to read more details of

a service

Step 4 In the case that all the returned matching services Sam got provide quite different

functionalities or they belong to different service categories for example the SMS messaging services

alert users not through SMS but voice messaging For this reason Sam would like to see other

different categories that may be contain the services he wants Or the services of other categories

which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can

further filter his search in terms of browsing through categories

Requirement 4 Service categories and allow the user to look all services that belonged to that specific

category If possible it should also allow the user to browse through categories

Step 5 When Sam got all the services that could provide a SMS messaging service via the methods

described in the Step 4 at present he wants to look for the services that offered by an Austrian

provider and have no base fees if possible

Requirement 5 Faceted search

Deep Web Service Crawler

12

Step 6 After Sam got all these specific services now he would like to choose the services that can

provide a high reliability

Requirement 6 Sort functionality based on usersrsquo chooses

Step 7 For now Sam expects to compare the service availability between the promised to the service

provider and the actually provided This should be contained in the servicesrsquo details And there needs

also have service coverage information so that Sam can know whether this service covers the areas

he lives and works Moreover Sam would also like to compare these services in other way For

instance put some services into a structured table to compare the transaction fees

Requirement 7 A side-by-side comparison table for services and a functionality that enable users to

select services he wants to compare

Step 8 At last Sam wants to know whether the service providers offer a free try out of the services

So that he can test the service functionality

Requirement 8 If possible display a note that offering free service trials

212 Architecture Plan for the Service-Finder Project

The architecture plan of the Service-Finder project contains five basic components Service Crawler

Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal

Interface Figure 2-1 presents a high level overview of the components and the data flow among

them

Figure2-1Dataflow of Service-Finder and Its Components [3]

Deep Web Service Crawler

13

2121 The Principle of the Service Crawler Component

The Service Crawler is responsible for gathering available services and their related information from

the Web The overall cycle is depicted as following

(1) A Web developer publishes a Web Service

(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services

like WSDL (Web Service Description Language) documents

(3) The Crawler is also going to search for other related information as long as a service is discovered

(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant

part of the Web

At last the output of the crawler would be forwarded to the subsequent components for analyzing

indexing and displaying

2122 The Principle of the Automatic Annotator Component

The Automatic Annotator receives the relevant data from previous component and generates

semantic service descriptions about the WSDL documents and its related information based on the

Service-Finder Ontology and Service Category Ontology

Firstly it will simply introduce those two compatible ontologies that would be used throughout the

whole process [2]

n Generic Service Ontology it is an ontology which is functional to describe the data objects For

example the services the service providers availability payment modalities and so on

n Service Category Ontology it is an ontology which is used to categorize the functionalities or

applications of the services For instance data verification messaging data storage weather etc

Afterwards it is going to talk about the function of this component with its input output

Oslash Input

u Crawled data from Service Crawler

u Service-Finder Ontologies

u Feedback or Correction of before annotations

Oslash Function

u Enrich the information about the service and extract semantic statements according to the

Service-Finder Ontologies For example categorize the service according to the Service

Category Ontology

u Determine whether a particular document is relevant or not through the Web link graph If

not discard these irrelevant documents

u Classify the pages into their genres For instance pricing user comments FAQ and so on

Oslash Output

u Semantic annotation of the services

Deep Web Service Crawler

14

2123 The Principle of the Conceptual Indexer and Matcher

Component

The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted

information of the services and supplying users the capability of retrieval and semantic query For

example the matchmaking between user requests and service offers and the act of retrieving user

feedback on extracted annotations

In addition letrsquos have a look of the function of this component and its input output

Oslash Input

u Semantic annotation data and full text information obtained from Automatic Annotation

u Semantic annotation data and full text information that come from user interfaces

u Cluster data from user and service clustering component

Oslash Function

u Store the semantic annotations received from the Automatic Annotation component and

from the user interface

u Store the cluster data that procured through the clustering component

u Store and index the textual description offered by the Automatic Annotation component

and the textual comments offered by users

u Ontological query the semantic data from the data store center

u Combined keyword and Ontological querying used for user queries

u Provide a list of similar services for a given service

Oslash Output

u A list of matching services that are queried by users In particular these services should be

sorted by ranking and can also be iterated

u All available data that related to a particular entity must be retrievable at the user interface

2124 The Principle of the Service-Finder Portal Interface

Component

The Service-Finder Portal Interface is the main entry point that provided for users of the

Service-Finder system to search and browse the data which is managed by the Conceptual Indexer

and Matcher component In addition the users can also contribute information by means of providing

tags comments categorizations and ratings to the data browsed Furthermore the developers can

still directly invoke the Service-Finder functionalities from their custom applications in terms of an API

Besides the details of this componentrsquos function input and output are represented as below

Oslash Input

u A list of ordered services for a query

u Detailed information about a service or a set of services and a service provider

u Query access to service category ontology and the most used tags provided by the users

Deep Web Service Crawler

15

u Service availability information

Oslash Function

u The Web Interface allows the users to search services by keyword tag or concept in the

categorization sort and filter query results by refining the query compare and bookmark

services try out the services that offer this functionality

u The API allows the developers to invoke Service-Finder functionalities

Oslash Output

u Explicit user annotations such as tags ratings comments decryptions and so on

u Implicit user data for example click stream of users bookmarks comparisons links sent

etc

u Manual advertising of available new services

2125 The Principle of the Cluster Engine Component

The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the

Service-Finder Portal eg the queried services and the compared services of the users Moreover it

also provides cluster data to the Conceptual Indexer and Matcher for providing service

recommendations

Furthermore letrsquos detailed introduce this componentrsquos function input and output

Oslash Input

u Service annotation data of both extracted and user feedback

u Usersrsquo Click streams used for extracting user behaviors

Oslash Function

u Obtain user clusters from user behaviors

u Obtain service clusters from service annotation data to enable to find similar services

Oslash Output

u Clusters of users and services

22 Information Extraction

Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge

amount of information sources on the Internet which has been limited the access to browsing and

searching for the reason of the heterogeneity and the lack of structure of Web information sources

Therefore the appearance of Information Extraction that transforms the Web pages into

program-friendly structures for post-processing would become a great necessity However the task of

Information Extraction is specified in terms of the inputs and the extraction targets And the

techniques used in the process of Information Extraction called extractor

221 Input Types of Information Extraction

Generally speaking there are three different input types The first input type is the unstructured

Deep Web Service Crawler

16

document For example the free text that showed in figure 2-2 It is unstructured and written in

natural language So that it will require substantial natural language processing While the second

input type is called the structured document For instance the XML documents based on the reason

that the data can be described through the available DTD (Document Type Definition) or XML

(eXtensible Markup Language) schema Finally but obviously the third input type is the

semi-structured document that are widespread on the Web Such as the large volume of HTML

pages like tables itemized lists and enumerated lists This is because HTML tags are often used to

render these embedded data in the HTML pages See figure 2-3

Figure2-2Left is the free text input type and right is its output [4]

Figure2-3A Semi-structured page containing data records

(in rectangular box) to be extracted [4]

Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly

regular structure And the data of these documents can be displayed in a format of HTML way or

non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and

generated from structured databases in terms of some templates or layouts thus it would be

considered as one of the input sources which could provide some of these semi-structured documents

For example the authors price and comments of the book pages that provided by Amazon have the

Deep Web Service Crawler

17

same layout That is because these Web pages are generated from the same database and applied

with the same template or layout Furthermore there has another option which could manually

generate HTML pages of semi-structured type For example although the publication lists that

provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have

title and source property for every single pager Eventually the inputs for some Information Extraction

can also be the pages with the same class or among various Web Service Registries

222 Extraction Targets of Information Extraction

Moreover regarding the task of the Information Extraction it has to consider the extraction target

There also have two different extraction targets The first one is the relation of k-tuple And the k in

there means the number of attributes in a record Nevertheless in some cases an attribute of one

record may have none instantiation Otherwise the attribute owns multiple instantiations In addition

the complex object with hierarchically organized data would be the second extraction target Though

the ways for depicting the extraction targets in a page are diverse the most common structure is the

hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf

nodes which called internal nodes And the structure for a data object may also be flat or nested To

be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise

if it is nested structure then the internal nodes that involved in this data object would be more than

two levels

Furthermore in order to make the Web pages readable for human being and having an easier

visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated

or demarcated However the displaying for a data object in a Web page would be affected by

following conditions [4]

Oslash The attribute of a data object has zero or several values

(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo

attribute For example a special offer only available for certain books might be a ldquononerdquo

attribute

(2) If there are more than one values for the attribute of a data object it will be called the

ldquomultiValuerdquo attribute For instance the name of the author for a book could be a

ldquomultiValuerdquo attribute

Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering

That is to say among this set of attribute the position of the attribute might be changed

according to the diverse instances of a data object Thus this attribute will be called the

ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would

enumerate the release data in front of the movesrsquo title while for the movies after year 1999

(including 1999) it will enumerate the release data behind the movesrsquo title

Oslash The attribute has different formats

This means the displaying format of the data object could be completely distinct with respect to

these different instances Therefore if the format of an attribute is free then a lot of rules will be

needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo

attribute For example an ecommerce Web site would use the bold font format to present the

general prices while use the red color format to display the sale prices Nevertheless there has

Deep Web Service Crawler

18

another situation that some different attributes for a data object have the same format For

example various attributes are presented in terms of using the ltTDgt tags in a table presentation

And the attributes like those could be differentiated by means of the order information of these

attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo

attributes it must have to revise the rules for extracting these attributes

Oslash The attribute cannot be decomposed

Because of the easier processing sometimes the input documents would like to be treated as

strings of tokens instead of the strings of characters In addition some of the attribute cannot

even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo

attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The

department code and the course number in them cannot be separated into two different strings

of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo

223 The Used Techniques in Information Extraction

The extractor used in the process of Information Extraction aims at providing a single uniform query

interface to access information sources like database server and Web server It consists of following

phases collecting returned Web pages labeling these Web pages generalizing extraction rules

extracting the relevant data and outputting the result in an appropriate format (XML format or

relational database) for further information integration For example at first the extractor queries the

Web server to gather the returned pages through the HTTP protocols after that it starts to extract the

contents among these HTML documents and integrate with other data sources thereafter Actually

the whole process of the extractor follows below steps

Oslash Step 1

At the beginning it must have to tokenize the input However there are two different

granularities for the input string tokenization They are tag-level encoding and word-level

encoding The tag-level encoding will transform the tags of HTML page into the general tokens

while transform all text string between two tags into a special token Nevertheless the

word-level encoding does this in another way It treats each word in a document as a token

Oslash Step 2

Next it should apply the extraction rules for every attributes of the data object in the Web pages

These extraction rules could be induced in terms of a top-down or bottom-up generalization

pattern mining or logic programming In addition the type of extraction rules may be indicated

by means of regular grammars or logic rules For example some use path-expressions of the

HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic

constraints and some use delimiter-based constraints such as HTML tags or literal words

Oslash Step 3

After that all these extracted data would be assembled into the records

Oslash Step 4

Finally iterate this process until all these data objects in the input

Deep Web Service Crawler

19

23 Pica-Pica Web Service Description Crawler

The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the

Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web

Services problem For example the evaluation of the descriptive quality of Web Services that offered

and how well are these Web Services described in nowadaysrsquo Web Service Registries

231 Needed Libraries of the Pica-Pica Web Service

Description Crawler

This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef

Spillner and programmed in terms of the Python language Actually in order to run these scripts to

parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib

n Beautiful Soup

It is an HTMLXML parser for Python language And it can even turn these invalid markups into a

parse tree [5] Moreover the following three features can make it more powerful

u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that

makes approximately as much sense as the original document Therefore you can obtain

the data that you want

u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating

searching and modifying the parse tree Hence you donrsquot need to create a custom parse

for every application

u If the document has already specified an encoding then you can ignore it since the

Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way

Otherwise what you have to do is just to specify the encoding of the original documents

Furthermore the ways of including Beautiful Soup into the application are displayed in the

following [5]

sup2 From BeautifulSoup import BeautifulSoup For processing HTML

sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML

sup2 Import BeautifulSoup To get everything

n Html5lib

It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to

gain maximum compatibility with the current major desktop web browsers this implementation

will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5

specification

Deep Web Service Crawler

20

232 Architecture of the Pica-Pica Web Service

Description Crawler

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler

Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes

four fundamental components Service Page Grabber component WSDL Grabber component

Property Grabber component and WSML Register component

(1) The Service Page Grabber components is going to take the URL seed as the input and output the

link of the service page into following two components WSDL Grabber component and Property

Grabber component

(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the

delivered service pagersquos link And then check whether the validation for these obtained WSDL

document Finally only these valid WSDL document will be passed into the WSML Register

component for further processing

(3) The Property Grabber component will try to extract the servicersquos property the hosted in the

service page if there exists After that all these servicersquos properties would be saved into an INI

file as the information of that service

(4) The functionality of the WSML Register component is to write appropriate WSML document by

means of the valid WSDL documents that delivered from WSDL Grabber component and the

optionally INI files that delivered from Property Grabber component Afterwards register them in

Conqo

n WSML [9]

It stands for Web Service Modeling Language which provides a framework with different

language variants Hence it is often used to describe the different aspects of the semantic Web

Services according to the conceptual model of WSMO

n WSMO [10]

WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various

aspects related to the Semantic Web Services based on the ontologies Ontology is a formal

explicit specification of a shared conceptualization In fact these ontologies are the sticking points

that can satisfy the linkage between the agreement of the communities of users and the defined

conceptual semantics of the real-world

n Conqo [11]

Deep Web Service Crawler

21

It is a discovery framework that considers not only the Quality of Service (QoS) but also the

context information It will use a Web Service Repository to manage these service descriptions

that based on WSML

233 Implementation of the Pica-Pica Web Service

Description Crawler

This section is going to describe the processes of the implementation of the Pica-Pica Web Service

Description Crawler in detail

(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it

needs an input as the initial seed In this crawler there are five Web Service Registries which are

listed in the below The URL address of these five Web Service Registries will be used as the input

seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web

Service Description Crawler there has a single Python script for each Web Service Registry And

the crawling process for these Web Service Registriesrsquo Python script will be executed one after

another

Biocataloue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwsrvice-repositorycom

Xmethods httpwwwxmethodsnet

(2) Then after feeding with the input seed it will step into the next component Service Page Grabber

At first this component will try to read the data from the Web based on the input seed Then it

will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library

After that this Service Page Grabber component starts to look for the service page link for each

service that published in the Web Service Registry by means of the functions in Html5lib library In

the case that the service page link of one single service is found it will firstly check whether this

service page link is valid or not Once the service page link is valid it will pass it into the following

two components for further processing which are WSDL Grabber component and Property

Grabber component

(3) When the WSDL Grabber component receives a service page link from its previous component it

sets out to extract the WSDL link address for that service through the parsing tree of the data in

this service page Next this component will start to download the WSDL document of that service

in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into

the disk The process of this WSDL Grabber component will be continually carried on until there is

no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They

may either contain bad definitions or bad namespaceURI or be an empty document What is

worse it is not even of XML format Hence in order to pick them out this component will further

analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo

folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in

order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo

folder will be passed into the subsequent component

(4) Moreover since some Web Service Registries would give some addition information about the

Deep Web Service Crawler

22

services such as availability service provider version therefore the Property Grabber

component will set out to extract these information as this servicersquos properties And thereafter

save these servicersquos properties into an INI file However if there has no available additional

information then it is no need to extract the service property thus there is no INI file for that

service However for the implementation of this Pica-Pica Web Service Description Crawler only

the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to

extract the servicesrsquo properties While for other three Web Service Registries there has such

function

(5) Furthermore this is an optional to create a report file which contains the statistic information of

this process Such as the total number of services for one Web Service Registry the number of

services whose WSDL document is invalid etc

(6) As has been stated at present there has a folder with all valid WSDL documents and might also

have some INI files Therefore at this time the task of WSML Register component is to generate

the appropriate WSML documents with these valid WSDL document and INI files And then

register them in Conqo

24 Conclusions of the Existing Strategies

In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder

project Information Extraction technique and Pica-Pica Web Service Description Crawler

The task of this master program is going to obtain the available Web services and their related

information from the Web Actually this is a procedure of extracting the needed information for the

service such as the servicersquos WSDL document and its properties Therefore the Information

Extraction technique which used to extract the information that hosted in the Web could be used in

this master program

Moreover the Service-Finder is a large project that not only able to obtain the available Web services

and their related information from the Web but also enrich these crawled data with annotations by

means of the Service-Finder ontology and the Service Category ontology and integrate all the

information into a coherent semantic mode Furthermore it also provides the capabilities for

searching and browsing the data with the user interface and gives users with service

recommendations However as a master program the Service-Finder project is far more exceeding

the requirements Therefore it is just considered as a consultation for this master program

Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available

Web services and their related information this fulfills the primary task of this master program

Nevertheless regarding the information about the service this Pica-Pica Web Service Description

Crawler extracts only few properties sometime even no property Consequently in order to improve

the quality of the service it has to extract as much properties about the service as possible Thence in

chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler

Deep Web Service Crawler

23

3 Design and Implementation

In the previous chapter of the State of Art those already existing techniques or implementations are

presented In the following it is time to introduce the basic principle of the proposed approach Deep

Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica

Web Service Description Crawler

31 Deep Web Services Crawler Requirements

This section is mainly talking about the goals of the Deep Web Service Crawler approach system

requirements with respect to approach and some non-functional requirements

311 Basic Requirements for DWSC

The following lists are the basic requirements which should be achieved

(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list

of Web services that published in the Web Service Registry It contains not only the WSDL document

of the service but also a file with respect to the information about that service Therefore for the

purpose of producing the largest annotated Service Catalogue this proposed approach need to

extract as much properties about those Web services as possible Moreover it has also to download

the WSDL document that hosted along with the Web service That is to say these properties are not

only the interesting structured properties such as service name its WSDL link address but also some

other non-function properties for example endpoint and its monitoring information

(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties

For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of

schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way

this proposed approach provides three methods for the storage The first one will store them as an

XML file While for the second method it is going to store them in an INI file And the third method is

to use the database for the storage

312 System Requirements for DWSC

General Speaking the needed requirements for achieving a programming project contains following

lists

1) Operating System LinuxUnix Windows (XP Vista 2000 etc)

2) Programming language C++ Java Python C etc

3) Programming Tool NetBean Eclipse Visual Studio and so on

However in this master thesis the scripts of the Deep Web Service Crawler approach are written in

Java programming language Besides these code scripts are only tested running in Windows XP

operating system and Linux operating system but have not been tested in other operating system

Deep Web Service Crawler

24

313 Non-Functional Requirements for DWSC

In this part several non-functional requirements for the Deep Web Service Crawler approach are

presented

1) Transparency the process of data exploration and data storage should be done automatically

without the userrsquos intervention However at first user should specify the path in the hard disk which

will be used store these outputs of this program

2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in

order to stop the process from interruption there must have some necessary error handlings for the

process recovering

3) Completeness this approach should extract the interesting properties about each Web Service as

much as possible eg endpoint monitoring information etc

In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the

following five URLs hence the proposed approach must implement not less than those five Web

Service Registries

Biocatalogue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwservice-repositorycom

Xmethods httpwwwxmethodsnet

32 Deep Web Services Crawler Architecture

In this section an overview of this high-level architecture of Deep Web Services Crawler approach will

be introduced first Thereafter there are four subsections that focusing on the outlining of each single

component and how they will play together presented

The current components and flows of data in Deep Web Service Crawler can be summarized as

depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page

links and related service page links by crawling the Web (Web Service Extractor) Then those gathered

links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL

Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all

these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is

illustrated as following

Oslash Step 1

When the Deep Web Service Crawler starts to run the File Chooser container would require the

user to specify a path in the computer or in any other hard disk The reason for specifying the

path is that this Deep Web Service Crawler program needs a place to store all its outputs

Oslash Step 2

After that the Web Service Extractor will be triggered It is the main entry to the specific crawling

process However since the Deep Web Service Crawler program is a procedure which supposes

to crawl for Web Services in some given Web Service Registries Hence the URL addresses of

Deep Web Service Crawler

25

these Web Service Registries should be given as an initial seed for this Web Service Extractor

process However since the page structures of these Web Service Registries are completely

different thus there will be a dependent process for each Web Service Registry

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler

Oslash Step 3

So that according to the given seed two types of links would be obtained by the Web Service

Extractor component One is the service list page link And another is the service page link

Service list page is page that contains a list of Web Services and may be some information about

these Web Services While service page is page that contains much more information for single

service Finally it forwards these two types of links into the next two components Property

Grabber and WSDL Grabber

Oslash Step 4

Then on the one hand the Property Grabber component tries to gather the information about

the service that hosted in the service list page and service page Such as the name of the service

its description the ranking for this service etc Finally all the information of the service will be

collected together as the service properties which then will be delivered to the storage

component for further processing

Oslash Step 5

On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page

or the service page That is because for some Web Service Registry the WSDL link is hosted in

Deep Web Service Crawler

26

the service list page like the Biocatalogue and for other Web Service Registries it is hosted in

the service page such as Xmethods Then after obtaining the WSDL link it will also be

transmitted to the Storage component for further processing

Oslash Step6

When service properties information and WSDL link of the service are received by Storage

component it will be going to store them into the disk For the service properties they will be

stored into the disk by means of three different ways They are an XML file an INI file or one

record inside the table of Database However for the WSDL link the Storage component will try

to download the page content first according to the URL address of the WSDL link If it can work

successful the page content of the service will be stored as a WSDL document into the disk

Oslash Step 7

Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there

are more than one service or there are more than one service list page in the those Web Service

Registries the crawling process from step 3 to step 6 will be continued again and again until

there is no service or service list page in that Web Service Registries any more

Oslash Step 8

Furthermore after the crawling process for one Web Service Registry finishes there will also

generate a file that contains some of statistic information about this crawling process For

example the time when the crawling process of this Web Service Registry started and when

finished the total number of Web services in this Web Service Registry how many services have

empty WSDL document the average number of service properties in this Web Service Registry

and the average time cost for extracting service properties getting WSDL document and

generating XML file INI file etc

321 The Function of Web Service Extractor Component

The Web Service Extractor component is responsible for gathering service list page links and service

page links It pursues a focused crawl of the Web and only forwards service list page and service page

links to subsequent components for analyzing collecting gathering purposes Therefore it is going to

identify both service list page links and related service page links on these Web Service Registries

As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost

as important as the Web Service Extractor itself as it highly influences the part of the Web which it

needs to crawl The seed can or shall contain eg Web pages where these Web Services are

published or that they talk about Web Services

Deep Web Service Crawler

27

Figure3-2 Overview the process flow of the Web Service Extractor Component

After feeding with the seed of URL the Web Service Extractor component starts to get the link of

service list page from the initial page of this URL seed However this process would be diverse for

these five Web Service Registries Following shows the different situation in these Web Service

Registries

Oslash Service-Repository Web Service Registry

In this Web Service Registry the link of the first service list page is the URL address of its seed

which means some Web Services can be found in the home page of the Service-Repository Web

Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this

Web Service Registry Therefore the process of getting service list page link in this registry will be

continually carried on until there no more service list page link exists

Oslash Xmethods Web Service Registry

Although there has Web Services in the home page of the Xmethods Web Service Registry the

number of those Web Services is only a small subset of these in this Web Service Registry

Moreover in the Xmethods Web Service Registry there is only one page containing all Web

Services Therefore it must have to get the service list page link for that page

Oslash Ebi Web Service Registry

The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That

is to say there also has one page that contains all Web Services in this Web Service Registry

However this page is not the initial page of the input seed Therefore there needs more than

one operation steps to get the service list page link of that page

Oslash Seekda Web Service Registry

In Seekda Web Service Registry these Web Services are not contained in the initial page of the

input seed The service list page link can be obtained after several additional operation steps

Deep Web Service Crawler

28

However there has a problem for getting the service list page link in this registry Simply put if

there are more than one pages contains the Web Services for some unknown reasons it cannot

not get the links of the rest service list page In another word it can only get the link of the first

service list page

Oslash Biocatalogue Web Service Registry

The process of getting service list page in Biocatalogue Web Service Registry is almost the same

as in Seekda Web Service Registry The only difference is that it can get all these service list page

links in Biocatalogue Web Service Registry if there are more than one service list pages

Then after getting the link of service list page the Web Service Extractor will begin to get the link of

service page for each service that listed in the service list page The reason for why it can do like this is

that there has an internal link for every service which will address to the service page It is worth

noting that once a service page link is obtained this component will immediately forward the service

page link and the service list page link into the subsequent two components for further processing

Nevertheless the process of obtaining the service page link will be continuously carried out until all

services listed in that service list page are crawled Analogously the process of getting service list page

will also be continuously carried out until there no more service list page exists

3211 Features of the Web Service Extractor Component

The main features are described in the following paragraphs

1) Obtain service list page links

A central task is to obtain the corresponding service list page links It is an URL address that includes a

publicly list of Web Services and just a simply information about these Web services like the name of

the service an internal URL that links to another page which would contain the detailed information

about that service sometimes may have the link address for WSDL document

2) Obtain service page links

Once service list page links are found a crucial aspect is to extract the internal link of each service to

aid the task of service information discovery Therefore it is the task of the Web Service Extractor to

harvest the html page content of the service list page so that the service page links which would

contain much more detailed information of the single Web service can be obtained

3212 Input of the Web Service Extractor Component

This component is dependent on some specific input seeds And the only input required for this

component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section

313

3213 Output of the Web Service Extractor Component

The component will produce two service related page links from the Web

l Service list page links

l Service page links

Deep Web Service Crawler

29

3214 Demonstration for Web Service Extractor

In order to have comprehensive understanding of the process of the Web Service Extractor

component following gives some figures for explanation Though there are five URL addresses in this

section only the URL of the service-repository address is showed as an example

1) The input seed is the initial URL address of the Service-Repository which is

ldquohttpwwwservice-repositorycomrdquo

2) As has already said in section 321 the first service list link of this Web Service Registry is its input

seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page

of that link

Figure3-3 Service list page of the Service-Repository

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo

Figure3-5Code Overview of getting service page link in Service Repository

Figure3-6 Service page of the Web service ldquoBLZServicerdquo

Deep Web Service Crawler

30

3) For now because of the service list page link is already know the next step is to acquiring the

service page link according to these services that listed in the service list page The text in red box

of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the

complete link of the service page It should be prefixed with the initial URL address of

Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for

getting service page link of the Web service in Service Repository is showed in figure 3-5

Therefore the final link of this service page is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure 3-6 is the corrsponding service page of that link

4) Afterwards those two links service list page link and service page link which are gathered by the

Web Service Extractor component will be immediately forwarded to the next two components

which are WSDL Grabber component and Property Grabber component

322 The Function of WSDL Grabber Component

The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the

service lists page link or service page link And the whole process flow is illustrated in figure 3-7

Figure3-7 Overview the process flow of the WSDL Grabber Component

When the WSDL Grabber component receives the inputs that delivered from previous component it

will start to get the WSDL link for the service based on these inputs However although the inputs of

the WSDL Grabber component are the links of service page and service list page only one of them

contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the

service page or the service list page The reason why there needs these two links to be delivered into

this component is that the only one of the five Web Service Registries which is Biocatalogue hosts

the WSDL link in the service list page While for other four Web Service Registries the WSDL link is

hosted in the service page Therefore the WSDL link of these four Web Service Registries will be

obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL

link will be obtained through the service list page link However there has a problem for getting the

WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in

Deep Web Service Crawler

31

the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another

word these services will not have WSDL document For the situation like this the value of the WSDL

link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web

Services in other four Web Service Registries the WSDL link would always exist in the service page

Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service

it will inevitably be forwarded into Storage component for downloading the WSDL document at once

3221 Features of the WSDL Grabber Component

The WSDL Grabber component has to provide the following features

l Obtain WSDL links

WSDL link is the direct way to get into the page that contains the contents of the WSDL document

It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo

to represent this is an address that addressed to the page of WSDL document

3222 Input of the WSDL Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3223 Output of the WSDL Grabber Component

The component will only produce the following output data

l The URL address of WSDL link for each service

3224 Demonstration for WSDL Grabber Component

In this section it will demonstrate a list of figures in order to give a comprehensive understanding of

the process of the WSDL Grabber component Without a doubt it uses the same Web service of the

Service-Repository as an example too

1) The input for this WSDL Grabber component is the link of service page that obtained from Web

Service Extractor component The address of this link is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page

Deep Web Service Crawler

32

2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the

figure 3-8

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo

3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure

3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry

For other four Web Service Registries this will be different This function

ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo

Then it checks all of the nodes one by one to see if the text value of that node is the value of

ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will

be extracted as the value of WSDL link for this Web service

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function

Figure3-11 Code overview of ldquooneParameterrdquo function

4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web

service ldquoBLZServicerdquo can be obtained The value of it is

ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo

Deep Web Service Crawler

33

323 The Function of Property Grabber Component

The Property Grabber component is a module which is used to extract and gather all these Web

service information that hosted in the Web actually are the information that showed in service list

page and service page In the end all the obtained Web service information will be collected together

as the service properties which would be delivered into the Storage component for storing The

detailed process flow of Property Grabber component is illustrated in figure 3-12

The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component

However there is still a little difference between them with respect to the seed As have already

mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is

sufficient to get the WSDL link While regarding the Property Grabber component it needs two of

them as the seeds Therefore once the Property Grabber component receives the needed inputs it

will start to extract the information of that single Web service

Figure3-12 Overview the process flow of the Property Grabber Component

Therefore after the Property Grabber component receives the inputs it starts to extract the service

information for the Web service Generally speaking the service information consists of four aspects

which are structured information endpoint information monitoring information and whois

information respectively

(1) Structured Information

The structured information can be obtained by extracting the information that hosted in the

service page and service list page They are the basic descriptive information about the service

Such as the name of the service the URL address through which the WSDL document can be

obtained the description of introducing the service the provider who provides that service its

rating and the server who owns this service etc However the elements constituting this

Deep Web Service Crawler

34

structured information would be diverse according to these different Web Service Registries For

example the rating information of the Web service exists in the Service-Repository Web Service

Registry while the Xmethods Web Service Registry does not have this information In addition

even though the Web services in the same Web Service Registry some elements of the structured

information could be not exist For instance one service in a Web Service Registry may have the

description information of this service while another service in the same Web Service Registry

does not have this description information Tables 3-1 to 3-5 demonstrate the structured

information that should be extracted in these five Web Service Registries Moreover if the style of

the Biocataloue Web Service Registry is SOAP there has some additional information describing

the SOAP operation of this service While it is REST this additional information would be

something that talked about the REST operation They should also be considered as a part of the

structured information Table 3-6 and table 3-7 illustrate the information for these two different

operations

Service Name WSDL Link WSDL Version

Provider Server Rating

Homepage Owner Homepage Description

Table 3-1Structured Information of Service-Repository Web Service Registry

Service Name WSDL Link Provider

Service Style Homepage Implementation Language

Description User Description Contributed Client Name

Type of this Client Publisher for this client Used Tookit of this client

Used Language of this client Used Operation System of this client

Table 3-2Structured Information of Xmethods Web Service Registry

Service Name WSDL Link Server

Provider Providerrsquos Country Service Style

Rating Description User Description

Service Tags Documentation (within WSDL)

Table 3-3Structured Information of Seekda Web Service Registry

Service Name WSDL Link Port Name

Service URL Address Implementation Class

Table 3-4Structured Information of Ebi Web Service Registry

Service Name WSDL Link Style

Provider Providerrsquos Country View Times

Favorite Times Submitter Service Tags

Total Annotation Provider Annotation Member Annotation

Registry Annotation Base URL SOAP Lab Server Base URL

Description User Description Category

Table 3-5Structured Information of Biocatalogue Web Service Registry

Deep Web Service Crawler

35

SOAP Operation Name Inputs and Outputs Operation Description

Operation Tags Part of Which Service

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry

REST Operation Name Service Tags Used Template

Operation Description Part of Which Service Part of Which Endpoint Group

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry

(2) Endpoint Information

Endpoint information is the port information of the Web service which can be extracted only

through the service page However since different Web Service Registries have different structure

of the endpoint information with respect to the Web service Thence some elements of the

endpoint information for the Web service would be very diverse But there is one thing needed to

pay attention the Ebi Web Service Registry does not have endpoint information for all these Web

services published in this registry Moreover though the Web services in the same Web Service

Registry have the same structure of this endpoint information some elements of the endpoint

information would be missing or empty Furthermore these Web Service Registries could even

have no endpoint information for some Web services published by them Nevertheless whatever

there is endpoint information for the Web service there will be at least one such element which

is the URL address of the endpoint Following table 3-8 shows the endpoint information that

should be extracted in these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Endpoint Name Endpoint URL

Endpoint Critical Endpoint Type

Bound Endpoint

Xmethods

Endpoint URL Publisher of this Endpoint

Contact Email of this Publisher Implementation Language of this

Endpoint

Seekda Endpoint URL

Biocatalogue Endpoint Name Endpoint URL

Table 3-8 Endpoint Information of these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Service Availability Number of Downs

Total Uptime Total Downtime

MTTR MTBF

RTT Max of Endpoint RTT Min of Endpoint

RTT Average of Endpoint Ping Count of Endpoint

Seekda Service Availability Begin Time of Monitoring

Biocatalogue Monitored Status of Endpoint Overall Status of Service

Table 3-9 Monitoring Information of these five Web Service Registries

(3) Monitoring Information

Monitoring information is the tested statistic information for the Web service But it is worth

Deep Web Service Crawler

36

noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of

all these Web services published by them While for other three Web Service Registries only a

few of Web services may not have this information Table 3-12 displays the monitoring

information for these three Web Service Registries

(4) Whois Information

Whois information is not extracted from the information that hosted in the service page and

service list page It is the descriptive information for the service domain which can be gained by

means of the address of the WSDL link Because of that the process of getting the Whois

information will start after gaining the service domain first The final value of service domain

should not contain the string like this http https www etc It must belong to the top level

domain After that it is going to query the service domain database by means of sending the value

of service domain to the Whois client which is just a Web site on the Internet for example

ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain

exists a list of its information will be returned as the output However the structure of the

returned information will be diverse with respect to different service domain Therefore the most

challenge thing is that there has to deal with the extracting process for each different situation of

the returned information Table 3-10 gives the Whois information that needs be extracted for all

these five Web Service Registries

Service Domain URL Domain Name Domain Type

Domain Address Domain Description State

Postal Code City Country

Country Code Phone Fax

Email Organization Established Time

Table 3-10Whois Information for these five Web Service Registries

Finally all the information of these four aspects will be collected together And then be delivered into

the Storage component for further storage processing

3231 Features of the Property Grabber Component

The Property Grabber component has to provide the following features

l Obtain basic information

Generally speaking the more information one Web service has the better you can know how

well this Web service is Hence this Property Grabber component will be necessary to extract all

these basic information that hosted in the service list page and service page This basic

information contains the structure information endpoint information and monitoring

information

l Obtain Whois information

Due to the reason that the more information the Web service has the better the quality of the

Web service is That makes it is necessary to extract as much information about the Web service

as possible Therefore except the basic information the Property Grabber component can also

obtain some additional information called Whois information such as the type of the domain the

name of the person in charge the postal code of the domain city phone fax detailed address

email etc

Deep Web Service Crawler

37

3232 Input of the Property Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3233 Output of the Property Grabber Component

The component will produce the following output data

l Structured information of each service

l Endpoint information about each service if exists

l Monitoring information for the service and endpoint if exists

l Whois information of the service domain

However all these information will be collected together as the properties for each service Thereafter

the collected properties will be sent to the Storage component

3234 Demonstration for Property Grabber Component

Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the

Property Grabber component To simplify the explanation the example showed in this section uses

the same Web service as before

1) The inputs of the Property Grabber component are links of service list page and service page that

received from Web Service Extractor component Their links are

ldquohttpwwwservice-repositorycomrdquo

and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page

Deep Web Service Crawler

38

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page

3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that

displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of

the structured information have the same content like the contents of description showed in

service page and service list page Thence in order to saving the time of extracting process and

the space for storing process the elements with the same content would only be extracted once

Moreover there needs a transformation of non-descriptive text to the descriptive text for the

rating information because the contents of it are several image of stars Therefore the final

results of the extracted structured information for this Web service are showed in table 3-11

Because of there are no descriptive information for the provider homepage and Owner

Homepage their values are assigned as ldquoNULLrdquo

Service Name BLZService

WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl

WSDL Version 0

Server Apache-Coyote11

Description BLZService

Rating Four stars and A Half

Provider NULL

Homepage NULL

Owner Homepage NULL

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo

4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that

displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list

only one of them will be extracted as the information of endpoint That is because this master

program is intended to extract as more information as possible but these information should not

contain some redundant information Therefore for the purpose of that only one of them will be

extracted as the endpoint information even if there are more than one endpoint records Table

3-12 shows the final results of the endpoint information of this Web service

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page

Deep Web Service Crawler

39

Endpoint Name BLZServiceSOAP12port_http

Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService

Endpoint Critical True

Endpoint Type production

Bound Endpoint BLZServiceSOAP12Binding

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo

5) Then it is time to extract the monitoring information through invoking the function

ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box

above contains the monitoring information about the Web service and the red box below lists

the monitoring information for these endpoints As have already mentioned before only one

endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two

types of availability Actually they all represent the availability for this Web service just like the

availability that showed in figure 3-14 Therefore only one content of these availability

information is sufficient Table 3-13 shows the final results of this extracting process

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page

Service Availability 100

Number of Downs 0

Total Uptime 1 day 19 hours 19 minutes

Total Downtime 0 second

MTBF 1 day 19 hours 19 minutes

MTTR 0 second

RTT Max of Endpoint 141 ms

RTT Min of Endpoint 0 ms

RTT Average of Endpoint 577 ms

Ping Count of Endpoint 112

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo

6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the

service domain from the WSDL link first Towards this Web service the gained service domain is

ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying

process After that it returns a list of information for that service domain See figure 3-17 And

table 3-14 is the extracted Whois information

Deep Web Service Crawler

40

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo

Service Domain URL thomas-bayercom

Domain Name Thomas Bayer

Domain Type NULL

Domain Address Moltkestr40

Domain Description NULL

State NULL

Postal Code 54173

City Bonn

Country NULL

Country Code DE

Phone +4922855525760

Fax NULL

Email infopredic8de

Organization predic8 GmbH

Established Time NULL

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo

7) Finally all the information of these four aspects will be collected together as the service

properties and then these service properties are forwarded into the Storage component

324 The Function of Storage Component

The WSDL link from the WSDL Grabber component will be used by this Storage component to

download the WSDL document from the Web and stores it into the disk thereafter In addition the

service properties from the Property Grabber component will also be directly stored into the disk in

three different manners with this Storage component Figure 3-18 illustrates the process flow of the

Storage component

Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo

of the Storage component will be triggered After that it is going to transform the service properties

into three different output formats and store them into the disk These output formats are XML file

INI file and database records Besides it will also try to download the WSDL document with the URL

address of the WSDL link and then store the obtained WSDL document into disk too However this

ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo

ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the

storage tasks

Deep Web Service Crawler

41

Figure3-18 Overview the process flow of the Storage Component

(1) ldquogetWSDLrdquo sub function

The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it

into the disk Therefore above all it has to get the content of the WSDL document This

procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value

of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if

the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo

For the case like that it will create a WSDL document whose name is the service name appended

with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content

It is an empty document Nevertheless if that service does have WSDL link this sub function will

try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all

these contents that hosted in the Web would be downloaded and stored into the disk and named

only with the name of the service Otherwise it would create a WSDL document which prefixes

ldquoBadrdquo before the service name

(2) ldquogenerateXMLrdquo sub function

The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform

them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo

XML stands for eXtensible Markup Language which is a markup language that designed to

transport and store data The first line of an XML file is the XML declaration which defines the

XML version and the encoding used For example following ldquoltxml version=10

encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is

8-bit Unicode Transformation Format character set Besides an XML file contains XML elements

which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML

element can also contain other elements which could be simple text or a mixture of both

Deep Web Service Crawler

42

However an XML file must contain a root element as the parent of all other elements

(3) ldquogenerateINIrdquo sub function

The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will

transform them into an INI file and then store it in the disk with the name structure of servicersquos

name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for

configuration files They are just simple text files with a basic structure Generally speaking the

INI file contains three different parts They are section parameter and comment respectively

Parameter is the basic element contained in an INI file Its format is a key-value pair or can also

be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always

appears to the left of the equals sign The section is more like a room that grouped all its

parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And

section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo

Thence anything between the semicolon and the end of line will be ignored

(4) ldquogenerateDatabaserdquo sub function

The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions

Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will

turn them into the data of the database by using the SQL statements SQL stands for Structured

Query Language which is a database language that designed for accessing and manipulating data

in database Most of the actions that performed on a database are done with SQL statements

And the primary statements of SQL include insert into delete update select create alter and

drop Therefore for the purpose of transforming these service properties into the data of the

database this sub function has to create a database first using the ldquocreate databaserdquo statement

Then it should create a table to store the data A table is a collections of related data entries and

it consists of columns and rows Since the data for all these five Web Service Registries are not

very large therefore one database table is enough for storing these service properties Because of

that the field names of the service properties in columns for all these five Web Service Registries

should be uniform and well-defined Afterwards the service properties of each single service can

be put into the table as a record with the ldquoinsert intordquo statement of the SQL

3241 Features of the Storage Component

The Property Grabber component has to provide the following features

l Generate different output formats

The final result of this master program is to store the information of the service into the disk for

future work However this Storage component provides three different formats for storing these

service properties of the services that published in the Web Service Registries This makes the

storage of the service very flexible and also longevity

l Obtain the WSDL document

The important thing of this component is to check if the WSDL document can be obtained from

the WSDL link because the WSDL document plays a decisive role for determining the quality of

the service This Storage component provides the ability to deal with the different situations

about the process of obtaining the WSDL document

Deep Web Service Crawler

43

3242 Input of the Storage Component

This component requires the following input data

l WSDL link of each service

l Each service property information

3243 Output of the Storage Component

The component will produce the following output data

l WSDL document of the service

l XML document INI file and tables in database

3244 Demonstration for Storage Component

Following figures describe the fundamental implementation codes of this Storage component

The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the

implementation codes The first common place is about the parameters that defined in each of

these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute

path in the computer disk It will be used for the storing procedure and has already been specified

by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an

increasing Integer which is used as a part of the name of the service The reason for doing this is

that it can prevent the services that have the same name from overriding in the disk The content

of red mark among the codes of these figures is the second common place Its function is to

create a file or document for the service with the corresponding parameters

2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to

get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the

name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo

is the actual WSDL link which is the most important one for this sub function The other two

parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and

ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data

of the service in each Web Service Registry Such as the overall number of properties for that Web

Service Registry the overall number of services the number of services that have no WSDL link

the number of services contains no contents the number of services whose WSDL link are not

available etc The ldquoLog Informationrdquo text file records the results of the process steps and the

problems encountered For example which service is crawling now which Web Service Registry it

belongs the reason why it cannot get the WSDL document for this service the reason why this

service is unreachable and so on

Deep Web Service Crawler

44

Figure3-19 Implementation code for getting WSDL document

3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file

and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of

ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two

variables name and value

Figure3-20 Implementation code for generating XML file

Deep Web Service Crawler

45

Figure3-21 Implementation code for generating INI file

4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all

services in these five Web Service Registries into the records of the database Therefore it has to

create a database first The name of the database can be arbitrary as long as it conforms to the

naming rules in database This is the same for the name of the table Figure 3-22 displays the

procedure of creating a table in the database Due to the reason that it is hard to decide the

length of each service property thence the data types for all service properties are set to ldquoTextrdquo

Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo

statement

Figure3-22 Implementation code for creating table in database

Deep Web Service Crawler

46

Figure3-23 Implementation code for generating table records

33 Multithreaded Programming for DWSC

Multithreaded programming is a built-in characteristic which is provided by Java language A

multithreaded program contains two or more separate parts that that can execute concurrently Each

part of such a program is called a thread The use of multithreading enables to create programs that

makes efficient use of the system resources For example make maximum use of the CPU thatrsquos

because the idle time of the CPU can be kept to a minimum

In this master program there are five Web Service Registries that are needed to crawl for the services

among them Moreover the number of services published among each Web Service Registry is quite

different That makes the running time cost by each Web Service Registry is different Then it will

happen that one Web Service Registry who owns fewer services has to be waiting for executing until

another Web Service Registry who has much more services finishes Therefore in order to reduce the

waiting time for other Web Service Registries and maximize the use of the system resources it is

necessary to apply this multithreaded programming to this master program That is to say this master

program will create a thread for each Web Service Registry And these threads are executed

independently

34 Sleep Time Configuration for Web Service

Registries

Due to the reason that this master program is intended for downloading the WSDL document and

extracting the service information of these Web services that published in the Web Service Registry

this would inevitably affect the performance of the Web Service Registry In addition for the purpose

of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of

accessing Because of that sometimes when this master program is executing unknown errors would

happen For instance this master program will be continually halting at one point without getting any

more WSDL document and service information or some servicesrsquo WSDL documents of some Web

Deep Web Service Crawler

47

Service Registries cannot be obtained or there would miss some service information Therefore in

order to obtain the largest staffs of these Web services published in these five Web Service Registries

and not to affect the throughput of them it has to configure the accessing rate for each service of all

Web Service Registries

In consequence before going into the essential procedure for each single service of these Web Service

Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long

milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for

the specified number of milliseconds In other words it will temporarily cease execution for a while

Following table shows this time interval of the sleep function for each Web Service Registry

Web Service Registry Name Time Interval (milliseconds)

Service Repository 8000

Ebi 3000

Xmethods 10000

Seekda 20000

Biocatalogue 10000

Table 3-15Sleep Time of these five Web Service Registries

Deep Web Service Crawler

48

4 Experimental Results and Analysis

This chapter is going to show the quantitative experimental results of the prototype that presented in

chapter 3 Besides the analysis of these results will also be described and explained However in

order to gain a rather accurate result the experiments are carried out more than five times All these

data that displayed in the following tables and charts are the average of them

41 Statistic Information for Different Web Service

Registries

This section is mainly used to talk about the amount statistic of these Web services that published in

these five Web Service Registries It includes the overall number of the Web services that published in

each Web Service Registry and the number of these unavailable Web services which have been

archived because they may not be active anymore or are close to being non active Table 4-1 shows

the service amount statistic of these five Web Service Registries

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Overall

Services 57 289 382 853 2567

Unavailable

Services 0 0 0 0 125

Table4-1 Service amount statistic of these five Web Service Registries

Nevertheless in order to have an intuitive view of the service amount statistic in these five Web

Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As

you can see from the bar chart one on hand there has an ascending increase for the overall number

of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service

Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web

service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to

provide the Web services for the users because it contains far more services than other four Web

Service Registries On the other hand there is no unavailable service existed in almost all Web Service

Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web

Service Registry contains some Web services that cannot be used by the users To some degree this is

useless for the reason that these services cannot be used anymore and it will waste the network

resources on the Web Therefore all these unavailable services should be eliminated in order to

reduce the wasting of the network resources

Deep Web Service Crawler

49

Figure4-1 Service amount statistic of these five Web Service Registries

42 Statistic Information for WSDL Document

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Failed WSDL

Links 1 0 23 145 32

Without

WSDL Links 0 0 0 0 16

Empty

Content 0 0 2 0 2

Table4-2 Statistic information for WSDL Document

Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in

these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed

WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the

Web services whose WSDL links are invalid Say in other words that means it is impossible to get the

WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no

WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services

for these Web Service Registries It is used as the overall number of Web services in each Web Service

Registry who has no WSDL link That is to say there would be no WSDL document for Web services

like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL

document will also be created But there is no content and the name of this WSDL document will

contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which

represents the overall number of Web services that have the WSDL links and the URL address of them

0

500

1000

1500

2000

2500

3000

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

57289 382

853

2567

0 0 0 0 125

The

Num

ber o

f Web

Ser

vice

s

The name of Web Service Registry

Overall Services

Unavailable Services

Deep Web Service Crawler

50

are valid but their WSDL documents contains no content In this case a WSDL document whose name

contains a string ldquo(BAD)rdquo will be created

Figure4-2 Statistic information for WSDL Document

43 Comparison of Different Average Number of

Service Properties

This section is going to compare the average number of service properties in these five Web Service

Registries This average number of service properties is calculated by means of following equation = (1)

Where

ASP is the average number of service properties for one Web Service Registry

ONSP is the overall number of service properties in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

Figure 4-3 shows the average number of service properties of each Web service in these five Web

Service Registries As have already mentioned before one of the measurements for testing the quality

of Web services in the Web Service Registry is the service information The more information of a Web

service the better you know about that service consequently the corresponding Web Service

Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service

Repository and Biocatalogue Web Service Registries own a larger number of service properties than

that in other three Web Service Registries This can directly reflect that these two Web Service

Registries can provide more detailed information about the Web services published among them So

that the users can be easier to choose the service they need And they would also like to use the Web

services that published in these two Web Service Registries By contrast the Xmethods and Seekda

0

50

100

150

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

1 023

145

32

0 0 0 016

0 0 2 0 2

The

Num

ber o

f Ser

vice

s

The Name of the Web Service Registry

Failed WSDL Links

Without WSDL links

Empty Content

Deep Web Service Crawler

51

Web Service Registries who have less service information of the Web services would offer less quality

for these Web services Therefore users may be not like to use the Web services provided in these

two Web Service Registries Not to mention the Web services published in the Ebi Web Service

Registry

Figure4-3 Average Number of Service Properties

From the description presented in section 323 the causes of the issue for the different number of

service properties in these Web Service Registries may consist of following several points First the

number of the structured information for these Web services is different with respect to these five

Web Service Registries Even part of information for some Web services in one Web Service Registry

could be missing or have empty value For example the number of structure information that

supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that

in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for

almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce

the amount of the overall number of service properties Thirdly some Web Service Registries do not

have monitoring information such as Xmethods and Ebi In particular the Service Repository Web

Service Registry has a large amount of monitoring information of the Web services that could be

extracted in the Web Obvious the last one is the number of the whois information for these Web

services If the database of the whois client does not contain the information about the service

domain for the Web service in one Web Service Registry then there will be no whois information

could be extracted Moreover even there has information of the service domain the amount of the

information could be very diverse Therefore if the Web Service Registry undergoes the situation like

that many service domains of the Web services in this registry have no and few whois information

then the average number of service properties in that registry will decrease greatly

As a result in order to help users get better acquainted with the Web services that provided in Web

Service Registry and distinguish these Web services the Web Service Registry should do it best to

offer more and more information for each published Web service of it

05

101520253035

23

717 17

32

The

Num

ber o

f Pro

pert

ies

The Name of Web Service Registry

Average Number of Service Properties

Deep Web Service Crawler

52

44 Different Outputs of Web Services

The fundamental task of this master program is to obtain the WSDL document of the Web services

that hosted in the Web Service Registries as well as extract and gather the properties of these Web

services and thereafter store them into disk Therefore this section is going to describe the different

outputs of this master program which include the WSDL documents of Web services the generated

XML file and INI file and data records of these service properties in Database

Figure4-4 WSDL Document format of one Web service

The WSDL document of the Web service is just read from the Web according to the URL address of its

WSDL link And then these data will be stored into the disk as the WSDL document The name of the

WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document

And in order to distinguish these WSDL documents whose names would be the same while the

content of them are different the name of each obtained WSDL document in one Web Service

Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL

document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and

data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output

formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name

is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the

materials belonged to the same Web service The first three lines in that INI file are some service

comments which start from the semicolon to the end of the line They are the basic information for

the description of this INI file The following one line of them is the section that included in a pairs of

brackets It is important because it represents that the following lines behind it are the information of

this Web service Therefore the rest of these lines are the actual service information that existed in a

key-value pair with an equal between the pair Each service property is displayed from the beginning

of the line

Deep Web Service Crawler

53

Figure4-5 INI File format of one Web service

Figure4-6 XML File format of one Web service

Figure4-7 Database data format for all Web services

Deep Web Service Crawler

54

Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its

name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web

service Though the format of the XML file is different from that of the INI file the essential contents

of them are the same That is to say the values of these service properties are no different This is

because they are files generated from the collection of properties for the same Web service The XML

file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the

section in INI file is something like the root in XML file Therefore all values of these elements among

the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service

Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of

service information for all Web services in these five Web Service Registries And the entire service

information for one Web service will be only one record in this table Because of that the column

names of that table should be the union of the name of the service information in each Web Service

Registry However since the column names of the table should be unique thus these redundant

names in this union must be eliminated This could make sense and possible because the names of the

service information are well-defined and uniform for all these five Web Service Registries In addition

in this table the first column is the primary key which is an increasing Integer The function of this is

more like the Integer that contained in the name of the XML file and INI file While the rest columns in

that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in

this table represents that this property of that Web service is empty or missing

45 Comparison of Average Time Cost for Different

Parts of Single Web Service

This section aims to describe the comparison of the average time cost for different parts of getting

one single Web service in all these five Web Service Registries At first it has to calculate the average

time cost of getting one single service in the Web Service Registry The calculation of it can be

obtained through following equation = (2)

Where

ATC is the average time cost for one single Web service

OTS is the overall time cost of all the Web services in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the different parts of the average time cost for getting one single service consists of

following 6 aspects which are the average time cost for extracting service property the average time

cost for obtaining WSDL document the average time cost for generating XML file the average time

cost for generating INI file the average time cost for inserting the service property into the table of

the Database and the average time cost for some other procedures such as get the service list page

link get the service page link and so on The average time cost for extracting service property would

be obtained by means of following equation

Deep Web Service Crawler

55

= (3)

Where

ATCSI is the average time cost for extracting service property of one single Web service

OTSSI is the overall time cost for extracting service property of all the Web services in one Web

Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the calculation of other parts is similar to the equation for calculating the average time

cost for extracting service property While the calculation of the average time cost for other

procedures equals to the result of the average time cost for one single Web service minus the sum of

the average time cost of all other five parts

Service

property

WSDL

Document

XML

File INI File Database Others Overall

Service

Repository

8801 918 2 1 53 267 10042

Ebi 699 82 2 1 28 11 823

Xmethods 5801 1168 2 1 45 12 7029

Seekda 5186 1013 2 1 41 23 6266

Biocatalogue 39533 762 2 1 66 1636 42000

Table4-3 Average time cost information for all Web Service Registries

Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web

Service Registries The first column of table 4-3 is the name of these five Web Service Registries And

the last column in this table is the average time cost for the single service in one Web Service Registry

While the rest columns in this table are the average time cost of these six different parts In order to

have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated

with following corresponding figures See figure 4-8 to figure 4-14

Figure4-8 Average time cost for extracting service property in all Web Service Registries

8801

6995801 5186

39533

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Service Property

Deep Web Service Crawler

56

As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue

Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries

which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively

That is to say it will take much longer time to extract the service properties of these Web services

published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the

Biocatalogue Web Service Registry has even more average number of service properties which have

already been talked about in section 43 On the contrary the average number of service properties in

Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service

Registry is larger than that in Seekda Web Service Registry As have already known that the average

number of service properties is the same for these two Web Service Registries Nevertheless there

has course might explain the case that the average time in Xmethods costs more than that in Seekda

which is the process for extracting service properties in Xmethods Web Service Registry needs to be

executed by means of service page and service list page while it needs only service page link for

Seekda Web Service Registry

The average time cost for obtaining WSDL document in all these five Web Service Registries is

displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the

average time for extracting the WSDL link of Web service and the average time for reading the data of

WSDL document from the Web and then storing it into the disk As seen from this figure the average

time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is

1168 Although the process for extracting WSDL link will cost certain amount of time it does not have

a significant influence to the total average time spent for obtaining WSDL document This is because

the WSDL link of one Web service is almost gained in one step Therefore this can imply that the

average size of the WSDL document for Xmethods Web Service Registry is larger than that for other

four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds

for the process of extracting WSDL document

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries

918

82

11681013

762

0

200

400

600

800

1000

1200

1400

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

WSDL Document

Deep Web Service Crawler

57

Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different

outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11

the average time of generating XML file for one Web service cost the same for all these five Web

Service Registries which is only 2 milliseconds In addition the average time of generating INI file for

one Web service cost the same too And its value is just 1 millisecond Even though sum up the values

of these two types of average time costs the result is still so small that it could be omitted when

comparing to the overall average time cost of getting one Web service for each corresponding Web

Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI

file would be finished at once after receiving the input of service properties of one Web service

Furthermore it can be seen from figure 4-12 although the average time costs of creating database

record for each Web service in all these five Web Service Registries are larger than that time of

generating XML and INI file the process of the operation for creating database record is still fast

Figure4-10 Average time cost for generating XML file in all Web Service Registries

Figure4-11 Average time cost for generating INI file in all Web Service Registries

2 2 2 2 2

0

05

1

15

2

25

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

XML File

1 1 1 1 1

0

02

04

06

08

1

12

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

INI File

Deep Web Service Crawler

58

Figure4-12 Average time cost for creating database record in all Web Service Registries

Figure4-13 Average time cost for getting one Web service in all Web Service Registries

In the figure 4-13 it gives the average time cost for getting one single Web service in all these five

Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the

longest time for this process This is because the presentation of these five different parts that

described in the front shows that the average time cost of each part would needs more time to finish

the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining

WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web

Service Registry does not cost the longest Moreover there is amazing thing when looking at these

figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same

trend This can further indicate that one Web Service Registry who spends more time to get the

description information of one Web service would offer more information about the Web service

53

28

4541

66

0

10

20

30

40

50

60

70

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Database

10042

823

7029 6266

42000

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Overall

Deep Web Service Crawler

59

5 Conclusion and Further Direction

This master thesis provides a schema which aims to explore the description information of the Web

services that hosted in the different Web Service Registries Actually the description information of

the Web service consists of WSDL document and service information about that Web service

Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can

be used to obtain the WSDL document and service information of the Web service in these Web

Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web

services that hosted in the these Web Service Registries as well as the storage of these Web services

is not flexible and the most important thing is that only a few service information of the Web service

are extracted even more there is no service information would be extracted for some Web Service

Registry However the work presented in this master thesis can be able to explore all Web services in

these Web Service Registries Moreover the service information of Web service would be extracted as

much as it could with the approach presented in this master thesis so that the final result would be

the largest annotated service catalogue ever produced Furthermore regarding the storage process of

the description information of the Web service this master thesis provides three different ways that

can guarantee not only the completeness but also longevity of the description information for the

Web service

However in the implementation performed in this master thesis the whois client used for querying

the information of service domain will return a free text if the information exists And sometime this

free text differs completely That makes it must have to crawl each Web service in all Web Service

Registries at least once in the experiment stage So that all the cases of these free text would be

foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web

services in these Web Service Registries Therefore in order to simply the work other whois client

who can ease the work here needs to be found and used

Moreover in the experiment stage of this master thesis the time cost for getting Web service is still

large For the purpose of reducing this time the multithreaded programming would be applied to

each some parts of the process for getting one Web service

Although the work performed here is specialized for only these five Web Service Registries the

main parts of the principles used here is adaptable to other Web Service Registries that with only

small changes in the implementation codes or the structure

Deep Web Service Crawler

60

6 Bibliography

[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)

June 27 2008

[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from

httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-

whole Emanuele Della Valle (CEFRIEL) July 1 2008

[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk

ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-

architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009

[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web

Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October

2006

[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from

httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008

[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from

httpwwww3orgTRws-arch-scenarios February 11 2004

[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo

Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering

University of Washington Seattle pp233-272 February 1999

[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web

Consortium Working Draft WD-html5-20100624 January 22 2008

[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language

WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from

httpwwwwsmoorgTRd16d161v02

[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO

-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from

httpwwwwsmoorgTRd2v11

[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware

Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet

2008

Deep Web Service Crawler

61

7 Appendixes

There are additional outputs of this master program which are log information file and statistic report

file Figure 8-1 shows one of the basic output formats of the log information for these five Web

Service Registries

Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry

Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry

Deep Web Service Crawler

62

Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo

ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively

Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry

Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry

Deep Web Service Crawler

63

Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry

Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry

Deep Web Service Crawler

64

Table of Figures

Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12

Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16

Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25

Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27

Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41

Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46

Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50

Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51

Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52

Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56

Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58

Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58

Deep Web Service Crawler

65

Table of Tables

Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47

Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48

Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55

Deep Web Service Crawler

66

Table of Abbreviations

DTD Document Type Definition

DWSC Deep Web Service Crawler

HTML HyperText Markup Language

MTBF Mean Time between Failures

MTTR Mean Time to Recovery

QoS Quality of Service

REST Representational State Transfer

RTT Round Trip Time

SOAP Simple Object Access Protocol

SQL Structured Query Language

URL Uniform Resource Locator

WHATWG Web Hypertext Application Technology Working Group

XML eXtensible Markup Language

WSDL Web Service Description Language

WSML Web Service Modeling Language

WSMO Web Service Modeling Ontology

Page 10: Deep Web Service Crawler

Deep Web Service Crawler

10

2 State of the Art

This chapter aims at presenting some existing techniques or Strategies that related to the work of

applying this Deep Web Service Extraction Crawler approach Section 21 plans to talk about the

existing catalogues Service-Finder project And then in section 22 it is going to explain the existing

implemented crawlers Pica-Pica Web Service Description Crawler Finally in the section 23 it is

supposed to present some details about the Information Extraction technique

21 Service Finder Project

Service-Finder Project aims at developing a platform for Web Service discovery especially for the Web

Services that are embedded in a Web 20 environment [1] Hence it can provide an efficient access to

publicly available services The goals of the Service-Finder project are depicted as follows [1]

n Automatically gather Web Services and their related information

n Semi-automatically create semantic service description based on the information that available

on the Web

n Create and improve semantic annotations via the user feedback

n Describe the aggregated information in semantic models and allow reasoning query

However before describing the basic functionality of the Service-Finder Project there is going to

present one of its use cases and requirements first

211 Use Cases for Service-Finder Project

The Service-Finder Project employed the use case methodology of the W3C Use Case description [6]

for its needs and then applied this methodology to the use cases that it enumerated

2111 Use Case Methodology

There are three aspects needed to be considered for use case definitions of Service-Finder Project [1]

(1) Description that used to describe information of the use case

(2) Actors Roles and Goals that used to identify the actors they would be the roles they act and the

goals they need to achieve in the scenario

(3) Storyboard that used to describe the serial of interactions among the actors and the

Service-Finder Portal

2112 System Administrator

This section is going to present the use case that applied to the Service-Finder portal and that

illustrated the requirements on its functionality from a user point of view However all these

Deep Web Service Crawler

11

information in this use case are derived from [1] In this use case there has a system administrator

whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities

online and working all day and night Therefore if there is any system failures Sam Adams should fix

the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will

alert him immediately by sending him a SMS Message in the case of a system failure

n Description

This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an

SMS Messaging Service that he wants to build it into his application

n Actors Roles and Goals

The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him

are the immediate service delivery the reliability of the service and low base fee and transaction fee

n Storyboard

Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find

many useful services from it especially he know what he is looking for Hence he visits the

Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo

Requirement 1 Search functionality

Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the

number of matching services that will be displayed on one page And he would also expect there has

short information about the service functionality the service provider and the service availability So

that he could decide which service he will choose to read further

Requirement 2 Enable configurable pagination of the matching results and have some short

information for each service

Step 3 When Sam looks through the short information about the services that displayed on the first

page he expects to find the most relevant services that related to his request After that he would

like to read more detailed information about that service to see whether this service can provide the

needed functionality

Requirement 3 Rank the returned matching services and must provide ability to read more details of

a service

Step 4 In the case that all the returned matching services Sam got provide quite different

functionalities or they belong to different service categories for example the SMS messaging services

alert users not through SMS but voice messaging For this reason Sam would like to see other

different categories that may be contain the services he wants Or the services of other categories

which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can

further filter his search in terms of browsing through categories

Requirement 4 Service categories and allow the user to look all services that belonged to that specific

category If possible it should also allow the user to browse through categories

Step 5 When Sam got all the services that could provide a SMS messaging service via the methods

described in the Step 4 at present he wants to look for the services that offered by an Austrian

provider and have no base fees if possible

Requirement 5 Faceted search

Deep Web Service Crawler

12

Step 6 After Sam got all these specific services now he would like to choose the services that can

provide a high reliability

Requirement 6 Sort functionality based on usersrsquo chooses

Step 7 For now Sam expects to compare the service availability between the promised to the service

provider and the actually provided This should be contained in the servicesrsquo details And there needs

also have service coverage information so that Sam can know whether this service covers the areas

he lives and works Moreover Sam would also like to compare these services in other way For

instance put some services into a structured table to compare the transaction fees

Requirement 7 A side-by-side comparison table for services and a functionality that enable users to

select services he wants to compare

Step 8 At last Sam wants to know whether the service providers offer a free try out of the services

So that he can test the service functionality

Requirement 8 If possible display a note that offering free service trials

212 Architecture Plan for the Service-Finder Project

The architecture plan of the Service-Finder project contains five basic components Service Crawler

Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal

Interface Figure 2-1 presents a high level overview of the components and the data flow among

them

Figure2-1Dataflow of Service-Finder and Its Components [3]

Deep Web Service Crawler

13

2121 The Principle of the Service Crawler Component

The Service Crawler is responsible for gathering available services and their related information from

the Web The overall cycle is depicted as following

(1) A Web developer publishes a Web Service

(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services

like WSDL (Web Service Description Language) documents

(3) The Crawler is also going to search for other related information as long as a service is discovered

(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant

part of the Web

At last the output of the crawler would be forwarded to the subsequent components for analyzing

indexing and displaying

2122 The Principle of the Automatic Annotator Component

The Automatic Annotator receives the relevant data from previous component and generates

semantic service descriptions about the WSDL documents and its related information based on the

Service-Finder Ontology and Service Category Ontology

Firstly it will simply introduce those two compatible ontologies that would be used throughout the

whole process [2]

n Generic Service Ontology it is an ontology which is functional to describe the data objects For

example the services the service providers availability payment modalities and so on

n Service Category Ontology it is an ontology which is used to categorize the functionalities or

applications of the services For instance data verification messaging data storage weather etc

Afterwards it is going to talk about the function of this component with its input output

Oslash Input

u Crawled data from Service Crawler

u Service-Finder Ontologies

u Feedback or Correction of before annotations

Oslash Function

u Enrich the information about the service and extract semantic statements according to the

Service-Finder Ontologies For example categorize the service according to the Service

Category Ontology

u Determine whether a particular document is relevant or not through the Web link graph If

not discard these irrelevant documents

u Classify the pages into their genres For instance pricing user comments FAQ and so on

Oslash Output

u Semantic annotation of the services

Deep Web Service Crawler

14

2123 The Principle of the Conceptual Indexer and Matcher

Component

The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted

information of the services and supplying users the capability of retrieval and semantic query For

example the matchmaking between user requests and service offers and the act of retrieving user

feedback on extracted annotations

In addition letrsquos have a look of the function of this component and its input output

Oslash Input

u Semantic annotation data and full text information obtained from Automatic Annotation

u Semantic annotation data and full text information that come from user interfaces

u Cluster data from user and service clustering component

Oslash Function

u Store the semantic annotations received from the Automatic Annotation component and

from the user interface

u Store the cluster data that procured through the clustering component

u Store and index the textual description offered by the Automatic Annotation component

and the textual comments offered by users

u Ontological query the semantic data from the data store center

u Combined keyword and Ontological querying used for user queries

u Provide a list of similar services for a given service

Oslash Output

u A list of matching services that are queried by users In particular these services should be

sorted by ranking and can also be iterated

u All available data that related to a particular entity must be retrievable at the user interface

2124 The Principle of the Service-Finder Portal Interface

Component

The Service-Finder Portal Interface is the main entry point that provided for users of the

Service-Finder system to search and browse the data which is managed by the Conceptual Indexer

and Matcher component In addition the users can also contribute information by means of providing

tags comments categorizations and ratings to the data browsed Furthermore the developers can

still directly invoke the Service-Finder functionalities from their custom applications in terms of an API

Besides the details of this componentrsquos function input and output are represented as below

Oslash Input

u A list of ordered services for a query

u Detailed information about a service or a set of services and a service provider

u Query access to service category ontology and the most used tags provided by the users

Deep Web Service Crawler

15

u Service availability information

Oslash Function

u The Web Interface allows the users to search services by keyword tag or concept in the

categorization sort and filter query results by refining the query compare and bookmark

services try out the services that offer this functionality

u The API allows the developers to invoke Service-Finder functionalities

Oslash Output

u Explicit user annotations such as tags ratings comments decryptions and so on

u Implicit user data for example click stream of users bookmarks comparisons links sent

etc

u Manual advertising of available new services

2125 The Principle of the Cluster Engine Component

The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the

Service-Finder Portal eg the queried services and the compared services of the users Moreover it

also provides cluster data to the Conceptual Indexer and Matcher for providing service

recommendations

Furthermore letrsquos detailed introduce this componentrsquos function input and output

Oslash Input

u Service annotation data of both extracted and user feedback

u Usersrsquo Click streams used for extracting user behaviors

Oslash Function

u Obtain user clusters from user behaviors

u Obtain service clusters from service annotation data to enable to find similar services

Oslash Output

u Clusters of users and services

22 Information Extraction

Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge

amount of information sources on the Internet which has been limited the access to browsing and

searching for the reason of the heterogeneity and the lack of structure of Web information sources

Therefore the appearance of Information Extraction that transforms the Web pages into

program-friendly structures for post-processing would become a great necessity However the task of

Information Extraction is specified in terms of the inputs and the extraction targets And the

techniques used in the process of Information Extraction called extractor

221 Input Types of Information Extraction

Generally speaking there are three different input types The first input type is the unstructured

Deep Web Service Crawler

16

document For example the free text that showed in figure 2-2 It is unstructured and written in

natural language So that it will require substantial natural language processing While the second

input type is called the structured document For instance the XML documents based on the reason

that the data can be described through the available DTD (Document Type Definition) or XML

(eXtensible Markup Language) schema Finally but obviously the third input type is the

semi-structured document that are widespread on the Web Such as the large volume of HTML

pages like tables itemized lists and enumerated lists This is because HTML tags are often used to

render these embedded data in the HTML pages See figure 2-3

Figure2-2Left is the free text input type and right is its output [4]

Figure2-3A Semi-structured page containing data records

(in rectangular box) to be extracted [4]

Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly

regular structure And the data of these documents can be displayed in a format of HTML way or

non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and

generated from structured databases in terms of some templates or layouts thus it would be

considered as one of the input sources which could provide some of these semi-structured documents

For example the authors price and comments of the book pages that provided by Amazon have the

Deep Web Service Crawler

17

same layout That is because these Web pages are generated from the same database and applied

with the same template or layout Furthermore there has another option which could manually

generate HTML pages of semi-structured type For example although the publication lists that

provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have

title and source property for every single pager Eventually the inputs for some Information Extraction

can also be the pages with the same class or among various Web Service Registries

222 Extraction Targets of Information Extraction

Moreover regarding the task of the Information Extraction it has to consider the extraction target

There also have two different extraction targets The first one is the relation of k-tuple And the k in

there means the number of attributes in a record Nevertheless in some cases an attribute of one

record may have none instantiation Otherwise the attribute owns multiple instantiations In addition

the complex object with hierarchically organized data would be the second extraction target Though

the ways for depicting the extraction targets in a page are diverse the most common structure is the

hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf

nodes which called internal nodes And the structure for a data object may also be flat or nested To

be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise

if it is nested structure then the internal nodes that involved in this data object would be more than

two levels

Furthermore in order to make the Web pages readable for human being and having an easier

visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated

or demarcated However the displaying for a data object in a Web page would be affected by

following conditions [4]

Oslash The attribute of a data object has zero or several values

(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo

attribute For example a special offer only available for certain books might be a ldquononerdquo

attribute

(2) If there are more than one values for the attribute of a data object it will be called the

ldquomultiValuerdquo attribute For instance the name of the author for a book could be a

ldquomultiValuerdquo attribute

Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering

That is to say among this set of attribute the position of the attribute might be changed

according to the diverse instances of a data object Thus this attribute will be called the

ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would

enumerate the release data in front of the movesrsquo title while for the movies after year 1999

(including 1999) it will enumerate the release data behind the movesrsquo title

Oslash The attribute has different formats

This means the displaying format of the data object could be completely distinct with respect to

these different instances Therefore if the format of an attribute is free then a lot of rules will be

needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo

attribute For example an ecommerce Web site would use the bold font format to present the

general prices while use the red color format to display the sale prices Nevertheless there has

Deep Web Service Crawler

18

another situation that some different attributes for a data object have the same format For

example various attributes are presented in terms of using the ltTDgt tags in a table presentation

And the attributes like those could be differentiated by means of the order information of these

attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo

attributes it must have to revise the rules for extracting these attributes

Oslash The attribute cannot be decomposed

Because of the easier processing sometimes the input documents would like to be treated as

strings of tokens instead of the strings of characters In addition some of the attribute cannot

even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo

attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The

department code and the course number in them cannot be separated into two different strings

of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo

223 The Used Techniques in Information Extraction

The extractor used in the process of Information Extraction aims at providing a single uniform query

interface to access information sources like database server and Web server It consists of following

phases collecting returned Web pages labeling these Web pages generalizing extraction rules

extracting the relevant data and outputting the result in an appropriate format (XML format or

relational database) for further information integration For example at first the extractor queries the

Web server to gather the returned pages through the HTTP protocols after that it starts to extract the

contents among these HTML documents and integrate with other data sources thereafter Actually

the whole process of the extractor follows below steps

Oslash Step 1

At the beginning it must have to tokenize the input However there are two different

granularities for the input string tokenization They are tag-level encoding and word-level

encoding The tag-level encoding will transform the tags of HTML page into the general tokens

while transform all text string between two tags into a special token Nevertheless the

word-level encoding does this in another way It treats each word in a document as a token

Oslash Step 2

Next it should apply the extraction rules for every attributes of the data object in the Web pages

These extraction rules could be induced in terms of a top-down or bottom-up generalization

pattern mining or logic programming In addition the type of extraction rules may be indicated

by means of regular grammars or logic rules For example some use path-expressions of the

HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic

constraints and some use delimiter-based constraints such as HTML tags or literal words

Oslash Step 3

After that all these extracted data would be assembled into the records

Oslash Step 4

Finally iterate this process until all these data objects in the input

Deep Web Service Crawler

19

23 Pica-Pica Web Service Description Crawler

The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the

Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web

Services problem For example the evaluation of the descriptive quality of Web Services that offered

and how well are these Web Services described in nowadaysrsquo Web Service Registries

231 Needed Libraries of the Pica-Pica Web Service

Description Crawler

This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef

Spillner and programmed in terms of the Python language Actually in order to run these scripts to

parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib

n Beautiful Soup

It is an HTMLXML parser for Python language And it can even turn these invalid markups into a

parse tree [5] Moreover the following three features can make it more powerful

u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that

makes approximately as much sense as the original document Therefore you can obtain

the data that you want

u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating

searching and modifying the parse tree Hence you donrsquot need to create a custom parse

for every application

u If the document has already specified an encoding then you can ignore it since the

Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way

Otherwise what you have to do is just to specify the encoding of the original documents

Furthermore the ways of including Beautiful Soup into the application are displayed in the

following [5]

sup2 From BeautifulSoup import BeautifulSoup For processing HTML

sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML

sup2 Import BeautifulSoup To get everything

n Html5lib

It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to

gain maximum compatibility with the current major desktop web browsers this implementation

will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5

specification

Deep Web Service Crawler

20

232 Architecture of the Pica-Pica Web Service

Description Crawler

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler

Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes

four fundamental components Service Page Grabber component WSDL Grabber component

Property Grabber component and WSML Register component

(1) The Service Page Grabber components is going to take the URL seed as the input and output the

link of the service page into following two components WSDL Grabber component and Property

Grabber component

(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the

delivered service pagersquos link And then check whether the validation for these obtained WSDL

document Finally only these valid WSDL document will be passed into the WSML Register

component for further processing

(3) The Property Grabber component will try to extract the servicersquos property the hosted in the

service page if there exists After that all these servicersquos properties would be saved into an INI

file as the information of that service

(4) The functionality of the WSML Register component is to write appropriate WSML document by

means of the valid WSDL documents that delivered from WSDL Grabber component and the

optionally INI files that delivered from Property Grabber component Afterwards register them in

Conqo

n WSML [9]

It stands for Web Service Modeling Language which provides a framework with different

language variants Hence it is often used to describe the different aspects of the semantic Web

Services according to the conceptual model of WSMO

n WSMO [10]

WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various

aspects related to the Semantic Web Services based on the ontologies Ontology is a formal

explicit specification of a shared conceptualization In fact these ontologies are the sticking points

that can satisfy the linkage between the agreement of the communities of users and the defined

conceptual semantics of the real-world

n Conqo [11]

Deep Web Service Crawler

21

It is a discovery framework that considers not only the Quality of Service (QoS) but also the

context information It will use a Web Service Repository to manage these service descriptions

that based on WSML

233 Implementation of the Pica-Pica Web Service

Description Crawler

This section is going to describe the processes of the implementation of the Pica-Pica Web Service

Description Crawler in detail

(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it

needs an input as the initial seed In this crawler there are five Web Service Registries which are

listed in the below The URL address of these five Web Service Registries will be used as the input

seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web

Service Description Crawler there has a single Python script for each Web Service Registry And

the crawling process for these Web Service Registriesrsquo Python script will be executed one after

another

Biocataloue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwsrvice-repositorycom

Xmethods httpwwwxmethodsnet

(2) Then after feeding with the input seed it will step into the next component Service Page Grabber

At first this component will try to read the data from the Web based on the input seed Then it

will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library

After that this Service Page Grabber component starts to look for the service page link for each

service that published in the Web Service Registry by means of the functions in Html5lib library In

the case that the service page link of one single service is found it will firstly check whether this

service page link is valid or not Once the service page link is valid it will pass it into the following

two components for further processing which are WSDL Grabber component and Property

Grabber component

(3) When the WSDL Grabber component receives a service page link from its previous component it

sets out to extract the WSDL link address for that service through the parsing tree of the data in

this service page Next this component will start to download the WSDL document of that service

in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into

the disk The process of this WSDL Grabber component will be continually carried on until there is

no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They

may either contain bad definitions or bad namespaceURI or be an empty document What is

worse it is not even of XML format Hence in order to pick them out this component will further

analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo

folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in

order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo

folder will be passed into the subsequent component

(4) Moreover since some Web Service Registries would give some addition information about the

Deep Web Service Crawler

22

services such as availability service provider version therefore the Property Grabber

component will set out to extract these information as this servicersquos properties And thereafter

save these servicersquos properties into an INI file However if there has no available additional

information then it is no need to extract the service property thus there is no INI file for that

service However for the implementation of this Pica-Pica Web Service Description Crawler only

the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to

extract the servicesrsquo properties While for other three Web Service Registries there has such

function

(5) Furthermore this is an optional to create a report file which contains the statistic information of

this process Such as the total number of services for one Web Service Registry the number of

services whose WSDL document is invalid etc

(6) As has been stated at present there has a folder with all valid WSDL documents and might also

have some INI files Therefore at this time the task of WSML Register component is to generate

the appropriate WSML documents with these valid WSDL document and INI files And then

register them in Conqo

24 Conclusions of the Existing Strategies

In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder

project Information Extraction technique and Pica-Pica Web Service Description Crawler

The task of this master program is going to obtain the available Web services and their related

information from the Web Actually this is a procedure of extracting the needed information for the

service such as the servicersquos WSDL document and its properties Therefore the Information

Extraction technique which used to extract the information that hosted in the Web could be used in

this master program

Moreover the Service-Finder is a large project that not only able to obtain the available Web services

and their related information from the Web but also enrich these crawled data with annotations by

means of the Service-Finder ontology and the Service Category ontology and integrate all the

information into a coherent semantic mode Furthermore it also provides the capabilities for

searching and browsing the data with the user interface and gives users with service

recommendations However as a master program the Service-Finder project is far more exceeding

the requirements Therefore it is just considered as a consultation for this master program

Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available

Web services and their related information this fulfills the primary task of this master program

Nevertheless regarding the information about the service this Pica-Pica Web Service Description

Crawler extracts only few properties sometime even no property Consequently in order to improve

the quality of the service it has to extract as much properties about the service as possible Thence in

chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler

Deep Web Service Crawler

23

3 Design and Implementation

In the previous chapter of the State of Art those already existing techniques or implementations are

presented In the following it is time to introduce the basic principle of the proposed approach Deep

Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica

Web Service Description Crawler

31 Deep Web Services Crawler Requirements

This section is mainly talking about the goals of the Deep Web Service Crawler approach system

requirements with respect to approach and some non-functional requirements

311 Basic Requirements for DWSC

The following lists are the basic requirements which should be achieved

(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list

of Web services that published in the Web Service Registry It contains not only the WSDL document

of the service but also a file with respect to the information about that service Therefore for the

purpose of producing the largest annotated Service Catalogue this proposed approach need to

extract as much properties about those Web services as possible Moreover it has also to download

the WSDL document that hosted along with the Web service That is to say these properties are not

only the interesting structured properties such as service name its WSDL link address but also some

other non-function properties for example endpoint and its monitoring information

(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties

For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of

schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way

this proposed approach provides three methods for the storage The first one will store them as an

XML file While for the second method it is going to store them in an INI file And the third method is

to use the database for the storage

312 System Requirements for DWSC

General Speaking the needed requirements for achieving a programming project contains following

lists

1) Operating System LinuxUnix Windows (XP Vista 2000 etc)

2) Programming language C++ Java Python C etc

3) Programming Tool NetBean Eclipse Visual Studio and so on

However in this master thesis the scripts of the Deep Web Service Crawler approach are written in

Java programming language Besides these code scripts are only tested running in Windows XP

operating system and Linux operating system but have not been tested in other operating system

Deep Web Service Crawler

24

313 Non-Functional Requirements for DWSC

In this part several non-functional requirements for the Deep Web Service Crawler approach are

presented

1) Transparency the process of data exploration and data storage should be done automatically

without the userrsquos intervention However at first user should specify the path in the hard disk which

will be used store these outputs of this program

2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in

order to stop the process from interruption there must have some necessary error handlings for the

process recovering

3) Completeness this approach should extract the interesting properties about each Web Service as

much as possible eg endpoint monitoring information etc

In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the

following five URLs hence the proposed approach must implement not less than those five Web

Service Registries

Biocatalogue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwservice-repositorycom

Xmethods httpwwwxmethodsnet

32 Deep Web Services Crawler Architecture

In this section an overview of this high-level architecture of Deep Web Services Crawler approach will

be introduced first Thereafter there are four subsections that focusing on the outlining of each single

component and how they will play together presented

The current components and flows of data in Deep Web Service Crawler can be summarized as

depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page

links and related service page links by crawling the Web (Web Service Extractor) Then those gathered

links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL

Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all

these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is

illustrated as following

Oslash Step 1

When the Deep Web Service Crawler starts to run the File Chooser container would require the

user to specify a path in the computer or in any other hard disk The reason for specifying the

path is that this Deep Web Service Crawler program needs a place to store all its outputs

Oslash Step 2

After that the Web Service Extractor will be triggered It is the main entry to the specific crawling

process However since the Deep Web Service Crawler program is a procedure which supposes

to crawl for Web Services in some given Web Service Registries Hence the URL addresses of

Deep Web Service Crawler

25

these Web Service Registries should be given as an initial seed for this Web Service Extractor

process However since the page structures of these Web Service Registries are completely

different thus there will be a dependent process for each Web Service Registry

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler

Oslash Step 3

So that according to the given seed two types of links would be obtained by the Web Service

Extractor component One is the service list page link And another is the service page link

Service list page is page that contains a list of Web Services and may be some information about

these Web Services While service page is page that contains much more information for single

service Finally it forwards these two types of links into the next two components Property

Grabber and WSDL Grabber

Oslash Step 4

Then on the one hand the Property Grabber component tries to gather the information about

the service that hosted in the service list page and service page Such as the name of the service

its description the ranking for this service etc Finally all the information of the service will be

collected together as the service properties which then will be delivered to the storage

component for further processing

Oslash Step 5

On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page

or the service page That is because for some Web Service Registry the WSDL link is hosted in

Deep Web Service Crawler

26

the service list page like the Biocatalogue and for other Web Service Registries it is hosted in

the service page such as Xmethods Then after obtaining the WSDL link it will also be

transmitted to the Storage component for further processing

Oslash Step6

When service properties information and WSDL link of the service are received by Storage

component it will be going to store them into the disk For the service properties they will be

stored into the disk by means of three different ways They are an XML file an INI file or one

record inside the table of Database However for the WSDL link the Storage component will try

to download the page content first according to the URL address of the WSDL link If it can work

successful the page content of the service will be stored as a WSDL document into the disk

Oslash Step 7

Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there

are more than one service or there are more than one service list page in the those Web Service

Registries the crawling process from step 3 to step 6 will be continued again and again until

there is no service or service list page in that Web Service Registries any more

Oslash Step 8

Furthermore after the crawling process for one Web Service Registry finishes there will also

generate a file that contains some of statistic information about this crawling process For

example the time when the crawling process of this Web Service Registry started and when

finished the total number of Web services in this Web Service Registry how many services have

empty WSDL document the average number of service properties in this Web Service Registry

and the average time cost for extracting service properties getting WSDL document and

generating XML file INI file etc

321 The Function of Web Service Extractor Component

The Web Service Extractor component is responsible for gathering service list page links and service

page links It pursues a focused crawl of the Web and only forwards service list page and service page

links to subsequent components for analyzing collecting gathering purposes Therefore it is going to

identify both service list page links and related service page links on these Web Service Registries

As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost

as important as the Web Service Extractor itself as it highly influences the part of the Web which it

needs to crawl The seed can or shall contain eg Web pages where these Web Services are

published or that they talk about Web Services

Deep Web Service Crawler

27

Figure3-2 Overview the process flow of the Web Service Extractor Component

After feeding with the seed of URL the Web Service Extractor component starts to get the link of

service list page from the initial page of this URL seed However this process would be diverse for

these five Web Service Registries Following shows the different situation in these Web Service

Registries

Oslash Service-Repository Web Service Registry

In this Web Service Registry the link of the first service list page is the URL address of its seed

which means some Web Services can be found in the home page of the Service-Repository Web

Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this

Web Service Registry Therefore the process of getting service list page link in this registry will be

continually carried on until there no more service list page link exists

Oslash Xmethods Web Service Registry

Although there has Web Services in the home page of the Xmethods Web Service Registry the

number of those Web Services is only a small subset of these in this Web Service Registry

Moreover in the Xmethods Web Service Registry there is only one page containing all Web

Services Therefore it must have to get the service list page link for that page

Oslash Ebi Web Service Registry

The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That

is to say there also has one page that contains all Web Services in this Web Service Registry

However this page is not the initial page of the input seed Therefore there needs more than

one operation steps to get the service list page link of that page

Oslash Seekda Web Service Registry

In Seekda Web Service Registry these Web Services are not contained in the initial page of the

input seed The service list page link can be obtained after several additional operation steps

Deep Web Service Crawler

28

However there has a problem for getting the service list page link in this registry Simply put if

there are more than one pages contains the Web Services for some unknown reasons it cannot

not get the links of the rest service list page In another word it can only get the link of the first

service list page

Oslash Biocatalogue Web Service Registry

The process of getting service list page in Biocatalogue Web Service Registry is almost the same

as in Seekda Web Service Registry The only difference is that it can get all these service list page

links in Biocatalogue Web Service Registry if there are more than one service list pages

Then after getting the link of service list page the Web Service Extractor will begin to get the link of

service page for each service that listed in the service list page The reason for why it can do like this is

that there has an internal link for every service which will address to the service page It is worth

noting that once a service page link is obtained this component will immediately forward the service

page link and the service list page link into the subsequent two components for further processing

Nevertheless the process of obtaining the service page link will be continuously carried out until all

services listed in that service list page are crawled Analogously the process of getting service list page

will also be continuously carried out until there no more service list page exists

3211 Features of the Web Service Extractor Component

The main features are described in the following paragraphs

1) Obtain service list page links

A central task is to obtain the corresponding service list page links It is an URL address that includes a

publicly list of Web Services and just a simply information about these Web services like the name of

the service an internal URL that links to another page which would contain the detailed information

about that service sometimes may have the link address for WSDL document

2) Obtain service page links

Once service list page links are found a crucial aspect is to extract the internal link of each service to

aid the task of service information discovery Therefore it is the task of the Web Service Extractor to

harvest the html page content of the service list page so that the service page links which would

contain much more detailed information of the single Web service can be obtained

3212 Input of the Web Service Extractor Component

This component is dependent on some specific input seeds And the only input required for this

component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section

313

3213 Output of the Web Service Extractor Component

The component will produce two service related page links from the Web

l Service list page links

l Service page links

Deep Web Service Crawler

29

3214 Demonstration for Web Service Extractor

In order to have comprehensive understanding of the process of the Web Service Extractor

component following gives some figures for explanation Though there are five URL addresses in this

section only the URL of the service-repository address is showed as an example

1) The input seed is the initial URL address of the Service-Repository which is

ldquohttpwwwservice-repositorycomrdquo

2) As has already said in section 321 the first service list link of this Web Service Registry is its input

seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page

of that link

Figure3-3 Service list page of the Service-Repository

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo

Figure3-5Code Overview of getting service page link in Service Repository

Figure3-6 Service page of the Web service ldquoBLZServicerdquo

Deep Web Service Crawler

30

3) For now because of the service list page link is already know the next step is to acquiring the

service page link according to these services that listed in the service list page The text in red box

of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the

complete link of the service page It should be prefixed with the initial URL address of

Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for

getting service page link of the Web service in Service Repository is showed in figure 3-5

Therefore the final link of this service page is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure 3-6 is the corrsponding service page of that link

4) Afterwards those two links service list page link and service page link which are gathered by the

Web Service Extractor component will be immediately forwarded to the next two components

which are WSDL Grabber component and Property Grabber component

322 The Function of WSDL Grabber Component

The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the

service lists page link or service page link And the whole process flow is illustrated in figure 3-7

Figure3-7 Overview the process flow of the WSDL Grabber Component

When the WSDL Grabber component receives the inputs that delivered from previous component it

will start to get the WSDL link for the service based on these inputs However although the inputs of

the WSDL Grabber component are the links of service page and service list page only one of them

contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the

service page or the service list page The reason why there needs these two links to be delivered into

this component is that the only one of the five Web Service Registries which is Biocatalogue hosts

the WSDL link in the service list page While for other four Web Service Registries the WSDL link is

hosted in the service page Therefore the WSDL link of these four Web Service Registries will be

obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL

link will be obtained through the service list page link However there has a problem for getting the

WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in

Deep Web Service Crawler

31

the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another

word these services will not have WSDL document For the situation like this the value of the WSDL

link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web

Services in other four Web Service Registries the WSDL link would always exist in the service page

Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service

it will inevitably be forwarded into Storage component for downloading the WSDL document at once

3221 Features of the WSDL Grabber Component

The WSDL Grabber component has to provide the following features

l Obtain WSDL links

WSDL link is the direct way to get into the page that contains the contents of the WSDL document

It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo

to represent this is an address that addressed to the page of WSDL document

3222 Input of the WSDL Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3223 Output of the WSDL Grabber Component

The component will only produce the following output data

l The URL address of WSDL link for each service

3224 Demonstration for WSDL Grabber Component

In this section it will demonstrate a list of figures in order to give a comprehensive understanding of

the process of the WSDL Grabber component Without a doubt it uses the same Web service of the

Service-Repository as an example too

1) The input for this WSDL Grabber component is the link of service page that obtained from Web

Service Extractor component The address of this link is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page

Deep Web Service Crawler

32

2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the

figure 3-8

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo

3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure

3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry

For other four Web Service Registries this will be different This function

ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo

Then it checks all of the nodes one by one to see if the text value of that node is the value of

ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will

be extracted as the value of WSDL link for this Web service

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function

Figure3-11 Code overview of ldquooneParameterrdquo function

4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web

service ldquoBLZServicerdquo can be obtained The value of it is

ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo

Deep Web Service Crawler

33

323 The Function of Property Grabber Component

The Property Grabber component is a module which is used to extract and gather all these Web

service information that hosted in the Web actually are the information that showed in service list

page and service page In the end all the obtained Web service information will be collected together

as the service properties which would be delivered into the Storage component for storing The

detailed process flow of Property Grabber component is illustrated in figure 3-12

The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component

However there is still a little difference between them with respect to the seed As have already

mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is

sufficient to get the WSDL link While regarding the Property Grabber component it needs two of

them as the seeds Therefore once the Property Grabber component receives the needed inputs it

will start to extract the information of that single Web service

Figure3-12 Overview the process flow of the Property Grabber Component

Therefore after the Property Grabber component receives the inputs it starts to extract the service

information for the Web service Generally speaking the service information consists of four aspects

which are structured information endpoint information monitoring information and whois

information respectively

(1) Structured Information

The structured information can be obtained by extracting the information that hosted in the

service page and service list page They are the basic descriptive information about the service

Such as the name of the service the URL address through which the WSDL document can be

obtained the description of introducing the service the provider who provides that service its

rating and the server who owns this service etc However the elements constituting this

Deep Web Service Crawler

34

structured information would be diverse according to these different Web Service Registries For

example the rating information of the Web service exists in the Service-Repository Web Service

Registry while the Xmethods Web Service Registry does not have this information In addition

even though the Web services in the same Web Service Registry some elements of the structured

information could be not exist For instance one service in a Web Service Registry may have the

description information of this service while another service in the same Web Service Registry

does not have this description information Tables 3-1 to 3-5 demonstrate the structured

information that should be extracted in these five Web Service Registries Moreover if the style of

the Biocataloue Web Service Registry is SOAP there has some additional information describing

the SOAP operation of this service While it is REST this additional information would be

something that talked about the REST operation They should also be considered as a part of the

structured information Table 3-6 and table 3-7 illustrate the information for these two different

operations

Service Name WSDL Link WSDL Version

Provider Server Rating

Homepage Owner Homepage Description

Table 3-1Structured Information of Service-Repository Web Service Registry

Service Name WSDL Link Provider

Service Style Homepage Implementation Language

Description User Description Contributed Client Name

Type of this Client Publisher for this client Used Tookit of this client

Used Language of this client Used Operation System of this client

Table 3-2Structured Information of Xmethods Web Service Registry

Service Name WSDL Link Server

Provider Providerrsquos Country Service Style

Rating Description User Description

Service Tags Documentation (within WSDL)

Table 3-3Structured Information of Seekda Web Service Registry

Service Name WSDL Link Port Name

Service URL Address Implementation Class

Table 3-4Structured Information of Ebi Web Service Registry

Service Name WSDL Link Style

Provider Providerrsquos Country View Times

Favorite Times Submitter Service Tags

Total Annotation Provider Annotation Member Annotation

Registry Annotation Base URL SOAP Lab Server Base URL

Description User Description Category

Table 3-5Structured Information of Biocatalogue Web Service Registry

Deep Web Service Crawler

35

SOAP Operation Name Inputs and Outputs Operation Description

Operation Tags Part of Which Service

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry

REST Operation Name Service Tags Used Template

Operation Description Part of Which Service Part of Which Endpoint Group

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry

(2) Endpoint Information

Endpoint information is the port information of the Web service which can be extracted only

through the service page However since different Web Service Registries have different structure

of the endpoint information with respect to the Web service Thence some elements of the

endpoint information for the Web service would be very diverse But there is one thing needed to

pay attention the Ebi Web Service Registry does not have endpoint information for all these Web

services published in this registry Moreover though the Web services in the same Web Service

Registry have the same structure of this endpoint information some elements of the endpoint

information would be missing or empty Furthermore these Web Service Registries could even

have no endpoint information for some Web services published by them Nevertheless whatever

there is endpoint information for the Web service there will be at least one such element which

is the URL address of the endpoint Following table 3-8 shows the endpoint information that

should be extracted in these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Endpoint Name Endpoint URL

Endpoint Critical Endpoint Type

Bound Endpoint

Xmethods

Endpoint URL Publisher of this Endpoint

Contact Email of this Publisher Implementation Language of this

Endpoint

Seekda Endpoint URL

Biocatalogue Endpoint Name Endpoint URL

Table 3-8 Endpoint Information of these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Service Availability Number of Downs

Total Uptime Total Downtime

MTTR MTBF

RTT Max of Endpoint RTT Min of Endpoint

RTT Average of Endpoint Ping Count of Endpoint

Seekda Service Availability Begin Time of Monitoring

Biocatalogue Monitored Status of Endpoint Overall Status of Service

Table 3-9 Monitoring Information of these five Web Service Registries

(3) Monitoring Information

Monitoring information is the tested statistic information for the Web service But it is worth

Deep Web Service Crawler

36

noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of

all these Web services published by them While for other three Web Service Registries only a

few of Web services may not have this information Table 3-12 displays the monitoring

information for these three Web Service Registries

(4) Whois Information

Whois information is not extracted from the information that hosted in the service page and

service list page It is the descriptive information for the service domain which can be gained by

means of the address of the WSDL link Because of that the process of getting the Whois

information will start after gaining the service domain first The final value of service domain

should not contain the string like this http https www etc It must belong to the top level

domain After that it is going to query the service domain database by means of sending the value

of service domain to the Whois client which is just a Web site on the Internet for example

ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain

exists a list of its information will be returned as the output However the structure of the

returned information will be diverse with respect to different service domain Therefore the most

challenge thing is that there has to deal with the extracting process for each different situation of

the returned information Table 3-10 gives the Whois information that needs be extracted for all

these five Web Service Registries

Service Domain URL Domain Name Domain Type

Domain Address Domain Description State

Postal Code City Country

Country Code Phone Fax

Email Organization Established Time

Table 3-10Whois Information for these five Web Service Registries

Finally all the information of these four aspects will be collected together And then be delivered into

the Storage component for further storage processing

3231 Features of the Property Grabber Component

The Property Grabber component has to provide the following features

l Obtain basic information

Generally speaking the more information one Web service has the better you can know how

well this Web service is Hence this Property Grabber component will be necessary to extract all

these basic information that hosted in the service list page and service page This basic

information contains the structure information endpoint information and monitoring

information

l Obtain Whois information

Due to the reason that the more information the Web service has the better the quality of the

Web service is That makes it is necessary to extract as much information about the Web service

as possible Therefore except the basic information the Property Grabber component can also

obtain some additional information called Whois information such as the type of the domain the

name of the person in charge the postal code of the domain city phone fax detailed address

email etc

Deep Web Service Crawler

37

3232 Input of the Property Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3233 Output of the Property Grabber Component

The component will produce the following output data

l Structured information of each service

l Endpoint information about each service if exists

l Monitoring information for the service and endpoint if exists

l Whois information of the service domain

However all these information will be collected together as the properties for each service Thereafter

the collected properties will be sent to the Storage component

3234 Demonstration for Property Grabber Component

Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the

Property Grabber component To simplify the explanation the example showed in this section uses

the same Web service as before

1) The inputs of the Property Grabber component are links of service list page and service page that

received from Web Service Extractor component Their links are

ldquohttpwwwservice-repositorycomrdquo

and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page

Deep Web Service Crawler

38

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page

3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that

displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of

the structured information have the same content like the contents of description showed in

service page and service list page Thence in order to saving the time of extracting process and

the space for storing process the elements with the same content would only be extracted once

Moreover there needs a transformation of non-descriptive text to the descriptive text for the

rating information because the contents of it are several image of stars Therefore the final

results of the extracted structured information for this Web service are showed in table 3-11

Because of there are no descriptive information for the provider homepage and Owner

Homepage their values are assigned as ldquoNULLrdquo

Service Name BLZService

WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl

WSDL Version 0

Server Apache-Coyote11

Description BLZService

Rating Four stars and A Half

Provider NULL

Homepage NULL

Owner Homepage NULL

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo

4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that

displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list

only one of them will be extracted as the information of endpoint That is because this master

program is intended to extract as more information as possible but these information should not

contain some redundant information Therefore for the purpose of that only one of them will be

extracted as the endpoint information even if there are more than one endpoint records Table

3-12 shows the final results of the endpoint information of this Web service

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page

Deep Web Service Crawler

39

Endpoint Name BLZServiceSOAP12port_http

Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService

Endpoint Critical True

Endpoint Type production

Bound Endpoint BLZServiceSOAP12Binding

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo

5) Then it is time to extract the monitoring information through invoking the function

ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box

above contains the monitoring information about the Web service and the red box below lists

the monitoring information for these endpoints As have already mentioned before only one

endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two

types of availability Actually they all represent the availability for this Web service just like the

availability that showed in figure 3-14 Therefore only one content of these availability

information is sufficient Table 3-13 shows the final results of this extracting process

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page

Service Availability 100

Number of Downs 0

Total Uptime 1 day 19 hours 19 minutes

Total Downtime 0 second

MTBF 1 day 19 hours 19 minutes

MTTR 0 second

RTT Max of Endpoint 141 ms

RTT Min of Endpoint 0 ms

RTT Average of Endpoint 577 ms

Ping Count of Endpoint 112

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo

6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the

service domain from the WSDL link first Towards this Web service the gained service domain is

ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying

process After that it returns a list of information for that service domain See figure 3-17 And

table 3-14 is the extracted Whois information

Deep Web Service Crawler

40

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo

Service Domain URL thomas-bayercom

Domain Name Thomas Bayer

Domain Type NULL

Domain Address Moltkestr40

Domain Description NULL

State NULL

Postal Code 54173

City Bonn

Country NULL

Country Code DE

Phone +4922855525760

Fax NULL

Email infopredic8de

Organization predic8 GmbH

Established Time NULL

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo

7) Finally all the information of these four aspects will be collected together as the service

properties and then these service properties are forwarded into the Storage component

324 The Function of Storage Component

The WSDL link from the WSDL Grabber component will be used by this Storage component to

download the WSDL document from the Web and stores it into the disk thereafter In addition the

service properties from the Property Grabber component will also be directly stored into the disk in

three different manners with this Storage component Figure 3-18 illustrates the process flow of the

Storage component

Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo

of the Storage component will be triggered After that it is going to transform the service properties

into three different output formats and store them into the disk These output formats are XML file

INI file and database records Besides it will also try to download the WSDL document with the URL

address of the WSDL link and then store the obtained WSDL document into disk too However this

ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo

ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the

storage tasks

Deep Web Service Crawler

41

Figure3-18 Overview the process flow of the Storage Component

(1) ldquogetWSDLrdquo sub function

The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it

into the disk Therefore above all it has to get the content of the WSDL document This

procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value

of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if

the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo

For the case like that it will create a WSDL document whose name is the service name appended

with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content

It is an empty document Nevertheless if that service does have WSDL link this sub function will

try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all

these contents that hosted in the Web would be downloaded and stored into the disk and named

only with the name of the service Otherwise it would create a WSDL document which prefixes

ldquoBadrdquo before the service name

(2) ldquogenerateXMLrdquo sub function

The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform

them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo

XML stands for eXtensible Markup Language which is a markup language that designed to

transport and store data The first line of an XML file is the XML declaration which defines the

XML version and the encoding used For example following ldquoltxml version=10

encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is

8-bit Unicode Transformation Format character set Besides an XML file contains XML elements

which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML

element can also contain other elements which could be simple text or a mixture of both

Deep Web Service Crawler

42

However an XML file must contain a root element as the parent of all other elements

(3) ldquogenerateINIrdquo sub function

The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will

transform them into an INI file and then store it in the disk with the name structure of servicersquos

name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for

configuration files They are just simple text files with a basic structure Generally speaking the

INI file contains three different parts They are section parameter and comment respectively

Parameter is the basic element contained in an INI file Its format is a key-value pair or can also

be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always

appears to the left of the equals sign The section is more like a room that grouped all its

parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And

section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo

Thence anything between the semicolon and the end of line will be ignored

(4) ldquogenerateDatabaserdquo sub function

The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions

Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will

turn them into the data of the database by using the SQL statements SQL stands for Structured

Query Language which is a database language that designed for accessing and manipulating data

in database Most of the actions that performed on a database are done with SQL statements

And the primary statements of SQL include insert into delete update select create alter and

drop Therefore for the purpose of transforming these service properties into the data of the

database this sub function has to create a database first using the ldquocreate databaserdquo statement

Then it should create a table to store the data A table is a collections of related data entries and

it consists of columns and rows Since the data for all these five Web Service Registries are not

very large therefore one database table is enough for storing these service properties Because of

that the field names of the service properties in columns for all these five Web Service Registries

should be uniform and well-defined Afterwards the service properties of each single service can

be put into the table as a record with the ldquoinsert intordquo statement of the SQL

3241 Features of the Storage Component

The Property Grabber component has to provide the following features

l Generate different output formats

The final result of this master program is to store the information of the service into the disk for

future work However this Storage component provides three different formats for storing these

service properties of the services that published in the Web Service Registries This makes the

storage of the service very flexible and also longevity

l Obtain the WSDL document

The important thing of this component is to check if the WSDL document can be obtained from

the WSDL link because the WSDL document plays a decisive role for determining the quality of

the service This Storage component provides the ability to deal with the different situations

about the process of obtaining the WSDL document

Deep Web Service Crawler

43

3242 Input of the Storage Component

This component requires the following input data

l WSDL link of each service

l Each service property information

3243 Output of the Storage Component

The component will produce the following output data

l WSDL document of the service

l XML document INI file and tables in database

3244 Demonstration for Storage Component

Following figures describe the fundamental implementation codes of this Storage component

The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the

implementation codes The first common place is about the parameters that defined in each of

these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute

path in the computer disk It will be used for the storing procedure and has already been specified

by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an

increasing Integer which is used as a part of the name of the service The reason for doing this is

that it can prevent the services that have the same name from overriding in the disk The content

of red mark among the codes of these figures is the second common place Its function is to

create a file or document for the service with the corresponding parameters

2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to

get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the

name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo

is the actual WSDL link which is the most important one for this sub function The other two

parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and

ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data

of the service in each Web Service Registry Such as the overall number of properties for that Web

Service Registry the overall number of services the number of services that have no WSDL link

the number of services contains no contents the number of services whose WSDL link are not

available etc The ldquoLog Informationrdquo text file records the results of the process steps and the

problems encountered For example which service is crawling now which Web Service Registry it

belongs the reason why it cannot get the WSDL document for this service the reason why this

service is unreachable and so on

Deep Web Service Crawler

44

Figure3-19 Implementation code for getting WSDL document

3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file

and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of

ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two

variables name and value

Figure3-20 Implementation code for generating XML file

Deep Web Service Crawler

45

Figure3-21 Implementation code for generating INI file

4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all

services in these five Web Service Registries into the records of the database Therefore it has to

create a database first The name of the database can be arbitrary as long as it conforms to the

naming rules in database This is the same for the name of the table Figure 3-22 displays the

procedure of creating a table in the database Due to the reason that it is hard to decide the

length of each service property thence the data types for all service properties are set to ldquoTextrdquo

Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo

statement

Figure3-22 Implementation code for creating table in database

Deep Web Service Crawler

46

Figure3-23 Implementation code for generating table records

33 Multithreaded Programming for DWSC

Multithreaded programming is a built-in characteristic which is provided by Java language A

multithreaded program contains two or more separate parts that that can execute concurrently Each

part of such a program is called a thread The use of multithreading enables to create programs that

makes efficient use of the system resources For example make maximum use of the CPU thatrsquos

because the idle time of the CPU can be kept to a minimum

In this master program there are five Web Service Registries that are needed to crawl for the services

among them Moreover the number of services published among each Web Service Registry is quite

different That makes the running time cost by each Web Service Registry is different Then it will

happen that one Web Service Registry who owns fewer services has to be waiting for executing until

another Web Service Registry who has much more services finishes Therefore in order to reduce the

waiting time for other Web Service Registries and maximize the use of the system resources it is

necessary to apply this multithreaded programming to this master program That is to say this master

program will create a thread for each Web Service Registry And these threads are executed

independently

34 Sleep Time Configuration for Web Service

Registries

Due to the reason that this master program is intended for downloading the WSDL document and

extracting the service information of these Web services that published in the Web Service Registry

this would inevitably affect the performance of the Web Service Registry In addition for the purpose

of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of

accessing Because of that sometimes when this master program is executing unknown errors would

happen For instance this master program will be continually halting at one point without getting any

more WSDL document and service information or some servicesrsquo WSDL documents of some Web

Deep Web Service Crawler

47

Service Registries cannot be obtained or there would miss some service information Therefore in

order to obtain the largest staffs of these Web services published in these five Web Service Registries

and not to affect the throughput of them it has to configure the accessing rate for each service of all

Web Service Registries

In consequence before going into the essential procedure for each single service of these Web Service

Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long

milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for

the specified number of milliseconds In other words it will temporarily cease execution for a while

Following table shows this time interval of the sleep function for each Web Service Registry

Web Service Registry Name Time Interval (milliseconds)

Service Repository 8000

Ebi 3000

Xmethods 10000

Seekda 20000

Biocatalogue 10000

Table 3-15Sleep Time of these five Web Service Registries

Deep Web Service Crawler

48

4 Experimental Results and Analysis

This chapter is going to show the quantitative experimental results of the prototype that presented in

chapter 3 Besides the analysis of these results will also be described and explained However in

order to gain a rather accurate result the experiments are carried out more than five times All these

data that displayed in the following tables and charts are the average of them

41 Statistic Information for Different Web Service

Registries

This section is mainly used to talk about the amount statistic of these Web services that published in

these five Web Service Registries It includes the overall number of the Web services that published in

each Web Service Registry and the number of these unavailable Web services which have been

archived because they may not be active anymore or are close to being non active Table 4-1 shows

the service amount statistic of these five Web Service Registries

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Overall

Services 57 289 382 853 2567

Unavailable

Services 0 0 0 0 125

Table4-1 Service amount statistic of these five Web Service Registries

Nevertheless in order to have an intuitive view of the service amount statistic in these five Web

Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As

you can see from the bar chart one on hand there has an ascending increase for the overall number

of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service

Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web

service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to

provide the Web services for the users because it contains far more services than other four Web

Service Registries On the other hand there is no unavailable service existed in almost all Web Service

Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web

Service Registry contains some Web services that cannot be used by the users To some degree this is

useless for the reason that these services cannot be used anymore and it will waste the network

resources on the Web Therefore all these unavailable services should be eliminated in order to

reduce the wasting of the network resources

Deep Web Service Crawler

49

Figure4-1 Service amount statistic of these five Web Service Registries

42 Statistic Information for WSDL Document

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Failed WSDL

Links 1 0 23 145 32

Without

WSDL Links 0 0 0 0 16

Empty

Content 0 0 2 0 2

Table4-2 Statistic information for WSDL Document

Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in

these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed

WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the

Web services whose WSDL links are invalid Say in other words that means it is impossible to get the

WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no

WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services

for these Web Service Registries It is used as the overall number of Web services in each Web Service

Registry who has no WSDL link That is to say there would be no WSDL document for Web services

like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL

document will also be created But there is no content and the name of this WSDL document will

contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which

represents the overall number of Web services that have the WSDL links and the URL address of them

0

500

1000

1500

2000

2500

3000

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

57289 382

853

2567

0 0 0 0 125

The

Num

ber o

f Web

Ser

vice

s

The name of Web Service Registry

Overall Services

Unavailable Services

Deep Web Service Crawler

50

are valid but their WSDL documents contains no content In this case a WSDL document whose name

contains a string ldquo(BAD)rdquo will be created

Figure4-2 Statistic information for WSDL Document

43 Comparison of Different Average Number of

Service Properties

This section is going to compare the average number of service properties in these five Web Service

Registries This average number of service properties is calculated by means of following equation = (1)

Where

ASP is the average number of service properties for one Web Service Registry

ONSP is the overall number of service properties in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

Figure 4-3 shows the average number of service properties of each Web service in these five Web

Service Registries As have already mentioned before one of the measurements for testing the quality

of Web services in the Web Service Registry is the service information The more information of a Web

service the better you know about that service consequently the corresponding Web Service

Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service

Repository and Biocatalogue Web Service Registries own a larger number of service properties than

that in other three Web Service Registries This can directly reflect that these two Web Service

Registries can provide more detailed information about the Web services published among them So

that the users can be easier to choose the service they need And they would also like to use the Web

services that published in these two Web Service Registries By contrast the Xmethods and Seekda

0

50

100

150

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

1 023

145

32

0 0 0 016

0 0 2 0 2

The

Num

ber o

f Ser

vice

s

The Name of the Web Service Registry

Failed WSDL Links

Without WSDL links

Empty Content

Deep Web Service Crawler

51

Web Service Registries who have less service information of the Web services would offer less quality

for these Web services Therefore users may be not like to use the Web services provided in these

two Web Service Registries Not to mention the Web services published in the Ebi Web Service

Registry

Figure4-3 Average Number of Service Properties

From the description presented in section 323 the causes of the issue for the different number of

service properties in these Web Service Registries may consist of following several points First the

number of the structured information for these Web services is different with respect to these five

Web Service Registries Even part of information for some Web services in one Web Service Registry

could be missing or have empty value For example the number of structure information that

supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that

in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for

almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce

the amount of the overall number of service properties Thirdly some Web Service Registries do not

have monitoring information such as Xmethods and Ebi In particular the Service Repository Web

Service Registry has a large amount of monitoring information of the Web services that could be

extracted in the Web Obvious the last one is the number of the whois information for these Web

services If the database of the whois client does not contain the information about the service

domain for the Web service in one Web Service Registry then there will be no whois information

could be extracted Moreover even there has information of the service domain the amount of the

information could be very diverse Therefore if the Web Service Registry undergoes the situation like

that many service domains of the Web services in this registry have no and few whois information

then the average number of service properties in that registry will decrease greatly

As a result in order to help users get better acquainted with the Web services that provided in Web

Service Registry and distinguish these Web services the Web Service Registry should do it best to

offer more and more information for each published Web service of it

05

101520253035

23

717 17

32

The

Num

ber o

f Pro

pert

ies

The Name of Web Service Registry

Average Number of Service Properties

Deep Web Service Crawler

52

44 Different Outputs of Web Services

The fundamental task of this master program is to obtain the WSDL document of the Web services

that hosted in the Web Service Registries as well as extract and gather the properties of these Web

services and thereafter store them into disk Therefore this section is going to describe the different

outputs of this master program which include the WSDL documents of Web services the generated

XML file and INI file and data records of these service properties in Database

Figure4-4 WSDL Document format of one Web service

The WSDL document of the Web service is just read from the Web according to the URL address of its

WSDL link And then these data will be stored into the disk as the WSDL document The name of the

WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document

And in order to distinguish these WSDL documents whose names would be the same while the

content of them are different the name of each obtained WSDL document in one Web Service

Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL

document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and

data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output

formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name

is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the

materials belonged to the same Web service The first three lines in that INI file are some service

comments which start from the semicolon to the end of the line They are the basic information for

the description of this INI file The following one line of them is the section that included in a pairs of

brackets It is important because it represents that the following lines behind it are the information of

this Web service Therefore the rest of these lines are the actual service information that existed in a

key-value pair with an equal between the pair Each service property is displayed from the beginning

of the line

Deep Web Service Crawler

53

Figure4-5 INI File format of one Web service

Figure4-6 XML File format of one Web service

Figure4-7 Database data format for all Web services

Deep Web Service Crawler

54

Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its

name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web

service Though the format of the XML file is different from that of the INI file the essential contents

of them are the same That is to say the values of these service properties are no different This is

because they are files generated from the collection of properties for the same Web service The XML

file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the

section in INI file is something like the root in XML file Therefore all values of these elements among

the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service

Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of

service information for all Web services in these five Web Service Registries And the entire service

information for one Web service will be only one record in this table Because of that the column

names of that table should be the union of the name of the service information in each Web Service

Registry However since the column names of the table should be unique thus these redundant

names in this union must be eliminated This could make sense and possible because the names of the

service information are well-defined and uniform for all these five Web Service Registries In addition

in this table the first column is the primary key which is an increasing Integer The function of this is

more like the Integer that contained in the name of the XML file and INI file While the rest columns in

that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in

this table represents that this property of that Web service is empty or missing

45 Comparison of Average Time Cost for Different

Parts of Single Web Service

This section aims to describe the comparison of the average time cost for different parts of getting

one single Web service in all these five Web Service Registries At first it has to calculate the average

time cost of getting one single service in the Web Service Registry The calculation of it can be

obtained through following equation = (2)

Where

ATC is the average time cost for one single Web service

OTS is the overall time cost of all the Web services in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the different parts of the average time cost for getting one single service consists of

following 6 aspects which are the average time cost for extracting service property the average time

cost for obtaining WSDL document the average time cost for generating XML file the average time

cost for generating INI file the average time cost for inserting the service property into the table of

the Database and the average time cost for some other procedures such as get the service list page

link get the service page link and so on The average time cost for extracting service property would

be obtained by means of following equation

Deep Web Service Crawler

55

= (3)

Where

ATCSI is the average time cost for extracting service property of one single Web service

OTSSI is the overall time cost for extracting service property of all the Web services in one Web

Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the calculation of other parts is similar to the equation for calculating the average time

cost for extracting service property While the calculation of the average time cost for other

procedures equals to the result of the average time cost for one single Web service minus the sum of

the average time cost of all other five parts

Service

property

WSDL

Document

XML

File INI File Database Others Overall

Service

Repository

8801 918 2 1 53 267 10042

Ebi 699 82 2 1 28 11 823

Xmethods 5801 1168 2 1 45 12 7029

Seekda 5186 1013 2 1 41 23 6266

Biocatalogue 39533 762 2 1 66 1636 42000

Table4-3 Average time cost information for all Web Service Registries

Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web

Service Registries The first column of table 4-3 is the name of these five Web Service Registries And

the last column in this table is the average time cost for the single service in one Web Service Registry

While the rest columns in this table are the average time cost of these six different parts In order to

have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated

with following corresponding figures See figure 4-8 to figure 4-14

Figure4-8 Average time cost for extracting service property in all Web Service Registries

8801

6995801 5186

39533

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Service Property

Deep Web Service Crawler

56

As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue

Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries

which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively

That is to say it will take much longer time to extract the service properties of these Web services

published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the

Biocatalogue Web Service Registry has even more average number of service properties which have

already been talked about in section 43 On the contrary the average number of service properties in

Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service

Registry is larger than that in Seekda Web Service Registry As have already known that the average

number of service properties is the same for these two Web Service Registries Nevertheless there

has course might explain the case that the average time in Xmethods costs more than that in Seekda

which is the process for extracting service properties in Xmethods Web Service Registry needs to be

executed by means of service page and service list page while it needs only service page link for

Seekda Web Service Registry

The average time cost for obtaining WSDL document in all these five Web Service Registries is

displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the

average time for extracting the WSDL link of Web service and the average time for reading the data of

WSDL document from the Web and then storing it into the disk As seen from this figure the average

time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is

1168 Although the process for extracting WSDL link will cost certain amount of time it does not have

a significant influence to the total average time spent for obtaining WSDL document This is because

the WSDL link of one Web service is almost gained in one step Therefore this can imply that the

average size of the WSDL document for Xmethods Web Service Registry is larger than that for other

four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds

for the process of extracting WSDL document

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries

918

82

11681013

762

0

200

400

600

800

1000

1200

1400

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

WSDL Document

Deep Web Service Crawler

57

Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different

outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11

the average time of generating XML file for one Web service cost the same for all these five Web

Service Registries which is only 2 milliseconds In addition the average time of generating INI file for

one Web service cost the same too And its value is just 1 millisecond Even though sum up the values

of these two types of average time costs the result is still so small that it could be omitted when

comparing to the overall average time cost of getting one Web service for each corresponding Web

Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI

file would be finished at once after receiving the input of service properties of one Web service

Furthermore it can be seen from figure 4-12 although the average time costs of creating database

record for each Web service in all these five Web Service Registries are larger than that time of

generating XML and INI file the process of the operation for creating database record is still fast

Figure4-10 Average time cost for generating XML file in all Web Service Registries

Figure4-11 Average time cost for generating INI file in all Web Service Registries

2 2 2 2 2

0

05

1

15

2

25

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

XML File

1 1 1 1 1

0

02

04

06

08

1

12

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

INI File

Deep Web Service Crawler

58

Figure4-12 Average time cost for creating database record in all Web Service Registries

Figure4-13 Average time cost for getting one Web service in all Web Service Registries

In the figure 4-13 it gives the average time cost for getting one single Web service in all these five

Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the

longest time for this process This is because the presentation of these five different parts that

described in the front shows that the average time cost of each part would needs more time to finish

the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining

WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web

Service Registry does not cost the longest Moreover there is amazing thing when looking at these

figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same

trend This can further indicate that one Web Service Registry who spends more time to get the

description information of one Web service would offer more information about the Web service

53

28

4541

66

0

10

20

30

40

50

60

70

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Database

10042

823

7029 6266

42000

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Overall

Deep Web Service Crawler

59

5 Conclusion and Further Direction

This master thesis provides a schema which aims to explore the description information of the Web

services that hosted in the different Web Service Registries Actually the description information of

the Web service consists of WSDL document and service information about that Web service

Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can

be used to obtain the WSDL document and service information of the Web service in these Web

Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web

services that hosted in the these Web Service Registries as well as the storage of these Web services

is not flexible and the most important thing is that only a few service information of the Web service

are extracted even more there is no service information would be extracted for some Web Service

Registry However the work presented in this master thesis can be able to explore all Web services in

these Web Service Registries Moreover the service information of Web service would be extracted as

much as it could with the approach presented in this master thesis so that the final result would be

the largest annotated service catalogue ever produced Furthermore regarding the storage process of

the description information of the Web service this master thesis provides three different ways that

can guarantee not only the completeness but also longevity of the description information for the

Web service

However in the implementation performed in this master thesis the whois client used for querying

the information of service domain will return a free text if the information exists And sometime this

free text differs completely That makes it must have to crawl each Web service in all Web Service

Registries at least once in the experiment stage So that all the cases of these free text would be

foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web

services in these Web Service Registries Therefore in order to simply the work other whois client

who can ease the work here needs to be found and used

Moreover in the experiment stage of this master thesis the time cost for getting Web service is still

large For the purpose of reducing this time the multithreaded programming would be applied to

each some parts of the process for getting one Web service

Although the work performed here is specialized for only these five Web Service Registries the

main parts of the principles used here is adaptable to other Web Service Registries that with only

small changes in the implementation codes or the structure

Deep Web Service Crawler

60

6 Bibliography

[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)

June 27 2008

[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from

httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-

whole Emanuele Della Valle (CEFRIEL) July 1 2008

[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk

ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-

architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009

[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web

Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October

2006

[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from

httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008

[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from

httpwwww3orgTRws-arch-scenarios February 11 2004

[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo

Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering

University of Washington Seattle pp233-272 February 1999

[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web

Consortium Working Draft WD-html5-20100624 January 22 2008

[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language

WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from

httpwwwwsmoorgTRd16d161v02

[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO

-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from

httpwwwwsmoorgTRd2v11

[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware

Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet

2008

Deep Web Service Crawler

61

7 Appendixes

There are additional outputs of this master program which are log information file and statistic report

file Figure 8-1 shows one of the basic output formats of the log information for these five Web

Service Registries

Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry

Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry

Deep Web Service Crawler

62

Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo

ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively

Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry

Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry

Deep Web Service Crawler

63

Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry

Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry

Deep Web Service Crawler

64

Table of Figures

Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12

Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16

Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25

Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27

Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41

Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46

Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50

Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51

Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52

Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56

Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58

Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58

Deep Web Service Crawler

65

Table of Tables

Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47

Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48

Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55

Deep Web Service Crawler

66

Table of Abbreviations

DTD Document Type Definition

DWSC Deep Web Service Crawler

HTML HyperText Markup Language

MTBF Mean Time between Failures

MTTR Mean Time to Recovery

QoS Quality of Service

REST Representational State Transfer

RTT Round Trip Time

SOAP Simple Object Access Protocol

SQL Structured Query Language

URL Uniform Resource Locator

WHATWG Web Hypertext Application Technology Working Group

XML eXtensible Markup Language

WSDL Web Service Description Language

WSML Web Service Modeling Language

WSMO Web Service Modeling Ontology

Page 11: Deep Web Service Crawler

Deep Web Service Crawler

11

information in this use case are derived from [1] In this use case there has a system administrator

whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities

online and working all day and night Therefore if there is any system failures Sam Adams should fix

the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will

alert him immediately by sending him a SMS Message in the case of a system failure

n Description

This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an

SMS Messaging Service that he wants to build it into his application

n Actors Roles and Goals

The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him

are the immediate service delivery the reliability of the service and low base fee and transaction fee

n Storyboard

Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find

many useful services from it especially he know what he is looking for Hence he visits the

Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo

Requirement 1 Search functionality

Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the

number of matching services that will be displayed on one page And he would also expect there has

short information about the service functionality the service provider and the service availability So

that he could decide which service he will choose to read further

Requirement 2 Enable configurable pagination of the matching results and have some short

information for each service

Step 3 When Sam looks through the short information about the services that displayed on the first

page he expects to find the most relevant services that related to his request After that he would

like to read more detailed information about that service to see whether this service can provide the

needed functionality

Requirement 3 Rank the returned matching services and must provide ability to read more details of

a service

Step 4 In the case that all the returned matching services Sam got provide quite different

functionalities or they belong to different service categories for example the SMS messaging services

alert users not through SMS but voice messaging For this reason Sam would like to see other

different categories that may be contain the services he wants Or the services of other categories

which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can

further filter his search in terms of browsing through categories

Requirement 4 Service categories and allow the user to look all services that belonged to that specific

category If possible it should also allow the user to browse through categories

Step 5 When Sam got all the services that could provide a SMS messaging service via the methods

described in the Step 4 at present he wants to look for the services that offered by an Austrian

provider and have no base fees if possible

Requirement 5 Faceted search

Deep Web Service Crawler

12

Step 6 After Sam got all these specific services now he would like to choose the services that can

provide a high reliability

Requirement 6 Sort functionality based on usersrsquo chooses

Step 7 For now Sam expects to compare the service availability between the promised to the service

provider and the actually provided This should be contained in the servicesrsquo details And there needs

also have service coverage information so that Sam can know whether this service covers the areas

he lives and works Moreover Sam would also like to compare these services in other way For

instance put some services into a structured table to compare the transaction fees

Requirement 7 A side-by-side comparison table for services and a functionality that enable users to

select services he wants to compare

Step 8 At last Sam wants to know whether the service providers offer a free try out of the services

So that he can test the service functionality

Requirement 8 If possible display a note that offering free service trials

212 Architecture Plan for the Service-Finder Project

The architecture plan of the Service-Finder project contains five basic components Service Crawler

Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal

Interface Figure 2-1 presents a high level overview of the components and the data flow among

them

Figure2-1Dataflow of Service-Finder and Its Components [3]

Deep Web Service Crawler

13

2121 The Principle of the Service Crawler Component

The Service Crawler is responsible for gathering available services and their related information from

the Web The overall cycle is depicted as following

(1) A Web developer publishes a Web Service

(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services

like WSDL (Web Service Description Language) documents

(3) The Crawler is also going to search for other related information as long as a service is discovered

(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant

part of the Web

At last the output of the crawler would be forwarded to the subsequent components for analyzing

indexing and displaying

2122 The Principle of the Automatic Annotator Component

The Automatic Annotator receives the relevant data from previous component and generates

semantic service descriptions about the WSDL documents and its related information based on the

Service-Finder Ontology and Service Category Ontology

Firstly it will simply introduce those two compatible ontologies that would be used throughout the

whole process [2]

n Generic Service Ontology it is an ontology which is functional to describe the data objects For

example the services the service providers availability payment modalities and so on

n Service Category Ontology it is an ontology which is used to categorize the functionalities or

applications of the services For instance data verification messaging data storage weather etc

Afterwards it is going to talk about the function of this component with its input output

Oslash Input

u Crawled data from Service Crawler

u Service-Finder Ontologies

u Feedback or Correction of before annotations

Oslash Function

u Enrich the information about the service and extract semantic statements according to the

Service-Finder Ontologies For example categorize the service according to the Service

Category Ontology

u Determine whether a particular document is relevant or not through the Web link graph If

not discard these irrelevant documents

u Classify the pages into their genres For instance pricing user comments FAQ and so on

Oslash Output

u Semantic annotation of the services

Deep Web Service Crawler

14

2123 The Principle of the Conceptual Indexer and Matcher

Component

The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted

information of the services and supplying users the capability of retrieval and semantic query For

example the matchmaking between user requests and service offers and the act of retrieving user

feedback on extracted annotations

In addition letrsquos have a look of the function of this component and its input output

Oslash Input

u Semantic annotation data and full text information obtained from Automatic Annotation

u Semantic annotation data and full text information that come from user interfaces

u Cluster data from user and service clustering component

Oslash Function

u Store the semantic annotations received from the Automatic Annotation component and

from the user interface

u Store the cluster data that procured through the clustering component

u Store and index the textual description offered by the Automatic Annotation component

and the textual comments offered by users

u Ontological query the semantic data from the data store center

u Combined keyword and Ontological querying used for user queries

u Provide a list of similar services for a given service

Oslash Output

u A list of matching services that are queried by users In particular these services should be

sorted by ranking and can also be iterated

u All available data that related to a particular entity must be retrievable at the user interface

2124 The Principle of the Service-Finder Portal Interface

Component

The Service-Finder Portal Interface is the main entry point that provided for users of the

Service-Finder system to search and browse the data which is managed by the Conceptual Indexer

and Matcher component In addition the users can also contribute information by means of providing

tags comments categorizations and ratings to the data browsed Furthermore the developers can

still directly invoke the Service-Finder functionalities from their custom applications in terms of an API

Besides the details of this componentrsquos function input and output are represented as below

Oslash Input

u A list of ordered services for a query

u Detailed information about a service or a set of services and a service provider

u Query access to service category ontology and the most used tags provided by the users

Deep Web Service Crawler

15

u Service availability information

Oslash Function

u The Web Interface allows the users to search services by keyword tag or concept in the

categorization sort and filter query results by refining the query compare and bookmark

services try out the services that offer this functionality

u The API allows the developers to invoke Service-Finder functionalities

Oslash Output

u Explicit user annotations such as tags ratings comments decryptions and so on

u Implicit user data for example click stream of users bookmarks comparisons links sent

etc

u Manual advertising of available new services

2125 The Principle of the Cluster Engine Component

The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the

Service-Finder Portal eg the queried services and the compared services of the users Moreover it

also provides cluster data to the Conceptual Indexer and Matcher for providing service

recommendations

Furthermore letrsquos detailed introduce this componentrsquos function input and output

Oslash Input

u Service annotation data of both extracted and user feedback

u Usersrsquo Click streams used for extracting user behaviors

Oslash Function

u Obtain user clusters from user behaviors

u Obtain service clusters from service annotation data to enable to find similar services

Oslash Output

u Clusters of users and services

22 Information Extraction

Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge

amount of information sources on the Internet which has been limited the access to browsing and

searching for the reason of the heterogeneity and the lack of structure of Web information sources

Therefore the appearance of Information Extraction that transforms the Web pages into

program-friendly structures for post-processing would become a great necessity However the task of

Information Extraction is specified in terms of the inputs and the extraction targets And the

techniques used in the process of Information Extraction called extractor

221 Input Types of Information Extraction

Generally speaking there are three different input types The first input type is the unstructured

Deep Web Service Crawler

16

document For example the free text that showed in figure 2-2 It is unstructured and written in

natural language So that it will require substantial natural language processing While the second

input type is called the structured document For instance the XML documents based on the reason

that the data can be described through the available DTD (Document Type Definition) or XML

(eXtensible Markup Language) schema Finally but obviously the third input type is the

semi-structured document that are widespread on the Web Such as the large volume of HTML

pages like tables itemized lists and enumerated lists This is because HTML tags are often used to

render these embedded data in the HTML pages See figure 2-3

Figure2-2Left is the free text input type and right is its output [4]

Figure2-3A Semi-structured page containing data records

(in rectangular box) to be extracted [4]

Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly

regular structure And the data of these documents can be displayed in a format of HTML way or

non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and

generated from structured databases in terms of some templates or layouts thus it would be

considered as one of the input sources which could provide some of these semi-structured documents

For example the authors price and comments of the book pages that provided by Amazon have the

Deep Web Service Crawler

17

same layout That is because these Web pages are generated from the same database and applied

with the same template or layout Furthermore there has another option which could manually

generate HTML pages of semi-structured type For example although the publication lists that

provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have

title and source property for every single pager Eventually the inputs for some Information Extraction

can also be the pages with the same class or among various Web Service Registries

222 Extraction Targets of Information Extraction

Moreover regarding the task of the Information Extraction it has to consider the extraction target

There also have two different extraction targets The first one is the relation of k-tuple And the k in

there means the number of attributes in a record Nevertheless in some cases an attribute of one

record may have none instantiation Otherwise the attribute owns multiple instantiations In addition

the complex object with hierarchically organized data would be the second extraction target Though

the ways for depicting the extraction targets in a page are diverse the most common structure is the

hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf

nodes which called internal nodes And the structure for a data object may also be flat or nested To

be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise

if it is nested structure then the internal nodes that involved in this data object would be more than

two levels

Furthermore in order to make the Web pages readable for human being and having an easier

visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated

or demarcated However the displaying for a data object in a Web page would be affected by

following conditions [4]

Oslash The attribute of a data object has zero or several values

(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo

attribute For example a special offer only available for certain books might be a ldquononerdquo

attribute

(2) If there are more than one values for the attribute of a data object it will be called the

ldquomultiValuerdquo attribute For instance the name of the author for a book could be a

ldquomultiValuerdquo attribute

Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering

That is to say among this set of attribute the position of the attribute might be changed

according to the diverse instances of a data object Thus this attribute will be called the

ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would

enumerate the release data in front of the movesrsquo title while for the movies after year 1999

(including 1999) it will enumerate the release data behind the movesrsquo title

Oslash The attribute has different formats

This means the displaying format of the data object could be completely distinct with respect to

these different instances Therefore if the format of an attribute is free then a lot of rules will be

needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo

attribute For example an ecommerce Web site would use the bold font format to present the

general prices while use the red color format to display the sale prices Nevertheless there has

Deep Web Service Crawler

18

another situation that some different attributes for a data object have the same format For

example various attributes are presented in terms of using the ltTDgt tags in a table presentation

And the attributes like those could be differentiated by means of the order information of these

attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo

attributes it must have to revise the rules for extracting these attributes

Oslash The attribute cannot be decomposed

Because of the easier processing sometimes the input documents would like to be treated as

strings of tokens instead of the strings of characters In addition some of the attribute cannot

even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo

attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The

department code and the course number in them cannot be separated into two different strings

of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo

223 The Used Techniques in Information Extraction

The extractor used in the process of Information Extraction aims at providing a single uniform query

interface to access information sources like database server and Web server It consists of following

phases collecting returned Web pages labeling these Web pages generalizing extraction rules

extracting the relevant data and outputting the result in an appropriate format (XML format or

relational database) for further information integration For example at first the extractor queries the

Web server to gather the returned pages through the HTTP protocols after that it starts to extract the

contents among these HTML documents and integrate with other data sources thereafter Actually

the whole process of the extractor follows below steps

Oslash Step 1

At the beginning it must have to tokenize the input However there are two different

granularities for the input string tokenization They are tag-level encoding and word-level

encoding The tag-level encoding will transform the tags of HTML page into the general tokens

while transform all text string between two tags into a special token Nevertheless the

word-level encoding does this in another way It treats each word in a document as a token

Oslash Step 2

Next it should apply the extraction rules for every attributes of the data object in the Web pages

These extraction rules could be induced in terms of a top-down or bottom-up generalization

pattern mining or logic programming In addition the type of extraction rules may be indicated

by means of regular grammars or logic rules For example some use path-expressions of the

HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic

constraints and some use delimiter-based constraints such as HTML tags or literal words

Oslash Step 3

After that all these extracted data would be assembled into the records

Oslash Step 4

Finally iterate this process until all these data objects in the input

Deep Web Service Crawler

19

23 Pica-Pica Web Service Description Crawler

The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the

Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web

Services problem For example the evaluation of the descriptive quality of Web Services that offered

and how well are these Web Services described in nowadaysrsquo Web Service Registries

231 Needed Libraries of the Pica-Pica Web Service

Description Crawler

This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef

Spillner and programmed in terms of the Python language Actually in order to run these scripts to

parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib

n Beautiful Soup

It is an HTMLXML parser for Python language And it can even turn these invalid markups into a

parse tree [5] Moreover the following three features can make it more powerful

u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that

makes approximately as much sense as the original document Therefore you can obtain

the data that you want

u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating

searching and modifying the parse tree Hence you donrsquot need to create a custom parse

for every application

u If the document has already specified an encoding then you can ignore it since the

Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way

Otherwise what you have to do is just to specify the encoding of the original documents

Furthermore the ways of including Beautiful Soup into the application are displayed in the

following [5]

sup2 From BeautifulSoup import BeautifulSoup For processing HTML

sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML

sup2 Import BeautifulSoup To get everything

n Html5lib

It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to

gain maximum compatibility with the current major desktop web browsers this implementation

will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5

specification

Deep Web Service Crawler

20

232 Architecture of the Pica-Pica Web Service

Description Crawler

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler

Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes

four fundamental components Service Page Grabber component WSDL Grabber component

Property Grabber component and WSML Register component

(1) The Service Page Grabber components is going to take the URL seed as the input and output the

link of the service page into following two components WSDL Grabber component and Property

Grabber component

(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the

delivered service pagersquos link And then check whether the validation for these obtained WSDL

document Finally only these valid WSDL document will be passed into the WSML Register

component for further processing

(3) The Property Grabber component will try to extract the servicersquos property the hosted in the

service page if there exists After that all these servicersquos properties would be saved into an INI

file as the information of that service

(4) The functionality of the WSML Register component is to write appropriate WSML document by

means of the valid WSDL documents that delivered from WSDL Grabber component and the

optionally INI files that delivered from Property Grabber component Afterwards register them in

Conqo

n WSML [9]

It stands for Web Service Modeling Language which provides a framework with different

language variants Hence it is often used to describe the different aspects of the semantic Web

Services according to the conceptual model of WSMO

n WSMO [10]

WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various

aspects related to the Semantic Web Services based on the ontologies Ontology is a formal

explicit specification of a shared conceptualization In fact these ontologies are the sticking points

that can satisfy the linkage between the agreement of the communities of users and the defined

conceptual semantics of the real-world

n Conqo [11]

Deep Web Service Crawler

21

It is a discovery framework that considers not only the Quality of Service (QoS) but also the

context information It will use a Web Service Repository to manage these service descriptions

that based on WSML

233 Implementation of the Pica-Pica Web Service

Description Crawler

This section is going to describe the processes of the implementation of the Pica-Pica Web Service

Description Crawler in detail

(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it

needs an input as the initial seed In this crawler there are five Web Service Registries which are

listed in the below The URL address of these five Web Service Registries will be used as the input

seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web

Service Description Crawler there has a single Python script for each Web Service Registry And

the crawling process for these Web Service Registriesrsquo Python script will be executed one after

another

Biocataloue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwsrvice-repositorycom

Xmethods httpwwwxmethodsnet

(2) Then after feeding with the input seed it will step into the next component Service Page Grabber

At first this component will try to read the data from the Web based on the input seed Then it

will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library

After that this Service Page Grabber component starts to look for the service page link for each

service that published in the Web Service Registry by means of the functions in Html5lib library In

the case that the service page link of one single service is found it will firstly check whether this

service page link is valid or not Once the service page link is valid it will pass it into the following

two components for further processing which are WSDL Grabber component and Property

Grabber component

(3) When the WSDL Grabber component receives a service page link from its previous component it

sets out to extract the WSDL link address for that service through the parsing tree of the data in

this service page Next this component will start to download the WSDL document of that service

in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into

the disk The process of this WSDL Grabber component will be continually carried on until there is

no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They

may either contain bad definitions or bad namespaceURI or be an empty document What is

worse it is not even of XML format Hence in order to pick them out this component will further

analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo

folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in

order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo

folder will be passed into the subsequent component

(4) Moreover since some Web Service Registries would give some addition information about the

Deep Web Service Crawler

22

services such as availability service provider version therefore the Property Grabber

component will set out to extract these information as this servicersquos properties And thereafter

save these servicersquos properties into an INI file However if there has no available additional

information then it is no need to extract the service property thus there is no INI file for that

service However for the implementation of this Pica-Pica Web Service Description Crawler only

the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to

extract the servicesrsquo properties While for other three Web Service Registries there has such

function

(5) Furthermore this is an optional to create a report file which contains the statistic information of

this process Such as the total number of services for one Web Service Registry the number of

services whose WSDL document is invalid etc

(6) As has been stated at present there has a folder with all valid WSDL documents and might also

have some INI files Therefore at this time the task of WSML Register component is to generate

the appropriate WSML documents with these valid WSDL document and INI files And then

register them in Conqo

24 Conclusions of the Existing Strategies

In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder

project Information Extraction technique and Pica-Pica Web Service Description Crawler

The task of this master program is going to obtain the available Web services and their related

information from the Web Actually this is a procedure of extracting the needed information for the

service such as the servicersquos WSDL document and its properties Therefore the Information

Extraction technique which used to extract the information that hosted in the Web could be used in

this master program

Moreover the Service-Finder is a large project that not only able to obtain the available Web services

and their related information from the Web but also enrich these crawled data with annotations by

means of the Service-Finder ontology and the Service Category ontology and integrate all the

information into a coherent semantic mode Furthermore it also provides the capabilities for

searching and browsing the data with the user interface and gives users with service

recommendations However as a master program the Service-Finder project is far more exceeding

the requirements Therefore it is just considered as a consultation for this master program

Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available

Web services and their related information this fulfills the primary task of this master program

Nevertheless regarding the information about the service this Pica-Pica Web Service Description

Crawler extracts only few properties sometime even no property Consequently in order to improve

the quality of the service it has to extract as much properties about the service as possible Thence in

chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler

Deep Web Service Crawler

23

3 Design and Implementation

In the previous chapter of the State of Art those already existing techniques or implementations are

presented In the following it is time to introduce the basic principle of the proposed approach Deep

Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica

Web Service Description Crawler

31 Deep Web Services Crawler Requirements

This section is mainly talking about the goals of the Deep Web Service Crawler approach system

requirements with respect to approach and some non-functional requirements

311 Basic Requirements for DWSC

The following lists are the basic requirements which should be achieved

(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list

of Web services that published in the Web Service Registry It contains not only the WSDL document

of the service but also a file with respect to the information about that service Therefore for the

purpose of producing the largest annotated Service Catalogue this proposed approach need to

extract as much properties about those Web services as possible Moreover it has also to download

the WSDL document that hosted along with the Web service That is to say these properties are not

only the interesting structured properties such as service name its WSDL link address but also some

other non-function properties for example endpoint and its monitoring information

(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties

For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of

schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way

this proposed approach provides three methods for the storage The first one will store them as an

XML file While for the second method it is going to store them in an INI file And the third method is

to use the database for the storage

312 System Requirements for DWSC

General Speaking the needed requirements for achieving a programming project contains following

lists

1) Operating System LinuxUnix Windows (XP Vista 2000 etc)

2) Programming language C++ Java Python C etc

3) Programming Tool NetBean Eclipse Visual Studio and so on

However in this master thesis the scripts of the Deep Web Service Crawler approach are written in

Java programming language Besides these code scripts are only tested running in Windows XP

operating system and Linux operating system but have not been tested in other operating system

Deep Web Service Crawler

24

313 Non-Functional Requirements for DWSC

In this part several non-functional requirements for the Deep Web Service Crawler approach are

presented

1) Transparency the process of data exploration and data storage should be done automatically

without the userrsquos intervention However at first user should specify the path in the hard disk which

will be used store these outputs of this program

2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in

order to stop the process from interruption there must have some necessary error handlings for the

process recovering

3) Completeness this approach should extract the interesting properties about each Web Service as

much as possible eg endpoint monitoring information etc

In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the

following five URLs hence the proposed approach must implement not less than those five Web

Service Registries

Biocatalogue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwservice-repositorycom

Xmethods httpwwwxmethodsnet

32 Deep Web Services Crawler Architecture

In this section an overview of this high-level architecture of Deep Web Services Crawler approach will

be introduced first Thereafter there are four subsections that focusing on the outlining of each single

component and how they will play together presented

The current components and flows of data in Deep Web Service Crawler can be summarized as

depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page

links and related service page links by crawling the Web (Web Service Extractor) Then those gathered

links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL

Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all

these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is

illustrated as following

Oslash Step 1

When the Deep Web Service Crawler starts to run the File Chooser container would require the

user to specify a path in the computer or in any other hard disk The reason for specifying the

path is that this Deep Web Service Crawler program needs a place to store all its outputs

Oslash Step 2

After that the Web Service Extractor will be triggered It is the main entry to the specific crawling

process However since the Deep Web Service Crawler program is a procedure which supposes

to crawl for Web Services in some given Web Service Registries Hence the URL addresses of

Deep Web Service Crawler

25

these Web Service Registries should be given as an initial seed for this Web Service Extractor

process However since the page structures of these Web Service Registries are completely

different thus there will be a dependent process for each Web Service Registry

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler

Oslash Step 3

So that according to the given seed two types of links would be obtained by the Web Service

Extractor component One is the service list page link And another is the service page link

Service list page is page that contains a list of Web Services and may be some information about

these Web Services While service page is page that contains much more information for single

service Finally it forwards these two types of links into the next two components Property

Grabber and WSDL Grabber

Oslash Step 4

Then on the one hand the Property Grabber component tries to gather the information about

the service that hosted in the service list page and service page Such as the name of the service

its description the ranking for this service etc Finally all the information of the service will be

collected together as the service properties which then will be delivered to the storage

component for further processing

Oslash Step 5

On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page

or the service page That is because for some Web Service Registry the WSDL link is hosted in

Deep Web Service Crawler

26

the service list page like the Biocatalogue and for other Web Service Registries it is hosted in

the service page such as Xmethods Then after obtaining the WSDL link it will also be

transmitted to the Storage component for further processing

Oslash Step6

When service properties information and WSDL link of the service are received by Storage

component it will be going to store them into the disk For the service properties they will be

stored into the disk by means of three different ways They are an XML file an INI file or one

record inside the table of Database However for the WSDL link the Storage component will try

to download the page content first according to the URL address of the WSDL link If it can work

successful the page content of the service will be stored as a WSDL document into the disk

Oslash Step 7

Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there

are more than one service or there are more than one service list page in the those Web Service

Registries the crawling process from step 3 to step 6 will be continued again and again until

there is no service or service list page in that Web Service Registries any more

Oslash Step 8

Furthermore after the crawling process for one Web Service Registry finishes there will also

generate a file that contains some of statistic information about this crawling process For

example the time when the crawling process of this Web Service Registry started and when

finished the total number of Web services in this Web Service Registry how many services have

empty WSDL document the average number of service properties in this Web Service Registry

and the average time cost for extracting service properties getting WSDL document and

generating XML file INI file etc

321 The Function of Web Service Extractor Component

The Web Service Extractor component is responsible for gathering service list page links and service

page links It pursues a focused crawl of the Web and only forwards service list page and service page

links to subsequent components for analyzing collecting gathering purposes Therefore it is going to

identify both service list page links and related service page links on these Web Service Registries

As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost

as important as the Web Service Extractor itself as it highly influences the part of the Web which it

needs to crawl The seed can or shall contain eg Web pages where these Web Services are

published or that they talk about Web Services

Deep Web Service Crawler

27

Figure3-2 Overview the process flow of the Web Service Extractor Component

After feeding with the seed of URL the Web Service Extractor component starts to get the link of

service list page from the initial page of this URL seed However this process would be diverse for

these five Web Service Registries Following shows the different situation in these Web Service

Registries

Oslash Service-Repository Web Service Registry

In this Web Service Registry the link of the first service list page is the URL address of its seed

which means some Web Services can be found in the home page of the Service-Repository Web

Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this

Web Service Registry Therefore the process of getting service list page link in this registry will be

continually carried on until there no more service list page link exists

Oslash Xmethods Web Service Registry

Although there has Web Services in the home page of the Xmethods Web Service Registry the

number of those Web Services is only a small subset of these in this Web Service Registry

Moreover in the Xmethods Web Service Registry there is only one page containing all Web

Services Therefore it must have to get the service list page link for that page

Oslash Ebi Web Service Registry

The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That

is to say there also has one page that contains all Web Services in this Web Service Registry

However this page is not the initial page of the input seed Therefore there needs more than

one operation steps to get the service list page link of that page

Oslash Seekda Web Service Registry

In Seekda Web Service Registry these Web Services are not contained in the initial page of the

input seed The service list page link can be obtained after several additional operation steps

Deep Web Service Crawler

28

However there has a problem for getting the service list page link in this registry Simply put if

there are more than one pages contains the Web Services for some unknown reasons it cannot

not get the links of the rest service list page In another word it can only get the link of the first

service list page

Oslash Biocatalogue Web Service Registry

The process of getting service list page in Biocatalogue Web Service Registry is almost the same

as in Seekda Web Service Registry The only difference is that it can get all these service list page

links in Biocatalogue Web Service Registry if there are more than one service list pages

Then after getting the link of service list page the Web Service Extractor will begin to get the link of

service page for each service that listed in the service list page The reason for why it can do like this is

that there has an internal link for every service which will address to the service page It is worth

noting that once a service page link is obtained this component will immediately forward the service

page link and the service list page link into the subsequent two components for further processing

Nevertheless the process of obtaining the service page link will be continuously carried out until all

services listed in that service list page are crawled Analogously the process of getting service list page

will also be continuously carried out until there no more service list page exists

3211 Features of the Web Service Extractor Component

The main features are described in the following paragraphs

1) Obtain service list page links

A central task is to obtain the corresponding service list page links It is an URL address that includes a

publicly list of Web Services and just a simply information about these Web services like the name of

the service an internal URL that links to another page which would contain the detailed information

about that service sometimes may have the link address for WSDL document

2) Obtain service page links

Once service list page links are found a crucial aspect is to extract the internal link of each service to

aid the task of service information discovery Therefore it is the task of the Web Service Extractor to

harvest the html page content of the service list page so that the service page links which would

contain much more detailed information of the single Web service can be obtained

3212 Input of the Web Service Extractor Component

This component is dependent on some specific input seeds And the only input required for this

component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section

313

3213 Output of the Web Service Extractor Component

The component will produce two service related page links from the Web

l Service list page links

l Service page links

Deep Web Service Crawler

29

3214 Demonstration for Web Service Extractor

In order to have comprehensive understanding of the process of the Web Service Extractor

component following gives some figures for explanation Though there are five URL addresses in this

section only the URL of the service-repository address is showed as an example

1) The input seed is the initial URL address of the Service-Repository which is

ldquohttpwwwservice-repositorycomrdquo

2) As has already said in section 321 the first service list link of this Web Service Registry is its input

seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page

of that link

Figure3-3 Service list page of the Service-Repository

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo

Figure3-5Code Overview of getting service page link in Service Repository

Figure3-6 Service page of the Web service ldquoBLZServicerdquo

Deep Web Service Crawler

30

3) For now because of the service list page link is already know the next step is to acquiring the

service page link according to these services that listed in the service list page The text in red box

of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the

complete link of the service page It should be prefixed with the initial URL address of

Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for

getting service page link of the Web service in Service Repository is showed in figure 3-5

Therefore the final link of this service page is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure 3-6 is the corrsponding service page of that link

4) Afterwards those two links service list page link and service page link which are gathered by the

Web Service Extractor component will be immediately forwarded to the next two components

which are WSDL Grabber component and Property Grabber component

322 The Function of WSDL Grabber Component

The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the

service lists page link or service page link And the whole process flow is illustrated in figure 3-7

Figure3-7 Overview the process flow of the WSDL Grabber Component

When the WSDL Grabber component receives the inputs that delivered from previous component it

will start to get the WSDL link for the service based on these inputs However although the inputs of

the WSDL Grabber component are the links of service page and service list page only one of them

contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the

service page or the service list page The reason why there needs these two links to be delivered into

this component is that the only one of the five Web Service Registries which is Biocatalogue hosts

the WSDL link in the service list page While for other four Web Service Registries the WSDL link is

hosted in the service page Therefore the WSDL link of these four Web Service Registries will be

obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL

link will be obtained through the service list page link However there has a problem for getting the

WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in

Deep Web Service Crawler

31

the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another

word these services will not have WSDL document For the situation like this the value of the WSDL

link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web

Services in other four Web Service Registries the WSDL link would always exist in the service page

Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service

it will inevitably be forwarded into Storage component for downloading the WSDL document at once

3221 Features of the WSDL Grabber Component

The WSDL Grabber component has to provide the following features

l Obtain WSDL links

WSDL link is the direct way to get into the page that contains the contents of the WSDL document

It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo

to represent this is an address that addressed to the page of WSDL document

3222 Input of the WSDL Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3223 Output of the WSDL Grabber Component

The component will only produce the following output data

l The URL address of WSDL link for each service

3224 Demonstration for WSDL Grabber Component

In this section it will demonstrate a list of figures in order to give a comprehensive understanding of

the process of the WSDL Grabber component Without a doubt it uses the same Web service of the

Service-Repository as an example too

1) The input for this WSDL Grabber component is the link of service page that obtained from Web

Service Extractor component The address of this link is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page

Deep Web Service Crawler

32

2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the

figure 3-8

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo

3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure

3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry

For other four Web Service Registries this will be different This function

ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo

Then it checks all of the nodes one by one to see if the text value of that node is the value of

ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will

be extracted as the value of WSDL link for this Web service

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function

Figure3-11 Code overview of ldquooneParameterrdquo function

4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web

service ldquoBLZServicerdquo can be obtained The value of it is

ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo

Deep Web Service Crawler

33

323 The Function of Property Grabber Component

The Property Grabber component is a module which is used to extract and gather all these Web

service information that hosted in the Web actually are the information that showed in service list

page and service page In the end all the obtained Web service information will be collected together

as the service properties which would be delivered into the Storage component for storing The

detailed process flow of Property Grabber component is illustrated in figure 3-12

The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component

However there is still a little difference between them with respect to the seed As have already

mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is

sufficient to get the WSDL link While regarding the Property Grabber component it needs two of

them as the seeds Therefore once the Property Grabber component receives the needed inputs it

will start to extract the information of that single Web service

Figure3-12 Overview the process flow of the Property Grabber Component

Therefore after the Property Grabber component receives the inputs it starts to extract the service

information for the Web service Generally speaking the service information consists of four aspects

which are structured information endpoint information monitoring information and whois

information respectively

(1) Structured Information

The structured information can be obtained by extracting the information that hosted in the

service page and service list page They are the basic descriptive information about the service

Such as the name of the service the URL address through which the WSDL document can be

obtained the description of introducing the service the provider who provides that service its

rating and the server who owns this service etc However the elements constituting this

Deep Web Service Crawler

34

structured information would be diverse according to these different Web Service Registries For

example the rating information of the Web service exists in the Service-Repository Web Service

Registry while the Xmethods Web Service Registry does not have this information In addition

even though the Web services in the same Web Service Registry some elements of the structured

information could be not exist For instance one service in a Web Service Registry may have the

description information of this service while another service in the same Web Service Registry

does not have this description information Tables 3-1 to 3-5 demonstrate the structured

information that should be extracted in these five Web Service Registries Moreover if the style of

the Biocataloue Web Service Registry is SOAP there has some additional information describing

the SOAP operation of this service While it is REST this additional information would be

something that talked about the REST operation They should also be considered as a part of the

structured information Table 3-6 and table 3-7 illustrate the information for these two different

operations

Service Name WSDL Link WSDL Version

Provider Server Rating

Homepage Owner Homepage Description

Table 3-1Structured Information of Service-Repository Web Service Registry

Service Name WSDL Link Provider

Service Style Homepage Implementation Language

Description User Description Contributed Client Name

Type of this Client Publisher for this client Used Tookit of this client

Used Language of this client Used Operation System of this client

Table 3-2Structured Information of Xmethods Web Service Registry

Service Name WSDL Link Server

Provider Providerrsquos Country Service Style

Rating Description User Description

Service Tags Documentation (within WSDL)

Table 3-3Structured Information of Seekda Web Service Registry

Service Name WSDL Link Port Name

Service URL Address Implementation Class

Table 3-4Structured Information of Ebi Web Service Registry

Service Name WSDL Link Style

Provider Providerrsquos Country View Times

Favorite Times Submitter Service Tags

Total Annotation Provider Annotation Member Annotation

Registry Annotation Base URL SOAP Lab Server Base URL

Description User Description Category

Table 3-5Structured Information of Biocatalogue Web Service Registry

Deep Web Service Crawler

35

SOAP Operation Name Inputs and Outputs Operation Description

Operation Tags Part of Which Service

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry

REST Operation Name Service Tags Used Template

Operation Description Part of Which Service Part of Which Endpoint Group

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry

(2) Endpoint Information

Endpoint information is the port information of the Web service which can be extracted only

through the service page However since different Web Service Registries have different structure

of the endpoint information with respect to the Web service Thence some elements of the

endpoint information for the Web service would be very diverse But there is one thing needed to

pay attention the Ebi Web Service Registry does not have endpoint information for all these Web

services published in this registry Moreover though the Web services in the same Web Service

Registry have the same structure of this endpoint information some elements of the endpoint

information would be missing or empty Furthermore these Web Service Registries could even

have no endpoint information for some Web services published by them Nevertheless whatever

there is endpoint information for the Web service there will be at least one such element which

is the URL address of the endpoint Following table 3-8 shows the endpoint information that

should be extracted in these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Endpoint Name Endpoint URL

Endpoint Critical Endpoint Type

Bound Endpoint

Xmethods

Endpoint URL Publisher of this Endpoint

Contact Email of this Publisher Implementation Language of this

Endpoint

Seekda Endpoint URL

Biocatalogue Endpoint Name Endpoint URL

Table 3-8 Endpoint Information of these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Service Availability Number of Downs

Total Uptime Total Downtime

MTTR MTBF

RTT Max of Endpoint RTT Min of Endpoint

RTT Average of Endpoint Ping Count of Endpoint

Seekda Service Availability Begin Time of Monitoring

Biocatalogue Monitored Status of Endpoint Overall Status of Service

Table 3-9 Monitoring Information of these five Web Service Registries

(3) Monitoring Information

Monitoring information is the tested statistic information for the Web service But it is worth

Deep Web Service Crawler

36

noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of

all these Web services published by them While for other three Web Service Registries only a

few of Web services may not have this information Table 3-12 displays the monitoring

information for these three Web Service Registries

(4) Whois Information

Whois information is not extracted from the information that hosted in the service page and

service list page It is the descriptive information for the service domain which can be gained by

means of the address of the WSDL link Because of that the process of getting the Whois

information will start after gaining the service domain first The final value of service domain

should not contain the string like this http https www etc It must belong to the top level

domain After that it is going to query the service domain database by means of sending the value

of service domain to the Whois client which is just a Web site on the Internet for example

ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain

exists a list of its information will be returned as the output However the structure of the

returned information will be diverse with respect to different service domain Therefore the most

challenge thing is that there has to deal with the extracting process for each different situation of

the returned information Table 3-10 gives the Whois information that needs be extracted for all

these five Web Service Registries

Service Domain URL Domain Name Domain Type

Domain Address Domain Description State

Postal Code City Country

Country Code Phone Fax

Email Organization Established Time

Table 3-10Whois Information for these five Web Service Registries

Finally all the information of these four aspects will be collected together And then be delivered into

the Storage component for further storage processing

3231 Features of the Property Grabber Component

The Property Grabber component has to provide the following features

l Obtain basic information

Generally speaking the more information one Web service has the better you can know how

well this Web service is Hence this Property Grabber component will be necessary to extract all

these basic information that hosted in the service list page and service page This basic

information contains the structure information endpoint information and monitoring

information

l Obtain Whois information

Due to the reason that the more information the Web service has the better the quality of the

Web service is That makes it is necessary to extract as much information about the Web service

as possible Therefore except the basic information the Property Grabber component can also

obtain some additional information called Whois information such as the type of the domain the

name of the person in charge the postal code of the domain city phone fax detailed address

email etc

Deep Web Service Crawler

37

3232 Input of the Property Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3233 Output of the Property Grabber Component

The component will produce the following output data

l Structured information of each service

l Endpoint information about each service if exists

l Monitoring information for the service and endpoint if exists

l Whois information of the service domain

However all these information will be collected together as the properties for each service Thereafter

the collected properties will be sent to the Storage component

3234 Demonstration for Property Grabber Component

Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the

Property Grabber component To simplify the explanation the example showed in this section uses

the same Web service as before

1) The inputs of the Property Grabber component are links of service list page and service page that

received from Web Service Extractor component Their links are

ldquohttpwwwservice-repositorycomrdquo

and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page

Deep Web Service Crawler

38

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page

3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that

displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of

the structured information have the same content like the contents of description showed in

service page and service list page Thence in order to saving the time of extracting process and

the space for storing process the elements with the same content would only be extracted once

Moreover there needs a transformation of non-descriptive text to the descriptive text for the

rating information because the contents of it are several image of stars Therefore the final

results of the extracted structured information for this Web service are showed in table 3-11

Because of there are no descriptive information for the provider homepage and Owner

Homepage their values are assigned as ldquoNULLrdquo

Service Name BLZService

WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl

WSDL Version 0

Server Apache-Coyote11

Description BLZService

Rating Four stars and A Half

Provider NULL

Homepage NULL

Owner Homepage NULL

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo

4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that

displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list

only one of them will be extracted as the information of endpoint That is because this master

program is intended to extract as more information as possible but these information should not

contain some redundant information Therefore for the purpose of that only one of them will be

extracted as the endpoint information even if there are more than one endpoint records Table

3-12 shows the final results of the endpoint information of this Web service

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page

Deep Web Service Crawler

39

Endpoint Name BLZServiceSOAP12port_http

Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService

Endpoint Critical True

Endpoint Type production

Bound Endpoint BLZServiceSOAP12Binding

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo

5) Then it is time to extract the monitoring information through invoking the function

ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box

above contains the monitoring information about the Web service and the red box below lists

the monitoring information for these endpoints As have already mentioned before only one

endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two

types of availability Actually they all represent the availability for this Web service just like the

availability that showed in figure 3-14 Therefore only one content of these availability

information is sufficient Table 3-13 shows the final results of this extracting process

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page

Service Availability 100

Number of Downs 0

Total Uptime 1 day 19 hours 19 minutes

Total Downtime 0 second

MTBF 1 day 19 hours 19 minutes

MTTR 0 second

RTT Max of Endpoint 141 ms

RTT Min of Endpoint 0 ms

RTT Average of Endpoint 577 ms

Ping Count of Endpoint 112

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo

6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the

service domain from the WSDL link first Towards this Web service the gained service domain is

ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying

process After that it returns a list of information for that service domain See figure 3-17 And

table 3-14 is the extracted Whois information

Deep Web Service Crawler

40

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo

Service Domain URL thomas-bayercom

Domain Name Thomas Bayer

Domain Type NULL

Domain Address Moltkestr40

Domain Description NULL

State NULL

Postal Code 54173

City Bonn

Country NULL

Country Code DE

Phone +4922855525760

Fax NULL

Email infopredic8de

Organization predic8 GmbH

Established Time NULL

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo

7) Finally all the information of these four aspects will be collected together as the service

properties and then these service properties are forwarded into the Storage component

324 The Function of Storage Component

The WSDL link from the WSDL Grabber component will be used by this Storage component to

download the WSDL document from the Web and stores it into the disk thereafter In addition the

service properties from the Property Grabber component will also be directly stored into the disk in

three different manners with this Storage component Figure 3-18 illustrates the process flow of the

Storage component

Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo

of the Storage component will be triggered After that it is going to transform the service properties

into three different output formats and store them into the disk These output formats are XML file

INI file and database records Besides it will also try to download the WSDL document with the URL

address of the WSDL link and then store the obtained WSDL document into disk too However this

ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo

ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the

storage tasks

Deep Web Service Crawler

41

Figure3-18 Overview the process flow of the Storage Component

(1) ldquogetWSDLrdquo sub function

The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it

into the disk Therefore above all it has to get the content of the WSDL document This

procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value

of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if

the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo

For the case like that it will create a WSDL document whose name is the service name appended

with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content

It is an empty document Nevertheless if that service does have WSDL link this sub function will

try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all

these contents that hosted in the Web would be downloaded and stored into the disk and named

only with the name of the service Otherwise it would create a WSDL document which prefixes

ldquoBadrdquo before the service name

(2) ldquogenerateXMLrdquo sub function

The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform

them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo

XML stands for eXtensible Markup Language which is a markup language that designed to

transport and store data The first line of an XML file is the XML declaration which defines the

XML version and the encoding used For example following ldquoltxml version=10

encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is

8-bit Unicode Transformation Format character set Besides an XML file contains XML elements

which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML

element can also contain other elements which could be simple text or a mixture of both

Deep Web Service Crawler

42

However an XML file must contain a root element as the parent of all other elements

(3) ldquogenerateINIrdquo sub function

The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will

transform them into an INI file and then store it in the disk with the name structure of servicersquos

name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for

configuration files They are just simple text files with a basic structure Generally speaking the

INI file contains three different parts They are section parameter and comment respectively

Parameter is the basic element contained in an INI file Its format is a key-value pair or can also

be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always

appears to the left of the equals sign The section is more like a room that grouped all its

parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And

section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo

Thence anything between the semicolon and the end of line will be ignored

(4) ldquogenerateDatabaserdquo sub function

The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions

Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will

turn them into the data of the database by using the SQL statements SQL stands for Structured

Query Language which is a database language that designed for accessing and manipulating data

in database Most of the actions that performed on a database are done with SQL statements

And the primary statements of SQL include insert into delete update select create alter and

drop Therefore for the purpose of transforming these service properties into the data of the

database this sub function has to create a database first using the ldquocreate databaserdquo statement

Then it should create a table to store the data A table is a collections of related data entries and

it consists of columns and rows Since the data for all these five Web Service Registries are not

very large therefore one database table is enough for storing these service properties Because of

that the field names of the service properties in columns for all these five Web Service Registries

should be uniform and well-defined Afterwards the service properties of each single service can

be put into the table as a record with the ldquoinsert intordquo statement of the SQL

3241 Features of the Storage Component

The Property Grabber component has to provide the following features

l Generate different output formats

The final result of this master program is to store the information of the service into the disk for

future work However this Storage component provides three different formats for storing these

service properties of the services that published in the Web Service Registries This makes the

storage of the service very flexible and also longevity

l Obtain the WSDL document

The important thing of this component is to check if the WSDL document can be obtained from

the WSDL link because the WSDL document plays a decisive role for determining the quality of

the service This Storage component provides the ability to deal with the different situations

about the process of obtaining the WSDL document

Deep Web Service Crawler

43

3242 Input of the Storage Component

This component requires the following input data

l WSDL link of each service

l Each service property information

3243 Output of the Storage Component

The component will produce the following output data

l WSDL document of the service

l XML document INI file and tables in database

3244 Demonstration for Storage Component

Following figures describe the fundamental implementation codes of this Storage component

The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the

implementation codes The first common place is about the parameters that defined in each of

these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute

path in the computer disk It will be used for the storing procedure and has already been specified

by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an

increasing Integer which is used as a part of the name of the service The reason for doing this is

that it can prevent the services that have the same name from overriding in the disk The content

of red mark among the codes of these figures is the second common place Its function is to

create a file or document for the service with the corresponding parameters

2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to

get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the

name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo

is the actual WSDL link which is the most important one for this sub function The other two

parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and

ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data

of the service in each Web Service Registry Such as the overall number of properties for that Web

Service Registry the overall number of services the number of services that have no WSDL link

the number of services contains no contents the number of services whose WSDL link are not

available etc The ldquoLog Informationrdquo text file records the results of the process steps and the

problems encountered For example which service is crawling now which Web Service Registry it

belongs the reason why it cannot get the WSDL document for this service the reason why this

service is unreachable and so on

Deep Web Service Crawler

44

Figure3-19 Implementation code for getting WSDL document

3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file

and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of

ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two

variables name and value

Figure3-20 Implementation code for generating XML file

Deep Web Service Crawler

45

Figure3-21 Implementation code for generating INI file

4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all

services in these five Web Service Registries into the records of the database Therefore it has to

create a database first The name of the database can be arbitrary as long as it conforms to the

naming rules in database This is the same for the name of the table Figure 3-22 displays the

procedure of creating a table in the database Due to the reason that it is hard to decide the

length of each service property thence the data types for all service properties are set to ldquoTextrdquo

Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo

statement

Figure3-22 Implementation code for creating table in database

Deep Web Service Crawler

46

Figure3-23 Implementation code for generating table records

33 Multithreaded Programming for DWSC

Multithreaded programming is a built-in characteristic which is provided by Java language A

multithreaded program contains two or more separate parts that that can execute concurrently Each

part of such a program is called a thread The use of multithreading enables to create programs that

makes efficient use of the system resources For example make maximum use of the CPU thatrsquos

because the idle time of the CPU can be kept to a minimum

In this master program there are five Web Service Registries that are needed to crawl for the services

among them Moreover the number of services published among each Web Service Registry is quite

different That makes the running time cost by each Web Service Registry is different Then it will

happen that one Web Service Registry who owns fewer services has to be waiting for executing until

another Web Service Registry who has much more services finishes Therefore in order to reduce the

waiting time for other Web Service Registries and maximize the use of the system resources it is

necessary to apply this multithreaded programming to this master program That is to say this master

program will create a thread for each Web Service Registry And these threads are executed

independently

34 Sleep Time Configuration for Web Service

Registries

Due to the reason that this master program is intended for downloading the WSDL document and

extracting the service information of these Web services that published in the Web Service Registry

this would inevitably affect the performance of the Web Service Registry In addition for the purpose

of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of

accessing Because of that sometimes when this master program is executing unknown errors would

happen For instance this master program will be continually halting at one point without getting any

more WSDL document and service information or some servicesrsquo WSDL documents of some Web

Deep Web Service Crawler

47

Service Registries cannot be obtained or there would miss some service information Therefore in

order to obtain the largest staffs of these Web services published in these five Web Service Registries

and not to affect the throughput of them it has to configure the accessing rate for each service of all

Web Service Registries

In consequence before going into the essential procedure for each single service of these Web Service

Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long

milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for

the specified number of milliseconds In other words it will temporarily cease execution for a while

Following table shows this time interval of the sleep function for each Web Service Registry

Web Service Registry Name Time Interval (milliseconds)

Service Repository 8000

Ebi 3000

Xmethods 10000

Seekda 20000

Biocatalogue 10000

Table 3-15Sleep Time of these five Web Service Registries

Deep Web Service Crawler

48

4 Experimental Results and Analysis

This chapter is going to show the quantitative experimental results of the prototype that presented in

chapter 3 Besides the analysis of these results will also be described and explained However in

order to gain a rather accurate result the experiments are carried out more than five times All these

data that displayed in the following tables and charts are the average of them

41 Statistic Information for Different Web Service

Registries

This section is mainly used to talk about the amount statistic of these Web services that published in

these five Web Service Registries It includes the overall number of the Web services that published in

each Web Service Registry and the number of these unavailable Web services which have been

archived because they may not be active anymore or are close to being non active Table 4-1 shows

the service amount statistic of these five Web Service Registries

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Overall

Services 57 289 382 853 2567

Unavailable

Services 0 0 0 0 125

Table4-1 Service amount statistic of these five Web Service Registries

Nevertheless in order to have an intuitive view of the service amount statistic in these five Web

Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As

you can see from the bar chart one on hand there has an ascending increase for the overall number

of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service

Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web

service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to

provide the Web services for the users because it contains far more services than other four Web

Service Registries On the other hand there is no unavailable service existed in almost all Web Service

Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web

Service Registry contains some Web services that cannot be used by the users To some degree this is

useless for the reason that these services cannot be used anymore and it will waste the network

resources on the Web Therefore all these unavailable services should be eliminated in order to

reduce the wasting of the network resources

Deep Web Service Crawler

49

Figure4-1 Service amount statistic of these five Web Service Registries

42 Statistic Information for WSDL Document

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Failed WSDL

Links 1 0 23 145 32

Without

WSDL Links 0 0 0 0 16

Empty

Content 0 0 2 0 2

Table4-2 Statistic information for WSDL Document

Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in

these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed

WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the

Web services whose WSDL links are invalid Say in other words that means it is impossible to get the

WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no

WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services

for these Web Service Registries It is used as the overall number of Web services in each Web Service

Registry who has no WSDL link That is to say there would be no WSDL document for Web services

like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL

document will also be created But there is no content and the name of this WSDL document will

contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which

represents the overall number of Web services that have the WSDL links and the URL address of them

0

500

1000

1500

2000

2500

3000

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

57289 382

853

2567

0 0 0 0 125

The

Num

ber o

f Web

Ser

vice

s

The name of Web Service Registry

Overall Services

Unavailable Services

Deep Web Service Crawler

50

are valid but their WSDL documents contains no content In this case a WSDL document whose name

contains a string ldquo(BAD)rdquo will be created

Figure4-2 Statistic information for WSDL Document

43 Comparison of Different Average Number of

Service Properties

This section is going to compare the average number of service properties in these five Web Service

Registries This average number of service properties is calculated by means of following equation = (1)

Where

ASP is the average number of service properties for one Web Service Registry

ONSP is the overall number of service properties in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

Figure 4-3 shows the average number of service properties of each Web service in these five Web

Service Registries As have already mentioned before one of the measurements for testing the quality

of Web services in the Web Service Registry is the service information The more information of a Web

service the better you know about that service consequently the corresponding Web Service

Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service

Repository and Biocatalogue Web Service Registries own a larger number of service properties than

that in other three Web Service Registries This can directly reflect that these two Web Service

Registries can provide more detailed information about the Web services published among them So

that the users can be easier to choose the service they need And they would also like to use the Web

services that published in these two Web Service Registries By contrast the Xmethods and Seekda

0

50

100

150

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

1 023

145

32

0 0 0 016

0 0 2 0 2

The

Num

ber o

f Ser

vice

s

The Name of the Web Service Registry

Failed WSDL Links

Without WSDL links

Empty Content

Deep Web Service Crawler

51

Web Service Registries who have less service information of the Web services would offer less quality

for these Web services Therefore users may be not like to use the Web services provided in these

two Web Service Registries Not to mention the Web services published in the Ebi Web Service

Registry

Figure4-3 Average Number of Service Properties

From the description presented in section 323 the causes of the issue for the different number of

service properties in these Web Service Registries may consist of following several points First the

number of the structured information for these Web services is different with respect to these five

Web Service Registries Even part of information for some Web services in one Web Service Registry

could be missing or have empty value For example the number of structure information that

supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that

in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for

almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce

the amount of the overall number of service properties Thirdly some Web Service Registries do not

have monitoring information such as Xmethods and Ebi In particular the Service Repository Web

Service Registry has a large amount of monitoring information of the Web services that could be

extracted in the Web Obvious the last one is the number of the whois information for these Web

services If the database of the whois client does not contain the information about the service

domain for the Web service in one Web Service Registry then there will be no whois information

could be extracted Moreover even there has information of the service domain the amount of the

information could be very diverse Therefore if the Web Service Registry undergoes the situation like

that many service domains of the Web services in this registry have no and few whois information

then the average number of service properties in that registry will decrease greatly

As a result in order to help users get better acquainted with the Web services that provided in Web

Service Registry and distinguish these Web services the Web Service Registry should do it best to

offer more and more information for each published Web service of it

05

101520253035

23

717 17

32

The

Num

ber o

f Pro

pert

ies

The Name of Web Service Registry

Average Number of Service Properties

Deep Web Service Crawler

52

44 Different Outputs of Web Services

The fundamental task of this master program is to obtain the WSDL document of the Web services

that hosted in the Web Service Registries as well as extract and gather the properties of these Web

services and thereafter store them into disk Therefore this section is going to describe the different

outputs of this master program which include the WSDL documents of Web services the generated

XML file and INI file and data records of these service properties in Database

Figure4-4 WSDL Document format of one Web service

The WSDL document of the Web service is just read from the Web according to the URL address of its

WSDL link And then these data will be stored into the disk as the WSDL document The name of the

WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document

And in order to distinguish these WSDL documents whose names would be the same while the

content of them are different the name of each obtained WSDL document in one Web Service

Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL

document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and

data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output

formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name

is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the

materials belonged to the same Web service The first three lines in that INI file are some service

comments which start from the semicolon to the end of the line They are the basic information for

the description of this INI file The following one line of them is the section that included in a pairs of

brackets It is important because it represents that the following lines behind it are the information of

this Web service Therefore the rest of these lines are the actual service information that existed in a

key-value pair with an equal between the pair Each service property is displayed from the beginning

of the line

Deep Web Service Crawler

53

Figure4-5 INI File format of one Web service

Figure4-6 XML File format of one Web service

Figure4-7 Database data format for all Web services

Deep Web Service Crawler

54

Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its

name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web

service Though the format of the XML file is different from that of the INI file the essential contents

of them are the same That is to say the values of these service properties are no different This is

because they are files generated from the collection of properties for the same Web service The XML

file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the

section in INI file is something like the root in XML file Therefore all values of these elements among

the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service

Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of

service information for all Web services in these five Web Service Registries And the entire service

information for one Web service will be only one record in this table Because of that the column

names of that table should be the union of the name of the service information in each Web Service

Registry However since the column names of the table should be unique thus these redundant

names in this union must be eliminated This could make sense and possible because the names of the

service information are well-defined and uniform for all these five Web Service Registries In addition

in this table the first column is the primary key which is an increasing Integer The function of this is

more like the Integer that contained in the name of the XML file and INI file While the rest columns in

that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in

this table represents that this property of that Web service is empty or missing

45 Comparison of Average Time Cost for Different

Parts of Single Web Service

This section aims to describe the comparison of the average time cost for different parts of getting

one single Web service in all these five Web Service Registries At first it has to calculate the average

time cost of getting one single service in the Web Service Registry The calculation of it can be

obtained through following equation = (2)

Where

ATC is the average time cost for one single Web service

OTS is the overall time cost of all the Web services in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the different parts of the average time cost for getting one single service consists of

following 6 aspects which are the average time cost for extracting service property the average time

cost for obtaining WSDL document the average time cost for generating XML file the average time

cost for generating INI file the average time cost for inserting the service property into the table of

the Database and the average time cost for some other procedures such as get the service list page

link get the service page link and so on The average time cost for extracting service property would

be obtained by means of following equation

Deep Web Service Crawler

55

= (3)

Where

ATCSI is the average time cost for extracting service property of one single Web service

OTSSI is the overall time cost for extracting service property of all the Web services in one Web

Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the calculation of other parts is similar to the equation for calculating the average time

cost for extracting service property While the calculation of the average time cost for other

procedures equals to the result of the average time cost for one single Web service minus the sum of

the average time cost of all other five parts

Service

property

WSDL

Document

XML

File INI File Database Others Overall

Service

Repository

8801 918 2 1 53 267 10042

Ebi 699 82 2 1 28 11 823

Xmethods 5801 1168 2 1 45 12 7029

Seekda 5186 1013 2 1 41 23 6266

Biocatalogue 39533 762 2 1 66 1636 42000

Table4-3 Average time cost information for all Web Service Registries

Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web

Service Registries The first column of table 4-3 is the name of these five Web Service Registries And

the last column in this table is the average time cost for the single service in one Web Service Registry

While the rest columns in this table are the average time cost of these six different parts In order to

have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated

with following corresponding figures See figure 4-8 to figure 4-14

Figure4-8 Average time cost for extracting service property in all Web Service Registries

8801

6995801 5186

39533

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Service Property

Deep Web Service Crawler

56

As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue

Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries

which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively

That is to say it will take much longer time to extract the service properties of these Web services

published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the

Biocatalogue Web Service Registry has even more average number of service properties which have

already been talked about in section 43 On the contrary the average number of service properties in

Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service

Registry is larger than that in Seekda Web Service Registry As have already known that the average

number of service properties is the same for these two Web Service Registries Nevertheless there

has course might explain the case that the average time in Xmethods costs more than that in Seekda

which is the process for extracting service properties in Xmethods Web Service Registry needs to be

executed by means of service page and service list page while it needs only service page link for

Seekda Web Service Registry

The average time cost for obtaining WSDL document in all these five Web Service Registries is

displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the

average time for extracting the WSDL link of Web service and the average time for reading the data of

WSDL document from the Web and then storing it into the disk As seen from this figure the average

time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is

1168 Although the process for extracting WSDL link will cost certain amount of time it does not have

a significant influence to the total average time spent for obtaining WSDL document This is because

the WSDL link of one Web service is almost gained in one step Therefore this can imply that the

average size of the WSDL document for Xmethods Web Service Registry is larger than that for other

four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds

for the process of extracting WSDL document

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries

918

82

11681013

762

0

200

400

600

800

1000

1200

1400

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

WSDL Document

Deep Web Service Crawler

57

Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different

outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11

the average time of generating XML file for one Web service cost the same for all these five Web

Service Registries which is only 2 milliseconds In addition the average time of generating INI file for

one Web service cost the same too And its value is just 1 millisecond Even though sum up the values

of these two types of average time costs the result is still so small that it could be omitted when

comparing to the overall average time cost of getting one Web service for each corresponding Web

Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI

file would be finished at once after receiving the input of service properties of one Web service

Furthermore it can be seen from figure 4-12 although the average time costs of creating database

record for each Web service in all these five Web Service Registries are larger than that time of

generating XML and INI file the process of the operation for creating database record is still fast

Figure4-10 Average time cost for generating XML file in all Web Service Registries

Figure4-11 Average time cost for generating INI file in all Web Service Registries

2 2 2 2 2

0

05

1

15

2

25

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

XML File

1 1 1 1 1

0

02

04

06

08

1

12

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

INI File

Deep Web Service Crawler

58

Figure4-12 Average time cost for creating database record in all Web Service Registries

Figure4-13 Average time cost for getting one Web service in all Web Service Registries

In the figure 4-13 it gives the average time cost for getting one single Web service in all these five

Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the

longest time for this process This is because the presentation of these five different parts that

described in the front shows that the average time cost of each part would needs more time to finish

the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining

WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web

Service Registry does not cost the longest Moreover there is amazing thing when looking at these

figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same

trend This can further indicate that one Web Service Registry who spends more time to get the

description information of one Web service would offer more information about the Web service

53

28

4541

66

0

10

20

30

40

50

60

70

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Database

10042

823

7029 6266

42000

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Overall

Deep Web Service Crawler

59

5 Conclusion and Further Direction

This master thesis provides a schema which aims to explore the description information of the Web

services that hosted in the different Web Service Registries Actually the description information of

the Web service consists of WSDL document and service information about that Web service

Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can

be used to obtain the WSDL document and service information of the Web service in these Web

Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web

services that hosted in the these Web Service Registries as well as the storage of these Web services

is not flexible and the most important thing is that only a few service information of the Web service

are extracted even more there is no service information would be extracted for some Web Service

Registry However the work presented in this master thesis can be able to explore all Web services in

these Web Service Registries Moreover the service information of Web service would be extracted as

much as it could with the approach presented in this master thesis so that the final result would be

the largest annotated service catalogue ever produced Furthermore regarding the storage process of

the description information of the Web service this master thesis provides three different ways that

can guarantee not only the completeness but also longevity of the description information for the

Web service

However in the implementation performed in this master thesis the whois client used for querying

the information of service domain will return a free text if the information exists And sometime this

free text differs completely That makes it must have to crawl each Web service in all Web Service

Registries at least once in the experiment stage So that all the cases of these free text would be

foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web

services in these Web Service Registries Therefore in order to simply the work other whois client

who can ease the work here needs to be found and used

Moreover in the experiment stage of this master thesis the time cost for getting Web service is still

large For the purpose of reducing this time the multithreaded programming would be applied to

each some parts of the process for getting one Web service

Although the work performed here is specialized for only these five Web Service Registries the

main parts of the principles used here is adaptable to other Web Service Registries that with only

small changes in the implementation codes or the structure

Deep Web Service Crawler

60

6 Bibliography

[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)

June 27 2008

[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from

httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-

whole Emanuele Della Valle (CEFRIEL) July 1 2008

[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk

ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-

architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009

[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web

Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October

2006

[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from

httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008

[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from

httpwwww3orgTRws-arch-scenarios February 11 2004

[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo

Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering

University of Washington Seattle pp233-272 February 1999

[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web

Consortium Working Draft WD-html5-20100624 January 22 2008

[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language

WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from

httpwwwwsmoorgTRd16d161v02

[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO

-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from

httpwwwwsmoorgTRd2v11

[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware

Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet

2008

Deep Web Service Crawler

61

7 Appendixes

There are additional outputs of this master program which are log information file and statistic report

file Figure 8-1 shows one of the basic output formats of the log information for these five Web

Service Registries

Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry

Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry

Deep Web Service Crawler

62

Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo

ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively

Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry

Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry

Deep Web Service Crawler

63

Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry

Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry

Deep Web Service Crawler

64

Table of Figures

Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12

Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16

Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25

Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27

Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41

Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46

Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50

Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51

Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52

Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56

Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58

Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58

Deep Web Service Crawler

65

Table of Tables

Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47

Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48

Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55

Deep Web Service Crawler

66

Table of Abbreviations

DTD Document Type Definition

DWSC Deep Web Service Crawler

HTML HyperText Markup Language

MTBF Mean Time between Failures

MTTR Mean Time to Recovery

QoS Quality of Service

REST Representational State Transfer

RTT Round Trip Time

SOAP Simple Object Access Protocol

SQL Structured Query Language

URL Uniform Resource Locator

WHATWG Web Hypertext Application Technology Working Group

XML eXtensible Markup Language

WSDL Web Service Description Language

WSML Web Service Modeling Language

WSMO Web Service Modeling Ontology

Page 12: Deep Web Service Crawler

Deep Web Service Crawler

12

Step 6 After Sam got all these specific services now he would like to choose the services that can

provide a high reliability

Requirement 6 Sort functionality based on usersrsquo chooses

Step 7 For now Sam expects to compare the service availability between the promised to the service

provider and the actually provided This should be contained in the servicesrsquo details And there needs

also have service coverage information so that Sam can know whether this service covers the areas

he lives and works Moreover Sam would also like to compare these services in other way For

instance put some services into a structured table to compare the transaction fees

Requirement 7 A side-by-side comparison table for services and a functionality that enable users to

select services he wants to compare

Step 8 At last Sam wants to know whether the service providers offer a free try out of the services

So that he can test the service functionality

Requirement 8 If possible display a note that offering free service trials

212 Architecture Plan for the Service-Finder Project

The architecture plan of the Service-Finder project contains five basic components Service Crawler

Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal

Interface Figure 2-1 presents a high level overview of the components and the data flow among

them

Figure2-1Dataflow of Service-Finder and Its Components [3]

Deep Web Service Crawler

13

2121 The Principle of the Service Crawler Component

The Service Crawler is responsible for gathering available services and their related information from

the Web The overall cycle is depicted as following

(1) A Web developer publishes a Web Service

(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services

like WSDL (Web Service Description Language) documents

(3) The Crawler is also going to search for other related information as long as a service is discovered

(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant

part of the Web

At last the output of the crawler would be forwarded to the subsequent components for analyzing

indexing and displaying

2122 The Principle of the Automatic Annotator Component

The Automatic Annotator receives the relevant data from previous component and generates

semantic service descriptions about the WSDL documents and its related information based on the

Service-Finder Ontology and Service Category Ontology

Firstly it will simply introduce those two compatible ontologies that would be used throughout the

whole process [2]

n Generic Service Ontology it is an ontology which is functional to describe the data objects For

example the services the service providers availability payment modalities and so on

n Service Category Ontology it is an ontology which is used to categorize the functionalities or

applications of the services For instance data verification messaging data storage weather etc

Afterwards it is going to talk about the function of this component with its input output

Oslash Input

u Crawled data from Service Crawler

u Service-Finder Ontologies

u Feedback or Correction of before annotations

Oslash Function

u Enrich the information about the service and extract semantic statements according to the

Service-Finder Ontologies For example categorize the service according to the Service

Category Ontology

u Determine whether a particular document is relevant or not through the Web link graph If

not discard these irrelevant documents

u Classify the pages into their genres For instance pricing user comments FAQ and so on

Oslash Output

u Semantic annotation of the services

Deep Web Service Crawler

14

2123 The Principle of the Conceptual Indexer and Matcher

Component

The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted

information of the services and supplying users the capability of retrieval and semantic query For

example the matchmaking between user requests and service offers and the act of retrieving user

feedback on extracted annotations

In addition letrsquos have a look of the function of this component and its input output

Oslash Input

u Semantic annotation data and full text information obtained from Automatic Annotation

u Semantic annotation data and full text information that come from user interfaces

u Cluster data from user and service clustering component

Oslash Function

u Store the semantic annotations received from the Automatic Annotation component and

from the user interface

u Store the cluster data that procured through the clustering component

u Store and index the textual description offered by the Automatic Annotation component

and the textual comments offered by users

u Ontological query the semantic data from the data store center

u Combined keyword and Ontological querying used for user queries

u Provide a list of similar services for a given service

Oslash Output

u A list of matching services that are queried by users In particular these services should be

sorted by ranking and can also be iterated

u All available data that related to a particular entity must be retrievable at the user interface

2124 The Principle of the Service-Finder Portal Interface

Component

The Service-Finder Portal Interface is the main entry point that provided for users of the

Service-Finder system to search and browse the data which is managed by the Conceptual Indexer

and Matcher component In addition the users can also contribute information by means of providing

tags comments categorizations and ratings to the data browsed Furthermore the developers can

still directly invoke the Service-Finder functionalities from their custom applications in terms of an API

Besides the details of this componentrsquos function input and output are represented as below

Oslash Input

u A list of ordered services for a query

u Detailed information about a service or a set of services and a service provider

u Query access to service category ontology and the most used tags provided by the users

Deep Web Service Crawler

15

u Service availability information

Oslash Function

u The Web Interface allows the users to search services by keyword tag or concept in the

categorization sort and filter query results by refining the query compare and bookmark

services try out the services that offer this functionality

u The API allows the developers to invoke Service-Finder functionalities

Oslash Output

u Explicit user annotations such as tags ratings comments decryptions and so on

u Implicit user data for example click stream of users bookmarks comparisons links sent

etc

u Manual advertising of available new services

2125 The Principle of the Cluster Engine Component

The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the

Service-Finder Portal eg the queried services and the compared services of the users Moreover it

also provides cluster data to the Conceptual Indexer and Matcher for providing service

recommendations

Furthermore letrsquos detailed introduce this componentrsquos function input and output

Oslash Input

u Service annotation data of both extracted and user feedback

u Usersrsquo Click streams used for extracting user behaviors

Oslash Function

u Obtain user clusters from user behaviors

u Obtain service clusters from service annotation data to enable to find similar services

Oslash Output

u Clusters of users and services

22 Information Extraction

Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge

amount of information sources on the Internet which has been limited the access to browsing and

searching for the reason of the heterogeneity and the lack of structure of Web information sources

Therefore the appearance of Information Extraction that transforms the Web pages into

program-friendly structures for post-processing would become a great necessity However the task of

Information Extraction is specified in terms of the inputs and the extraction targets And the

techniques used in the process of Information Extraction called extractor

221 Input Types of Information Extraction

Generally speaking there are three different input types The first input type is the unstructured

Deep Web Service Crawler

16

document For example the free text that showed in figure 2-2 It is unstructured and written in

natural language So that it will require substantial natural language processing While the second

input type is called the structured document For instance the XML documents based on the reason

that the data can be described through the available DTD (Document Type Definition) or XML

(eXtensible Markup Language) schema Finally but obviously the third input type is the

semi-structured document that are widespread on the Web Such as the large volume of HTML

pages like tables itemized lists and enumerated lists This is because HTML tags are often used to

render these embedded data in the HTML pages See figure 2-3

Figure2-2Left is the free text input type and right is its output [4]

Figure2-3A Semi-structured page containing data records

(in rectangular box) to be extracted [4]

Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly

regular structure And the data of these documents can be displayed in a format of HTML way or

non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and

generated from structured databases in terms of some templates or layouts thus it would be

considered as one of the input sources which could provide some of these semi-structured documents

For example the authors price and comments of the book pages that provided by Amazon have the

Deep Web Service Crawler

17

same layout That is because these Web pages are generated from the same database and applied

with the same template or layout Furthermore there has another option which could manually

generate HTML pages of semi-structured type For example although the publication lists that

provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have

title and source property for every single pager Eventually the inputs for some Information Extraction

can also be the pages with the same class or among various Web Service Registries

222 Extraction Targets of Information Extraction

Moreover regarding the task of the Information Extraction it has to consider the extraction target

There also have two different extraction targets The first one is the relation of k-tuple And the k in

there means the number of attributes in a record Nevertheless in some cases an attribute of one

record may have none instantiation Otherwise the attribute owns multiple instantiations In addition

the complex object with hierarchically organized data would be the second extraction target Though

the ways for depicting the extraction targets in a page are diverse the most common structure is the

hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf

nodes which called internal nodes And the structure for a data object may also be flat or nested To

be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise

if it is nested structure then the internal nodes that involved in this data object would be more than

two levels

Furthermore in order to make the Web pages readable for human being and having an easier

visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated

or demarcated However the displaying for a data object in a Web page would be affected by

following conditions [4]

Oslash The attribute of a data object has zero or several values

(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo

attribute For example a special offer only available for certain books might be a ldquononerdquo

attribute

(2) If there are more than one values for the attribute of a data object it will be called the

ldquomultiValuerdquo attribute For instance the name of the author for a book could be a

ldquomultiValuerdquo attribute

Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering

That is to say among this set of attribute the position of the attribute might be changed

according to the diverse instances of a data object Thus this attribute will be called the

ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would

enumerate the release data in front of the movesrsquo title while for the movies after year 1999

(including 1999) it will enumerate the release data behind the movesrsquo title

Oslash The attribute has different formats

This means the displaying format of the data object could be completely distinct with respect to

these different instances Therefore if the format of an attribute is free then a lot of rules will be

needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo

attribute For example an ecommerce Web site would use the bold font format to present the

general prices while use the red color format to display the sale prices Nevertheless there has

Deep Web Service Crawler

18

another situation that some different attributes for a data object have the same format For

example various attributes are presented in terms of using the ltTDgt tags in a table presentation

And the attributes like those could be differentiated by means of the order information of these

attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo

attributes it must have to revise the rules for extracting these attributes

Oslash The attribute cannot be decomposed

Because of the easier processing sometimes the input documents would like to be treated as

strings of tokens instead of the strings of characters In addition some of the attribute cannot

even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo

attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The

department code and the course number in them cannot be separated into two different strings

of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo

223 The Used Techniques in Information Extraction

The extractor used in the process of Information Extraction aims at providing a single uniform query

interface to access information sources like database server and Web server It consists of following

phases collecting returned Web pages labeling these Web pages generalizing extraction rules

extracting the relevant data and outputting the result in an appropriate format (XML format or

relational database) for further information integration For example at first the extractor queries the

Web server to gather the returned pages through the HTTP protocols after that it starts to extract the

contents among these HTML documents and integrate with other data sources thereafter Actually

the whole process of the extractor follows below steps

Oslash Step 1

At the beginning it must have to tokenize the input However there are two different

granularities for the input string tokenization They are tag-level encoding and word-level

encoding The tag-level encoding will transform the tags of HTML page into the general tokens

while transform all text string between two tags into a special token Nevertheless the

word-level encoding does this in another way It treats each word in a document as a token

Oslash Step 2

Next it should apply the extraction rules for every attributes of the data object in the Web pages

These extraction rules could be induced in terms of a top-down or bottom-up generalization

pattern mining or logic programming In addition the type of extraction rules may be indicated

by means of regular grammars or logic rules For example some use path-expressions of the

HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic

constraints and some use delimiter-based constraints such as HTML tags or literal words

Oslash Step 3

After that all these extracted data would be assembled into the records

Oslash Step 4

Finally iterate this process until all these data objects in the input

Deep Web Service Crawler

19

23 Pica-Pica Web Service Description Crawler

The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the

Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web

Services problem For example the evaluation of the descriptive quality of Web Services that offered

and how well are these Web Services described in nowadaysrsquo Web Service Registries

231 Needed Libraries of the Pica-Pica Web Service

Description Crawler

This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef

Spillner and programmed in terms of the Python language Actually in order to run these scripts to

parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib

n Beautiful Soup

It is an HTMLXML parser for Python language And it can even turn these invalid markups into a

parse tree [5] Moreover the following three features can make it more powerful

u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that

makes approximately as much sense as the original document Therefore you can obtain

the data that you want

u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating

searching and modifying the parse tree Hence you donrsquot need to create a custom parse

for every application

u If the document has already specified an encoding then you can ignore it since the

Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way

Otherwise what you have to do is just to specify the encoding of the original documents

Furthermore the ways of including Beautiful Soup into the application are displayed in the

following [5]

sup2 From BeautifulSoup import BeautifulSoup For processing HTML

sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML

sup2 Import BeautifulSoup To get everything

n Html5lib

It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to

gain maximum compatibility with the current major desktop web browsers this implementation

will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5

specification

Deep Web Service Crawler

20

232 Architecture of the Pica-Pica Web Service

Description Crawler

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler

Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes

four fundamental components Service Page Grabber component WSDL Grabber component

Property Grabber component and WSML Register component

(1) The Service Page Grabber components is going to take the URL seed as the input and output the

link of the service page into following two components WSDL Grabber component and Property

Grabber component

(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the

delivered service pagersquos link And then check whether the validation for these obtained WSDL

document Finally only these valid WSDL document will be passed into the WSML Register

component for further processing

(3) The Property Grabber component will try to extract the servicersquos property the hosted in the

service page if there exists After that all these servicersquos properties would be saved into an INI

file as the information of that service

(4) The functionality of the WSML Register component is to write appropriate WSML document by

means of the valid WSDL documents that delivered from WSDL Grabber component and the

optionally INI files that delivered from Property Grabber component Afterwards register them in

Conqo

n WSML [9]

It stands for Web Service Modeling Language which provides a framework with different

language variants Hence it is often used to describe the different aspects of the semantic Web

Services according to the conceptual model of WSMO

n WSMO [10]

WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various

aspects related to the Semantic Web Services based on the ontologies Ontology is a formal

explicit specification of a shared conceptualization In fact these ontologies are the sticking points

that can satisfy the linkage between the agreement of the communities of users and the defined

conceptual semantics of the real-world

n Conqo [11]

Deep Web Service Crawler

21

It is a discovery framework that considers not only the Quality of Service (QoS) but also the

context information It will use a Web Service Repository to manage these service descriptions

that based on WSML

233 Implementation of the Pica-Pica Web Service

Description Crawler

This section is going to describe the processes of the implementation of the Pica-Pica Web Service

Description Crawler in detail

(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it

needs an input as the initial seed In this crawler there are five Web Service Registries which are

listed in the below The URL address of these five Web Service Registries will be used as the input

seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web

Service Description Crawler there has a single Python script for each Web Service Registry And

the crawling process for these Web Service Registriesrsquo Python script will be executed one after

another

Biocataloue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwsrvice-repositorycom

Xmethods httpwwwxmethodsnet

(2) Then after feeding with the input seed it will step into the next component Service Page Grabber

At first this component will try to read the data from the Web based on the input seed Then it

will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library

After that this Service Page Grabber component starts to look for the service page link for each

service that published in the Web Service Registry by means of the functions in Html5lib library In

the case that the service page link of one single service is found it will firstly check whether this

service page link is valid or not Once the service page link is valid it will pass it into the following

two components for further processing which are WSDL Grabber component and Property

Grabber component

(3) When the WSDL Grabber component receives a service page link from its previous component it

sets out to extract the WSDL link address for that service through the parsing tree of the data in

this service page Next this component will start to download the WSDL document of that service

in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into

the disk The process of this WSDL Grabber component will be continually carried on until there is

no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They

may either contain bad definitions or bad namespaceURI or be an empty document What is

worse it is not even of XML format Hence in order to pick them out this component will further

analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo

folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in

order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo

folder will be passed into the subsequent component

(4) Moreover since some Web Service Registries would give some addition information about the

Deep Web Service Crawler

22

services such as availability service provider version therefore the Property Grabber

component will set out to extract these information as this servicersquos properties And thereafter

save these servicersquos properties into an INI file However if there has no available additional

information then it is no need to extract the service property thus there is no INI file for that

service However for the implementation of this Pica-Pica Web Service Description Crawler only

the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to

extract the servicesrsquo properties While for other three Web Service Registries there has such

function

(5) Furthermore this is an optional to create a report file which contains the statistic information of

this process Such as the total number of services for one Web Service Registry the number of

services whose WSDL document is invalid etc

(6) As has been stated at present there has a folder with all valid WSDL documents and might also

have some INI files Therefore at this time the task of WSML Register component is to generate

the appropriate WSML documents with these valid WSDL document and INI files And then

register them in Conqo

24 Conclusions of the Existing Strategies

In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder

project Information Extraction technique and Pica-Pica Web Service Description Crawler

The task of this master program is going to obtain the available Web services and their related

information from the Web Actually this is a procedure of extracting the needed information for the

service such as the servicersquos WSDL document and its properties Therefore the Information

Extraction technique which used to extract the information that hosted in the Web could be used in

this master program

Moreover the Service-Finder is a large project that not only able to obtain the available Web services

and their related information from the Web but also enrich these crawled data with annotations by

means of the Service-Finder ontology and the Service Category ontology and integrate all the

information into a coherent semantic mode Furthermore it also provides the capabilities for

searching and browsing the data with the user interface and gives users with service

recommendations However as a master program the Service-Finder project is far more exceeding

the requirements Therefore it is just considered as a consultation for this master program

Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available

Web services and their related information this fulfills the primary task of this master program

Nevertheless regarding the information about the service this Pica-Pica Web Service Description

Crawler extracts only few properties sometime even no property Consequently in order to improve

the quality of the service it has to extract as much properties about the service as possible Thence in

chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler

Deep Web Service Crawler

23

3 Design and Implementation

In the previous chapter of the State of Art those already existing techniques or implementations are

presented In the following it is time to introduce the basic principle of the proposed approach Deep

Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica

Web Service Description Crawler

31 Deep Web Services Crawler Requirements

This section is mainly talking about the goals of the Deep Web Service Crawler approach system

requirements with respect to approach and some non-functional requirements

311 Basic Requirements for DWSC

The following lists are the basic requirements which should be achieved

(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list

of Web services that published in the Web Service Registry It contains not only the WSDL document

of the service but also a file with respect to the information about that service Therefore for the

purpose of producing the largest annotated Service Catalogue this proposed approach need to

extract as much properties about those Web services as possible Moreover it has also to download

the WSDL document that hosted along with the Web service That is to say these properties are not

only the interesting structured properties such as service name its WSDL link address but also some

other non-function properties for example endpoint and its monitoring information

(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties

For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of

schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way

this proposed approach provides three methods for the storage The first one will store them as an

XML file While for the second method it is going to store them in an INI file And the third method is

to use the database for the storage

312 System Requirements for DWSC

General Speaking the needed requirements for achieving a programming project contains following

lists

1) Operating System LinuxUnix Windows (XP Vista 2000 etc)

2) Programming language C++ Java Python C etc

3) Programming Tool NetBean Eclipse Visual Studio and so on

However in this master thesis the scripts of the Deep Web Service Crawler approach are written in

Java programming language Besides these code scripts are only tested running in Windows XP

operating system and Linux operating system but have not been tested in other operating system

Deep Web Service Crawler

24

313 Non-Functional Requirements for DWSC

In this part several non-functional requirements for the Deep Web Service Crawler approach are

presented

1) Transparency the process of data exploration and data storage should be done automatically

without the userrsquos intervention However at first user should specify the path in the hard disk which

will be used store these outputs of this program

2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in

order to stop the process from interruption there must have some necessary error handlings for the

process recovering

3) Completeness this approach should extract the interesting properties about each Web Service as

much as possible eg endpoint monitoring information etc

In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the

following five URLs hence the proposed approach must implement not less than those five Web

Service Registries

Biocatalogue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwservice-repositorycom

Xmethods httpwwwxmethodsnet

32 Deep Web Services Crawler Architecture

In this section an overview of this high-level architecture of Deep Web Services Crawler approach will

be introduced first Thereafter there are four subsections that focusing on the outlining of each single

component and how they will play together presented

The current components and flows of data in Deep Web Service Crawler can be summarized as

depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page

links and related service page links by crawling the Web (Web Service Extractor) Then those gathered

links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL

Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all

these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is

illustrated as following

Oslash Step 1

When the Deep Web Service Crawler starts to run the File Chooser container would require the

user to specify a path in the computer or in any other hard disk The reason for specifying the

path is that this Deep Web Service Crawler program needs a place to store all its outputs

Oslash Step 2

After that the Web Service Extractor will be triggered It is the main entry to the specific crawling

process However since the Deep Web Service Crawler program is a procedure which supposes

to crawl for Web Services in some given Web Service Registries Hence the URL addresses of

Deep Web Service Crawler

25

these Web Service Registries should be given as an initial seed for this Web Service Extractor

process However since the page structures of these Web Service Registries are completely

different thus there will be a dependent process for each Web Service Registry

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler

Oslash Step 3

So that according to the given seed two types of links would be obtained by the Web Service

Extractor component One is the service list page link And another is the service page link

Service list page is page that contains a list of Web Services and may be some information about

these Web Services While service page is page that contains much more information for single

service Finally it forwards these two types of links into the next two components Property

Grabber and WSDL Grabber

Oslash Step 4

Then on the one hand the Property Grabber component tries to gather the information about

the service that hosted in the service list page and service page Such as the name of the service

its description the ranking for this service etc Finally all the information of the service will be

collected together as the service properties which then will be delivered to the storage

component for further processing

Oslash Step 5

On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page

or the service page That is because for some Web Service Registry the WSDL link is hosted in

Deep Web Service Crawler

26

the service list page like the Biocatalogue and for other Web Service Registries it is hosted in

the service page such as Xmethods Then after obtaining the WSDL link it will also be

transmitted to the Storage component for further processing

Oslash Step6

When service properties information and WSDL link of the service are received by Storage

component it will be going to store them into the disk For the service properties they will be

stored into the disk by means of three different ways They are an XML file an INI file or one

record inside the table of Database However for the WSDL link the Storage component will try

to download the page content first according to the URL address of the WSDL link If it can work

successful the page content of the service will be stored as a WSDL document into the disk

Oslash Step 7

Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there

are more than one service or there are more than one service list page in the those Web Service

Registries the crawling process from step 3 to step 6 will be continued again and again until

there is no service or service list page in that Web Service Registries any more

Oslash Step 8

Furthermore after the crawling process for one Web Service Registry finishes there will also

generate a file that contains some of statistic information about this crawling process For

example the time when the crawling process of this Web Service Registry started and when

finished the total number of Web services in this Web Service Registry how many services have

empty WSDL document the average number of service properties in this Web Service Registry

and the average time cost for extracting service properties getting WSDL document and

generating XML file INI file etc

321 The Function of Web Service Extractor Component

The Web Service Extractor component is responsible for gathering service list page links and service

page links It pursues a focused crawl of the Web and only forwards service list page and service page

links to subsequent components for analyzing collecting gathering purposes Therefore it is going to

identify both service list page links and related service page links on these Web Service Registries

As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost

as important as the Web Service Extractor itself as it highly influences the part of the Web which it

needs to crawl The seed can or shall contain eg Web pages where these Web Services are

published or that they talk about Web Services

Deep Web Service Crawler

27

Figure3-2 Overview the process flow of the Web Service Extractor Component

After feeding with the seed of URL the Web Service Extractor component starts to get the link of

service list page from the initial page of this URL seed However this process would be diverse for

these five Web Service Registries Following shows the different situation in these Web Service

Registries

Oslash Service-Repository Web Service Registry

In this Web Service Registry the link of the first service list page is the URL address of its seed

which means some Web Services can be found in the home page of the Service-Repository Web

Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this

Web Service Registry Therefore the process of getting service list page link in this registry will be

continually carried on until there no more service list page link exists

Oslash Xmethods Web Service Registry

Although there has Web Services in the home page of the Xmethods Web Service Registry the

number of those Web Services is only a small subset of these in this Web Service Registry

Moreover in the Xmethods Web Service Registry there is only one page containing all Web

Services Therefore it must have to get the service list page link for that page

Oslash Ebi Web Service Registry

The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That

is to say there also has one page that contains all Web Services in this Web Service Registry

However this page is not the initial page of the input seed Therefore there needs more than

one operation steps to get the service list page link of that page

Oslash Seekda Web Service Registry

In Seekda Web Service Registry these Web Services are not contained in the initial page of the

input seed The service list page link can be obtained after several additional operation steps

Deep Web Service Crawler

28

However there has a problem for getting the service list page link in this registry Simply put if

there are more than one pages contains the Web Services for some unknown reasons it cannot

not get the links of the rest service list page In another word it can only get the link of the first

service list page

Oslash Biocatalogue Web Service Registry

The process of getting service list page in Biocatalogue Web Service Registry is almost the same

as in Seekda Web Service Registry The only difference is that it can get all these service list page

links in Biocatalogue Web Service Registry if there are more than one service list pages

Then after getting the link of service list page the Web Service Extractor will begin to get the link of

service page for each service that listed in the service list page The reason for why it can do like this is

that there has an internal link for every service which will address to the service page It is worth

noting that once a service page link is obtained this component will immediately forward the service

page link and the service list page link into the subsequent two components for further processing

Nevertheless the process of obtaining the service page link will be continuously carried out until all

services listed in that service list page are crawled Analogously the process of getting service list page

will also be continuously carried out until there no more service list page exists

3211 Features of the Web Service Extractor Component

The main features are described in the following paragraphs

1) Obtain service list page links

A central task is to obtain the corresponding service list page links It is an URL address that includes a

publicly list of Web Services and just a simply information about these Web services like the name of

the service an internal URL that links to another page which would contain the detailed information

about that service sometimes may have the link address for WSDL document

2) Obtain service page links

Once service list page links are found a crucial aspect is to extract the internal link of each service to

aid the task of service information discovery Therefore it is the task of the Web Service Extractor to

harvest the html page content of the service list page so that the service page links which would

contain much more detailed information of the single Web service can be obtained

3212 Input of the Web Service Extractor Component

This component is dependent on some specific input seeds And the only input required for this

component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section

313

3213 Output of the Web Service Extractor Component

The component will produce two service related page links from the Web

l Service list page links

l Service page links

Deep Web Service Crawler

29

3214 Demonstration for Web Service Extractor

In order to have comprehensive understanding of the process of the Web Service Extractor

component following gives some figures for explanation Though there are five URL addresses in this

section only the URL of the service-repository address is showed as an example

1) The input seed is the initial URL address of the Service-Repository which is

ldquohttpwwwservice-repositorycomrdquo

2) As has already said in section 321 the first service list link of this Web Service Registry is its input

seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page

of that link

Figure3-3 Service list page of the Service-Repository

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo

Figure3-5Code Overview of getting service page link in Service Repository

Figure3-6 Service page of the Web service ldquoBLZServicerdquo

Deep Web Service Crawler

30

3) For now because of the service list page link is already know the next step is to acquiring the

service page link according to these services that listed in the service list page The text in red box

of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the

complete link of the service page It should be prefixed with the initial URL address of

Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for

getting service page link of the Web service in Service Repository is showed in figure 3-5

Therefore the final link of this service page is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure 3-6 is the corrsponding service page of that link

4) Afterwards those two links service list page link and service page link which are gathered by the

Web Service Extractor component will be immediately forwarded to the next two components

which are WSDL Grabber component and Property Grabber component

322 The Function of WSDL Grabber Component

The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the

service lists page link or service page link And the whole process flow is illustrated in figure 3-7

Figure3-7 Overview the process flow of the WSDL Grabber Component

When the WSDL Grabber component receives the inputs that delivered from previous component it

will start to get the WSDL link for the service based on these inputs However although the inputs of

the WSDL Grabber component are the links of service page and service list page only one of them

contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the

service page or the service list page The reason why there needs these two links to be delivered into

this component is that the only one of the five Web Service Registries which is Biocatalogue hosts

the WSDL link in the service list page While for other four Web Service Registries the WSDL link is

hosted in the service page Therefore the WSDL link of these four Web Service Registries will be

obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL

link will be obtained through the service list page link However there has a problem for getting the

WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in

Deep Web Service Crawler

31

the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another

word these services will not have WSDL document For the situation like this the value of the WSDL

link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web

Services in other four Web Service Registries the WSDL link would always exist in the service page

Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service

it will inevitably be forwarded into Storage component for downloading the WSDL document at once

3221 Features of the WSDL Grabber Component

The WSDL Grabber component has to provide the following features

l Obtain WSDL links

WSDL link is the direct way to get into the page that contains the contents of the WSDL document

It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo

to represent this is an address that addressed to the page of WSDL document

3222 Input of the WSDL Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3223 Output of the WSDL Grabber Component

The component will only produce the following output data

l The URL address of WSDL link for each service

3224 Demonstration for WSDL Grabber Component

In this section it will demonstrate a list of figures in order to give a comprehensive understanding of

the process of the WSDL Grabber component Without a doubt it uses the same Web service of the

Service-Repository as an example too

1) The input for this WSDL Grabber component is the link of service page that obtained from Web

Service Extractor component The address of this link is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page

Deep Web Service Crawler

32

2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the

figure 3-8

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo

3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure

3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry

For other four Web Service Registries this will be different This function

ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo

Then it checks all of the nodes one by one to see if the text value of that node is the value of

ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will

be extracted as the value of WSDL link for this Web service

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function

Figure3-11 Code overview of ldquooneParameterrdquo function

4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web

service ldquoBLZServicerdquo can be obtained The value of it is

ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo

Deep Web Service Crawler

33

323 The Function of Property Grabber Component

The Property Grabber component is a module which is used to extract and gather all these Web

service information that hosted in the Web actually are the information that showed in service list

page and service page In the end all the obtained Web service information will be collected together

as the service properties which would be delivered into the Storage component for storing The

detailed process flow of Property Grabber component is illustrated in figure 3-12

The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component

However there is still a little difference between them with respect to the seed As have already

mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is

sufficient to get the WSDL link While regarding the Property Grabber component it needs two of

them as the seeds Therefore once the Property Grabber component receives the needed inputs it

will start to extract the information of that single Web service

Figure3-12 Overview the process flow of the Property Grabber Component

Therefore after the Property Grabber component receives the inputs it starts to extract the service

information for the Web service Generally speaking the service information consists of four aspects

which are structured information endpoint information monitoring information and whois

information respectively

(1) Structured Information

The structured information can be obtained by extracting the information that hosted in the

service page and service list page They are the basic descriptive information about the service

Such as the name of the service the URL address through which the WSDL document can be

obtained the description of introducing the service the provider who provides that service its

rating and the server who owns this service etc However the elements constituting this

Deep Web Service Crawler

34

structured information would be diverse according to these different Web Service Registries For

example the rating information of the Web service exists in the Service-Repository Web Service

Registry while the Xmethods Web Service Registry does not have this information In addition

even though the Web services in the same Web Service Registry some elements of the structured

information could be not exist For instance one service in a Web Service Registry may have the

description information of this service while another service in the same Web Service Registry

does not have this description information Tables 3-1 to 3-5 demonstrate the structured

information that should be extracted in these five Web Service Registries Moreover if the style of

the Biocataloue Web Service Registry is SOAP there has some additional information describing

the SOAP operation of this service While it is REST this additional information would be

something that talked about the REST operation They should also be considered as a part of the

structured information Table 3-6 and table 3-7 illustrate the information for these two different

operations

Service Name WSDL Link WSDL Version

Provider Server Rating

Homepage Owner Homepage Description

Table 3-1Structured Information of Service-Repository Web Service Registry

Service Name WSDL Link Provider

Service Style Homepage Implementation Language

Description User Description Contributed Client Name

Type of this Client Publisher for this client Used Tookit of this client

Used Language of this client Used Operation System of this client

Table 3-2Structured Information of Xmethods Web Service Registry

Service Name WSDL Link Server

Provider Providerrsquos Country Service Style

Rating Description User Description

Service Tags Documentation (within WSDL)

Table 3-3Structured Information of Seekda Web Service Registry

Service Name WSDL Link Port Name

Service URL Address Implementation Class

Table 3-4Structured Information of Ebi Web Service Registry

Service Name WSDL Link Style

Provider Providerrsquos Country View Times

Favorite Times Submitter Service Tags

Total Annotation Provider Annotation Member Annotation

Registry Annotation Base URL SOAP Lab Server Base URL

Description User Description Category

Table 3-5Structured Information of Biocatalogue Web Service Registry

Deep Web Service Crawler

35

SOAP Operation Name Inputs and Outputs Operation Description

Operation Tags Part of Which Service

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry

REST Operation Name Service Tags Used Template

Operation Description Part of Which Service Part of Which Endpoint Group

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry

(2) Endpoint Information

Endpoint information is the port information of the Web service which can be extracted only

through the service page However since different Web Service Registries have different structure

of the endpoint information with respect to the Web service Thence some elements of the

endpoint information for the Web service would be very diverse But there is one thing needed to

pay attention the Ebi Web Service Registry does not have endpoint information for all these Web

services published in this registry Moreover though the Web services in the same Web Service

Registry have the same structure of this endpoint information some elements of the endpoint

information would be missing or empty Furthermore these Web Service Registries could even

have no endpoint information for some Web services published by them Nevertheless whatever

there is endpoint information for the Web service there will be at least one such element which

is the URL address of the endpoint Following table 3-8 shows the endpoint information that

should be extracted in these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Endpoint Name Endpoint URL

Endpoint Critical Endpoint Type

Bound Endpoint

Xmethods

Endpoint URL Publisher of this Endpoint

Contact Email of this Publisher Implementation Language of this

Endpoint

Seekda Endpoint URL

Biocatalogue Endpoint Name Endpoint URL

Table 3-8 Endpoint Information of these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Service Availability Number of Downs

Total Uptime Total Downtime

MTTR MTBF

RTT Max of Endpoint RTT Min of Endpoint

RTT Average of Endpoint Ping Count of Endpoint

Seekda Service Availability Begin Time of Monitoring

Biocatalogue Monitored Status of Endpoint Overall Status of Service

Table 3-9 Monitoring Information of these five Web Service Registries

(3) Monitoring Information

Monitoring information is the tested statistic information for the Web service But it is worth

Deep Web Service Crawler

36

noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of

all these Web services published by them While for other three Web Service Registries only a

few of Web services may not have this information Table 3-12 displays the monitoring

information for these three Web Service Registries

(4) Whois Information

Whois information is not extracted from the information that hosted in the service page and

service list page It is the descriptive information for the service domain which can be gained by

means of the address of the WSDL link Because of that the process of getting the Whois

information will start after gaining the service domain first The final value of service domain

should not contain the string like this http https www etc It must belong to the top level

domain After that it is going to query the service domain database by means of sending the value

of service domain to the Whois client which is just a Web site on the Internet for example

ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain

exists a list of its information will be returned as the output However the structure of the

returned information will be diverse with respect to different service domain Therefore the most

challenge thing is that there has to deal with the extracting process for each different situation of

the returned information Table 3-10 gives the Whois information that needs be extracted for all

these five Web Service Registries

Service Domain URL Domain Name Domain Type

Domain Address Domain Description State

Postal Code City Country

Country Code Phone Fax

Email Organization Established Time

Table 3-10Whois Information for these five Web Service Registries

Finally all the information of these four aspects will be collected together And then be delivered into

the Storage component for further storage processing

3231 Features of the Property Grabber Component

The Property Grabber component has to provide the following features

l Obtain basic information

Generally speaking the more information one Web service has the better you can know how

well this Web service is Hence this Property Grabber component will be necessary to extract all

these basic information that hosted in the service list page and service page This basic

information contains the structure information endpoint information and monitoring

information

l Obtain Whois information

Due to the reason that the more information the Web service has the better the quality of the

Web service is That makes it is necessary to extract as much information about the Web service

as possible Therefore except the basic information the Property Grabber component can also

obtain some additional information called Whois information such as the type of the domain the

name of the person in charge the postal code of the domain city phone fax detailed address

email etc

Deep Web Service Crawler

37

3232 Input of the Property Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3233 Output of the Property Grabber Component

The component will produce the following output data

l Structured information of each service

l Endpoint information about each service if exists

l Monitoring information for the service and endpoint if exists

l Whois information of the service domain

However all these information will be collected together as the properties for each service Thereafter

the collected properties will be sent to the Storage component

3234 Demonstration for Property Grabber Component

Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the

Property Grabber component To simplify the explanation the example showed in this section uses

the same Web service as before

1) The inputs of the Property Grabber component are links of service list page and service page that

received from Web Service Extractor component Their links are

ldquohttpwwwservice-repositorycomrdquo

and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page

Deep Web Service Crawler

38

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page

3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that

displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of

the structured information have the same content like the contents of description showed in

service page and service list page Thence in order to saving the time of extracting process and

the space for storing process the elements with the same content would only be extracted once

Moreover there needs a transformation of non-descriptive text to the descriptive text for the

rating information because the contents of it are several image of stars Therefore the final

results of the extracted structured information for this Web service are showed in table 3-11

Because of there are no descriptive information for the provider homepage and Owner

Homepage their values are assigned as ldquoNULLrdquo

Service Name BLZService

WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl

WSDL Version 0

Server Apache-Coyote11

Description BLZService

Rating Four stars and A Half

Provider NULL

Homepage NULL

Owner Homepage NULL

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo

4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that

displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list

only one of them will be extracted as the information of endpoint That is because this master

program is intended to extract as more information as possible but these information should not

contain some redundant information Therefore for the purpose of that only one of them will be

extracted as the endpoint information even if there are more than one endpoint records Table

3-12 shows the final results of the endpoint information of this Web service

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page

Deep Web Service Crawler

39

Endpoint Name BLZServiceSOAP12port_http

Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService

Endpoint Critical True

Endpoint Type production

Bound Endpoint BLZServiceSOAP12Binding

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo

5) Then it is time to extract the monitoring information through invoking the function

ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box

above contains the monitoring information about the Web service and the red box below lists

the monitoring information for these endpoints As have already mentioned before only one

endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two

types of availability Actually they all represent the availability for this Web service just like the

availability that showed in figure 3-14 Therefore only one content of these availability

information is sufficient Table 3-13 shows the final results of this extracting process

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page

Service Availability 100

Number of Downs 0

Total Uptime 1 day 19 hours 19 minutes

Total Downtime 0 second

MTBF 1 day 19 hours 19 minutes

MTTR 0 second

RTT Max of Endpoint 141 ms

RTT Min of Endpoint 0 ms

RTT Average of Endpoint 577 ms

Ping Count of Endpoint 112

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo

6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the

service domain from the WSDL link first Towards this Web service the gained service domain is

ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying

process After that it returns a list of information for that service domain See figure 3-17 And

table 3-14 is the extracted Whois information

Deep Web Service Crawler

40

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo

Service Domain URL thomas-bayercom

Domain Name Thomas Bayer

Domain Type NULL

Domain Address Moltkestr40

Domain Description NULL

State NULL

Postal Code 54173

City Bonn

Country NULL

Country Code DE

Phone +4922855525760

Fax NULL

Email infopredic8de

Organization predic8 GmbH

Established Time NULL

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo

7) Finally all the information of these four aspects will be collected together as the service

properties and then these service properties are forwarded into the Storage component

324 The Function of Storage Component

The WSDL link from the WSDL Grabber component will be used by this Storage component to

download the WSDL document from the Web and stores it into the disk thereafter In addition the

service properties from the Property Grabber component will also be directly stored into the disk in

three different manners with this Storage component Figure 3-18 illustrates the process flow of the

Storage component

Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo

of the Storage component will be triggered After that it is going to transform the service properties

into three different output formats and store them into the disk These output formats are XML file

INI file and database records Besides it will also try to download the WSDL document with the URL

address of the WSDL link and then store the obtained WSDL document into disk too However this

ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo

ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the

storage tasks

Deep Web Service Crawler

41

Figure3-18 Overview the process flow of the Storage Component

(1) ldquogetWSDLrdquo sub function

The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it

into the disk Therefore above all it has to get the content of the WSDL document This

procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value

of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if

the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo

For the case like that it will create a WSDL document whose name is the service name appended

with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content

It is an empty document Nevertheless if that service does have WSDL link this sub function will

try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all

these contents that hosted in the Web would be downloaded and stored into the disk and named

only with the name of the service Otherwise it would create a WSDL document which prefixes

ldquoBadrdquo before the service name

(2) ldquogenerateXMLrdquo sub function

The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform

them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo

XML stands for eXtensible Markup Language which is a markup language that designed to

transport and store data The first line of an XML file is the XML declaration which defines the

XML version and the encoding used For example following ldquoltxml version=10

encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is

8-bit Unicode Transformation Format character set Besides an XML file contains XML elements

which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML

element can also contain other elements which could be simple text or a mixture of both

Deep Web Service Crawler

42

However an XML file must contain a root element as the parent of all other elements

(3) ldquogenerateINIrdquo sub function

The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will

transform them into an INI file and then store it in the disk with the name structure of servicersquos

name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for

configuration files They are just simple text files with a basic structure Generally speaking the

INI file contains three different parts They are section parameter and comment respectively

Parameter is the basic element contained in an INI file Its format is a key-value pair or can also

be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always

appears to the left of the equals sign The section is more like a room that grouped all its

parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And

section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo

Thence anything between the semicolon and the end of line will be ignored

(4) ldquogenerateDatabaserdquo sub function

The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions

Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will

turn them into the data of the database by using the SQL statements SQL stands for Structured

Query Language which is a database language that designed for accessing and manipulating data

in database Most of the actions that performed on a database are done with SQL statements

And the primary statements of SQL include insert into delete update select create alter and

drop Therefore for the purpose of transforming these service properties into the data of the

database this sub function has to create a database first using the ldquocreate databaserdquo statement

Then it should create a table to store the data A table is a collections of related data entries and

it consists of columns and rows Since the data for all these five Web Service Registries are not

very large therefore one database table is enough for storing these service properties Because of

that the field names of the service properties in columns for all these five Web Service Registries

should be uniform and well-defined Afterwards the service properties of each single service can

be put into the table as a record with the ldquoinsert intordquo statement of the SQL

3241 Features of the Storage Component

The Property Grabber component has to provide the following features

l Generate different output formats

The final result of this master program is to store the information of the service into the disk for

future work However this Storage component provides three different formats for storing these

service properties of the services that published in the Web Service Registries This makes the

storage of the service very flexible and also longevity

l Obtain the WSDL document

The important thing of this component is to check if the WSDL document can be obtained from

the WSDL link because the WSDL document plays a decisive role for determining the quality of

the service This Storage component provides the ability to deal with the different situations

about the process of obtaining the WSDL document

Deep Web Service Crawler

43

3242 Input of the Storage Component

This component requires the following input data

l WSDL link of each service

l Each service property information

3243 Output of the Storage Component

The component will produce the following output data

l WSDL document of the service

l XML document INI file and tables in database

3244 Demonstration for Storage Component

Following figures describe the fundamental implementation codes of this Storage component

The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the

implementation codes The first common place is about the parameters that defined in each of

these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute

path in the computer disk It will be used for the storing procedure and has already been specified

by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an

increasing Integer which is used as a part of the name of the service The reason for doing this is

that it can prevent the services that have the same name from overriding in the disk The content

of red mark among the codes of these figures is the second common place Its function is to

create a file or document for the service with the corresponding parameters

2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to

get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the

name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo

is the actual WSDL link which is the most important one for this sub function The other two

parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and

ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data

of the service in each Web Service Registry Such as the overall number of properties for that Web

Service Registry the overall number of services the number of services that have no WSDL link

the number of services contains no contents the number of services whose WSDL link are not

available etc The ldquoLog Informationrdquo text file records the results of the process steps and the

problems encountered For example which service is crawling now which Web Service Registry it

belongs the reason why it cannot get the WSDL document for this service the reason why this

service is unreachable and so on

Deep Web Service Crawler

44

Figure3-19 Implementation code for getting WSDL document

3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file

and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of

ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two

variables name and value

Figure3-20 Implementation code for generating XML file

Deep Web Service Crawler

45

Figure3-21 Implementation code for generating INI file

4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all

services in these five Web Service Registries into the records of the database Therefore it has to

create a database first The name of the database can be arbitrary as long as it conforms to the

naming rules in database This is the same for the name of the table Figure 3-22 displays the

procedure of creating a table in the database Due to the reason that it is hard to decide the

length of each service property thence the data types for all service properties are set to ldquoTextrdquo

Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo

statement

Figure3-22 Implementation code for creating table in database

Deep Web Service Crawler

46

Figure3-23 Implementation code for generating table records

33 Multithreaded Programming for DWSC

Multithreaded programming is a built-in characteristic which is provided by Java language A

multithreaded program contains two or more separate parts that that can execute concurrently Each

part of such a program is called a thread The use of multithreading enables to create programs that

makes efficient use of the system resources For example make maximum use of the CPU thatrsquos

because the idle time of the CPU can be kept to a minimum

In this master program there are five Web Service Registries that are needed to crawl for the services

among them Moreover the number of services published among each Web Service Registry is quite

different That makes the running time cost by each Web Service Registry is different Then it will

happen that one Web Service Registry who owns fewer services has to be waiting for executing until

another Web Service Registry who has much more services finishes Therefore in order to reduce the

waiting time for other Web Service Registries and maximize the use of the system resources it is

necessary to apply this multithreaded programming to this master program That is to say this master

program will create a thread for each Web Service Registry And these threads are executed

independently

34 Sleep Time Configuration for Web Service

Registries

Due to the reason that this master program is intended for downloading the WSDL document and

extracting the service information of these Web services that published in the Web Service Registry

this would inevitably affect the performance of the Web Service Registry In addition for the purpose

of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of

accessing Because of that sometimes when this master program is executing unknown errors would

happen For instance this master program will be continually halting at one point without getting any

more WSDL document and service information or some servicesrsquo WSDL documents of some Web

Deep Web Service Crawler

47

Service Registries cannot be obtained or there would miss some service information Therefore in

order to obtain the largest staffs of these Web services published in these five Web Service Registries

and not to affect the throughput of them it has to configure the accessing rate for each service of all

Web Service Registries

In consequence before going into the essential procedure for each single service of these Web Service

Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long

milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for

the specified number of milliseconds In other words it will temporarily cease execution for a while

Following table shows this time interval of the sleep function for each Web Service Registry

Web Service Registry Name Time Interval (milliseconds)

Service Repository 8000

Ebi 3000

Xmethods 10000

Seekda 20000

Biocatalogue 10000

Table 3-15Sleep Time of these five Web Service Registries

Deep Web Service Crawler

48

4 Experimental Results and Analysis

This chapter is going to show the quantitative experimental results of the prototype that presented in

chapter 3 Besides the analysis of these results will also be described and explained However in

order to gain a rather accurate result the experiments are carried out more than five times All these

data that displayed in the following tables and charts are the average of them

41 Statistic Information for Different Web Service

Registries

This section is mainly used to talk about the amount statistic of these Web services that published in

these five Web Service Registries It includes the overall number of the Web services that published in

each Web Service Registry and the number of these unavailable Web services which have been

archived because they may not be active anymore or are close to being non active Table 4-1 shows

the service amount statistic of these five Web Service Registries

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Overall

Services 57 289 382 853 2567

Unavailable

Services 0 0 0 0 125

Table4-1 Service amount statistic of these five Web Service Registries

Nevertheless in order to have an intuitive view of the service amount statistic in these five Web

Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As

you can see from the bar chart one on hand there has an ascending increase for the overall number

of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service

Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web

service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to

provide the Web services for the users because it contains far more services than other four Web

Service Registries On the other hand there is no unavailable service existed in almost all Web Service

Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web

Service Registry contains some Web services that cannot be used by the users To some degree this is

useless for the reason that these services cannot be used anymore and it will waste the network

resources on the Web Therefore all these unavailable services should be eliminated in order to

reduce the wasting of the network resources

Deep Web Service Crawler

49

Figure4-1 Service amount statistic of these five Web Service Registries

42 Statistic Information for WSDL Document

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Failed WSDL

Links 1 0 23 145 32

Without

WSDL Links 0 0 0 0 16

Empty

Content 0 0 2 0 2

Table4-2 Statistic information for WSDL Document

Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in

these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed

WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the

Web services whose WSDL links are invalid Say in other words that means it is impossible to get the

WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no

WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services

for these Web Service Registries It is used as the overall number of Web services in each Web Service

Registry who has no WSDL link That is to say there would be no WSDL document for Web services

like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL

document will also be created But there is no content and the name of this WSDL document will

contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which

represents the overall number of Web services that have the WSDL links and the URL address of them

0

500

1000

1500

2000

2500

3000

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

57289 382

853

2567

0 0 0 0 125

The

Num

ber o

f Web

Ser

vice

s

The name of Web Service Registry

Overall Services

Unavailable Services

Deep Web Service Crawler

50

are valid but their WSDL documents contains no content In this case a WSDL document whose name

contains a string ldquo(BAD)rdquo will be created

Figure4-2 Statistic information for WSDL Document

43 Comparison of Different Average Number of

Service Properties

This section is going to compare the average number of service properties in these five Web Service

Registries This average number of service properties is calculated by means of following equation = (1)

Where

ASP is the average number of service properties for one Web Service Registry

ONSP is the overall number of service properties in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

Figure 4-3 shows the average number of service properties of each Web service in these five Web

Service Registries As have already mentioned before one of the measurements for testing the quality

of Web services in the Web Service Registry is the service information The more information of a Web

service the better you know about that service consequently the corresponding Web Service

Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service

Repository and Biocatalogue Web Service Registries own a larger number of service properties than

that in other three Web Service Registries This can directly reflect that these two Web Service

Registries can provide more detailed information about the Web services published among them So

that the users can be easier to choose the service they need And they would also like to use the Web

services that published in these two Web Service Registries By contrast the Xmethods and Seekda

0

50

100

150

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

1 023

145

32

0 0 0 016

0 0 2 0 2

The

Num

ber o

f Ser

vice

s

The Name of the Web Service Registry

Failed WSDL Links

Without WSDL links

Empty Content

Deep Web Service Crawler

51

Web Service Registries who have less service information of the Web services would offer less quality

for these Web services Therefore users may be not like to use the Web services provided in these

two Web Service Registries Not to mention the Web services published in the Ebi Web Service

Registry

Figure4-3 Average Number of Service Properties

From the description presented in section 323 the causes of the issue for the different number of

service properties in these Web Service Registries may consist of following several points First the

number of the structured information for these Web services is different with respect to these five

Web Service Registries Even part of information for some Web services in one Web Service Registry

could be missing or have empty value For example the number of structure information that

supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that

in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for

almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce

the amount of the overall number of service properties Thirdly some Web Service Registries do not

have monitoring information such as Xmethods and Ebi In particular the Service Repository Web

Service Registry has a large amount of monitoring information of the Web services that could be

extracted in the Web Obvious the last one is the number of the whois information for these Web

services If the database of the whois client does not contain the information about the service

domain for the Web service in one Web Service Registry then there will be no whois information

could be extracted Moreover even there has information of the service domain the amount of the

information could be very diverse Therefore if the Web Service Registry undergoes the situation like

that many service domains of the Web services in this registry have no and few whois information

then the average number of service properties in that registry will decrease greatly

As a result in order to help users get better acquainted with the Web services that provided in Web

Service Registry and distinguish these Web services the Web Service Registry should do it best to

offer more and more information for each published Web service of it

05

101520253035

23

717 17

32

The

Num

ber o

f Pro

pert

ies

The Name of Web Service Registry

Average Number of Service Properties

Deep Web Service Crawler

52

44 Different Outputs of Web Services

The fundamental task of this master program is to obtain the WSDL document of the Web services

that hosted in the Web Service Registries as well as extract and gather the properties of these Web

services and thereafter store them into disk Therefore this section is going to describe the different

outputs of this master program which include the WSDL documents of Web services the generated

XML file and INI file and data records of these service properties in Database

Figure4-4 WSDL Document format of one Web service

The WSDL document of the Web service is just read from the Web according to the URL address of its

WSDL link And then these data will be stored into the disk as the WSDL document The name of the

WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document

And in order to distinguish these WSDL documents whose names would be the same while the

content of them are different the name of each obtained WSDL document in one Web Service

Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL

document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and

data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output

formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name

is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the

materials belonged to the same Web service The first three lines in that INI file are some service

comments which start from the semicolon to the end of the line They are the basic information for

the description of this INI file The following one line of them is the section that included in a pairs of

brackets It is important because it represents that the following lines behind it are the information of

this Web service Therefore the rest of these lines are the actual service information that existed in a

key-value pair with an equal between the pair Each service property is displayed from the beginning

of the line

Deep Web Service Crawler

53

Figure4-5 INI File format of one Web service

Figure4-6 XML File format of one Web service

Figure4-7 Database data format for all Web services

Deep Web Service Crawler

54

Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its

name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web

service Though the format of the XML file is different from that of the INI file the essential contents

of them are the same That is to say the values of these service properties are no different This is

because they are files generated from the collection of properties for the same Web service The XML

file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the

section in INI file is something like the root in XML file Therefore all values of these elements among

the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service

Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of

service information for all Web services in these five Web Service Registries And the entire service

information for one Web service will be only one record in this table Because of that the column

names of that table should be the union of the name of the service information in each Web Service

Registry However since the column names of the table should be unique thus these redundant

names in this union must be eliminated This could make sense and possible because the names of the

service information are well-defined and uniform for all these five Web Service Registries In addition

in this table the first column is the primary key which is an increasing Integer The function of this is

more like the Integer that contained in the name of the XML file and INI file While the rest columns in

that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in

this table represents that this property of that Web service is empty or missing

45 Comparison of Average Time Cost for Different

Parts of Single Web Service

This section aims to describe the comparison of the average time cost for different parts of getting

one single Web service in all these five Web Service Registries At first it has to calculate the average

time cost of getting one single service in the Web Service Registry The calculation of it can be

obtained through following equation = (2)

Where

ATC is the average time cost for one single Web service

OTS is the overall time cost of all the Web services in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the different parts of the average time cost for getting one single service consists of

following 6 aspects which are the average time cost for extracting service property the average time

cost for obtaining WSDL document the average time cost for generating XML file the average time

cost for generating INI file the average time cost for inserting the service property into the table of

the Database and the average time cost for some other procedures such as get the service list page

link get the service page link and so on The average time cost for extracting service property would

be obtained by means of following equation

Deep Web Service Crawler

55

= (3)

Where

ATCSI is the average time cost for extracting service property of one single Web service

OTSSI is the overall time cost for extracting service property of all the Web services in one Web

Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the calculation of other parts is similar to the equation for calculating the average time

cost for extracting service property While the calculation of the average time cost for other

procedures equals to the result of the average time cost for one single Web service minus the sum of

the average time cost of all other five parts

Service

property

WSDL

Document

XML

File INI File Database Others Overall

Service

Repository

8801 918 2 1 53 267 10042

Ebi 699 82 2 1 28 11 823

Xmethods 5801 1168 2 1 45 12 7029

Seekda 5186 1013 2 1 41 23 6266

Biocatalogue 39533 762 2 1 66 1636 42000

Table4-3 Average time cost information for all Web Service Registries

Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web

Service Registries The first column of table 4-3 is the name of these five Web Service Registries And

the last column in this table is the average time cost for the single service in one Web Service Registry

While the rest columns in this table are the average time cost of these six different parts In order to

have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated

with following corresponding figures See figure 4-8 to figure 4-14

Figure4-8 Average time cost for extracting service property in all Web Service Registries

8801

6995801 5186

39533

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Service Property

Deep Web Service Crawler

56

As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue

Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries

which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively

That is to say it will take much longer time to extract the service properties of these Web services

published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the

Biocatalogue Web Service Registry has even more average number of service properties which have

already been talked about in section 43 On the contrary the average number of service properties in

Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service

Registry is larger than that in Seekda Web Service Registry As have already known that the average

number of service properties is the same for these two Web Service Registries Nevertheless there

has course might explain the case that the average time in Xmethods costs more than that in Seekda

which is the process for extracting service properties in Xmethods Web Service Registry needs to be

executed by means of service page and service list page while it needs only service page link for

Seekda Web Service Registry

The average time cost for obtaining WSDL document in all these five Web Service Registries is

displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the

average time for extracting the WSDL link of Web service and the average time for reading the data of

WSDL document from the Web and then storing it into the disk As seen from this figure the average

time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is

1168 Although the process for extracting WSDL link will cost certain amount of time it does not have

a significant influence to the total average time spent for obtaining WSDL document This is because

the WSDL link of one Web service is almost gained in one step Therefore this can imply that the

average size of the WSDL document for Xmethods Web Service Registry is larger than that for other

four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds

for the process of extracting WSDL document

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries

918

82

11681013

762

0

200

400

600

800

1000

1200

1400

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

WSDL Document

Deep Web Service Crawler

57

Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different

outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11

the average time of generating XML file for one Web service cost the same for all these five Web

Service Registries which is only 2 milliseconds In addition the average time of generating INI file for

one Web service cost the same too And its value is just 1 millisecond Even though sum up the values

of these two types of average time costs the result is still so small that it could be omitted when

comparing to the overall average time cost of getting one Web service for each corresponding Web

Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI

file would be finished at once after receiving the input of service properties of one Web service

Furthermore it can be seen from figure 4-12 although the average time costs of creating database

record for each Web service in all these five Web Service Registries are larger than that time of

generating XML and INI file the process of the operation for creating database record is still fast

Figure4-10 Average time cost for generating XML file in all Web Service Registries

Figure4-11 Average time cost for generating INI file in all Web Service Registries

2 2 2 2 2

0

05

1

15

2

25

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

XML File

1 1 1 1 1

0

02

04

06

08

1

12

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

INI File

Deep Web Service Crawler

58

Figure4-12 Average time cost for creating database record in all Web Service Registries

Figure4-13 Average time cost for getting one Web service in all Web Service Registries

In the figure 4-13 it gives the average time cost for getting one single Web service in all these five

Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the

longest time for this process This is because the presentation of these five different parts that

described in the front shows that the average time cost of each part would needs more time to finish

the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining

WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web

Service Registry does not cost the longest Moreover there is amazing thing when looking at these

figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same

trend This can further indicate that one Web Service Registry who spends more time to get the

description information of one Web service would offer more information about the Web service

53

28

4541

66

0

10

20

30

40

50

60

70

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Database

10042

823

7029 6266

42000

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Overall

Deep Web Service Crawler

59

5 Conclusion and Further Direction

This master thesis provides a schema which aims to explore the description information of the Web

services that hosted in the different Web Service Registries Actually the description information of

the Web service consists of WSDL document and service information about that Web service

Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can

be used to obtain the WSDL document and service information of the Web service in these Web

Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web

services that hosted in the these Web Service Registries as well as the storage of these Web services

is not flexible and the most important thing is that only a few service information of the Web service

are extracted even more there is no service information would be extracted for some Web Service

Registry However the work presented in this master thesis can be able to explore all Web services in

these Web Service Registries Moreover the service information of Web service would be extracted as

much as it could with the approach presented in this master thesis so that the final result would be

the largest annotated service catalogue ever produced Furthermore regarding the storage process of

the description information of the Web service this master thesis provides three different ways that

can guarantee not only the completeness but also longevity of the description information for the

Web service

However in the implementation performed in this master thesis the whois client used for querying

the information of service domain will return a free text if the information exists And sometime this

free text differs completely That makes it must have to crawl each Web service in all Web Service

Registries at least once in the experiment stage So that all the cases of these free text would be

foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web

services in these Web Service Registries Therefore in order to simply the work other whois client

who can ease the work here needs to be found and used

Moreover in the experiment stage of this master thesis the time cost for getting Web service is still

large For the purpose of reducing this time the multithreaded programming would be applied to

each some parts of the process for getting one Web service

Although the work performed here is specialized for only these five Web Service Registries the

main parts of the principles used here is adaptable to other Web Service Registries that with only

small changes in the implementation codes or the structure

Deep Web Service Crawler

60

6 Bibliography

[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)

June 27 2008

[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from

httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-

whole Emanuele Della Valle (CEFRIEL) July 1 2008

[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk

ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-

architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009

[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web

Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October

2006

[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from

httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008

[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from

httpwwww3orgTRws-arch-scenarios February 11 2004

[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo

Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering

University of Washington Seattle pp233-272 February 1999

[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web

Consortium Working Draft WD-html5-20100624 January 22 2008

[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language

WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from

httpwwwwsmoorgTRd16d161v02

[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO

-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from

httpwwwwsmoorgTRd2v11

[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware

Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet

2008

Deep Web Service Crawler

61

7 Appendixes

There are additional outputs of this master program which are log information file and statistic report

file Figure 8-1 shows one of the basic output formats of the log information for these five Web

Service Registries

Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry

Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry

Deep Web Service Crawler

62

Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo

ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively

Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry

Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry

Deep Web Service Crawler

63

Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry

Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry

Deep Web Service Crawler

64

Table of Figures

Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12

Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16

Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25

Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27

Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41

Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46

Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50

Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51

Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52

Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56

Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58

Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58

Deep Web Service Crawler

65

Table of Tables

Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47

Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48

Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55

Deep Web Service Crawler

66

Table of Abbreviations

DTD Document Type Definition

DWSC Deep Web Service Crawler

HTML HyperText Markup Language

MTBF Mean Time between Failures

MTTR Mean Time to Recovery

QoS Quality of Service

REST Representational State Transfer

RTT Round Trip Time

SOAP Simple Object Access Protocol

SQL Structured Query Language

URL Uniform Resource Locator

WHATWG Web Hypertext Application Technology Working Group

XML eXtensible Markup Language

WSDL Web Service Description Language

WSML Web Service Modeling Language

WSMO Web Service Modeling Ontology

Page 13: Deep Web Service Crawler

Deep Web Service Crawler

13

2121 The Principle of the Service Crawler Component

The Service Crawler is responsible for gathering available services and their related information from

the Web The overall cycle is depicted as following

(1) A Web developer publishes a Web Service

(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services

like WSDL (Web Service Description Language) documents

(3) The Crawler is also going to search for other related information as long as a service is discovered

(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant

part of the Web

At last the output of the crawler would be forwarded to the subsequent components for analyzing

indexing and displaying

2122 The Principle of the Automatic Annotator Component

The Automatic Annotator receives the relevant data from previous component and generates

semantic service descriptions about the WSDL documents and its related information based on the

Service-Finder Ontology and Service Category Ontology

Firstly it will simply introduce those two compatible ontologies that would be used throughout the

whole process [2]

n Generic Service Ontology it is an ontology which is functional to describe the data objects For

example the services the service providers availability payment modalities and so on

n Service Category Ontology it is an ontology which is used to categorize the functionalities or

applications of the services For instance data verification messaging data storage weather etc

Afterwards it is going to talk about the function of this component with its input output

Oslash Input

u Crawled data from Service Crawler

u Service-Finder Ontologies

u Feedback or Correction of before annotations

Oslash Function

u Enrich the information about the service and extract semantic statements according to the

Service-Finder Ontologies For example categorize the service according to the Service

Category Ontology

u Determine whether a particular document is relevant or not through the Web link graph If

not discard these irrelevant documents

u Classify the pages into their genres For instance pricing user comments FAQ and so on

Oslash Output

u Semantic annotation of the services

Deep Web Service Crawler

14

2123 The Principle of the Conceptual Indexer and Matcher

Component

The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted

information of the services and supplying users the capability of retrieval and semantic query For

example the matchmaking between user requests and service offers and the act of retrieving user

feedback on extracted annotations

In addition letrsquos have a look of the function of this component and its input output

Oslash Input

u Semantic annotation data and full text information obtained from Automatic Annotation

u Semantic annotation data and full text information that come from user interfaces

u Cluster data from user and service clustering component

Oslash Function

u Store the semantic annotations received from the Automatic Annotation component and

from the user interface

u Store the cluster data that procured through the clustering component

u Store and index the textual description offered by the Automatic Annotation component

and the textual comments offered by users

u Ontological query the semantic data from the data store center

u Combined keyword and Ontological querying used for user queries

u Provide a list of similar services for a given service

Oslash Output

u A list of matching services that are queried by users In particular these services should be

sorted by ranking and can also be iterated

u All available data that related to a particular entity must be retrievable at the user interface

2124 The Principle of the Service-Finder Portal Interface

Component

The Service-Finder Portal Interface is the main entry point that provided for users of the

Service-Finder system to search and browse the data which is managed by the Conceptual Indexer

and Matcher component In addition the users can also contribute information by means of providing

tags comments categorizations and ratings to the data browsed Furthermore the developers can

still directly invoke the Service-Finder functionalities from their custom applications in terms of an API

Besides the details of this componentrsquos function input and output are represented as below

Oslash Input

u A list of ordered services for a query

u Detailed information about a service or a set of services and a service provider

u Query access to service category ontology and the most used tags provided by the users

Deep Web Service Crawler

15

u Service availability information

Oslash Function

u The Web Interface allows the users to search services by keyword tag or concept in the

categorization sort and filter query results by refining the query compare and bookmark

services try out the services that offer this functionality

u The API allows the developers to invoke Service-Finder functionalities

Oslash Output

u Explicit user annotations such as tags ratings comments decryptions and so on

u Implicit user data for example click stream of users bookmarks comparisons links sent

etc

u Manual advertising of available new services

2125 The Principle of the Cluster Engine Component

The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the

Service-Finder Portal eg the queried services and the compared services of the users Moreover it

also provides cluster data to the Conceptual Indexer and Matcher for providing service

recommendations

Furthermore letrsquos detailed introduce this componentrsquos function input and output

Oslash Input

u Service annotation data of both extracted and user feedback

u Usersrsquo Click streams used for extracting user behaviors

Oslash Function

u Obtain user clusters from user behaviors

u Obtain service clusters from service annotation data to enable to find similar services

Oslash Output

u Clusters of users and services

22 Information Extraction

Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge

amount of information sources on the Internet which has been limited the access to browsing and

searching for the reason of the heterogeneity and the lack of structure of Web information sources

Therefore the appearance of Information Extraction that transforms the Web pages into

program-friendly structures for post-processing would become a great necessity However the task of

Information Extraction is specified in terms of the inputs and the extraction targets And the

techniques used in the process of Information Extraction called extractor

221 Input Types of Information Extraction

Generally speaking there are three different input types The first input type is the unstructured

Deep Web Service Crawler

16

document For example the free text that showed in figure 2-2 It is unstructured and written in

natural language So that it will require substantial natural language processing While the second

input type is called the structured document For instance the XML documents based on the reason

that the data can be described through the available DTD (Document Type Definition) or XML

(eXtensible Markup Language) schema Finally but obviously the third input type is the

semi-structured document that are widespread on the Web Such as the large volume of HTML

pages like tables itemized lists and enumerated lists This is because HTML tags are often used to

render these embedded data in the HTML pages See figure 2-3

Figure2-2Left is the free text input type and right is its output [4]

Figure2-3A Semi-structured page containing data records

(in rectangular box) to be extracted [4]

Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly

regular structure And the data of these documents can be displayed in a format of HTML way or

non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and

generated from structured databases in terms of some templates or layouts thus it would be

considered as one of the input sources which could provide some of these semi-structured documents

For example the authors price and comments of the book pages that provided by Amazon have the

Deep Web Service Crawler

17

same layout That is because these Web pages are generated from the same database and applied

with the same template or layout Furthermore there has another option which could manually

generate HTML pages of semi-structured type For example although the publication lists that

provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have

title and source property for every single pager Eventually the inputs for some Information Extraction

can also be the pages with the same class or among various Web Service Registries

222 Extraction Targets of Information Extraction

Moreover regarding the task of the Information Extraction it has to consider the extraction target

There also have two different extraction targets The first one is the relation of k-tuple And the k in

there means the number of attributes in a record Nevertheless in some cases an attribute of one

record may have none instantiation Otherwise the attribute owns multiple instantiations In addition

the complex object with hierarchically organized data would be the second extraction target Though

the ways for depicting the extraction targets in a page are diverse the most common structure is the

hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf

nodes which called internal nodes And the structure for a data object may also be flat or nested To

be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise

if it is nested structure then the internal nodes that involved in this data object would be more than

two levels

Furthermore in order to make the Web pages readable for human being and having an easier

visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated

or demarcated However the displaying for a data object in a Web page would be affected by

following conditions [4]

Oslash The attribute of a data object has zero or several values

(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo

attribute For example a special offer only available for certain books might be a ldquononerdquo

attribute

(2) If there are more than one values for the attribute of a data object it will be called the

ldquomultiValuerdquo attribute For instance the name of the author for a book could be a

ldquomultiValuerdquo attribute

Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering

That is to say among this set of attribute the position of the attribute might be changed

according to the diverse instances of a data object Thus this attribute will be called the

ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would

enumerate the release data in front of the movesrsquo title while for the movies after year 1999

(including 1999) it will enumerate the release data behind the movesrsquo title

Oslash The attribute has different formats

This means the displaying format of the data object could be completely distinct with respect to

these different instances Therefore if the format of an attribute is free then a lot of rules will be

needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo

attribute For example an ecommerce Web site would use the bold font format to present the

general prices while use the red color format to display the sale prices Nevertheless there has

Deep Web Service Crawler

18

another situation that some different attributes for a data object have the same format For

example various attributes are presented in terms of using the ltTDgt tags in a table presentation

And the attributes like those could be differentiated by means of the order information of these

attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo

attributes it must have to revise the rules for extracting these attributes

Oslash The attribute cannot be decomposed

Because of the easier processing sometimes the input documents would like to be treated as

strings of tokens instead of the strings of characters In addition some of the attribute cannot

even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo

attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The

department code and the course number in them cannot be separated into two different strings

of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo

223 The Used Techniques in Information Extraction

The extractor used in the process of Information Extraction aims at providing a single uniform query

interface to access information sources like database server and Web server It consists of following

phases collecting returned Web pages labeling these Web pages generalizing extraction rules

extracting the relevant data and outputting the result in an appropriate format (XML format or

relational database) for further information integration For example at first the extractor queries the

Web server to gather the returned pages through the HTTP protocols after that it starts to extract the

contents among these HTML documents and integrate with other data sources thereafter Actually

the whole process of the extractor follows below steps

Oslash Step 1

At the beginning it must have to tokenize the input However there are two different

granularities for the input string tokenization They are tag-level encoding and word-level

encoding The tag-level encoding will transform the tags of HTML page into the general tokens

while transform all text string between two tags into a special token Nevertheless the

word-level encoding does this in another way It treats each word in a document as a token

Oslash Step 2

Next it should apply the extraction rules for every attributes of the data object in the Web pages

These extraction rules could be induced in terms of a top-down or bottom-up generalization

pattern mining or logic programming In addition the type of extraction rules may be indicated

by means of regular grammars or logic rules For example some use path-expressions of the

HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic

constraints and some use delimiter-based constraints such as HTML tags or literal words

Oslash Step 3

After that all these extracted data would be assembled into the records

Oslash Step 4

Finally iterate this process until all these data objects in the input

Deep Web Service Crawler

19

23 Pica-Pica Web Service Description Crawler

The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the

Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web

Services problem For example the evaluation of the descriptive quality of Web Services that offered

and how well are these Web Services described in nowadaysrsquo Web Service Registries

231 Needed Libraries of the Pica-Pica Web Service

Description Crawler

This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef

Spillner and programmed in terms of the Python language Actually in order to run these scripts to

parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib

n Beautiful Soup

It is an HTMLXML parser for Python language And it can even turn these invalid markups into a

parse tree [5] Moreover the following three features can make it more powerful

u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that

makes approximately as much sense as the original document Therefore you can obtain

the data that you want

u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating

searching and modifying the parse tree Hence you donrsquot need to create a custom parse

for every application

u If the document has already specified an encoding then you can ignore it since the

Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way

Otherwise what you have to do is just to specify the encoding of the original documents

Furthermore the ways of including Beautiful Soup into the application are displayed in the

following [5]

sup2 From BeautifulSoup import BeautifulSoup For processing HTML

sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML

sup2 Import BeautifulSoup To get everything

n Html5lib

It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to

gain maximum compatibility with the current major desktop web browsers this implementation

will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5

specification

Deep Web Service Crawler

20

232 Architecture of the Pica-Pica Web Service

Description Crawler

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler

Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes

four fundamental components Service Page Grabber component WSDL Grabber component

Property Grabber component and WSML Register component

(1) The Service Page Grabber components is going to take the URL seed as the input and output the

link of the service page into following two components WSDL Grabber component and Property

Grabber component

(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the

delivered service pagersquos link And then check whether the validation for these obtained WSDL

document Finally only these valid WSDL document will be passed into the WSML Register

component for further processing

(3) The Property Grabber component will try to extract the servicersquos property the hosted in the

service page if there exists After that all these servicersquos properties would be saved into an INI

file as the information of that service

(4) The functionality of the WSML Register component is to write appropriate WSML document by

means of the valid WSDL documents that delivered from WSDL Grabber component and the

optionally INI files that delivered from Property Grabber component Afterwards register them in

Conqo

n WSML [9]

It stands for Web Service Modeling Language which provides a framework with different

language variants Hence it is often used to describe the different aspects of the semantic Web

Services according to the conceptual model of WSMO

n WSMO [10]

WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various

aspects related to the Semantic Web Services based on the ontologies Ontology is a formal

explicit specification of a shared conceptualization In fact these ontologies are the sticking points

that can satisfy the linkage between the agreement of the communities of users and the defined

conceptual semantics of the real-world

n Conqo [11]

Deep Web Service Crawler

21

It is a discovery framework that considers not only the Quality of Service (QoS) but also the

context information It will use a Web Service Repository to manage these service descriptions

that based on WSML

233 Implementation of the Pica-Pica Web Service

Description Crawler

This section is going to describe the processes of the implementation of the Pica-Pica Web Service

Description Crawler in detail

(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it

needs an input as the initial seed In this crawler there are five Web Service Registries which are

listed in the below The URL address of these five Web Service Registries will be used as the input

seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web

Service Description Crawler there has a single Python script for each Web Service Registry And

the crawling process for these Web Service Registriesrsquo Python script will be executed one after

another

Biocataloue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwsrvice-repositorycom

Xmethods httpwwwxmethodsnet

(2) Then after feeding with the input seed it will step into the next component Service Page Grabber

At first this component will try to read the data from the Web based on the input seed Then it

will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library

After that this Service Page Grabber component starts to look for the service page link for each

service that published in the Web Service Registry by means of the functions in Html5lib library In

the case that the service page link of one single service is found it will firstly check whether this

service page link is valid or not Once the service page link is valid it will pass it into the following

two components for further processing which are WSDL Grabber component and Property

Grabber component

(3) When the WSDL Grabber component receives a service page link from its previous component it

sets out to extract the WSDL link address for that service through the parsing tree of the data in

this service page Next this component will start to download the WSDL document of that service

in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into

the disk The process of this WSDL Grabber component will be continually carried on until there is

no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They

may either contain bad definitions or bad namespaceURI or be an empty document What is

worse it is not even of XML format Hence in order to pick them out this component will further

analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo

folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in

order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo

folder will be passed into the subsequent component

(4) Moreover since some Web Service Registries would give some addition information about the

Deep Web Service Crawler

22

services such as availability service provider version therefore the Property Grabber

component will set out to extract these information as this servicersquos properties And thereafter

save these servicersquos properties into an INI file However if there has no available additional

information then it is no need to extract the service property thus there is no INI file for that

service However for the implementation of this Pica-Pica Web Service Description Crawler only

the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to

extract the servicesrsquo properties While for other three Web Service Registries there has such

function

(5) Furthermore this is an optional to create a report file which contains the statistic information of

this process Such as the total number of services for one Web Service Registry the number of

services whose WSDL document is invalid etc

(6) As has been stated at present there has a folder with all valid WSDL documents and might also

have some INI files Therefore at this time the task of WSML Register component is to generate

the appropriate WSML documents with these valid WSDL document and INI files And then

register them in Conqo

24 Conclusions of the Existing Strategies

In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder

project Information Extraction technique and Pica-Pica Web Service Description Crawler

The task of this master program is going to obtain the available Web services and their related

information from the Web Actually this is a procedure of extracting the needed information for the

service such as the servicersquos WSDL document and its properties Therefore the Information

Extraction technique which used to extract the information that hosted in the Web could be used in

this master program

Moreover the Service-Finder is a large project that not only able to obtain the available Web services

and their related information from the Web but also enrich these crawled data with annotations by

means of the Service-Finder ontology and the Service Category ontology and integrate all the

information into a coherent semantic mode Furthermore it also provides the capabilities for

searching and browsing the data with the user interface and gives users with service

recommendations However as a master program the Service-Finder project is far more exceeding

the requirements Therefore it is just considered as a consultation for this master program

Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available

Web services and their related information this fulfills the primary task of this master program

Nevertheless regarding the information about the service this Pica-Pica Web Service Description

Crawler extracts only few properties sometime even no property Consequently in order to improve

the quality of the service it has to extract as much properties about the service as possible Thence in

chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler

Deep Web Service Crawler

23

3 Design and Implementation

In the previous chapter of the State of Art those already existing techniques or implementations are

presented In the following it is time to introduce the basic principle of the proposed approach Deep

Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica

Web Service Description Crawler

31 Deep Web Services Crawler Requirements

This section is mainly talking about the goals of the Deep Web Service Crawler approach system

requirements with respect to approach and some non-functional requirements

311 Basic Requirements for DWSC

The following lists are the basic requirements which should be achieved

(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list

of Web services that published in the Web Service Registry It contains not only the WSDL document

of the service but also a file with respect to the information about that service Therefore for the

purpose of producing the largest annotated Service Catalogue this proposed approach need to

extract as much properties about those Web services as possible Moreover it has also to download

the WSDL document that hosted along with the Web service That is to say these properties are not

only the interesting structured properties such as service name its WSDL link address but also some

other non-function properties for example endpoint and its monitoring information

(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties

For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of

schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way

this proposed approach provides three methods for the storage The first one will store them as an

XML file While for the second method it is going to store them in an INI file And the third method is

to use the database for the storage

312 System Requirements for DWSC

General Speaking the needed requirements for achieving a programming project contains following

lists

1) Operating System LinuxUnix Windows (XP Vista 2000 etc)

2) Programming language C++ Java Python C etc

3) Programming Tool NetBean Eclipse Visual Studio and so on

However in this master thesis the scripts of the Deep Web Service Crawler approach are written in

Java programming language Besides these code scripts are only tested running in Windows XP

operating system and Linux operating system but have not been tested in other operating system

Deep Web Service Crawler

24

313 Non-Functional Requirements for DWSC

In this part several non-functional requirements for the Deep Web Service Crawler approach are

presented

1) Transparency the process of data exploration and data storage should be done automatically

without the userrsquos intervention However at first user should specify the path in the hard disk which

will be used store these outputs of this program

2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in

order to stop the process from interruption there must have some necessary error handlings for the

process recovering

3) Completeness this approach should extract the interesting properties about each Web Service as

much as possible eg endpoint monitoring information etc

In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the

following five URLs hence the proposed approach must implement not less than those five Web

Service Registries

Biocatalogue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwservice-repositorycom

Xmethods httpwwwxmethodsnet

32 Deep Web Services Crawler Architecture

In this section an overview of this high-level architecture of Deep Web Services Crawler approach will

be introduced first Thereafter there are four subsections that focusing on the outlining of each single

component and how they will play together presented

The current components and flows of data in Deep Web Service Crawler can be summarized as

depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page

links and related service page links by crawling the Web (Web Service Extractor) Then those gathered

links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL

Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all

these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is

illustrated as following

Oslash Step 1

When the Deep Web Service Crawler starts to run the File Chooser container would require the

user to specify a path in the computer or in any other hard disk The reason for specifying the

path is that this Deep Web Service Crawler program needs a place to store all its outputs

Oslash Step 2

After that the Web Service Extractor will be triggered It is the main entry to the specific crawling

process However since the Deep Web Service Crawler program is a procedure which supposes

to crawl for Web Services in some given Web Service Registries Hence the URL addresses of

Deep Web Service Crawler

25

these Web Service Registries should be given as an initial seed for this Web Service Extractor

process However since the page structures of these Web Service Registries are completely

different thus there will be a dependent process for each Web Service Registry

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler

Oslash Step 3

So that according to the given seed two types of links would be obtained by the Web Service

Extractor component One is the service list page link And another is the service page link

Service list page is page that contains a list of Web Services and may be some information about

these Web Services While service page is page that contains much more information for single

service Finally it forwards these two types of links into the next two components Property

Grabber and WSDL Grabber

Oslash Step 4

Then on the one hand the Property Grabber component tries to gather the information about

the service that hosted in the service list page and service page Such as the name of the service

its description the ranking for this service etc Finally all the information of the service will be

collected together as the service properties which then will be delivered to the storage

component for further processing

Oslash Step 5

On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page

or the service page That is because for some Web Service Registry the WSDL link is hosted in

Deep Web Service Crawler

26

the service list page like the Biocatalogue and for other Web Service Registries it is hosted in

the service page such as Xmethods Then after obtaining the WSDL link it will also be

transmitted to the Storage component for further processing

Oslash Step6

When service properties information and WSDL link of the service are received by Storage

component it will be going to store them into the disk For the service properties they will be

stored into the disk by means of three different ways They are an XML file an INI file or one

record inside the table of Database However for the WSDL link the Storage component will try

to download the page content first according to the URL address of the WSDL link If it can work

successful the page content of the service will be stored as a WSDL document into the disk

Oslash Step 7

Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there

are more than one service or there are more than one service list page in the those Web Service

Registries the crawling process from step 3 to step 6 will be continued again and again until

there is no service or service list page in that Web Service Registries any more

Oslash Step 8

Furthermore after the crawling process for one Web Service Registry finishes there will also

generate a file that contains some of statistic information about this crawling process For

example the time when the crawling process of this Web Service Registry started and when

finished the total number of Web services in this Web Service Registry how many services have

empty WSDL document the average number of service properties in this Web Service Registry

and the average time cost for extracting service properties getting WSDL document and

generating XML file INI file etc

321 The Function of Web Service Extractor Component

The Web Service Extractor component is responsible for gathering service list page links and service

page links It pursues a focused crawl of the Web and only forwards service list page and service page

links to subsequent components for analyzing collecting gathering purposes Therefore it is going to

identify both service list page links and related service page links on these Web Service Registries

As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost

as important as the Web Service Extractor itself as it highly influences the part of the Web which it

needs to crawl The seed can or shall contain eg Web pages where these Web Services are

published or that they talk about Web Services

Deep Web Service Crawler

27

Figure3-2 Overview the process flow of the Web Service Extractor Component

After feeding with the seed of URL the Web Service Extractor component starts to get the link of

service list page from the initial page of this URL seed However this process would be diverse for

these five Web Service Registries Following shows the different situation in these Web Service

Registries

Oslash Service-Repository Web Service Registry

In this Web Service Registry the link of the first service list page is the URL address of its seed

which means some Web Services can be found in the home page of the Service-Repository Web

Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this

Web Service Registry Therefore the process of getting service list page link in this registry will be

continually carried on until there no more service list page link exists

Oslash Xmethods Web Service Registry

Although there has Web Services in the home page of the Xmethods Web Service Registry the

number of those Web Services is only a small subset of these in this Web Service Registry

Moreover in the Xmethods Web Service Registry there is only one page containing all Web

Services Therefore it must have to get the service list page link for that page

Oslash Ebi Web Service Registry

The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That

is to say there also has one page that contains all Web Services in this Web Service Registry

However this page is not the initial page of the input seed Therefore there needs more than

one operation steps to get the service list page link of that page

Oslash Seekda Web Service Registry

In Seekda Web Service Registry these Web Services are not contained in the initial page of the

input seed The service list page link can be obtained after several additional operation steps

Deep Web Service Crawler

28

However there has a problem for getting the service list page link in this registry Simply put if

there are more than one pages contains the Web Services for some unknown reasons it cannot

not get the links of the rest service list page In another word it can only get the link of the first

service list page

Oslash Biocatalogue Web Service Registry

The process of getting service list page in Biocatalogue Web Service Registry is almost the same

as in Seekda Web Service Registry The only difference is that it can get all these service list page

links in Biocatalogue Web Service Registry if there are more than one service list pages

Then after getting the link of service list page the Web Service Extractor will begin to get the link of

service page for each service that listed in the service list page The reason for why it can do like this is

that there has an internal link for every service which will address to the service page It is worth

noting that once a service page link is obtained this component will immediately forward the service

page link and the service list page link into the subsequent two components for further processing

Nevertheless the process of obtaining the service page link will be continuously carried out until all

services listed in that service list page are crawled Analogously the process of getting service list page

will also be continuously carried out until there no more service list page exists

3211 Features of the Web Service Extractor Component

The main features are described in the following paragraphs

1) Obtain service list page links

A central task is to obtain the corresponding service list page links It is an URL address that includes a

publicly list of Web Services and just a simply information about these Web services like the name of

the service an internal URL that links to another page which would contain the detailed information

about that service sometimes may have the link address for WSDL document

2) Obtain service page links

Once service list page links are found a crucial aspect is to extract the internal link of each service to

aid the task of service information discovery Therefore it is the task of the Web Service Extractor to

harvest the html page content of the service list page so that the service page links which would

contain much more detailed information of the single Web service can be obtained

3212 Input of the Web Service Extractor Component

This component is dependent on some specific input seeds And the only input required for this

component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section

313

3213 Output of the Web Service Extractor Component

The component will produce two service related page links from the Web

l Service list page links

l Service page links

Deep Web Service Crawler

29

3214 Demonstration for Web Service Extractor

In order to have comprehensive understanding of the process of the Web Service Extractor

component following gives some figures for explanation Though there are five URL addresses in this

section only the URL of the service-repository address is showed as an example

1) The input seed is the initial URL address of the Service-Repository which is

ldquohttpwwwservice-repositorycomrdquo

2) As has already said in section 321 the first service list link of this Web Service Registry is its input

seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page

of that link

Figure3-3 Service list page of the Service-Repository

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo

Figure3-5Code Overview of getting service page link in Service Repository

Figure3-6 Service page of the Web service ldquoBLZServicerdquo

Deep Web Service Crawler

30

3) For now because of the service list page link is already know the next step is to acquiring the

service page link according to these services that listed in the service list page The text in red box

of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the

complete link of the service page It should be prefixed with the initial URL address of

Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for

getting service page link of the Web service in Service Repository is showed in figure 3-5

Therefore the final link of this service page is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure 3-6 is the corrsponding service page of that link

4) Afterwards those two links service list page link and service page link which are gathered by the

Web Service Extractor component will be immediately forwarded to the next two components

which are WSDL Grabber component and Property Grabber component

322 The Function of WSDL Grabber Component

The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the

service lists page link or service page link And the whole process flow is illustrated in figure 3-7

Figure3-7 Overview the process flow of the WSDL Grabber Component

When the WSDL Grabber component receives the inputs that delivered from previous component it

will start to get the WSDL link for the service based on these inputs However although the inputs of

the WSDL Grabber component are the links of service page and service list page only one of them

contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the

service page or the service list page The reason why there needs these two links to be delivered into

this component is that the only one of the five Web Service Registries which is Biocatalogue hosts

the WSDL link in the service list page While for other four Web Service Registries the WSDL link is

hosted in the service page Therefore the WSDL link of these four Web Service Registries will be

obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL

link will be obtained through the service list page link However there has a problem for getting the

WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in

Deep Web Service Crawler

31

the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another

word these services will not have WSDL document For the situation like this the value of the WSDL

link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web

Services in other four Web Service Registries the WSDL link would always exist in the service page

Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service

it will inevitably be forwarded into Storage component for downloading the WSDL document at once

3221 Features of the WSDL Grabber Component

The WSDL Grabber component has to provide the following features

l Obtain WSDL links

WSDL link is the direct way to get into the page that contains the contents of the WSDL document

It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo

to represent this is an address that addressed to the page of WSDL document

3222 Input of the WSDL Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3223 Output of the WSDL Grabber Component

The component will only produce the following output data

l The URL address of WSDL link for each service

3224 Demonstration for WSDL Grabber Component

In this section it will demonstrate a list of figures in order to give a comprehensive understanding of

the process of the WSDL Grabber component Without a doubt it uses the same Web service of the

Service-Repository as an example too

1) The input for this WSDL Grabber component is the link of service page that obtained from Web

Service Extractor component The address of this link is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page

Deep Web Service Crawler

32

2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the

figure 3-8

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo

3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure

3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry

For other four Web Service Registries this will be different This function

ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo

Then it checks all of the nodes one by one to see if the text value of that node is the value of

ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will

be extracted as the value of WSDL link for this Web service

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function

Figure3-11 Code overview of ldquooneParameterrdquo function

4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web

service ldquoBLZServicerdquo can be obtained The value of it is

ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo

Deep Web Service Crawler

33

323 The Function of Property Grabber Component

The Property Grabber component is a module which is used to extract and gather all these Web

service information that hosted in the Web actually are the information that showed in service list

page and service page In the end all the obtained Web service information will be collected together

as the service properties which would be delivered into the Storage component for storing The

detailed process flow of Property Grabber component is illustrated in figure 3-12

The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component

However there is still a little difference between them with respect to the seed As have already

mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is

sufficient to get the WSDL link While regarding the Property Grabber component it needs two of

them as the seeds Therefore once the Property Grabber component receives the needed inputs it

will start to extract the information of that single Web service

Figure3-12 Overview the process flow of the Property Grabber Component

Therefore after the Property Grabber component receives the inputs it starts to extract the service

information for the Web service Generally speaking the service information consists of four aspects

which are structured information endpoint information monitoring information and whois

information respectively

(1) Structured Information

The structured information can be obtained by extracting the information that hosted in the

service page and service list page They are the basic descriptive information about the service

Such as the name of the service the URL address through which the WSDL document can be

obtained the description of introducing the service the provider who provides that service its

rating and the server who owns this service etc However the elements constituting this

Deep Web Service Crawler

34

structured information would be diverse according to these different Web Service Registries For

example the rating information of the Web service exists in the Service-Repository Web Service

Registry while the Xmethods Web Service Registry does not have this information In addition

even though the Web services in the same Web Service Registry some elements of the structured

information could be not exist For instance one service in a Web Service Registry may have the

description information of this service while another service in the same Web Service Registry

does not have this description information Tables 3-1 to 3-5 demonstrate the structured

information that should be extracted in these five Web Service Registries Moreover if the style of

the Biocataloue Web Service Registry is SOAP there has some additional information describing

the SOAP operation of this service While it is REST this additional information would be

something that talked about the REST operation They should also be considered as a part of the

structured information Table 3-6 and table 3-7 illustrate the information for these two different

operations

Service Name WSDL Link WSDL Version

Provider Server Rating

Homepage Owner Homepage Description

Table 3-1Structured Information of Service-Repository Web Service Registry

Service Name WSDL Link Provider

Service Style Homepage Implementation Language

Description User Description Contributed Client Name

Type of this Client Publisher for this client Used Tookit of this client

Used Language of this client Used Operation System of this client

Table 3-2Structured Information of Xmethods Web Service Registry

Service Name WSDL Link Server

Provider Providerrsquos Country Service Style

Rating Description User Description

Service Tags Documentation (within WSDL)

Table 3-3Structured Information of Seekda Web Service Registry

Service Name WSDL Link Port Name

Service URL Address Implementation Class

Table 3-4Structured Information of Ebi Web Service Registry

Service Name WSDL Link Style

Provider Providerrsquos Country View Times

Favorite Times Submitter Service Tags

Total Annotation Provider Annotation Member Annotation

Registry Annotation Base URL SOAP Lab Server Base URL

Description User Description Category

Table 3-5Structured Information of Biocatalogue Web Service Registry

Deep Web Service Crawler

35

SOAP Operation Name Inputs and Outputs Operation Description

Operation Tags Part of Which Service

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry

REST Operation Name Service Tags Used Template

Operation Description Part of Which Service Part of Which Endpoint Group

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry

(2) Endpoint Information

Endpoint information is the port information of the Web service which can be extracted only

through the service page However since different Web Service Registries have different structure

of the endpoint information with respect to the Web service Thence some elements of the

endpoint information for the Web service would be very diverse But there is one thing needed to

pay attention the Ebi Web Service Registry does not have endpoint information for all these Web

services published in this registry Moreover though the Web services in the same Web Service

Registry have the same structure of this endpoint information some elements of the endpoint

information would be missing or empty Furthermore these Web Service Registries could even

have no endpoint information for some Web services published by them Nevertheless whatever

there is endpoint information for the Web service there will be at least one such element which

is the URL address of the endpoint Following table 3-8 shows the endpoint information that

should be extracted in these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Endpoint Name Endpoint URL

Endpoint Critical Endpoint Type

Bound Endpoint

Xmethods

Endpoint URL Publisher of this Endpoint

Contact Email of this Publisher Implementation Language of this

Endpoint

Seekda Endpoint URL

Biocatalogue Endpoint Name Endpoint URL

Table 3-8 Endpoint Information of these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Service Availability Number of Downs

Total Uptime Total Downtime

MTTR MTBF

RTT Max of Endpoint RTT Min of Endpoint

RTT Average of Endpoint Ping Count of Endpoint

Seekda Service Availability Begin Time of Monitoring

Biocatalogue Monitored Status of Endpoint Overall Status of Service

Table 3-9 Monitoring Information of these five Web Service Registries

(3) Monitoring Information

Monitoring information is the tested statistic information for the Web service But it is worth

Deep Web Service Crawler

36

noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of

all these Web services published by them While for other three Web Service Registries only a

few of Web services may not have this information Table 3-12 displays the monitoring

information for these three Web Service Registries

(4) Whois Information

Whois information is not extracted from the information that hosted in the service page and

service list page It is the descriptive information for the service domain which can be gained by

means of the address of the WSDL link Because of that the process of getting the Whois

information will start after gaining the service domain first The final value of service domain

should not contain the string like this http https www etc It must belong to the top level

domain After that it is going to query the service domain database by means of sending the value

of service domain to the Whois client which is just a Web site on the Internet for example

ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain

exists a list of its information will be returned as the output However the structure of the

returned information will be diverse with respect to different service domain Therefore the most

challenge thing is that there has to deal with the extracting process for each different situation of

the returned information Table 3-10 gives the Whois information that needs be extracted for all

these five Web Service Registries

Service Domain URL Domain Name Domain Type

Domain Address Domain Description State

Postal Code City Country

Country Code Phone Fax

Email Organization Established Time

Table 3-10Whois Information for these five Web Service Registries

Finally all the information of these four aspects will be collected together And then be delivered into

the Storage component for further storage processing

3231 Features of the Property Grabber Component

The Property Grabber component has to provide the following features

l Obtain basic information

Generally speaking the more information one Web service has the better you can know how

well this Web service is Hence this Property Grabber component will be necessary to extract all

these basic information that hosted in the service list page and service page This basic

information contains the structure information endpoint information and monitoring

information

l Obtain Whois information

Due to the reason that the more information the Web service has the better the quality of the

Web service is That makes it is necessary to extract as much information about the Web service

as possible Therefore except the basic information the Property Grabber component can also

obtain some additional information called Whois information such as the type of the domain the

name of the person in charge the postal code of the domain city phone fax detailed address

email etc

Deep Web Service Crawler

37

3232 Input of the Property Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3233 Output of the Property Grabber Component

The component will produce the following output data

l Structured information of each service

l Endpoint information about each service if exists

l Monitoring information for the service and endpoint if exists

l Whois information of the service domain

However all these information will be collected together as the properties for each service Thereafter

the collected properties will be sent to the Storage component

3234 Demonstration for Property Grabber Component

Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the

Property Grabber component To simplify the explanation the example showed in this section uses

the same Web service as before

1) The inputs of the Property Grabber component are links of service list page and service page that

received from Web Service Extractor component Their links are

ldquohttpwwwservice-repositorycomrdquo

and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page

Deep Web Service Crawler

38

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page

3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that

displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of

the structured information have the same content like the contents of description showed in

service page and service list page Thence in order to saving the time of extracting process and

the space for storing process the elements with the same content would only be extracted once

Moreover there needs a transformation of non-descriptive text to the descriptive text for the

rating information because the contents of it are several image of stars Therefore the final

results of the extracted structured information for this Web service are showed in table 3-11

Because of there are no descriptive information for the provider homepage and Owner

Homepage their values are assigned as ldquoNULLrdquo

Service Name BLZService

WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl

WSDL Version 0

Server Apache-Coyote11

Description BLZService

Rating Four stars and A Half

Provider NULL

Homepage NULL

Owner Homepage NULL

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo

4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that

displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list

only one of them will be extracted as the information of endpoint That is because this master

program is intended to extract as more information as possible but these information should not

contain some redundant information Therefore for the purpose of that only one of them will be

extracted as the endpoint information even if there are more than one endpoint records Table

3-12 shows the final results of the endpoint information of this Web service

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page

Deep Web Service Crawler

39

Endpoint Name BLZServiceSOAP12port_http

Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService

Endpoint Critical True

Endpoint Type production

Bound Endpoint BLZServiceSOAP12Binding

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo

5) Then it is time to extract the monitoring information through invoking the function

ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box

above contains the monitoring information about the Web service and the red box below lists

the monitoring information for these endpoints As have already mentioned before only one

endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two

types of availability Actually they all represent the availability for this Web service just like the

availability that showed in figure 3-14 Therefore only one content of these availability

information is sufficient Table 3-13 shows the final results of this extracting process

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page

Service Availability 100

Number of Downs 0

Total Uptime 1 day 19 hours 19 minutes

Total Downtime 0 second

MTBF 1 day 19 hours 19 minutes

MTTR 0 second

RTT Max of Endpoint 141 ms

RTT Min of Endpoint 0 ms

RTT Average of Endpoint 577 ms

Ping Count of Endpoint 112

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo

6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the

service domain from the WSDL link first Towards this Web service the gained service domain is

ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying

process After that it returns a list of information for that service domain See figure 3-17 And

table 3-14 is the extracted Whois information

Deep Web Service Crawler

40

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo

Service Domain URL thomas-bayercom

Domain Name Thomas Bayer

Domain Type NULL

Domain Address Moltkestr40

Domain Description NULL

State NULL

Postal Code 54173

City Bonn

Country NULL

Country Code DE

Phone +4922855525760

Fax NULL

Email infopredic8de

Organization predic8 GmbH

Established Time NULL

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo

7) Finally all the information of these four aspects will be collected together as the service

properties and then these service properties are forwarded into the Storage component

324 The Function of Storage Component

The WSDL link from the WSDL Grabber component will be used by this Storage component to

download the WSDL document from the Web and stores it into the disk thereafter In addition the

service properties from the Property Grabber component will also be directly stored into the disk in

three different manners with this Storage component Figure 3-18 illustrates the process flow of the

Storage component

Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo

of the Storage component will be triggered After that it is going to transform the service properties

into three different output formats and store them into the disk These output formats are XML file

INI file and database records Besides it will also try to download the WSDL document with the URL

address of the WSDL link and then store the obtained WSDL document into disk too However this

ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo

ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the

storage tasks

Deep Web Service Crawler

41

Figure3-18 Overview the process flow of the Storage Component

(1) ldquogetWSDLrdquo sub function

The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it

into the disk Therefore above all it has to get the content of the WSDL document This

procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value

of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if

the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo

For the case like that it will create a WSDL document whose name is the service name appended

with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content

It is an empty document Nevertheless if that service does have WSDL link this sub function will

try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all

these contents that hosted in the Web would be downloaded and stored into the disk and named

only with the name of the service Otherwise it would create a WSDL document which prefixes

ldquoBadrdquo before the service name

(2) ldquogenerateXMLrdquo sub function

The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform

them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo

XML stands for eXtensible Markup Language which is a markup language that designed to

transport and store data The first line of an XML file is the XML declaration which defines the

XML version and the encoding used For example following ldquoltxml version=10

encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is

8-bit Unicode Transformation Format character set Besides an XML file contains XML elements

which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML

element can also contain other elements which could be simple text or a mixture of both

Deep Web Service Crawler

42

However an XML file must contain a root element as the parent of all other elements

(3) ldquogenerateINIrdquo sub function

The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will

transform them into an INI file and then store it in the disk with the name structure of servicersquos

name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for

configuration files They are just simple text files with a basic structure Generally speaking the

INI file contains three different parts They are section parameter and comment respectively

Parameter is the basic element contained in an INI file Its format is a key-value pair or can also

be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always

appears to the left of the equals sign The section is more like a room that grouped all its

parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And

section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo

Thence anything between the semicolon and the end of line will be ignored

(4) ldquogenerateDatabaserdquo sub function

The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions

Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will

turn them into the data of the database by using the SQL statements SQL stands for Structured

Query Language which is a database language that designed for accessing and manipulating data

in database Most of the actions that performed on a database are done with SQL statements

And the primary statements of SQL include insert into delete update select create alter and

drop Therefore for the purpose of transforming these service properties into the data of the

database this sub function has to create a database first using the ldquocreate databaserdquo statement

Then it should create a table to store the data A table is a collections of related data entries and

it consists of columns and rows Since the data for all these five Web Service Registries are not

very large therefore one database table is enough for storing these service properties Because of

that the field names of the service properties in columns for all these five Web Service Registries

should be uniform and well-defined Afterwards the service properties of each single service can

be put into the table as a record with the ldquoinsert intordquo statement of the SQL

3241 Features of the Storage Component

The Property Grabber component has to provide the following features

l Generate different output formats

The final result of this master program is to store the information of the service into the disk for

future work However this Storage component provides three different formats for storing these

service properties of the services that published in the Web Service Registries This makes the

storage of the service very flexible and also longevity

l Obtain the WSDL document

The important thing of this component is to check if the WSDL document can be obtained from

the WSDL link because the WSDL document plays a decisive role for determining the quality of

the service This Storage component provides the ability to deal with the different situations

about the process of obtaining the WSDL document

Deep Web Service Crawler

43

3242 Input of the Storage Component

This component requires the following input data

l WSDL link of each service

l Each service property information

3243 Output of the Storage Component

The component will produce the following output data

l WSDL document of the service

l XML document INI file and tables in database

3244 Demonstration for Storage Component

Following figures describe the fundamental implementation codes of this Storage component

The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the

implementation codes The first common place is about the parameters that defined in each of

these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute

path in the computer disk It will be used for the storing procedure and has already been specified

by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an

increasing Integer which is used as a part of the name of the service The reason for doing this is

that it can prevent the services that have the same name from overriding in the disk The content

of red mark among the codes of these figures is the second common place Its function is to

create a file or document for the service with the corresponding parameters

2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to

get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the

name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo

is the actual WSDL link which is the most important one for this sub function The other two

parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and

ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data

of the service in each Web Service Registry Such as the overall number of properties for that Web

Service Registry the overall number of services the number of services that have no WSDL link

the number of services contains no contents the number of services whose WSDL link are not

available etc The ldquoLog Informationrdquo text file records the results of the process steps and the

problems encountered For example which service is crawling now which Web Service Registry it

belongs the reason why it cannot get the WSDL document for this service the reason why this

service is unreachable and so on

Deep Web Service Crawler

44

Figure3-19 Implementation code for getting WSDL document

3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file

and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of

ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two

variables name and value

Figure3-20 Implementation code for generating XML file

Deep Web Service Crawler

45

Figure3-21 Implementation code for generating INI file

4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all

services in these five Web Service Registries into the records of the database Therefore it has to

create a database first The name of the database can be arbitrary as long as it conforms to the

naming rules in database This is the same for the name of the table Figure 3-22 displays the

procedure of creating a table in the database Due to the reason that it is hard to decide the

length of each service property thence the data types for all service properties are set to ldquoTextrdquo

Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo

statement

Figure3-22 Implementation code for creating table in database

Deep Web Service Crawler

46

Figure3-23 Implementation code for generating table records

33 Multithreaded Programming for DWSC

Multithreaded programming is a built-in characteristic which is provided by Java language A

multithreaded program contains two or more separate parts that that can execute concurrently Each

part of such a program is called a thread The use of multithreading enables to create programs that

makes efficient use of the system resources For example make maximum use of the CPU thatrsquos

because the idle time of the CPU can be kept to a minimum

In this master program there are five Web Service Registries that are needed to crawl for the services

among them Moreover the number of services published among each Web Service Registry is quite

different That makes the running time cost by each Web Service Registry is different Then it will

happen that one Web Service Registry who owns fewer services has to be waiting for executing until

another Web Service Registry who has much more services finishes Therefore in order to reduce the

waiting time for other Web Service Registries and maximize the use of the system resources it is

necessary to apply this multithreaded programming to this master program That is to say this master

program will create a thread for each Web Service Registry And these threads are executed

independently

34 Sleep Time Configuration for Web Service

Registries

Due to the reason that this master program is intended for downloading the WSDL document and

extracting the service information of these Web services that published in the Web Service Registry

this would inevitably affect the performance of the Web Service Registry In addition for the purpose

of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of

accessing Because of that sometimes when this master program is executing unknown errors would

happen For instance this master program will be continually halting at one point without getting any

more WSDL document and service information or some servicesrsquo WSDL documents of some Web

Deep Web Service Crawler

47

Service Registries cannot be obtained or there would miss some service information Therefore in

order to obtain the largest staffs of these Web services published in these five Web Service Registries

and not to affect the throughput of them it has to configure the accessing rate for each service of all

Web Service Registries

In consequence before going into the essential procedure for each single service of these Web Service

Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long

milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for

the specified number of milliseconds In other words it will temporarily cease execution for a while

Following table shows this time interval of the sleep function for each Web Service Registry

Web Service Registry Name Time Interval (milliseconds)

Service Repository 8000

Ebi 3000

Xmethods 10000

Seekda 20000

Biocatalogue 10000

Table 3-15Sleep Time of these five Web Service Registries

Deep Web Service Crawler

48

4 Experimental Results and Analysis

This chapter is going to show the quantitative experimental results of the prototype that presented in

chapter 3 Besides the analysis of these results will also be described and explained However in

order to gain a rather accurate result the experiments are carried out more than five times All these

data that displayed in the following tables and charts are the average of them

41 Statistic Information for Different Web Service

Registries

This section is mainly used to talk about the amount statistic of these Web services that published in

these five Web Service Registries It includes the overall number of the Web services that published in

each Web Service Registry and the number of these unavailable Web services which have been

archived because they may not be active anymore or are close to being non active Table 4-1 shows

the service amount statistic of these five Web Service Registries

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Overall

Services 57 289 382 853 2567

Unavailable

Services 0 0 0 0 125

Table4-1 Service amount statistic of these five Web Service Registries

Nevertheless in order to have an intuitive view of the service amount statistic in these five Web

Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As

you can see from the bar chart one on hand there has an ascending increase for the overall number

of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service

Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web

service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to

provide the Web services for the users because it contains far more services than other four Web

Service Registries On the other hand there is no unavailable service existed in almost all Web Service

Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web

Service Registry contains some Web services that cannot be used by the users To some degree this is

useless for the reason that these services cannot be used anymore and it will waste the network

resources on the Web Therefore all these unavailable services should be eliminated in order to

reduce the wasting of the network resources

Deep Web Service Crawler

49

Figure4-1 Service amount statistic of these five Web Service Registries

42 Statistic Information for WSDL Document

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Failed WSDL

Links 1 0 23 145 32

Without

WSDL Links 0 0 0 0 16

Empty

Content 0 0 2 0 2

Table4-2 Statistic information for WSDL Document

Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in

these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed

WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the

Web services whose WSDL links are invalid Say in other words that means it is impossible to get the

WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no

WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services

for these Web Service Registries It is used as the overall number of Web services in each Web Service

Registry who has no WSDL link That is to say there would be no WSDL document for Web services

like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL

document will also be created But there is no content and the name of this WSDL document will

contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which

represents the overall number of Web services that have the WSDL links and the URL address of them

0

500

1000

1500

2000

2500

3000

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

57289 382

853

2567

0 0 0 0 125

The

Num

ber o

f Web

Ser

vice

s

The name of Web Service Registry

Overall Services

Unavailable Services

Deep Web Service Crawler

50

are valid but their WSDL documents contains no content In this case a WSDL document whose name

contains a string ldquo(BAD)rdquo will be created

Figure4-2 Statistic information for WSDL Document

43 Comparison of Different Average Number of

Service Properties

This section is going to compare the average number of service properties in these five Web Service

Registries This average number of service properties is calculated by means of following equation = (1)

Where

ASP is the average number of service properties for one Web Service Registry

ONSP is the overall number of service properties in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

Figure 4-3 shows the average number of service properties of each Web service in these five Web

Service Registries As have already mentioned before one of the measurements for testing the quality

of Web services in the Web Service Registry is the service information The more information of a Web

service the better you know about that service consequently the corresponding Web Service

Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service

Repository and Biocatalogue Web Service Registries own a larger number of service properties than

that in other three Web Service Registries This can directly reflect that these two Web Service

Registries can provide more detailed information about the Web services published among them So

that the users can be easier to choose the service they need And they would also like to use the Web

services that published in these two Web Service Registries By contrast the Xmethods and Seekda

0

50

100

150

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

1 023

145

32

0 0 0 016

0 0 2 0 2

The

Num

ber o

f Ser

vice

s

The Name of the Web Service Registry

Failed WSDL Links

Without WSDL links

Empty Content

Deep Web Service Crawler

51

Web Service Registries who have less service information of the Web services would offer less quality

for these Web services Therefore users may be not like to use the Web services provided in these

two Web Service Registries Not to mention the Web services published in the Ebi Web Service

Registry

Figure4-3 Average Number of Service Properties

From the description presented in section 323 the causes of the issue for the different number of

service properties in these Web Service Registries may consist of following several points First the

number of the structured information for these Web services is different with respect to these five

Web Service Registries Even part of information for some Web services in one Web Service Registry

could be missing or have empty value For example the number of structure information that

supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that

in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for

almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce

the amount of the overall number of service properties Thirdly some Web Service Registries do not

have monitoring information such as Xmethods and Ebi In particular the Service Repository Web

Service Registry has a large amount of monitoring information of the Web services that could be

extracted in the Web Obvious the last one is the number of the whois information for these Web

services If the database of the whois client does not contain the information about the service

domain for the Web service in one Web Service Registry then there will be no whois information

could be extracted Moreover even there has information of the service domain the amount of the

information could be very diverse Therefore if the Web Service Registry undergoes the situation like

that many service domains of the Web services in this registry have no and few whois information

then the average number of service properties in that registry will decrease greatly

As a result in order to help users get better acquainted with the Web services that provided in Web

Service Registry and distinguish these Web services the Web Service Registry should do it best to

offer more and more information for each published Web service of it

05

101520253035

23

717 17

32

The

Num

ber o

f Pro

pert

ies

The Name of Web Service Registry

Average Number of Service Properties

Deep Web Service Crawler

52

44 Different Outputs of Web Services

The fundamental task of this master program is to obtain the WSDL document of the Web services

that hosted in the Web Service Registries as well as extract and gather the properties of these Web

services and thereafter store them into disk Therefore this section is going to describe the different

outputs of this master program which include the WSDL documents of Web services the generated

XML file and INI file and data records of these service properties in Database

Figure4-4 WSDL Document format of one Web service

The WSDL document of the Web service is just read from the Web according to the URL address of its

WSDL link And then these data will be stored into the disk as the WSDL document The name of the

WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document

And in order to distinguish these WSDL documents whose names would be the same while the

content of them are different the name of each obtained WSDL document in one Web Service

Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL

document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and

data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output

formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name

is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the

materials belonged to the same Web service The first three lines in that INI file are some service

comments which start from the semicolon to the end of the line They are the basic information for

the description of this INI file The following one line of them is the section that included in a pairs of

brackets It is important because it represents that the following lines behind it are the information of

this Web service Therefore the rest of these lines are the actual service information that existed in a

key-value pair with an equal between the pair Each service property is displayed from the beginning

of the line

Deep Web Service Crawler

53

Figure4-5 INI File format of one Web service

Figure4-6 XML File format of one Web service

Figure4-7 Database data format for all Web services

Deep Web Service Crawler

54

Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its

name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web

service Though the format of the XML file is different from that of the INI file the essential contents

of them are the same That is to say the values of these service properties are no different This is

because they are files generated from the collection of properties for the same Web service The XML

file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the

section in INI file is something like the root in XML file Therefore all values of these elements among

the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service

Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of

service information for all Web services in these five Web Service Registries And the entire service

information for one Web service will be only one record in this table Because of that the column

names of that table should be the union of the name of the service information in each Web Service

Registry However since the column names of the table should be unique thus these redundant

names in this union must be eliminated This could make sense and possible because the names of the

service information are well-defined and uniform for all these five Web Service Registries In addition

in this table the first column is the primary key which is an increasing Integer The function of this is

more like the Integer that contained in the name of the XML file and INI file While the rest columns in

that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in

this table represents that this property of that Web service is empty or missing

45 Comparison of Average Time Cost for Different

Parts of Single Web Service

This section aims to describe the comparison of the average time cost for different parts of getting

one single Web service in all these five Web Service Registries At first it has to calculate the average

time cost of getting one single service in the Web Service Registry The calculation of it can be

obtained through following equation = (2)

Where

ATC is the average time cost for one single Web service

OTS is the overall time cost of all the Web services in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the different parts of the average time cost for getting one single service consists of

following 6 aspects which are the average time cost for extracting service property the average time

cost for obtaining WSDL document the average time cost for generating XML file the average time

cost for generating INI file the average time cost for inserting the service property into the table of

the Database and the average time cost for some other procedures such as get the service list page

link get the service page link and so on The average time cost for extracting service property would

be obtained by means of following equation

Deep Web Service Crawler

55

= (3)

Where

ATCSI is the average time cost for extracting service property of one single Web service

OTSSI is the overall time cost for extracting service property of all the Web services in one Web

Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the calculation of other parts is similar to the equation for calculating the average time

cost for extracting service property While the calculation of the average time cost for other

procedures equals to the result of the average time cost for one single Web service minus the sum of

the average time cost of all other five parts

Service

property

WSDL

Document

XML

File INI File Database Others Overall

Service

Repository

8801 918 2 1 53 267 10042

Ebi 699 82 2 1 28 11 823

Xmethods 5801 1168 2 1 45 12 7029

Seekda 5186 1013 2 1 41 23 6266

Biocatalogue 39533 762 2 1 66 1636 42000

Table4-3 Average time cost information for all Web Service Registries

Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web

Service Registries The first column of table 4-3 is the name of these five Web Service Registries And

the last column in this table is the average time cost for the single service in one Web Service Registry

While the rest columns in this table are the average time cost of these six different parts In order to

have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated

with following corresponding figures See figure 4-8 to figure 4-14

Figure4-8 Average time cost for extracting service property in all Web Service Registries

8801

6995801 5186

39533

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Service Property

Deep Web Service Crawler

56

As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue

Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries

which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively

That is to say it will take much longer time to extract the service properties of these Web services

published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the

Biocatalogue Web Service Registry has even more average number of service properties which have

already been talked about in section 43 On the contrary the average number of service properties in

Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service

Registry is larger than that in Seekda Web Service Registry As have already known that the average

number of service properties is the same for these two Web Service Registries Nevertheless there

has course might explain the case that the average time in Xmethods costs more than that in Seekda

which is the process for extracting service properties in Xmethods Web Service Registry needs to be

executed by means of service page and service list page while it needs only service page link for

Seekda Web Service Registry

The average time cost for obtaining WSDL document in all these five Web Service Registries is

displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the

average time for extracting the WSDL link of Web service and the average time for reading the data of

WSDL document from the Web and then storing it into the disk As seen from this figure the average

time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is

1168 Although the process for extracting WSDL link will cost certain amount of time it does not have

a significant influence to the total average time spent for obtaining WSDL document This is because

the WSDL link of one Web service is almost gained in one step Therefore this can imply that the

average size of the WSDL document for Xmethods Web Service Registry is larger than that for other

four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds

for the process of extracting WSDL document

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries

918

82

11681013

762

0

200

400

600

800

1000

1200

1400

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

WSDL Document

Deep Web Service Crawler

57

Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different

outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11

the average time of generating XML file for one Web service cost the same for all these five Web

Service Registries which is only 2 milliseconds In addition the average time of generating INI file for

one Web service cost the same too And its value is just 1 millisecond Even though sum up the values

of these two types of average time costs the result is still so small that it could be omitted when

comparing to the overall average time cost of getting one Web service for each corresponding Web

Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI

file would be finished at once after receiving the input of service properties of one Web service

Furthermore it can be seen from figure 4-12 although the average time costs of creating database

record for each Web service in all these five Web Service Registries are larger than that time of

generating XML and INI file the process of the operation for creating database record is still fast

Figure4-10 Average time cost for generating XML file in all Web Service Registries

Figure4-11 Average time cost for generating INI file in all Web Service Registries

2 2 2 2 2

0

05

1

15

2

25

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

XML File

1 1 1 1 1

0

02

04

06

08

1

12

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

INI File

Deep Web Service Crawler

58

Figure4-12 Average time cost for creating database record in all Web Service Registries

Figure4-13 Average time cost for getting one Web service in all Web Service Registries

In the figure 4-13 it gives the average time cost for getting one single Web service in all these five

Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the

longest time for this process This is because the presentation of these five different parts that

described in the front shows that the average time cost of each part would needs more time to finish

the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining

WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web

Service Registry does not cost the longest Moreover there is amazing thing when looking at these

figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same

trend This can further indicate that one Web Service Registry who spends more time to get the

description information of one Web service would offer more information about the Web service

53

28

4541

66

0

10

20

30

40

50

60

70

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Database

10042

823

7029 6266

42000

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Overall

Deep Web Service Crawler

59

5 Conclusion and Further Direction

This master thesis provides a schema which aims to explore the description information of the Web

services that hosted in the different Web Service Registries Actually the description information of

the Web service consists of WSDL document and service information about that Web service

Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can

be used to obtain the WSDL document and service information of the Web service in these Web

Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web

services that hosted in the these Web Service Registries as well as the storage of these Web services

is not flexible and the most important thing is that only a few service information of the Web service

are extracted even more there is no service information would be extracted for some Web Service

Registry However the work presented in this master thesis can be able to explore all Web services in

these Web Service Registries Moreover the service information of Web service would be extracted as

much as it could with the approach presented in this master thesis so that the final result would be

the largest annotated service catalogue ever produced Furthermore regarding the storage process of

the description information of the Web service this master thesis provides three different ways that

can guarantee not only the completeness but also longevity of the description information for the

Web service

However in the implementation performed in this master thesis the whois client used for querying

the information of service domain will return a free text if the information exists And sometime this

free text differs completely That makes it must have to crawl each Web service in all Web Service

Registries at least once in the experiment stage So that all the cases of these free text would be

foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web

services in these Web Service Registries Therefore in order to simply the work other whois client

who can ease the work here needs to be found and used

Moreover in the experiment stage of this master thesis the time cost for getting Web service is still

large For the purpose of reducing this time the multithreaded programming would be applied to

each some parts of the process for getting one Web service

Although the work performed here is specialized for only these five Web Service Registries the

main parts of the principles used here is adaptable to other Web Service Registries that with only

small changes in the implementation codes or the structure

Deep Web Service Crawler

60

6 Bibliography

[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)

June 27 2008

[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from

httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-

whole Emanuele Della Valle (CEFRIEL) July 1 2008

[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk

ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-

architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009

[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web

Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October

2006

[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from

httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008

[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from

httpwwww3orgTRws-arch-scenarios February 11 2004

[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo

Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering

University of Washington Seattle pp233-272 February 1999

[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web

Consortium Working Draft WD-html5-20100624 January 22 2008

[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language

WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from

httpwwwwsmoorgTRd16d161v02

[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO

-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from

httpwwwwsmoorgTRd2v11

[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware

Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet

2008

Deep Web Service Crawler

61

7 Appendixes

There are additional outputs of this master program which are log information file and statistic report

file Figure 8-1 shows one of the basic output formats of the log information for these five Web

Service Registries

Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry

Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry

Deep Web Service Crawler

62

Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo

ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively

Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry

Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry

Deep Web Service Crawler

63

Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry

Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry

Deep Web Service Crawler

64

Table of Figures

Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12

Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16

Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25

Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27

Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41

Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46

Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50

Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51

Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52

Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56

Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58

Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58

Deep Web Service Crawler

65

Table of Tables

Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47

Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48

Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55

Deep Web Service Crawler

66

Table of Abbreviations

DTD Document Type Definition

DWSC Deep Web Service Crawler

HTML HyperText Markup Language

MTBF Mean Time between Failures

MTTR Mean Time to Recovery

QoS Quality of Service

REST Representational State Transfer

RTT Round Trip Time

SOAP Simple Object Access Protocol

SQL Structured Query Language

URL Uniform Resource Locator

WHATWG Web Hypertext Application Technology Working Group

XML eXtensible Markup Language

WSDL Web Service Description Language

WSML Web Service Modeling Language

WSMO Web Service Modeling Ontology

Page 14: Deep Web Service Crawler

Deep Web Service Crawler

14

2123 The Principle of the Conceptual Indexer and Matcher

Component

The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted

information of the services and supplying users the capability of retrieval and semantic query For

example the matchmaking between user requests and service offers and the act of retrieving user

feedback on extracted annotations

In addition letrsquos have a look of the function of this component and its input output

Oslash Input

u Semantic annotation data and full text information obtained from Automatic Annotation

u Semantic annotation data and full text information that come from user interfaces

u Cluster data from user and service clustering component

Oslash Function

u Store the semantic annotations received from the Automatic Annotation component and

from the user interface

u Store the cluster data that procured through the clustering component

u Store and index the textual description offered by the Automatic Annotation component

and the textual comments offered by users

u Ontological query the semantic data from the data store center

u Combined keyword and Ontological querying used for user queries

u Provide a list of similar services for a given service

Oslash Output

u A list of matching services that are queried by users In particular these services should be

sorted by ranking and can also be iterated

u All available data that related to a particular entity must be retrievable at the user interface

2124 The Principle of the Service-Finder Portal Interface

Component

The Service-Finder Portal Interface is the main entry point that provided for users of the

Service-Finder system to search and browse the data which is managed by the Conceptual Indexer

and Matcher component In addition the users can also contribute information by means of providing

tags comments categorizations and ratings to the data browsed Furthermore the developers can

still directly invoke the Service-Finder functionalities from their custom applications in terms of an API

Besides the details of this componentrsquos function input and output are represented as below

Oslash Input

u A list of ordered services for a query

u Detailed information about a service or a set of services and a service provider

u Query access to service category ontology and the most used tags provided by the users

Deep Web Service Crawler

15

u Service availability information

Oslash Function

u The Web Interface allows the users to search services by keyword tag or concept in the

categorization sort and filter query results by refining the query compare and bookmark

services try out the services that offer this functionality

u The API allows the developers to invoke Service-Finder functionalities

Oslash Output

u Explicit user annotations such as tags ratings comments decryptions and so on

u Implicit user data for example click stream of users bookmarks comparisons links sent

etc

u Manual advertising of available new services

2125 The Principle of the Cluster Engine Component

The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the

Service-Finder Portal eg the queried services and the compared services of the users Moreover it

also provides cluster data to the Conceptual Indexer and Matcher for providing service

recommendations

Furthermore letrsquos detailed introduce this componentrsquos function input and output

Oslash Input

u Service annotation data of both extracted and user feedback

u Usersrsquo Click streams used for extracting user behaviors

Oslash Function

u Obtain user clusters from user behaviors

u Obtain service clusters from service annotation data to enable to find similar services

Oslash Output

u Clusters of users and services

22 Information Extraction

Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge

amount of information sources on the Internet which has been limited the access to browsing and

searching for the reason of the heterogeneity and the lack of structure of Web information sources

Therefore the appearance of Information Extraction that transforms the Web pages into

program-friendly structures for post-processing would become a great necessity However the task of

Information Extraction is specified in terms of the inputs and the extraction targets And the

techniques used in the process of Information Extraction called extractor

221 Input Types of Information Extraction

Generally speaking there are three different input types The first input type is the unstructured

Deep Web Service Crawler

16

document For example the free text that showed in figure 2-2 It is unstructured and written in

natural language So that it will require substantial natural language processing While the second

input type is called the structured document For instance the XML documents based on the reason

that the data can be described through the available DTD (Document Type Definition) or XML

(eXtensible Markup Language) schema Finally but obviously the third input type is the

semi-structured document that are widespread on the Web Such as the large volume of HTML

pages like tables itemized lists and enumerated lists This is because HTML tags are often used to

render these embedded data in the HTML pages See figure 2-3

Figure2-2Left is the free text input type and right is its output [4]

Figure2-3A Semi-structured page containing data records

(in rectangular box) to be extracted [4]

Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly

regular structure And the data of these documents can be displayed in a format of HTML way or

non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and

generated from structured databases in terms of some templates or layouts thus it would be

considered as one of the input sources which could provide some of these semi-structured documents

For example the authors price and comments of the book pages that provided by Amazon have the

Deep Web Service Crawler

17

same layout That is because these Web pages are generated from the same database and applied

with the same template or layout Furthermore there has another option which could manually

generate HTML pages of semi-structured type For example although the publication lists that

provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have

title and source property for every single pager Eventually the inputs for some Information Extraction

can also be the pages with the same class or among various Web Service Registries

222 Extraction Targets of Information Extraction

Moreover regarding the task of the Information Extraction it has to consider the extraction target

There also have two different extraction targets The first one is the relation of k-tuple And the k in

there means the number of attributes in a record Nevertheless in some cases an attribute of one

record may have none instantiation Otherwise the attribute owns multiple instantiations In addition

the complex object with hierarchically organized data would be the second extraction target Though

the ways for depicting the extraction targets in a page are diverse the most common structure is the

hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf

nodes which called internal nodes And the structure for a data object may also be flat or nested To

be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise

if it is nested structure then the internal nodes that involved in this data object would be more than

two levels

Furthermore in order to make the Web pages readable for human being and having an easier

visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated

or demarcated However the displaying for a data object in a Web page would be affected by

following conditions [4]

Oslash The attribute of a data object has zero or several values

(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo

attribute For example a special offer only available for certain books might be a ldquononerdquo

attribute

(2) If there are more than one values for the attribute of a data object it will be called the

ldquomultiValuerdquo attribute For instance the name of the author for a book could be a

ldquomultiValuerdquo attribute

Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering

That is to say among this set of attribute the position of the attribute might be changed

according to the diverse instances of a data object Thus this attribute will be called the

ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would

enumerate the release data in front of the movesrsquo title while for the movies after year 1999

(including 1999) it will enumerate the release data behind the movesrsquo title

Oslash The attribute has different formats

This means the displaying format of the data object could be completely distinct with respect to

these different instances Therefore if the format of an attribute is free then a lot of rules will be

needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo

attribute For example an ecommerce Web site would use the bold font format to present the

general prices while use the red color format to display the sale prices Nevertheless there has

Deep Web Service Crawler

18

another situation that some different attributes for a data object have the same format For

example various attributes are presented in terms of using the ltTDgt tags in a table presentation

And the attributes like those could be differentiated by means of the order information of these

attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo

attributes it must have to revise the rules for extracting these attributes

Oslash The attribute cannot be decomposed

Because of the easier processing sometimes the input documents would like to be treated as

strings of tokens instead of the strings of characters In addition some of the attribute cannot

even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo

attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The

department code and the course number in them cannot be separated into two different strings

of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo

223 The Used Techniques in Information Extraction

The extractor used in the process of Information Extraction aims at providing a single uniform query

interface to access information sources like database server and Web server It consists of following

phases collecting returned Web pages labeling these Web pages generalizing extraction rules

extracting the relevant data and outputting the result in an appropriate format (XML format or

relational database) for further information integration For example at first the extractor queries the

Web server to gather the returned pages through the HTTP protocols after that it starts to extract the

contents among these HTML documents and integrate with other data sources thereafter Actually

the whole process of the extractor follows below steps

Oslash Step 1

At the beginning it must have to tokenize the input However there are two different

granularities for the input string tokenization They are tag-level encoding and word-level

encoding The tag-level encoding will transform the tags of HTML page into the general tokens

while transform all text string between two tags into a special token Nevertheless the

word-level encoding does this in another way It treats each word in a document as a token

Oslash Step 2

Next it should apply the extraction rules for every attributes of the data object in the Web pages

These extraction rules could be induced in terms of a top-down or bottom-up generalization

pattern mining or logic programming In addition the type of extraction rules may be indicated

by means of regular grammars or logic rules For example some use path-expressions of the

HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic

constraints and some use delimiter-based constraints such as HTML tags or literal words

Oslash Step 3

After that all these extracted data would be assembled into the records

Oslash Step 4

Finally iterate this process until all these data objects in the input

Deep Web Service Crawler

19

23 Pica-Pica Web Service Description Crawler

The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the

Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web

Services problem For example the evaluation of the descriptive quality of Web Services that offered

and how well are these Web Services described in nowadaysrsquo Web Service Registries

231 Needed Libraries of the Pica-Pica Web Service

Description Crawler

This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef

Spillner and programmed in terms of the Python language Actually in order to run these scripts to

parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib

n Beautiful Soup

It is an HTMLXML parser for Python language And it can even turn these invalid markups into a

parse tree [5] Moreover the following three features can make it more powerful

u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that

makes approximately as much sense as the original document Therefore you can obtain

the data that you want

u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating

searching and modifying the parse tree Hence you donrsquot need to create a custom parse

for every application

u If the document has already specified an encoding then you can ignore it since the

Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way

Otherwise what you have to do is just to specify the encoding of the original documents

Furthermore the ways of including Beautiful Soup into the application are displayed in the

following [5]

sup2 From BeautifulSoup import BeautifulSoup For processing HTML

sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML

sup2 Import BeautifulSoup To get everything

n Html5lib

It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to

gain maximum compatibility with the current major desktop web browsers this implementation

will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5

specification

Deep Web Service Crawler

20

232 Architecture of the Pica-Pica Web Service

Description Crawler

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler

Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes

four fundamental components Service Page Grabber component WSDL Grabber component

Property Grabber component and WSML Register component

(1) The Service Page Grabber components is going to take the URL seed as the input and output the

link of the service page into following two components WSDL Grabber component and Property

Grabber component

(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the

delivered service pagersquos link And then check whether the validation for these obtained WSDL

document Finally only these valid WSDL document will be passed into the WSML Register

component for further processing

(3) The Property Grabber component will try to extract the servicersquos property the hosted in the

service page if there exists After that all these servicersquos properties would be saved into an INI

file as the information of that service

(4) The functionality of the WSML Register component is to write appropriate WSML document by

means of the valid WSDL documents that delivered from WSDL Grabber component and the

optionally INI files that delivered from Property Grabber component Afterwards register them in

Conqo

n WSML [9]

It stands for Web Service Modeling Language which provides a framework with different

language variants Hence it is often used to describe the different aspects of the semantic Web

Services according to the conceptual model of WSMO

n WSMO [10]

WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various

aspects related to the Semantic Web Services based on the ontologies Ontology is a formal

explicit specification of a shared conceptualization In fact these ontologies are the sticking points

that can satisfy the linkage between the agreement of the communities of users and the defined

conceptual semantics of the real-world

n Conqo [11]

Deep Web Service Crawler

21

It is a discovery framework that considers not only the Quality of Service (QoS) but also the

context information It will use a Web Service Repository to manage these service descriptions

that based on WSML

233 Implementation of the Pica-Pica Web Service

Description Crawler

This section is going to describe the processes of the implementation of the Pica-Pica Web Service

Description Crawler in detail

(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it

needs an input as the initial seed In this crawler there are five Web Service Registries which are

listed in the below The URL address of these five Web Service Registries will be used as the input

seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web

Service Description Crawler there has a single Python script for each Web Service Registry And

the crawling process for these Web Service Registriesrsquo Python script will be executed one after

another

Biocataloue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwsrvice-repositorycom

Xmethods httpwwwxmethodsnet

(2) Then after feeding with the input seed it will step into the next component Service Page Grabber

At first this component will try to read the data from the Web based on the input seed Then it

will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library

After that this Service Page Grabber component starts to look for the service page link for each

service that published in the Web Service Registry by means of the functions in Html5lib library In

the case that the service page link of one single service is found it will firstly check whether this

service page link is valid or not Once the service page link is valid it will pass it into the following

two components for further processing which are WSDL Grabber component and Property

Grabber component

(3) When the WSDL Grabber component receives a service page link from its previous component it

sets out to extract the WSDL link address for that service through the parsing tree of the data in

this service page Next this component will start to download the WSDL document of that service

in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into

the disk The process of this WSDL Grabber component will be continually carried on until there is

no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They

may either contain bad definitions or bad namespaceURI or be an empty document What is

worse it is not even of XML format Hence in order to pick them out this component will further

analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo

folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in

order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo

folder will be passed into the subsequent component

(4) Moreover since some Web Service Registries would give some addition information about the

Deep Web Service Crawler

22

services such as availability service provider version therefore the Property Grabber

component will set out to extract these information as this servicersquos properties And thereafter

save these servicersquos properties into an INI file However if there has no available additional

information then it is no need to extract the service property thus there is no INI file for that

service However for the implementation of this Pica-Pica Web Service Description Crawler only

the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to

extract the servicesrsquo properties While for other three Web Service Registries there has such

function

(5) Furthermore this is an optional to create a report file which contains the statistic information of

this process Such as the total number of services for one Web Service Registry the number of

services whose WSDL document is invalid etc

(6) As has been stated at present there has a folder with all valid WSDL documents and might also

have some INI files Therefore at this time the task of WSML Register component is to generate

the appropriate WSML documents with these valid WSDL document and INI files And then

register them in Conqo

24 Conclusions of the Existing Strategies

In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder

project Information Extraction technique and Pica-Pica Web Service Description Crawler

The task of this master program is going to obtain the available Web services and their related

information from the Web Actually this is a procedure of extracting the needed information for the

service such as the servicersquos WSDL document and its properties Therefore the Information

Extraction technique which used to extract the information that hosted in the Web could be used in

this master program

Moreover the Service-Finder is a large project that not only able to obtain the available Web services

and their related information from the Web but also enrich these crawled data with annotations by

means of the Service-Finder ontology and the Service Category ontology and integrate all the

information into a coherent semantic mode Furthermore it also provides the capabilities for

searching and browsing the data with the user interface and gives users with service

recommendations However as a master program the Service-Finder project is far more exceeding

the requirements Therefore it is just considered as a consultation for this master program

Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available

Web services and their related information this fulfills the primary task of this master program

Nevertheless regarding the information about the service this Pica-Pica Web Service Description

Crawler extracts only few properties sometime even no property Consequently in order to improve

the quality of the service it has to extract as much properties about the service as possible Thence in

chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler

Deep Web Service Crawler

23

3 Design and Implementation

In the previous chapter of the State of Art those already existing techniques or implementations are

presented In the following it is time to introduce the basic principle of the proposed approach Deep

Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica

Web Service Description Crawler

31 Deep Web Services Crawler Requirements

This section is mainly talking about the goals of the Deep Web Service Crawler approach system

requirements with respect to approach and some non-functional requirements

311 Basic Requirements for DWSC

The following lists are the basic requirements which should be achieved

(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list

of Web services that published in the Web Service Registry It contains not only the WSDL document

of the service but also a file with respect to the information about that service Therefore for the

purpose of producing the largest annotated Service Catalogue this proposed approach need to

extract as much properties about those Web services as possible Moreover it has also to download

the WSDL document that hosted along with the Web service That is to say these properties are not

only the interesting structured properties such as service name its WSDL link address but also some

other non-function properties for example endpoint and its monitoring information

(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties

For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of

schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way

this proposed approach provides three methods for the storage The first one will store them as an

XML file While for the second method it is going to store them in an INI file And the third method is

to use the database for the storage

312 System Requirements for DWSC

General Speaking the needed requirements for achieving a programming project contains following

lists

1) Operating System LinuxUnix Windows (XP Vista 2000 etc)

2) Programming language C++ Java Python C etc

3) Programming Tool NetBean Eclipse Visual Studio and so on

However in this master thesis the scripts of the Deep Web Service Crawler approach are written in

Java programming language Besides these code scripts are only tested running in Windows XP

operating system and Linux operating system but have not been tested in other operating system

Deep Web Service Crawler

24

313 Non-Functional Requirements for DWSC

In this part several non-functional requirements for the Deep Web Service Crawler approach are

presented

1) Transparency the process of data exploration and data storage should be done automatically

without the userrsquos intervention However at first user should specify the path in the hard disk which

will be used store these outputs of this program

2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in

order to stop the process from interruption there must have some necessary error handlings for the

process recovering

3) Completeness this approach should extract the interesting properties about each Web Service as

much as possible eg endpoint monitoring information etc

In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the

following five URLs hence the proposed approach must implement not less than those five Web

Service Registries

Biocatalogue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwservice-repositorycom

Xmethods httpwwwxmethodsnet

32 Deep Web Services Crawler Architecture

In this section an overview of this high-level architecture of Deep Web Services Crawler approach will

be introduced first Thereafter there are four subsections that focusing on the outlining of each single

component and how they will play together presented

The current components and flows of data in Deep Web Service Crawler can be summarized as

depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page

links and related service page links by crawling the Web (Web Service Extractor) Then those gathered

links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL

Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all

these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is

illustrated as following

Oslash Step 1

When the Deep Web Service Crawler starts to run the File Chooser container would require the

user to specify a path in the computer or in any other hard disk The reason for specifying the

path is that this Deep Web Service Crawler program needs a place to store all its outputs

Oslash Step 2

After that the Web Service Extractor will be triggered It is the main entry to the specific crawling

process However since the Deep Web Service Crawler program is a procedure which supposes

to crawl for Web Services in some given Web Service Registries Hence the URL addresses of

Deep Web Service Crawler

25

these Web Service Registries should be given as an initial seed for this Web Service Extractor

process However since the page structures of these Web Service Registries are completely

different thus there will be a dependent process for each Web Service Registry

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler

Oslash Step 3

So that according to the given seed two types of links would be obtained by the Web Service

Extractor component One is the service list page link And another is the service page link

Service list page is page that contains a list of Web Services and may be some information about

these Web Services While service page is page that contains much more information for single

service Finally it forwards these two types of links into the next two components Property

Grabber and WSDL Grabber

Oslash Step 4

Then on the one hand the Property Grabber component tries to gather the information about

the service that hosted in the service list page and service page Such as the name of the service

its description the ranking for this service etc Finally all the information of the service will be

collected together as the service properties which then will be delivered to the storage

component for further processing

Oslash Step 5

On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page

or the service page That is because for some Web Service Registry the WSDL link is hosted in

Deep Web Service Crawler

26

the service list page like the Biocatalogue and for other Web Service Registries it is hosted in

the service page such as Xmethods Then after obtaining the WSDL link it will also be

transmitted to the Storage component for further processing

Oslash Step6

When service properties information and WSDL link of the service are received by Storage

component it will be going to store them into the disk For the service properties they will be

stored into the disk by means of three different ways They are an XML file an INI file or one

record inside the table of Database However for the WSDL link the Storage component will try

to download the page content first according to the URL address of the WSDL link If it can work

successful the page content of the service will be stored as a WSDL document into the disk

Oslash Step 7

Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there

are more than one service or there are more than one service list page in the those Web Service

Registries the crawling process from step 3 to step 6 will be continued again and again until

there is no service or service list page in that Web Service Registries any more

Oslash Step 8

Furthermore after the crawling process for one Web Service Registry finishes there will also

generate a file that contains some of statistic information about this crawling process For

example the time when the crawling process of this Web Service Registry started and when

finished the total number of Web services in this Web Service Registry how many services have

empty WSDL document the average number of service properties in this Web Service Registry

and the average time cost for extracting service properties getting WSDL document and

generating XML file INI file etc

321 The Function of Web Service Extractor Component

The Web Service Extractor component is responsible for gathering service list page links and service

page links It pursues a focused crawl of the Web and only forwards service list page and service page

links to subsequent components for analyzing collecting gathering purposes Therefore it is going to

identify both service list page links and related service page links on these Web Service Registries

As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost

as important as the Web Service Extractor itself as it highly influences the part of the Web which it

needs to crawl The seed can or shall contain eg Web pages where these Web Services are

published or that they talk about Web Services

Deep Web Service Crawler

27

Figure3-2 Overview the process flow of the Web Service Extractor Component

After feeding with the seed of URL the Web Service Extractor component starts to get the link of

service list page from the initial page of this URL seed However this process would be diverse for

these five Web Service Registries Following shows the different situation in these Web Service

Registries

Oslash Service-Repository Web Service Registry

In this Web Service Registry the link of the first service list page is the URL address of its seed

which means some Web Services can be found in the home page of the Service-Repository Web

Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this

Web Service Registry Therefore the process of getting service list page link in this registry will be

continually carried on until there no more service list page link exists

Oslash Xmethods Web Service Registry

Although there has Web Services in the home page of the Xmethods Web Service Registry the

number of those Web Services is only a small subset of these in this Web Service Registry

Moreover in the Xmethods Web Service Registry there is only one page containing all Web

Services Therefore it must have to get the service list page link for that page

Oslash Ebi Web Service Registry

The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That

is to say there also has one page that contains all Web Services in this Web Service Registry

However this page is not the initial page of the input seed Therefore there needs more than

one operation steps to get the service list page link of that page

Oslash Seekda Web Service Registry

In Seekda Web Service Registry these Web Services are not contained in the initial page of the

input seed The service list page link can be obtained after several additional operation steps

Deep Web Service Crawler

28

However there has a problem for getting the service list page link in this registry Simply put if

there are more than one pages contains the Web Services for some unknown reasons it cannot

not get the links of the rest service list page In another word it can only get the link of the first

service list page

Oslash Biocatalogue Web Service Registry

The process of getting service list page in Biocatalogue Web Service Registry is almost the same

as in Seekda Web Service Registry The only difference is that it can get all these service list page

links in Biocatalogue Web Service Registry if there are more than one service list pages

Then after getting the link of service list page the Web Service Extractor will begin to get the link of

service page for each service that listed in the service list page The reason for why it can do like this is

that there has an internal link for every service which will address to the service page It is worth

noting that once a service page link is obtained this component will immediately forward the service

page link and the service list page link into the subsequent two components for further processing

Nevertheless the process of obtaining the service page link will be continuously carried out until all

services listed in that service list page are crawled Analogously the process of getting service list page

will also be continuously carried out until there no more service list page exists

3211 Features of the Web Service Extractor Component

The main features are described in the following paragraphs

1) Obtain service list page links

A central task is to obtain the corresponding service list page links It is an URL address that includes a

publicly list of Web Services and just a simply information about these Web services like the name of

the service an internal URL that links to another page which would contain the detailed information

about that service sometimes may have the link address for WSDL document

2) Obtain service page links

Once service list page links are found a crucial aspect is to extract the internal link of each service to

aid the task of service information discovery Therefore it is the task of the Web Service Extractor to

harvest the html page content of the service list page so that the service page links which would

contain much more detailed information of the single Web service can be obtained

3212 Input of the Web Service Extractor Component

This component is dependent on some specific input seeds And the only input required for this

component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section

313

3213 Output of the Web Service Extractor Component

The component will produce two service related page links from the Web

l Service list page links

l Service page links

Deep Web Service Crawler

29

3214 Demonstration for Web Service Extractor

In order to have comprehensive understanding of the process of the Web Service Extractor

component following gives some figures for explanation Though there are five URL addresses in this

section only the URL of the service-repository address is showed as an example

1) The input seed is the initial URL address of the Service-Repository which is

ldquohttpwwwservice-repositorycomrdquo

2) As has already said in section 321 the first service list link of this Web Service Registry is its input

seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page

of that link

Figure3-3 Service list page of the Service-Repository

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo

Figure3-5Code Overview of getting service page link in Service Repository

Figure3-6 Service page of the Web service ldquoBLZServicerdquo

Deep Web Service Crawler

30

3) For now because of the service list page link is already know the next step is to acquiring the

service page link according to these services that listed in the service list page The text in red box

of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the

complete link of the service page It should be prefixed with the initial URL address of

Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for

getting service page link of the Web service in Service Repository is showed in figure 3-5

Therefore the final link of this service page is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure 3-6 is the corrsponding service page of that link

4) Afterwards those two links service list page link and service page link which are gathered by the

Web Service Extractor component will be immediately forwarded to the next two components

which are WSDL Grabber component and Property Grabber component

322 The Function of WSDL Grabber Component

The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the

service lists page link or service page link And the whole process flow is illustrated in figure 3-7

Figure3-7 Overview the process flow of the WSDL Grabber Component

When the WSDL Grabber component receives the inputs that delivered from previous component it

will start to get the WSDL link for the service based on these inputs However although the inputs of

the WSDL Grabber component are the links of service page and service list page only one of them

contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the

service page or the service list page The reason why there needs these two links to be delivered into

this component is that the only one of the five Web Service Registries which is Biocatalogue hosts

the WSDL link in the service list page While for other four Web Service Registries the WSDL link is

hosted in the service page Therefore the WSDL link of these four Web Service Registries will be

obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL

link will be obtained through the service list page link However there has a problem for getting the

WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in

Deep Web Service Crawler

31

the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another

word these services will not have WSDL document For the situation like this the value of the WSDL

link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web

Services in other four Web Service Registries the WSDL link would always exist in the service page

Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service

it will inevitably be forwarded into Storage component for downloading the WSDL document at once

3221 Features of the WSDL Grabber Component

The WSDL Grabber component has to provide the following features

l Obtain WSDL links

WSDL link is the direct way to get into the page that contains the contents of the WSDL document

It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo

to represent this is an address that addressed to the page of WSDL document

3222 Input of the WSDL Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3223 Output of the WSDL Grabber Component

The component will only produce the following output data

l The URL address of WSDL link for each service

3224 Demonstration for WSDL Grabber Component

In this section it will demonstrate a list of figures in order to give a comprehensive understanding of

the process of the WSDL Grabber component Without a doubt it uses the same Web service of the

Service-Repository as an example too

1) The input for this WSDL Grabber component is the link of service page that obtained from Web

Service Extractor component The address of this link is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page

Deep Web Service Crawler

32

2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the

figure 3-8

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo

3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure

3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry

For other four Web Service Registries this will be different This function

ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo

Then it checks all of the nodes one by one to see if the text value of that node is the value of

ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will

be extracted as the value of WSDL link for this Web service

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function

Figure3-11 Code overview of ldquooneParameterrdquo function

4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web

service ldquoBLZServicerdquo can be obtained The value of it is

ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo

Deep Web Service Crawler

33

323 The Function of Property Grabber Component

The Property Grabber component is a module which is used to extract and gather all these Web

service information that hosted in the Web actually are the information that showed in service list

page and service page In the end all the obtained Web service information will be collected together

as the service properties which would be delivered into the Storage component for storing The

detailed process flow of Property Grabber component is illustrated in figure 3-12

The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component

However there is still a little difference between them with respect to the seed As have already

mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is

sufficient to get the WSDL link While regarding the Property Grabber component it needs two of

them as the seeds Therefore once the Property Grabber component receives the needed inputs it

will start to extract the information of that single Web service

Figure3-12 Overview the process flow of the Property Grabber Component

Therefore after the Property Grabber component receives the inputs it starts to extract the service

information for the Web service Generally speaking the service information consists of four aspects

which are structured information endpoint information monitoring information and whois

information respectively

(1) Structured Information

The structured information can be obtained by extracting the information that hosted in the

service page and service list page They are the basic descriptive information about the service

Such as the name of the service the URL address through which the WSDL document can be

obtained the description of introducing the service the provider who provides that service its

rating and the server who owns this service etc However the elements constituting this

Deep Web Service Crawler

34

structured information would be diverse according to these different Web Service Registries For

example the rating information of the Web service exists in the Service-Repository Web Service

Registry while the Xmethods Web Service Registry does not have this information In addition

even though the Web services in the same Web Service Registry some elements of the structured

information could be not exist For instance one service in a Web Service Registry may have the

description information of this service while another service in the same Web Service Registry

does not have this description information Tables 3-1 to 3-5 demonstrate the structured

information that should be extracted in these five Web Service Registries Moreover if the style of

the Biocataloue Web Service Registry is SOAP there has some additional information describing

the SOAP operation of this service While it is REST this additional information would be

something that talked about the REST operation They should also be considered as a part of the

structured information Table 3-6 and table 3-7 illustrate the information for these two different

operations

Service Name WSDL Link WSDL Version

Provider Server Rating

Homepage Owner Homepage Description

Table 3-1Structured Information of Service-Repository Web Service Registry

Service Name WSDL Link Provider

Service Style Homepage Implementation Language

Description User Description Contributed Client Name

Type of this Client Publisher for this client Used Tookit of this client

Used Language of this client Used Operation System of this client

Table 3-2Structured Information of Xmethods Web Service Registry

Service Name WSDL Link Server

Provider Providerrsquos Country Service Style

Rating Description User Description

Service Tags Documentation (within WSDL)

Table 3-3Structured Information of Seekda Web Service Registry

Service Name WSDL Link Port Name

Service URL Address Implementation Class

Table 3-4Structured Information of Ebi Web Service Registry

Service Name WSDL Link Style

Provider Providerrsquos Country View Times

Favorite Times Submitter Service Tags

Total Annotation Provider Annotation Member Annotation

Registry Annotation Base URL SOAP Lab Server Base URL

Description User Description Category

Table 3-5Structured Information of Biocatalogue Web Service Registry

Deep Web Service Crawler

35

SOAP Operation Name Inputs and Outputs Operation Description

Operation Tags Part of Which Service

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry

REST Operation Name Service Tags Used Template

Operation Description Part of Which Service Part of Which Endpoint Group

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry

(2) Endpoint Information

Endpoint information is the port information of the Web service which can be extracted only

through the service page However since different Web Service Registries have different structure

of the endpoint information with respect to the Web service Thence some elements of the

endpoint information for the Web service would be very diverse But there is one thing needed to

pay attention the Ebi Web Service Registry does not have endpoint information for all these Web

services published in this registry Moreover though the Web services in the same Web Service

Registry have the same structure of this endpoint information some elements of the endpoint

information would be missing or empty Furthermore these Web Service Registries could even

have no endpoint information for some Web services published by them Nevertheless whatever

there is endpoint information for the Web service there will be at least one such element which

is the URL address of the endpoint Following table 3-8 shows the endpoint information that

should be extracted in these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Endpoint Name Endpoint URL

Endpoint Critical Endpoint Type

Bound Endpoint

Xmethods

Endpoint URL Publisher of this Endpoint

Contact Email of this Publisher Implementation Language of this

Endpoint

Seekda Endpoint URL

Biocatalogue Endpoint Name Endpoint URL

Table 3-8 Endpoint Information of these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Service Availability Number of Downs

Total Uptime Total Downtime

MTTR MTBF

RTT Max of Endpoint RTT Min of Endpoint

RTT Average of Endpoint Ping Count of Endpoint

Seekda Service Availability Begin Time of Monitoring

Biocatalogue Monitored Status of Endpoint Overall Status of Service

Table 3-9 Monitoring Information of these five Web Service Registries

(3) Monitoring Information

Monitoring information is the tested statistic information for the Web service But it is worth

Deep Web Service Crawler

36

noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of

all these Web services published by them While for other three Web Service Registries only a

few of Web services may not have this information Table 3-12 displays the monitoring

information for these three Web Service Registries

(4) Whois Information

Whois information is not extracted from the information that hosted in the service page and

service list page It is the descriptive information for the service domain which can be gained by

means of the address of the WSDL link Because of that the process of getting the Whois

information will start after gaining the service domain first The final value of service domain

should not contain the string like this http https www etc It must belong to the top level

domain After that it is going to query the service domain database by means of sending the value

of service domain to the Whois client which is just a Web site on the Internet for example

ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain

exists a list of its information will be returned as the output However the structure of the

returned information will be diverse with respect to different service domain Therefore the most

challenge thing is that there has to deal with the extracting process for each different situation of

the returned information Table 3-10 gives the Whois information that needs be extracted for all

these five Web Service Registries

Service Domain URL Domain Name Domain Type

Domain Address Domain Description State

Postal Code City Country

Country Code Phone Fax

Email Organization Established Time

Table 3-10Whois Information for these five Web Service Registries

Finally all the information of these four aspects will be collected together And then be delivered into

the Storage component for further storage processing

3231 Features of the Property Grabber Component

The Property Grabber component has to provide the following features

l Obtain basic information

Generally speaking the more information one Web service has the better you can know how

well this Web service is Hence this Property Grabber component will be necessary to extract all

these basic information that hosted in the service list page and service page This basic

information contains the structure information endpoint information and monitoring

information

l Obtain Whois information

Due to the reason that the more information the Web service has the better the quality of the

Web service is That makes it is necessary to extract as much information about the Web service

as possible Therefore except the basic information the Property Grabber component can also

obtain some additional information called Whois information such as the type of the domain the

name of the person in charge the postal code of the domain city phone fax detailed address

email etc

Deep Web Service Crawler

37

3232 Input of the Property Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3233 Output of the Property Grabber Component

The component will produce the following output data

l Structured information of each service

l Endpoint information about each service if exists

l Monitoring information for the service and endpoint if exists

l Whois information of the service domain

However all these information will be collected together as the properties for each service Thereafter

the collected properties will be sent to the Storage component

3234 Demonstration for Property Grabber Component

Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the

Property Grabber component To simplify the explanation the example showed in this section uses

the same Web service as before

1) The inputs of the Property Grabber component are links of service list page and service page that

received from Web Service Extractor component Their links are

ldquohttpwwwservice-repositorycomrdquo

and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page

Deep Web Service Crawler

38

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page

3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that

displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of

the structured information have the same content like the contents of description showed in

service page and service list page Thence in order to saving the time of extracting process and

the space for storing process the elements with the same content would only be extracted once

Moreover there needs a transformation of non-descriptive text to the descriptive text for the

rating information because the contents of it are several image of stars Therefore the final

results of the extracted structured information for this Web service are showed in table 3-11

Because of there are no descriptive information for the provider homepage and Owner

Homepage their values are assigned as ldquoNULLrdquo

Service Name BLZService

WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl

WSDL Version 0

Server Apache-Coyote11

Description BLZService

Rating Four stars and A Half

Provider NULL

Homepage NULL

Owner Homepage NULL

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo

4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that

displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list

only one of them will be extracted as the information of endpoint That is because this master

program is intended to extract as more information as possible but these information should not

contain some redundant information Therefore for the purpose of that only one of them will be

extracted as the endpoint information even if there are more than one endpoint records Table

3-12 shows the final results of the endpoint information of this Web service

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page

Deep Web Service Crawler

39

Endpoint Name BLZServiceSOAP12port_http

Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService

Endpoint Critical True

Endpoint Type production

Bound Endpoint BLZServiceSOAP12Binding

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo

5) Then it is time to extract the monitoring information through invoking the function

ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box

above contains the monitoring information about the Web service and the red box below lists

the monitoring information for these endpoints As have already mentioned before only one

endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two

types of availability Actually they all represent the availability for this Web service just like the

availability that showed in figure 3-14 Therefore only one content of these availability

information is sufficient Table 3-13 shows the final results of this extracting process

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page

Service Availability 100

Number of Downs 0

Total Uptime 1 day 19 hours 19 minutes

Total Downtime 0 second

MTBF 1 day 19 hours 19 minutes

MTTR 0 second

RTT Max of Endpoint 141 ms

RTT Min of Endpoint 0 ms

RTT Average of Endpoint 577 ms

Ping Count of Endpoint 112

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo

6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the

service domain from the WSDL link first Towards this Web service the gained service domain is

ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying

process After that it returns a list of information for that service domain See figure 3-17 And

table 3-14 is the extracted Whois information

Deep Web Service Crawler

40

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo

Service Domain URL thomas-bayercom

Domain Name Thomas Bayer

Domain Type NULL

Domain Address Moltkestr40

Domain Description NULL

State NULL

Postal Code 54173

City Bonn

Country NULL

Country Code DE

Phone +4922855525760

Fax NULL

Email infopredic8de

Organization predic8 GmbH

Established Time NULL

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo

7) Finally all the information of these four aspects will be collected together as the service

properties and then these service properties are forwarded into the Storage component

324 The Function of Storage Component

The WSDL link from the WSDL Grabber component will be used by this Storage component to

download the WSDL document from the Web and stores it into the disk thereafter In addition the

service properties from the Property Grabber component will also be directly stored into the disk in

three different manners with this Storage component Figure 3-18 illustrates the process flow of the

Storage component

Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo

of the Storage component will be triggered After that it is going to transform the service properties

into three different output formats and store them into the disk These output formats are XML file

INI file and database records Besides it will also try to download the WSDL document with the URL

address of the WSDL link and then store the obtained WSDL document into disk too However this

ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo

ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the

storage tasks

Deep Web Service Crawler

41

Figure3-18 Overview the process flow of the Storage Component

(1) ldquogetWSDLrdquo sub function

The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it

into the disk Therefore above all it has to get the content of the WSDL document This

procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value

of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if

the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo

For the case like that it will create a WSDL document whose name is the service name appended

with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content

It is an empty document Nevertheless if that service does have WSDL link this sub function will

try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all

these contents that hosted in the Web would be downloaded and stored into the disk and named

only with the name of the service Otherwise it would create a WSDL document which prefixes

ldquoBadrdquo before the service name

(2) ldquogenerateXMLrdquo sub function

The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform

them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo

XML stands for eXtensible Markup Language which is a markup language that designed to

transport and store data The first line of an XML file is the XML declaration which defines the

XML version and the encoding used For example following ldquoltxml version=10

encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is

8-bit Unicode Transformation Format character set Besides an XML file contains XML elements

which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML

element can also contain other elements which could be simple text or a mixture of both

Deep Web Service Crawler

42

However an XML file must contain a root element as the parent of all other elements

(3) ldquogenerateINIrdquo sub function

The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will

transform them into an INI file and then store it in the disk with the name structure of servicersquos

name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for

configuration files They are just simple text files with a basic structure Generally speaking the

INI file contains three different parts They are section parameter and comment respectively

Parameter is the basic element contained in an INI file Its format is a key-value pair or can also

be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always

appears to the left of the equals sign The section is more like a room that grouped all its

parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And

section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo

Thence anything between the semicolon and the end of line will be ignored

(4) ldquogenerateDatabaserdquo sub function

The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions

Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will

turn them into the data of the database by using the SQL statements SQL stands for Structured

Query Language which is a database language that designed for accessing and manipulating data

in database Most of the actions that performed on a database are done with SQL statements

And the primary statements of SQL include insert into delete update select create alter and

drop Therefore for the purpose of transforming these service properties into the data of the

database this sub function has to create a database first using the ldquocreate databaserdquo statement

Then it should create a table to store the data A table is a collections of related data entries and

it consists of columns and rows Since the data for all these five Web Service Registries are not

very large therefore one database table is enough for storing these service properties Because of

that the field names of the service properties in columns for all these five Web Service Registries

should be uniform and well-defined Afterwards the service properties of each single service can

be put into the table as a record with the ldquoinsert intordquo statement of the SQL

3241 Features of the Storage Component

The Property Grabber component has to provide the following features

l Generate different output formats

The final result of this master program is to store the information of the service into the disk for

future work However this Storage component provides three different formats for storing these

service properties of the services that published in the Web Service Registries This makes the

storage of the service very flexible and also longevity

l Obtain the WSDL document

The important thing of this component is to check if the WSDL document can be obtained from

the WSDL link because the WSDL document plays a decisive role for determining the quality of

the service This Storage component provides the ability to deal with the different situations

about the process of obtaining the WSDL document

Deep Web Service Crawler

43

3242 Input of the Storage Component

This component requires the following input data

l WSDL link of each service

l Each service property information

3243 Output of the Storage Component

The component will produce the following output data

l WSDL document of the service

l XML document INI file and tables in database

3244 Demonstration for Storage Component

Following figures describe the fundamental implementation codes of this Storage component

The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the

implementation codes The first common place is about the parameters that defined in each of

these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute

path in the computer disk It will be used for the storing procedure and has already been specified

by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an

increasing Integer which is used as a part of the name of the service The reason for doing this is

that it can prevent the services that have the same name from overriding in the disk The content

of red mark among the codes of these figures is the second common place Its function is to

create a file or document for the service with the corresponding parameters

2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to

get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the

name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo

is the actual WSDL link which is the most important one for this sub function The other two

parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and

ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data

of the service in each Web Service Registry Such as the overall number of properties for that Web

Service Registry the overall number of services the number of services that have no WSDL link

the number of services contains no contents the number of services whose WSDL link are not

available etc The ldquoLog Informationrdquo text file records the results of the process steps and the

problems encountered For example which service is crawling now which Web Service Registry it

belongs the reason why it cannot get the WSDL document for this service the reason why this

service is unreachable and so on

Deep Web Service Crawler

44

Figure3-19 Implementation code for getting WSDL document

3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file

and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of

ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two

variables name and value

Figure3-20 Implementation code for generating XML file

Deep Web Service Crawler

45

Figure3-21 Implementation code for generating INI file

4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all

services in these five Web Service Registries into the records of the database Therefore it has to

create a database first The name of the database can be arbitrary as long as it conforms to the

naming rules in database This is the same for the name of the table Figure 3-22 displays the

procedure of creating a table in the database Due to the reason that it is hard to decide the

length of each service property thence the data types for all service properties are set to ldquoTextrdquo

Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo

statement

Figure3-22 Implementation code for creating table in database

Deep Web Service Crawler

46

Figure3-23 Implementation code for generating table records

33 Multithreaded Programming for DWSC

Multithreaded programming is a built-in characteristic which is provided by Java language A

multithreaded program contains two or more separate parts that that can execute concurrently Each

part of such a program is called a thread The use of multithreading enables to create programs that

makes efficient use of the system resources For example make maximum use of the CPU thatrsquos

because the idle time of the CPU can be kept to a minimum

In this master program there are five Web Service Registries that are needed to crawl for the services

among them Moreover the number of services published among each Web Service Registry is quite

different That makes the running time cost by each Web Service Registry is different Then it will

happen that one Web Service Registry who owns fewer services has to be waiting for executing until

another Web Service Registry who has much more services finishes Therefore in order to reduce the

waiting time for other Web Service Registries and maximize the use of the system resources it is

necessary to apply this multithreaded programming to this master program That is to say this master

program will create a thread for each Web Service Registry And these threads are executed

independently

34 Sleep Time Configuration for Web Service

Registries

Due to the reason that this master program is intended for downloading the WSDL document and

extracting the service information of these Web services that published in the Web Service Registry

this would inevitably affect the performance of the Web Service Registry In addition for the purpose

of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of

accessing Because of that sometimes when this master program is executing unknown errors would

happen For instance this master program will be continually halting at one point without getting any

more WSDL document and service information or some servicesrsquo WSDL documents of some Web

Deep Web Service Crawler

47

Service Registries cannot be obtained or there would miss some service information Therefore in

order to obtain the largest staffs of these Web services published in these five Web Service Registries

and not to affect the throughput of them it has to configure the accessing rate for each service of all

Web Service Registries

In consequence before going into the essential procedure for each single service of these Web Service

Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long

milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for

the specified number of milliseconds In other words it will temporarily cease execution for a while

Following table shows this time interval of the sleep function for each Web Service Registry

Web Service Registry Name Time Interval (milliseconds)

Service Repository 8000

Ebi 3000

Xmethods 10000

Seekda 20000

Biocatalogue 10000

Table 3-15Sleep Time of these five Web Service Registries

Deep Web Service Crawler

48

4 Experimental Results and Analysis

This chapter is going to show the quantitative experimental results of the prototype that presented in

chapter 3 Besides the analysis of these results will also be described and explained However in

order to gain a rather accurate result the experiments are carried out more than five times All these

data that displayed in the following tables and charts are the average of them

41 Statistic Information for Different Web Service

Registries

This section is mainly used to talk about the amount statistic of these Web services that published in

these five Web Service Registries It includes the overall number of the Web services that published in

each Web Service Registry and the number of these unavailable Web services which have been

archived because they may not be active anymore or are close to being non active Table 4-1 shows

the service amount statistic of these five Web Service Registries

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Overall

Services 57 289 382 853 2567

Unavailable

Services 0 0 0 0 125

Table4-1 Service amount statistic of these five Web Service Registries

Nevertheless in order to have an intuitive view of the service amount statistic in these five Web

Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As

you can see from the bar chart one on hand there has an ascending increase for the overall number

of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service

Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web

service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to

provide the Web services for the users because it contains far more services than other four Web

Service Registries On the other hand there is no unavailable service existed in almost all Web Service

Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web

Service Registry contains some Web services that cannot be used by the users To some degree this is

useless for the reason that these services cannot be used anymore and it will waste the network

resources on the Web Therefore all these unavailable services should be eliminated in order to

reduce the wasting of the network resources

Deep Web Service Crawler

49

Figure4-1 Service amount statistic of these five Web Service Registries

42 Statistic Information for WSDL Document

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Failed WSDL

Links 1 0 23 145 32

Without

WSDL Links 0 0 0 0 16

Empty

Content 0 0 2 0 2

Table4-2 Statistic information for WSDL Document

Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in

these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed

WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the

Web services whose WSDL links are invalid Say in other words that means it is impossible to get the

WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no

WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services

for these Web Service Registries It is used as the overall number of Web services in each Web Service

Registry who has no WSDL link That is to say there would be no WSDL document for Web services

like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL

document will also be created But there is no content and the name of this WSDL document will

contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which

represents the overall number of Web services that have the WSDL links and the URL address of them

0

500

1000

1500

2000

2500

3000

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

57289 382

853

2567

0 0 0 0 125

The

Num

ber o

f Web

Ser

vice

s

The name of Web Service Registry

Overall Services

Unavailable Services

Deep Web Service Crawler

50

are valid but their WSDL documents contains no content In this case a WSDL document whose name

contains a string ldquo(BAD)rdquo will be created

Figure4-2 Statistic information for WSDL Document

43 Comparison of Different Average Number of

Service Properties

This section is going to compare the average number of service properties in these five Web Service

Registries This average number of service properties is calculated by means of following equation = (1)

Where

ASP is the average number of service properties for one Web Service Registry

ONSP is the overall number of service properties in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

Figure 4-3 shows the average number of service properties of each Web service in these five Web

Service Registries As have already mentioned before one of the measurements for testing the quality

of Web services in the Web Service Registry is the service information The more information of a Web

service the better you know about that service consequently the corresponding Web Service

Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service

Repository and Biocatalogue Web Service Registries own a larger number of service properties than

that in other three Web Service Registries This can directly reflect that these two Web Service

Registries can provide more detailed information about the Web services published among them So

that the users can be easier to choose the service they need And they would also like to use the Web

services that published in these two Web Service Registries By contrast the Xmethods and Seekda

0

50

100

150

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

1 023

145

32

0 0 0 016

0 0 2 0 2

The

Num

ber o

f Ser

vice

s

The Name of the Web Service Registry

Failed WSDL Links

Without WSDL links

Empty Content

Deep Web Service Crawler

51

Web Service Registries who have less service information of the Web services would offer less quality

for these Web services Therefore users may be not like to use the Web services provided in these

two Web Service Registries Not to mention the Web services published in the Ebi Web Service

Registry

Figure4-3 Average Number of Service Properties

From the description presented in section 323 the causes of the issue for the different number of

service properties in these Web Service Registries may consist of following several points First the

number of the structured information for these Web services is different with respect to these five

Web Service Registries Even part of information for some Web services in one Web Service Registry

could be missing or have empty value For example the number of structure information that

supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that

in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for

almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce

the amount of the overall number of service properties Thirdly some Web Service Registries do not

have monitoring information such as Xmethods and Ebi In particular the Service Repository Web

Service Registry has a large amount of monitoring information of the Web services that could be

extracted in the Web Obvious the last one is the number of the whois information for these Web

services If the database of the whois client does not contain the information about the service

domain for the Web service in one Web Service Registry then there will be no whois information

could be extracted Moreover even there has information of the service domain the amount of the

information could be very diverse Therefore if the Web Service Registry undergoes the situation like

that many service domains of the Web services in this registry have no and few whois information

then the average number of service properties in that registry will decrease greatly

As a result in order to help users get better acquainted with the Web services that provided in Web

Service Registry and distinguish these Web services the Web Service Registry should do it best to

offer more and more information for each published Web service of it

05

101520253035

23

717 17

32

The

Num

ber o

f Pro

pert

ies

The Name of Web Service Registry

Average Number of Service Properties

Deep Web Service Crawler

52

44 Different Outputs of Web Services

The fundamental task of this master program is to obtain the WSDL document of the Web services

that hosted in the Web Service Registries as well as extract and gather the properties of these Web

services and thereafter store them into disk Therefore this section is going to describe the different

outputs of this master program which include the WSDL documents of Web services the generated

XML file and INI file and data records of these service properties in Database

Figure4-4 WSDL Document format of one Web service

The WSDL document of the Web service is just read from the Web according to the URL address of its

WSDL link And then these data will be stored into the disk as the WSDL document The name of the

WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document

And in order to distinguish these WSDL documents whose names would be the same while the

content of them are different the name of each obtained WSDL document in one Web Service

Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL

document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and

data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output

formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name

is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the

materials belonged to the same Web service The first three lines in that INI file are some service

comments which start from the semicolon to the end of the line They are the basic information for

the description of this INI file The following one line of them is the section that included in a pairs of

brackets It is important because it represents that the following lines behind it are the information of

this Web service Therefore the rest of these lines are the actual service information that existed in a

key-value pair with an equal between the pair Each service property is displayed from the beginning

of the line

Deep Web Service Crawler

53

Figure4-5 INI File format of one Web service

Figure4-6 XML File format of one Web service

Figure4-7 Database data format for all Web services

Deep Web Service Crawler

54

Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its

name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web

service Though the format of the XML file is different from that of the INI file the essential contents

of them are the same That is to say the values of these service properties are no different This is

because they are files generated from the collection of properties for the same Web service The XML

file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the

section in INI file is something like the root in XML file Therefore all values of these elements among

the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service

Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of

service information for all Web services in these five Web Service Registries And the entire service

information for one Web service will be only one record in this table Because of that the column

names of that table should be the union of the name of the service information in each Web Service

Registry However since the column names of the table should be unique thus these redundant

names in this union must be eliminated This could make sense and possible because the names of the

service information are well-defined and uniform for all these five Web Service Registries In addition

in this table the first column is the primary key which is an increasing Integer The function of this is

more like the Integer that contained in the name of the XML file and INI file While the rest columns in

that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in

this table represents that this property of that Web service is empty or missing

45 Comparison of Average Time Cost for Different

Parts of Single Web Service

This section aims to describe the comparison of the average time cost for different parts of getting

one single Web service in all these five Web Service Registries At first it has to calculate the average

time cost of getting one single service in the Web Service Registry The calculation of it can be

obtained through following equation = (2)

Where

ATC is the average time cost for one single Web service

OTS is the overall time cost of all the Web services in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the different parts of the average time cost for getting one single service consists of

following 6 aspects which are the average time cost for extracting service property the average time

cost for obtaining WSDL document the average time cost for generating XML file the average time

cost for generating INI file the average time cost for inserting the service property into the table of

the Database and the average time cost for some other procedures such as get the service list page

link get the service page link and so on The average time cost for extracting service property would

be obtained by means of following equation

Deep Web Service Crawler

55

= (3)

Where

ATCSI is the average time cost for extracting service property of one single Web service

OTSSI is the overall time cost for extracting service property of all the Web services in one Web

Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the calculation of other parts is similar to the equation for calculating the average time

cost for extracting service property While the calculation of the average time cost for other

procedures equals to the result of the average time cost for one single Web service minus the sum of

the average time cost of all other five parts

Service

property

WSDL

Document

XML

File INI File Database Others Overall

Service

Repository

8801 918 2 1 53 267 10042

Ebi 699 82 2 1 28 11 823

Xmethods 5801 1168 2 1 45 12 7029

Seekda 5186 1013 2 1 41 23 6266

Biocatalogue 39533 762 2 1 66 1636 42000

Table4-3 Average time cost information for all Web Service Registries

Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web

Service Registries The first column of table 4-3 is the name of these five Web Service Registries And

the last column in this table is the average time cost for the single service in one Web Service Registry

While the rest columns in this table are the average time cost of these six different parts In order to

have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated

with following corresponding figures See figure 4-8 to figure 4-14

Figure4-8 Average time cost for extracting service property in all Web Service Registries

8801

6995801 5186

39533

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Service Property

Deep Web Service Crawler

56

As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue

Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries

which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively

That is to say it will take much longer time to extract the service properties of these Web services

published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the

Biocatalogue Web Service Registry has even more average number of service properties which have

already been talked about in section 43 On the contrary the average number of service properties in

Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service

Registry is larger than that in Seekda Web Service Registry As have already known that the average

number of service properties is the same for these two Web Service Registries Nevertheless there

has course might explain the case that the average time in Xmethods costs more than that in Seekda

which is the process for extracting service properties in Xmethods Web Service Registry needs to be

executed by means of service page and service list page while it needs only service page link for

Seekda Web Service Registry

The average time cost for obtaining WSDL document in all these five Web Service Registries is

displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the

average time for extracting the WSDL link of Web service and the average time for reading the data of

WSDL document from the Web and then storing it into the disk As seen from this figure the average

time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is

1168 Although the process for extracting WSDL link will cost certain amount of time it does not have

a significant influence to the total average time spent for obtaining WSDL document This is because

the WSDL link of one Web service is almost gained in one step Therefore this can imply that the

average size of the WSDL document for Xmethods Web Service Registry is larger than that for other

four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds

for the process of extracting WSDL document

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries

918

82

11681013

762

0

200

400

600

800

1000

1200

1400

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

WSDL Document

Deep Web Service Crawler

57

Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different

outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11

the average time of generating XML file for one Web service cost the same for all these five Web

Service Registries which is only 2 milliseconds In addition the average time of generating INI file for

one Web service cost the same too And its value is just 1 millisecond Even though sum up the values

of these two types of average time costs the result is still so small that it could be omitted when

comparing to the overall average time cost of getting one Web service for each corresponding Web

Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI

file would be finished at once after receiving the input of service properties of one Web service

Furthermore it can be seen from figure 4-12 although the average time costs of creating database

record for each Web service in all these five Web Service Registries are larger than that time of

generating XML and INI file the process of the operation for creating database record is still fast

Figure4-10 Average time cost for generating XML file in all Web Service Registries

Figure4-11 Average time cost for generating INI file in all Web Service Registries

2 2 2 2 2

0

05

1

15

2

25

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

XML File

1 1 1 1 1

0

02

04

06

08

1

12

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

INI File

Deep Web Service Crawler

58

Figure4-12 Average time cost for creating database record in all Web Service Registries

Figure4-13 Average time cost for getting one Web service in all Web Service Registries

In the figure 4-13 it gives the average time cost for getting one single Web service in all these five

Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the

longest time for this process This is because the presentation of these five different parts that

described in the front shows that the average time cost of each part would needs more time to finish

the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining

WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web

Service Registry does not cost the longest Moreover there is amazing thing when looking at these

figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same

trend This can further indicate that one Web Service Registry who spends more time to get the

description information of one Web service would offer more information about the Web service

53

28

4541

66

0

10

20

30

40

50

60

70

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Database

10042

823

7029 6266

42000

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Overall

Deep Web Service Crawler

59

5 Conclusion and Further Direction

This master thesis provides a schema which aims to explore the description information of the Web

services that hosted in the different Web Service Registries Actually the description information of

the Web service consists of WSDL document and service information about that Web service

Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can

be used to obtain the WSDL document and service information of the Web service in these Web

Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web

services that hosted in the these Web Service Registries as well as the storage of these Web services

is not flexible and the most important thing is that only a few service information of the Web service

are extracted even more there is no service information would be extracted for some Web Service

Registry However the work presented in this master thesis can be able to explore all Web services in

these Web Service Registries Moreover the service information of Web service would be extracted as

much as it could with the approach presented in this master thesis so that the final result would be

the largest annotated service catalogue ever produced Furthermore regarding the storage process of

the description information of the Web service this master thesis provides three different ways that

can guarantee not only the completeness but also longevity of the description information for the

Web service

However in the implementation performed in this master thesis the whois client used for querying

the information of service domain will return a free text if the information exists And sometime this

free text differs completely That makes it must have to crawl each Web service in all Web Service

Registries at least once in the experiment stage So that all the cases of these free text would be

foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web

services in these Web Service Registries Therefore in order to simply the work other whois client

who can ease the work here needs to be found and used

Moreover in the experiment stage of this master thesis the time cost for getting Web service is still

large For the purpose of reducing this time the multithreaded programming would be applied to

each some parts of the process for getting one Web service

Although the work performed here is specialized for only these five Web Service Registries the

main parts of the principles used here is adaptable to other Web Service Registries that with only

small changes in the implementation codes or the structure

Deep Web Service Crawler

60

6 Bibliography

[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)

June 27 2008

[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from

httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-

whole Emanuele Della Valle (CEFRIEL) July 1 2008

[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk

ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-

architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009

[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web

Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October

2006

[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from

httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008

[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from

httpwwww3orgTRws-arch-scenarios February 11 2004

[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo

Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering

University of Washington Seattle pp233-272 February 1999

[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web

Consortium Working Draft WD-html5-20100624 January 22 2008

[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language

WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from

httpwwwwsmoorgTRd16d161v02

[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO

-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from

httpwwwwsmoorgTRd2v11

[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware

Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet

2008

Deep Web Service Crawler

61

7 Appendixes

There are additional outputs of this master program which are log information file and statistic report

file Figure 8-1 shows one of the basic output formats of the log information for these five Web

Service Registries

Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry

Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry

Deep Web Service Crawler

62

Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo

ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively

Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry

Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry

Deep Web Service Crawler

63

Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry

Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry

Deep Web Service Crawler

64

Table of Figures

Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12

Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16

Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25

Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27

Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41

Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46

Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50

Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51

Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52

Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56

Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58

Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58

Deep Web Service Crawler

65

Table of Tables

Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47

Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48

Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55

Deep Web Service Crawler

66

Table of Abbreviations

DTD Document Type Definition

DWSC Deep Web Service Crawler

HTML HyperText Markup Language

MTBF Mean Time between Failures

MTTR Mean Time to Recovery

QoS Quality of Service

REST Representational State Transfer

RTT Round Trip Time

SOAP Simple Object Access Protocol

SQL Structured Query Language

URL Uniform Resource Locator

WHATWG Web Hypertext Application Technology Working Group

XML eXtensible Markup Language

WSDL Web Service Description Language

WSML Web Service Modeling Language

WSMO Web Service Modeling Ontology

Page 15: Deep Web Service Crawler

Deep Web Service Crawler

15

u Service availability information

Oslash Function

u The Web Interface allows the users to search services by keyword tag or concept in the

categorization sort and filter query results by refining the query compare and bookmark

services try out the services that offer this functionality

u The API allows the developers to invoke Service-Finder functionalities

Oslash Output

u Explicit user annotations such as tags ratings comments decryptions and so on

u Implicit user data for example click stream of users bookmarks comparisons links sent

etc

u Manual advertising of available new services

2125 The Principle of the Cluster Engine Component

The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the

Service-Finder Portal eg the queried services and the compared services of the users Moreover it

also provides cluster data to the Conceptual Indexer and Matcher for providing service

recommendations

Furthermore letrsquos detailed introduce this componentrsquos function input and output

Oslash Input

u Service annotation data of both extracted and user feedback

u Usersrsquo Click streams used for extracting user behaviors

Oslash Function

u Obtain user clusters from user behaviors

u Obtain service clusters from service annotation data to enable to find similar services

Oslash Output

u Clusters of users and services

22 Information Extraction

Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge

amount of information sources on the Internet which has been limited the access to browsing and

searching for the reason of the heterogeneity and the lack of structure of Web information sources

Therefore the appearance of Information Extraction that transforms the Web pages into

program-friendly structures for post-processing would become a great necessity However the task of

Information Extraction is specified in terms of the inputs and the extraction targets And the

techniques used in the process of Information Extraction called extractor

221 Input Types of Information Extraction

Generally speaking there are three different input types The first input type is the unstructured

Deep Web Service Crawler

16

document For example the free text that showed in figure 2-2 It is unstructured and written in

natural language So that it will require substantial natural language processing While the second

input type is called the structured document For instance the XML documents based on the reason

that the data can be described through the available DTD (Document Type Definition) or XML

(eXtensible Markup Language) schema Finally but obviously the third input type is the

semi-structured document that are widespread on the Web Such as the large volume of HTML

pages like tables itemized lists and enumerated lists This is because HTML tags are often used to

render these embedded data in the HTML pages See figure 2-3

Figure2-2Left is the free text input type and right is its output [4]

Figure2-3A Semi-structured page containing data records

(in rectangular box) to be extracted [4]

Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly

regular structure And the data of these documents can be displayed in a format of HTML way or

non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and

generated from structured databases in terms of some templates or layouts thus it would be

considered as one of the input sources which could provide some of these semi-structured documents

For example the authors price and comments of the book pages that provided by Amazon have the

Deep Web Service Crawler

17

same layout That is because these Web pages are generated from the same database and applied

with the same template or layout Furthermore there has another option which could manually

generate HTML pages of semi-structured type For example although the publication lists that

provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have

title and source property for every single pager Eventually the inputs for some Information Extraction

can also be the pages with the same class or among various Web Service Registries

222 Extraction Targets of Information Extraction

Moreover regarding the task of the Information Extraction it has to consider the extraction target

There also have two different extraction targets The first one is the relation of k-tuple And the k in

there means the number of attributes in a record Nevertheless in some cases an attribute of one

record may have none instantiation Otherwise the attribute owns multiple instantiations In addition

the complex object with hierarchically organized data would be the second extraction target Though

the ways for depicting the extraction targets in a page are diverse the most common structure is the

hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf

nodes which called internal nodes And the structure for a data object may also be flat or nested To

be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise

if it is nested structure then the internal nodes that involved in this data object would be more than

two levels

Furthermore in order to make the Web pages readable for human being and having an easier

visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated

or demarcated However the displaying for a data object in a Web page would be affected by

following conditions [4]

Oslash The attribute of a data object has zero or several values

(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo

attribute For example a special offer only available for certain books might be a ldquononerdquo

attribute

(2) If there are more than one values for the attribute of a data object it will be called the

ldquomultiValuerdquo attribute For instance the name of the author for a book could be a

ldquomultiValuerdquo attribute

Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering

That is to say among this set of attribute the position of the attribute might be changed

according to the diverse instances of a data object Thus this attribute will be called the

ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would

enumerate the release data in front of the movesrsquo title while for the movies after year 1999

(including 1999) it will enumerate the release data behind the movesrsquo title

Oslash The attribute has different formats

This means the displaying format of the data object could be completely distinct with respect to

these different instances Therefore if the format of an attribute is free then a lot of rules will be

needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo

attribute For example an ecommerce Web site would use the bold font format to present the

general prices while use the red color format to display the sale prices Nevertheless there has

Deep Web Service Crawler

18

another situation that some different attributes for a data object have the same format For

example various attributes are presented in terms of using the ltTDgt tags in a table presentation

And the attributes like those could be differentiated by means of the order information of these

attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo

attributes it must have to revise the rules for extracting these attributes

Oslash The attribute cannot be decomposed

Because of the easier processing sometimes the input documents would like to be treated as

strings of tokens instead of the strings of characters In addition some of the attribute cannot

even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo

attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The

department code and the course number in them cannot be separated into two different strings

of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo

223 The Used Techniques in Information Extraction

The extractor used in the process of Information Extraction aims at providing a single uniform query

interface to access information sources like database server and Web server It consists of following

phases collecting returned Web pages labeling these Web pages generalizing extraction rules

extracting the relevant data and outputting the result in an appropriate format (XML format or

relational database) for further information integration For example at first the extractor queries the

Web server to gather the returned pages through the HTTP protocols after that it starts to extract the

contents among these HTML documents and integrate with other data sources thereafter Actually

the whole process of the extractor follows below steps

Oslash Step 1

At the beginning it must have to tokenize the input However there are two different

granularities for the input string tokenization They are tag-level encoding and word-level

encoding The tag-level encoding will transform the tags of HTML page into the general tokens

while transform all text string between two tags into a special token Nevertheless the

word-level encoding does this in another way It treats each word in a document as a token

Oslash Step 2

Next it should apply the extraction rules for every attributes of the data object in the Web pages

These extraction rules could be induced in terms of a top-down or bottom-up generalization

pattern mining or logic programming In addition the type of extraction rules may be indicated

by means of regular grammars or logic rules For example some use path-expressions of the

HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic

constraints and some use delimiter-based constraints such as HTML tags or literal words

Oslash Step 3

After that all these extracted data would be assembled into the records

Oslash Step 4

Finally iterate this process until all these data objects in the input

Deep Web Service Crawler

19

23 Pica-Pica Web Service Description Crawler

The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the

Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web

Services problem For example the evaluation of the descriptive quality of Web Services that offered

and how well are these Web Services described in nowadaysrsquo Web Service Registries

231 Needed Libraries of the Pica-Pica Web Service

Description Crawler

This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef

Spillner and programmed in terms of the Python language Actually in order to run these scripts to

parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib

n Beautiful Soup

It is an HTMLXML parser for Python language And it can even turn these invalid markups into a

parse tree [5] Moreover the following three features can make it more powerful

u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that

makes approximately as much sense as the original document Therefore you can obtain

the data that you want

u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating

searching and modifying the parse tree Hence you donrsquot need to create a custom parse

for every application

u If the document has already specified an encoding then you can ignore it since the

Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way

Otherwise what you have to do is just to specify the encoding of the original documents

Furthermore the ways of including Beautiful Soup into the application are displayed in the

following [5]

sup2 From BeautifulSoup import BeautifulSoup For processing HTML

sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML

sup2 Import BeautifulSoup To get everything

n Html5lib

It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to

gain maximum compatibility with the current major desktop web browsers this implementation

will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5

specification

Deep Web Service Crawler

20

232 Architecture of the Pica-Pica Web Service

Description Crawler

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler

Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes

four fundamental components Service Page Grabber component WSDL Grabber component

Property Grabber component and WSML Register component

(1) The Service Page Grabber components is going to take the URL seed as the input and output the

link of the service page into following two components WSDL Grabber component and Property

Grabber component

(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the

delivered service pagersquos link And then check whether the validation for these obtained WSDL

document Finally only these valid WSDL document will be passed into the WSML Register

component for further processing

(3) The Property Grabber component will try to extract the servicersquos property the hosted in the

service page if there exists After that all these servicersquos properties would be saved into an INI

file as the information of that service

(4) The functionality of the WSML Register component is to write appropriate WSML document by

means of the valid WSDL documents that delivered from WSDL Grabber component and the

optionally INI files that delivered from Property Grabber component Afterwards register them in

Conqo

n WSML [9]

It stands for Web Service Modeling Language which provides a framework with different

language variants Hence it is often used to describe the different aspects of the semantic Web

Services according to the conceptual model of WSMO

n WSMO [10]

WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various

aspects related to the Semantic Web Services based on the ontologies Ontology is a formal

explicit specification of a shared conceptualization In fact these ontologies are the sticking points

that can satisfy the linkage between the agreement of the communities of users and the defined

conceptual semantics of the real-world

n Conqo [11]

Deep Web Service Crawler

21

It is a discovery framework that considers not only the Quality of Service (QoS) but also the

context information It will use a Web Service Repository to manage these service descriptions

that based on WSML

233 Implementation of the Pica-Pica Web Service

Description Crawler

This section is going to describe the processes of the implementation of the Pica-Pica Web Service

Description Crawler in detail

(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it

needs an input as the initial seed In this crawler there are five Web Service Registries which are

listed in the below The URL address of these five Web Service Registries will be used as the input

seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web

Service Description Crawler there has a single Python script for each Web Service Registry And

the crawling process for these Web Service Registriesrsquo Python script will be executed one after

another

Biocataloue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwsrvice-repositorycom

Xmethods httpwwwxmethodsnet

(2) Then after feeding with the input seed it will step into the next component Service Page Grabber

At first this component will try to read the data from the Web based on the input seed Then it

will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library

After that this Service Page Grabber component starts to look for the service page link for each

service that published in the Web Service Registry by means of the functions in Html5lib library In

the case that the service page link of one single service is found it will firstly check whether this

service page link is valid or not Once the service page link is valid it will pass it into the following

two components for further processing which are WSDL Grabber component and Property

Grabber component

(3) When the WSDL Grabber component receives a service page link from its previous component it

sets out to extract the WSDL link address for that service through the parsing tree of the data in

this service page Next this component will start to download the WSDL document of that service

in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into

the disk The process of this WSDL Grabber component will be continually carried on until there is

no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They

may either contain bad definitions or bad namespaceURI or be an empty document What is

worse it is not even of XML format Hence in order to pick them out this component will further

analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo

folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in

order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo

folder will be passed into the subsequent component

(4) Moreover since some Web Service Registries would give some addition information about the

Deep Web Service Crawler

22

services such as availability service provider version therefore the Property Grabber

component will set out to extract these information as this servicersquos properties And thereafter

save these servicersquos properties into an INI file However if there has no available additional

information then it is no need to extract the service property thus there is no INI file for that

service However for the implementation of this Pica-Pica Web Service Description Crawler only

the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to

extract the servicesrsquo properties While for other three Web Service Registries there has such

function

(5) Furthermore this is an optional to create a report file which contains the statistic information of

this process Such as the total number of services for one Web Service Registry the number of

services whose WSDL document is invalid etc

(6) As has been stated at present there has a folder with all valid WSDL documents and might also

have some INI files Therefore at this time the task of WSML Register component is to generate

the appropriate WSML documents with these valid WSDL document and INI files And then

register them in Conqo

24 Conclusions of the Existing Strategies

In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder

project Information Extraction technique and Pica-Pica Web Service Description Crawler

The task of this master program is going to obtain the available Web services and their related

information from the Web Actually this is a procedure of extracting the needed information for the

service such as the servicersquos WSDL document and its properties Therefore the Information

Extraction technique which used to extract the information that hosted in the Web could be used in

this master program

Moreover the Service-Finder is a large project that not only able to obtain the available Web services

and their related information from the Web but also enrich these crawled data with annotations by

means of the Service-Finder ontology and the Service Category ontology and integrate all the

information into a coherent semantic mode Furthermore it also provides the capabilities for

searching and browsing the data with the user interface and gives users with service

recommendations However as a master program the Service-Finder project is far more exceeding

the requirements Therefore it is just considered as a consultation for this master program

Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available

Web services and their related information this fulfills the primary task of this master program

Nevertheless regarding the information about the service this Pica-Pica Web Service Description

Crawler extracts only few properties sometime even no property Consequently in order to improve

the quality of the service it has to extract as much properties about the service as possible Thence in

chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler

Deep Web Service Crawler

23

3 Design and Implementation

In the previous chapter of the State of Art those already existing techniques or implementations are

presented In the following it is time to introduce the basic principle of the proposed approach Deep

Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica

Web Service Description Crawler

31 Deep Web Services Crawler Requirements

This section is mainly talking about the goals of the Deep Web Service Crawler approach system

requirements with respect to approach and some non-functional requirements

311 Basic Requirements for DWSC

The following lists are the basic requirements which should be achieved

(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list

of Web services that published in the Web Service Registry It contains not only the WSDL document

of the service but also a file with respect to the information about that service Therefore for the

purpose of producing the largest annotated Service Catalogue this proposed approach need to

extract as much properties about those Web services as possible Moreover it has also to download

the WSDL document that hosted along with the Web service That is to say these properties are not

only the interesting structured properties such as service name its WSDL link address but also some

other non-function properties for example endpoint and its monitoring information

(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties

For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of

schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way

this proposed approach provides three methods for the storage The first one will store them as an

XML file While for the second method it is going to store them in an INI file And the third method is

to use the database for the storage

312 System Requirements for DWSC

General Speaking the needed requirements for achieving a programming project contains following

lists

1) Operating System LinuxUnix Windows (XP Vista 2000 etc)

2) Programming language C++ Java Python C etc

3) Programming Tool NetBean Eclipse Visual Studio and so on

However in this master thesis the scripts of the Deep Web Service Crawler approach are written in

Java programming language Besides these code scripts are only tested running in Windows XP

operating system and Linux operating system but have not been tested in other operating system

Deep Web Service Crawler

24

313 Non-Functional Requirements for DWSC

In this part several non-functional requirements for the Deep Web Service Crawler approach are

presented

1) Transparency the process of data exploration and data storage should be done automatically

without the userrsquos intervention However at first user should specify the path in the hard disk which

will be used store these outputs of this program

2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in

order to stop the process from interruption there must have some necessary error handlings for the

process recovering

3) Completeness this approach should extract the interesting properties about each Web Service as

much as possible eg endpoint monitoring information etc

In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the

following five URLs hence the proposed approach must implement not less than those five Web

Service Registries

Biocatalogue httpwwwbiocataloguecom

Ebi httpwwwebiacuk

Seekda httpwwwseekdacom

Service-Repository httpwwwservice-repositorycom

Xmethods httpwwwxmethodsnet

32 Deep Web Services Crawler Architecture

In this section an overview of this high-level architecture of Deep Web Services Crawler approach will

be introduced first Thereafter there are four subsections that focusing on the outlining of each single

component and how they will play together presented

The current components and flows of data in Deep Web Service Crawler can be summarized as

depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page

links and related service page links by crawling the Web (Web Service Extractor) Then those gathered

links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL

Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all

these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is

illustrated as following

Oslash Step 1

When the Deep Web Service Crawler starts to run the File Chooser container would require the

user to specify a path in the computer or in any other hard disk The reason for specifying the

path is that this Deep Web Service Crawler program needs a place to store all its outputs

Oslash Step 2

After that the Web Service Extractor will be triggered It is the main entry to the specific crawling

process However since the Deep Web Service Crawler program is a procedure which supposes

to crawl for Web Services in some given Web Service Registries Hence the URL addresses of

Deep Web Service Crawler

25

these Web Service Registries should be given as an initial seed for this Web Service Extractor

process However since the page structures of these Web Service Registries are completely

different thus there will be a dependent process for each Web Service Registry

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler

Oslash Step 3

So that according to the given seed two types of links would be obtained by the Web Service

Extractor component One is the service list page link And another is the service page link

Service list page is page that contains a list of Web Services and may be some information about

these Web Services While service page is page that contains much more information for single

service Finally it forwards these two types of links into the next two components Property

Grabber and WSDL Grabber

Oslash Step 4

Then on the one hand the Property Grabber component tries to gather the information about

the service that hosted in the service list page and service page Such as the name of the service

its description the ranking for this service etc Finally all the information of the service will be

collected together as the service properties which then will be delivered to the storage

component for further processing

Oslash Step 5

On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page

or the service page That is because for some Web Service Registry the WSDL link is hosted in

Deep Web Service Crawler

26

the service list page like the Biocatalogue and for other Web Service Registries it is hosted in

the service page such as Xmethods Then after obtaining the WSDL link it will also be

transmitted to the Storage component for further processing

Oslash Step6

When service properties information and WSDL link of the service are received by Storage

component it will be going to store them into the disk For the service properties they will be

stored into the disk by means of three different ways They are an XML file an INI file or one

record inside the table of Database However for the WSDL link the Storage component will try

to download the page content first according to the URL address of the WSDL link If it can work

successful the page content of the service will be stored as a WSDL document into the disk

Oslash Step 7

Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there

are more than one service or there are more than one service list page in the those Web Service

Registries the crawling process from step 3 to step 6 will be continued again and again until

there is no service or service list page in that Web Service Registries any more

Oslash Step 8

Furthermore after the crawling process for one Web Service Registry finishes there will also

generate a file that contains some of statistic information about this crawling process For

example the time when the crawling process of this Web Service Registry started and when

finished the total number of Web services in this Web Service Registry how many services have

empty WSDL document the average number of service properties in this Web Service Registry

and the average time cost for extracting service properties getting WSDL document and

generating XML file INI file etc

321 The Function of Web Service Extractor Component

The Web Service Extractor component is responsible for gathering service list page links and service

page links It pursues a focused crawl of the Web and only forwards service list page and service page

links to subsequent components for analyzing collecting gathering purposes Therefore it is going to

identify both service list page links and related service page links on these Web Service Registries

As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost

as important as the Web Service Extractor itself as it highly influences the part of the Web which it

needs to crawl The seed can or shall contain eg Web pages where these Web Services are

published or that they talk about Web Services

Deep Web Service Crawler

27

Figure3-2 Overview the process flow of the Web Service Extractor Component

After feeding with the seed of URL the Web Service Extractor component starts to get the link of

service list page from the initial page of this URL seed However this process would be diverse for

these five Web Service Registries Following shows the different situation in these Web Service

Registries

Oslash Service-Repository Web Service Registry

In this Web Service Registry the link of the first service list page is the URL address of its seed

which means some Web Services can be found in the home page of the Service-Repository Web

Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this

Web Service Registry Therefore the process of getting service list page link in this registry will be

continually carried on until there no more service list page link exists

Oslash Xmethods Web Service Registry

Although there has Web Services in the home page of the Xmethods Web Service Registry the

number of those Web Services is only a small subset of these in this Web Service Registry

Moreover in the Xmethods Web Service Registry there is only one page containing all Web

Services Therefore it must have to get the service list page link for that page

Oslash Ebi Web Service Registry

The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That

is to say there also has one page that contains all Web Services in this Web Service Registry

However this page is not the initial page of the input seed Therefore there needs more than

one operation steps to get the service list page link of that page

Oslash Seekda Web Service Registry

In Seekda Web Service Registry these Web Services are not contained in the initial page of the

input seed The service list page link can be obtained after several additional operation steps

Deep Web Service Crawler

28

However there has a problem for getting the service list page link in this registry Simply put if

there are more than one pages contains the Web Services for some unknown reasons it cannot

not get the links of the rest service list page In another word it can only get the link of the first

service list page

Oslash Biocatalogue Web Service Registry

The process of getting service list page in Biocatalogue Web Service Registry is almost the same

as in Seekda Web Service Registry The only difference is that it can get all these service list page

links in Biocatalogue Web Service Registry if there are more than one service list pages

Then after getting the link of service list page the Web Service Extractor will begin to get the link of

service page for each service that listed in the service list page The reason for why it can do like this is

that there has an internal link for every service which will address to the service page It is worth

noting that once a service page link is obtained this component will immediately forward the service

page link and the service list page link into the subsequent two components for further processing

Nevertheless the process of obtaining the service page link will be continuously carried out until all

services listed in that service list page are crawled Analogously the process of getting service list page

will also be continuously carried out until there no more service list page exists

3211 Features of the Web Service Extractor Component

The main features are described in the following paragraphs

1) Obtain service list page links

A central task is to obtain the corresponding service list page links It is an URL address that includes a

publicly list of Web Services and just a simply information about these Web services like the name of

the service an internal URL that links to another page which would contain the detailed information

about that service sometimes may have the link address for WSDL document

2) Obtain service page links

Once service list page links are found a crucial aspect is to extract the internal link of each service to

aid the task of service information discovery Therefore it is the task of the Web Service Extractor to

harvest the html page content of the service list page so that the service page links which would

contain much more detailed information of the single Web service can be obtained

3212 Input of the Web Service Extractor Component

This component is dependent on some specific input seeds And the only input required for this

component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section

313

3213 Output of the Web Service Extractor Component

The component will produce two service related page links from the Web

l Service list page links

l Service page links

Deep Web Service Crawler

29

3214 Demonstration for Web Service Extractor

In order to have comprehensive understanding of the process of the Web Service Extractor

component following gives some figures for explanation Though there are five URL addresses in this

section only the URL of the service-repository address is showed as an example

1) The input seed is the initial URL address of the Service-Repository which is

ldquohttpwwwservice-repositorycomrdquo

2) As has already said in section 321 the first service list link of this Web Service Registry is its input

seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page

of that link

Figure3-3 Service list page of the Service-Repository

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo

Figure3-5Code Overview of getting service page link in Service Repository

Figure3-6 Service page of the Web service ldquoBLZServicerdquo

Deep Web Service Crawler

30

3) For now because of the service list page link is already know the next step is to acquiring the

service page link according to these services that listed in the service list page The text in red box

of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the

complete link of the service page It should be prefixed with the initial URL address of

Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for

getting service page link of the Web service in Service Repository is showed in figure 3-5

Therefore the final link of this service page is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure 3-6 is the corrsponding service page of that link

4) Afterwards those two links service list page link and service page link which are gathered by the

Web Service Extractor component will be immediately forwarded to the next two components

which are WSDL Grabber component and Property Grabber component

322 The Function of WSDL Grabber Component

The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the

service lists page link or service page link And the whole process flow is illustrated in figure 3-7

Figure3-7 Overview the process flow of the WSDL Grabber Component

When the WSDL Grabber component receives the inputs that delivered from previous component it

will start to get the WSDL link for the service based on these inputs However although the inputs of

the WSDL Grabber component are the links of service page and service list page only one of them

contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the

service page or the service list page The reason why there needs these two links to be delivered into

this component is that the only one of the five Web Service Registries which is Biocatalogue hosts

the WSDL link in the service list page While for other four Web Service Registries the WSDL link is

hosted in the service page Therefore the WSDL link of these four Web Service Registries will be

obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL

link will be obtained through the service list page link However there has a problem for getting the

WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in

Deep Web Service Crawler

31

the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another

word these services will not have WSDL document For the situation like this the value of the WSDL

link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web

Services in other four Web Service Registries the WSDL link would always exist in the service page

Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service

it will inevitably be forwarded into Storage component for downloading the WSDL document at once

3221 Features of the WSDL Grabber Component

The WSDL Grabber component has to provide the following features

l Obtain WSDL links

WSDL link is the direct way to get into the page that contains the contents of the WSDL document

It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo

to represent this is an address that addressed to the page of WSDL document

3222 Input of the WSDL Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3223 Output of the WSDL Grabber Component

The component will only produce the following output data

l The URL address of WSDL link for each service

3224 Demonstration for WSDL Grabber Component

In this section it will demonstrate a list of figures in order to give a comprehensive understanding of

the process of the WSDL Grabber component Without a doubt it uses the same Web service of the

Service-Repository as an example too

1) The input for this WSDL Grabber component is the link of service page that obtained from Web

Service Extractor component The address of this link is

ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page

Deep Web Service Crawler

32

2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the

figure 3-8

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo

3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure

3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry

For other four Web Service Registries this will be different This function

ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo

Then it checks all of the nodes one by one to see if the text value of that node is the value of

ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will

be extracted as the value of WSDL link for this Web service

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function

Figure3-11 Code overview of ldquooneParameterrdquo function

4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web

service ldquoBLZServicerdquo can be obtained The value of it is

ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo

Deep Web Service Crawler

33

323 The Function of Property Grabber Component

The Property Grabber component is a module which is used to extract and gather all these Web

service information that hosted in the Web actually are the information that showed in service list

page and service page In the end all the obtained Web service information will be collected together

as the service properties which would be delivered into the Storage component for storing The

detailed process flow of Property Grabber component is illustrated in figure 3-12

The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component

However there is still a little difference between them with respect to the seed As have already

mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is

sufficient to get the WSDL link While regarding the Property Grabber component it needs two of

them as the seeds Therefore once the Property Grabber component receives the needed inputs it

will start to extract the information of that single Web service

Figure3-12 Overview the process flow of the Property Grabber Component

Therefore after the Property Grabber component receives the inputs it starts to extract the service

information for the Web service Generally speaking the service information consists of four aspects

which are structured information endpoint information monitoring information and whois

information respectively

(1) Structured Information

The structured information can be obtained by extracting the information that hosted in the

service page and service list page They are the basic descriptive information about the service

Such as the name of the service the URL address through which the WSDL document can be

obtained the description of introducing the service the provider who provides that service its

rating and the server who owns this service etc However the elements constituting this

Deep Web Service Crawler

34

structured information would be diverse according to these different Web Service Registries For

example the rating information of the Web service exists in the Service-Repository Web Service

Registry while the Xmethods Web Service Registry does not have this information In addition

even though the Web services in the same Web Service Registry some elements of the structured

information could be not exist For instance one service in a Web Service Registry may have the

description information of this service while another service in the same Web Service Registry

does not have this description information Tables 3-1 to 3-5 demonstrate the structured

information that should be extracted in these five Web Service Registries Moreover if the style of

the Biocataloue Web Service Registry is SOAP there has some additional information describing

the SOAP operation of this service While it is REST this additional information would be

something that talked about the REST operation They should also be considered as a part of the

structured information Table 3-6 and table 3-7 illustrate the information for these two different

operations

Service Name WSDL Link WSDL Version

Provider Server Rating

Homepage Owner Homepage Description

Table 3-1Structured Information of Service-Repository Web Service Registry

Service Name WSDL Link Provider

Service Style Homepage Implementation Language

Description User Description Contributed Client Name

Type of this Client Publisher for this client Used Tookit of this client

Used Language of this client Used Operation System of this client

Table 3-2Structured Information of Xmethods Web Service Registry

Service Name WSDL Link Server

Provider Providerrsquos Country Service Style

Rating Description User Description

Service Tags Documentation (within WSDL)

Table 3-3Structured Information of Seekda Web Service Registry

Service Name WSDL Link Port Name

Service URL Address Implementation Class

Table 3-4Structured Information of Ebi Web Service Registry

Service Name WSDL Link Style

Provider Providerrsquos Country View Times

Favorite Times Submitter Service Tags

Total Annotation Provider Annotation Member Annotation

Registry Annotation Base URL SOAP Lab Server Base URL

Description User Description Category

Table 3-5Structured Information of Biocatalogue Web Service Registry

Deep Web Service Crawler

35

SOAP Operation Name Inputs and Outputs Operation Description

Operation Tags Part of Which Service

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry

REST Operation Name Service Tags Used Template

Operation Description Part of Which Service Part of Which Endpoint Group

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry

(2) Endpoint Information

Endpoint information is the port information of the Web service which can be extracted only

through the service page However since different Web Service Registries have different structure

of the endpoint information with respect to the Web service Thence some elements of the

endpoint information for the Web service would be very diverse But there is one thing needed to

pay attention the Ebi Web Service Registry does not have endpoint information for all these Web

services published in this registry Moreover though the Web services in the same Web Service

Registry have the same structure of this endpoint information some elements of the endpoint

information would be missing or empty Furthermore these Web Service Registries could even

have no endpoint information for some Web services published by them Nevertheless whatever

there is endpoint information for the Web service there will be at least one such element which

is the URL address of the endpoint Following table 3-8 shows the endpoint information that

should be extracted in these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Endpoint Name Endpoint URL

Endpoint Critical Endpoint Type

Bound Endpoint

Xmethods

Endpoint URL Publisher of this Endpoint

Contact Email of this Publisher Implementation Language of this

Endpoint

Seekda Endpoint URL

Biocatalogue Endpoint Name Endpoint URL

Table 3-8 Endpoint Information of these five Web Service Registries

Web Service

Registry Name

Elements of the Monitoring Information

Service

Repository

Service Availability Number of Downs

Total Uptime Total Downtime

MTTR MTBF

RTT Max of Endpoint RTT Min of Endpoint

RTT Average of Endpoint Ping Count of Endpoint

Seekda Service Availability Begin Time of Monitoring

Biocatalogue Monitored Status of Endpoint Overall Status of Service

Table 3-9 Monitoring Information of these five Web Service Registries

(3) Monitoring Information

Monitoring information is the tested statistic information for the Web service But it is worth

Deep Web Service Crawler

36

noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of

all these Web services published by them While for other three Web Service Registries only a

few of Web services may not have this information Table 3-12 displays the monitoring

information for these three Web Service Registries

(4) Whois Information

Whois information is not extracted from the information that hosted in the service page and

service list page It is the descriptive information for the service domain which can be gained by

means of the address of the WSDL link Because of that the process of getting the Whois

information will start after gaining the service domain first The final value of service domain

should not contain the string like this http https www etc It must belong to the top level

domain After that it is going to query the service domain database by means of sending the value

of service domain to the Whois client which is just a Web site on the Internet for example

ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain

exists a list of its information will be returned as the output However the structure of the

returned information will be diverse with respect to different service domain Therefore the most

challenge thing is that there has to deal with the extracting process for each different situation of

the returned information Table 3-10 gives the Whois information that needs be extracted for all

these five Web Service Registries

Service Domain URL Domain Name Domain Type

Domain Address Domain Description State

Postal Code City Country

Country Code Phone Fax

Email Organization Established Time

Table 3-10Whois Information for these five Web Service Registries

Finally all the information of these four aspects will be collected together And then be delivered into

the Storage component for further storage processing

3231 Features of the Property Grabber Component

The Property Grabber component has to provide the following features

l Obtain basic information

Generally speaking the more information one Web service has the better you can know how

well this Web service is Hence this Property Grabber component will be necessary to extract all

these basic information that hosted in the service list page and service page This basic

information contains the structure information endpoint information and monitoring

information

l Obtain Whois information

Due to the reason that the more information the Web service has the better the quality of the

Web service is That makes it is necessary to extract as much information about the Web service

as possible Therefore except the basic information the Property Grabber component can also

obtain some additional information called Whois information such as the type of the domain the

name of the person in charge the postal code of the domain city phone fax detailed address

email etc

Deep Web Service Crawler

37

3232 Input of the Property Grabber Component

This component requires the following input data

l Service list page link

l Service page link

3233 Output of the Property Grabber Component

The component will produce the following output data

l Structured information of each service

l Endpoint information about each service if exists

l Monitoring information for the service and endpoint if exists

l Whois information of the service domain

However all these information will be collected together as the properties for each service Thereafter

the collected properties will be sent to the Storage component

3234 Demonstration for Property Grabber Component

Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the

Property Grabber component To simplify the explanation the example showed in this section uses

the same Web service as before

1) The inputs of the Property Grabber component are links of service list page and service page that

received from Web Service Extractor component Their links are

ldquohttpwwwservice-repositorycomrdquo

and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo

2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page

Deep Web Service Crawler

38

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page

3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that

displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of

the structured information have the same content like the contents of description showed in

service page and service list page Thence in order to saving the time of extracting process and

the space for storing process the elements with the same content would only be extracted once

Moreover there needs a transformation of non-descriptive text to the descriptive text for the

rating information because the contents of it are several image of stars Therefore the final

results of the extracted structured information for this Web service are showed in table 3-11

Because of there are no descriptive information for the provider homepage and Owner

Homepage their values are assigned as ldquoNULLrdquo

Service Name BLZService

WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl

WSDL Version 0

Server Apache-Coyote11

Description BLZService

Rating Four stars and A Half

Provider NULL

Homepage NULL

Owner Homepage NULL

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo

4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that

displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list

only one of them will be extracted as the information of endpoint That is because this master

program is intended to extract as more information as possible but these information should not

contain some redundant information Therefore for the purpose of that only one of them will be

extracted as the endpoint information even if there are more than one endpoint records Table

3-12 shows the final results of the endpoint information of this Web service

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page

Deep Web Service Crawler

39

Endpoint Name BLZServiceSOAP12port_http

Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService

Endpoint Critical True

Endpoint Type production

Bound Endpoint BLZServiceSOAP12Binding

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo

5) Then it is time to extract the monitoring information through invoking the function

ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box

above contains the monitoring information about the Web service and the red box below lists

the monitoring information for these endpoints As have already mentioned before only one

endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two

types of availability Actually they all represent the availability for this Web service just like the

availability that showed in figure 3-14 Therefore only one content of these availability

information is sufficient Table 3-13 shows the final results of this extracting process

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page

Service Availability 100

Number of Downs 0

Total Uptime 1 day 19 hours 19 minutes

Total Downtime 0 second

MTBF 1 day 19 hours 19 minutes

MTTR 0 second

RTT Max of Endpoint 141 ms

RTT Min of Endpoint 0 ms

RTT Average of Endpoint 577 ms

Ping Count of Endpoint 112

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo

6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the

service domain from the WSDL link first Towards this Web service the gained service domain is

ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying

process After that it returns a list of information for that service domain See figure 3-17 And

table 3-14 is the extracted Whois information

Deep Web Service Crawler

40

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo

Service Domain URL thomas-bayercom

Domain Name Thomas Bayer

Domain Type NULL

Domain Address Moltkestr40

Domain Description NULL

State NULL

Postal Code 54173

City Bonn

Country NULL

Country Code DE

Phone +4922855525760

Fax NULL

Email infopredic8de

Organization predic8 GmbH

Established Time NULL

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo

7) Finally all the information of these four aspects will be collected together as the service

properties and then these service properties are forwarded into the Storage component

324 The Function of Storage Component

The WSDL link from the WSDL Grabber component will be used by this Storage component to

download the WSDL document from the Web and stores it into the disk thereafter In addition the

service properties from the Property Grabber component will also be directly stored into the disk in

three different manners with this Storage component Figure 3-18 illustrates the process flow of the

Storage component

Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo

of the Storage component will be triggered After that it is going to transform the service properties

into three different output formats and store them into the disk These output formats are XML file

INI file and database records Besides it will also try to download the WSDL document with the URL

address of the WSDL link and then store the obtained WSDL document into disk too However this

ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo

ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the

storage tasks

Deep Web Service Crawler

41

Figure3-18 Overview the process flow of the Storage Component

(1) ldquogetWSDLrdquo sub function

The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it

into the disk Therefore above all it has to get the content of the WSDL document This

procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value

of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if

the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo

For the case like that it will create a WSDL document whose name is the service name appended

with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content

It is an empty document Nevertheless if that service does have WSDL link this sub function will

try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all

these contents that hosted in the Web would be downloaded and stored into the disk and named

only with the name of the service Otherwise it would create a WSDL document which prefixes

ldquoBadrdquo before the service name

(2) ldquogenerateXMLrdquo sub function

The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform

them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo

XML stands for eXtensible Markup Language which is a markup language that designed to

transport and store data The first line of an XML file is the XML declaration which defines the

XML version and the encoding used For example following ldquoltxml version=10

encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is

8-bit Unicode Transformation Format character set Besides an XML file contains XML elements

which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML

element can also contain other elements which could be simple text or a mixture of both

Deep Web Service Crawler

42

However an XML file must contain a root element as the parent of all other elements

(3) ldquogenerateINIrdquo sub function

The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will

transform them into an INI file and then store it in the disk with the name structure of servicersquos

name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for

configuration files They are just simple text files with a basic structure Generally speaking the

INI file contains three different parts They are section parameter and comment respectively

Parameter is the basic element contained in an INI file Its format is a key-value pair or can also

be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always

appears to the left of the equals sign The section is more like a room that grouped all its

parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And

section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo

Thence anything between the semicolon and the end of line will be ignored

(4) ldquogenerateDatabaserdquo sub function

The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions

Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will

turn them into the data of the database by using the SQL statements SQL stands for Structured

Query Language which is a database language that designed for accessing and manipulating data

in database Most of the actions that performed on a database are done with SQL statements

And the primary statements of SQL include insert into delete update select create alter and

drop Therefore for the purpose of transforming these service properties into the data of the

database this sub function has to create a database first using the ldquocreate databaserdquo statement

Then it should create a table to store the data A table is a collections of related data entries and

it consists of columns and rows Since the data for all these five Web Service Registries are not

very large therefore one database table is enough for storing these service properties Because of

that the field names of the service properties in columns for all these five Web Service Registries

should be uniform and well-defined Afterwards the service properties of each single service can

be put into the table as a record with the ldquoinsert intordquo statement of the SQL

3241 Features of the Storage Component

The Property Grabber component has to provide the following features

l Generate different output formats

The final result of this master program is to store the information of the service into the disk for

future work However this Storage component provides three different formats for storing these

service properties of the services that published in the Web Service Registries This makes the

storage of the service very flexible and also longevity

l Obtain the WSDL document

The important thing of this component is to check if the WSDL document can be obtained from

the WSDL link because the WSDL document plays a decisive role for determining the quality of

the service This Storage component provides the ability to deal with the different situations

about the process of obtaining the WSDL document

Deep Web Service Crawler

43

3242 Input of the Storage Component

This component requires the following input data

l WSDL link of each service

l Each service property information

3243 Output of the Storage Component

The component will produce the following output data

l WSDL document of the service

l XML document INI file and tables in database

3244 Demonstration for Storage Component

Following figures describe the fundamental implementation codes of this Storage component

The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the

implementation codes The first common place is about the parameters that defined in each of

these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute

path in the computer disk It will be used for the storing procedure and has already been specified

by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an

increasing Integer which is used as a part of the name of the service The reason for doing this is

that it can prevent the services that have the same name from overriding in the disk The content

of red mark among the codes of these figures is the second common place Its function is to

create a file or document for the service with the corresponding parameters

2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to

get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the

name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo

is the actual WSDL link which is the most important one for this sub function The other two

parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and

ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data

of the service in each Web Service Registry Such as the overall number of properties for that Web

Service Registry the overall number of services the number of services that have no WSDL link

the number of services contains no contents the number of services whose WSDL link are not

available etc The ldquoLog Informationrdquo text file records the results of the process steps and the

problems encountered For example which service is crawling now which Web Service Registry it

belongs the reason why it cannot get the WSDL document for this service the reason why this

service is unreachable and so on

Deep Web Service Crawler

44

Figure3-19 Implementation code for getting WSDL document

3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file

and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of

ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two

variables name and value

Figure3-20 Implementation code for generating XML file

Deep Web Service Crawler

45

Figure3-21 Implementation code for generating INI file

4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all

services in these five Web Service Registries into the records of the database Therefore it has to

create a database first The name of the database can be arbitrary as long as it conforms to the

naming rules in database This is the same for the name of the table Figure 3-22 displays the

procedure of creating a table in the database Due to the reason that it is hard to decide the

length of each service property thence the data types for all service properties are set to ldquoTextrdquo

Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo

statement

Figure3-22 Implementation code for creating table in database

Deep Web Service Crawler

46

Figure3-23 Implementation code for generating table records

33 Multithreaded Programming for DWSC

Multithreaded programming is a built-in characteristic which is provided by Java language A

multithreaded program contains two or more separate parts that that can execute concurrently Each

part of such a program is called a thread The use of multithreading enables to create programs that

makes efficient use of the system resources For example make maximum use of the CPU thatrsquos

because the idle time of the CPU can be kept to a minimum

In this master program there are five Web Service Registries that are needed to crawl for the services

among them Moreover the number of services published among each Web Service Registry is quite

different That makes the running time cost by each Web Service Registry is different Then it will

happen that one Web Service Registry who owns fewer services has to be waiting for executing until

another Web Service Registry who has much more services finishes Therefore in order to reduce the

waiting time for other Web Service Registries and maximize the use of the system resources it is

necessary to apply this multithreaded programming to this master program That is to say this master

program will create a thread for each Web Service Registry And these threads are executed

independently

34 Sleep Time Configuration for Web Service

Registries

Due to the reason that this master program is intended for downloading the WSDL document and

extracting the service information of these Web services that published in the Web Service Registry

this would inevitably affect the performance of the Web Service Registry In addition for the purpose

of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of

accessing Because of that sometimes when this master program is executing unknown errors would

happen For instance this master program will be continually halting at one point without getting any

more WSDL document and service information or some servicesrsquo WSDL documents of some Web

Deep Web Service Crawler

47

Service Registries cannot be obtained or there would miss some service information Therefore in

order to obtain the largest staffs of these Web services published in these five Web Service Registries

and not to affect the throughput of them it has to configure the accessing rate for each service of all

Web Service Registries

In consequence before going into the essential procedure for each single service of these Web Service

Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long

milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for

the specified number of milliseconds In other words it will temporarily cease execution for a while

Following table shows this time interval of the sleep function for each Web Service Registry

Web Service Registry Name Time Interval (milliseconds)

Service Repository 8000

Ebi 3000

Xmethods 10000

Seekda 20000

Biocatalogue 10000

Table 3-15Sleep Time of these five Web Service Registries

Deep Web Service Crawler

48

4 Experimental Results and Analysis

This chapter is going to show the quantitative experimental results of the prototype that presented in

chapter 3 Besides the analysis of these results will also be described and explained However in

order to gain a rather accurate result the experiments are carried out more than five times All these

data that displayed in the following tables and charts are the average of them

41 Statistic Information for Different Web Service

Registries

This section is mainly used to talk about the amount statistic of these Web services that published in

these five Web Service Registries It includes the overall number of the Web services that published in

each Web Service Registry and the number of these unavailable Web services which have been

archived because they may not be active anymore or are close to being non active Table 4-1 shows

the service amount statistic of these five Web Service Registries

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Overall

Services 57 289 382 853 2567

Unavailable

Services 0 0 0 0 125

Table4-1 Service amount statistic of these five Web Service Registries

Nevertheless in order to have an intuitive view of the service amount statistic in these five Web

Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As

you can see from the bar chart one on hand there has an ascending increase for the overall number

of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service

Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web

service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to

provide the Web services for the users because it contains far more services than other four Web

Service Registries On the other hand there is no unavailable service existed in almost all Web Service

Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web

Service Registry contains some Web services that cannot be used by the users To some degree this is

useless for the reason that these services cannot be used anymore and it will waste the network

resources on the Web Therefore all these unavailable services should be eliminated in order to

reduce the wasting of the network resources

Deep Web Service Crawler

49

Figure4-1 Service amount statistic of these five Web Service Registries

42 Statistic Information for WSDL Document

Web Service

Registry Name

Service

Reporitory Ebi Xmethods Seekda Biocatalogue

Failed WSDL

Links 1 0 23 145 32

Without

WSDL Links 0 0 0 0 16

Empty

Content 0 0 2 0 2

Table4-2 Statistic information for WSDL Document

Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in

these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed

WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the

Web services whose WSDL links are invalid Say in other words that means it is impossible to get the

WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no

WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services

for these Web Service Registries It is used as the overall number of Web services in each Web Service

Registry who has no WSDL link That is to say there would be no WSDL document for Web services

like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL

document will also be created But there is no content and the name of this WSDL document will

contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which

represents the overall number of Web services that have the WSDL links and the URL address of them

0

500

1000

1500

2000

2500

3000

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

57289 382

853

2567

0 0 0 0 125

The

Num

ber o

f Web

Ser

vice

s

The name of Web Service Registry

Overall Services

Unavailable Services

Deep Web Service Crawler

50

are valid but their WSDL documents contains no content In this case a WSDL document whose name

contains a string ldquo(BAD)rdquo will be created

Figure4-2 Statistic information for WSDL Document

43 Comparison of Different Average Number of

Service Properties

This section is going to compare the average number of service properties in these five Web Service

Registries This average number of service properties is calculated by means of following equation = (1)

Where

ASP is the average number of service properties for one Web Service Registry

ONSP is the overall number of service properties in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

Figure 4-3 shows the average number of service properties of each Web service in these five Web

Service Registries As have already mentioned before one of the measurements for testing the quality

of Web services in the Web Service Registry is the service information The more information of a Web

service the better you know about that service consequently the corresponding Web Service

Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service

Repository and Biocatalogue Web Service Registries own a larger number of service properties than

that in other three Web Service Registries This can directly reflect that these two Web Service

Registries can provide more detailed information about the Web services published among them So

that the users can be easier to choose the service they need And they would also like to use the Web

services that published in these two Web Service Registries By contrast the Xmethods and Seekda

0

50

100

150

Service Reporitory

Ebi Xmethods Seekda Biocatalogue

1 023

145

32

0 0 0 016

0 0 2 0 2

The

Num

ber o

f Ser

vice

s

The Name of the Web Service Registry

Failed WSDL Links

Without WSDL links

Empty Content

Deep Web Service Crawler

51

Web Service Registries who have less service information of the Web services would offer less quality

for these Web services Therefore users may be not like to use the Web services provided in these

two Web Service Registries Not to mention the Web services published in the Ebi Web Service

Registry

Figure4-3 Average Number of Service Properties

From the description presented in section 323 the causes of the issue for the different number of

service properties in these Web Service Registries may consist of following several points First the

number of the structured information for these Web services is different with respect to these five

Web Service Registries Even part of information for some Web services in one Web Service Registry

could be missing or have empty value For example the number of structure information that

supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that

in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for

almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce

the amount of the overall number of service properties Thirdly some Web Service Registries do not

have monitoring information such as Xmethods and Ebi In particular the Service Repository Web

Service Registry has a large amount of monitoring information of the Web services that could be

extracted in the Web Obvious the last one is the number of the whois information for these Web

services If the database of the whois client does not contain the information about the service

domain for the Web service in one Web Service Registry then there will be no whois information

could be extracted Moreover even there has information of the service domain the amount of the

information could be very diverse Therefore if the Web Service Registry undergoes the situation like

that many service domains of the Web services in this registry have no and few whois information

then the average number of service properties in that registry will decrease greatly

As a result in order to help users get better acquainted with the Web services that provided in Web

Service Registry and distinguish these Web services the Web Service Registry should do it best to

offer more and more information for each published Web service of it

05

101520253035

23

717 17

32

The

Num

ber o

f Pro

pert

ies

The Name of Web Service Registry

Average Number of Service Properties

Deep Web Service Crawler

52

44 Different Outputs of Web Services

The fundamental task of this master program is to obtain the WSDL document of the Web services

that hosted in the Web Service Registries as well as extract and gather the properties of these Web

services and thereafter store them into disk Therefore this section is going to describe the different

outputs of this master program which include the WSDL documents of Web services the generated

XML file and INI file and data records of these service properties in Database

Figure4-4 WSDL Document format of one Web service

The WSDL document of the Web service is just read from the Web according to the URL address of its

WSDL link And then these data will be stored into the disk as the WSDL document The name of the

WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document

And in order to distinguish these WSDL documents whose names would be the same while the

content of them are different the name of each obtained WSDL document in one Web Service

Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL

document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and

data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output

formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name

is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the

materials belonged to the same Web service The first three lines in that INI file are some service

comments which start from the semicolon to the end of the line They are the basic information for

the description of this INI file The following one line of them is the section that included in a pairs of

brackets It is important because it represents that the following lines behind it are the information of

this Web service Therefore the rest of these lines are the actual service information that existed in a

key-value pair with an equal between the pair Each service property is displayed from the beginning

of the line

Deep Web Service Crawler

53

Figure4-5 INI File format of one Web service

Figure4-6 XML File format of one Web service

Figure4-7 Database data format for all Web services

Deep Web Service Crawler

54

Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its

name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web

service Though the format of the XML file is different from that of the INI file the essential contents

of them are the same That is to say the values of these service properties are no different This is

because they are files generated from the collection of properties for the same Web service The XML

file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the

section in INI file is something like the root in XML file Therefore all values of these elements among

the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service

Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of

service information for all Web services in these five Web Service Registries And the entire service

information for one Web service will be only one record in this table Because of that the column

names of that table should be the union of the name of the service information in each Web Service

Registry However since the column names of the table should be unique thus these redundant

names in this union must be eliminated This could make sense and possible because the names of the

service information are well-defined and uniform for all these five Web Service Registries In addition

in this table the first column is the primary key which is an increasing Integer The function of this is

more like the Integer that contained in the name of the XML file and INI file While the rest columns in

that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in

this table represents that this property of that Web service is empty or missing

45 Comparison of Average Time Cost for Different

Parts of Single Web Service

This section aims to describe the comparison of the average time cost for different parts of getting

one single Web service in all these five Web Service Registries At first it has to calculate the average

time cost of getting one single service in the Web Service Registry The calculation of it can be

obtained through following equation = (2)

Where

ATC is the average time cost for one single Web service

OTS is the overall time cost of all the Web services in one Web Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the different parts of the average time cost for getting one single service consists of

following 6 aspects which are the average time cost for extracting service property the average time

cost for obtaining WSDL document the average time cost for generating XML file the average time

cost for generating INI file the average time cost for inserting the service property into the table of

the Database and the average time cost for some other procedures such as get the service list page

link get the service page link and so on The average time cost for extracting service property would

be obtained by means of following equation

Deep Web Service Crawler

55

= (3)

Where

ATCSI is the average time cost for extracting service property of one single Web service

OTSSI is the overall time cost for extracting service property of all the Web services in one Web

Service Registry

ONS is the overall number of Web services that have already been crawled from the

corresponding Web Service Registry

In addition the calculation of other parts is similar to the equation for calculating the average time

cost for extracting service property While the calculation of the average time cost for other

procedures equals to the result of the average time cost for one single Web service minus the sum of

the average time cost of all other five parts

Service

property

WSDL

Document

XML

File INI File Database Others Overall

Service

Repository

8801 918 2 1 53 267 10042

Ebi 699 82 2 1 28 11 823

Xmethods 5801 1168 2 1 45 12 7029

Seekda 5186 1013 2 1 41 23 6266

Biocatalogue 39533 762 2 1 66 1636 42000

Table4-3 Average time cost information for all Web Service Registries

Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web

Service Registries The first column of table 4-3 is the name of these five Web Service Registries And

the last column in this table is the average time cost for the single service in one Web Service Registry

While the rest columns in this table are the average time cost of these six different parts In order to

have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated

with following corresponding figures See figure 4-8 to figure 4-14

Figure4-8 Average time cost for extracting service property in all Web Service Registries

8801

6995801 5186

39533

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Service Property

Deep Web Service Crawler

56

As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue

Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries

which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively

That is to say it will take much longer time to extract the service properties of these Web services

published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the

Biocatalogue Web Service Registry has even more average number of service properties which have

already been talked about in section 43 On the contrary the average number of service properties in

Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service

Registry is larger than that in Seekda Web Service Registry As have already known that the average

number of service properties is the same for these two Web Service Registries Nevertheless there

has course might explain the case that the average time in Xmethods costs more than that in Seekda

which is the process for extracting service properties in Xmethods Web Service Registry needs to be

executed by means of service page and service list page while it needs only service page link for

Seekda Web Service Registry

The average time cost for obtaining WSDL document in all these five Web Service Registries is

displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the

average time for extracting the WSDL link of Web service and the average time for reading the data of

WSDL document from the Web and then storing it into the disk As seen from this figure the average

time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is

1168 Although the process for extracting WSDL link will cost certain amount of time it does not have

a significant influence to the total average time spent for obtaining WSDL document This is because

the WSDL link of one Web service is almost gained in one step Therefore this can imply that the

average size of the WSDL document for Xmethods Web Service Registry is larger than that for other

four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds

for the process of extracting WSDL document

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries

918

82

11681013

762

0

200

400

600

800

1000

1200

1400

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

WSDL Document

Deep Web Service Crawler

57

Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different

outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11

the average time of generating XML file for one Web service cost the same for all these five Web

Service Registries which is only 2 milliseconds In addition the average time of generating INI file for

one Web service cost the same too And its value is just 1 millisecond Even though sum up the values

of these two types of average time costs the result is still so small that it could be omitted when

comparing to the overall average time cost of getting one Web service for each corresponding Web

Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI

file would be finished at once after receiving the input of service properties of one Web service

Furthermore it can be seen from figure 4-12 although the average time costs of creating database

record for each Web service in all these five Web Service Registries are larger than that time of

generating XML and INI file the process of the operation for creating database record is still fast

Figure4-10 Average time cost for generating XML file in all Web Service Registries

Figure4-11 Average time cost for generating INI file in all Web Service Registries

2 2 2 2 2

0

05

1

15

2

25

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

XML File

1 1 1 1 1

0

02

04

06

08

1

12

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

INI File

Deep Web Service Crawler

58

Figure4-12 Average time cost for creating database record in all Web Service Registries

Figure4-13 Average time cost for getting one Web service in all Web Service Registries

In the figure 4-13 it gives the average time cost for getting one single Web service in all these five

Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the

longest time for this process This is because the presentation of these five different parts that

described in the front shows that the average time cost of each part would needs more time to finish

the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining

WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web

Service Registry does not cost the longest Moreover there is amazing thing when looking at these

figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same

trend This can further indicate that one Web Service Registry who spends more time to get the

description information of one Web service would offer more information about the Web service

53

28

4541

66

0

10

20

30

40

50

60

70

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Database

10042

823

7029 6266

42000

05000

1000015000200002500030000350004000045000

Service Repository

Ebi Xmethods Seekda Biocatalogue

The

Tim

e in

Mill

isec

onds

The Name of Web Service Registry

Overall

Deep Web Service Crawler

59

5 Conclusion and Further Direction

This master thesis provides a schema which aims to explore the description information of the Web

services that hosted in the different Web Service Registries Actually the description information of

the Web service consists of WSDL document and service information about that Web service

Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can

be used to obtain the WSDL document and service information of the Web service in these Web

Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web

services that hosted in the these Web Service Registries as well as the storage of these Web services

is not flexible and the most important thing is that only a few service information of the Web service

are extracted even more there is no service information would be extracted for some Web Service

Registry However the work presented in this master thesis can be able to explore all Web services in

these Web Service Registries Moreover the service information of Web service would be extracted as

much as it could with the approach presented in this master thesis so that the final result would be

the largest annotated service catalogue ever produced Furthermore regarding the storage process of

the description information of the Web service this master thesis provides three different ways that

can guarantee not only the completeness but also longevity of the description information for the

Web service

However in the implementation performed in this master thesis the whois client used for querying

the information of service domain will return a free text if the information exists And sometime this

free text differs completely That makes it must have to crawl each Web service in all Web Service

Registries at least once in the experiment stage So that all the cases of these free text would be

foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web

services in these Web Service Registries Therefore in order to simply the work other whois client

who can ease the work here needs to be found and used

Moreover in the experiment stage of this master thesis the time cost for getting Web service is still

large For the purpose of reducing this time the multithreaded programming would be applied to

each some parts of the process for getting one Web service

Although the work performed here is specialized for only these five Web Service Registries the

main parts of the principles used here is adaptable to other Web Service Registries that with only

small changes in the implementation codes or the structure

Deep Web Service Crawler

60

6 Bibliography

[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)

June 27 2008

[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and

Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from

httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-

whole Emanuele Della Valle (CEFRIEL) July 1 2008

[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk

ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from

httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-

architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009

[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web

Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October

2006

[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from

httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008

[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from

httpwwww3orgTRws-arch-scenarios February 11 2004

[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo

Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering

University of Washington Seattle pp233-272 February 1999

[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web

Consortium Working Draft WD-html5-20100624 January 22 2008

[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language

WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from

httpwwwwsmoorgTRd16d161v02

[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO

-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from

httpwwwwsmoorgTRd2v11

[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware

Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet

2008

Deep Web Service Crawler

61

7 Appendixes

There are additional outputs of this master program which are log information file and statistic report

file Figure 8-1 shows one of the basic output formats of the log information for these five Web

Service Registries

Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry

Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry

Deep Web Service Crawler

62

Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo

ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively

Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry

Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry

Deep Web Service Crawler

63

Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry

Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry

Deep Web Service Crawler

64

Table of Figures

Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12

Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16

Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16

Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20

Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25

Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27

Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29

Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30

Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31

Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32

Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33

Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37

Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38

Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41

Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44

Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45

Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46

Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50

Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51

Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52

Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53

Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55

Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56

Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57

Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58

Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58

Deep Web Service Crawler

65

Table of Tables

Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34

Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34

Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35

Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36

Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38

Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39

Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40

Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47

Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48

Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49

Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55

Deep Web Service Crawler

66

Table of Abbreviations

DTD Document Type Definition

DWSC Deep Web Service Crawler

HTML HyperText Markup Language

MTBF Mean Time between Failures

MTTR Mean Time to Recovery

QoS Quality of Service

REST Representational State Transfer

RTT Round Trip Time

SOAP Simple Object Access Protocol

SQL Structured Query Language

URL Uniform Resource Locator

WHATWG Web Hypertext Application Technology Working Group

XML eXtensible Markup Language

WSDL Web Service Description Language

WSML Web Service Modeling Language

WSMO Web Service Modeling Ontology

Page 16: Deep Web Service Crawler
Page 17: Deep Web Service Crawler
Page 18: Deep Web Service Crawler
Page 19: Deep Web Service Crawler
Page 20: Deep Web Service Crawler
Page 21: Deep Web Service Crawler
Page 22: Deep Web Service Crawler
Page 23: Deep Web Service Crawler
Page 24: Deep Web Service Crawler
Page 25: Deep Web Service Crawler
Page 26: Deep Web Service Crawler
Page 27: Deep Web Service Crawler
Page 28: Deep Web Service Crawler
Page 29: Deep Web Service Crawler
Page 30: Deep Web Service Crawler
Page 31: Deep Web Service Crawler
Page 32: Deep Web Service Crawler
Page 33: Deep Web Service Crawler
Page 34: Deep Web Service Crawler
Page 35: Deep Web Service Crawler
Page 36: Deep Web Service Crawler
Page 37: Deep Web Service Crawler
Page 38: Deep Web Service Crawler
Page 39: Deep Web Service Crawler
Page 40: Deep Web Service Crawler
Page 41: Deep Web Service Crawler
Page 42: Deep Web Service Crawler
Page 43: Deep Web Service Crawler
Page 44: Deep Web Service Crawler
Page 45: Deep Web Service Crawler
Page 46: Deep Web Service Crawler
Page 47: Deep Web Service Crawler
Page 48: Deep Web Service Crawler
Page 49: Deep Web Service Crawler
Page 50: Deep Web Service Crawler
Page 51: Deep Web Service Crawler
Page 52: Deep Web Service Crawler
Page 53: Deep Web Service Crawler
Page 54: Deep Web Service Crawler
Page 55: Deep Web Service Crawler
Page 56: Deep Web Service Crawler
Page 57: Deep Web Service Crawler
Page 58: Deep Web Service Crawler
Page 59: Deep Web Service Crawler
Page 60: Deep Web Service Crawler
Page 61: Deep Web Service Crawler
Page 62: Deep Web Service Crawler
Page 63: Deep Web Service Crawler
Page 64: Deep Web Service Crawler
Page 65: Deep Web Service Crawler
Page 66: Deep Web Service Crawler

Recommended