Deep Web Service Crawler
1
Dresden University of Technology (Germany)
Deep Web Services Crawler
Name Duan Dehua
Matrikel-Nr 3459827 Faculty Department of Computer Science Major Computational Engineering
Kind of Topic Master Thesis Supervisor Dipl-Inf Josef Spillner
Prof Dr rer nat habil Dr h c Alexander Schill
Start Date May 01 2010 Finish Data October 31 2010
Deep Web Service Crawler
2
Acknowledgements
The work presented in this master thesis is a result of the master task for Web Service project
which is provided by Computer Networks at Dresden University of Technology
In here I am heartily thankful to my supervisor Dipl-infJosef Spillner whose encouragement
guidance and support from the initial to the final level enabled me to develop an understanding of
the subject
Lastly I offer my regards and blessings to all of those who supported me in any respect during the
completion of the project
Deep Web Service Crawler
3
Abstract
Nowadays Web Service Registries offer convenient access to offering searching and using
electronic Web services Usually they host Web service descriptions along with related metadata
generated by both the system and the users Hence monitoring and rating information can help
users to distinguish similar Web service offerings However at present there is little support to
compare these Web services across platforms and to build a global view Besides for the metadata
not all of them especially non-functional property descriptions are made available in s structured
format
Therefore the task of this master thesis is to apply Deep Web analysis techniques to extract as
much information about these published Web services as possible Corresponding the result shall
be the largest annotated Service Catalogue ever produced
Index Terms
Web Service Deep Web Service Crawler Service-Finder Pica-Pica Web Service Description Crawler
WSDL
Deep Web Service Crawler
4
Table of Contents
Acknowledgements 2
Abstract 3
1 Introduction 7
11 BackgroundMotivation 7
12 Initial Designing of the Deep Web Service Crawler Approach 7
13 Goals of this Master Thesis 8
14 Outline of this Master Thesis 8
2 State of the Art 10
21 Service Finder Project 10
211 Use Cases for Service-Finder Project 10
2111 Use Case Methodology 10
2112 System Administrator 10
212 Architecture Plan for the Service-Finder Project 12
2121 The Principle of the Service Crawler Component 13
2122 The Principle of the Automatic Annotator Component 13
2123 The Principle of the Conceptual Indexer and Matcher Component 14
2124 The Principle of the Service-Finder Portal Interface Component 14
2125 The Principle of the Cluster Engine Component 15
22 Information Extraction 15
221 Input Types of Information Extraction 15
222 Extraction Targets of Information Extraction 17
223 The Used Techniques in Information Extraction 18
23 Pica-Pica Web Service Description Crawler 19
231 Needed Libraries of the Pica-Pica Web Service Description Crawler 19
232 Architecture of the Pica-Pica Web Service Description Crawler 20
233 Implementation of the Pica-Pica Web Service Description Crawler 21
24 Conclusions of the Existing Strategies 22
3 Design and Implementation 23
31 Deep Web Services Crawler Requirements 23
311 Basic Requirements for DWSC 23
Deep Web Service Crawler
5
312 System Requirements for DWSC 23
313 Non-Functional Requirements for DWSC 24
32 Deep Web Services Crawler Architecture 24
321 The Function of Web Service Extractor Component 26
3211 Features of the Web Service Extractor Component 28
3212 Input of the Web Service Extractor Component 28
3213 Output of the Web Service Extractor Component 28
3214 Demonstration for Web Service Extractor 29
322 The Function of WSDL Grabber Component 30
3221 Features of the WSDL Grabber Component 31
3222 Input of the WSDL Grabber Component 31
3223 Output of the WSDL Grabber Component 31
3224 Demonstration for WSDL Grabber Component 31
323 The Function of Property Grabber Component 33
3231 Features of the Property Grabber Component 36
3232 Input of the Property Grabber Component 37
3233 Output of the Property Grabber Component 37
3234 Demonstration for Property Grabber Component 37
324 The Function of Storage Component 40
3241 Features of the Storage Component 42
3242 Input of the Storage Component 43
3243 Output of the Storage Component 43
3244 Demonstration for Storage Component 43
33 Multithreaded Programming for DWSC 46
34 Sleep Time Configuration for Web Service Registries 46
4 Experimental Results and Analysis 48
41 Statistic Information for Different Web Service Registries 48
42 Statistic Information for WSDL Document 49
43 Comparison of Different Average Number of Service Properties 50
44 Different Outputs of Web Services 52
45 Comparison of Average Time Cost for Different Parts of Single Web Service 54
5 Conclusion and Further Direction 59
6 Bibliography 60
Deep Web Service Crawler
6
7 Appendixes 61
Table of Figures 64
Table of Tables 65
Table of Abbreviations 66
Deep Web Service Crawler
7
1 Introduction
In this introduction chapter of the master thesis at first itrsquos going to concisely explain the background
of the current situation then is the basic introduction of the proposed approach which is called Deep
Web Service Extraction Crawler
11 BackgroundMotivation
In the late 1990rsquos the Web Service Registry was a hot commodity The formal definition of Web
Service Registry is known as a link links page Its function is to uniformly present information that
comes from various sources Hence it can provide a convenient channel to the users for offering
searching and using the Web Services Actually the related metadata of the Web Services that
submitted by both the system and users are commonly hosted along with the Service descriptions
Nevertheless in fact when users enter one of the Web Service Registries to look for some Web
Services they might meet some situations that would bring lots of trouble to them One of the
situations may be like that these Web Service Registries return several similar published Web Services
after the users search on it For example two or more Web Services have the same name but their
versions are not the same Or two or more Web Services that derived from the same server but have
different contents etc Furthermore most users are also interested in a global view of the published
services For instance they want to know which Web Service Registry can provide better quality for
the Web Service Therefore in order to help users to differentiate those similar published Web
Services and have a global view of the Web Services this information should be monitored and rated
Moreover there are a great many Web Service Registries in the Internet Each Web Service Registry
can provide a great number of Web Services Obviously there might have some similar Web Services
among these Web Service Registries Or a Web Service in one of the Web Service Registry is related to
another Web Service in other Web Service Registries Hence these Web Services should be
comparable across different Web Service Registries However recently there has not much support of
this In addition towards the metadata actually not all of them are structured especially the
descriptions of the non-functional property Therefore what have to do now is to turn those
non-functional property descriptions into the structured format Clearly speaking it needs to extract
as much information as possible about the Web Services that offered in the Web Service Registries
Eventually after extracting all the information from the Web Service Registries it is necessary to store
them into the disk This procedure should be efficient flexible and completeness
12 Initial Designing of the Deep Web Service
Crawler Approach
The problems have already been stated in the previous section Hence the following work is to solve
Deep Web Service Crawler
8
these problems In this section it will present the basic principle of Deep Web Service Crawler
approach
At first it is the simply introduction of the Deep Web Service Crawler approach to these problems As
have already been mentioned each Web Service Registry can offer Web Services Moreover each
Web Service Registry has its own html page structures These structures may be the same or even
complete different Therefore the first thing is to identify which Web Service Registry that it will be
going to explore Since each Web Service Registry owns a unique URL this job can be done by directly
analyzing the corresponding URL address of that Web Service Registry After identifying which Web
Service Registry it is going to explore the following step is to obtain all these Web Services that
published in that Web Service Registry Then with all these obtained Web Services it is time to extract
analyze and gather the information of the services That information can be in structured format or
even in unstructured format In this master thesis some Deep Web Analysis Techniques will be
applied to obtain this information So that the information about each Web Service shall be the
largest annotated The last but not the least important all the information about the Web Services
need to be stored
13 Goals of this Master Thesis
The lists in the following are the goals of this master thesis
n Produce the largest annotated Service Catalogue
Service Catalogue is a list of service properties The more properties the service has the larger
Service Catalogue it owns Therefore this master program should extract as much service
properties as possible
n Flexible storage of these metadata of each service as annotations or dedicated documents
The metadata of one service includes not only the WSDL document but also service properties
All these metadata are important information for the service Therefore this master program
should provide flexible ways to store these metadata into the disk
n Improve the comparable property of the Web Services across different Web Service Registries
The names of service properties for one Web Service Registry could be different from another
Web Service Registry Hence for the purpose of improving the comparable ability all these
names of the service properties should be uniformed and well-defined
14 Outline of this Master Thesis
In this chapter the motivation objective and initial approach plan have already been discussed
Thereafter the remaining paper is structured as follows
Firstly in chapter 2 the work presented is almost based on these existing techniques In section 21
there is given a detailed introduction to the technique of the Service-Finder project Then in section
22 the already implemented technique of Pica-Pica Web Service Description Crawler is introduction
and discussed After that in section 23 the Information Retrieval technique is presented
Deep Web Service Crawler
9
Then following in chapter 3 it is explained the designing details of this Deep Web Service Crawler
approach In section 31 it gives a short description for the different requirements of this approach
Next in section 32 the actual design for the Deep Web Service Crawler is presented Then in section
33 34 the multithreaded programming and sleep time configuration that used in this master
program are introduced respectively
In chapter 4 it is supposed to display the experiments of this Deep Web Service Crawler approach
and then give some evaluation of it
Finally in Section 5 the conclusion and discussion of the work that already done as well as the work in
the future for this master task are presented respectively
Deep Web Service Crawler
10
2 State of the Art
This chapter aims at presenting some existing techniques or Strategies that related to the work of
applying this Deep Web Service Extraction Crawler approach Section 21 plans to talk about the
existing catalogues Service-Finder project And then in section 22 it is going to explain the existing
implemented crawlers Pica-Pica Web Service Description Crawler Finally in the section 23 it is
supposed to present some details about the Information Extraction technique
21 Service Finder Project
Service-Finder Project aims at developing a platform for Web Service discovery especially for the Web
Services that are embedded in a Web 20 environment [1] Hence it can provide an efficient access to
publicly available services The goals of the Service-Finder project are depicted as follows [1]
n Automatically gather Web Services and their related information
n Semi-automatically create semantic service description based on the information that available
on the Web
n Create and improve semantic annotations via the user feedback
n Describe the aggregated information in semantic models and allow reasoning query
However before describing the basic functionality of the Service-Finder Project there is going to
present one of its use cases and requirements first
211 Use Cases for Service-Finder Project
The Service-Finder Project employed the use case methodology of the W3C Use Case description [6]
for its needs and then applied this methodology to the use cases that it enumerated
2111 Use Case Methodology
There are three aspects needed to be considered for use case definitions of Service-Finder Project [1]
(1) Description that used to describe information of the use case
(2) Actors Roles and Goals that used to identify the actors they would be the roles they act and the
goals they need to achieve in the scenario
(3) Storyboard that used to describe the serial of interactions among the actors and the
Service-Finder Portal
2112 System Administrator
This section is going to present the use case that applied to the Service-Finder portal and that
illustrated the requirements on its functionality from a user point of view However all these
Deep Web Service Crawler
11
information in this use case are derived from [1] In this use case there has a system administrator
whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities
online and working all day and night Therefore if there is any system failures Sam Adams should fix
the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will
alert him immediately by sending him a SMS Message in the case of a system failure
n Description
This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an
SMS Messaging Service that he wants to build it into his application
n Actors Roles and Goals
The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him
are the immediate service delivery the reliability of the service and low base fee and transaction fee
n Storyboard
Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find
many useful services from it especially he know what he is looking for Hence he visits the
Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo
Requirement 1 Search functionality
Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the
number of matching services that will be displayed on one page And he would also expect there has
short information about the service functionality the service provider and the service availability So
that he could decide which service he will choose to read further
Requirement 2 Enable configurable pagination of the matching results and have some short
information for each service
Step 3 When Sam looks through the short information about the services that displayed on the first
page he expects to find the most relevant services that related to his request After that he would
like to read more detailed information about that service to see whether this service can provide the
needed functionality
Requirement 3 Rank the returned matching services and must provide ability to read more details of
a service
Step 4 In the case that all the returned matching services Sam got provide quite different
functionalities or they belong to different service categories for example the SMS messaging services
alert users not through SMS but voice messaging For this reason Sam would like to see other
different categories that may be contain the services he wants Or the services of other categories
which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can
further filter his search in terms of browsing through categories
Requirement 4 Service categories and allow the user to look all services that belonged to that specific
category If possible it should also allow the user to browse through categories
Step 5 When Sam got all the services that could provide a SMS messaging service via the methods
described in the Step 4 at present he wants to look for the services that offered by an Austrian
provider and have no base fees if possible
Requirement 5 Faceted search
Deep Web Service Crawler
12
Step 6 After Sam got all these specific services now he would like to choose the services that can
provide a high reliability
Requirement 6 Sort functionality based on usersrsquo chooses
Step 7 For now Sam expects to compare the service availability between the promised to the service
provider and the actually provided This should be contained in the servicesrsquo details And there needs
also have service coverage information so that Sam can know whether this service covers the areas
he lives and works Moreover Sam would also like to compare these services in other way For
instance put some services into a structured table to compare the transaction fees
Requirement 7 A side-by-side comparison table for services and a functionality that enable users to
select services he wants to compare
Step 8 At last Sam wants to know whether the service providers offer a free try out of the services
So that he can test the service functionality
Requirement 8 If possible display a note that offering free service trials
212 Architecture Plan for the Service-Finder Project
The architecture plan of the Service-Finder project contains five basic components Service Crawler
Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal
Interface Figure 2-1 presents a high level overview of the components and the data flow among
them
Figure2-1Dataflow of Service-Finder and Its Components [3]
Deep Web Service Crawler
13
2121 The Principle of the Service Crawler Component
The Service Crawler is responsible for gathering available services and their related information from
the Web The overall cycle is depicted as following
(1) A Web developer publishes a Web Service
(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services
like WSDL (Web Service Description Language) documents
(3) The Crawler is also going to search for other related information as long as a service is discovered
(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant
part of the Web
At last the output of the crawler would be forwarded to the subsequent components for analyzing
indexing and displaying
2122 The Principle of the Automatic Annotator Component
The Automatic Annotator receives the relevant data from previous component and generates
semantic service descriptions about the WSDL documents and its related information based on the
Service-Finder Ontology and Service Category Ontology
Firstly it will simply introduce those two compatible ontologies that would be used throughout the
whole process [2]
n Generic Service Ontology it is an ontology which is functional to describe the data objects For
example the services the service providers availability payment modalities and so on
n Service Category Ontology it is an ontology which is used to categorize the functionalities or
applications of the services For instance data verification messaging data storage weather etc
Afterwards it is going to talk about the function of this component with its input output
Oslash Input
u Crawled data from Service Crawler
u Service-Finder Ontologies
u Feedback or Correction of before annotations
Oslash Function
u Enrich the information about the service and extract semantic statements according to the
Service-Finder Ontologies For example categorize the service according to the Service
Category Ontology
u Determine whether a particular document is relevant or not through the Web link graph If
not discard these irrelevant documents
u Classify the pages into their genres For instance pricing user comments FAQ and so on
Oslash Output
u Semantic annotation of the services
Deep Web Service Crawler
14
2123 The Principle of the Conceptual Indexer and Matcher
Component
The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted
information of the services and supplying users the capability of retrieval and semantic query For
example the matchmaking between user requests and service offers and the act of retrieving user
feedback on extracted annotations
In addition letrsquos have a look of the function of this component and its input output
Oslash Input
u Semantic annotation data and full text information obtained from Automatic Annotation
u Semantic annotation data and full text information that come from user interfaces
u Cluster data from user and service clustering component
Oslash Function
u Store the semantic annotations received from the Automatic Annotation component and
from the user interface
u Store the cluster data that procured through the clustering component
u Store and index the textual description offered by the Automatic Annotation component
and the textual comments offered by users
u Ontological query the semantic data from the data store center
u Combined keyword and Ontological querying used for user queries
u Provide a list of similar services for a given service
Oslash Output
u A list of matching services that are queried by users In particular these services should be
sorted by ranking and can also be iterated
u All available data that related to a particular entity must be retrievable at the user interface
2124 The Principle of the Service-Finder Portal Interface
Component
The Service-Finder Portal Interface is the main entry point that provided for users of the
Service-Finder system to search and browse the data which is managed by the Conceptual Indexer
and Matcher component In addition the users can also contribute information by means of providing
tags comments categorizations and ratings to the data browsed Furthermore the developers can
still directly invoke the Service-Finder functionalities from their custom applications in terms of an API
Besides the details of this componentrsquos function input and output are represented as below
Oslash Input
u A list of ordered services for a query
u Detailed information about a service or a set of services and a service provider
u Query access to service category ontology and the most used tags provided by the users
Deep Web Service Crawler
15
u Service availability information
Oslash Function
u The Web Interface allows the users to search services by keyword tag or concept in the
categorization sort and filter query results by refining the query compare and bookmark
services try out the services that offer this functionality
u The API allows the developers to invoke Service-Finder functionalities
Oslash Output
u Explicit user annotations such as tags ratings comments decryptions and so on
u Implicit user data for example click stream of users bookmarks comparisons links sent
etc
u Manual advertising of available new services
2125 The Principle of the Cluster Engine Component
The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the
Service-Finder Portal eg the queried services and the compared services of the users Moreover it
also provides cluster data to the Conceptual Indexer and Matcher for providing service
recommendations
Furthermore letrsquos detailed introduce this componentrsquos function input and output
Oslash Input
u Service annotation data of both extracted and user feedback
u Usersrsquo Click streams used for extracting user behaviors
Oslash Function
u Obtain user clusters from user behaviors
u Obtain service clusters from service annotation data to enable to find similar services
Oslash Output
u Clusters of users and services
22 Information Extraction
Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge
amount of information sources on the Internet which has been limited the access to browsing and
searching for the reason of the heterogeneity and the lack of structure of Web information sources
Therefore the appearance of Information Extraction that transforms the Web pages into
program-friendly structures for post-processing would become a great necessity However the task of
Information Extraction is specified in terms of the inputs and the extraction targets And the
techniques used in the process of Information Extraction called extractor
221 Input Types of Information Extraction
Generally speaking there are three different input types The first input type is the unstructured
Deep Web Service Crawler
16
document For example the free text that showed in figure 2-2 It is unstructured and written in
natural language So that it will require substantial natural language processing While the second
input type is called the structured document For instance the XML documents based on the reason
that the data can be described through the available DTD (Document Type Definition) or XML
(eXtensible Markup Language) schema Finally but obviously the third input type is the
semi-structured document that are widespread on the Web Such as the large volume of HTML
pages like tables itemized lists and enumerated lists This is because HTML tags are often used to
render these embedded data in the HTML pages See figure 2-3
Figure2-2Left is the free text input type and right is its output [4]
Figure2-3A Semi-structured page containing data records
(in rectangular box) to be extracted [4]
Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly
regular structure And the data of these documents can be displayed in a format of HTML way or
non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and
generated from structured databases in terms of some templates or layouts thus it would be
considered as one of the input sources which could provide some of these semi-structured documents
For example the authors price and comments of the book pages that provided by Amazon have the
Deep Web Service Crawler
17
same layout That is because these Web pages are generated from the same database and applied
with the same template or layout Furthermore there has another option which could manually
generate HTML pages of semi-structured type For example although the publication lists that
provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have
title and source property for every single pager Eventually the inputs for some Information Extraction
can also be the pages with the same class or among various Web Service Registries
222 Extraction Targets of Information Extraction
Moreover regarding the task of the Information Extraction it has to consider the extraction target
There also have two different extraction targets The first one is the relation of k-tuple And the k in
there means the number of attributes in a record Nevertheless in some cases an attribute of one
record may have none instantiation Otherwise the attribute owns multiple instantiations In addition
the complex object with hierarchically organized data would be the second extraction target Though
the ways for depicting the extraction targets in a page are diverse the most common structure is the
hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf
nodes which called internal nodes And the structure for a data object may also be flat or nested To
be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise
if it is nested structure then the internal nodes that involved in this data object would be more than
two levels
Furthermore in order to make the Web pages readable for human being and having an easier
visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated
or demarcated However the displaying for a data object in a Web page would be affected by
following conditions [4]
Oslash The attribute of a data object has zero or several values
(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo
attribute For example a special offer only available for certain books might be a ldquononerdquo
attribute
(2) If there are more than one values for the attribute of a data object it will be called the
ldquomultiValuerdquo attribute For instance the name of the author for a book could be a
ldquomultiValuerdquo attribute
Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering
That is to say among this set of attribute the position of the attribute might be changed
according to the diverse instances of a data object Thus this attribute will be called the
ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would
enumerate the release data in front of the movesrsquo title while for the movies after year 1999
(including 1999) it will enumerate the release data behind the movesrsquo title
Oslash The attribute has different formats
This means the displaying format of the data object could be completely distinct with respect to
these different instances Therefore if the format of an attribute is free then a lot of rules will be
needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo
attribute For example an ecommerce Web site would use the bold font format to present the
general prices while use the red color format to display the sale prices Nevertheless there has
Deep Web Service Crawler
18
another situation that some different attributes for a data object have the same format For
example various attributes are presented in terms of using the ltTDgt tags in a table presentation
And the attributes like those could be differentiated by means of the order information of these
attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo
attributes it must have to revise the rules for extracting these attributes
Oslash The attribute cannot be decomposed
Because of the easier processing sometimes the input documents would like to be treated as
strings of tokens instead of the strings of characters In addition some of the attribute cannot
even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo
attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The
department code and the course number in them cannot be separated into two different strings
of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo
223 The Used Techniques in Information Extraction
The extractor used in the process of Information Extraction aims at providing a single uniform query
interface to access information sources like database server and Web server It consists of following
phases collecting returned Web pages labeling these Web pages generalizing extraction rules
extracting the relevant data and outputting the result in an appropriate format (XML format or
relational database) for further information integration For example at first the extractor queries the
Web server to gather the returned pages through the HTTP protocols after that it starts to extract the
contents among these HTML documents and integrate with other data sources thereafter Actually
the whole process of the extractor follows below steps
Oslash Step 1
At the beginning it must have to tokenize the input However there are two different
granularities for the input string tokenization They are tag-level encoding and word-level
encoding The tag-level encoding will transform the tags of HTML page into the general tokens
while transform all text string between two tags into a special token Nevertheless the
word-level encoding does this in another way It treats each word in a document as a token
Oslash Step 2
Next it should apply the extraction rules for every attributes of the data object in the Web pages
These extraction rules could be induced in terms of a top-down or bottom-up generalization
pattern mining or logic programming In addition the type of extraction rules may be indicated
by means of regular grammars or logic rules For example some use path-expressions of the
HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic
constraints and some use delimiter-based constraints such as HTML tags or literal words
Oslash Step 3
After that all these extracted data would be assembled into the records
Oslash Step 4
Finally iterate this process until all these data objects in the input
Deep Web Service Crawler
19
23 Pica-Pica Web Service Description Crawler
The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the
Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web
Services problem For example the evaluation of the descriptive quality of Web Services that offered
and how well are these Web Services described in nowadaysrsquo Web Service Registries
231 Needed Libraries of the Pica-Pica Web Service
Description Crawler
This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef
Spillner and programmed in terms of the Python language Actually in order to run these scripts to
parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib
n Beautiful Soup
It is an HTMLXML parser for Python language And it can even turn these invalid markups into a
parse tree [5] Moreover the following three features can make it more powerful
u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that
makes approximately as much sense as the original document Therefore you can obtain
the data that you want
u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating
searching and modifying the parse tree Hence you donrsquot need to create a custom parse
for every application
u If the document has already specified an encoding then you can ignore it since the
Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way
Otherwise what you have to do is just to specify the encoding of the original documents
Furthermore the ways of including Beautiful Soup into the application are displayed in the
following [5]
sup2 From BeautifulSoup import BeautifulSoup For processing HTML
sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML
sup2 Import BeautifulSoup To get everything
n Html5lib
It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to
gain maximum compatibility with the current major desktop web browsers this implementation
will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5
specification
Deep Web Service Crawler
20
232 Architecture of the Pica-Pica Web Service
Description Crawler
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler
Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes
four fundamental components Service Page Grabber component WSDL Grabber component
Property Grabber component and WSML Register component
(1) The Service Page Grabber components is going to take the URL seed as the input and output the
link of the service page into following two components WSDL Grabber component and Property
Grabber component
(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the
delivered service pagersquos link And then check whether the validation for these obtained WSDL
document Finally only these valid WSDL document will be passed into the WSML Register
component for further processing
(3) The Property Grabber component will try to extract the servicersquos property the hosted in the
service page if there exists After that all these servicersquos properties would be saved into an INI
file as the information of that service
(4) The functionality of the WSML Register component is to write appropriate WSML document by
means of the valid WSDL documents that delivered from WSDL Grabber component and the
optionally INI files that delivered from Property Grabber component Afterwards register them in
Conqo
n WSML [9]
It stands for Web Service Modeling Language which provides a framework with different
language variants Hence it is often used to describe the different aspects of the semantic Web
Services according to the conceptual model of WSMO
n WSMO [10]
WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various
aspects related to the Semantic Web Services based on the ontologies Ontology is a formal
explicit specification of a shared conceptualization In fact these ontologies are the sticking points
that can satisfy the linkage between the agreement of the communities of users and the defined
conceptual semantics of the real-world
n Conqo [11]
Deep Web Service Crawler
21
It is a discovery framework that considers not only the Quality of Service (QoS) but also the
context information It will use a Web Service Repository to manage these service descriptions
that based on WSML
233 Implementation of the Pica-Pica Web Service
Description Crawler
This section is going to describe the processes of the implementation of the Pica-Pica Web Service
Description Crawler in detail
(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it
needs an input as the initial seed In this crawler there are five Web Service Registries which are
listed in the below The URL address of these five Web Service Registries will be used as the input
seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web
Service Description Crawler there has a single Python script for each Web Service Registry And
the crawling process for these Web Service Registriesrsquo Python script will be executed one after
another
Biocataloue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwsrvice-repositorycom
Xmethods httpwwwxmethodsnet
(2) Then after feeding with the input seed it will step into the next component Service Page Grabber
At first this component will try to read the data from the Web based on the input seed Then it
will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library
After that this Service Page Grabber component starts to look for the service page link for each
service that published in the Web Service Registry by means of the functions in Html5lib library In
the case that the service page link of one single service is found it will firstly check whether this
service page link is valid or not Once the service page link is valid it will pass it into the following
two components for further processing which are WSDL Grabber component and Property
Grabber component
(3) When the WSDL Grabber component receives a service page link from its previous component it
sets out to extract the WSDL link address for that service through the parsing tree of the data in
this service page Next this component will start to download the WSDL document of that service
in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into
the disk The process of this WSDL Grabber component will be continually carried on until there is
no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They
may either contain bad definitions or bad namespaceURI or be an empty document What is
worse it is not even of XML format Hence in order to pick them out this component will further
analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo
folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in
order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo
folder will be passed into the subsequent component
(4) Moreover since some Web Service Registries would give some addition information about the
Deep Web Service Crawler
22
services such as availability service provider version therefore the Property Grabber
component will set out to extract these information as this servicersquos properties And thereafter
save these servicersquos properties into an INI file However if there has no available additional
information then it is no need to extract the service property thus there is no INI file for that
service However for the implementation of this Pica-Pica Web Service Description Crawler only
the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to
extract the servicesrsquo properties While for other three Web Service Registries there has such
function
(5) Furthermore this is an optional to create a report file which contains the statistic information of
this process Such as the total number of services for one Web Service Registry the number of
services whose WSDL document is invalid etc
(6) As has been stated at present there has a folder with all valid WSDL documents and might also
have some INI files Therefore at this time the task of WSML Register component is to generate
the appropriate WSML documents with these valid WSDL document and INI files And then
register them in Conqo
24 Conclusions of the Existing Strategies
In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder
project Information Extraction technique and Pica-Pica Web Service Description Crawler
The task of this master program is going to obtain the available Web services and their related
information from the Web Actually this is a procedure of extracting the needed information for the
service such as the servicersquos WSDL document and its properties Therefore the Information
Extraction technique which used to extract the information that hosted in the Web could be used in
this master program
Moreover the Service-Finder is a large project that not only able to obtain the available Web services
and their related information from the Web but also enrich these crawled data with annotations by
means of the Service-Finder ontology and the Service Category ontology and integrate all the
information into a coherent semantic mode Furthermore it also provides the capabilities for
searching and browsing the data with the user interface and gives users with service
recommendations However as a master program the Service-Finder project is far more exceeding
the requirements Therefore it is just considered as a consultation for this master program
Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available
Web services and their related information this fulfills the primary task of this master program
Nevertheless regarding the information about the service this Pica-Pica Web Service Description
Crawler extracts only few properties sometime even no property Consequently in order to improve
the quality of the service it has to extract as much properties about the service as possible Thence in
chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler
Deep Web Service Crawler
23
3 Design and Implementation
In the previous chapter of the State of Art those already existing techniques or implementations are
presented In the following it is time to introduce the basic principle of the proposed approach Deep
Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica
Web Service Description Crawler
31 Deep Web Services Crawler Requirements
This section is mainly talking about the goals of the Deep Web Service Crawler approach system
requirements with respect to approach and some non-functional requirements
311 Basic Requirements for DWSC
The following lists are the basic requirements which should be achieved
(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list
of Web services that published in the Web Service Registry It contains not only the WSDL document
of the service but also a file with respect to the information about that service Therefore for the
purpose of producing the largest annotated Service Catalogue this proposed approach need to
extract as much properties about those Web services as possible Moreover it has also to download
the WSDL document that hosted along with the Web service That is to say these properties are not
only the interesting structured properties such as service name its WSDL link address but also some
other non-function properties for example endpoint and its monitoring information
(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties
For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of
schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way
this proposed approach provides three methods for the storage The first one will store them as an
XML file While for the second method it is going to store them in an INI file And the third method is
to use the database for the storage
312 System Requirements for DWSC
General Speaking the needed requirements for achieving a programming project contains following
lists
1) Operating System LinuxUnix Windows (XP Vista 2000 etc)
2) Programming language C++ Java Python C etc
3) Programming Tool NetBean Eclipse Visual Studio and so on
However in this master thesis the scripts of the Deep Web Service Crawler approach are written in
Java programming language Besides these code scripts are only tested running in Windows XP
operating system and Linux operating system but have not been tested in other operating system
Deep Web Service Crawler
24
313 Non-Functional Requirements for DWSC
In this part several non-functional requirements for the Deep Web Service Crawler approach are
presented
1) Transparency the process of data exploration and data storage should be done automatically
without the userrsquos intervention However at first user should specify the path in the hard disk which
will be used store these outputs of this program
2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in
order to stop the process from interruption there must have some necessary error handlings for the
process recovering
3) Completeness this approach should extract the interesting properties about each Web Service as
much as possible eg endpoint monitoring information etc
In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the
following five URLs hence the proposed approach must implement not less than those five Web
Service Registries
Biocatalogue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwservice-repositorycom
Xmethods httpwwwxmethodsnet
32 Deep Web Services Crawler Architecture
In this section an overview of this high-level architecture of Deep Web Services Crawler approach will
be introduced first Thereafter there are four subsections that focusing on the outlining of each single
component and how they will play together presented
The current components and flows of data in Deep Web Service Crawler can be summarized as
depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page
links and related service page links by crawling the Web (Web Service Extractor) Then those gathered
links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL
Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all
these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is
illustrated as following
Oslash Step 1
When the Deep Web Service Crawler starts to run the File Chooser container would require the
user to specify a path in the computer or in any other hard disk The reason for specifying the
path is that this Deep Web Service Crawler program needs a place to store all its outputs
Oslash Step 2
After that the Web Service Extractor will be triggered It is the main entry to the specific crawling
process However since the Deep Web Service Crawler program is a procedure which supposes
to crawl for Web Services in some given Web Service Registries Hence the URL addresses of
Deep Web Service Crawler
25
these Web Service Registries should be given as an initial seed for this Web Service Extractor
process However since the page structures of these Web Service Registries are completely
different thus there will be a dependent process for each Web Service Registry
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler
Oslash Step 3
So that according to the given seed two types of links would be obtained by the Web Service
Extractor component One is the service list page link And another is the service page link
Service list page is page that contains a list of Web Services and may be some information about
these Web Services While service page is page that contains much more information for single
service Finally it forwards these two types of links into the next two components Property
Grabber and WSDL Grabber
Oslash Step 4
Then on the one hand the Property Grabber component tries to gather the information about
the service that hosted in the service list page and service page Such as the name of the service
its description the ranking for this service etc Finally all the information of the service will be
collected together as the service properties which then will be delivered to the storage
component for further processing
Oslash Step 5
On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page
or the service page That is because for some Web Service Registry the WSDL link is hosted in
Deep Web Service Crawler
26
the service list page like the Biocatalogue and for other Web Service Registries it is hosted in
the service page such as Xmethods Then after obtaining the WSDL link it will also be
transmitted to the Storage component for further processing
Oslash Step6
When service properties information and WSDL link of the service are received by Storage
component it will be going to store them into the disk For the service properties they will be
stored into the disk by means of three different ways They are an XML file an INI file or one
record inside the table of Database However for the WSDL link the Storage component will try
to download the page content first according to the URL address of the WSDL link If it can work
successful the page content of the service will be stored as a WSDL document into the disk
Oslash Step 7
Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there
are more than one service or there are more than one service list page in the those Web Service
Registries the crawling process from step 3 to step 6 will be continued again and again until
there is no service or service list page in that Web Service Registries any more
Oslash Step 8
Furthermore after the crawling process for one Web Service Registry finishes there will also
generate a file that contains some of statistic information about this crawling process For
example the time when the crawling process of this Web Service Registry started and when
finished the total number of Web services in this Web Service Registry how many services have
empty WSDL document the average number of service properties in this Web Service Registry
and the average time cost for extracting service properties getting WSDL document and
generating XML file INI file etc
321 The Function of Web Service Extractor Component
The Web Service Extractor component is responsible for gathering service list page links and service
page links It pursues a focused crawl of the Web and only forwards service list page and service page
links to subsequent components for analyzing collecting gathering purposes Therefore it is going to
identify both service list page links and related service page links on these Web Service Registries
As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost
as important as the Web Service Extractor itself as it highly influences the part of the Web which it
needs to crawl The seed can or shall contain eg Web pages where these Web Services are
published or that they talk about Web Services
Deep Web Service Crawler
27
Figure3-2 Overview the process flow of the Web Service Extractor Component
After feeding with the seed of URL the Web Service Extractor component starts to get the link of
service list page from the initial page of this URL seed However this process would be diverse for
these five Web Service Registries Following shows the different situation in these Web Service
Registries
Oslash Service-Repository Web Service Registry
In this Web Service Registry the link of the first service list page is the URL address of its seed
which means some Web Services can be found in the home page of the Service-Repository Web
Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this
Web Service Registry Therefore the process of getting service list page link in this registry will be
continually carried on until there no more service list page link exists
Oslash Xmethods Web Service Registry
Although there has Web Services in the home page of the Xmethods Web Service Registry the
number of those Web Services is only a small subset of these in this Web Service Registry
Moreover in the Xmethods Web Service Registry there is only one page containing all Web
Services Therefore it must have to get the service list page link for that page
Oslash Ebi Web Service Registry
The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That
is to say there also has one page that contains all Web Services in this Web Service Registry
However this page is not the initial page of the input seed Therefore there needs more than
one operation steps to get the service list page link of that page
Oslash Seekda Web Service Registry
In Seekda Web Service Registry these Web Services are not contained in the initial page of the
input seed The service list page link can be obtained after several additional operation steps
Deep Web Service Crawler
28
However there has a problem for getting the service list page link in this registry Simply put if
there are more than one pages contains the Web Services for some unknown reasons it cannot
not get the links of the rest service list page In another word it can only get the link of the first
service list page
Oslash Biocatalogue Web Service Registry
The process of getting service list page in Biocatalogue Web Service Registry is almost the same
as in Seekda Web Service Registry The only difference is that it can get all these service list page
links in Biocatalogue Web Service Registry if there are more than one service list pages
Then after getting the link of service list page the Web Service Extractor will begin to get the link of
service page for each service that listed in the service list page The reason for why it can do like this is
that there has an internal link for every service which will address to the service page It is worth
noting that once a service page link is obtained this component will immediately forward the service
page link and the service list page link into the subsequent two components for further processing
Nevertheless the process of obtaining the service page link will be continuously carried out until all
services listed in that service list page are crawled Analogously the process of getting service list page
will also be continuously carried out until there no more service list page exists
3211 Features of the Web Service Extractor Component
The main features are described in the following paragraphs
1) Obtain service list page links
A central task is to obtain the corresponding service list page links It is an URL address that includes a
publicly list of Web Services and just a simply information about these Web services like the name of
the service an internal URL that links to another page which would contain the detailed information
about that service sometimes may have the link address for WSDL document
2) Obtain service page links
Once service list page links are found a crucial aspect is to extract the internal link of each service to
aid the task of service information discovery Therefore it is the task of the Web Service Extractor to
harvest the html page content of the service list page so that the service page links which would
contain much more detailed information of the single Web service can be obtained
3212 Input of the Web Service Extractor Component
This component is dependent on some specific input seeds And the only input required for this
component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section
313
3213 Output of the Web Service Extractor Component
The component will produce two service related page links from the Web
l Service list page links
l Service page links
Deep Web Service Crawler
29
3214 Demonstration for Web Service Extractor
In order to have comprehensive understanding of the process of the Web Service Extractor
component following gives some figures for explanation Though there are five URL addresses in this
section only the URL of the service-repository address is showed as an example
1) The input seed is the initial URL address of the Service-Repository which is
ldquohttpwwwservice-repositorycomrdquo
2) As has already said in section 321 the first service list link of this Web Service Registry is its input
seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page
of that link
Figure3-3 Service list page of the Service-Repository
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo
Figure3-5Code Overview of getting service page link in Service Repository
Figure3-6 Service page of the Web service ldquoBLZServicerdquo
Deep Web Service Crawler
30
3) For now because of the service list page link is already know the next step is to acquiring the
service page link according to these services that listed in the service list page The text in red box
of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the
complete link of the service page It should be prefixed with the initial URL address of
Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for
getting service page link of the Web service in Service Repository is showed in figure 3-5
Therefore the final link of this service page is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure 3-6 is the corrsponding service page of that link
4) Afterwards those two links service list page link and service page link which are gathered by the
Web Service Extractor component will be immediately forwarded to the next two components
which are WSDL Grabber component and Property Grabber component
322 The Function of WSDL Grabber Component
The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the
service lists page link or service page link And the whole process flow is illustrated in figure 3-7
Figure3-7 Overview the process flow of the WSDL Grabber Component
When the WSDL Grabber component receives the inputs that delivered from previous component it
will start to get the WSDL link for the service based on these inputs However although the inputs of
the WSDL Grabber component are the links of service page and service list page only one of them
contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the
service page or the service list page The reason why there needs these two links to be delivered into
this component is that the only one of the five Web Service Registries which is Biocatalogue hosts
the WSDL link in the service list page While for other four Web Service Registries the WSDL link is
hosted in the service page Therefore the WSDL link of these four Web Service Registries will be
obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL
link will be obtained through the service list page link However there has a problem for getting the
WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in
Deep Web Service Crawler
31
the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another
word these services will not have WSDL document For the situation like this the value of the WSDL
link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web
Services in other four Web Service Registries the WSDL link would always exist in the service page
Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service
it will inevitably be forwarded into Storage component for downloading the WSDL document at once
3221 Features of the WSDL Grabber Component
The WSDL Grabber component has to provide the following features
l Obtain WSDL links
WSDL link is the direct way to get into the page that contains the contents of the WSDL document
It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo
to represent this is an address that addressed to the page of WSDL document
3222 Input of the WSDL Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3223 Output of the WSDL Grabber Component
The component will only produce the following output data
l The URL address of WSDL link for each service
3224 Demonstration for WSDL Grabber Component
In this section it will demonstrate a list of figures in order to give a comprehensive understanding of
the process of the WSDL Grabber component Without a doubt it uses the same Web service of the
Service-Repository as an example too
1) The input for this WSDL Grabber component is the link of service page that obtained from Web
Service Extractor component The address of this link is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page
Deep Web Service Crawler
32
2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the
figure 3-8
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo
3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure
3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry
For other four Web Service Registries this will be different This function
ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo
Then it checks all of the nodes one by one to see if the text value of that node is the value of
ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will
be extracted as the value of WSDL link for this Web service
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function
Figure3-11 Code overview of ldquooneParameterrdquo function
4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web
service ldquoBLZServicerdquo can be obtained The value of it is
ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo
Deep Web Service Crawler
33
323 The Function of Property Grabber Component
The Property Grabber component is a module which is used to extract and gather all these Web
service information that hosted in the Web actually are the information that showed in service list
page and service page In the end all the obtained Web service information will be collected together
as the service properties which would be delivered into the Storage component for storing The
detailed process flow of Property Grabber component is illustrated in figure 3-12
The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component
However there is still a little difference between them with respect to the seed As have already
mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is
sufficient to get the WSDL link While regarding the Property Grabber component it needs two of
them as the seeds Therefore once the Property Grabber component receives the needed inputs it
will start to extract the information of that single Web service
Figure3-12 Overview the process flow of the Property Grabber Component
Therefore after the Property Grabber component receives the inputs it starts to extract the service
information for the Web service Generally speaking the service information consists of four aspects
which are structured information endpoint information monitoring information and whois
information respectively
(1) Structured Information
The structured information can be obtained by extracting the information that hosted in the
service page and service list page They are the basic descriptive information about the service
Such as the name of the service the URL address through which the WSDL document can be
obtained the description of introducing the service the provider who provides that service its
rating and the server who owns this service etc However the elements constituting this
Deep Web Service Crawler
34
structured information would be diverse according to these different Web Service Registries For
example the rating information of the Web service exists in the Service-Repository Web Service
Registry while the Xmethods Web Service Registry does not have this information In addition
even though the Web services in the same Web Service Registry some elements of the structured
information could be not exist For instance one service in a Web Service Registry may have the
description information of this service while another service in the same Web Service Registry
does not have this description information Tables 3-1 to 3-5 demonstrate the structured
information that should be extracted in these five Web Service Registries Moreover if the style of
the Biocataloue Web Service Registry is SOAP there has some additional information describing
the SOAP operation of this service While it is REST this additional information would be
something that talked about the REST operation They should also be considered as a part of the
structured information Table 3-6 and table 3-7 illustrate the information for these two different
operations
Service Name WSDL Link WSDL Version
Provider Server Rating
Homepage Owner Homepage Description
Table 3-1Structured Information of Service-Repository Web Service Registry
Service Name WSDL Link Provider
Service Style Homepage Implementation Language
Description User Description Contributed Client Name
Type of this Client Publisher for this client Used Tookit of this client
Used Language of this client Used Operation System of this client
Table 3-2Structured Information of Xmethods Web Service Registry
Service Name WSDL Link Server
Provider Providerrsquos Country Service Style
Rating Description User Description
Service Tags Documentation (within WSDL)
Table 3-3Structured Information of Seekda Web Service Registry
Service Name WSDL Link Port Name
Service URL Address Implementation Class
Table 3-4Structured Information of Ebi Web Service Registry
Service Name WSDL Link Style
Provider Providerrsquos Country View Times
Favorite Times Submitter Service Tags
Total Annotation Provider Annotation Member Annotation
Registry Annotation Base URL SOAP Lab Server Base URL
Description User Description Category
Table 3-5Structured Information of Biocatalogue Web Service Registry
Deep Web Service Crawler
35
SOAP Operation Name Inputs and Outputs Operation Description
Operation Tags Part of Which Service
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry
REST Operation Name Service Tags Used Template
Operation Description Part of Which Service Part of Which Endpoint Group
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry
(2) Endpoint Information
Endpoint information is the port information of the Web service which can be extracted only
through the service page However since different Web Service Registries have different structure
of the endpoint information with respect to the Web service Thence some elements of the
endpoint information for the Web service would be very diverse But there is one thing needed to
pay attention the Ebi Web Service Registry does not have endpoint information for all these Web
services published in this registry Moreover though the Web services in the same Web Service
Registry have the same structure of this endpoint information some elements of the endpoint
information would be missing or empty Furthermore these Web Service Registries could even
have no endpoint information for some Web services published by them Nevertheless whatever
there is endpoint information for the Web service there will be at least one such element which
is the URL address of the endpoint Following table 3-8 shows the endpoint information that
should be extracted in these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Endpoint Name Endpoint URL
Endpoint Critical Endpoint Type
Bound Endpoint
Xmethods
Endpoint URL Publisher of this Endpoint
Contact Email of this Publisher Implementation Language of this
Endpoint
Seekda Endpoint URL
Biocatalogue Endpoint Name Endpoint URL
Table 3-8 Endpoint Information of these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Service Availability Number of Downs
Total Uptime Total Downtime
MTTR MTBF
RTT Max of Endpoint RTT Min of Endpoint
RTT Average of Endpoint Ping Count of Endpoint
Seekda Service Availability Begin Time of Monitoring
Biocatalogue Monitored Status of Endpoint Overall Status of Service
Table 3-9 Monitoring Information of these five Web Service Registries
(3) Monitoring Information
Monitoring information is the tested statistic information for the Web service But it is worth
Deep Web Service Crawler
36
noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of
all these Web services published by them While for other three Web Service Registries only a
few of Web services may not have this information Table 3-12 displays the monitoring
information for these three Web Service Registries
(4) Whois Information
Whois information is not extracted from the information that hosted in the service page and
service list page It is the descriptive information for the service domain which can be gained by
means of the address of the WSDL link Because of that the process of getting the Whois
information will start after gaining the service domain first The final value of service domain
should not contain the string like this http https www etc It must belong to the top level
domain After that it is going to query the service domain database by means of sending the value
of service domain to the Whois client which is just a Web site on the Internet for example
ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain
exists a list of its information will be returned as the output However the structure of the
returned information will be diverse with respect to different service domain Therefore the most
challenge thing is that there has to deal with the extracting process for each different situation of
the returned information Table 3-10 gives the Whois information that needs be extracted for all
these five Web Service Registries
Service Domain URL Domain Name Domain Type
Domain Address Domain Description State
Postal Code City Country
Country Code Phone Fax
Email Organization Established Time
Table 3-10Whois Information for these five Web Service Registries
Finally all the information of these four aspects will be collected together And then be delivered into
the Storage component for further storage processing
3231 Features of the Property Grabber Component
The Property Grabber component has to provide the following features
l Obtain basic information
Generally speaking the more information one Web service has the better you can know how
well this Web service is Hence this Property Grabber component will be necessary to extract all
these basic information that hosted in the service list page and service page This basic
information contains the structure information endpoint information and monitoring
information
l Obtain Whois information
Due to the reason that the more information the Web service has the better the quality of the
Web service is That makes it is necessary to extract as much information about the Web service
as possible Therefore except the basic information the Property Grabber component can also
obtain some additional information called Whois information such as the type of the domain the
name of the person in charge the postal code of the domain city phone fax detailed address
email etc
Deep Web Service Crawler
37
3232 Input of the Property Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3233 Output of the Property Grabber Component
The component will produce the following output data
l Structured information of each service
l Endpoint information about each service if exists
l Monitoring information for the service and endpoint if exists
l Whois information of the service domain
However all these information will be collected together as the properties for each service Thereafter
the collected properties will be sent to the Storage component
3234 Demonstration for Property Grabber Component
Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the
Property Grabber component To simplify the explanation the example showed in this section uses
the same Web service as before
1) The inputs of the Property Grabber component are links of service list page and service page that
received from Web Service Extractor component Their links are
ldquohttpwwwservice-repositorycomrdquo
and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page
Deep Web Service Crawler
38
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page
3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that
displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of
the structured information have the same content like the contents of description showed in
service page and service list page Thence in order to saving the time of extracting process and
the space for storing process the elements with the same content would only be extracted once
Moreover there needs a transformation of non-descriptive text to the descriptive text for the
rating information because the contents of it are several image of stars Therefore the final
results of the extracted structured information for this Web service are showed in table 3-11
Because of there are no descriptive information for the provider homepage and Owner
Homepage their values are assigned as ldquoNULLrdquo
Service Name BLZService
WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl
WSDL Version 0
Server Apache-Coyote11
Description BLZService
Rating Four stars and A Half
Provider NULL
Homepage NULL
Owner Homepage NULL
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo
4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that
displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list
only one of them will be extracted as the information of endpoint That is because this master
program is intended to extract as more information as possible but these information should not
contain some redundant information Therefore for the purpose of that only one of them will be
extracted as the endpoint information even if there are more than one endpoint records Table
3-12 shows the final results of the endpoint information of this Web service
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page
Deep Web Service Crawler
39
Endpoint Name BLZServiceSOAP12port_http
Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService
Endpoint Critical True
Endpoint Type production
Bound Endpoint BLZServiceSOAP12Binding
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo
5) Then it is time to extract the monitoring information through invoking the function
ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box
above contains the monitoring information about the Web service and the red box below lists
the monitoring information for these endpoints As have already mentioned before only one
endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two
types of availability Actually they all represent the availability for this Web service just like the
availability that showed in figure 3-14 Therefore only one content of these availability
information is sufficient Table 3-13 shows the final results of this extracting process
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page
Service Availability 100
Number of Downs 0
Total Uptime 1 day 19 hours 19 minutes
Total Downtime 0 second
MTBF 1 day 19 hours 19 minutes
MTTR 0 second
RTT Max of Endpoint 141 ms
RTT Min of Endpoint 0 ms
RTT Average of Endpoint 577 ms
Ping Count of Endpoint 112
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo
6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the
service domain from the WSDL link first Towards this Web service the gained service domain is
ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying
process After that it returns a list of information for that service domain See figure 3-17 And
table 3-14 is the extracted Whois information
Deep Web Service Crawler
40
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo
Service Domain URL thomas-bayercom
Domain Name Thomas Bayer
Domain Type NULL
Domain Address Moltkestr40
Domain Description NULL
State NULL
Postal Code 54173
City Bonn
Country NULL
Country Code DE
Phone +4922855525760
Fax NULL
Email infopredic8de
Organization predic8 GmbH
Established Time NULL
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo
7) Finally all the information of these four aspects will be collected together as the service
properties and then these service properties are forwarded into the Storage component
324 The Function of Storage Component
The WSDL link from the WSDL Grabber component will be used by this Storage component to
download the WSDL document from the Web and stores it into the disk thereafter In addition the
service properties from the Property Grabber component will also be directly stored into the disk in
three different manners with this Storage component Figure 3-18 illustrates the process flow of the
Storage component
Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo
of the Storage component will be triggered After that it is going to transform the service properties
into three different output formats and store them into the disk These output formats are XML file
INI file and database records Besides it will also try to download the WSDL document with the URL
address of the WSDL link and then store the obtained WSDL document into disk too However this
ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo
ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the
storage tasks
Deep Web Service Crawler
41
Figure3-18 Overview the process flow of the Storage Component
(1) ldquogetWSDLrdquo sub function
The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it
into the disk Therefore above all it has to get the content of the WSDL document This
procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value
of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if
the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo
For the case like that it will create a WSDL document whose name is the service name appended
with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content
It is an empty document Nevertheless if that service does have WSDL link this sub function will
try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all
these contents that hosted in the Web would be downloaded and stored into the disk and named
only with the name of the service Otherwise it would create a WSDL document which prefixes
ldquoBadrdquo before the service name
(2) ldquogenerateXMLrdquo sub function
The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform
them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo
XML stands for eXtensible Markup Language which is a markup language that designed to
transport and store data The first line of an XML file is the XML declaration which defines the
XML version and the encoding used For example following ldquoltxml version=10
encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is
8-bit Unicode Transformation Format character set Besides an XML file contains XML elements
which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML
element can also contain other elements which could be simple text or a mixture of both
Deep Web Service Crawler
42
However an XML file must contain a root element as the parent of all other elements
(3) ldquogenerateINIrdquo sub function
The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will
transform them into an INI file and then store it in the disk with the name structure of servicersquos
name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for
configuration files They are just simple text files with a basic structure Generally speaking the
INI file contains three different parts They are section parameter and comment respectively
Parameter is the basic element contained in an INI file Its format is a key-value pair or can also
be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always
appears to the left of the equals sign The section is more like a room that grouped all its
parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And
section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo
Thence anything between the semicolon and the end of line will be ignored
(4) ldquogenerateDatabaserdquo sub function
The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions
Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will
turn them into the data of the database by using the SQL statements SQL stands for Structured
Query Language which is a database language that designed for accessing and manipulating data
in database Most of the actions that performed on a database are done with SQL statements
And the primary statements of SQL include insert into delete update select create alter and
drop Therefore for the purpose of transforming these service properties into the data of the
database this sub function has to create a database first using the ldquocreate databaserdquo statement
Then it should create a table to store the data A table is a collections of related data entries and
it consists of columns and rows Since the data for all these five Web Service Registries are not
very large therefore one database table is enough for storing these service properties Because of
that the field names of the service properties in columns for all these five Web Service Registries
should be uniform and well-defined Afterwards the service properties of each single service can
be put into the table as a record with the ldquoinsert intordquo statement of the SQL
3241 Features of the Storage Component
The Property Grabber component has to provide the following features
l Generate different output formats
The final result of this master program is to store the information of the service into the disk for
future work However this Storage component provides three different formats for storing these
service properties of the services that published in the Web Service Registries This makes the
storage of the service very flexible and also longevity
l Obtain the WSDL document
The important thing of this component is to check if the WSDL document can be obtained from
the WSDL link because the WSDL document plays a decisive role for determining the quality of
the service This Storage component provides the ability to deal with the different situations
about the process of obtaining the WSDL document
Deep Web Service Crawler
43
3242 Input of the Storage Component
This component requires the following input data
l WSDL link of each service
l Each service property information
3243 Output of the Storage Component
The component will produce the following output data
l WSDL document of the service
l XML document INI file and tables in database
3244 Demonstration for Storage Component
Following figures describe the fundamental implementation codes of this Storage component
The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the
implementation codes The first common place is about the parameters that defined in each of
these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute
path in the computer disk It will be used for the storing procedure and has already been specified
by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an
increasing Integer which is used as a part of the name of the service The reason for doing this is
that it can prevent the services that have the same name from overriding in the disk The content
of red mark among the codes of these figures is the second common place Its function is to
create a file or document for the service with the corresponding parameters
2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to
get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the
name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo
is the actual WSDL link which is the most important one for this sub function The other two
parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and
ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data
of the service in each Web Service Registry Such as the overall number of properties for that Web
Service Registry the overall number of services the number of services that have no WSDL link
the number of services contains no contents the number of services whose WSDL link are not
available etc The ldquoLog Informationrdquo text file records the results of the process steps and the
problems encountered For example which service is crawling now which Web Service Registry it
belongs the reason why it cannot get the WSDL document for this service the reason why this
service is unreachable and so on
Deep Web Service Crawler
44
Figure3-19 Implementation code for getting WSDL document
3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file
and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of
ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two
variables name and value
Figure3-20 Implementation code for generating XML file
Deep Web Service Crawler
45
Figure3-21 Implementation code for generating INI file
4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all
services in these five Web Service Registries into the records of the database Therefore it has to
create a database first The name of the database can be arbitrary as long as it conforms to the
naming rules in database This is the same for the name of the table Figure 3-22 displays the
procedure of creating a table in the database Due to the reason that it is hard to decide the
length of each service property thence the data types for all service properties are set to ldquoTextrdquo
Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo
statement
Figure3-22 Implementation code for creating table in database
Deep Web Service Crawler
46
Figure3-23 Implementation code for generating table records
33 Multithreaded Programming for DWSC
Multithreaded programming is a built-in characteristic which is provided by Java language A
multithreaded program contains two or more separate parts that that can execute concurrently Each
part of such a program is called a thread The use of multithreading enables to create programs that
makes efficient use of the system resources For example make maximum use of the CPU thatrsquos
because the idle time of the CPU can be kept to a minimum
In this master program there are five Web Service Registries that are needed to crawl for the services
among them Moreover the number of services published among each Web Service Registry is quite
different That makes the running time cost by each Web Service Registry is different Then it will
happen that one Web Service Registry who owns fewer services has to be waiting for executing until
another Web Service Registry who has much more services finishes Therefore in order to reduce the
waiting time for other Web Service Registries and maximize the use of the system resources it is
necessary to apply this multithreaded programming to this master program That is to say this master
program will create a thread for each Web Service Registry And these threads are executed
independently
34 Sleep Time Configuration for Web Service
Registries
Due to the reason that this master program is intended for downloading the WSDL document and
extracting the service information of these Web services that published in the Web Service Registry
this would inevitably affect the performance of the Web Service Registry In addition for the purpose
of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of
accessing Because of that sometimes when this master program is executing unknown errors would
happen For instance this master program will be continually halting at one point without getting any
more WSDL document and service information or some servicesrsquo WSDL documents of some Web
Deep Web Service Crawler
47
Service Registries cannot be obtained or there would miss some service information Therefore in
order to obtain the largest staffs of these Web services published in these five Web Service Registries
and not to affect the throughput of them it has to configure the accessing rate for each service of all
Web Service Registries
In consequence before going into the essential procedure for each single service of these Web Service
Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long
milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for
the specified number of milliseconds In other words it will temporarily cease execution for a while
Following table shows this time interval of the sleep function for each Web Service Registry
Web Service Registry Name Time Interval (milliseconds)
Service Repository 8000
Ebi 3000
Xmethods 10000
Seekda 20000
Biocatalogue 10000
Table 3-15Sleep Time of these five Web Service Registries
Deep Web Service Crawler
48
4 Experimental Results and Analysis
This chapter is going to show the quantitative experimental results of the prototype that presented in
chapter 3 Besides the analysis of these results will also be described and explained However in
order to gain a rather accurate result the experiments are carried out more than five times All these
data that displayed in the following tables and charts are the average of them
41 Statistic Information for Different Web Service
Registries
This section is mainly used to talk about the amount statistic of these Web services that published in
these five Web Service Registries It includes the overall number of the Web services that published in
each Web Service Registry and the number of these unavailable Web services which have been
archived because they may not be active anymore or are close to being non active Table 4-1 shows
the service amount statistic of these five Web Service Registries
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Overall
Services 57 289 382 853 2567
Unavailable
Services 0 0 0 0 125
Table4-1 Service amount statistic of these five Web Service Registries
Nevertheless in order to have an intuitive view of the service amount statistic in these five Web
Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As
you can see from the bar chart one on hand there has an ascending increase for the overall number
of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service
Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web
service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to
provide the Web services for the users because it contains far more services than other four Web
Service Registries On the other hand there is no unavailable service existed in almost all Web Service
Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web
Service Registry contains some Web services that cannot be used by the users To some degree this is
useless for the reason that these services cannot be used anymore and it will waste the network
resources on the Web Therefore all these unavailable services should be eliminated in order to
reduce the wasting of the network resources
Deep Web Service Crawler
49
Figure4-1 Service amount statistic of these five Web Service Registries
42 Statistic Information for WSDL Document
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Failed WSDL
Links 1 0 23 145 32
Without
WSDL Links 0 0 0 0 16
Empty
Content 0 0 2 0 2
Table4-2 Statistic information for WSDL Document
Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in
these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed
WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the
Web services whose WSDL links are invalid Say in other words that means it is impossible to get the
WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no
WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services
for these Web Service Registries It is used as the overall number of Web services in each Web Service
Registry who has no WSDL link That is to say there would be no WSDL document for Web services
like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL
document will also be created But there is no content and the name of this WSDL document will
contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which
represents the overall number of Web services that have the WSDL links and the URL address of them
0
500
1000
1500
2000
2500
3000
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
57289 382
853
2567
0 0 0 0 125
The
Num
ber o
f Web
Ser
vice
s
The name of Web Service Registry
Overall Services
Unavailable Services
Deep Web Service Crawler
50
are valid but their WSDL documents contains no content In this case a WSDL document whose name
contains a string ldquo(BAD)rdquo will be created
Figure4-2 Statistic information for WSDL Document
43 Comparison of Different Average Number of
Service Properties
This section is going to compare the average number of service properties in these five Web Service
Registries This average number of service properties is calculated by means of following equation = (1)
Where
ASP is the average number of service properties for one Web Service Registry
ONSP is the overall number of service properties in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
Figure 4-3 shows the average number of service properties of each Web service in these five Web
Service Registries As have already mentioned before one of the measurements for testing the quality
of Web services in the Web Service Registry is the service information The more information of a Web
service the better you know about that service consequently the corresponding Web Service
Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service
Repository and Biocatalogue Web Service Registries own a larger number of service properties than
that in other three Web Service Registries This can directly reflect that these two Web Service
Registries can provide more detailed information about the Web services published among them So
that the users can be easier to choose the service they need And they would also like to use the Web
services that published in these two Web Service Registries By contrast the Xmethods and Seekda
0
50
100
150
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
1 023
145
32
0 0 0 016
0 0 2 0 2
The
Num
ber o
f Ser
vice
s
The Name of the Web Service Registry
Failed WSDL Links
Without WSDL links
Empty Content
Deep Web Service Crawler
51
Web Service Registries who have less service information of the Web services would offer less quality
for these Web services Therefore users may be not like to use the Web services provided in these
two Web Service Registries Not to mention the Web services published in the Ebi Web Service
Registry
Figure4-3 Average Number of Service Properties
From the description presented in section 323 the causes of the issue for the different number of
service properties in these Web Service Registries may consist of following several points First the
number of the structured information for these Web services is different with respect to these five
Web Service Registries Even part of information for some Web services in one Web Service Registry
could be missing or have empty value For example the number of structure information that
supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that
in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for
almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce
the amount of the overall number of service properties Thirdly some Web Service Registries do not
have monitoring information such as Xmethods and Ebi In particular the Service Repository Web
Service Registry has a large amount of monitoring information of the Web services that could be
extracted in the Web Obvious the last one is the number of the whois information for these Web
services If the database of the whois client does not contain the information about the service
domain for the Web service in one Web Service Registry then there will be no whois information
could be extracted Moreover even there has information of the service domain the amount of the
information could be very diverse Therefore if the Web Service Registry undergoes the situation like
that many service domains of the Web services in this registry have no and few whois information
then the average number of service properties in that registry will decrease greatly
As a result in order to help users get better acquainted with the Web services that provided in Web
Service Registry and distinguish these Web services the Web Service Registry should do it best to
offer more and more information for each published Web service of it
05
101520253035
23
717 17
32
The
Num
ber o
f Pro
pert
ies
The Name of Web Service Registry
Average Number of Service Properties
Deep Web Service Crawler
52
44 Different Outputs of Web Services
The fundamental task of this master program is to obtain the WSDL document of the Web services
that hosted in the Web Service Registries as well as extract and gather the properties of these Web
services and thereafter store them into disk Therefore this section is going to describe the different
outputs of this master program which include the WSDL documents of Web services the generated
XML file and INI file and data records of these service properties in Database
Figure4-4 WSDL Document format of one Web service
The WSDL document of the Web service is just read from the Web according to the URL address of its
WSDL link And then these data will be stored into the disk as the WSDL document The name of the
WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document
And in order to distinguish these WSDL documents whose names would be the same while the
content of them are different the name of each obtained WSDL document in one Web Service
Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL
document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and
data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output
formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name
is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the
materials belonged to the same Web service The first three lines in that INI file are some service
comments which start from the semicolon to the end of the line They are the basic information for
the description of this INI file The following one line of them is the section that included in a pairs of
brackets It is important because it represents that the following lines behind it are the information of
this Web service Therefore the rest of these lines are the actual service information that existed in a
key-value pair with an equal between the pair Each service property is displayed from the beginning
of the line
Deep Web Service Crawler
53
Figure4-5 INI File format of one Web service
Figure4-6 XML File format of one Web service
Figure4-7 Database data format for all Web services
Deep Web Service Crawler
54
Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its
name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web
service Though the format of the XML file is different from that of the INI file the essential contents
of them are the same That is to say the values of these service properties are no different This is
because they are files generated from the collection of properties for the same Web service The XML
file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the
section in INI file is something like the root in XML file Therefore all values of these elements among
the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service
Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of
service information for all Web services in these five Web Service Registries And the entire service
information for one Web service will be only one record in this table Because of that the column
names of that table should be the union of the name of the service information in each Web Service
Registry However since the column names of the table should be unique thus these redundant
names in this union must be eliminated This could make sense and possible because the names of the
service information are well-defined and uniform for all these five Web Service Registries In addition
in this table the first column is the primary key which is an increasing Integer The function of this is
more like the Integer that contained in the name of the XML file and INI file While the rest columns in
that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in
this table represents that this property of that Web service is empty or missing
45 Comparison of Average Time Cost for Different
Parts of Single Web Service
This section aims to describe the comparison of the average time cost for different parts of getting
one single Web service in all these five Web Service Registries At first it has to calculate the average
time cost of getting one single service in the Web Service Registry The calculation of it can be
obtained through following equation = (2)
Where
ATC is the average time cost for one single Web service
OTS is the overall time cost of all the Web services in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the different parts of the average time cost for getting one single service consists of
following 6 aspects which are the average time cost for extracting service property the average time
cost for obtaining WSDL document the average time cost for generating XML file the average time
cost for generating INI file the average time cost for inserting the service property into the table of
the Database and the average time cost for some other procedures such as get the service list page
link get the service page link and so on The average time cost for extracting service property would
be obtained by means of following equation
Deep Web Service Crawler
55
= (3)
Where
ATCSI is the average time cost for extracting service property of one single Web service
OTSSI is the overall time cost for extracting service property of all the Web services in one Web
Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the calculation of other parts is similar to the equation for calculating the average time
cost for extracting service property While the calculation of the average time cost for other
procedures equals to the result of the average time cost for one single Web service minus the sum of
the average time cost of all other five parts
Service
property
WSDL
Document
XML
File INI File Database Others Overall
Service
Repository
8801 918 2 1 53 267 10042
Ebi 699 82 2 1 28 11 823
Xmethods 5801 1168 2 1 45 12 7029
Seekda 5186 1013 2 1 41 23 6266
Biocatalogue 39533 762 2 1 66 1636 42000
Table4-3 Average time cost information for all Web Service Registries
Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web
Service Registries The first column of table 4-3 is the name of these five Web Service Registries And
the last column in this table is the average time cost for the single service in one Web Service Registry
While the rest columns in this table are the average time cost of these six different parts In order to
have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated
with following corresponding figures See figure 4-8 to figure 4-14
Figure4-8 Average time cost for extracting service property in all Web Service Registries
8801
6995801 5186
39533
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Service Property
Deep Web Service Crawler
56
As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue
Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries
which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively
That is to say it will take much longer time to extract the service properties of these Web services
published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the
Biocatalogue Web Service Registry has even more average number of service properties which have
already been talked about in section 43 On the contrary the average number of service properties in
Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service
Registry is larger than that in Seekda Web Service Registry As have already known that the average
number of service properties is the same for these two Web Service Registries Nevertheless there
has course might explain the case that the average time in Xmethods costs more than that in Seekda
which is the process for extracting service properties in Xmethods Web Service Registry needs to be
executed by means of service page and service list page while it needs only service page link for
Seekda Web Service Registry
The average time cost for obtaining WSDL document in all these five Web Service Registries is
displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the
average time for extracting the WSDL link of Web service and the average time for reading the data of
WSDL document from the Web and then storing it into the disk As seen from this figure the average
time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is
1168 Although the process for extracting WSDL link will cost certain amount of time it does not have
a significant influence to the total average time spent for obtaining WSDL document This is because
the WSDL link of one Web service is almost gained in one step Therefore this can imply that the
average size of the WSDL document for Xmethods Web Service Registry is larger than that for other
four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds
for the process of extracting WSDL document
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries
918
82
11681013
762
0
200
400
600
800
1000
1200
1400
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
WSDL Document
Deep Web Service Crawler
57
Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different
outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11
the average time of generating XML file for one Web service cost the same for all these five Web
Service Registries which is only 2 milliseconds In addition the average time of generating INI file for
one Web service cost the same too And its value is just 1 millisecond Even though sum up the values
of these two types of average time costs the result is still so small that it could be omitted when
comparing to the overall average time cost of getting one Web service for each corresponding Web
Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI
file would be finished at once after receiving the input of service properties of one Web service
Furthermore it can be seen from figure 4-12 although the average time costs of creating database
record for each Web service in all these five Web Service Registries are larger than that time of
generating XML and INI file the process of the operation for creating database record is still fast
Figure4-10 Average time cost for generating XML file in all Web Service Registries
Figure4-11 Average time cost for generating INI file in all Web Service Registries
2 2 2 2 2
0
05
1
15
2
25
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
XML File
1 1 1 1 1
0
02
04
06
08
1
12
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
INI File
Deep Web Service Crawler
58
Figure4-12 Average time cost for creating database record in all Web Service Registries
Figure4-13 Average time cost for getting one Web service in all Web Service Registries
In the figure 4-13 it gives the average time cost for getting one single Web service in all these five
Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the
longest time for this process This is because the presentation of these five different parts that
described in the front shows that the average time cost of each part would needs more time to finish
the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining
WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web
Service Registry does not cost the longest Moreover there is amazing thing when looking at these
figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same
trend This can further indicate that one Web Service Registry who spends more time to get the
description information of one Web service would offer more information about the Web service
53
28
4541
66
0
10
20
30
40
50
60
70
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Database
10042
823
7029 6266
42000
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Overall
Deep Web Service Crawler
59
5 Conclusion and Further Direction
This master thesis provides a schema which aims to explore the description information of the Web
services that hosted in the different Web Service Registries Actually the description information of
the Web service consists of WSDL document and service information about that Web service
Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can
be used to obtain the WSDL document and service information of the Web service in these Web
Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web
services that hosted in the these Web Service Registries as well as the storage of these Web services
is not flexible and the most important thing is that only a few service information of the Web service
are extracted even more there is no service information would be extracted for some Web Service
Registry However the work presented in this master thesis can be able to explore all Web services in
these Web Service Registries Moreover the service information of Web service would be extracted as
much as it could with the approach presented in this master thesis so that the final result would be
the largest annotated service catalogue ever produced Furthermore regarding the storage process of
the description information of the Web service this master thesis provides three different ways that
can guarantee not only the completeness but also longevity of the description information for the
Web service
However in the implementation performed in this master thesis the whois client used for querying
the information of service domain will return a free text if the information exists And sometime this
free text differs completely That makes it must have to crawl each Web service in all Web Service
Registries at least once in the experiment stage So that all the cases of these free text would be
foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web
services in these Web Service Registries Therefore in order to simply the work other whois client
who can ease the work here needs to be found and used
Moreover in the experiment stage of this master thesis the time cost for getting Web service is still
large For the purpose of reducing this time the multithreaded programming would be applied to
each some parts of the process for getting one Web service
Although the work performed here is specialized for only these five Web Service Registries the
main parts of the principles used here is adaptable to other Web Service Registries that with only
small changes in the implementation codes or the structure
Deep Web Service Crawler
60
6 Bibliography
[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)
June 27 2008
[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from
httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-
whole Emanuele Della Valle (CEFRIEL) July 1 2008
[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk
ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-
architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009
[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web
Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October
2006
[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from
httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008
[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from
httpwwww3orgTRws-arch-scenarios February 11 2004
[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo
Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering
University of Washington Seattle pp233-272 February 1999
[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web
Consortium Working Draft WD-html5-20100624 January 22 2008
[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language
WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from
httpwwwwsmoorgTRd16d161v02
[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO
-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from
httpwwwwsmoorgTRd2v11
[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware
Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet
2008
Deep Web Service Crawler
61
7 Appendixes
There are additional outputs of this master program which are log information file and statistic report
file Figure 8-1 shows one of the basic output formats of the log information for these five Web
Service Registries
Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry
Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry
Deep Web Service Crawler
62
Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo
ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively
Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry
Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry
Deep Web Service Crawler
63
Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry
Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry
Deep Web Service Crawler
64
Table of Figures
Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12
Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16
Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25
Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27
Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41
Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46
Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50
Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51
Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52
Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56
Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58
Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58
Deep Web Service Crawler
65
Table of Tables
Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47
Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48
Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55
Deep Web Service Crawler
66
Table of Abbreviations
DTD Document Type Definition
DWSC Deep Web Service Crawler
HTML HyperText Markup Language
MTBF Mean Time between Failures
MTTR Mean Time to Recovery
QoS Quality of Service
REST Representational State Transfer
RTT Round Trip Time
SOAP Simple Object Access Protocol
SQL Structured Query Language
URL Uniform Resource Locator
WHATWG Web Hypertext Application Technology Working Group
XML eXtensible Markup Language
WSDL Web Service Description Language
WSML Web Service Modeling Language
WSMO Web Service Modeling Ontology
Deep Web Service Crawler
2
Acknowledgements
The work presented in this master thesis is a result of the master task for Web Service project
which is provided by Computer Networks at Dresden University of Technology
In here I am heartily thankful to my supervisor Dipl-infJosef Spillner whose encouragement
guidance and support from the initial to the final level enabled me to develop an understanding of
the subject
Lastly I offer my regards and blessings to all of those who supported me in any respect during the
completion of the project
Deep Web Service Crawler
3
Abstract
Nowadays Web Service Registries offer convenient access to offering searching and using
electronic Web services Usually they host Web service descriptions along with related metadata
generated by both the system and the users Hence monitoring and rating information can help
users to distinguish similar Web service offerings However at present there is little support to
compare these Web services across platforms and to build a global view Besides for the metadata
not all of them especially non-functional property descriptions are made available in s structured
format
Therefore the task of this master thesis is to apply Deep Web analysis techniques to extract as
much information about these published Web services as possible Corresponding the result shall
be the largest annotated Service Catalogue ever produced
Index Terms
Web Service Deep Web Service Crawler Service-Finder Pica-Pica Web Service Description Crawler
WSDL
Deep Web Service Crawler
4
Table of Contents
Acknowledgements 2
Abstract 3
1 Introduction 7
11 BackgroundMotivation 7
12 Initial Designing of the Deep Web Service Crawler Approach 7
13 Goals of this Master Thesis 8
14 Outline of this Master Thesis 8
2 State of the Art 10
21 Service Finder Project 10
211 Use Cases for Service-Finder Project 10
2111 Use Case Methodology 10
2112 System Administrator 10
212 Architecture Plan for the Service-Finder Project 12
2121 The Principle of the Service Crawler Component 13
2122 The Principle of the Automatic Annotator Component 13
2123 The Principle of the Conceptual Indexer and Matcher Component 14
2124 The Principle of the Service-Finder Portal Interface Component 14
2125 The Principle of the Cluster Engine Component 15
22 Information Extraction 15
221 Input Types of Information Extraction 15
222 Extraction Targets of Information Extraction 17
223 The Used Techniques in Information Extraction 18
23 Pica-Pica Web Service Description Crawler 19
231 Needed Libraries of the Pica-Pica Web Service Description Crawler 19
232 Architecture of the Pica-Pica Web Service Description Crawler 20
233 Implementation of the Pica-Pica Web Service Description Crawler 21
24 Conclusions of the Existing Strategies 22
3 Design and Implementation 23
31 Deep Web Services Crawler Requirements 23
311 Basic Requirements for DWSC 23
Deep Web Service Crawler
5
312 System Requirements for DWSC 23
313 Non-Functional Requirements for DWSC 24
32 Deep Web Services Crawler Architecture 24
321 The Function of Web Service Extractor Component 26
3211 Features of the Web Service Extractor Component 28
3212 Input of the Web Service Extractor Component 28
3213 Output of the Web Service Extractor Component 28
3214 Demonstration for Web Service Extractor 29
322 The Function of WSDL Grabber Component 30
3221 Features of the WSDL Grabber Component 31
3222 Input of the WSDL Grabber Component 31
3223 Output of the WSDL Grabber Component 31
3224 Demonstration for WSDL Grabber Component 31
323 The Function of Property Grabber Component 33
3231 Features of the Property Grabber Component 36
3232 Input of the Property Grabber Component 37
3233 Output of the Property Grabber Component 37
3234 Demonstration for Property Grabber Component 37
324 The Function of Storage Component 40
3241 Features of the Storage Component 42
3242 Input of the Storage Component 43
3243 Output of the Storage Component 43
3244 Demonstration for Storage Component 43
33 Multithreaded Programming for DWSC 46
34 Sleep Time Configuration for Web Service Registries 46
4 Experimental Results and Analysis 48
41 Statistic Information for Different Web Service Registries 48
42 Statistic Information for WSDL Document 49
43 Comparison of Different Average Number of Service Properties 50
44 Different Outputs of Web Services 52
45 Comparison of Average Time Cost for Different Parts of Single Web Service 54
5 Conclusion and Further Direction 59
6 Bibliography 60
Deep Web Service Crawler
6
7 Appendixes 61
Table of Figures 64
Table of Tables 65
Table of Abbreviations 66
Deep Web Service Crawler
7
1 Introduction
In this introduction chapter of the master thesis at first itrsquos going to concisely explain the background
of the current situation then is the basic introduction of the proposed approach which is called Deep
Web Service Extraction Crawler
11 BackgroundMotivation
In the late 1990rsquos the Web Service Registry was a hot commodity The formal definition of Web
Service Registry is known as a link links page Its function is to uniformly present information that
comes from various sources Hence it can provide a convenient channel to the users for offering
searching and using the Web Services Actually the related metadata of the Web Services that
submitted by both the system and users are commonly hosted along with the Service descriptions
Nevertheless in fact when users enter one of the Web Service Registries to look for some Web
Services they might meet some situations that would bring lots of trouble to them One of the
situations may be like that these Web Service Registries return several similar published Web Services
after the users search on it For example two or more Web Services have the same name but their
versions are not the same Or two or more Web Services that derived from the same server but have
different contents etc Furthermore most users are also interested in a global view of the published
services For instance they want to know which Web Service Registry can provide better quality for
the Web Service Therefore in order to help users to differentiate those similar published Web
Services and have a global view of the Web Services this information should be monitored and rated
Moreover there are a great many Web Service Registries in the Internet Each Web Service Registry
can provide a great number of Web Services Obviously there might have some similar Web Services
among these Web Service Registries Or a Web Service in one of the Web Service Registry is related to
another Web Service in other Web Service Registries Hence these Web Services should be
comparable across different Web Service Registries However recently there has not much support of
this In addition towards the metadata actually not all of them are structured especially the
descriptions of the non-functional property Therefore what have to do now is to turn those
non-functional property descriptions into the structured format Clearly speaking it needs to extract
as much information as possible about the Web Services that offered in the Web Service Registries
Eventually after extracting all the information from the Web Service Registries it is necessary to store
them into the disk This procedure should be efficient flexible and completeness
12 Initial Designing of the Deep Web Service
Crawler Approach
The problems have already been stated in the previous section Hence the following work is to solve
Deep Web Service Crawler
8
these problems In this section it will present the basic principle of Deep Web Service Crawler
approach
At first it is the simply introduction of the Deep Web Service Crawler approach to these problems As
have already been mentioned each Web Service Registry can offer Web Services Moreover each
Web Service Registry has its own html page structures These structures may be the same or even
complete different Therefore the first thing is to identify which Web Service Registry that it will be
going to explore Since each Web Service Registry owns a unique URL this job can be done by directly
analyzing the corresponding URL address of that Web Service Registry After identifying which Web
Service Registry it is going to explore the following step is to obtain all these Web Services that
published in that Web Service Registry Then with all these obtained Web Services it is time to extract
analyze and gather the information of the services That information can be in structured format or
even in unstructured format In this master thesis some Deep Web Analysis Techniques will be
applied to obtain this information So that the information about each Web Service shall be the
largest annotated The last but not the least important all the information about the Web Services
need to be stored
13 Goals of this Master Thesis
The lists in the following are the goals of this master thesis
n Produce the largest annotated Service Catalogue
Service Catalogue is a list of service properties The more properties the service has the larger
Service Catalogue it owns Therefore this master program should extract as much service
properties as possible
n Flexible storage of these metadata of each service as annotations or dedicated documents
The metadata of one service includes not only the WSDL document but also service properties
All these metadata are important information for the service Therefore this master program
should provide flexible ways to store these metadata into the disk
n Improve the comparable property of the Web Services across different Web Service Registries
The names of service properties for one Web Service Registry could be different from another
Web Service Registry Hence for the purpose of improving the comparable ability all these
names of the service properties should be uniformed and well-defined
14 Outline of this Master Thesis
In this chapter the motivation objective and initial approach plan have already been discussed
Thereafter the remaining paper is structured as follows
Firstly in chapter 2 the work presented is almost based on these existing techniques In section 21
there is given a detailed introduction to the technique of the Service-Finder project Then in section
22 the already implemented technique of Pica-Pica Web Service Description Crawler is introduction
and discussed After that in section 23 the Information Retrieval technique is presented
Deep Web Service Crawler
9
Then following in chapter 3 it is explained the designing details of this Deep Web Service Crawler
approach In section 31 it gives a short description for the different requirements of this approach
Next in section 32 the actual design for the Deep Web Service Crawler is presented Then in section
33 34 the multithreaded programming and sleep time configuration that used in this master
program are introduced respectively
In chapter 4 it is supposed to display the experiments of this Deep Web Service Crawler approach
and then give some evaluation of it
Finally in Section 5 the conclusion and discussion of the work that already done as well as the work in
the future for this master task are presented respectively
Deep Web Service Crawler
10
2 State of the Art
This chapter aims at presenting some existing techniques or Strategies that related to the work of
applying this Deep Web Service Extraction Crawler approach Section 21 plans to talk about the
existing catalogues Service-Finder project And then in section 22 it is going to explain the existing
implemented crawlers Pica-Pica Web Service Description Crawler Finally in the section 23 it is
supposed to present some details about the Information Extraction technique
21 Service Finder Project
Service-Finder Project aims at developing a platform for Web Service discovery especially for the Web
Services that are embedded in a Web 20 environment [1] Hence it can provide an efficient access to
publicly available services The goals of the Service-Finder project are depicted as follows [1]
n Automatically gather Web Services and their related information
n Semi-automatically create semantic service description based on the information that available
on the Web
n Create and improve semantic annotations via the user feedback
n Describe the aggregated information in semantic models and allow reasoning query
However before describing the basic functionality of the Service-Finder Project there is going to
present one of its use cases and requirements first
211 Use Cases for Service-Finder Project
The Service-Finder Project employed the use case methodology of the W3C Use Case description [6]
for its needs and then applied this methodology to the use cases that it enumerated
2111 Use Case Methodology
There are three aspects needed to be considered for use case definitions of Service-Finder Project [1]
(1) Description that used to describe information of the use case
(2) Actors Roles and Goals that used to identify the actors they would be the roles they act and the
goals they need to achieve in the scenario
(3) Storyboard that used to describe the serial of interactions among the actors and the
Service-Finder Portal
2112 System Administrator
This section is going to present the use case that applied to the Service-Finder portal and that
illustrated the requirements on its functionality from a user point of view However all these
Deep Web Service Crawler
11
information in this use case are derived from [1] In this use case there has a system administrator
whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities
online and working all day and night Therefore if there is any system failures Sam Adams should fix
the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will
alert him immediately by sending him a SMS Message in the case of a system failure
n Description
This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an
SMS Messaging Service that he wants to build it into his application
n Actors Roles and Goals
The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him
are the immediate service delivery the reliability of the service and low base fee and transaction fee
n Storyboard
Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find
many useful services from it especially he know what he is looking for Hence he visits the
Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo
Requirement 1 Search functionality
Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the
number of matching services that will be displayed on one page And he would also expect there has
short information about the service functionality the service provider and the service availability So
that he could decide which service he will choose to read further
Requirement 2 Enable configurable pagination of the matching results and have some short
information for each service
Step 3 When Sam looks through the short information about the services that displayed on the first
page he expects to find the most relevant services that related to his request After that he would
like to read more detailed information about that service to see whether this service can provide the
needed functionality
Requirement 3 Rank the returned matching services and must provide ability to read more details of
a service
Step 4 In the case that all the returned matching services Sam got provide quite different
functionalities or they belong to different service categories for example the SMS messaging services
alert users not through SMS but voice messaging For this reason Sam would like to see other
different categories that may be contain the services he wants Or the services of other categories
which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can
further filter his search in terms of browsing through categories
Requirement 4 Service categories and allow the user to look all services that belonged to that specific
category If possible it should also allow the user to browse through categories
Step 5 When Sam got all the services that could provide a SMS messaging service via the methods
described in the Step 4 at present he wants to look for the services that offered by an Austrian
provider and have no base fees if possible
Requirement 5 Faceted search
Deep Web Service Crawler
12
Step 6 After Sam got all these specific services now he would like to choose the services that can
provide a high reliability
Requirement 6 Sort functionality based on usersrsquo chooses
Step 7 For now Sam expects to compare the service availability between the promised to the service
provider and the actually provided This should be contained in the servicesrsquo details And there needs
also have service coverage information so that Sam can know whether this service covers the areas
he lives and works Moreover Sam would also like to compare these services in other way For
instance put some services into a structured table to compare the transaction fees
Requirement 7 A side-by-side comparison table for services and a functionality that enable users to
select services he wants to compare
Step 8 At last Sam wants to know whether the service providers offer a free try out of the services
So that he can test the service functionality
Requirement 8 If possible display a note that offering free service trials
212 Architecture Plan for the Service-Finder Project
The architecture plan of the Service-Finder project contains five basic components Service Crawler
Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal
Interface Figure 2-1 presents a high level overview of the components and the data flow among
them
Figure2-1Dataflow of Service-Finder and Its Components [3]
Deep Web Service Crawler
13
2121 The Principle of the Service Crawler Component
The Service Crawler is responsible for gathering available services and their related information from
the Web The overall cycle is depicted as following
(1) A Web developer publishes a Web Service
(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services
like WSDL (Web Service Description Language) documents
(3) The Crawler is also going to search for other related information as long as a service is discovered
(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant
part of the Web
At last the output of the crawler would be forwarded to the subsequent components for analyzing
indexing and displaying
2122 The Principle of the Automatic Annotator Component
The Automatic Annotator receives the relevant data from previous component and generates
semantic service descriptions about the WSDL documents and its related information based on the
Service-Finder Ontology and Service Category Ontology
Firstly it will simply introduce those two compatible ontologies that would be used throughout the
whole process [2]
n Generic Service Ontology it is an ontology which is functional to describe the data objects For
example the services the service providers availability payment modalities and so on
n Service Category Ontology it is an ontology which is used to categorize the functionalities or
applications of the services For instance data verification messaging data storage weather etc
Afterwards it is going to talk about the function of this component with its input output
Oslash Input
u Crawled data from Service Crawler
u Service-Finder Ontologies
u Feedback or Correction of before annotations
Oslash Function
u Enrich the information about the service and extract semantic statements according to the
Service-Finder Ontologies For example categorize the service according to the Service
Category Ontology
u Determine whether a particular document is relevant or not through the Web link graph If
not discard these irrelevant documents
u Classify the pages into their genres For instance pricing user comments FAQ and so on
Oslash Output
u Semantic annotation of the services
Deep Web Service Crawler
14
2123 The Principle of the Conceptual Indexer and Matcher
Component
The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted
information of the services and supplying users the capability of retrieval and semantic query For
example the matchmaking between user requests and service offers and the act of retrieving user
feedback on extracted annotations
In addition letrsquos have a look of the function of this component and its input output
Oslash Input
u Semantic annotation data and full text information obtained from Automatic Annotation
u Semantic annotation data and full text information that come from user interfaces
u Cluster data from user and service clustering component
Oslash Function
u Store the semantic annotations received from the Automatic Annotation component and
from the user interface
u Store the cluster data that procured through the clustering component
u Store and index the textual description offered by the Automatic Annotation component
and the textual comments offered by users
u Ontological query the semantic data from the data store center
u Combined keyword and Ontological querying used for user queries
u Provide a list of similar services for a given service
Oslash Output
u A list of matching services that are queried by users In particular these services should be
sorted by ranking and can also be iterated
u All available data that related to a particular entity must be retrievable at the user interface
2124 The Principle of the Service-Finder Portal Interface
Component
The Service-Finder Portal Interface is the main entry point that provided for users of the
Service-Finder system to search and browse the data which is managed by the Conceptual Indexer
and Matcher component In addition the users can also contribute information by means of providing
tags comments categorizations and ratings to the data browsed Furthermore the developers can
still directly invoke the Service-Finder functionalities from their custom applications in terms of an API
Besides the details of this componentrsquos function input and output are represented as below
Oslash Input
u A list of ordered services for a query
u Detailed information about a service or a set of services and a service provider
u Query access to service category ontology and the most used tags provided by the users
Deep Web Service Crawler
15
u Service availability information
Oslash Function
u The Web Interface allows the users to search services by keyword tag or concept in the
categorization sort and filter query results by refining the query compare and bookmark
services try out the services that offer this functionality
u The API allows the developers to invoke Service-Finder functionalities
Oslash Output
u Explicit user annotations such as tags ratings comments decryptions and so on
u Implicit user data for example click stream of users bookmarks comparisons links sent
etc
u Manual advertising of available new services
2125 The Principle of the Cluster Engine Component
The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the
Service-Finder Portal eg the queried services and the compared services of the users Moreover it
also provides cluster data to the Conceptual Indexer and Matcher for providing service
recommendations
Furthermore letrsquos detailed introduce this componentrsquos function input and output
Oslash Input
u Service annotation data of both extracted and user feedback
u Usersrsquo Click streams used for extracting user behaviors
Oslash Function
u Obtain user clusters from user behaviors
u Obtain service clusters from service annotation data to enable to find similar services
Oslash Output
u Clusters of users and services
22 Information Extraction
Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge
amount of information sources on the Internet which has been limited the access to browsing and
searching for the reason of the heterogeneity and the lack of structure of Web information sources
Therefore the appearance of Information Extraction that transforms the Web pages into
program-friendly structures for post-processing would become a great necessity However the task of
Information Extraction is specified in terms of the inputs and the extraction targets And the
techniques used in the process of Information Extraction called extractor
221 Input Types of Information Extraction
Generally speaking there are three different input types The first input type is the unstructured
Deep Web Service Crawler
16
document For example the free text that showed in figure 2-2 It is unstructured and written in
natural language So that it will require substantial natural language processing While the second
input type is called the structured document For instance the XML documents based on the reason
that the data can be described through the available DTD (Document Type Definition) or XML
(eXtensible Markup Language) schema Finally but obviously the third input type is the
semi-structured document that are widespread on the Web Such as the large volume of HTML
pages like tables itemized lists and enumerated lists This is because HTML tags are often used to
render these embedded data in the HTML pages See figure 2-3
Figure2-2Left is the free text input type and right is its output [4]
Figure2-3A Semi-structured page containing data records
(in rectangular box) to be extracted [4]
Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly
regular structure And the data of these documents can be displayed in a format of HTML way or
non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and
generated from structured databases in terms of some templates or layouts thus it would be
considered as one of the input sources which could provide some of these semi-structured documents
For example the authors price and comments of the book pages that provided by Amazon have the
Deep Web Service Crawler
17
same layout That is because these Web pages are generated from the same database and applied
with the same template or layout Furthermore there has another option which could manually
generate HTML pages of semi-structured type For example although the publication lists that
provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have
title and source property for every single pager Eventually the inputs for some Information Extraction
can also be the pages with the same class or among various Web Service Registries
222 Extraction Targets of Information Extraction
Moreover regarding the task of the Information Extraction it has to consider the extraction target
There also have two different extraction targets The first one is the relation of k-tuple And the k in
there means the number of attributes in a record Nevertheless in some cases an attribute of one
record may have none instantiation Otherwise the attribute owns multiple instantiations In addition
the complex object with hierarchically organized data would be the second extraction target Though
the ways for depicting the extraction targets in a page are diverse the most common structure is the
hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf
nodes which called internal nodes And the structure for a data object may also be flat or nested To
be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise
if it is nested structure then the internal nodes that involved in this data object would be more than
two levels
Furthermore in order to make the Web pages readable for human being and having an easier
visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated
or demarcated However the displaying for a data object in a Web page would be affected by
following conditions [4]
Oslash The attribute of a data object has zero or several values
(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo
attribute For example a special offer only available for certain books might be a ldquononerdquo
attribute
(2) If there are more than one values for the attribute of a data object it will be called the
ldquomultiValuerdquo attribute For instance the name of the author for a book could be a
ldquomultiValuerdquo attribute
Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering
That is to say among this set of attribute the position of the attribute might be changed
according to the diverse instances of a data object Thus this attribute will be called the
ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would
enumerate the release data in front of the movesrsquo title while for the movies after year 1999
(including 1999) it will enumerate the release data behind the movesrsquo title
Oslash The attribute has different formats
This means the displaying format of the data object could be completely distinct with respect to
these different instances Therefore if the format of an attribute is free then a lot of rules will be
needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo
attribute For example an ecommerce Web site would use the bold font format to present the
general prices while use the red color format to display the sale prices Nevertheless there has
Deep Web Service Crawler
18
another situation that some different attributes for a data object have the same format For
example various attributes are presented in terms of using the ltTDgt tags in a table presentation
And the attributes like those could be differentiated by means of the order information of these
attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo
attributes it must have to revise the rules for extracting these attributes
Oslash The attribute cannot be decomposed
Because of the easier processing sometimes the input documents would like to be treated as
strings of tokens instead of the strings of characters In addition some of the attribute cannot
even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo
attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The
department code and the course number in them cannot be separated into two different strings
of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo
223 The Used Techniques in Information Extraction
The extractor used in the process of Information Extraction aims at providing a single uniform query
interface to access information sources like database server and Web server It consists of following
phases collecting returned Web pages labeling these Web pages generalizing extraction rules
extracting the relevant data and outputting the result in an appropriate format (XML format or
relational database) for further information integration For example at first the extractor queries the
Web server to gather the returned pages through the HTTP protocols after that it starts to extract the
contents among these HTML documents and integrate with other data sources thereafter Actually
the whole process of the extractor follows below steps
Oslash Step 1
At the beginning it must have to tokenize the input However there are two different
granularities for the input string tokenization They are tag-level encoding and word-level
encoding The tag-level encoding will transform the tags of HTML page into the general tokens
while transform all text string between two tags into a special token Nevertheless the
word-level encoding does this in another way It treats each word in a document as a token
Oslash Step 2
Next it should apply the extraction rules for every attributes of the data object in the Web pages
These extraction rules could be induced in terms of a top-down or bottom-up generalization
pattern mining or logic programming In addition the type of extraction rules may be indicated
by means of regular grammars or logic rules For example some use path-expressions of the
HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic
constraints and some use delimiter-based constraints such as HTML tags or literal words
Oslash Step 3
After that all these extracted data would be assembled into the records
Oslash Step 4
Finally iterate this process until all these data objects in the input
Deep Web Service Crawler
19
23 Pica-Pica Web Service Description Crawler
The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the
Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web
Services problem For example the evaluation of the descriptive quality of Web Services that offered
and how well are these Web Services described in nowadaysrsquo Web Service Registries
231 Needed Libraries of the Pica-Pica Web Service
Description Crawler
This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef
Spillner and programmed in terms of the Python language Actually in order to run these scripts to
parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib
n Beautiful Soup
It is an HTMLXML parser for Python language And it can even turn these invalid markups into a
parse tree [5] Moreover the following three features can make it more powerful
u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that
makes approximately as much sense as the original document Therefore you can obtain
the data that you want
u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating
searching and modifying the parse tree Hence you donrsquot need to create a custom parse
for every application
u If the document has already specified an encoding then you can ignore it since the
Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way
Otherwise what you have to do is just to specify the encoding of the original documents
Furthermore the ways of including Beautiful Soup into the application are displayed in the
following [5]
sup2 From BeautifulSoup import BeautifulSoup For processing HTML
sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML
sup2 Import BeautifulSoup To get everything
n Html5lib
It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to
gain maximum compatibility with the current major desktop web browsers this implementation
will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5
specification
Deep Web Service Crawler
20
232 Architecture of the Pica-Pica Web Service
Description Crawler
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler
Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes
four fundamental components Service Page Grabber component WSDL Grabber component
Property Grabber component and WSML Register component
(1) The Service Page Grabber components is going to take the URL seed as the input and output the
link of the service page into following two components WSDL Grabber component and Property
Grabber component
(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the
delivered service pagersquos link And then check whether the validation for these obtained WSDL
document Finally only these valid WSDL document will be passed into the WSML Register
component for further processing
(3) The Property Grabber component will try to extract the servicersquos property the hosted in the
service page if there exists After that all these servicersquos properties would be saved into an INI
file as the information of that service
(4) The functionality of the WSML Register component is to write appropriate WSML document by
means of the valid WSDL documents that delivered from WSDL Grabber component and the
optionally INI files that delivered from Property Grabber component Afterwards register them in
Conqo
n WSML [9]
It stands for Web Service Modeling Language which provides a framework with different
language variants Hence it is often used to describe the different aspects of the semantic Web
Services according to the conceptual model of WSMO
n WSMO [10]
WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various
aspects related to the Semantic Web Services based on the ontologies Ontology is a formal
explicit specification of a shared conceptualization In fact these ontologies are the sticking points
that can satisfy the linkage between the agreement of the communities of users and the defined
conceptual semantics of the real-world
n Conqo [11]
Deep Web Service Crawler
21
It is a discovery framework that considers not only the Quality of Service (QoS) but also the
context information It will use a Web Service Repository to manage these service descriptions
that based on WSML
233 Implementation of the Pica-Pica Web Service
Description Crawler
This section is going to describe the processes of the implementation of the Pica-Pica Web Service
Description Crawler in detail
(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it
needs an input as the initial seed In this crawler there are five Web Service Registries which are
listed in the below The URL address of these five Web Service Registries will be used as the input
seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web
Service Description Crawler there has a single Python script for each Web Service Registry And
the crawling process for these Web Service Registriesrsquo Python script will be executed one after
another
Biocataloue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwsrvice-repositorycom
Xmethods httpwwwxmethodsnet
(2) Then after feeding with the input seed it will step into the next component Service Page Grabber
At first this component will try to read the data from the Web based on the input seed Then it
will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library
After that this Service Page Grabber component starts to look for the service page link for each
service that published in the Web Service Registry by means of the functions in Html5lib library In
the case that the service page link of one single service is found it will firstly check whether this
service page link is valid or not Once the service page link is valid it will pass it into the following
two components for further processing which are WSDL Grabber component and Property
Grabber component
(3) When the WSDL Grabber component receives a service page link from its previous component it
sets out to extract the WSDL link address for that service through the parsing tree of the data in
this service page Next this component will start to download the WSDL document of that service
in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into
the disk The process of this WSDL Grabber component will be continually carried on until there is
no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They
may either contain bad definitions or bad namespaceURI or be an empty document What is
worse it is not even of XML format Hence in order to pick them out this component will further
analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo
folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in
order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo
folder will be passed into the subsequent component
(4) Moreover since some Web Service Registries would give some addition information about the
Deep Web Service Crawler
22
services such as availability service provider version therefore the Property Grabber
component will set out to extract these information as this servicersquos properties And thereafter
save these servicersquos properties into an INI file However if there has no available additional
information then it is no need to extract the service property thus there is no INI file for that
service However for the implementation of this Pica-Pica Web Service Description Crawler only
the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to
extract the servicesrsquo properties While for other three Web Service Registries there has such
function
(5) Furthermore this is an optional to create a report file which contains the statistic information of
this process Such as the total number of services for one Web Service Registry the number of
services whose WSDL document is invalid etc
(6) As has been stated at present there has a folder with all valid WSDL documents and might also
have some INI files Therefore at this time the task of WSML Register component is to generate
the appropriate WSML documents with these valid WSDL document and INI files And then
register them in Conqo
24 Conclusions of the Existing Strategies
In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder
project Information Extraction technique and Pica-Pica Web Service Description Crawler
The task of this master program is going to obtain the available Web services and their related
information from the Web Actually this is a procedure of extracting the needed information for the
service such as the servicersquos WSDL document and its properties Therefore the Information
Extraction technique which used to extract the information that hosted in the Web could be used in
this master program
Moreover the Service-Finder is a large project that not only able to obtain the available Web services
and their related information from the Web but also enrich these crawled data with annotations by
means of the Service-Finder ontology and the Service Category ontology and integrate all the
information into a coherent semantic mode Furthermore it also provides the capabilities for
searching and browsing the data with the user interface and gives users with service
recommendations However as a master program the Service-Finder project is far more exceeding
the requirements Therefore it is just considered as a consultation for this master program
Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available
Web services and their related information this fulfills the primary task of this master program
Nevertheless regarding the information about the service this Pica-Pica Web Service Description
Crawler extracts only few properties sometime even no property Consequently in order to improve
the quality of the service it has to extract as much properties about the service as possible Thence in
chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler
Deep Web Service Crawler
23
3 Design and Implementation
In the previous chapter of the State of Art those already existing techniques or implementations are
presented In the following it is time to introduce the basic principle of the proposed approach Deep
Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica
Web Service Description Crawler
31 Deep Web Services Crawler Requirements
This section is mainly talking about the goals of the Deep Web Service Crawler approach system
requirements with respect to approach and some non-functional requirements
311 Basic Requirements for DWSC
The following lists are the basic requirements which should be achieved
(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list
of Web services that published in the Web Service Registry It contains not only the WSDL document
of the service but also a file with respect to the information about that service Therefore for the
purpose of producing the largest annotated Service Catalogue this proposed approach need to
extract as much properties about those Web services as possible Moreover it has also to download
the WSDL document that hosted along with the Web service That is to say these properties are not
only the interesting structured properties such as service name its WSDL link address but also some
other non-function properties for example endpoint and its monitoring information
(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties
For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of
schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way
this proposed approach provides three methods for the storage The first one will store them as an
XML file While for the second method it is going to store them in an INI file And the third method is
to use the database for the storage
312 System Requirements for DWSC
General Speaking the needed requirements for achieving a programming project contains following
lists
1) Operating System LinuxUnix Windows (XP Vista 2000 etc)
2) Programming language C++ Java Python C etc
3) Programming Tool NetBean Eclipse Visual Studio and so on
However in this master thesis the scripts of the Deep Web Service Crawler approach are written in
Java programming language Besides these code scripts are only tested running in Windows XP
operating system and Linux operating system but have not been tested in other operating system
Deep Web Service Crawler
24
313 Non-Functional Requirements for DWSC
In this part several non-functional requirements for the Deep Web Service Crawler approach are
presented
1) Transparency the process of data exploration and data storage should be done automatically
without the userrsquos intervention However at first user should specify the path in the hard disk which
will be used store these outputs of this program
2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in
order to stop the process from interruption there must have some necessary error handlings for the
process recovering
3) Completeness this approach should extract the interesting properties about each Web Service as
much as possible eg endpoint monitoring information etc
In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the
following five URLs hence the proposed approach must implement not less than those five Web
Service Registries
Biocatalogue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwservice-repositorycom
Xmethods httpwwwxmethodsnet
32 Deep Web Services Crawler Architecture
In this section an overview of this high-level architecture of Deep Web Services Crawler approach will
be introduced first Thereafter there are four subsections that focusing on the outlining of each single
component and how they will play together presented
The current components and flows of data in Deep Web Service Crawler can be summarized as
depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page
links and related service page links by crawling the Web (Web Service Extractor) Then those gathered
links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL
Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all
these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is
illustrated as following
Oslash Step 1
When the Deep Web Service Crawler starts to run the File Chooser container would require the
user to specify a path in the computer or in any other hard disk The reason for specifying the
path is that this Deep Web Service Crawler program needs a place to store all its outputs
Oslash Step 2
After that the Web Service Extractor will be triggered It is the main entry to the specific crawling
process However since the Deep Web Service Crawler program is a procedure which supposes
to crawl for Web Services in some given Web Service Registries Hence the URL addresses of
Deep Web Service Crawler
25
these Web Service Registries should be given as an initial seed for this Web Service Extractor
process However since the page structures of these Web Service Registries are completely
different thus there will be a dependent process for each Web Service Registry
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler
Oslash Step 3
So that according to the given seed two types of links would be obtained by the Web Service
Extractor component One is the service list page link And another is the service page link
Service list page is page that contains a list of Web Services and may be some information about
these Web Services While service page is page that contains much more information for single
service Finally it forwards these two types of links into the next two components Property
Grabber and WSDL Grabber
Oslash Step 4
Then on the one hand the Property Grabber component tries to gather the information about
the service that hosted in the service list page and service page Such as the name of the service
its description the ranking for this service etc Finally all the information of the service will be
collected together as the service properties which then will be delivered to the storage
component for further processing
Oslash Step 5
On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page
or the service page That is because for some Web Service Registry the WSDL link is hosted in
Deep Web Service Crawler
26
the service list page like the Biocatalogue and for other Web Service Registries it is hosted in
the service page such as Xmethods Then after obtaining the WSDL link it will also be
transmitted to the Storage component for further processing
Oslash Step6
When service properties information and WSDL link of the service are received by Storage
component it will be going to store them into the disk For the service properties they will be
stored into the disk by means of three different ways They are an XML file an INI file or one
record inside the table of Database However for the WSDL link the Storage component will try
to download the page content first according to the URL address of the WSDL link If it can work
successful the page content of the service will be stored as a WSDL document into the disk
Oslash Step 7
Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there
are more than one service or there are more than one service list page in the those Web Service
Registries the crawling process from step 3 to step 6 will be continued again and again until
there is no service or service list page in that Web Service Registries any more
Oslash Step 8
Furthermore after the crawling process for one Web Service Registry finishes there will also
generate a file that contains some of statistic information about this crawling process For
example the time when the crawling process of this Web Service Registry started and when
finished the total number of Web services in this Web Service Registry how many services have
empty WSDL document the average number of service properties in this Web Service Registry
and the average time cost for extracting service properties getting WSDL document and
generating XML file INI file etc
321 The Function of Web Service Extractor Component
The Web Service Extractor component is responsible for gathering service list page links and service
page links It pursues a focused crawl of the Web and only forwards service list page and service page
links to subsequent components for analyzing collecting gathering purposes Therefore it is going to
identify both service list page links and related service page links on these Web Service Registries
As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost
as important as the Web Service Extractor itself as it highly influences the part of the Web which it
needs to crawl The seed can or shall contain eg Web pages where these Web Services are
published or that they talk about Web Services
Deep Web Service Crawler
27
Figure3-2 Overview the process flow of the Web Service Extractor Component
After feeding with the seed of URL the Web Service Extractor component starts to get the link of
service list page from the initial page of this URL seed However this process would be diverse for
these five Web Service Registries Following shows the different situation in these Web Service
Registries
Oslash Service-Repository Web Service Registry
In this Web Service Registry the link of the first service list page is the URL address of its seed
which means some Web Services can be found in the home page of the Service-Repository Web
Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this
Web Service Registry Therefore the process of getting service list page link in this registry will be
continually carried on until there no more service list page link exists
Oslash Xmethods Web Service Registry
Although there has Web Services in the home page of the Xmethods Web Service Registry the
number of those Web Services is only a small subset of these in this Web Service Registry
Moreover in the Xmethods Web Service Registry there is only one page containing all Web
Services Therefore it must have to get the service list page link for that page
Oslash Ebi Web Service Registry
The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That
is to say there also has one page that contains all Web Services in this Web Service Registry
However this page is not the initial page of the input seed Therefore there needs more than
one operation steps to get the service list page link of that page
Oslash Seekda Web Service Registry
In Seekda Web Service Registry these Web Services are not contained in the initial page of the
input seed The service list page link can be obtained after several additional operation steps
Deep Web Service Crawler
28
However there has a problem for getting the service list page link in this registry Simply put if
there are more than one pages contains the Web Services for some unknown reasons it cannot
not get the links of the rest service list page In another word it can only get the link of the first
service list page
Oslash Biocatalogue Web Service Registry
The process of getting service list page in Biocatalogue Web Service Registry is almost the same
as in Seekda Web Service Registry The only difference is that it can get all these service list page
links in Biocatalogue Web Service Registry if there are more than one service list pages
Then after getting the link of service list page the Web Service Extractor will begin to get the link of
service page for each service that listed in the service list page The reason for why it can do like this is
that there has an internal link for every service which will address to the service page It is worth
noting that once a service page link is obtained this component will immediately forward the service
page link and the service list page link into the subsequent two components for further processing
Nevertheless the process of obtaining the service page link will be continuously carried out until all
services listed in that service list page are crawled Analogously the process of getting service list page
will also be continuously carried out until there no more service list page exists
3211 Features of the Web Service Extractor Component
The main features are described in the following paragraphs
1) Obtain service list page links
A central task is to obtain the corresponding service list page links It is an URL address that includes a
publicly list of Web Services and just a simply information about these Web services like the name of
the service an internal URL that links to another page which would contain the detailed information
about that service sometimes may have the link address for WSDL document
2) Obtain service page links
Once service list page links are found a crucial aspect is to extract the internal link of each service to
aid the task of service information discovery Therefore it is the task of the Web Service Extractor to
harvest the html page content of the service list page so that the service page links which would
contain much more detailed information of the single Web service can be obtained
3212 Input of the Web Service Extractor Component
This component is dependent on some specific input seeds And the only input required for this
component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section
313
3213 Output of the Web Service Extractor Component
The component will produce two service related page links from the Web
l Service list page links
l Service page links
Deep Web Service Crawler
29
3214 Demonstration for Web Service Extractor
In order to have comprehensive understanding of the process of the Web Service Extractor
component following gives some figures for explanation Though there are five URL addresses in this
section only the URL of the service-repository address is showed as an example
1) The input seed is the initial URL address of the Service-Repository which is
ldquohttpwwwservice-repositorycomrdquo
2) As has already said in section 321 the first service list link of this Web Service Registry is its input
seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page
of that link
Figure3-3 Service list page of the Service-Repository
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo
Figure3-5Code Overview of getting service page link in Service Repository
Figure3-6 Service page of the Web service ldquoBLZServicerdquo
Deep Web Service Crawler
30
3) For now because of the service list page link is already know the next step is to acquiring the
service page link according to these services that listed in the service list page The text in red box
of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the
complete link of the service page It should be prefixed with the initial URL address of
Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for
getting service page link of the Web service in Service Repository is showed in figure 3-5
Therefore the final link of this service page is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure 3-6 is the corrsponding service page of that link
4) Afterwards those two links service list page link and service page link which are gathered by the
Web Service Extractor component will be immediately forwarded to the next two components
which are WSDL Grabber component and Property Grabber component
322 The Function of WSDL Grabber Component
The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the
service lists page link or service page link And the whole process flow is illustrated in figure 3-7
Figure3-7 Overview the process flow of the WSDL Grabber Component
When the WSDL Grabber component receives the inputs that delivered from previous component it
will start to get the WSDL link for the service based on these inputs However although the inputs of
the WSDL Grabber component are the links of service page and service list page only one of them
contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the
service page or the service list page The reason why there needs these two links to be delivered into
this component is that the only one of the five Web Service Registries which is Biocatalogue hosts
the WSDL link in the service list page While for other four Web Service Registries the WSDL link is
hosted in the service page Therefore the WSDL link of these four Web Service Registries will be
obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL
link will be obtained through the service list page link However there has a problem for getting the
WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in
Deep Web Service Crawler
31
the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another
word these services will not have WSDL document For the situation like this the value of the WSDL
link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web
Services in other four Web Service Registries the WSDL link would always exist in the service page
Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service
it will inevitably be forwarded into Storage component for downloading the WSDL document at once
3221 Features of the WSDL Grabber Component
The WSDL Grabber component has to provide the following features
l Obtain WSDL links
WSDL link is the direct way to get into the page that contains the contents of the WSDL document
It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo
to represent this is an address that addressed to the page of WSDL document
3222 Input of the WSDL Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3223 Output of the WSDL Grabber Component
The component will only produce the following output data
l The URL address of WSDL link for each service
3224 Demonstration for WSDL Grabber Component
In this section it will demonstrate a list of figures in order to give a comprehensive understanding of
the process of the WSDL Grabber component Without a doubt it uses the same Web service of the
Service-Repository as an example too
1) The input for this WSDL Grabber component is the link of service page that obtained from Web
Service Extractor component The address of this link is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page
Deep Web Service Crawler
32
2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the
figure 3-8
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo
3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure
3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry
For other four Web Service Registries this will be different This function
ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo
Then it checks all of the nodes one by one to see if the text value of that node is the value of
ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will
be extracted as the value of WSDL link for this Web service
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function
Figure3-11 Code overview of ldquooneParameterrdquo function
4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web
service ldquoBLZServicerdquo can be obtained The value of it is
ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo
Deep Web Service Crawler
33
323 The Function of Property Grabber Component
The Property Grabber component is a module which is used to extract and gather all these Web
service information that hosted in the Web actually are the information that showed in service list
page and service page In the end all the obtained Web service information will be collected together
as the service properties which would be delivered into the Storage component for storing The
detailed process flow of Property Grabber component is illustrated in figure 3-12
The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component
However there is still a little difference between them with respect to the seed As have already
mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is
sufficient to get the WSDL link While regarding the Property Grabber component it needs two of
them as the seeds Therefore once the Property Grabber component receives the needed inputs it
will start to extract the information of that single Web service
Figure3-12 Overview the process flow of the Property Grabber Component
Therefore after the Property Grabber component receives the inputs it starts to extract the service
information for the Web service Generally speaking the service information consists of four aspects
which are structured information endpoint information monitoring information and whois
information respectively
(1) Structured Information
The structured information can be obtained by extracting the information that hosted in the
service page and service list page They are the basic descriptive information about the service
Such as the name of the service the URL address through which the WSDL document can be
obtained the description of introducing the service the provider who provides that service its
rating and the server who owns this service etc However the elements constituting this
Deep Web Service Crawler
34
structured information would be diverse according to these different Web Service Registries For
example the rating information of the Web service exists in the Service-Repository Web Service
Registry while the Xmethods Web Service Registry does not have this information In addition
even though the Web services in the same Web Service Registry some elements of the structured
information could be not exist For instance one service in a Web Service Registry may have the
description information of this service while another service in the same Web Service Registry
does not have this description information Tables 3-1 to 3-5 demonstrate the structured
information that should be extracted in these five Web Service Registries Moreover if the style of
the Biocataloue Web Service Registry is SOAP there has some additional information describing
the SOAP operation of this service While it is REST this additional information would be
something that talked about the REST operation They should also be considered as a part of the
structured information Table 3-6 and table 3-7 illustrate the information for these two different
operations
Service Name WSDL Link WSDL Version
Provider Server Rating
Homepage Owner Homepage Description
Table 3-1Structured Information of Service-Repository Web Service Registry
Service Name WSDL Link Provider
Service Style Homepage Implementation Language
Description User Description Contributed Client Name
Type of this Client Publisher for this client Used Tookit of this client
Used Language of this client Used Operation System of this client
Table 3-2Structured Information of Xmethods Web Service Registry
Service Name WSDL Link Server
Provider Providerrsquos Country Service Style
Rating Description User Description
Service Tags Documentation (within WSDL)
Table 3-3Structured Information of Seekda Web Service Registry
Service Name WSDL Link Port Name
Service URL Address Implementation Class
Table 3-4Structured Information of Ebi Web Service Registry
Service Name WSDL Link Style
Provider Providerrsquos Country View Times
Favorite Times Submitter Service Tags
Total Annotation Provider Annotation Member Annotation
Registry Annotation Base URL SOAP Lab Server Base URL
Description User Description Category
Table 3-5Structured Information of Biocatalogue Web Service Registry
Deep Web Service Crawler
35
SOAP Operation Name Inputs and Outputs Operation Description
Operation Tags Part of Which Service
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry
REST Operation Name Service Tags Used Template
Operation Description Part of Which Service Part of Which Endpoint Group
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry
(2) Endpoint Information
Endpoint information is the port information of the Web service which can be extracted only
through the service page However since different Web Service Registries have different structure
of the endpoint information with respect to the Web service Thence some elements of the
endpoint information for the Web service would be very diverse But there is one thing needed to
pay attention the Ebi Web Service Registry does not have endpoint information for all these Web
services published in this registry Moreover though the Web services in the same Web Service
Registry have the same structure of this endpoint information some elements of the endpoint
information would be missing or empty Furthermore these Web Service Registries could even
have no endpoint information for some Web services published by them Nevertheless whatever
there is endpoint information for the Web service there will be at least one such element which
is the URL address of the endpoint Following table 3-8 shows the endpoint information that
should be extracted in these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Endpoint Name Endpoint URL
Endpoint Critical Endpoint Type
Bound Endpoint
Xmethods
Endpoint URL Publisher of this Endpoint
Contact Email of this Publisher Implementation Language of this
Endpoint
Seekda Endpoint URL
Biocatalogue Endpoint Name Endpoint URL
Table 3-8 Endpoint Information of these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Service Availability Number of Downs
Total Uptime Total Downtime
MTTR MTBF
RTT Max of Endpoint RTT Min of Endpoint
RTT Average of Endpoint Ping Count of Endpoint
Seekda Service Availability Begin Time of Monitoring
Biocatalogue Monitored Status of Endpoint Overall Status of Service
Table 3-9 Monitoring Information of these five Web Service Registries
(3) Monitoring Information
Monitoring information is the tested statistic information for the Web service But it is worth
Deep Web Service Crawler
36
noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of
all these Web services published by them While for other three Web Service Registries only a
few of Web services may not have this information Table 3-12 displays the monitoring
information for these three Web Service Registries
(4) Whois Information
Whois information is not extracted from the information that hosted in the service page and
service list page It is the descriptive information for the service domain which can be gained by
means of the address of the WSDL link Because of that the process of getting the Whois
information will start after gaining the service domain first The final value of service domain
should not contain the string like this http https www etc It must belong to the top level
domain After that it is going to query the service domain database by means of sending the value
of service domain to the Whois client which is just a Web site on the Internet for example
ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain
exists a list of its information will be returned as the output However the structure of the
returned information will be diverse with respect to different service domain Therefore the most
challenge thing is that there has to deal with the extracting process for each different situation of
the returned information Table 3-10 gives the Whois information that needs be extracted for all
these five Web Service Registries
Service Domain URL Domain Name Domain Type
Domain Address Domain Description State
Postal Code City Country
Country Code Phone Fax
Email Organization Established Time
Table 3-10Whois Information for these five Web Service Registries
Finally all the information of these four aspects will be collected together And then be delivered into
the Storage component for further storage processing
3231 Features of the Property Grabber Component
The Property Grabber component has to provide the following features
l Obtain basic information
Generally speaking the more information one Web service has the better you can know how
well this Web service is Hence this Property Grabber component will be necessary to extract all
these basic information that hosted in the service list page and service page This basic
information contains the structure information endpoint information and monitoring
information
l Obtain Whois information
Due to the reason that the more information the Web service has the better the quality of the
Web service is That makes it is necessary to extract as much information about the Web service
as possible Therefore except the basic information the Property Grabber component can also
obtain some additional information called Whois information such as the type of the domain the
name of the person in charge the postal code of the domain city phone fax detailed address
email etc
Deep Web Service Crawler
37
3232 Input of the Property Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3233 Output of the Property Grabber Component
The component will produce the following output data
l Structured information of each service
l Endpoint information about each service if exists
l Monitoring information for the service and endpoint if exists
l Whois information of the service domain
However all these information will be collected together as the properties for each service Thereafter
the collected properties will be sent to the Storage component
3234 Demonstration for Property Grabber Component
Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the
Property Grabber component To simplify the explanation the example showed in this section uses
the same Web service as before
1) The inputs of the Property Grabber component are links of service list page and service page that
received from Web Service Extractor component Their links are
ldquohttpwwwservice-repositorycomrdquo
and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page
Deep Web Service Crawler
38
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page
3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that
displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of
the structured information have the same content like the contents of description showed in
service page and service list page Thence in order to saving the time of extracting process and
the space for storing process the elements with the same content would only be extracted once
Moreover there needs a transformation of non-descriptive text to the descriptive text for the
rating information because the contents of it are several image of stars Therefore the final
results of the extracted structured information for this Web service are showed in table 3-11
Because of there are no descriptive information for the provider homepage and Owner
Homepage their values are assigned as ldquoNULLrdquo
Service Name BLZService
WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl
WSDL Version 0
Server Apache-Coyote11
Description BLZService
Rating Four stars and A Half
Provider NULL
Homepage NULL
Owner Homepage NULL
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo
4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that
displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list
only one of them will be extracted as the information of endpoint That is because this master
program is intended to extract as more information as possible but these information should not
contain some redundant information Therefore for the purpose of that only one of them will be
extracted as the endpoint information even if there are more than one endpoint records Table
3-12 shows the final results of the endpoint information of this Web service
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page
Deep Web Service Crawler
39
Endpoint Name BLZServiceSOAP12port_http
Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService
Endpoint Critical True
Endpoint Type production
Bound Endpoint BLZServiceSOAP12Binding
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo
5) Then it is time to extract the monitoring information through invoking the function
ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box
above contains the monitoring information about the Web service and the red box below lists
the monitoring information for these endpoints As have already mentioned before only one
endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two
types of availability Actually they all represent the availability for this Web service just like the
availability that showed in figure 3-14 Therefore only one content of these availability
information is sufficient Table 3-13 shows the final results of this extracting process
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page
Service Availability 100
Number of Downs 0
Total Uptime 1 day 19 hours 19 minutes
Total Downtime 0 second
MTBF 1 day 19 hours 19 minutes
MTTR 0 second
RTT Max of Endpoint 141 ms
RTT Min of Endpoint 0 ms
RTT Average of Endpoint 577 ms
Ping Count of Endpoint 112
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo
6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the
service domain from the WSDL link first Towards this Web service the gained service domain is
ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying
process After that it returns a list of information for that service domain See figure 3-17 And
table 3-14 is the extracted Whois information
Deep Web Service Crawler
40
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo
Service Domain URL thomas-bayercom
Domain Name Thomas Bayer
Domain Type NULL
Domain Address Moltkestr40
Domain Description NULL
State NULL
Postal Code 54173
City Bonn
Country NULL
Country Code DE
Phone +4922855525760
Fax NULL
Email infopredic8de
Organization predic8 GmbH
Established Time NULL
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo
7) Finally all the information of these four aspects will be collected together as the service
properties and then these service properties are forwarded into the Storage component
324 The Function of Storage Component
The WSDL link from the WSDL Grabber component will be used by this Storage component to
download the WSDL document from the Web and stores it into the disk thereafter In addition the
service properties from the Property Grabber component will also be directly stored into the disk in
three different manners with this Storage component Figure 3-18 illustrates the process flow of the
Storage component
Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo
of the Storage component will be triggered After that it is going to transform the service properties
into three different output formats and store them into the disk These output formats are XML file
INI file and database records Besides it will also try to download the WSDL document with the URL
address of the WSDL link and then store the obtained WSDL document into disk too However this
ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo
ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the
storage tasks
Deep Web Service Crawler
41
Figure3-18 Overview the process flow of the Storage Component
(1) ldquogetWSDLrdquo sub function
The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it
into the disk Therefore above all it has to get the content of the WSDL document This
procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value
of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if
the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo
For the case like that it will create a WSDL document whose name is the service name appended
with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content
It is an empty document Nevertheless if that service does have WSDL link this sub function will
try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all
these contents that hosted in the Web would be downloaded and stored into the disk and named
only with the name of the service Otherwise it would create a WSDL document which prefixes
ldquoBadrdquo before the service name
(2) ldquogenerateXMLrdquo sub function
The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform
them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo
XML stands for eXtensible Markup Language which is a markup language that designed to
transport and store data The first line of an XML file is the XML declaration which defines the
XML version and the encoding used For example following ldquoltxml version=10
encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is
8-bit Unicode Transformation Format character set Besides an XML file contains XML elements
which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML
element can also contain other elements which could be simple text or a mixture of both
Deep Web Service Crawler
42
However an XML file must contain a root element as the parent of all other elements
(3) ldquogenerateINIrdquo sub function
The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will
transform them into an INI file and then store it in the disk with the name structure of servicersquos
name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for
configuration files They are just simple text files with a basic structure Generally speaking the
INI file contains three different parts They are section parameter and comment respectively
Parameter is the basic element contained in an INI file Its format is a key-value pair or can also
be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always
appears to the left of the equals sign The section is more like a room that grouped all its
parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And
section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo
Thence anything between the semicolon and the end of line will be ignored
(4) ldquogenerateDatabaserdquo sub function
The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions
Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will
turn them into the data of the database by using the SQL statements SQL stands for Structured
Query Language which is a database language that designed for accessing and manipulating data
in database Most of the actions that performed on a database are done with SQL statements
And the primary statements of SQL include insert into delete update select create alter and
drop Therefore for the purpose of transforming these service properties into the data of the
database this sub function has to create a database first using the ldquocreate databaserdquo statement
Then it should create a table to store the data A table is a collections of related data entries and
it consists of columns and rows Since the data for all these five Web Service Registries are not
very large therefore one database table is enough for storing these service properties Because of
that the field names of the service properties in columns for all these five Web Service Registries
should be uniform and well-defined Afterwards the service properties of each single service can
be put into the table as a record with the ldquoinsert intordquo statement of the SQL
3241 Features of the Storage Component
The Property Grabber component has to provide the following features
l Generate different output formats
The final result of this master program is to store the information of the service into the disk for
future work However this Storage component provides three different formats for storing these
service properties of the services that published in the Web Service Registries This makes the
storage of the service very flexible and also longevity
l Obtain the WSDL document
The important thing of this component is to check if the WSDL document can be obtained from
the WSDL link because the WSDL document plays a decisive role for determining the quality of
the service This Storage component provides the ability to deal with the different situations
about the process of obtaining the WSDL document
Deep Web Service Crawler
43
3242 Input of the Storage Component
This component requires the following input data
l WSDL link of each service
l Each service property information
3243 Output of the Storage Component
The component will produce the following output data
l WSDL document of the service
l XML document INI file and tables in database
3244 Demonstration for Storage Component
Following figures describe the fundamental implementation codes of this Storage component
The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the
implementation codes The first common place is about the parameters that defined in each of
these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute
path in the computer disk It will be used for the storing procedure and has already been specified
by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an
increasing Integer which is used as a part of the name of the service The reason for doing this is
that it can prevent the services that have the same name from overriding in the disk The content
of red mark among the codes of these figures is the second common place Its function is to
create a file or document for the service with the corresponding parameters
2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to
get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the
name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo
is the actual WSDL link which is the most important one for this sub function The other two
parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and
ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data
of the service in each Web Service Registry Such as the overall number of properties for that Web
Service Registry the overall number of services the number of services that have no WSDL link
the number of services contains no contents the number of services whose WSDL link are not
available etc The ldquoLog Informationrdquo text file records the results of the process steps and the
problems encountered For example which service is crawling now which Web Service Registry it
belongs the reason why it cannot get the WSDL document for this service the reason why this
service is unreachable and so on
Deep Web Service Crawler
44
Figure3-19 Implementation code for getting WSDL document
3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file
and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of
ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two
variables name and value
Figure3-20 Implementation code for generating XML file
Deep Web Service Crawler
45
Figure3-21 Implementation code for generating INI file
4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all
services in these five Web Service Registries into the records of the database Therefore it has to
create a database first The name of the database can be arbitrary as long as it conforms to the
naming rules in database This is the same for the name of the table Figure 3-22 displays the
procedure of creating a table in the database Due to the reason that it is hard to decide the
length of each service property thence the data types for all service properties are set to ldquoTextrdquo
Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo
statement
Figure3-22 Implementation code for creating table in database
Deep Web Service Crawler
46
Figure3-23 Implementation code for generating table records
33 Multithreaded Programming for DWSC
Multithreaded programming is a built-in characteristic which is provided by Java language A
multithreaded program contains two or more separate parts that that can execute concurrently Each
part of such a program is called a thread The use of multithreading enables to create programs that
makes efficient use of the system resources For example make maximum use of the CPU thatrsquos
because the idle time of the CPU can be kept to a minimum
In this master program there are five Web Service Registries that are needed to crawl for the services
among them Moreover the number of services published among each Web Service Registry is quite
different That makes the running time cost by each Web Service Registry is different Then it will
happen that one Web Service Registry who owns fewer services has to be waiting for executing until
another Web Service Registry who has much more services finishes Therefore in order to reduce the
waiting time for other Web Service Registries and maximize the use of the system resources it is
necessary to apply this multithreaded programming to this master program That is to say this master
program will create a thread for each Web Service Registry And these threads are executed
independently
34 Sleep Time Configuration for Web Service
Registries
Due to the reason that this master program is intended for downloading the WSDL document and
extracting the service information of these Web services that published in the Web Service Registry
this would inevitably affect the performance of the Web Service Registry In addition for the purpose
of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of
accessing Because of that sometimes when this master program is executing unknown errors would
happen For instance this master program will be continually halting at one point without getting any
more WSDL document and service information or some servicesrsquo WSDL documents of some Web
Deep Web Service Crawler
47
Service Registries cannot be obtained or there would miss some service information Therefore in
order to obtain the largest staffs of these Web services published in these five Web Service Registries
and not to affect the throughput of them it has to configure the accessing rate for each service of all
Web Service Registries
In consequence before going into the essential procedure for each single service of these Web Service
Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long
milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for
the specified number of milliseconds In other words it will temporarily cease execution for a while
Following table shows this time interval of the sleep function for each Web Service Registry
Web Service Registry Name Time Interval (milliseconds)
Service Repository 8000
Ebi 3000
Xmethods 10000
Seekda 20000
Biocatalogue 10000
Table 3-15Sleep Time of these five Web Service Registries
Deep Web Service Crawler
48
4 Experimental Results and Analysis
This chapter is going to show the quantitative experimental results of the prototype that presented in
chapter 3 Besides the analysis of these results will also be described and explained However in
order to gain a rather accurate result the experiments are carried out more than five times All these
data that displayed in the following tables and charts are the average of them
41 Statistic Information for Different Web Service
Registries
This section is mainly used to talk about the amount statistic of these Web services that published in
these five Web Service Registries It includes the overall number of the Web services that published in
each Web Service Registry and the number of these unavailable Web services which have been
archived because they may not be active anymore or are close to being non active Table 4-1 shows
the service amount statistic of these five Web Service Registries
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Overall
Services 57 289 382 853 2567
Unavailable
Services 0 0 0 0 125
Table4-1 Service amount statistic of these five Web Service Registries
Nevertheless in order to have an intuitive view of the service amount statistic in these five Web
Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As
you can see from the bar chart one on hand there has an ascending increase for the overall number
of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service
Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web
service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to
provide the Web services for the users because it contains far more services than other four Web
Service Registries On the other hand there is no unavailable service existed in almost all Web Service
Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web
Service Registry contains some Web services that cannot be used by the users To some degree this is
useless for the reason that these services cannot be used anymore and it will waste the network
resources on the Web Therefore all these unavailable services should be eliminated in order to
reduce the wasting of the network resources
Deep Web Service Crawler
49
Figure4-1 Service amount statistic of these five Web Service Registries
42 Statistic Information for WSDL Document
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Failed WSDL
Links 1 0 23 145 32
Without
WSDL Links 0 0 0 0 16
Empty
Content 0 0 2 0 2
Table4-2 Statistic information for WSDL Document
Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in
these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed
WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the
Web services whose WSDL links are invalid Say in other words that means it is impossible to get the
WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no
WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services
for these Web Service Registries It is used as the overall number of Web services in each Web Service
Registry who has no WSDL link That is to say there would be no WSDL document for Web services
like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL
document will also be created But there is no content and the name of this WSDL document will
contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which
represents the overall number of Web services that have the WSDL links and the URL address of them
0
500
1000
1500
2000
2500
3000
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
57289 382
853
2567
0 0 0 0 125
The
Num
ber o
f Web
Ser
vice
s
The name of Web Service Registry
Overall Services
Unavailable Services
Deep Web Service Crawler
50
are valid but their WSDL documents contains no content In this case a WSDL document whose name
contains a string ldquo(BAD)rdquo will be created
Figure4-2 Statistic information for WSDL Document
43 Comparison of Different Average Number of
Service Properties
This section is going to compare the average number of service properties in these five Web Service
Registries This average number of service properties is calculated by means of following equation = (1)
Where
ASP is the average number of service properties for one Web Service Registry
ONSP is the overall number of service properties in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
Figure 4-3 shows the average number of service properties of each Web service in these five Web
Service Registries As have already mentioned before one of the measurements for testing the quality
of Web services in the Web Service Registry is the service information The more information of a Web
service the better you know about that service consequently the corresponding Web Service
Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service
Repository and Biocatalogue Web Service Registries own a larger number of service properties than
that in other three Web Service Registries This can directly reflect that these two Web Service
Registries can provide more detailed information about the Web services published among them So
that the users can be easier to choose the service they need And they would also like to use the Web
services that published in these two Web Service Registries By contrast the Xmethods and Seekda
0
50
100
150
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
1 023
145
32
0 0 0 016
0 0 2 0 2
The
Num
ber o
f Ser
vice
s
The Name of the Web Service Registry
Failed WSDL Links
Without WSDL links
Empty Content
Deep Web Service Crawler
51
Web Service Registries who have less service information of the Web services would offer less quality
for these Web services Therefore users may be not like to use the Web services provided in these
two Web Service Registries Not to mention the Web services published in the Ebi Web Service
Registry
Figure4-3 Average Number of Service Properties
From the description presented in section 323 the causes of the issue for the different number of
service properties in these Web Service Registries may consist of following several points First the
number of the structured information for these Web services is different with respect to these five
Web Service Registries Even part of information for some Web services in one Web Service Registry
could be missing or have empty value For example the number of structure information that
supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that
in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for
almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce
the amount of the overall number of service properties Thirdly some Web Service Registries do not
have monitoring information such as Xmethods and Ebi In particular the Service Repository Web
Service Registry has a large amount of monitoring information of the Web services that could be
extracted in the Web Obvious the last one is the number of the whois information for these Web
services If the database of the whois client does not contain the information about the service
domain for the Web service in one Web Service Registry then there will be no whois information
could be extracted Moreover even there has information of the service domain the amount of the
information could be very diverse Therefore if the Web Service Registry undergoes the situation like
that many service domains of the Web services in this registry have no and few whois information
then the average number of service properties in that registry will decrease greatly
As a result in order to help users get better acquainted with the Web services that provided in Web
Service Registry and distinguish these Web services the Web Service Registry should do it best to
offer more and more information for each published Web service of it
05
101520253035
23
717 17
32
The
Num
ber o
f Pro
pert
ies
The Name of Web Service Registry
Average Number of Service Properties
Deep Web Service Crawler
52
44 Different Outputs of Web Services
The fundamental task of this master program is to obtain the WSDL document of the Web services
that hosted in the Web Service Registries as well as extract and gather the properties of these Web
services and thereafter store them into disk Therefore this section is going to describe the different
outputs of this master program which include the WSDL documents of Web services the generated
XML file and INI file and data records of these service properties in Database
Figure4-4 WSDL Document format of one Web service
The WSDL document of the Web service is just read from the Web according to the URL address of its
WSDL link And then these data will be stored into the disk as the WSDL document The name of the
WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document
And in order to distinguish these WSDL documents whose names would be the same while the
content of them are different the name of each obtained WSDL document in one Web Service
Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL
document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and
data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output
formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name
is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the
materials belonged to the same Web service The first three lines in that INI file are some service
comments which start from the semicolon to the end of the line They are the basic information for
the description of this INI file The following one line of them is the section that included in a pairs of
brackets It is important because it represents that the following lines behind it are the information of
this Web service Therefore the rest of these lines are the actual service information that existed in a
key-value pair with an equal between the pair Each service property is displayed from the beginning
of the line
Deep Web Service Crawler
53
Figure4-5 INI File format of one Web service
Figure4-6 XML File format of one Web service
Figure4-7 Database data format for all Web services
Deep Web Service Crawler
54
Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its
name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web
service Though the format of the XML file is different from that of the INI file the essential contents
of them are the same That is to say the values of these service properties are no different This is
because they are files generated from the collection of properties for the same Web service The XML
file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the
section in INI file is something like the root in XML file Therefore all values of these elements among
the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service
Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of
service information for all Web services in these five Web Service Registries And the entire service
information for one Web service will be only one record in this table Because of that the column
names of that table should be the union of the name of the service information in each Web Service
Registry However since the column names of the table should be unique thus these redundant
names in this union must be eliminated This could make sense and possible because the names of the
service information are well-defined and uniform for all these five Web Service Registries In addition
in this table the first column is the primary key which is an increasing Integer The function of this is
more like the Integer that contained in the name of the XML file and INI file While the rest columns in
that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in
this table represents that this property of that Web service is empty or missing
45 Comparison of Average Time Cost for Different
Parts of Single Web Service
This section aims to describe the comparison of the average time cost for different parts of getting
one single Web service in all these five Web Service Registries At first it has to calculate the average
time cost of getting one single service in the Web Service Registry The calculation of it can be
obtained through following equation = (2)
Where
ATC is the average time cost for one single Web service
OTS is the overall time cost of all the Web services in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the different parts of the average time cost for getting one single service consists of
following 6 aspects which are the average time cost for extracting service property the average time
cost for obtaining WSDL document the average time cost for generating XML file the average time
cost for generating INI file the average time cost for inserting the service property into the table of
the Database and the average time cost for some other procedures such as get the service list page
link get the service page link and so on The average time cost for extracting service property would
be obtained by means of following equation
Deep Web Service Crawler
55
= (3)
Where
ATCSI is the average time cost for extracting service property of one single Web service
OTSSI is the overall time cost for extracting service property of all the Web services in one Web
Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the calculation of other parts is similar to the equation for calculating the average time
cost for extracting service property While the calculation of the average time cost for other
procedures equals to the result of the average time cost for one single Web service minus the sum of
the average time cost of all other five parts
Service
property
WSDL
Document
XML
File INI File Database Others Overall
Service
Repository
8801 918 2 1 53 267 10042
Ebi 699 82 2 1 28 11 823
Xmethods 5801 1168 2 1 45 12 7029
Seekda 5186 1013 2 1 41 23 6266
Biocatalogue 39533 762 2 1 66 1636 42000
Table4-3 Average time cost information for all Web Service Registries
Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web
Service Registries The first column of table 4-3 is the name of these five Web Service Registries And
the last column in this table is the average time cost for the single service in one Web Service Registry
While the rest columns in this table are the average time cost of these six different parts In order to
have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated
with following corresponding figures See figure 4-8 to figure 4-14
Figure4-8 Average time cost for extracting service property in all Web Service Registries
8801
6995801 5186
39533
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Service Property
Deep Web Service Crawler
56
As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue
Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries
which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively
That is to say it will take much longer time to extract the service properties of these Web services
published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the
Biocatalogue Web Service Registry has even more average number of service properties which have
already been talked about in section 43 On the contrary the average number of service properties in
Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service
Registry is larger than that in Seekda Web Service Registry As have already known that the average
number of service properties is the same for these two Web Service Registries Nevertheless there
has course might explain the case that the average time in Xmethods costs more than that in Seekda
which is the process for extracting service properties in Xmethods Web Service Registry needs to be
executed by means of service page and service list page while it needs only service page link for
Seekda Web Service Registry
The average time cost for obtaining WSDL document in all these five Web Service Registries is
displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the
average time for extracting the WSDL link of Web service and the average time for reading the data of
WSDL document from the Web and then storing it into the disk As seen from this figure the average
time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is
1168 Although the process for extracting WSDL link will cost certain amount of time it does not have
a significant influence to the total average time spent for obtaining WSDL document This is because
the WSDL link of one Web service is almost gained in one step Therefore this can imply that the
average size of the WSDL document for Xmethods Web Service Registry is larger than that for other
four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds
for the process of extracting WSDL document
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries
918
82
11681013
762
0
200
400
600
800
1000
1200
1400
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
WSDL Document
Deep Web Service Crawler
57
Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different
outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11
the average time of generating XML file for one Web service cost the same for all these five Web
Service Registries which is only 2 milliseconds In addition the average time of generating INI file for
one Web service cost the same too And its value is just 1 millisecond Even though sum up the values
of these two types of average time costs the result is still so small that it could be omitted when
comparing to the overall average time cost of getting one Web service for each corresponding Web
Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI
file would be finished at once after receiving the input of service properties of one Web service
Furthermore it can be seen from figure 4-12 although the average time costs of creating database
record for each Web service in all these five Web Service Registries are larger than that time of
generating XML and INI file the process of the operation for creating database record is still fast
Figure4-10 Average time cost for generating XML file in all Web Service Registries
Figure4-11 Average time cost for generating INI file in all Web Service Registries
2 2 2 2 2
0
05
1
15
2
25
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
XML File
1 1 1 1 1
0
02
04
06
08
1
12
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
INI File
Deep Web Service Crawler
58
Figure4-12 Average time cost for creating database record in all Web Service Registries
Figure4-13 Average time cost for getting one Web service in all Web Service Registries
In the figure 4-13 it gives the average time cost for getting one single Web service in all these five
Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the
longest time for this process This is because the presentation of these five different parts that
described in the front shows that the average time cost of each part would needs more time to finish
the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining
WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web
Service Registry does not cost the longest Moreover there is amazing thing when looking at these
figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same
trend This can further indicate that one Web Service Registry who spends more time to get the
description information of one Web service would offer more information about the Web service
53
28
4541
66
0
10
20
30
40
50
60
70
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Database
10042
823
7029 6266
42000
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Overall
Deep Web Service Crawler
59
5 Conclusion and Further Direction
This master thesis provides a schema which aims to explore the description information of the Web
services that hosted in the different Web Service Registries Actually the description information of
the Web service consists of WSDL document and service information about that Web service
Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can
be used to obtain the WSDL document and service information of the Web service in these Web
Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web
services that hosted in the these Web Service Registries as well as the storage of these Web services
is not flexible and the most important thing is that only a few service information of the Web service
are extracted even more there is no service information would be extracted for some Web Service
Registry However the work presented in this master thesis can be able to explore all Web services in
these Web Service Registries Moreover the service information of Web service would be extracted as
much as it could with the approach presented in this master thesis so that the final result would be
the largest annotated service catalogue ever produced Furthermore regarding the storage process of
the description information of the Web service this master thesis provides three different ways that
can guarantee not only the completeness but also longevity of the description information for the
Web service
However in the implementation performed in this master thesis the whois client used for querying
the information of service domain will return a free text if the information exists And sometime this
free text differs completely That makes it must have to crawl each Web service in all Web Service
Registries at least once in the experiment stage So that all the cases of these free text would be
foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web
services in these Web Service Registries Therefore in order to simply the work other whois client
who can ease the work here needs to be found and used
Moreover in the experiment stage of this master thesis the time cost for getting Web service is still
large For the purpose of reducing this time the multithreaded programming would be applied to
each some parts of the process for getting one Web service
Although the work performed here is specialized for only these five Web Service Registries the
main parts of the principles used here is adaptable to other Web Service Registries that with only
small changes in the implementation codes or the structure
Deep Web Service Crawler
60
6 Bibliography
[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)
June 27 2008
[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from
httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-
whole Emanuele Della Valle (CEFRIEL) July 1 2008
[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk
ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-
architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009
[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web
Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October
2006
[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from
httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008
[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from
httpwwww3orgTRws-arch-scenarios February 11 2004
[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo
Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering
University of Washington Seattle pp233-272 February 1999
[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web
Consortium Working Draft WD-html5-20100624 January 22 2008
[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language
WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from
httpwwwwsmoorgTRd16d161v02
[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO
-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from
httpwwwwsmoorgTRd2v11
[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware
Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet
2008
Deep Web Service Crawler
61
7 Appendixes
There are additional outputs of this master program which are log information file and statistic report
file Figure 8-1 shows one of the basic output formats of the log information for these five Web
Service Registries
Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry
Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry
Deep Web Service Crawler
62
Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo
ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively
Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry
Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry
Deep Web Service Crawler
63
Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry
Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry
Deep Web Service Crawler
64
Table of Figures
Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12
Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16
Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25
Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27
Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41
Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46
Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50
Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51
Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52
Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56
Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58
Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58
Deep Web Service Crawler
65
Table of Tables
Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47
Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48
Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55
Deep Web Service Crawler
66
Table of Abbreviations
DTD Document Type Definition
DWSC Deep Web Service Crawler
HTML HyperText Markup Language
MTBF Mean Time between Failures
MTTR Mean Time to Recovery
QoS Quality of Service
REST Representational State Transfer
RTT Round Trip Time
SOAP Simple Object Access Protocol
SQL Structured Query Language
URL Uniform Resource Locator
WHATWG Web Hypertext Application Technology Working Group
XML eXtensible Markup Language
WSDL Web Service Description Language
WSML Web Service Modeling Language
WSMO Web Service Modeling Ontology
Deep Web Service Crawler
3
Abstract
Nowadays Web Service Registries offer convenient access to offering searching and using
electronic Web services Usually they host Web service descriptions along with related metadata
generated by both the system and the users Hence monitoring and rating information can help
users to distinguish similar Web service offerings However at present there is little support to
compare these Web services across platforms and to build a global view Besides for the metadata
not all of them especially non-functional property descriptions are made available in s structured
format
Therefore the task of this master thesis is to apply Deep Web analysis techniques to extract as
much information about these published Web services as possible Corresponding the result shall
be the largest annotated Service Catalogue ever produced
Index Terms
Web Service Deep Web Service Crawler Service-Finder Pica-Pica Web Service Description Crawler
WSDL
Deep Web Service Crawler
4
Table of Contents
Acknowledgements 2
Abstract 3
1 Introduction 7
11 BackgroundMotivation 7
12 Initial Designing of the Deep Web Service Crawler Approach 7
13 Goals of this Master Thesis 8
14 Outline of this Master Thesis 8
2 State of the Art 10
21 Service Finder Project 10
211 Use Cases for Service-Finder Project 10
2111 Use Case Methodology 10
2112 System Administrator 10
212 Architecture Plan for the Service-Finder Project 12
2121 The Principle of the Service Crawler Component 13
2122 The Principle of the Automatic Annotator Component 13
2123 The Principle of the Conceptual Indexer and Matcher Component 14
2124 The Principle of the Service-Finder Portal Interface Component 14
2125 The Principle of the Cluster Engine Component 15
22 Information Extraction 15
221 Input Types of Information Extraction 15
222 Extraction Targets of Information Extraction 17
223 The Used Techniques in Information Extraction 18
23 Pica-Pica Web Service Description Crawler 19
231 Needed Libraries of the Pica-Pica Web Service Description Crawler 19
232 Architecture of the Pica-Pica Web Service Description Crawler 20
233 Implementation of the Pica-Pica Web Service Description Crawler 21
24 Conclusions of the Existing Strategies 22
3 Design and Implementation 23
31 Deep Web Services Crawler Requirements 23
311 Basic Requirements for DWSC 23
Deep Web Service Crawler
5
312 System Requirements for DWSC 23
313 Non-Functional Requirements for DWSC 24
32 Deep Web Services Crawler Architecture 24
321 The Function of Web Service Extractor Component 26
3211 Features of the Web Service Extractor Component 28
3212 Input of the Web Service Extractor Component 28
3213 Output of the Web Service Extractor Component 28
3214 Demonstration for Web Service Extractor 29
322 The Function of WSDL Grabber Component 30
3221 Features of the WSDL Grabber Component 31
3222 Input of the WSDL Grabber Component 31
3223 Output of the WSDL Grabber Component 31
3224 Demonstration for WSDL Grabber Component 31
323 The Function of Property Grabber Component 33
3231 Features of the Property Grabber Component 36
3232 Input of the Property Grabber Component 37
3233 Output of the Property Grabber Component 37
3234 Demonstration for Property Grabber Component 37
324 The Function of Storage Component 40
3241 Features of the Storage Component 42
3242 Input of the Storage Component 43
3243 Output of the Storage Component 43
3244 Demonstration for Storage Component 43
33 Multithreaded Programming for DWSC 46
34 Sleep Time Configuration for Web Service Registries 46
4 Experimental Results and Analysis 48
41 Statistic Information for Different Web Service Registries 48
42 Statistic Information for WSDL Document 49
43 Comparison of Different Average Number of Service Properties 50
44 Different Outputs of Web Services 52
45 Comparison of Average Time Cost for Different Parts of Single Web Service 54
5 Conclusion and Further Direction 59
6 Bibliography 60
Deep Web Service Crawler
6
7 Appendixes 61
Table of Figures 64
Table of Tables 65
Table of Abbreviations 66
Deep Web Service Crawler
7
1 Introduction
In this introduction chapter of the master thesis at first itrsquos going to concisely explain the background
of the current situation then is the basic introduction of the proposed approach which is called Deep
Web Service Extraction Crawler
11 BackgroundMotivation
In the late 1990rsquos the Web Service Registry was a hot commodity The formal definition of Web
Service Registry is known as a link links page Its function is to uniformly present information that
comes from various sources Hence it can provide a convenient channel to the users for offering
searching and using the Web Services Actually the related metadata of the Web Services that
submitted by both the system and users are commonly hosted along with the Service descriptions
Nevertheless in fact when users enter one of the Web Service Registries to look for some Web
Services they might meet some situations that would bring lots of trouble to them One of the
situations may be like that these Web Service Registries return several similar published Web Services
after the users search on it For example two or more Web Services have the same name but their
versions are not the same Or two or more Web Services that derived from the same server but have
different contents etc Furthermore most users are also interested in a global view of the published
services For instance they want to know which Web Service Registry can provide better quality for
the Web Service Therefore in order to help users to differentiate those similar published Web
Services and have a global view of the Web Services this information should be monitored and rated
Moreover there are a great many Web Service Registries in the Internet Each Web Service Registry
can provide a great number of Web Services Obviously there might have some similar Web Services
among these Web Service Registries Or a Web Service in one of the Web Service Registry is related to
another Web Service in other Web Service Registries Hence these Web Services should be
comparable across different Web Service Registries However recently there has not much support of
this In addition towards the metadata actually not all of them are structured especially the
descriptions of the non-functional property Therefore what have to do now is to turn those
non-functional property descriptions into the structured format Clearly speaking it needs to extract
as much information as possible about the Web Services that offered in the Web Service Registries
Eventually after extracting all the information from the Web Service Registries it is necessary to store
them into the disk This procedure should be efficient flexible and completeness
12 Initial Designing of the Deep Web Service
Crawler Approach
The problems have already been stated in the previous section Hence the following work is to solve
Deep Web Service Crawler
8
these problems In this section it will present the basic principle of Deep Web Service Crawler
approach
At first it is the simply introduction of the Deep Web Service Crawler approach to these problems As
have already been mentioned each Web Service Registry can offer Web Services Moreover each
Web Service Registry has its own html page structures These structures may be the same or even
complete different Therefore the first thing is to identify which Web Service Registry that it will be
going to explore Since each Web Service Registry owns a unique URL this job can be done by directly
analyzing the corresponding URL address of that Web Service Registry After identifying which Web
Service Registry it is going to explore the following step is to obtain all these Web Services that
published in that Web Service Registry Then with all these obtained Web Services it is time to extract
analyze and gather the information of the services That information can be in structured format or
even in unstructured format In this master thesis some Deep Web Analysis Techniques will be
applied to obtain this information So that the information about each Web Service shall be the
largest annotated The last but not the least important all the information about the Web Services
need to be stored
13 Goals of this Master Thesis
The lists in the following are the goals of this master thesis
n Produce the largest annotated Service Catalogue
Service Catalogue is a list of service properties The more properties the service has the larger
Service Catalogue it owns Therefore this master program should extract as much service
properties as possible
n Flexible storage of these metadata of each service as annotations or dedicated documents
The metadata of one service includes not only the WSDL document but also service properties
All these metadata are important information for the service Therefore this master program
should provide flexible ways to store these metadata into the disk
n Improve the comparable property of the Web Services across different Web Service Registries
The names of service properties for one Web Service Registry could be different from another
Web Service Registry Hence for the purpose of improving the comparable ability all these
names of the service properties should be uniformed and well-defined
14 Outline of this Master Thesis
In this chapter the motivation objective and initial approach plan have already been discussed
Thereafter the remaining paper is structured as follows
Firstly in chapter 2 the work presented is almost based on these existing techniques In section 21
there is given a detailed introduction to the technique of the Service-Finder project Then in section
22 the already implemented technique of Pica-Pica Web Service Description Crawler is introduction
and discussed After that in section 23 the Information Retrieval technique is presented
Deep Web Service Crawler
9
Then following in chapter 3 it is explained the designing details of this Deep Web Service Crawler
approach In section 31 it gives a short description for the different requirements of this approach
Next in section 32 the actual design for the Deep Web Service Crawler is presented Then in section
33 34 the multithreaded programming and sleep time configuration that used in this master
program are introduced respectively
In chapter 4 it is supposed to display the experiments of this Deep Web Service Crawler approach
and then give some evaluation of it
Finally in Section 5 the conclusion and discussion of the work that already done as well as the work in
the future for this master task are presented respectively
Deep Web Service Crawler
10
2 State of the Art
This chapter aims at presenting some existing techniques or Strategies that related to the work of
applying this Deep Web Service Extraction Crawler approach Section 21 plans to talk about the
existing catalogues Service-Finder project And then in section 22 it is going to explain the existing
implemented crawlers Pica-Pica Web Service Description Crawler Finally in the section 23 it is
supposed to present some details about the Information Extraction technique
21 Service Finder Project
Service-Finder Project aims at developing a platform for Web Service discovery especially for the Web
Services that are embedded in a Web 20 environment [1] Hence it can provide an efficient access to
publicly available services The goals of the Service-Finder project are depicted as follows [1]
n Automatically gather Web Services and their related information
n Semi-automatically create semantic service description based on the information that available
on the Web
n Create and improve semantic annotations via the user feedback
n Describe the aggregated information in semantic models and allow reasoning query
However before describing the basic functionality of the Service-Finder Project there is going to
present one of its use cases and requirements first
211 Use Cases for Service-Finder Project
The Service-Finder Project employed the use case methodology of the W3C Use Case description [6]
for its needs and then applied this methodology to the use cases that it enumerated
2111 Use Case Methodology
There are three aspects needed to be considered for use case definitions of Service-Finder Project [1]
(1) Description that used to describe information of the use case
(2) Actors Roles and Goals that used to identify the actors they would be the roles they act and the
goals they need to achieve in the scenario
(3) Storyboard that used to describe the serial of interactions among the actors and the
Service-Finder Portal
2112 System Administrator
This section is going to present the use case that applied to the Service-Finder portal and that
illustrated the requirements on its functionality from a user point of view However all these
Deep Web Service Crawler
11
information in this use case are derived from [1] In this use case there has a system administrator
whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities
online and working all day and night Therefore if there is any system failures Sam Adams should fix
the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will
alert him immediately by sending him a SMS Message in the case of a system failure
n Description
This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an
SMS Messaging Service that he wants to build it into his application
n Actors Roles and Goals
The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him
are the immediate service delivery the reliability of the service and low base fee and transaction fee
n Storyboard
Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find
many useful services from it especially he know what he is looking for Hence he visits the
Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo
Requirement 1 Search functionality
Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the
number of matching services that will be displayed on one page And he would also expect there has
short information about the service functionality the service provider and the service availability So
that he could decide which service he will choose to read further
Requirement 2 Enable configurable pagination of the matching results and have some short
information for each service
Step 3 When Sam looks through the short information about the services that displayed on the first
page he expects to find the most relevant services that related to his request After that he would
like to read more detailed information about that service to see whether this service can provide the
needed functionality
Requirement 3 Rank the returned matching services and must provide ability to read more details of
a service
Step 4 In the case that all the returned matching services Sam got provide quite different
functionalities or they belong to different service categories for example the SMS messaging services
alert users not through SMS but voice messaging For this reason Sam would like to see other
different categories that may be contain the services he wants Or the services of other categories
which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can
further filter his search in terms of browsing through categories
Requirement 4 Service categories and allow the user to look all services that belonged to that specific
category If possible it should also allow the user to browse through categories
Step 5 When Sam got all the services that could provide a SMS messaging service via the methods
described in the Step 4 at present he wants to look for the services that offered by an Austrian
provider and have no base fees if possible
Requirement 5 Faceted search
Deep Web Service Crawler
12
Step 6 After Sam got all these specific services now he would like to choose the services that can
provide a high reliability
Requirement 6 Sort functionality based on usersrsquo chooses
Step 7 For now Sam expects to compare the service availability between the promised to the service
provider and the actually provided This should be contained in the servicesrsquo details And there needs
also have service coverage information so that Sam can know whether this service covers the areas
he lives and works Moreover Sam would also like to compare these services in other way For
instance put some services into a structured table to compare the transaction fees
Requirement 7 A side-by-side comparison table for services and a functionality that enable users to
select services he wants to compare
Step 8 At last Sam wants to know whether the service providers offer a free try out of the services
So that he can test the service functionality
Requirement 8 If possible display a note that offering free service trials
212 Architecture Plan for the Service-Finder Project
The architecture plan of the Service-Finder project contains five basic components Service Crawler
Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal
Interface Figure 2-1 presents a high level overview of the components and the data flow among
them
Figure2-1Dataflow of Service-Finder and Its Components [3]
Deep Web Service Crawler
13
2121 The Principle of the Service Crawler Component
The Service Crawler is responsible for gathering available services and their related information from
the Web The overall cycle is depicted as following
(1) A Web developer publishes a Web Service
(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services
like WSDL (Web Service Description Language) documents
(3) The Crawler is also going to search for other related information as long as a service is discovered
(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant
part of the Web
At last the output of the crawler would be forwarded to the subsequent components for analyzing
indexing and displaying
2122 The Principle of the Automatic Annotator Component
The Automatic Annotator receives the relevant data from previous component and generates
semantic service descriptions about the WSDL documents and its related information based on the
Service-Finder Ontology and Service Category Ontology
Firstly it will simply introduce those two compatible ontologies that would be used throughout the
whole process [2]
n Generic Service Ontology it is an ontology which is functional to describe the data objects For
example the services the service providers availability payment modalities and so on
n Service Category Ontology it is an ontology which is used to categorize the functionalities or
applications of the services For instance data verification messaging data storage weather etc
Afterwards it is going to talk about the function of this component with its input output
Oslash Input
u Crawled data from Service Crawler
u Service-Finder Ontologies
u Feedback or Correction of before annotations
Oslash Function
u Enrich the information about the service and extract semantic statements according to the
Service-Finder Ontologies For example categorize the service according to the Service
Category Ontology
u Determine whether a particular document is relevant or not through the Web link graph If
not discard these irrelevant documents
u Classify the pages into their genres For instance pricing user comments FAQ and so on
Oslash Output
u Semantic annotation of the services
Deep Web Service Crawler
14
2123 The Principle of the Conceptual Indexer and Matcher
Component
The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted
information of the services and supplying users the capability of retrieval and semantic query For
example the matchmaking between user requests and service offers and the act of retrieving user
feedback on extracted annotations
In addition letrsquos have a look of the function of this component and its input output
Oslash Input
u Semantic annotation data and full text information obtained from Automatic Annotation
u Semantic annotation data and full text information that come from user interfaces
u Cluster data from user and service clustering component
Oslash Function
u Store the semantic annotations received from the Automatic Annotation component and
from the user interface
u Store the cluster data that procured through the clustering component
u Store and index the textual description offered by the Automatic Annotation component
and the textual comments offered by users
u Ontological query the semantic data from the data store center
u Combined keyword and Ontological querying used for user queries
u Provide a list of similar services for a given service
Oslash Output
u A list of matching services that are queried by users In particular these services should be
sorted by ranking and can also be iterated
u All available data that related to a particular entity must be retrievable at the user interface
2124 The Principle of the Service-Finder Portal Interface
Component
The Service-Finder Portal Interface is the main entry point that provided for users of the
Service-Finder system to search and browse the data which is managed by the Conceptual Indexer
and Matcher component In addition the users can also contribute information by means of providing
tags comments categorizations and ratings to the data browsed Furthermore the developers can
still directly invoke the Service-Finder functionalities from their custom applications in terms of an API
Besides the details of this componentrsquos function input and output are represented as below
Oslash Input
u A list of ordered services for a query
u Detailed information about a service or a set of services and a service provider
u Query access to service category ontology and the most used tags provided by the users
Deep Web Service Crawler
15
u Service availability information
Oslash Function
u The Web Interface allows the users to search services by keyword tag or concept in the
categorization sort and filter query results by refining the query compare and bookmark
services try out the services that offer this functionality
u The API allows the developers to invoke Service-Finder functionalities
Oslash Output
u Explicit user annotations such as tags ratings comments decryptions and so on
u Implicit user data for example click stream of users bookmarks comparisons links sent
etc
u Manual advertising of available new services
2125 The Principle of the Cluster Engine Component
The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the
Service-Finder Portal eg the queried services and the compared services of the users Moreover it
also provides cluster data to the Conceptual Indexer and Matcher for providing service
recommendations
Furthermore letrsquos detailed introduce this componentrsquos function input and output
Oslash Input
u Service annotation data of both extracted and user feedback
u Usersrsquo Click streams used for extracting user behaviors
Oslash Function
u Obtain user clusters from user behaviors
u Obtain service clusters from service annotation data to enable to find similar services
Oslash Output
u Clusters of users and services
22 Information Extraction
Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge
amount of information sources on the Internet which has been limited the access to browsing and
searching for the reason of the heterogeneity and the lack of structure of Web information sources
Therefore the appearance of Information Extraction that transforms the Web pages into
program-friendly structures for post-processing would become a great necessity However the task of
Information Extraction is specified in terms of the inputs and the extraction targets And the
techniques used in the process of Information Extraction called extractor
221 Input Types of Information Extraction
Generally speaking there are three different input types The first input type is the unstructured
Deep Web Service Crawler
16
document For example the free text that showed in figure 2-2 It is unstructured and written in
natural language So that it will require substantial natural language processing While the second
input type is called the structured document For instance the XML documents based on the reason
that the data can be described through the available DTD (Document Type Definition) or XML
(eXtensible Markup Language) schema Finally but obviously the third input type is the
semi-structured document that are widespread on the Web Such as the large volume of HTML
pages like tables itemized lists and enumerated lists This is because HTML tags are often used to
render these embedded data in the HTML pages See figure 2-3
Figure2-2Left is the free text input type and right is its output [4]
Figure2-3A Semi-structured page containing data records
(in rectangular box) to be extracted [4]
Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly
regular structure And the data of these documents can be displayed in a format of HTML way or
non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and
generated from structured databases in terms of some templates or layouts thus it would be
considered as one of the input sources which could provide some of these semi-structured documents
For example the authors price and comments of the book pages that provided by Amazon have the
Deep Web Service Crawler
17
same layout That is because these Web pages are generated from the same database and applied
with the same template or layout Furthermore there has another option which could manually
generate HTML pages of semi-structured type For example although the publication lists that
provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have
title and source property for every single pager Eventually the inputs for some Information Extraction
can also be the pages with the same class or among various Web Service Registries
222 Extraction Targets of Information Extraction
Moreover regarding the task of the Information Extraction it has to consider the extraction target
There also have two different extraction targets The first one is the relation of k-tuple And the k in
there means the number of attributes in a record Nevertheless in some cases an attribute of one
record may have none instantiation Otherwise the attribute owns multiple instantiations In addition
the complex object with hierarchically organized data would be the second extraction target Though
the ways for depicting the extraction targets in a page are diverse the most common structure is the
hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf
nodes which called internal nodes And the structure for a data object may also be flat or nested To
be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise
if it is nested structure then the internal nodes that involved in this data object would be more than
two levels
Furthermore in order to make the Web pages readable for human being and having an easier
visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated
or demarcated However the displaying for a data object in a Web page would be affected by
following conditions [4]
Oslash The attribute of a data object has zero or several values
(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo
attribute For example a special offer only available for certain books might be a ldquononerdquo
attribute
(2) If there are more than one values for the attribute of a data object it will be called the
ldquomultiValuerdquo attribute For instance the name of the author for a book could be a
ldquomultiValuerdquo attribute
Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering
That is to say among this set of attribute the position of the attribute might be changed
according to the diverse instances of a data object Thus this attribute will be called the
ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would
enumerate the release data in front of the movesrsquo title while for the movies after year 1999
(including 1999) it will enumerate the release data behind the movesrsquo title
Oslash The attribute has different formats
This means the displaying format of the data object could be completely distinct with respect to
these different instances Therefore if the format of an attribute is free then a lot of rules will be
needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo
attribute For example an ecommerce Web site would use the bold font format to present the
general prices while use the red color format to display the sale prices Nevertheless there has
Deep Web Service Crawler
18
another situation that some different attributes for a data object have the same format For
example various attributes are presented in terms of using the ltTDgt tags in a table presentation
And the attributes like those could be differentiated by means of the order information of these
attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo
attributes it must have to revise the rules for extracting these attributes
Oslash The attribute cannot be decomposed
Because of the easier processing sometimes the input documents would like to be treated as
strings of tokens instead of the strings of characters In addition some of the attribute cannot
even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo
attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The
department code and the course number in them cannot be separated into two different strings
of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo
223 The Used Techniques in Information Extraction
The extractor used in the process of Information Extraction aims at providing a single uniform query
interface to access information sources like database server and Web server It consists of following
phases collecting returned Web pages labeling these Web pages generalizing extraction rules
extracting the relevant data and outputting the result in an appropriate format (XML format or
relational database) for further information integration For example at first the extractor queries the
Web server to gather the returned pages through the HTTP protocols after that it starts to extract the
contents among these HTML documents and integrate with other data sources thereafter Actually
the whole process of the extractor follows below steps
Oslash Step 1
At the beginning it must have to tokenize the input However there are two different
granularities for the input string tokenization They are tag-level encoding and word-level
encoding The tag-level encoding will transform the tags of HTML page into the general tokens
while transform all text string between two tags into a special token Nevertheless the
word-level encoding does this in another way It treats each word in a document as a token
Oslash Step 2
Next it should apply the extraction rules for every attributes of the data object in the Web pages
These extraction rules could be induced in terms of a top-down or bottom-up generalization
pattern mining or logic programming In addition the type of extraction rules may be indicated
by means of regular grammars or logic rules For example some use path-expressions of the
HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic
constraints and some use delimiter-based constraints such as HTML tags or literal words
Oslash Step 3
After that all these extracted data would be assembled into the records
Oslash Step 4
Finally iterate this process until all these data objects in the input
Deep Web Service Crawler
19
23 Pica-Pica Web Service Description Crawler
The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the
Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web
Services problem For example the evaluation of the descriptive quality of Web Services that offered
and how well are these Web Services described in nowadaysrsquo Web Service Registries
231 Needed Libraries of the Pica-Pica Web Service
Description Crawler
This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef
Spillner and programmed in terms of the Python language Actually in order to run these scripts to
parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib
n Beautiful Soup
It is an HTMLXML parser for Python language And it can even turn these invalid markups into a
parse tree [5] Moreover the following three features can make it more powerful
u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that
makes approximately as much sense as the original document Therefore you can obtain
the data that you want
u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating
searching and modifying the parse tree Hence you donrsquot need to create a custom parse
for every application
u If the document has already specified an encoding then you can ignore it since the
Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way
Otherwise what you have to do is just to specify the encoding of the original documents
Furthermore the ways of including Beautiful Soup into the application are displayed in the
following [5]
sup2 From BeautifulSoup import BeautifulSoup For processing HTML
sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML
sup2 Import BeautifulSoup To get everything
n Html5lib
It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to
gain maximum compatibility with the current major desktop web browsers this implementation
will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5
specification
Deep Web Service Crawler
20
232 Architecture of the Pica-Pica Web Service
Description Crawler
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler
Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes
four fundamental components Service Page Grabber component WSDL Grabber component
Property Grabber component and WSML Register component
(1) The Service Page Grabber components is going to take the URL seed as the input and output the
link of the service page into following two components WSDL Grabber component and Property
Grabber component
(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the
delivered service pagersquos link And then check whether the validation for these obtained WSDL
document Finally only these valid WSDL document will be passed into the WSML Register
component for further processing
(3) The Property Grabber component will try to extract the servicersquos property the hosted in the
service page if there exists After that all these servicersquos properties would be saved into an INI
file as the information of that service
(4) The functionality of the WSML Register component is to write appropriate WSML document by
means of the valid WSDL documents that delivered from WSDL Grabber component and the
optionally INI files that delivered from Property Grabber component Afterwards register them in
Conqo
n WSML [9]
It stands for Web Service Modeling Language which provides a framework with different
language variants Hence it is often used to describe the different aspects of the semantic Web
Services according to the conceptual model of WSMO
n WSMO [10]
WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various
aspects related to the Semantic Web Services based on the ontologies Ontology is a formal
explicit specification of a shared conceptualization In fact these ontologies are the sticking points
that can satisfy the linkage between the agreement of the communities of users and the defined
conceptual semantics of the real-world
n Conqo [11]
Deep Web Service Crawler
21
It is a discovery framework that considers not only the Quality of Service (QoS) but also the
context information It will use a Web Service Repository to manage these service descriptions
that based on WSML
233 Implementation of the Pica-Pica Web Service
Description Crawler
This section is going to describe the processes of the implementation of the Pica-Pica Web Service
Description Crawler in detail
(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it
needs an input as the initial seed In this crawler there are five Web Service Registries which are
listed in the below The URL address of these five Web Service Registries will be used as the input
seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web
Service Description Crawler there has a single Python script for each Web Service Registry And
the crawling process for these Web Service Registriesrsquo Python script will be executed one after
another
Biocataloue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwsrvice-repositorycom
Xmethods httpwwwxmethodsnet
(2) Then after feeding with the input seed it will step into the next component Service Page Grabber
At first this component will try to read the data from the Web based on the input seed Then it
will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library
After that this Service Page Grabber component starts to look for the service page link for each
service that published in the Web Service Registry by means of the functions in Html5lib library In
the case that the service page link of one single service is found it will firstly check whether this
service page link is valid or not Once the service page link is valid it will pass it into the following
two components for further processing which are WSDL Grabber component and Property
Grabber component
(3) When the WSDL Grabber component receives a service page link from its previous component it
sets out to extract the WSDL link address for that service through the parsing tree of the data in
this service page Next this component will start to download the WSDL document of that service
in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into
the disk The process of this WSDL Grabber component will be continually carried on until there is
no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They
may either contain bad definitions or bad namespaceURI or be an empty document What is
worse it is not even of XML format Hence in order to pick them out this component will further
analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo
folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in
order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo
folder will be passed into the subsequent component
(4) Moreover since some Web Service Registries would give some addition information about the
Deep Web Service Crawler
22
services such as availability service provider version therefore the Property Grabber
component will set out to extract these information as this servicersquos properties And thereafter
save these servicersquos properties into an INI file However if there has no available additional
information then it is no need to extract the service property thus there is no INI file for that
service However for the implementation of this Pica-Pica Web Service Description Crawler only
the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to
extract the servicesrsquo properties While for other three Web Service Registries there has such
function
(5) Furthermore this is an optional to create a report file which contains the statistic information of
this process Such as the total number of services for one Web Service Registry the number of
services whose WSDL document is invalid etc
(6) As has been stated at present there has a folder with all valid WSDL documents and might also
have some INI files Therefore at this time the task of WSML Register component is to generate
the appropriate WSML documents with these valid WSDL document and INI files And then
register them in Conqo
24 Conclusions of the Existing Strategies
In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder
project Information Extraction technique and Pica-Pica Web Service Description Crawler
The task of this master program is going to obtain the available Web services and their related
information from the Web Actually this is a procedure of extracting the needed information for the
service such as the servicersquos WSDL document and its properties Therefore the Information
Extraction technique which used to extract the information that hosted in the Web could be used in
this master program
Moreover the Service-Finder is a large project that not only able to obtain the available Web services
and their related information from the Web but also enrich these crawled data with annotations by
means of the Service-Finder ontology and the Service Category ontology and integrate all the
information into a coherent semantic mode Furthermore it also provides the capabilities for
searching and browsing the data with the user interface and gives users with service
recommendations However as a master program the Service-Finder project is far more exceeding
the requirements Therefore it is just considered as a consultation for this master program
Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available
Web services and their related information this fulfills the primary task of this master program
Nevertheless regarding the information about the service this Pica-Pica Web Service Description
Crawler extracts only few properties sometime even no property Consequently in order to improve
the quality of the service it has to extract as much properties about the service as possible Thence in
chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler
Deep Web Service Crawler
23
3 Design and Implementation
In the previous chapter of the State of Art those already existing techniques or implementations are
presented In the following it is time to introduce the basic principle of the proposed approach Deep
Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica
Web Service Description Crawler
31 Deep Web Services Crawler Requirements
This section is mainly talking about the goals of the Deep Web Service Crawler approach system
requirements with respect to approach and some non-functional requirements
311 Basic Requirements for DWSC
The following lists are the basic requirements which should be achieved
(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list
of Web services that published in the Web Service Registry It contains not only the WSDL document
of the service but also a file with respect to the information about that service Therefore for the
purpose of producing the largest annotated Service Catalogue this proposed approach need to
extract as much properties about those Web services as possible Moreover it has also to download
the WSDL document that hosted along with the Web service That is to say these properties are not
only the interesting structured properties such as service name its WSDL link address but also some
other non-function properties for example endpoint and its monitoring information
(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties
For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of
schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way
this proposed approach provides three methods for the storage The first one will store them as an
XML file While for the second method it is going to store them in an INI file And the third method is
to use the database for the storage
312 System Requirements for DWSC
General Speaking the needed requirements for achieving a programming project contains following
lists
1) Operating System LinuxUnix Windows (XP Vista 2000 etc)
2) Programming language C++ Java Python C etc
3) Programming Tool NetBean Eclipse Visual Studio and so on
However in this master thesis the scripts of the Deep Web Service Crawler approach are written in
Java programming language Besides these code scripts are only tested running in Windows XP
operating system and Linux operating system but have not been tested in other operating system
Deep Web Service Crawler
24
313 Non-Functional Requirements for DWSC
In this part several non-functional requirements for the Deep Web Service Crawler approach are
presented
1) Transparency the process of data exploration and data storage should be done automatically
without the userrsquos intervention However at first user should specify the path in the hard disk which
will be used store these outputs of this program
2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in
order to stop the process from interruption there must have some necessary error handlings for the
process recovering
3) Completeness this approach should extract the interesting properties about each Web Service as
much as possible eg endpoint monitoring information etc
In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the
following five URLs hence the proposed approach must implement not less than those five Web
Service Registries
Biocatalogue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwservice-repositorycom
Xmethods httpwwwxmethodsnet
32 Deep Web Services Crawler Architecture
In this section an overview of this high-level architecture of Deep Web Services Crawler approach will
be introduced first Thereafter there are four subsections that focusing on the outlining of each single
component and how they will play together presented
The current components and flows of data in Deep Web Service Crawler can be summarized as
depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page
links and related service page links by crawling the Web (Web Service Extractor) Then those gathered
links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL
Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all
these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is
illustrated as following
Oslash Step 1
When the Deep Web Service Crawler starts to run the File Chooser container would require the
user to specify a path in the computer or in any other hard disk The reason for specifying the
path is that this Deep Web Service Crawler program needs a place to store all its outputs
Oslash Step 2
After that the Web Service Extractor will be triggered It is the main entry to the specific crawling
process However since the Deep Web Service Crawler program is a procedure which supposes
to crawl for Web Services in some given Web Service Registries Hence the URL addresses of
Deep Web Service Crawler
25
these Web Service Registries should be given as an initial seed for this Web Service Extractor
process However since the page structures of these Web Service Registries are completely
different thus there will be a dependent process for each Web Service Registry
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler
Oslash Step 3
So that according to the given seed two types of links would be obtained by the Web Service
Extractor component One is the service list page link And another is the service page link
Service list page is page that contains a list of Web Services and may be some information about
these Web Services While service page is page that contains much more information for single
service Finally it forwards these two types of links into the next two components Property
Grabber and WSDL Grabber
Oslash Step 4
Then on the one hand the Property Grabber component tries to gather the information about
the service that hosted in the service list page and service page Such as the name of the service
its description the ranking for this service etc Finally all the information of the service will be
collected together as the service properties which then will be delivered to the storage
component for further processing
Oslash Step 5
On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page
or the service page That is because for some Web Service Registry the WSDL link is hosted in
Deep Web Service Crawler
26
the service list page like the Biocatalogue and for other Web Service Registries it is hosted in
the service page such as Xmethods Then after obtaining the WSDL link it will also be
transmitted to the Storage component for further processing
Oslash Step6
When service properties information and WSDL link of the service are received by Storage
component it will be going to store them into the disk For the service properties they will be
stored into the disk by means of three different ways They are an XML file an INI file or one
record inside the table of Database However for the WSDL link the Storage component will try
to download the page content first according to the URL address of the WSDL link If it can work
successful the page content of the service will be stored as a WSDL document into the disk
Oslash Step 7
Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there
are more than one service or there are more than one service list page in the those Web Service
Registries the crawling process from step 3 to step 6 will be continued again and again until
there is no service or service list page in that Web Service Registries any more
Oslash Step 8
Furthermore after the crawling process for one Web Service Registry finishes there will also
generate a file that contains some of statistic information about this crawling process For
example the time when the crawling process of this Web Service Registry started and when
finished the total number of Web services in this Web Service Registry how many services have
empty WSDL document the average number of service properties in this Web Service Registry
and the average time cost for extracting service properties getting WSDL document and
generating XML file INI file etc
321 The Function of Web Service Extractor Component
The Web Service Extractor component is responsible for gathering service list page links and service
page links It pursues a focused crawl of the Web and only forwards service list page and service page
links to subsequent components for analyzing collecting gathering purposes Therefore it is going to
identify both service list page links and related service page links on these Web Service Registries
As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost
as important as the Web Service Extractor itself as it highly influences the part of the Web which it
needs to crawl The seed can or shall contain eg Web pages where these Web Services are
published or that they talk about Web Services
Deep Web Service Crawler
27
Figure3-2 Overview the process flow of the Web Service Extractor Component
After feeding with the seed of URL the Web Service Extractor component starts to get the link of
service list page from the initial page of this URL seed However this process would be diverse for
these five Web Service Registries Following shows the different situation in these Web Service
Registries
Oslash Service-Repository Web Service Registry
In this Web Service Registry the link of the first service list page is the URL address of its seed
which means some Web Services can be found in the home page of the Service-Repository Web
Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this
Web Service Registry Therefore the process of getting service list page link in this registry will be
continually carried on until there no more service list page link exists
Oslash Xmethods Web Service Registry
Although there has Web Services in the home page of the Xmethods Web Service Registry the
number of those Web Services is only a small subset of these in this Web Service Registry
Moreover in the Xmethods Web Service Registry there is only one page containing all Web
Services Therefore it must have to get the service list page link for that page
Oslash Ebi Web Service Registry
The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That
is to say there also has one page that contains all Web Services in this Web Service Registry
However this page is not the initial page of the input seed Therefore there needs more than
one operation steps to get the service list page link of that page
Oslash Seekda Web Service Registry
In Seekda Web Service Registry these Web Services are not contained in the initial page of the
input seed The service list page link can be obtained after several additional operation steps
Deep Web Service Crawler
28
However there has a problem for getting the service list page link in this registry Simply put if
there are more than one pages contains the Web Services for some unknown reasons it cannot
not get the links of the rest service list page In another word it can only get the link of the first
service list page
Oslash Biocatalogue Web Service Registry
The process of getting service list page in Biocatalogue Web Service Registry is almost the same
as in Seekda Web Service Registry The only difference is that it can get all these service list page
links in Biocatalogue Web Service Registry if there are more than one service list pages
Then after getting the link of service list page the Web Service Extractor will begin to get the link of
service page for each service that listed in the service list page The reason for why it can do like this is
that there has an internal link for every service which will address to the service page It is worth
noting that once a service page link is obtained this component will immediately forward the service
page link and the service list page link into the subsequent two components for further processing
Nevertheless the process of obtaining the service page link will be continuously carried out until all
services listed in that service list page are crawled Analogously the process of getting service list page
will also be continuously carried out until there no more service list page exists
3211 Features of the Web Service Extractor Component
The main features are described in the following paragraphs
1) Obtain service list page links
A central task is to obtain the corresponding service list page links It is an URL address that includes a
publicly list of Web Services and just a simply information about these Web services like the name of
the service an internal URL that links to another page which would contain the detailed information
about that service sometimes may have the link address for WSDL document
2) Obtain service page links
Once service list page links are found a crucial aspect is to extract the internal link of each service to
aid the task of service information discovery Therefore it is the task of the Web Service Extractor to
harvest the html page content of the service list page so that the service page links which would
contain much more detailed information of the single Web service can be obtained
3212 Input of the Web Service Extractor Component
This component is dependent on some specific input seeds And the only input required for this
component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section
313
3213 Output of the Web Service Extractor Component
The component will produce two service related page links from the Web
l Service list page links
l Service page links
Deep Web Service Crawler
29
3214 Demonstration for Web Service Extractor
In order to have comprehensive understanding of the process of the Web Service Extractor
component following gives some figures for explanation Though there are five URL addresses in this
section only the URL of the service-repository address is showed as an example
1) The input seed is the initial URL address of the Service-Repository which is
ldquohttpwwwservice-repositorycomrdquo
2) As has already said in section 321 the first service list link of this Web Service Registry is its input
seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page
of that link
Figure3-3 Service list page of the Service-Repository
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo
Figure3-5Code Overview of getting service page link in Service Repository
Figure3-6 Service page of the Web service ldquoBLZServicerdquo
Deep Web Service Crawler
30
3) For now because of the service list page link is already know the next step is to acquiring the
service page link according to these services that listed in the service list page The text in red box
of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the
complete link of the service page It should be prefixed with the initial URL address of
Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for
getting service page link of the Web service in Service Repository is showed in figure 3-5
Therefore the final link of this service page is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure 3-6 is the corrsponding service page of that link
4) Afterwards those two links service list page link and service page link which are gathered by the
Web Service Extractor component will be immediately forwarded to the next two components
which are WSDL Grabber component and Property Grabber component
322 The Function of WSDL Grabber Component
The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the
service lists page link or service page link And the whole process flow is illustrated in figure 3-7
Figure3-7 Overview the process flow of the WSDL Grabber Component
When the WSDL Grabber component receives the inputs that delivered from previous component it
will start to get the WSDL link for the service based on these inputs However although the inputs of
the WSDL Grabber component are the links of service page and service list page only one of them
contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the
service page or the service list page The reason why there needs these two links to be delivered into
this component is that the only one of the five Web Service Registries which is Biocatalogue hosts
the WSDL link in the service list page While for other four Web Service Registries the WSDL link is
hosted in the service page Therefore the WSDL link of these four Web Service Registries will be
obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL
link will be obtained through the service list page link However there has a problem for getting the
WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in
Deep Web Service Crawler
31
the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another
word these services will not have WSDL document For the situation like this the value of the WSDL
link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web
Services in other four Web Service Registries the WSDL link would always exist in the service page
Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service
it will inevitably be forwarded into Storage component for downloading the WSDL document at once
3221 Features of the WSDL Grabber Component
The WSDL Grabber component has to provide the following features
l Obtain WSDL links
WSDL link is the direct way to get into the page that contains the contents of the WSDL document
It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo
to represent this is an address that addressed to the page of WSDL document
3222 Input of the WSDL Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3223 Output of the WSDL Grabber Component
The component will only produce the following output data
l The URL address of WSDL link for each service
3224 Demonstration for WSDL Grabber Component
In this section it will demonstrate a list of figures in order to give a comprehensive understanding of
the process of the WSDL Grabber component Without a doubt it uses the same Web service of the
Service-Repository as an example too
1) The input for this WSDL Grabber component is the link of service page that obtained from Web
Service Extractor component The address of this link is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page
Deep Web Service Crawler
32
2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the
figure 3-8
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo
3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure
3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry
For other four Web Service Registries this will be different This function
ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo
Then it checks all of the nodes one by one to see if the text value of that node is the value of
ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will
be extracted as the value of WSDL link for this Web service
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function
Figure3-11 Code overview of ldquooneParameterrdquo function
4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web
service ldquoBLZServicerdquo can be obtained The value of it is
ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo
Deep Web Service Crawler
33
323 The Function of Property Grabber Component
The Property Grabber component is a module which is used to extract and gather all these Web
service information that hosted in the Web actually are the information that showed in service list
page and service page In the end all the obtained Web service information will be collected together
as the service properties which would be delivered into the Storage component for storing The
detailed process flow of Property Grabber component is illustrated in figure 3-12
The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component
However there is still a little difference between them with respect to the seed As have already
mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is
sufficient to get the WSDL link While regarding the Property Grabber component it needs two of
them as the seeds Therefore once the Property Grabber component receives the needed inputs it
will start to extract the information of that single Web service
Figure3-12 Overview the process flow of the Property Grabber Component
Therefore after the Property Grabber component receives the inputs it starts to extract the service
information for the Web service Generally speaking the service information consists of four aspects
which are structured information endpoint information monitoring information and whois
information respectively
(1) Structured Information
The structured information can be obtained by extracting the information that hosted in the
service page and service list page They are the basic descriptive information about the service
Such as the name of the service the URL address through which the WSDL document can be
obtained the description of introducing the service the provider who provides that service its
rating and the server who owns this service etc However the elements constituting this
Deep Web Service Crawler
34
structured information would be diverse according to these different Web Service Registries For
example the rating information of the Web service exists in the Service-Repository Web Service
Registry while the Xmethods Web Service Registry does not have this information In addition
even though the Web services in the same Web Service Registry some elements of the structured
information could be not exist For instance one service in a Web Service Registry may have the
description information of this service while another service in the same Web Service Registry
does not have this description information Tables 3-1 to 3-5 demonstrate the structured
information that should be extracted in these five Web Service Registries Moreover if the style of
the Biocataloue Web Service Registry is SOAP there has some additional information describing
the SOAP operation of this service While it is REST this additional information would be
something that talked about the REST operation They should also be considered as a part of the
structured information Table 3-6 and table 3-7 illustrate the information for these two different
operations
Service Name WSDL Link WSDL Version
Provider Server Rating
Homepage Owner Homepage Description
Table 3-1Structured Information of Service-Repository Web Service Registry
Service Name WSDL Link Provider
Service Style Homepage Implementation Language
Description User Description Contributed Client Name
Type of this Client Publisher for this client Used Tookit of this client
Used Language of this client Used Operation System of this client
Table 3-2Structured Information of Xmethods Web Service Registry
Service Name WSDL Link Server
Provider Providerrsquos Country Service Style
Rating Description User Description
Service Tags Documentation (within WSDL)
Table 3-3Structured Information of Seekda Web Service Registry
Service Name WSDL Link Port Name
Service URL Address Implementation Class
Table 3-4Structured Information of Ebi Web Service Registry
Service Name WSDL Link Style
Provider Providerrsquos Country View Times
Favorite Times Submitter Service Tags
Total Annotation Provider Annotation Member Annotation
Registry Annotation Base URL SOAP Lab Server Base URL
Description User Description Category
Table 3-5Structured Information of Biocatalogue Web Service Registry
Deep Web Service Crawler
35
SOAP Operation Name Inputs and Outputs Operation Description
Operation Tags Part of Which Service
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry
REST Operation Name Service Tags Used Template
Operation Description Part of Which Service Part of Which Endpoint Group
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry
(2) Endpoint Information
Endpoint information is the port information of the Web service which can be extracted only
through the service page However since different Web Service Registries have different structure
of the endpoint information with respect to the Web service Thence some elements of the
endpoint information for the Web service would be very diverse But there is one thing needed to
pay attention the Ebi Web Service Registry does not have endpoint information for all these Web
services published in this registry Moreover though the Web services in the same Web Service
Registry have the same structure of this endpoint information some elements of the endpoint
information would be missing or empty Furthermore these Web Service Registries could even
have no endpoint information for some Web services published by them Nevertheless whatever
there is endpoint information for the Web service there will be at least one such element which
is the URL address of the endpoint Following table 3-8 shows the endpoint information that
should be extracted in these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Endpoint Name Endpoint URL
Endpoint Critical Endpoint Type
Bound Endpoint
Xmethods
Endpoint URL Publisher of this Endpoint
Contact Email of this Publisher Implementation Language of this
Endpoint
Seekda Endpoint URL
Biocatalogue Endpoint Name Endpoint URL
Table 3-8 Endpoint Information of these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Service Availability Number of Downs
Total Uptime Total Downtime
MTTR MTBF
RTT Max of Endpoint RTT Min of Endpoint
RTT Average of Endpoint Ping Count of Endpoint
Seekda Service Availability Begin Time of Monitoring
Biocatalogue Monitored Status of Endpoint Overall Status of Service
Table 3-9 Monitoring Information of these five Web Service Registries
(3) Monitoring Information
Monitoring information is the tested statistic information for the Web service But it is worth
Deep Web Service Crawler
36
noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of
all these Web services published by them While for other three Web Service Registries only a
few of Web services may not have this information Table 3-12 displays the monitoring
information for these three Web Service Registries
(4) Whois Information
Whois information is not extracted from the information that hosted in the service page and
service list page It is the descriptive information for the service domain which can be gained by
means of the address of the WSDL link Because of that the process of getting the Whois
information will start after gaining the service domain first The final value of service domain
should not contain the string like this http https www etc It must belong to the top level
domain After that it is going to query the service domain database by means of sending the value
of service domain to the Whois client which is just a Web site on the Internet for example
ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain
exists a list of its information will be returned as the output However the structure of the
returned information will be diverse with respect to different service domain Therefore the most
challenge thing is that there has to deal with the extracting process for each different situation of
the returned information Table 3-10 gives the Whois information that needs be extracted for all
these five Web Service Registries
Service Domain URL Domain Name Domain Type
Domain Address Domain Description State
Postal Code City Country
Country Code Phone Fax
Email Organization Established Time
Table 3-10Whois Information for these five Web Service Registries
Finally all the information of these four aspects will be collected together And then be delivered into
the Storage component for further storage processing
3231 Features of the Property Grabber Component
The Property Grabber component has to provide the following features
l Obtain basic information
Generally speaking the more information one Web service has the better you can know how
well this Web service is Hence this Property Grabber component will be necessary to extract all
these basic information that hosted in the service list page and service page This basic
information contains the structure information endpoint information and monitoring
information
l Obtain Whois information
Due to the reason that the more information the Web service has the better the quality of the
Web service is That makes it is necessary to extract as much information about the Web service
as possible Therefore except the basic information the Property Grabber component can also
obtain some additional information called Whois information such as the type of the domain the
name of the person in charge the postal code of the domain city phone fax detailed address
email etc
Deep Web Service Crawler
37
3232 Input of the Property Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3233 Output of the Property Grabber Component
The component will produce the following output data
l Structured information of each service
l Endpoint information about each service if exists
l Monitoring information for the service and endpoint if exists
l Whois information of the service domain
However all these information will be collected together as the properties for each service Thereafter
the collected properties will be sent to the Storage component
3234 Demonstration for Property Grabber Component
Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the
Property Grabber component To simplify the explanation the example showed in this section uses
the same Web service as before
1) The inputs of the Property Grabber component are links of service list page and service page that
received from Web Service Extractor component Their links are
ldquohttpwwwservice-repositorycomrdquo
and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page
Deep Web Service Crawler
38
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page
3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that
displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of
the structured information have the same content like the contents of description showed in
service page and service list page Thence in order to saving the time of extracting process and
the space for storing process the elements with the same content would only be extracted once
Moreover there needs a transformation of non-descriptive text to the descriptive text for the
rating information because the contents of it are several image of stars Therefore the final
results of the extracted structured information for this Web service are showed in table 3-11
Because of there are no descriptive information for the provider homepage and Owner
Homepage their values are assigned as ldquoNULLrdquo
Service Name BLZService
WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl
WSDL Version 0
Server Apache-Coyote11
Description BLZService
Rating Four stars and A Half
Provider NULL
Homepage NULL
Owner Homepage NULL
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo
4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that
displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list
only one of them will be extracted as the information of endpoint That is because this master
program is intended to extract as more information as possible but these information should not
contain some redundant information Therefore for the purpose of that only one of them will be
extracted as the endpoint information even if there are more than one endpoint records Table
3-12 shows the final results of the endpoint information of this Web service
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page
Deep Web Service Crawler
39
Endpoint Name BLZServiceSOAP12port_http
Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService
Endpoint Critical True
Endpoint Type production
Bound Endpoint BLZServiceSOAP12Binding
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo
5) Then it is time to extract the monitoring information through invoking the function
ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box
above contains the monitoring information about the Web service and the red box below lists
the monitoring information for these endpoints As have already mentioned before only one
endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two
types of availability Actually they all represent the availability for this Web service just like the
availability that showed in figure 3-14 Therefore only one content of these availability
information is sufficient Table 3-13 shows the final results of this extracting process
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page
Service Availability 100
Number of Downs 0
Total Uptime 1 day 19 hours 19 minutes
Total Downtime 0 second
MTBF 1 day 19 hours 19 minutes
MTTR 0 second
RTT Max of Endpoint 141 ms
RTT Min of Endpoint 0 ms
RTT Average of Endpoint 577 ms
Ping Count of Endpoint 112
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo
6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the
service domain from the WSDL link first Towards this Web service the gained service domain is
ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying
process After that it returns a list of information for that service domain See figure 3-17 And
table 3-14 is the extracted Whois information
Deep Web Service Crawler
40
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo
Service Domain URL thomas-bayercom
Domain Name Thomas Bayer
Domain Type NULL
Domain Address Moltkestr40
Domain Description NULL
State NULL
Postal Code 54173
City Bonn
Country NULL
Country Code DE
Phone +4922855525760
Fax NULL
Email infopredic8de
Organization predic8 GmbH
Established Time NULL
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo
7) Finally all the information of these four aspects will be collected together as the service
properties and then these service properties are forwarded into the Storage component
324 The Function of Storage Component
The WSDL link from the WSDL Grabber component will be used by this Storage component to
download the WSDL document from the Web and stores it into the disk thereafter In addition the
service properties from the Property Grabber component will also be directly stored into the disk in
three different manners with this Storage component Figure 3-18 illustrates the process flow of the
Storage component
Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo
of the Storage component will be triggered After that it is going to transform the service properties
into three different output formats and store them into the disk These output formats are XML file
INI file and database records Besides it will also try to download the WSDL document with the URL
address of the WSDL link and then store the obtained WSDL document into disk too However this
ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo
ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the
storage tasks
Deep Web Service Crawler
41
Figure3-18 Overview the process flow of the Storage Component
(1) ldquogetWSDLrdquo sub function
The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it
into the disk Therefore above all it has to get the content of the WSDL document This
procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value
of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if
the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo
For the case like that it will create a WSDL document whose name is the service name appended
with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content
It is an empty document Nevertheless if that service does have WSDL link this sub function will
try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all
these contents that hosted in the Web would be downloaded and stored into the disk and named
only with the name of the service Otherwise it would create a WSDL document which prefixes
ldquoBadrdquo before the service name
(2) ldquogenerateXMLrdquo sub function
The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform
them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo
XML stands for eXtensible Markup Language which is a markup language that designed to
transport and store data The first line of an XML file is the XML declaration which defines the
XML version and the encoding used For example following ldquoltxml version=10
encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is
8-bit Unicode Transformation Format character set Besides an XML file contains XML elements
which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML
element can also contain other elements which could be simple text or a mixture of both
Deep Web Service Crawler
42
However an XML file must contain a root element as the parent of all other elements
(3) ldquogenerateINIrdquo sub function
The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will
transform them into an INI file and then store it in the disk with the name structure of servicersquos
name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for
configuration files They are just simple text files with a basic structure Generally speaking the
INI file contains three different parts They are section parameter and comment respectively
Parameter is the basic element contained in an INI file Its format is a key-value pair or can also
be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always
appears to the left of the equals sign The section is more like a room that grouped all its
parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And
section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo
Thence anything between the semicolon and the end of line will be ignored
(4) ldquogenerateDatabaserdquo sub function
The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions
Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will
turn them into the data of the database by using the SQL statements SQL stands for Structured
Query Language which is a database language that designed for accessing and manipulating data
in database Most of the actions that performed on a database are done with SQL statements
And the primary statements of SQL include insert into delete update select create alter and
drop Therefore for the purpose of transforming these service properties into the data of the
database this sub function has to create a database first using the ldquocreate databaserdquo statement
Then it should create a table to store the data A table is a collections of related data entries and
it consists of columns and rows Since the data for all these five Web Service Registries are not
very large therefore one database table is enough for storing these service properties Because of
that the field names of the service properties in columns for all these five Web Service Registries
should be uniform and well-defined Afterwards the service properties of each single service can
be put into the table as a record with the ldquoinsert intordquo statement of the SQL
3241 Features of the Storage Component
The Property Grabber component has to provide the following features
l Generate different output formats
The final result of this master program is to store the information of the service into the disk for
future work However this Storage component provides three different formats for storing these
service properties of the services that published in the Web Service Registries This makes the
storage of the service very flexible and also longevity
l Obtain the WSDL document
The important thing of this component is to check if the WSDL document can be obtained from
the WSDL link because the WSDL document plays a decisive role for determining the quality of
the service This Storage component provides the ability to deal with the different situations
about the process of obtaining the WSDL document
Deep Web Service Crawler
43
3242 Input of the Storage Component
This component requires the following input data
l WSDL link of each service
l Each service property information
3243 Output of the Storage Component
The component will produce the following output data
l WSDL document of the service
l XML document INI file and tables in database
3244 Demonstration for Storage Component
Following figures describe the fundamental implementation codes of this Storage component
The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the
implementation codes The first common place is about the parameters that defined in each of
these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute
path in the computer disk It will be used for the storing procedure and has already been specified
by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an
increasing Integer which is used as a part of the name of the service The reason for doing this is
that it can prevent the services that have the same name from overriding in the disk The content
of red mark among the codes of these figures is the second common place Its function is to
create a file or document for the service with the corresponding parameters
2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to
get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the
name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo
is the actual WSDL link which is the most important one for this sub function The other two
parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and
ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data
of the service in each Web Service Registry Such as the overall number of properties for that Web
Service Registry the overall number of services the number of services that have no WSDL link
the number of services contains no contents the number of services whose WSDL link are not
available etc The ldquoLog Informationrdquo text file records the results of the process steps and the
problems encountered For example which service is crawling now which Web Service Registry it
belongs the reason why it cannot get the WSDL document for this service the reason why this
service is unreachable and so on
Deep Web Service Crawler
44
Figure3-19 Implementation code for getting WSDL document
3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file
and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of
ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two
variables name and value
Figure3-20 Implementation code for generating XML file
Deep Web Service Crawler
45
Figure3-21 Implementation code for generating INI file
4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all
services in these five Web Service Registries into the records of the database Therefore it has to
create a database first The name of the database can be arbitrary as long as it conforms to the
naming rules in database This is the same for the name of the table Figure 3-22 displays the
procedure of creating a table in the database Due to the reason that it is hard to decide the
length of each service property thence the data types for all service properties are set to ldquoTextrdquo
Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo
statement
Figure3-22 Implementation code for creating table in database
Deep Web Service Crawler
46
Figure3-23 Implementation code for generating table records
33 Multithreaded Programming for DWSC
Multithreaded programming is a built-in characteristic which is provided by Java language A
multithreaded program contains two or more separate parts that that can execute concurrently Each
part of such a program is called a thread The use of multithreading enables to create programs that
makes efficient use of the system resources For example make maximum use of the CPU thatrsquos
because the idle time of the CPU can be kept to a minimum
In this master program there are five Web Service Registries that are needed to crawl for the services
among them Moreover the number of services published among each Web Service Registry is quite
different That makes the running time cost by each Web Service Registry is different Then it will
happen that one Web Service Registry who owns fewer services has to be waiting for executing until
another Web Service Registry who has much more services finishes Therefore in order to reduce the
waiting time for other Web Service Registries and maximize the use of the system resources it is
necessary to apply this multithreaded programming to this master program That is to say this master
program will create a thread for each Web Service Registry And these threads are executed
independently
34 Sleep Time Configuration for Web Service
Registries
Due to the reason that this master program is intended for downloading the WSDL document and
extracting the service information of these Web services that published in the Web Service Registry
this would inevitably affect the performance of the Web Service Registry In addition for the purpose
of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of
accessing Because of that sometimes when this master program is executing unknown errors would
happen For instance this master program will be continually halting at one point without getting any
more WSDL document and service information or some servicesrsquo WSDL documents of some Web
Deep Web Service Crawler
47
Service Registries cannot be obtained or there would miss some service information Therefore in
order to obtain the largest staffs of these Web services published in these five Web Service Registries
and not to affect the throughput of them it has to configure the accessing rate for each service of all
Web Service Registries
In consequence before going into the essential procedure for each single service of these Web Service
Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long
milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for
the specified number of milliseconds In other words it will temporarily cease execution for a while
Following table shows this time interval of the sleep function for each Web Service Registry
Web Service Registry Name Time Interval (milliseconds)
Service Repository 8000
Ebi 3000
Xmethods 10000
Seekda 20000
Biocatalogue 10000
Table 3-15Sleep Time of these five Web Service Registries
Deep Web Service Crawler
48
4 Experimental Results and Analysis
This chapter is going to show the quantitative experimental results of the prototype that presented in
chapter 3 Besides the analysis of these results will also be described and explained However in
order to gain a rather accurate result the experiments are carried out more than five times All these
data that displayed in the following tables and charts are the average of them
41 Statistic Information for Different Web Service
Registries
This section is mainly used to talk about the amount statistic of these Web services that published in
these five Web Service Registries It includes the overall number of the Web services that published in
each Web Service Registry and the number of these unavailable Web services which have been
archived because they may not be active anymore or are close to being non active Table 4-1 shows
the service amount statistic of these five Web Service Registries
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Overall
Services 57 289 382 853 2567
Unavailable
Services 0 0 0 0 125
Table4-1 Service amount statistic of these five Web Service Registries
Nevertheless in order to have an intuitive view of the service amount statistic in these five Web
Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As
you can see from the bar chart one on hand there has an ascending increase for the overall number
of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service
Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web
service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to
provide the Web services for the users because it contains far more services than other four Web
Service Registries On the other hand there is no unavailable service existed in almost all Web Service
Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web
Service Registry contains some Web services that cannot be used by the users To some degree this is
useless for the reason that these services cannot be used anymore and it will waste the network
resources on the Web Therefore all these unavailable services should be eliminated in order to
reduce the wasting of the network resources
Deep Web Service Crawler
49
Figure4-1 Service amount statistic of these five Web Service Registries
42 Statistic Information for WSDL Document
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Failed WSDL
Links 1 0 23 145 32
Without
WSDL Links 0 0 0 0 16
Empty
Content 0 0 2 0 2
Table4-2 Statistic information for WSDL Document
Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in
these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed
WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the
Web services whose WSDL links are invalid Say in other words that means it is impossible to get the
WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no
WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services
for these Web Service Registries It is used as the overall number of Web services in each Web Service
Registry who has no WSDL link That is to say there would be no WSDL document for Web services
like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL
document will also be created But there is no content and the name of this WSDL document will
contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which
represents the overall number of Web services that have the WSDL links and the URL address of them
0
500
1000
1500
2000
2500
3000
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
57289 382
853
2567
0 0 0 0 125
The
Num
ber o
f Web
Ser
vice
s
The name of Web Service Registry
Overall Services
Unavailable Services
Deep Web Service Crawler
50
are valid but their WSDL documents contains no content In this case a WSDL document whose name
contains a string ldquo(BAD)rdquo will be created
Figure4-2 Statistic information for WSDL Document
43 Comparison of Different Average Number of
Service Properties
This section is going to compare the average number of service properties in these five Web Service
Registries This average number of service properties is calculated by means of following equation = (1)
Where
ASP is the average number of service properties for one Web Service Registry
ONSP is the overall number of service properties in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
Figure 4-3 shows the average number of service properties of each Web service in these five Web
Service Registries As have already mentioned before one of the measurements for testing the quality
of Web services in the Web Service Registry is the service information The more information of a Web
service the better you know about that service consequently the corresponding Web Service
Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service
Repository and Biocatalogue Web Service Registries own a larger number of service properties than
that in other three Web Service Registries This can directly reflect that these two Web Service
Registries can provide more detailed information about the Web services published among them So
that the users can be easier to choose the service they need And they would also like to use the Web
services that published in these two Web Service Registries By contrast the Xmethods and Seekda
0
50
100
150
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
1 023
145
32
0 0 0 016
0 0 2 0 2
The
Num
ber o
f Ser
vice
s
The Name of the Web Service Registry
Failed WSDL Links
Without WSDL links
Empty Content
Deep Web Service Crawler
51
Web Service Registries who have less service information of the Web services would offer less quality
for these Web services Therefore users may be not like to use the Web services provided in these
two Web Service Registries Not to mention the Web services published in the Ebi Web Service
Registry
Figure4-3 Average Number of Service Properties
From the description presented in section 323 the causes of the issue for the different number of
service properties in these Web Service Registries may consist of following several points First the
number of the structured information for these Web services is different with respect to these five
Web Service Registries Even part of information for some Web services in one Web Service Registry
could be missing or have empty value For example the number of structure information that
supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that
in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for
almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce
the amount of the overall number of service properties Thirdly some Web Service Registries do not
have monitoring information such as Xmethods and Ebi In particular the Service Repository Web
Service Registry has a large amount of monitoring information of the Web services that could be
extracted in the Web Obvious the last one is the number of the whois information for these Web
services If the database of the whois client does not contain the information about the service
domain for the Web service in one Web Service Registry then there will be no whois information
could be extracted Moreover even there has information of the service domain the amount of the
information could be very diverse Therefore if the Web Service Registry undergoes the situation like
that many service domains of the Web services in this registry have no and few whois information
then the average number of service properties in that registry will decrease greatly
As a result in order to help users get better acquainted with the Web services that provided in Web
Service Registry and distinguish these Web services the Web Service Registry should do it best to
offer more and more information for each published Web service of it
05
101520253035
23
717 17
32
The
Num
ber o
f Pro
pert
ies
The Name of Web Service Registry
Average Number of Service Properties
Deep Web Service Crawler
52
44 Different Outputs of Web Services
The fundamental task of this master program is to obtain the WSDL document of the Web services
that hosted in the Web Service Registries as well as extract and gather the properties of these Web
services and thereafter store them into disk Therefore this section is going to describe the different
outputs of this master program which include the WSDL documents of Web services the generated
XML file and INI file and data records of these service properties in Database
Figure4-4 WSDL Document format of one Web service
The WSDL document of the Web service is just read from the Web according to the URL address of its
WSDL link And then these data will be stored into the disk as the WSDL document The name of the
WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document
And in order to distinguish these WSDL documents whose names would be the same while the
content of them are different the name of each obtained WSDL document in one Web Service
Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL
document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and
data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output
formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name
is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the
materials belonged to the same Web service The first three lines in that INI file are some service
comments which start from the semicolon to the end of the line They are the basic information for
the description of this INI file The following one line of them is the section that included in a pairs of
brackets It is important because it represents that the following lines behind it are the information of
this Web service Therefore the rest of these lines are the actual service information that existed in a
key-value pair with an equal between the pair Each service property is displayed from the beginning
of the line
Deep Web Service Crawler
53
Figure4-5 INI File format of one Web service
Figure4-6 XML File format of one Web service
Figure4-7 Database data format for all Web services
Deep Web Service Crawler
54
Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its
name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web
service Though the format of the XML file is different from that of the INI file the essential contents
of them are the same That is to say the values of these service properties are no different This is
because they are files generated from the collection of properties for the same Web service The XML
file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the
section in INI file is something like the root in XML file Therefore all values of these elements among
the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service
Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of
service information for all Web services in these five Web Service Registries And the entire service
information for one Web service will be only one record in this table Because of that the column
names of that table should be the union of the name of the service information in each Web Service
Registry However since the column names of the table should be unique thus these redundant
names in this union must be eliminated This could make sense and possible because the names of the
service information are well-defined and uniform for all these five Web Service Registries In addition
in this table the first column is the primary key which is an increasing Integer The function of this is
more like the Integer that contained in the name of the XML file and INI file While the rest columns in
that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in
this table represents that this property of that Web service is empty or missing
45 Comparison of Average Time Cost for Different
Parts of Single Web Service
This section aims to describe the comparison of the average time cost for different parts of getting
one single Web service in all these five Web Service Registries At first it has to calculate the average
time cost of getting one single service in the Web Service Registry The calculation of it can be
obtained through following equation = (2)
Where
ATC is the average time cost for one single Web service
OTS is the overall time cost of all the Web services in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the different parts of the average time cost for getting one single service consists of
following 6 aspects which are the average time cost for extracting service property the average time
cost for obtaining WSDL document the average time cost for generating XML file the average time
cost for generating INI file the average time cost for inserting the service property into the table of
the Database and the average time cost for some other procedures such as get the service list page
link get the service page link and so on The average time cost for extracting service property would
be obtained by means of following equation
Deep Web Service Crawler
55
= (3)
Where
ATCSI is the average time cost for extracting service property of one single Web service
OTSSI is the overall time cost for extracting service property of all the Web services in one Web
Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the calculation of other parts is similar to the equation for calculating the average time
cost for extracting service property While the calculation of the average time cost for other
procedures equals to the result of the average time cost for one single Web service minus the sum of
the average time cost of all other five parts
Service
property
WSDL
Document
XML
File INI File Database Others Overall
Service
Repository
8801 918 2 1 53 267 10042
Ebi 699 82 2 1 28 11 823
Xmethods 5801 1168 2 1 45 12 7029
Seekda 5186 1013 2 1 41 23 6266
Biocatalogue 39533 762 2 1 66 1636 42000
Table4-3 Average time cost information for all Web Service Registries
Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web
Service Registries The first column of table 4-3 is the name of these five Web Service Registries And
the last column in this table is the average time cost for the single service in one Web Service Registry
While the rest columns in this table are the average time cost of these six different parts In order to
have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated
with following corresponding figures See figure 4-8 to figure 4-14
Figure4-8 Average time cost for extracting service property in all Web Service Registries
8801
6995801 5186
39533
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Service Property
Deep Web Service Crawler
56
As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue
Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries
which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively
That is to say it will take much longer time to extract the service properties of these Web services
published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the
Biocatalogue Web Service Registry has even more average number of service properties which have
already been talked about in section 43 On the contrary the average number of service properties in
Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service
Registry is larger than that in Seekda Web Service Registry As have already known that the average
number of service properties is the same for these two Web Service Registries Nevertheless there
has course might explain the case that the average time in Xmethods costs more than that in Seekda
which is the process for extracting service properties in Xmethods Web Service Registry needs to be
executed by means of service page and service list page while it needs only service page link for
Seekda Web Service Registry
The average time cost for obtaining WSDL document in all these five Web Service Registries is
displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the
average time for extracting the WSDL link of Web service and the average time for reading the data of
WSDL document from the Web and then storing it into the disk As seen from this figure the average
time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is
1168 Although the process for extracting WSDL link will cost certain amount of time it does not have
a significant influence to the total average time spent for obtaining WSDL document This is because
the WSDL link of one Web service is almost gained in one step Therefore this can imply that the
average size of the WSDL document for Xmethods Web Service Registry is larger than that for other
four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds
for the process of extracting WSDL document
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries
918
82
11681013
762
0
200
400
600
800
1000
1200
1400
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
WSDL Document
Deep Web Service Crawler
57
Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different
outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11
the average time of generating XML file for one Web service cost the same for all these five Web
Service Registries which is only 2 milliseconds In addition the average time of generating INI file for
one Web service cost the same too And its value is just 1 millisecond Even though sum up the values
of these two types of average time costs the result is still so small that it could be omitted when
comparing to the overall average time cost of getting one Web service for each corresponding Web
Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI
file would be finished at once after receiving the input of service properties of one Web service
Furthermore it can be seen from figure 4-12 although the average time costs of creating database
record for each Web service in all these five Web Service Registries are larger than that time of
generating XML and INI file the process of the operation for creating database record is still fast
Figure4-10 Average time cost for generating XML file in all Web Service Registries
Figure4-11 Average time cost for generating INI file in all Web Service Registries
2 2 2 2 2
0
05
1
15
2
25
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
XML File
1 1 1 1 1
0
02
04
06
08
1
12
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
INI File
Deep Web Service Crawler
58
Figure4-12 Average time cost for creating database record in all Web Service Registries
Figure4-13 Average time cost for getting one Web service in all Web Service Registries
In the figure 4-13 it gives the average time cost for getting one single Web service in all these five
Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the
longest time for this process This is because the presentation of these five different parts that
described in the front shows that the average time cost of each part would needs more time to finish
the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining
WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web
Service Registry does not cost the longest Moreover there is amazing thing when looking at these
figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same
trend This can further indicate that one Web Service Registry who spends more time to get the
description information of one Web service would offer more information about the Web service
53
28
4541
66
0
10
20
30
40
50
60
70
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Database
10042
823
7029 6266
42000
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Overall
Deep Web Service Crawler
59
5 Conclusion and Further Direction
This master thesis provides a schema which aims to explore the description information of the Web
services that hosted in the different Web Service Registries Actually the description information of
the Web service consists of WSDL document and service information about that Web service
Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can
be used to obtain the WSDL document and service information of the Web service in these Web
Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web
services that hosted in the these Web Service Registries as well as the storage of these Web services
is not flexible and the most important thing is that only a few service information of the Web service
are extracted even more there is no service information would be extracted for some Web Service
Registry However the work presented in this master thesis can be able to explore all Web services in
these Web Service Registries Moreover the service information of Web service would be extracted as
much as it could with the approach presented in this master thesis so that the final result would be
the largest annotated service catalogue ever produced Furthermore regarding the storage process of
the description information of the Web service this master thesis provides three different ways that
can guarantee not only the completeness but also longevity of the description information for the
Web service
However in the implementation performed in this master thesis the whois client used for querying
the information of service domain will return a free text if the information exists And sometime this
free text differs completely That makes it must have to crawl each Web service in all Web Service
Registries at least once in the experiment stage So that all the cases of these free text would be
foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web
services in these Web Service Registries Therefore in order to simply the work other whois client
who can ease the work here needs to be found and used
Moreover in the experiment stage of this master thesis the time cost for getting Web service is still
large For the purpose of reducing this time the multithreaded programming would be applied to
each some parts of the process for getting one Web service
Although the work performed here is specialized for only these five Web Service Registries the
main parts of the principles used here is adaptable to other Web Service Registries that with only
small changes in the implementation codes or the structure
Deep Web Service Crawler
60
6 Bibliography
[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)
June 27 2008
[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from
httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-
whole Emanuele Della Valle (CEFRIEL) July 1 2008
[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk
ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-
architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009
[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web
Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October
2006
[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from
httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008
[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from
httpwwww3orgTRws-arch-scenarios February 11 2004
[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo
Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering
University of Washington Seattle pp233-272 February 1999
[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web
Consortium Working Draft WD-html5-20100624 January 22 2008
[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language
WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from
httpwwwwsmoorgTRd16d161v02
[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO
-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from
httpwwwwsmoorgTRd2v11
[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware
Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet
2008
Deep Web Service Crawler
61
7 Appendixes
There are additional outputs of this master program which are log information file and statistic report
file Figure 8-1 shows one of the basic output formats of the log information for these five Web
Service Registries
Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry
Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry
Deep Web Service Crawler
62
Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo
ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively
Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry
Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry
Deep Web Service Crawler
63
Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry
Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry
Deep Web Service Crawler
64
Table of Figures
Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12
Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16
Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25
Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27
Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41
Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46
Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50
Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51
Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52
Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56
Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58
Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58
Deep Web Service Crawler
65
Table of Tables
Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47
Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48
Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55
Deep Web Service Crawler
66
Table of Abbreviations
DTD Document Type Definition
DWSC Deep Web Service Crawler
HTML HyperText Markup Language
MTBF Mean Time between Failures
MTTR Mean Time to Recovery
QoS Quality of Service
REST Representational State Transfer
RTT Round Trip Time
SOAP Simple Object Access Protocol
SQL Structured Query Language
URL Uniform Resource Locator
WHATWG Web Hypertext Application Technology Working Group
XML eXtensible Markup Language
WSDL Web Service Description Language
WSML Web Service Modeling Language
WSMO Web Service Modeling Ontology
Deep Web Service Crawler
4
Table of Contents
Acknowledgements 2
Abstract 3
1 Introduction 7
11 BackgroundMotivation 7
12 Initial Designing of the Deep Web Service Crawler Approach 7
13 Goals of this Master Thesis 8
14 Outline of this Master Thesis 8
2 State of the Art 10
21 Service Finder Project 10
211 Use Cases for Service-Finder Project 10
2111 Use Case Methodology 10
2112 System Administrator 10
212 Architecture Plan for the Service-Finder Project 12
2121 The Principle of the Service Crawler Component 13
2122 The Principle of the Automatic Annotator Component 13
2123 The Principle of the Conceptual Indexer and Matcher Component 14
2124 The Principle of the Service-Finder Portal Interface Component 14
2125 The Principle of the Cluster Engine Component 15
22 Information Extraction 15
221 Input Types of Information Extraction 15
222 Extraction Targets of Information Extraction 17
223 The Used Techniques in Information Extraction 18
23 Pica-Pica Web Service Description Crawler 19
231 Needed Libraries of the Pica-Pica Web Service Description Crawler 19
232 Architecture of the Pica-Pica Web Service Description Crawler 20
233 Implementation of the Pica-Pica Web Service Description Crawler 21
24 Conclusions of the Existing Strategies 22
3 Design and Implementation 23
31 Deep Web Services Crawler Requirements 23
311 Basic Requirements for DWSC 23
Deep Web Service Crawler
5
312 System Requirements for DWSC 23
313 Non-Functional Requirements for DWSC 24
32 Deep Web Services Crawler Architecture 24
321 The Function of Web Service Extractor Component 26
3211 Features of the Web Service Extractor Component 28
3212 Input of the Web Service Extractor Component 28
3213 Output of the Web Service Extractor Component 28
3214 Demonstration for Web Service Extractor 29
322 The Function of WSDL Grabber Component 30
3221 Features of the WSDL Grabber Component 31
3222 Input of the WSDL Grabber Component 31
3223 Output of the WSDL Grabber Component 31
3224 Demonstration for WSDL Grabber Component 31
323 The Function of Property Grabber Component 33
3231 Features of the Property Grabber Component 36
3232 Input of the Property Grabber Component 37
3233 Output of the Property Grabber Component 37
3234 Demonstration for Property Grabber Component 37
324 The Function of Storage Component 40
3241 Features of the Storage Component 42
3242 Input of the Storage Component 43
3243 Output of the Storage Component 43
3244 Demonstration for Storage Component 43
33 Multithreaded Programming for DWSC 46
34 Sleep Time Configuration for Web Service Registries 46
4 Experimental Results and Analysis 48
41 Statistic Information for Different Web Service Registries 48
42 Statistic Information for WSDL Document 49
43 Comparison of Different Average Number of Service Properties 50
44 Different Outputs of Web Services 52
45 Comparison of Average Time Cost for Different Parts of Single Web Service 54
5 Conclusion and Further Direction 59
6 Bibliography 60
Deep Web Service Crawler
6
7 Appendixes 61
Table of Figures 64
Table of Tables 65
Table of Abbreviations 66
Deep Web Service Crawler
7
1 Introduction
In this introduction chapter of the master thesis at first itrsquos going to concisely explain the background
of the current situation then is the basic introduction of the proposed approach which is called Deep
Web Service Extraction Crawler
11 BackgroundMotivation
In the late 1990rsquos the Web Service Registry was a hot commodity The formal definition of Web
Service Registry is known as a link links page Its function is to uniformly present information that
comes from various sources Hence it can provide a convenient channel to the users for offering
searching and using the Web Services Actually the related metadata of the Web Services that
submitted by both the system and users are commonly hosted along with the Service descriptions
Nevertheless in fact when users enter one of the Web Service Registries to look for some Web
Services they might meet some situations that would bring lots of trouble to them One of the
situations may be like that these Web Service Registries return several similar published Web Services
after the users search on it For example two or more Web Services have the same name but their
versions are not the same Or two or more Web Services that derived from the same server but have
different contents etc Furthermore most users are also interested in a global view of the published
services For instance they want to know which Web Service Registry can provide better quality for
the Web Service Therefore in order to help users to differentiate those similar published Web
Services and have a global view of the Web Services this information should be monitored and rated
Moreover there are a great many Web Service Registries in the Internet Each Web Service Registry
can provide a great number of Web Services Obviously there might have some similar Web Services
among these Web Service Registries Or a Web Service in one of the Web Service Registry is related to
another Web Service in other Web Service Registries Hence these Web Services should be
comparable across different Web Service Registries However recently there has not much support of
this In addition towards the metadata actually not all of them are structured especially the
descriptions of the non-functional property Therefore what have to do now is to turn those
non-functional property descriptions into the structured format Clearly speaking it needs to extract
as much information as possible about the Web Services that offered in the Web Service Registries
Eventually after extracting all the information from the Web Service Registries it is necessary to store
them into the disk This procedure should be efficient flexible and completeness
12 Initial Designing of the Deep Web Service
Crawler Approach
The problems have already been stated in the previous section Hence the following work is to solve
Deep Web Service Crawler
8
these problems In this section it will present the basic principle of Deep Web Service Crawler
approach
At first it is the simply introduction of the Deep Web Service Crawler approach to these problems As
have already been mentioned each Web Service Registry can offer Web Services Moreover each
Web Service Registry has its own html page structures These structures may be the same or even
complete different Therefore the first thing is to identify which Web Service Registry that it will be
going to explore Since each Web Service Registry owns a unique URL this job can be done by directly
analyzing the corresponding URL address of that Web Service Registry After identifying which Web
Service Registry it is going to explore the following step is to obtain all these Web Services that
published in that Web Service Registry Then with all these obtained Web Services it is time to extract
analyze and gather the information of the services That information can be in structured format or
even in unstructured format In this master thesis some Deep Web Analysis Techniques will be
applied to obtain this information So that the information about each Web Service shall be the
largest annotated The last but not the least important all the information about the Web Services
need to be stored
13 Goals of this Master Thesis
The lists in the following are the goals of this master thesis
n Produce the largest annotated Service Catalogue
Service Catalogue is a list of service properties The more properties the service has the larger
Service Catalogue it owns Therefore this master program should extract as much service
properties as possible
n Flexible storage of these metadata of each service as annotations or dedicated documents
The metadata of one service includes not only the WSDL document but also service properties
All these metadata are important information for the service Therefore this master program
should provide flexible ways to store these metadata into the disk
n Improve the comparable property of the Web Services across different Web Service Registries
The names of service properties for one Web Service Registry could be different from another
Web Service Registry Hence for the purpose of improving the comparable ability all these
names of the service properties should be uniformed and well-defined
14 Outline of this Master Thesis
In this chapter the motivation objective and initial approach plan have already been discussed
Thereafter the remaining paper is structured as follows
Firstly in chapter 2 the work presented is almost based on these existing techniques In section 21
there is given a detailed introduction to the technique of the Service-Finder project Then in section
22 the already implemented technique of Pica-Pica Web Service Description Crawler is introduction
and discussed After that in section 23 the Information Retrieval technique is presented
Deep Web Service Crawler
9
Then following in chapter 3 it is explained the designing details of this Deep Web Service Crawler
approach In section 31 it gives a short description for the different requirements of this approach
Next in section 32 the actual design for the Deep Web Service Crawler is presented Then in section
33 34 the multithreaded programming and sleep time configuration that used in this master
program are introduced respectively
In chapter 4 it is supposed to display the experiments of this Deep Web Service Crawler approach
and then give some evaluation of it
Finally in Section 5 the conclusion and discussion of the work that already done as well as the work in
the future for this master task are presented respectively
Deep Web Service Crawler
10
2 State of the Art
This chapter aims at presenting some existing techniques or Strategies that related to the work of
applying this Deep Web Service Extraction Crawler approach Section 21 plans to talk about the
existing catalogues Service-Finder project And then in section 22 it is going to explain the existing
implemented crawlers Pica-Pica Web Service Description Crawler Finally in the section 23 it is
supposed to present some details about the Information Extraction technique
21 Service Finder Project
Service-Finder Project aims at developing a platform for Web Service discovery especially for the Web
Services that are embedded in a Web 20 environment [1] Hence it can provide an efficient access to
publicly available services The goals of the Service-Finder project are depicted as follows [1]
n Automatically gather Web Services and their related information
n Semi-automatically create semantic service description based on the information that available
on the Web
n Create and improve semantic annotations via the user feedback
n Describe the aggregated information in semantic models and allow reasoning query
However before describing the basic functionality of the Service-Finder Project there is going to
present one of its use cases and requirements first
211 Use Cases for Service-Finder Project
The Service-Finder Project employed the use case methodology of the W3C Use Case description [6]
for its needs and then applied this methodology to the use cases that it enumerated
2111 Use Case Methodology
There are three aspects needed to be considered for use case definitions of Service-Finder Project [1]
(1) Description that used to describe information of the use case
(2) Actors Roles and Goals that used to identify the actors they would be the roles they act and the
goals they need to achieve in the scenario
(3) Storyboard that used to describe the serial of interactions among the actors and the
Service-Finder Portal
2112 System Administrator
This section is going to present the use case that applied to the Service-Finder portal and that
illustrated the requirements on its functionality from a user point of view However all these
Deep Web Service Crawler
11
information in this use case are derived from [1] In this use case there has a system administrator
whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities
online and working all day and night Therefore if there is any system failures Sam Adams should fix
the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will
alert him immediately by sending him a SMS Message in the case of a system failure
n Description
This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an
SMS Messaging Service that he wants to build it into his application
n Actors Roles and Goals
The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him
are the immediate service delivery the reliability of the service and low base fee and transaction fee
n Storyboard
Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find
many useful services from it especially he know what he is looking for Hence he visits the
Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo
Requirement 1 Search functionality
Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the
number of matching services that will be displayed on one page And he would also expect there has
short information about the service functionality the service provider and the service availability So
that he could decide which service he will choose to read further
Requirement 2 Enable configurable pagination of the matching results and have some short
information for each service
Step 3 When Sam looks through the short information about the services that displayed on the first
page he expects to find the most relevant services that related to his request After that he would
like to read more detailed information about that service to see whether this service can provide the
needed functionality
Requirement 3 Rank the returned matching services and must provide ability to read more details of
a service
Step 4 In the case that all the returned matching services Sam got provide quite different
functionalities or they belong to different service categories for example the SMS messaging services
alert users not through SMS but voice messaging For this reason Sam would like to see other
different categories that may be contain the services he wants Or the services of other categories
which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can
further filter his search in terms of browsing through categories
Requirement 4 Service categories and allow the user to look all services that belonged to that specific
category If possible it should also allow the user to browse through categories
Step 5 When Sam got all the services that could provide a SMS messaging service via the methods
described in the Step 4 at present he wants to look for the services that offered by an Austrian
provider and have no base fees if possible
Requirement 5 Faceted search
Deep Web Service Crawler
12
Step 6 After Sam got all these specific services now he would like to choose the services that can
provide a high reliability
Requirement 6 Sort functionality based on usersrsquo chooses
Step 7 For now Sam expects to compare the service availability between the promised to the service
provider and the actually provided This should be contained in the servicesrsquo details And there needs
also have service coverage information so that Sam can know whether this service covers the areas
he lives and works Moreover Sam would also like to compare these services in other way For
instance put some services into a structured table to compare the transaction fees
Requirement 7 A side-by-side comparison table for services and a functionality that enable users to
select services he wants to compare
Step 8 At last Sam wants to know whether the service providers offer a free try out of the services
So that he can test the service functionality
Requirement 8 If possible display a note that offering free service trials
212 Architecture Plan for the Service-Finder Project
The architecture plan of the Service-Finder project contains five basic components Service Crawler
Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal
Interface Figure 2-1 presents a high level overview of the components and the data flow among
them
Figure2-1Dataflow of Service-Finder and Its Components [3]
Deep Web Service Crawler
13
2121 The Principle of the Service Crawler Component
The Service Crawler is responsible for gathering available services and their related information from
the Web The overall cycle is depicted as following
(1) A Web developer publishes a Web Service
(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services
like WSDL (Web Service Description Language) documents
(3) The Crawler is also going to search for other related information as long as a service is discovered
(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant
part of the Web
At last the output of the crawler would be forwarded to the subsequent components for analyzing
indexing and displaying
2122 The Principle of the Automatic Annotator Component
The Automatic Annotator receives the relevant data from previous component and generates
semantic service descriptions about the WSDL documents and its related information based on the
Service-Finder Ontology and Service Category Ontology
Firstly it will simply introduce those two compatible ontologies that would be used throughout the
whole process [2]
n Generic Service Ontology it is an ontology which is functional to describe the data objects For
example the services the service providers availability payment modalities and so on
n Service Category Ontology it is an ontology which is used to categorize the functionalities or
applications of the services For instance data verification messaging data storage weather etc
Afterwards it is going to talk about the function of this component with its input output
Oslash Input
u Crawled data from Service Crawler
u Service-Finder Ontologies
u Feedback or Correction of before annotations
Oslash Function
u Enrich the information about the service and extract semantic statements according to the
Service-Finder Ontologies For example categorize the service according to the Service
Category Ontology
u Determine whether a particular document is relevant or not through the Web link graph If
not discard these irrelevant documents
u Classify the pages into their genres For instance pricing user comments FAQ and so on
Oslash Output
u Semantic annotation of the services
Deep Web Service Crawler
14
2123 The Principle of the Conceptual Indexer and Matcher
Component
The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted
information of the services and supplying users the capability of retrieval and semantic query For
example the matchmaking between user requests and service offers and the act of retrieving user
feedback on extracted annotations
In addition letrsquos have a look of the function of this component and its input output
Oslash Input
u Semantic annotation data and full text information obtained from Automatic Annotation
u Semantic annotation data and full text information that come from user interfaces
u Cluster data from user and service clustering component
Oslash Function
u Store the semantic annotations received from the Automatic Annotation component and
from the user interface
u Store the cluster data that procured through the clustering component
u Store and index the textual description offered by the Automatic Annotation component
and the textual comments offered by users
u Ontological query the semantic data from the data store center
u Combined keyword and Ontological querying used for user queries
u Provide a list of similar services for a given service
Oslash Output
u A list of matching services that are queried by users In particular these services should be
sorted by ranking and can also be iterated
u All available data that related to a particular entity must be retrievable at the user interface
2124 The Principle of the Service-Finder Portal Interface
Component
The Service-Finder Portal Interface is the main entry point that provided for users of the
Service-Finder system to search and browse the data which is managed by the Conceptual Indexer
and Matcher component In addition the users can also contribute information by means of providing
tags comments categorizations and ratings to the data browsed Furthermore the developers can
still directly invoke the Service-Finder functionalities from their custom applications in terms of an API
Besides the details of this componentrsquos function input and output are represented as below
Oslash Input
u A list of ordered services for a query
u Detailed information about a service or a set of services and a service provider
u Query access to service category ontology and the most used tags provided by the users
Deep Web Service Crawler
15
u Service availability information
Oslash Function
u The Web Interface allows the users to search services by keyword tag or concept in the
categorization sort and filter query results by refining the query compare and bookmark
services try out the services that offer this functionality
u The API allows the developers to invoke Service-Finder functionalities
Oslash Output
u Explicit user annotations such as tags ratings comments decryptions and so on
u Implicit user data for example click stream of users bookmarks comparisons links sent
etc
u Manual advertising of available new services
2125 The Principle of the Cluster Engine Component
The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the
Service-Finder Portal eg the queried services and the compared services of the users Moreover it
also provides cluster data to the Conceptual Indexer and Matcher for providing service
recommendations
Furthermore letrsquos detailed introduce this componentrsquos function input and output
Oslash Input
u Service annotation data of both extracted and user feedback
u Usersrsquo Click streams used for extracting user behaviors
Oslash Function
u Obtain user clusters from user behaviors
u Obtain service clusters from service annotation data to enable to find similar services
Oslash Output
u Clusters of users and services
22 Information Extraction
Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge
amount of information sources on the Internet which has been limited the access to browsing and
searching for the reason of the heterogeneity and the lack of structure of Web information sources
Therefore the appearance of Information Extraction that transforms the Web pages into
program-friendly structures for post-processing would become a great necessity However the task of
Information Extraction is specified in terms of the inputs and the extraction targets And the
techniques used in the process of Information Extraction called extractor
221 Input Types of Information Extraction
Generally speaking there are three different input types The first input type is the unstructured
Deep Web Service Crawler
16
document For example the free text that showed in figure 2-2 It is unstructured and written in
natural language So that it will require substantial natural language processing While the second
input type is called the structured document For instance the XML documents based on the reason
that the data can be described through the available DTD (Document Type Definition) or XML
(eXtensible Markup Language) schema Finally but obviously the third input type is the
semi-structured document that are widespread on the Web Such as the large volume of HTML
pages like tables itemized lists and enumerated lists This is because HTML tags are often used to
render these embedded data in the HTML pages See figure 2-3
Figure2-2Left is the free text input type and right is its output [4]
Figure2-3A Semi-structured page containing data records
(in rectangular box) to be extracted [4]
Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly
regular structure And the data of these documents can be displayed in a format of HTML way or
non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and
generated from structured databases in terms of some templates or layouts thus it would be
considered as one of the input sources which could provide some of these semi-structured documents
For example the authors price and comments of the book pages that provided by Amazon have the
Deep Web Service Crawler
17
same layout That is because these Web pages are generated from the same database and applied
with the same template or layout Furthermore there has another option which could manually
generate HTML pages of semi-structured type For example although the publication lists that
provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have
title and source property for every single pager Eventually the inputs for some Information Extraction
can also be the pages with the same class or among various Web Service Registries
222 Extraction Targets of Information Extraction
Moreover regarding the task of the Information Extraction it has to consider the extraction target
There also have two different extraction targets The first one is the relation of k-tuple And the k in
there means the number of attributes in a record Nevertheless in some cases an attribute of one
record may have none instantiation Otherwise the attribute owns multiple instantiations In addition
the complex object with hierarchically organized data would be the second extraction target Though
the ways for depicting the extraction targets in a page are diverse the most common structure is the
hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf
nodes which called internal nodes And the structure for a data object may also be flat or nested To
be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise
if it is nested structure then the internal nodes that involved in this data object would be more than
two levels
Furthermore in order to make the Web pages readable for human being and having an easier
visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated
or demarcated However the displaying for a data object in a Web page would be affected by
following conditions [4]
Oslash The attribute of a data object has zero or several values
(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo
attribute For example a special offer only available for certain books might be a ldquononerdquo
attribute
(2) If there are more than one values for the attribute of a data object it will be called the
ldquomultiValuerdquo attribute For instance the name of the author for a book could be a
ldquomultiValuerdquo attribute
Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering
That is to say among this set of attribute the position of the attribute might be changed
according to the diverse instances of a data object Thus this attribute will be called the
ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would
enumerate the release data in front of the movesrsquo title while for the movies after year 1999
(including 1999) it will enumerate the release data behind the movesrsquo title
Oslash The attribute has different formats
This means the displaying format of the data object could be completely distinct with respect to
these different instances Therefore if the format of an attribute is free then a lot of rules will be
needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo
attribute For example an ecommerce Web site would use the bold font format to present the
general prices while use the red color format to display the sale prices Nevertheless there has
Deep Web Service Crawler
18
another situation that some different attributes for a data object have the same format For
example various attributes are presented in terms of using the ltTDgt tags in a table presentation
And the attributes like those could be differentiated by means of the order information of these
attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo
attributes it must have to revise the rules for extracting these attributes
Oslash The attribute cannot be decomposed
Because of the easier processing sometimes the input documents would like to be treated as
strings of tokens instead of the strings of characters In addition some of the attribute cannot
even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo
attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The
department code and the course number in them cannot be separated into two different strings
of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo
223 The Used Techniques in Information Extraction
The extractor used in the process of Information Extraction aims at providing a single uniform query
interface to access information sources like database server and Web server It consists of following
phases collecting returned Web pages labeling these Web pages generalizing extraction rules
extracting the relevant data and outputting the result in an appropriate format (XML format or
relational database) for further information integration For example at first the extractor queries the
Web server to gather the returned pages through the HTTP protocols after that it starts to extract the
contents among these HTML documents and integrate with other data sources thereafter Actually
the whole process of the extractor follows below steps
Oslash Step 1
At the beginning it must have to tokenize the input However there are two different
granularities for the input string tokenization They are tag-level encoding and word-level
encoding The tag-level encoding will transform the tags of HTML page into the general tokens
while transform all text string between two tags into a special token Nevertheless the
word-level encoding does this in another way It treats each word in a document as a token
Oslash Step 2
Next it should apply the extraction rules for every attributes of the data object in the Web pages
These extraction rules could be induced in terms of a top-down or bottom-up generalization
pattern mining or logic programming In addition the type of extraction rules may be indicated
by means of regular grammars or logic rules For example some use path-expressions of the
HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic
constraints and some use delimiter-based constraints such as HTML tags or literal words
Oslash Step 3
After that all these extracted data would be assembled into the records
Oslash Step 4
Finally iterate this process until all these data objects in the input
Deep Web Service Crawler
19
23 Pica-Pica Web Service Description Crawler
The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the
Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web
Services problem For example the evaluation of the descriptive quality of Web Services that offered
and how well are these Web Services described in nowadaysrsquo Web Service Registries
231 Needed Libraries of the Pica-Pica Web Service
Description Crawler
This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef
Spillner and programmed in terms of the Python language Actually in order to run these scripts to
parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib
n Beautiful Soup
It is an HTMLXML parser for Python language And it can even turn these invalid markups into a
parse tree [5] Moreover the following three features can make it more powerful
u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that
makes approximately as much sense as the original document Therefore you can obtain
the data that you want
u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating
searching and modifying the parse tree Hence you donrsquot need to create a custom parse
for every application
u If the document has already specified an encoding then you can ignore it since the
Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way
Otherwise what you have to do is just to specify the encoding of the original documents
Furthermore the ways of including Beautiful Soup into the application are displayed in the
following [5]
sup2 From BeautifulSoup import BeautifulSoup For processing HTML
sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML
sup2 Import BeautifulSoup To get everything
n Html5lib
It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to
gain maximum compatibility with the current major desktop web browsers this implementation
will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5
specification
Deep Web Service Crawler
20
232 Architecture of the Pica-Pica Web Service
Description Crawler
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler
Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes
four fundamental components Service Page Grabber component WSDL Grabber component
Property Grabber component and WSML Register component
(1) The Service Page Grabber components is going to take the URL seed as the input and output the
link of the service page into following two components WSDL Grabber component and Property
Grabber component
(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the
delivered service pagersquos link And then check whether the validation for these obtained WSDL
document Finally only these valid WSDL document will be passed into the WSML Register
component for further processing
(3) The Property Grabber component will try to extract the servicersquos property the hosted in the
service page if there exists After that all these servicersquos properties would be saved into an INI
file as the information of that service
(4) The functionality of the WSML Register component is to write appropriate WSML document by
means of the valid WSDL documents that delivered from WSDL Grabber component and the
optionally INI files that delivered from Property Grabber component Afterwards register them in
Conqo
n WSML [9]
It stands for Web Service Modeling Language which provides a framework with different
language variants Hence it is often used to describe the different aspects of the semantic Web
Services according to the conceptual model of WSMO
n WSMO [10]
WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various
aspects related to the Semantic Web Services based on the ontologies Ontology is a formal
explicit specification of a shared conceptualization In fact these ontologies are the sticking points
that can satisfy the linkage between the agreement of the communities of users and the defined
conceptual semantics of the real-world
n Conqo [11]
Deep Web Service Crawler
21
It is a discovery framework that considers not only the Quality of Service (QoS) but also the
context information It will use a Web Service Repository to manage these service descriptions
that based on WSML
233 Implementation of the Pica-Pica Web Service
Description Crawler
This section is going to describe the processes of the implementation of the Pica-Pica Web Service
Description Crawler in detail
(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it
needs an input as the initial seed In this crawler there are five Web Service Registries which are
listed in the below The URL address of these five Web Service Registries will be used as the input
seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web
Service Description Crawler there has a single Python script for each Web Service Registry And
the crawling process for these Web Service Registriesrsquo Python script will be executed one after
another
Biocataloue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwsrvice-repositorycom
Xmethods httpwwwxmethodsnet
(2) Then after feeding with the input seed it will step into the next component Service Page Grabber
At first this component will try to read the data from the Web based on the input seed Then it
will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library
After that this Service Page Grabber component starts to look for the service page link for each
service that published in the Web Service Registry by means of the functions in Html5lib library In
the case that the service page link of one single service is found it will firstly check whether this
service page link is valid or not Once the service page link is valid it will pass it into the following
two components for further processing which are WSDL Grabber component and Property
Grabber component
(3) When the WSDL Grabber component receives a service page link from its previous component it
sets out to extract the WSDL link address for that service through the parsing tree of the data in
this service page Next this component will start to download the WSDL document of that service
in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into
the disk The process of this WSDL Grabber component will be continually carried on until there is
no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They
may either contain bad definitions or bad namespaceURI or be an empty document What is
worse it is not even of XML format Hence in order to pick them out this component will further
analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo
folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in
order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo
folder will be passed into the subsequent component
(4) Moreover since some Web Service Registries would give some addition information about the
Deep Web Service Crawler
22
services such as availability service provider version therefore the Property Grabber
component will set out to extract these information as this servicersquos properties And thereafter
save these servicersquos properties into an INI file However if there has no available additional
information then it is no need to extract the service property thus there is no INI file for that
service However for the implementation of this Pica-Pica Web Service Description Crawler only
the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to
extract the servicesrsquo properties While for other three Web Service Registries there has such
function
(5) Furthermore this is an optional to create a report file which contains the statistic information of
this process Such as the total number of services for one Web Service Registry the number of
services whose WSDL document is invalid etc
(6) As has been stated at present there has a folder with all valid WSDL documents and might also
have some INI files Therefore at this time the task of WSML Register component is to generate
the appropriate WSML documents with these valid WSDL document and INI files And then
register them in Conqo
24 Conclusions of the Existing Strategies
In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder
project Information Extraction technique and Pica-Pica Web Service Description Crawler
The task of this master program is going to obtain the available Web services and their related
information from the Web Actually this is a procedure of extracting the needed information for the
service such as the servicersquos WSDL document and its properties Therefore the Information
Extraction technique which used to extract the information that hosted in the Web could be used in
this master program
Moreover the Service-Finder is a large project that not only able to obtain the available Web services
and their related information from the Web but also enrich these crawled data with annotations by
means of the Service-Finder ontology and the Service Category ontology and integrate all the
information into a coherent semantic mode Furthermore it also provides the capabilities for
searching and browsing the data with the user interface and gives users with service
recommendations However as a master program the Service-Finder project is far more exceeding
the requirements Therefore it is just considered as a consultation for this master program
Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available
Web services and their related information this fulfills the primary task of this master program
Nevertheless regarding the information about the service this Pica-Pica Web Service Description
Crawler extracts only few properties sometime even no property Consequently in order to improve
the quality of the service it has to extract as much properties about the service as possible Thence in
chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler
Deep Web Service Crawler
23
3 Design and Implementation
In the previous chapter of the State of Art those already existing techniques or implementations are
presented In the following it is time to introduce the basic principle of the proposed approach Deep
Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica
Web Service Description Crawler
31 Deep Web Services Crawler Requirements
This section is mainly talking about the goals of the Deep Web Service Crawler approach system
requirements with respect to approach and some non-functional requirements
311 Basic Requirements for DWSC
The following lists are the basic requirements which should be achieved
(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list
of Web services that published in the Web Service Registry It contains not only the WSDL document
of the service but also a file with respect to the information about that service Therefore for the
purpose of producing the largest annotated Service Catalogue this proposed approach need to
extract as much properties about those Web services as possible Moreover it has also to download
the WSDL document that hosted along with the Web service That is to say these properties are not
only the interesting structured properties such as service name its WSDL link address but also some
other non-function properties for example endpoint and its monitoring information
(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties
For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of
schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way
this proposed approach provides three methods for the storage The first one will store them as an
XML file While for the second method it is going to store them in an INI file And the third method is
to use the database for the storage
312 System Requirements for DWSC
General Speaking the needed requirements for achieving a programming project contains following
lists
1) Operating System LinuxUnix Windows (XP Vista 2000 etc)
2) Programming language C++ Java Python C etc
3) Programming Tool NetBean Eclipse Visual Studio and so on
However in this master thesis the scripts of the Deep Web Service Crawler approach are written in
Java programming language Besides these code scripts are only tested running in Windows XP
operating system and Linux operating system but have not been tested in other operating system
Deep Web Service Crawler
24
313 Non-Functional Requirements for DWSC
In this part several non-functional requirements for the Deep Web Service Crawler approach are
presented
1) Transparency the process of data exploration and data storage should be done automatically
without the userrsquos intervention However at first user should specify the path in the hard disk which
will be used store these outputs of this program
2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in
order to stop the process from interruption there must have some necessary error handlings for the
process recovering
3) Completeness this approach should extract the interesting properties about each Web Service as
much as possible eg endpoint monitoring information etc
In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the
following five URLs hence the proposed approach must implement not less than those five Web
Service Registries
Biocatalogue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwservice-repositorycom
Xmethods httpwwwxmethodsnet
32 Deep Web Services Crawler Architecture
In this section an overview of this high-level architecture of Deep Web Services Crawler approach will
be introduced first Thereafter there are four subsections that focusing on the outlining of each single
component and how they will play together presented
The current components and flows of data in Deep Web Service Crawler can be summarized as
depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page
links and related service page links by crawling the Web (Web Service Extractor) Then those gathered
links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL
Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all
these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is
illustrated as following
Oslash Step 1
When the Deep Web Service Crawler starts to run the File Chooser container would require the
user to specify a path in the computer or in any other hard disk The reason for specifying the
path is that this Deep Web Service Crawler program needs a place to store all its outputs
Oslash Step 2
After that the Web Service Extractor will be triggered It is the main entry to the specific crawling
process However since the Deep Web Service Crawler program is a procedure which supposes
to crawl for Web Services in some given Web Service Registries Hence the URL addresses of
Deep Web Service Crawler
25
these Web Service Registries should be given as an initial seed for this Web Service Extractor
process However since the page structures of these Web Service Registries are completely
different thus there will be a dependent process for each Web Service Registry
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler
Oslash Step 3
So that according to the given seed two types of links would be obtained by the Web Service
Extractor component One is the service list page link And another is the service page link
Service list page is page that contains a list of Web Services and may be some information about
these Web Services While service page is page that contains much more information for single
service Finally it forwards these two types of links into the next two components Property
Grabber and WSDL Grabber
Oslash Step 4
Then on the one hand the Property Grabber component tries to gather the information about
the service that hosted in the service list page and service page Such as the name of the service
its description the ranking for this service etc Finally all the information of the service will be
collected together as the service properties which then will be delivered to the storage
component for further processing
Oslash Step 5
On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page
or the service page That is because for some Web Service Registry the WSDL link is hosted in
Deep Web Service Crawler
26
the service list page like the Biocatalogue and for other Web Service Registries it is hosted in
the service page such as Xmethods Then after obtaining the WSDL link it will also be
transmitted to the Storage component for further processing
Oslash Step6
When service properties information and WSDL link of the service are received by Storage
component it will be going to store them into the disk For the service properties they will be
stored into the disk by means of three different ways They are an XML file an INI file or one
record inside the table of Database However for the WSDL link the Storage component will try
to download the page content first according to the URL address of the WSDL link If it can work
successful the page content of the service will be stored as a WSDL document into the disk
Oslash Step 7
Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there
are more than one service or there are more than one service list page in the those Web Service
Registries the crawling process from step 3 to step 6 will be continued again and again until
there is no service or service list page in that Web Service Registries any more
Oslash Step 8
Furthermore after the crawling process for one Web Service Registry finishes there will also
generate a file that contains some of statistic information about this crawling process For
example the time when the crawling process of this Web Service Registry started and when
finished the total number of Web services in this Web Service Registry how many services have
empty WSDL document the average number of service properties in this Web Service Registry
and the average time cost for extracting service properties getting WSDL document and
generating XML file INI file etc
321 The Function of Web Service Extractor Component
The Web Service Extractor component is responsible for gathering service list page links and service
page links It pursues a focused crawl of the Web and only forwards service list page and service page
links to subsequent components for analyzing collecting gathering purposes Therefore it is going to
identify both service list page links and related service page links on these Web Service Registries
As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost
as important as the Web Service Extractor itself as it highly influences the part of the Web which it
needs to crawl The seed can or shall contain eg Web pages where these Web Services are
published or that they talk about Web Services
Deep Web Service Crawler
27
Figure3-2 Overview the process flow of the Web Service Extractor Component
After feeding with the seed of URL the Web Service Extractor component starts to get the link of
service list page from the initial page of this URL seed However this process would be diverse for
these five Web Service Registries Following shows the different situation in these Web Service
Registries
Oslash Service-Repository Web Service Registry
In this Web Service Registry the link of the first service list page is the URL address of its seed
which means some Web Services can be found in the home page of the Service-Repository Web
Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this
Web Service Registry Therefore the process of getting service list page link in this registry will be
continually carried on until there no more service list page link exists
Oslash Xmethods Web Service Registry
Although there has Web Services in the home page of the Xmethods Web Service Registry the
number of those Web Services is only a small subset of these in this Web Service Registry
Moreover in the Xmethods Web Service Registry there is only one page containing all Web
Services Therefore it must have to get the service list page link for that page
Oslash Ebi Web Service Registry
The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That
is to say there also has one page that contains all Web Services in this Web Service Registry
However this page is not the initial page of the input seed Therefore there needs more than
one operation steps to get the service list page link of that page
Oslash Seekda Web Service Registry
In Seekda Web Service Registry these Web Services are not contained in the initial page of the
input seed The service list page link can be obtained after several additional operation steps
Deep Web Service Crawler
28
However there has a problem for getting the service list page link in this registry Simply put if
there are more than one pages contains the Web Services for some unknown reasons it cannot
not get the links of the rest service list page In another word it can only get the link of the first
service list page
Oslash Biocatalogue Web Service Registry
The process of getting service list page in Biocatalogue Web Service Registry is almost the same
as in Seekda Web Service Registry The only difference is that it can get all these service list page
links in Biocatalogue Web Service Registry if there are more than one service list pages
Then after getting the link of service list page the Web Service Extractor will begin to get the link of
service page for each service that listed in the service list page The reason for why it can do like this is
that there has an internal link for every service which will address to the service page It is worth
noting that once a service page link is obtained this component will immediately forward the service
page link and the service list page link into the subsequent two components for further processing
Nevertheless the process of obtaining the service page link will be continuously carried out until all
services listed in that service list page are crawled Analogously the process of getting service list page
will also be continuously carried out until there no more service list page exists
3211 Features of the Web Service Extractor Component
The main features are described in the following paragraphs
1) Obtain service list page links
A central task is to obtain the corresponding service list page links It is an URL address that includes a
publicly list of Web Services and just a simply information about these Web services like the name of
the service an internal URL that links to another page which would contain the detailed information
about that service sometimes may have the link address for WSDL document
2) Obtain service page links
Once service list page links are found a crucial aspect is to extract the internal link of each service to
aid the task of service information discovery Therefore it is the task of the Web Service Extractor to
harvest the html page content of the service list page so that the service page links which would
contain much more detailed information of the single Web service can be obtained
3212 Input of the Web Service Extractor Component
This component is dependent on some specific input seeds And the only input required for this
component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section
313
3213 Output of the Web Service Extractor Component
The component will produce two service related page links from the Web
l Service list page links
l Service page links
Deep Web Service Crawler
29
3214 Demonstration for Web Service Extractor
In order to have comprehensive understanding of the process of the Web Service Extractor
component following gives some figures for explanation Though there are five URL addresses in this
section only the URL of the service-repository address is showed as an example
1) The input seed is the initial URL address of the Service-Repository which is
ldquohttpwwwservice-repositorycomrdquo
2) As has already said in section 321 the first service list link of this Web Service Registry is its input
seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page
of that link
Figure3-3 Service list page of the Service-Repository
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo
Figure3-5Code Overview of getting service page link in Service Repository
Figure3-6 Service page of the Web service ldquoBLZServicerdquo
Deep Web Service Crawler
30
3) For now because of the service list page link is already know the next step is to acquiring the
service page link according to these services that listed in the service list page The text in red box
of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the
complete link of the service page It should be prefixed with the initial URL address of
Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for
getting service page link of the Web service in Service Repository is showed in figure 3-5
Therefore the final link of this service page is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure 3-6 is the corrsponding service page of that link
4) Afterwards those two links service list page link and service page link which are gathered by the
Web Service Extractor component will be immediately forwarded to the next two components
which are WSDL Grabber component and Property Grabber component
322 The Function of WSDL Grabber Component
The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the
service lists page link or service page link And the whole process flow is illustrated in figure 3-7
Figure3-7 Overview the process flow of the WSDL Grabber Component
When the WSDL Grabber component receives the inputs that delivered from previous component it
will start to get the WSDL link for the service based on these inputs However although the inputs of
the WSDL Grabber component are the links of service page and service list page only one of them
contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the
service page or the service list page The reason why there needs these two links to be delivered into
this component is that the only one of the five Web Service Registries which is Biocatalogue hosts
the WSDL link in the service list page While for other four Web Service Registries the WSDL link is
hosted in the service page Therefore the WSDL link of these four Web Service Registries will be
obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL
link will be obtained through the service list page link However there has a problem for getting the
WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in
Deep Web Service Crawler
31
the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another
word these services will not have WSDL document For the situation like this the value of the WSDL
link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web
Services in other four Web Service Registries the WSDL link would always exist in the service page
Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service
it will inevitably be forwarded into Storage component for downloading the WSDL document at once
3221 Features of the WSDL Grabber Component
The WSDL Grabber component has to provide the following features
l Obtain WSDL links
WSDL link is the direct way to get into the page that contains the contents of the WSDL document
It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo
to represent this is an address that addressed to the page of WSDL document
3222 Input of the WSDL Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3223 Output of the WSDL Grabber Component
The component will only produce the following output data
l The URL address of WSDL link for each service
3224 Demonstration for WSDL Grabber Component
In this section it will demonstrate a list of figures in order to give a comprehensive understanding of
the process of the WSDL Grabber component Without a doubt it uses the same Web service of the
Service-Repository as an example too
1) The input for this WSDL Grabber component is the link of service page that obtained from Web
Service Extractor component The address of this link is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page
Deep Web Service Crawler
32
2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the
figure 3-8
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo
3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure
3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry
For other four Web Service Registries this will be different This function
ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo
Then it checks all of the nodes one by one to see if the text value of that node is the value of
ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will
be extracted as the value of WSDL link for this Web service
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function
Figure3-11 Code overview of ldquooneParameterrdquo function
4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web
service ldquoBLZServicerdquo can be obtained The value of it is
ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo
Deep Web Service Crawler
33
323 The Function of Property Grabber Component
The Property Grabber component is a module which is used to extract and gather all these Web
service information that hosted in the Web actually are the information that showed in service list
page and service page In the end all the obtained Web service information will be collected together
as the service properties which would be delivered into the Storage component for storing The
detailed process flow of Property Grabber component is illustrated in figure 3-12
The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component
However there is still a little difference between them with respect to the seed As have already
mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is
sufficient to get the WSDL link While regarding the Property Grabber component it needs two of
them as the seeds Therefore once the Property Grabber component receives the needed inputs it
will start to extract the information of that single Web service
Figure3-12 Overview the process flow of the Property Grabber Component
Therefore after the Property Grabber component receives the inputs it starts to extract the service
information for the Web service Generally speaking the service information consists of four aspects
which are structured information endpoint information monitoring information and whois
information respectively
(1) Structured Information
The structured information can be obtained by extracting the information that hosted in the
service page and service list page They are the basic descriptive information about the service
Such as the name of the service the URL address through which the WSDL document can be
obtained the description of introducing the service the provider who provides that service its
rating and the server who owns this service etc However the elements constituting this
Deep Web Service Crawler
34
structured information would be diverse according to these different Web Service Registries For
example the rating information of the Web service exists in the Service-Repository Web Service
Registry while the Xmethods Web Service Registry does not have this information In addition
even though the Web services in the same Web Service Registry some elements of the structured
information could be not exist For instance one service in a Web Service Registry may have the
description information of this service while another service in the same Web Service Registry
does not have this description information Tables 3-1 to 3-5 demonstrate the structured
information that should be extracted in these five Web Service Registries Moreover if the style of
the Biocataloue Web Service Registry is SOAP there has some additional information describing
the SOAP operation of this service While it is REST this additional information would be
something that talked about the REST operation They should also be considered as a part of the
structured information Table 3-6 and table 3-7 illustrate the information for these two different
operations
Service Name WSDL Link WSDL Version
Provider Server Rating
Homepage Owner Homepage Description
Table 3-1Structured Information of Service-Repository Web Service Registry
Service Name WSDL Link Provider
Service Style Homepage Implementation Language
Description User Description Contributed Client Name
Type of this Client Publisher for this client Used Tookit of this client
Used Language of this client Used Operation System of this client
Table 3-2Structured Information of Xmethods Web Service Registry
Service Name WSDL Link Server
Provider Providerrsquos Country Service Style
Rating Description User Description
Service Tags Documentation (within WSDL)
Table 3-3Structured Information of Seekda Web Service Registry
Service Name WSDL Link Port Name
Service URL Address Implementation Class
Table 3-4Structured Information of Ebi Web Service Registry
Service Name WSDL Link Style
Provider Providerrsquos Country View Times
Favorite Times Submitter Service Tags
Total Annotation Provider Annotation Member Annotation
Registry Annotation Base URL SOAP Lab Server Base URL
Description User Description Category
Table 3-5Structured Information of Biocatalogue Web Service Registry
Deep Web Service Crawler
35
SOAP Operation Name Inputs and Outputs Operation Description
Operation Tags Part of Which Service
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry
REST Operation Name Service Tags Used Template
Operation Description Part of Which Service Part of Which Endpoint Group
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry
(2) Endpoint Information
Endpoint information is the port information of the Web service which can be extracted only
through the service page However since different Web Service Registries have different structure
of the endpoint information with respect to the Web service Thence some elements of the
endpoint information for the Web service would be very diverse But there is one thing needed to
pay attention the Ebi Web Service Registry does not have endpoint information for all these Web
services published in this registry Moreover though the Web services in the same Web Service
Registry have the same structure of this endpoint information some elements of the endpoint
information would be missing or empty Furthermore these Web Service Registries could even
have no endpoint information for some Web services published by them Nevertheless whatever
there is endpoint information for the Web service there will be at least one such element which
is the URL address of the endpoint Following table 3-8 shows the endpoint information that
should be extracted in these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Endpoint Name Endpoint URL
Endpoint Critical Endpoint Type
Bound Endpoint
Xmethods
Endpoint URL Publisher of this Endpoint
Contact Email of this Publisher Implementation Language of this
Endpoint
Seekda Endpoint URL
Biocatalogue Endpoint Name Endpoint URL
Table 3-8 Endpoint Information of these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Service Availability Number of Downs
Total Uptime Total Downtime
MTTR MTBF
RTT Max of Endpoint RTT Min of Endpoint
RTT Average of Endpoint Ping Count of Endpoint
Seekda Service Availability Begin Time of Monitoring
Biocatalogue Monitored Status of Endpoint Overall Status of Service
Table 3-9 Monitoring Information of these five Web Service Registries
(3) Monitoring Information
Monitoring information is the tested statistic information for the Web service But it is worth
Deep Web Service Crawler
36
noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of
all these Web services published by them While for other three Web Service Registries only a
few of Web services may not have this information Table 3-12 displays the monitoring
information for these three Web Service Registries
(4) Whois Information
Whois information is not extracted from the information that hosted in the service page and
service list page It is the descriptive information for the service domain which can be gained by
means of the address of the WSDL link Because of that the process of getting the Whois
information will start after gaining the service domain first The final value of service domain
should not contain the string like this http https www etc It must belong to the top level
domain After that it is going to query the service domain database by means of sending the value
of service domain to the Whois client which is just a Web site on the Internet for example
ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain
exists a list of its information will be returned as the output However the structure of the
returned information will be diverse with respect to different service domain Therefore the most
challenge thing is that there has to deal with the extracting process for each different situation of
the returned information Table 3-10 gives the Whois information that needs be extracted for all
these five Web Service Registries
Service Domain URL Domain Name Domain Type
Domain Address Domain Description State
Postal Code City Country
Country Code Phone Fax
Email Organization Established Time
Table 3-10Whois Information for these five Web Service Registries
Finally all the information of these four aspects will be collected together And then be delivered into
the Storage component for further storage processing
3231 Features of the Property Grabber Component
The Property Grabber component has to provide the following features
l Obtain basic information
Generally speaking the more information one Web service has the better you can know how
well this Web service is Hence this Property Grabber component will be necessary to extract all
these basic information that hosted in the service list page and service page This basic
information contains the structure information endpoint information and monitoring
information
l Obtain Whois information
Due to the reason that the more information the Web service has the better the quality of the
Web service is That makes it is necessary to extract as much information about the Web service
as possible Therefore except the basic information the Property Grabber component can also
obtain some additional information called Whois information such as the type of the domain the
name of the person in charge the postal code of the domain city phone fax detailed address
email etc
Deep Web Service Crawler
37
3232 Input of the Property Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3233 Output of the Property Grabber Component
The component will produce the following output data
l Structured information of each service
l Endpoint information about each service if exists
l Monitoring information for the service and endpoint if exists
l Whois information of the service domain
However all these information will be collected together as the properties for each service Thereafter
the collected properties will be sent to the Storage component
3234 Demonstration for Property Grabber Component
Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the
Property Grabber component To simplify the explanation the example showed in this section uses
the same Web service as before
1) The inputs of the Property Grabber component are links of service list page and service page that
received from Web Service Extractor component Their links are
ldquohttpwwwservice-repositorycomrdquo
and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page
Deep Web Service Crawler
38
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page
3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that
displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of
the structured information have the same content like the contents of description showed in
service page and service list page Thence in order to saving the time of extracting process and
the space for storing process the elements with the same content would only be extracted once
Moreover there needs a transformation of non-descriptive text to the descriptive text for the
rating information because the contents of it are several image of stars Therefore the final
results of the extracted structured information for this Web service are showed in table 3-11
Because of there are no descriptive information for the provider homepage and Owner
Homepage their values are assigned as ldquoNULLrdquo
Service Name BLZService
WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl
WSDL Version 0
Server Apache-Coyote11
Description BLZService
Rating Four stars and A Half
Provider NULL
Homepage NULL
Owner Homepage NULL
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo
4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that
displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list
only one of them will be extracted as the information of endpoint That is because this master
program is intended to extract as more information as possible but these information should not
contain some redundant information Therefore for the purpose of that only one of them will be
extracted as the endpoint information even if there are more than one endpoint records Table
3-12 shows the final results of the endpoint information of this Web service
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page
Deep Web Service Crawler
39
Endpoint Name BLZServiceSOAP12port_http
Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService
Endpoint Critical True
Endpoint Type production
Bound Endpoint BLZServiceSOAP12Binding
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo
5) Then it is time to extract the monitoring information through invoking the function
ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box
above contains the monitoring information about the Web service and the red box below lists
the monitoring information for these endpoints As have already mentioned before only one
endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two
types of availability Actually they all represent the availability for this Web service just like the
availability that showed in figure 3-14 Therefore only one content of these availability
information is sufficient Table 3-13 shows the final results of this extracting process
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page
Service Availability 100
Number of Downs 0
Total Uptime 1 day 19 hours 19 minutes
Total Downtime 0 second
MTBF 1 day 19 hours 19 minutes
MTTR 0 second
RTT Max of Endpoint 141 ms
RTT Min of Endpoint 0 ms
RTT Average of Endpoint 577 ms
Ping Count of Endpoint 112
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo
6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the
service domain from the WSDL link first Towards this Web service the gained service domain is
ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying
process After that it returns a list of information for that service domain See figure 3-17 And
table 3-14 is the extracted Whois information
Deep Web Service Crawler
40
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo
Service Domain URL thomas-bayercom
Domain Name Thomas Bayer
Domain Type NULL
Domain Address Moltkestr40
Domain Description NULL
State NULL
Postal Code 54173
City Bonn
Country NULL
Country Code DE
Phone +4922855525760
Fax NULL
Email infopredic8de
Organization predic8 GmbH
Established Time NULL
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo
7) Finally all the information of these four aspects will be collected together as the service
properties and then these service properties are forwarded into the Storage component
324 The Function of Storage Component
The WSDL link from the WSDL Grabber component will be used by this Storage component to
download the WSDL document from the Web and stores it into the disk thereafter In addition the
service properties from the Property Grabber component will also be directly stored into the disk in
three different manners with this Storage component Figure 3-18 illustrates the process flow of the
Storage component
Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo
of the Storage component will be triggered After that it is going to transform the service properties
into three different output formats and store them into the disk These output formats are XML file
INI file and database records Besides it will also try to download the WSDL document with the URL
address of the WSDL link and then store the obtained WSDL document into disk too However this
ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo
ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the
storage tasks
Deep Web Service Crawler
41
Figure3-18 Overview the process flow of the Storage Component
(1) ldquogetWSDLrdquo sub function
The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it
into the disk Therefore above all it has to get the content of the WSDL document This
procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value
of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if
the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo
For the case like that it will create a WSDL document whose name is the service name appended
with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content
It is an empty document Nevertheless if that service does have WSDL link this sub function will
try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all
these contents that hosted in the Web would be downloaded and stored into the disk and named
only with the name of the service Otherwise it would create a WSDL document which prefixes
ldquoBadrdquo before the service name
(2) ldquogenerateXMLrdquo sub function
The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform
them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo
XML stands for eXtensible Markup Language which is a markup language that designed to
transport and store data The first line of an XML file is the XML declaration which defines the
XML version and the encoding used For example following ldquoltxml version=10
encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is
8-bit Unicode Transformation Format character set Besides an XML file contains XML elements
which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML
element can also contain other elements which could be simple text or a mixture of both
Deep Web Service Crawler
42
However an XML file must contain a root element as the parent of all other elements
(3) ldquogenerateINIrdquo sub function
The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will
transform them into an INI file and then store it in the disk with the name structure of servicersquos
name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for
configuration files They are just simple text files with a basic structure Generally speaking the
INI file contains three different parts They are section parameter and comment respectively
Parameter is the basic element contained in an INI file Its format is a key-value pair or can also
be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always
appears to the left of the equals sign The section is more like a room that grouped all its
parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And
section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo
Thence anything between the semicolon and the end of line will be ignored
(4) ldquogenerateDatabaserdquo sub function
The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions
Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will
turn them into the data of the database by using the SQL statements SQL stands for Structured
Query Language which is a database language that designed for accessing and manipulating data
in database Most of the actions that performed on a database are done with SQL statements
And the primary statements of SQL include insert into delete update select create alter and
drop Therefore for the purpose of transforming these service properties into the data of the
database this sub function has to create a database first using the ldquocreate databaserdquo statement
Then it should create a table to store the data A table is a collections of related data entries and
it consists of columns and rows Since the data for all these five Web Service Registries are not
very large therefore one database table is enough for storing these service properties Because of
that the field names of the service properties in columns for all these five Web Service Registries
should be uniform and well-defined Afterwards the service properties of each single service can
be put into the table as a record with the ldquoinsert intordquo statement of the SQL
3241 Features of the Storage Component
The Property Grabber component has to provide the following features
l Generate different output formats
The final result of this master program is to store the information of the service into the disk for
future work However this Storage component provides three different formats for storing these
service properties of the services that published in the Web Service Registries This makes the
storage of the service very flexible and also longevity
l Obtain the WSDL document
The important thing of this component is to check if the WSDL document can be obtained from
the WSDL link because the WSDL document plays a decisive role for determining the quality of
the service This Storage component provides the ability to deal with the different situations
about the process of obtaining the WSDL document
Deep Web Service Crawler
43
3242 Input of the Storage Component
This component requires the following input data
l WSDL link of each service
l Each service property information
3243 Output of the Storage Component
The component will produce the following output data
l WSDL document of the service
l XML document INI file and tables in database
3244 Demonstration for Storage Component
Following figures describe the fundamental implementation codes of this Storage component
The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the
implementation codes The first common place is about the parameters that defined in each of
these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute
path in the computer disk It will be used for the storing procedure and has already been specified
by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an
increasing Integer which is used as a part of the name of the service The reason for doing this is
that it can prevent the services that have the same name from overriding in the disk The content
of red mark among the codes of these figures is the second common place Its function is to
create a file or document for the service with the corresponding parameters
2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to
get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the
name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo
is the actual WSDL link which is the most important one for this sub function The other two
parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and
ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data
of the service in each Web Service Registry Such as the overall number of properties for that Web
Service Registry the overall number of services the number of services that have no WSDL link
the number of services contains no contents the number of services whose WSDL link are not
available etc The ldquoLog Informationrdquo text file records the results of the process steps and the
problems encountered For example which service is crawling now which Web Service Registry it
belongs the reason why it cannot get the WSDL document for this service the reason why this
service is unreachable and so on
Deep Web Service Crawler
44
Figure3-19 Implementation code for getting WSDL document
3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file
and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of
ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two
variables name and value
Figure3-20 Implementation code for generating XML file
Deep Web Service Crawler
45
Figure3-21 Implementation code for generating INI file
4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all
services in these five Web Service Registries into the records of the database Therefore it has to
create a database first The name of the database can be arbitrary as long as it conforms to the
naming rules in database This is the same for the name of the table Figure 3-22 displays the
procedure of creating a table in the database Due to the reason that it is hard to decide the
length of each service property thence the data types for all service properties are set to ldquoTextrdquo
Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo
statement
Figure3-22 Implementation code for creating table in database
Deep Web Service Crawler
46
Figure3-23 Implementation code for generating table records
33 Multithreaded Programming for DWSC
Multithreaded programming is a built-in characteristic which is provided by Java language A
multithreaded program contains two or more separate parts that that can execute concurrently Each
part of such a program is called a thread The use of multithreading enables to create programs that
makes efficient use of the system resources For example make maximum use of the CPU thatrsquos
because the idle time of the CPU can be kept to a minimum
In this master program there are five Web Service Registries that are needed to crawl for the services
among them Moreover the number of services published among each Web Service Registry is quite
different That makes the running time cost by each Web Service Registry is different Then it will
happen that one Web Service Registry who owns fewer services has to be waiting for executing until
another Web Service Registry who has much more services finishes Therefore in order to reduce the
waiting time for other Web Service Registries and maximize the use of the system resources it is
necessary to apply this multithreaded programming to this master program That is to say this master
program will create a thread for each Web Service Registry And these threads are executed
independently
34 Sleep Time Configuration for Web Service
Registries
Due to the reason that this master program is intended for downloading the WSDL document and
extracting the service information of these Web services that published in the Web Service Registry
this would inevitably affect the performance of the Web Service Registry In addition for the purpose
of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of
accessing Because of that sometimes when this master program is executing unknown errors would
happen For instance this master program will be continually halting at one point without getting any
more WSDL document and service information or some servicesrsquo WSDL documents of some Web
Deep Web Service Crawler
47
Service Registries cannot be obtained or there would miss some service information Therefore in
order to obtain the largest staffs of these Web services published in these five Web Service Registries
and not to affect the throughput of them it has to configure the accessing rate for each service of all
Web Service Registries
In consequence before going into the essential procedure for each single service of these Web Service
Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long
milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for
the specified number of milliseconds In other words it will temporarily cease execution for a while
Following table shows this time interval of the sleep function for each Web Service Registry
Web Service Registry Name Time Interval (milliseconds)
Service Repository 8000
Ebi 3000
Xmethods 10000
Seekda 20000
Biocatalogue 10000
Table 3-15Sleep Time of these five Web Service Registries
Deep Web Service Crawler
48
4 Experimental Results and Analysis
This chapter is going to show the quantitative experimental results of the prototype that presented in
chapter 3 Besides the analysis of these results will also be described and explained However in
order to gain a rather accurate result the experiments are carried out more than five times All these
data that displayed in the following tables and charts are the average of them
41 Statistic Information for Different Web Service
Registries
This section is mainly used to talk about the amount statistic of these Web services that published in
these five Web Service Registries It includes the overall number of the Web services that published in
each Web Service Registry and the number of these unavailable Web services which have been
archived because they may not be active anymore or are close to being non active Table 4-1 shows
the service amount statistic of these five Web Service Registries
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Overall
Services 57 289 382 853 2567
Unavailable
Services 0 0 0 0 125
Table4-1 Service amount statistic of these five Web Service Registries
Nevertheless in order to have an intuitive view of the service amount statistic in these five Web
Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As
you can see from the bar chart one on hand there has an ascending increase for the overall number
of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service
Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web
service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to
provide the Web services for the users because it contains far more services than other four Web
Service Registries On the other hand there is no unavailable service existed in almost all Web Service
Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web
Service Registry contains some Web services that cannot be used by the users To some degree this is
useless for the reason that these services cannot be used anymore and it will waste the network
resources on the Web Therefore all these unavailable services should be eliminated in order to
reduce the wasting of the network resources
Deep Web Service Crawler
49
Figure4-1 Service amount statistic of these five Web Service Registries
42 Statistic Information for WSDL Document
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Failed WSDL
Links 1 0 23 145 32
Without
WSDL Links 0 0 0 0 16
Empty
Content 0 0 2 0 2
Table4-2 Statistic information for WSDL Document
Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in
these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed
WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the
Web services whose WSDL links are invalid Say in other words that means it is impossible to get the
WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no
WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services
for these Web Service Registries It is used as the overall number of Web services in each Web Service
Registry who has no WSDL link That is to say there would be no WSDL document for Web services
like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL
document will also be created But there is no content and the name of this WSDL document will
contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which
represents the overall number of Web services that have the WSDL links and the URL address of them
0
500
1000
1500
2000
2500
3000
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
57289 382
853
2567
0 0 0 0 125
The
Num
ber o
f Web
Ser
vice
s
The name of Web Service Registry
Overall Services
Unavailable Services
Deep Web Service Crawler
50
are valid but their WSDL documents contains no content In this case a WSDL document whose name
contains a string ldquo(BAD)rdquo will be created
Figure4-2 Statistic information for WSDL Document
43 Comparison of Different Average Number of
Service Properties
This section is going to compare the average number of service properties in these five Web Service
Registries This average number of service properties is calculated by means of following equation = (1)
Where
ASP is the average number of service properties for one Web Service Registry
ONSP is the overall number of service properties in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
Figure 4-3 shows the average number of service properties of each Web service in these five Web
Service Registries As have already mentioned before one of the measurements for testing the quality
of Web services in the Web Service Registry is the service information The more information of a Web
service the better you know about that service consequently the corresponding Web Service
Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service
Repository and Biocatalogue Web Service Registries own a larger number of service properties than
that in other three Web Service Registries This can directly reflect that these two Web Service
Registries can provide more detailed information about the Web services published among them So
that the users can be easier to choose the service they need And they would also like to use the Web
services that published in these two Web Service Registries By contrast the Xmethods and Seekda
0
50
100
150
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
1 023
145
32
0 0 0 016
0 0 2 0 2
The
Num
ber o
f Ser
vice
s
The Name of the Web Service Registry
Failed WSDL Links
Without WSDL links
Empty Content
Deep Web Service Crawler
51
Web Service Registries who have less service information of the Web services would offer less quality
for these Web services Therefore users may be not like to use the Web services provided in these
two Web Service Registries Not to mention the Web services published in the Ebi Web Service
Registry
Figure4-3 Average Number of Service Properties
From the description presented in section 323 the causes of the issue for the different number of
service properties in these Web Service Registries may consist of following several points First the
number of the structured information for these Web services is different with respect to these five
Web Service Registries Even part of information for some Web services in one Web Service Registry
could be missing or have empty value For example the number of structure information that
supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that
in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for
almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce
the amount of the overall number of service properties Thirdly some Web Service Registries do not
have monitoring information such as Xmethods and Ebi In particular the Service Repository Web
Service Registry has a large amount of monitoring information of the Web services that could be
extracted in the Web Obvious the last one is the number of the whois information for these Web
services If the database of the whois client does not contain the information about the service
domain for the Web service in one Web Service Registry then there will be no whois information
could be extracted Moreover even there has information of the service domain the amount of the
information could be very diverse Therefore if the Web Service Registry undergoes the situation like
that many service domains of the Web services in this registry have no and few whois information
then the average number of service properties in that registry will decrease greatly
As a result in order to help users get better acquainted with the Web services that provided in Web
Service Registry and distinguish these Web services the Web Service Registry should do it best to
offer more and more information for each published Web service of it
05
101520253035
23
717 17
32
The
Num
ber o
f Pro
pert
ies
The Name of Web Service Registry
Average Number of Service Properties
Deep Web Service Crawler
52
44 Different Outputs of Web Services
The fundamental task of this master program is to obtain the WSDL document of the Web services
that hosted in the Web Service Registries as well as extract and gather the properties of these Web
services and thereafter store them into disk Therefore this section is going to describe the different
outputs of this master program which include the WSDL documents of Web services the generated
XML file and INI file and data records of these service properties in Database
Figure4-4 WSDL Document format of one Web service
The WSDL document of the Web service is just read from the Web according to the URL address of its
WSDL link And then these data will be stored into the disk as the WSDL document The name of the
WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document
And in order to distinguish these WSDL documents whose names would be the same while the
content of them are different the name of each obtained WSDL document in one Web Service
Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL
document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and
data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output
formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name
is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the
materials belonged to the same Web service The first three lines in that INI file are some service
comments which start from the semicolon to the end of the line They are the basic information for
the description of this INI file The following one line of them is the section that included in a pairs of
brackets It is important because it represents that the following lines behind it are the information of
this Web service Therefore the rest of these lines are the actual service information that existed in a
key-value pair with an equal between the pair Each service property is displayed from the beginning
of the line
Deep Web Service Crawler
53
Figure4-5 INI File format of one Web service
Figure4-6 XML File format of one Web service
Figure4-7 Database data format for all Web services
Deep Web Service Crawler
54
Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its
name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web
service Though the format of the XML file is different from that of the INI file the essential contents
of them are the same That is to say the values of these service properties are no different This is
because they are files generated from the collection of properties for the same Web service The XML
file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the
section in INI file is something like the root in XML file Therefore all values of these elements among
the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service
Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of
service information for all Web services in these five Web Service Registries And the entire service
information for one Web service will be only one record in this table Because of that the column
names of that table should be the union of the name of the service information in each Web Service
Registry However since the column names of the table should be unique thus these redundant
names in this union must be eliminated This could make sense and possible because the names of the
service information are well-defined and uniform for all these five Web Service Registries In addition
in this table the first column is the primary key which is an increasing Integer The function of this is
more like the Integer that contained in the name of the XML file and INI file While the rest columns in
that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in
this table represents that this property of that Web service is empty or missing
45 Comparison of Average Time Cost for Different
Parts of Single Web Service
This section aims to describe the comparison of the average time cost for different parts of getting
one single Web service in all these five Web Service Registries At first it has to calculate the average
time cost of getting one single service in the Web Service Registry The calculation of it can be
obtained through following equation = (2)
Where
ATC is the average time cost for one single Web service
OTS is the overall time cost of all the Web services in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the different parts of the average time cost for getting one single service consists of
following 6 aspects which are the average time cost for extracting service property the average time
cost for obtaining WSDL document the average time cost for generating XML file the average time
cost for generating INI file the average time cost for inserting the service property into the table of
the Database and the average time cost for some other procedures such as get the service list page
link get the service page link and so on The average time cost for extracting service property would
be obtained by means of following equation
Deep Web Service Crawler
55
= (3)
Where
ATCSI is the average time cost for extracting service property of one single Web service
OTSSI is the overall time cost for extracting service property of all the Web services in one Web
Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the calculation of other parts is similar to the equation for calculating the average time
cost for extracting service property While the calculation of the average time cost for other
procedures equals to the result of the average time cost for one single Web service minus the sum of
the average time cost of all other five parts
Service
property
WSDL
Document
XML
File INI File Database Others Overall
Service
Repository
8801 918 2 1 53 267 10042
Ebi 699 82 2 1 28 11 823
Xmethods 5801 1168 2 1 45 12 7029
Seekda 5186 1013 2 1 41 23 6266
Biocatalogue 39533 762 2 1 66 1636 42000
Table4-3 Average time cost information for all Web Service Registries
Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web
Service Registries The first column of table 4-3 is the name of these five Web Service Registries And
the last column in this table is the average time cost for the single service in one Web Service Registry
While the rest columns in this table are the average time cost of these six different parts In order to
have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated
with following corresponding figures See figure 4-8 to figure 4-14
Figure4-8 Average time cost for extracting service property in all Web Service Registries
8801
6995801 5186
39533
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Service Property
Deep Web Service Crawler
56
As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue
Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries
which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively
That is to say it will take much longer time to extract the service properties of these Web services
published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the
Biocatalogue Web Service Registry has even more average number of service properties which have
already been talked about in section 43 On the contrary the average number of service properties in
Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service
Registry is larger than that in Seekda Web Service Registry As have already known that the average
number of service properties is the same for these two Web Service Registries Nevertheless there
has course might explain the case that the average time in Xmethods costs more than that in Seekda
which is the process for extracting service properties in Xmethods Web Service Registry needs to be
executed by means of service page and service list page while it needs only service page link for
Seekda Web Service Registry
The average time cost for obtaining WSDL document in all these five Web Service Registries is
displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the
average time for extracting the WSDL link of Web service and the average time for reading the data of
WSDL document from the Web and then storing it into the disk As seen from this figure the average
time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is
1168 Although the process for extracting WSDL link will cost certain amount of time it does not have
a significant influence to the total average time spent for obtaining WSDL document This is because
the WSDL link of one Web service is almost gained in one step Therefore this can imply that the
average size of the WSDL document for Xmethods Web Service Registry is larger than that for other
four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds
for the process of extracting WSDL document
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries
918
82
11681013
762
0
200
400
600
800
1000
1200
1400
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
WSDL Document
Deep Web Service Crawler
57
Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different
outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11
the average time of generating XML file for one Web service cost the same for all these five Web
Service Registries which is only 2 milliseconds In addition the average time of generating INI file for
one Web service cost the same too And its value is just 1 millisecond Even though sum up the values
of these two types of average time costs the result is still so small that it could be omitted when
comparing to the overall average time cost of getting one Web service for each corresponding Web
Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI
file would be finished at once after receiving the input of service properties of one Web service
Furthermore it can be seen from figure 4-12 although the average time costs of creating database
record for each Web service in all these five Web Service Registries are larger than that time of
generating XML and INI file the process of the operation for creating database record is still fast
Figure4-10 Average time cost for generating XML file in all Web Service Registries
Figure4-11 Average time cost for generating INI file in all Web Service Registries
2 2 2 2 2
0
05
1
15
2
25
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
XML File
1 1 1 1 1
0
02
04
06
08
1
12
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
INI File
Deep Web Service Crawler
58
Figure4-12 Average time cost for creating database record in all Web Service Registries
Figure4-13 Average time cost for getting one Web service in all Web Service Registries
In the figure 4-13 it gives the average time cost for getting one single Web service in all these five
Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the
longest time for this process This is because the presentation of these five different parts that
described in the front shows that the average time cost of each part would needs more time to finish
the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining
WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web
Service Registry does not cost the longest Moreover there is amazing thing when looking at these
figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same
trend This can further indicate that one Web Service Registry who spends more time to get the
description information of one Web service would offer more information about the Web service
53
28
4541
66
0
10
20
30
40
50
60
70
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Database
10042
823
7029 6266
42000
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Overall
Deep Web Service Crawler
59
5 Conclusion and Further Direction
This master thesis provides a schema which aims to explore the description information of the Web
services that hosted in the different Web Service Registries Actually the description information of
the Web service consists of WSDL document and service information about that Web service
Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can
be used to obtain the WSDL document and service information of the Web service in these Web
Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web
services that hosted in the these Web Service Registries as well as the storage of these Web services
is not flexible and the most important thing is that only a few service information of the Web service
are extracted even more there is no service information would be extracted for some Web Service
Registry However the work presented in this master thesis can be able to explore all Web services in
these Web Service Registries Moreover the service information of Web service would be extracted as
much as it could with the approach presented in this master thesis so that the final result would be
the largest annotated service catalogue ever produced Furthermore regarding the storage process of
the description information of the Web service this master thesis provides three different ways that
can guarantee not only the completeness but also longevity of the description information for the
Web service
However in the implementation performed in this master thesis the whois client used for querying
the information of service domain will return a free text if the information exists And sometime this
free text differs completely That makes it must have to crawl each Web service in all Web Service
Registries at least once in the experiment stage So that all the cases of these free text would be
foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web
services in these Web Service Registries Therefore in order to simply the work other whois client
who can ease the work here needs to be found and used
Moreover in the experiment stage of this master thesis the time cost for getting Web service is still
large For the purpose of reducing this time the multithreaded programming would be applied to
each some parts of the process for getting one Web service
Although the work performed here is specialized for only these five Web Service Registries the
main parts of the principles used here is adaptable to other Web Service Registries that with only
small changes in the implementation codes or the structure
Deep Web Service Crawler
60
6 Bibliography
[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)
June 27 2008
[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from
httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-
whole Emanuele Della Valle (CEFRIEL) July 1 2008
[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk
ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-
architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009
[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web
Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October
2006
[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from
httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008
[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from
httpwwww3orgTRws-arch-scenarios February 11 2004
[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo
Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering
University of Washington Seattle pp233-272 February 1999
[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web
Consortium Working Draft WD-html5-20100624 January 22 2008
[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language
WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from
httpwwwwsmoorgTRd16d161v02
[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO
-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from
httpwwwwsmoorgTRd2v11
[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware
Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet
2008
Deep Web Service Crawler
61
7 Appendixes
There are additional outputs of this master program which are log information file and statistic report
file Figure 8-1 shows one of the basic output formats of the log information for these five Web
Service Registries
Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry
Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry
Deep Web Service Crawler
62
Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo
ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively
Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry
Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry
Deep Web Service Crawler
63
Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry
Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry
Deep Web Service Crawler
64
Table of Figures
Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12
Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16
Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25
Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27
Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41
Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46
Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50
Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51
Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52
Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56
Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58
Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58
Deep Web Service Crawler
65
Table of Tables
Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47
Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48
Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55
Deep Web Service Crawler
66
Table of Abbreviations
DTD Document Type Definition
DWSC Deep Web Service Crawler
HTML HyperText Markup Language
MTBF Mean Time between Failures
MTTR Mean Time to Recovery
QoS Quality of Service
REST Representational State Transfer
RTT Round Trip Time
SOAP Simple Object Access Protocol
SQL Structured Query Language
URL Uniform Resource Locator
WHATWG Web Hypertext Application Technology Working Group
XML eXtensible Markup Language
WSDL Web Service Description Language
WSML Web Service Modeling Language
WSMO Web Service Modeling Ontology
Deep Web Service Crawler
5
312 System Requirements for DWSC 23
313 Non-Functional Requirements for DWSC 24
32 Deep Web Services Crawler Architecture 24
321 The Function of Web Service Extractor Component 26
3211 Features of the Web Service Extractor Component 28
3212 Input of the Web Service Extractor Component 28
3213 Output of the Web Service Extractor Component 28
3214 Demonstration for Web Service Extractor 29
322 The Function of WSDL Grabber Component 30
3221 Features of the WSDL Grabber Component 31
3222 Input of the WSDL Grabber Component 31
3223 Output of the WSDL Grabber Component 31
3224 Demonstration for WSDL Grabber Component 31
323 The Function of Property Grabber Component 33
3231 Features of the Property Grabber Component 36
3232 Input of the Property Grabber Component 37
3233 Output of the Property Grabber Component 37
3234 Demonstration for Property Grabber Component 37
324 The Function of Storage Component 40
3241 Features of the Storage Component 42
3242 Input of the Storage Component 43
3243 Output of the Storage Component 43
3244 Demonstration for Storage Component 43
33 Multithreaded Programming for DWSC 46
34 Sleep Time Configuration for Web Service Registries 46
4 Experimental Results and Analysis 48
41 Statistic Information for Different Web Service Registries 48
42 Statistic Information for WSDL Document 49
43 Comparison of Different Average Number of Service Properties 50
44 Different Outputs of Web Services 52
45 Comparison of Average Time Cost for Different Parts of Single Web Service 54
5 Conclusion and Further Direction 59
6 Bibliography 60
Deep Web Service Crawler
6
7 Appendixes 61
Table of Figures 64
Table of Tables 65
Table of Abbreviations 66
Deep Web Service Crawler
7
1 Introduction
In this introduction chapter of the master thesis at first itrsquos going to concisely explain the background
of the current situation then is the basic introduction of the proposed approach which is called Deep
Web Service Extraction Crawler
11 BackgroundMotivation
In the late 1990rsquos the Web Service Registry was a hot commodity The formal definition of Web
Service Registry is known as a link links page Its function is to uniformly present information that
comes from various sources Hence it can provide a convenient channel to the users for offering
searching and using the Web Services Actually the related metadata of the Web Services that
submitted by both the system and users are commonly hosted along with the Service descriptions
Nevertheless in fact when users enter one of the Web Service Registries to look for some Web
Services they might meet some situations that would bring lots of trouble to them One of the
situations may be like that these Web Service Registries return several similar published Web Services
after the users search on it For example two or more Web Services have the same name but their
versions are not the same Or two or more Web Services that derived from the same server but have
different contents etc Furthermore most users are also interested in a global view of the published
services For instance they want to know which Web Service Registry can provide better quality for
the Web Service Therefore in order to help users to differentiate those similar published Web
Services and have a global view of the Web Services this information should be monitored and rated
Moreover there are a great many Web Service Registries in the Internet Each Web Service Registry
can provide a great number of Web Services Obviously there might have some similar Web Services
among these Web Service Registries Or a Web Service in one of the Web Service Registry is related to
another Web Service in other Web Service Registries Hence these Web Services should be
comparable across different Web Service Registries However recently there has not much support of
this In addition towards the metadata actually not all of them are structured especially the
descriptions of the non-functional property Therefore what have to do now is to turn those
non-functional property descriptions into the structured format Clearly speaking it needs to extract
as much information as possible about the Web Services that offered in the Web Service Registries
Eventually after extracting all the information from the Web Service Registries it is necessary to store
them into the disk This procedure should be efficient flexible and completeness
12 Initial Designing of the Deep Web Service
Crawler Approach
The problems have already been stated in the previous section Hence the following work is to solve
Deep Web Service Crawler
8
these problems In this section it will present the basic principle of Deep Web Service Crawler
approach
At first it is the simply introduction of the Deep Web Service Crawler approach to these problems As
have already been mentioned each Web Service Registry can offer Web Services Moreover each
Web Service Registry has its own html page structures These structures may be the same or even
complete different Therefore the first thing is to identify which Web Service Registry that it will be
going to explore Since each Web Service Registry owns a unique URL this job can be done by directly
analyzing the corresponding URL address of that Web Service Registry After identifying which Web
Service Registry it is going to explore the following step is to obtain all these Web Services that
published in that Web Service Registry Then with all these obtained Web Services it is time to extract
analyze and gather the information of the services That information can be in structured format or
even in unstructured format In this master thesis some Deep Web Analysis Techniques will be
applied to obtain this information So that the information about each Web Service shall be the
largest annotated The last but not the least important all the information about the Web Services
need to be stored
13 Goals of this Master Thesis
The lists in the following are the goals of this master thesis
n Produce the largest annotated Service Catalogue
Service Catalogue is a list of service properties The more properties the service has the larger
Service Catalogue it owns Therefore this master program should extract as much service
properties as possible
n Flexible storage of these metadata of each service as annotations or dedicated documents
The metadata of one service includes not only the WSDL document but also service properties
All these metadata are important information for the service Therefore this master program
should provide flexible ways to store these metadata into the disk
n Improve the comparable property of the Web Services across different Web Service Registries
The names of service properties for one Web Service Registry could be different from another
Web Service Registry Hence for the purpose of improving the comparable ability all these
names of the service properties should be uniformed and well-defined
14 Outline of this Master Thesis
In this chapter the motivation objective and initial approach plan have already been discussed
Thereafter the remaining paper is structured as follows
Firstly in chapter 2 the work presented is almost based on these existing techniques In section 21
there is given a detailed introduction to the technique of the Service-Finder project Then in section
22 the already implemented technique of Pica-Pica Web Service Description Crawler is introduction
and discussed After that in section 23 the Information Retrieval technique is presented
Deep Web Service Crawler
9
Then following in chapter 3 it is explained the designing details of this Deep Web Service Crawler
approach In section 31 it gives a short description for the different requirements of this approach
Next in section 32 the actual design for the Deep Web Service Crawler is presented Then in section
33 34 the multithreaded programming and sleep time configuration that used in this master
program are introduced respectively
In chapter 4 it is supposed to display the experiments of this Deep Web Service Crawler approach
and then give some evaluation of it
Finally in Section 5 the conclusion and discussion of the work that already done as well as the work in
the future for this master task are presented respectively
Deep Web Service Crawler
10
2 State of the Art
This chapter aims at presenting some existing techniques or Strategies that related to the work of
applying this Deep Web Service Extraction Crawler approach Section 21 plans to talk about the
existing catalogues Service-Finder project And then in section 22 it is going to explain the existing
implemented crawlers Pica-Pica Web Service Description Crawler Finally in the section 23 it is
supposed to present some details about the Information Extraction technique
21 Service Finder Project
Service-Finder Project aims at developing a platform for Web Service discovery especially for the Web
Services that are embedded in a Web 20 environment [1] Hence it can provide an efficient access to
publicly available services The goals of the Service-Finder project are depicted as follows [1]
n Automatically gather Web Services and their related information
n Semi-automatically create semantic service description based on the information that available
on the Web
n Create and improve semantic annotations via the user feedback
n Describe the aggregated information in semantic models and allow reasoning query
However before describing the basic functionality of the Service-Finder Project there is going to
present one of its use cases and requirements first
211 Use Cases for Service-Finder Project
The Service-Finder Project employed the use case methodology of the W3C Use Case description [6]
for its needs and then applied this methodology to the use cases that it enumerated
2111 Use Case Methodology
There are three aspects needed to be considered for use case definitions of Service-Finder Project [1]
(1) Description that used to describe information of the use case
(2) Actors Roles and Goals that used to identify the actors they would be the roles they act and the
goals they need to achieve in the scenario
(3) Storyboard that used to describe the serial of interactions among the actors and the
Service-Finder Portal
2112 System Administrator
This section is going to present the use case that applied to the Service-Finder portal and that
illustrated the requirements on its functionality from a user point of view However all these
Deep Web Service Crawler
11
information in this use case are derived from [1] In this use case there has a system administrator
whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities
online and working all day and night Therefore if there is any system failures Sam Adams should fix
the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will
alert him immediately by sending him a SMS Message in the case of a system failure
n Description
This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an
SMS Messaging Service that he wants to build it into his application
n Actors Roles and Goals
The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him
are the immediate service delivery the reliability of the service and low base fee and transaction fee
n Storyboard
Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find
many useful services from it especially he know what he is looking for Hence he visits the
Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo
Requirement 1 Search functionality
Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the
number of matching services that will be displayed on one page And he would also expect there has
short information about the service functionality the service provider and the service availability So
that he could decide which service he will choose to read further
Requirement 2 Enable configurable pagination of the matching results and have some short
information for each service
Step 3 When Sam looks through the short information about the services that displayed on the first
page he expects to find the most relevant services that related to his request After that he would
like to read more detailed information about that service to see whether this service can provide the
needed functionality
Requirement 3 Rank the returned matching services and must provide ability to read more details of
a service
Step 4 In the case that all the returned matching services Sam got provide quite different
functionalities or they belong to different service categories for example the SMS messaging services
alert users not through SMS but voice messaging For this reason Sam would like to see other
different categories that may be contain the services he wants Or the services of other categories
which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can
further filter his search in terms of browsing through categories
Requirement 4 Service categories and allow the user to look all services that belonged to that specific
category If possible it should also allow the user to browse through categories
Step 5 When Sam got all the services that could provide a SMS messaging service via the methods
described in the Step 4 at present he wants to look for the services that offered by an Austrian
provider and have no base fees if possible
Requirement 5 Faceted search
Deep Web Service Crawler
12
Step 6 After Sam got all these specific services now he would like to choose the services that can
provide a high reliability
Requirement 6 Sort functionality based on usersrsquo chooses
Step 7 For now Sam expects to compare the service availability between the promised to the service
provider and the actually provided This should be contained in the servicesrsquo details And there needs
also have service coverage information so that Sam can know whether this service covers the areas
he lives and works Moreover Sam would also like to compare these services in other way For
instance put some services into a structured table to compare the transaction fees
Requirement 7 A side-by-side comparison table for services and a functionality that enable users to
select services he wants to compare
Step 8 At last Sam wants to know whether the service providers offer a free try out of the services
So that he can test the service functionality
Requirement 8 If possible display a note that offering free service trials
212 Architecture Plan for the Service-Finder Project
The architecture plan of the Service-Finder project contains five basic components Service Crawler
Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal
Interface Figure 2-1 presents a high level overview of the components and the data flow among
them
Figure2-1Dataflow of Service-Finder and Its Components [3]
Deep Web Service Crawler
13
2121 The Principle of the Service Crawler Component
The Service Crawler is responsible for gathering available services and their related information from
the Web The overall cycle is depicted as following
(1) A Web developer publishes a Web Service
(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services
like WSDL (Web Service Description Language) documents
(3) The Crawler is also going to search for other related information as long as a service is discovered
(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant
part of the Web
At last the output of the crawler would be forwarded to the subsequent components for analyzing
indexing and displaying
2122 The Principle of the Automatic Annotator Component
The Automatic Annotator receives the relevant data from previous component and generates
semantic service descriptions about the WSDL documents and its related information based on the
Service-Finder Ontology and Service Category Ontology
Firstly it will simply introduce those two compatible ontologies that would be used throughout the
whole process [2]
n Generic Service Ontology it is an ontology which is functional to describe the data objects For
example the services the service providers availability payment modalities and so on
n Service Category Ontology it is an ontology which is used to categorize the functionalities or
applications of the services For instance data verification messaging data storage weather etc
Afterwards it is going to talk about the function of this component with its input output
Oslash Input
u Crawled data from Service Crawler
u Service-Finder Ontologies
u Feedback or Correction of before annotations
Oslash Function
u Enrich the information about the service and extract semantic statements according to the
Service-Finder Ontologies For example categorize the service according to the Service
Category Ontology
u Determine whether a particular document is relevant or not through the Web link graph If
not discard these irrelevant documents
u Classify the pages into their genres For instance pricing user comments FAQ and so on
Oslash Output
u Semantic annotation of the services
Deep Web Service Crawler
14
2123 The Principle of the Conceptual Indexer and Matcher
Component
The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted
information of the services and supplying users the capability of retrieval and semantic query For
example the matchmaking between user requests and service offers and the act of retrieving user
feedback on extracted annotations
In addition letrsquos have a look of the function of this component and its input output
Oslash Input
u Semantic annotation data and full text information obtained from Automatic Annotation
u Semantic annotation data and full text information that come from user interfaces
u Cluster data from user and service clustering component
Oslash Function
u Store the semantic annotations received from the Automatic Annotation component and
from the user interface
u Store the cluster data that procured through the clustering component
u Store and index the textual description offered by the Automatic Annotation component
and the textual comments offered by users
u Ontological query the semantic data from the data store center
u Combined keyword and Ontological querying used for user queries
u Provide a list of similar services for a given service
Oslash Output
u A list of matching services that are queried by users In particular these services should be
sorted by ranking and can also be iterated
u All available data that related to a particular entity must be retrievable at the user interface
2124 The Principle of the Service-Finder Portal Interface
Component
The Service-Finder Portal Interface is the main entry point that provided for users of the
Service-Finder system to search and browse the data which is managed by the Conceptual Indexer
and Matcher component In addition the users can also contribute information by means of providing
tags comments categorizations and ratings to the data browsed Furthermore the developers can
still directly invoke the Service-Finder functionalities from their custom applications in terms of an API
Besides the details of this componentrsquos function input and output are represented as below
Oslash Input
u A list of ordered services for a query
u Detailed information about a service or a set of services and a service provider
u Query access to service category ontology and the most used tags provided by the users
Deep Web Service Crawler
15
u Service availability information
Oslash Function
u The Web Interface allows the users to search services by keyword tag or concept in the
categorization sort and filter query results by refining the query compare and bookmark
services try out the services that offer this functionality
u The API allows the developers to invoke Service-Finder functionalities
Oslash Output
u Explicit user annotations such as tags ratings comments decryptions and so on
u Implicit user data for example click stream of users bookmarks comparisons links sent
etc
u Manual advertising of available new services
2125 The Principle of the Cluster Engine Component
The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the
Service-Finder Portal eg the queried services and the compared services of the users Moreover it
also provides cluster data to the Conceptual Indexer and Matcher for providing service
recommendations
Furthermore letrsquos detailed introduce this componentrsquos function input and output
Oslash Input
u Service annotation data of both extracted and user feedback
u Usersrsquo Click streams used for extracting user behaviors
Oslash Function
u Obtain user clusters from user behaviors
u Obtain service clusters from service annotation data to enable to find similar services
Oslash Output
u Clusters of users and services
22 Information Extraction
Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge
amount of information sources on the Internet which has been limited the access to browsing and
searching for the reason of the heterogeneity and the lack of structure of Web information sources
Therefore the appearance of Information Extraction that transforms the Web pages into
program-friendly structures for post-processing would become a great necessity However the task of
Information Extraction is specified in terms of the inputs and the extraction targets And the
techniques used in the process of Information Extraction called extractor
221 Input Types of Information Extraction
Generally speaking there are three different input types The first input type is the unstructured
Deep Web Service Crawler
16
document For example the free text that showed in figure 2-2 It is unstructured and written in
natural language So that it will require substantial natural language processing While the second
input type is called the structured document For instance the XML documents based on the reason
that the data can be described through the available DTD (Document Type Definition) or XML
(eXtensible Markup Language) schema Finally but obviously the third input type is the
semi-structured document that are widespread on the Web Such as the large volume of HTML
pages like tables itemized lists and enumerated lists This is because HTML tags are often used to
render these embedded data in the HTML pages See figure 2-3
Figure2-2Left is the free text input type and right is its output [4]
Figure2-3A Semi-structured page containing data records
(in rectangular box) to be extracted [4]
Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly
regular structure And the data of these documents can be displayed in a format of HTML way or
non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and
generated from structured databases in terms of some templates or layouts thus it would be
considered as one of the input sources which could provide some of these semi-structured documents
For example the authors price and comments of the book pages that provided by Amazon have the
Deep Web Service Crawler
17
same layout That is because these Web pages are generated from the same database and applied
with the same template or layout Furthermore there has another option which could manually
generate HTML pages of semi-structured type For example although the publication lists that
provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have
title and source property for every single pager Eventually the inputs for some Information Extraction
can also be the pages with the same class or among various Web Service Registries
222 Extraction Targets of Information Extraction
Moreover regarding the task of the Information Extraction it has to consider the extraction target
There also have two different extraction targets The first one is the relation of k-tuple And the k in
there means the number of attributes in a record Nevertheless in some cases an attribute of one
record may have none instantiation Otherwise the attribute owns multiple instantiations In addition
the complex object with hierarchically organized data would be the second extraction target Though
the ways for depicting the extraction targets in a page are diverse the most common structure is the
hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf
nodes which called internal nodes And the structure for a data object may also be flat or nested To
be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise
if it is nested structure then the internal nodes that involved in this data object would be more than
two levels
Furthermore in order to make the Web pages readable for human being and having an easier
visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated
or demarcated However the displaying for a data object in a Web page would be affected by
following conditions [4]
Oslash The attribute of a data object has zero or several values
(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo
attribute For example a special offer only available for certain books might be a ldquononerdquo
attribute
(2) If there are more than one values for the attribute of a data object it will be called the
ldquomultiValuerdquo attribute For instance the name of the author for a book could be a
ldquomultiValuerdquo attribute
Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering
That is to say among this set of attribute the position of the attribute might be changed
according to the diverse instances of a data object Thus this attribute will be called the
ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would
enumerate the release data in front of the movesrsquo title while for the movies after year 1999
(including 1999) it will enumerate the release data behind the movesrsquo title
Oslash The attribute has different formats
This means the displaying format of the data object could be completely distinct with respect to
these different instances Therefore if the format of an attribute is free then a lot of rules will be
needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo
attribute For example an ecommerce Web site would use the bold font format to present the
general prices while use the red color format to display the sale prices Nevertheless there has
Deep Web Service Crawler
18
another situation that some different attributes for a data object have the same format For
example various attributes are presented in terms of using the ltTDgt tags in a table presentation
And the attributes like those could be differentiated by means of the order information of these
attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo
attributes it must have to revise the rules for extracting these attributes
Oslash The attribute cannot be decomposed
Because of the easier processing sometimes the input documents would like to be treated as
strings of tokens instead of the strings of characters In addition some of the attribute cannot
even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo
attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The
department code and the course number in them cannot be separated into two different strings
of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo
223 The Used Techniques in Information Extraction
The extractor used in the process of Information Extraction aims at providing a single uniform query
interface to access information sources like database server and Web server It consists of following
phases collecting returned Web pages labeling these Web pages generalizing extraction rules
extracting the relevant data and outputting the result in an appropriate format (XML format or
relational database) for further information integration For example at first the extractor queries the
Web server to gather the returned pages through the HTTP protocols after that it starts to extract the
contents among these HTML documents and integrate with other data sources thereafter Actually
the whole process of the extractor follows below steps
Oslash Step 1
At the beginning it must have to tokenize the input However there are two different
granularities for the input string tokenization They are tag-level encoding and word-level
encoding The tag-level encoding will transform the tags of HTML page into the general tokens
while transform all text string between two tags into a special token Nevertheless the
word-level encoding does this in another way It treats each word in a document as a token
Oslash Step 2
Next it should apply the extraction rules for every attributes of the data object in the Web pages
These extraction rules could be induced in terms of a top-down or bottom-up generalization
pattern mining or logic programming In addition the type of extraction rules may be indicated
by means of regular grammars or logic rules For example some use path-expressions of the
HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic
constraints and some use delimiter-based constraints such as HTML tags or literal words
Oslash Step 3
After that all these extracted data would be assembled into the records
Oslash Step 4
Finally iterate this process until all these data objects in the input
Deep Web Service Crawler
19
23 Pica-Pica Web Service Description Crawler
The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the
Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web
Services problem For example the evaluation of the descriptive quality of Web Services that offered
and how well are these Web Services described in nowadaysrsquo Web Service Registries
231 Needed Libraries of the Pica-Pica Web Service
Description Crawler
This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef
Spillner and programmed in terms of the Python language Actually in order to run these scripts to
parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib
n Beautiful Soup
It is an HTMLXML parser for Python language And it can even turn these invalid markups into a
parse tree [5] Moreover the following three features can make it more powerful
u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that
makes approximately as much sense as the original document Therefore you can obtain
the data that you want
u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating
searching and modifying the parse tree Hence you donrsquot need to create a custom parse
for every application
u If the document has already specified an encoding then you can ignore it since the
Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way
Otherwise what you have to do is just to specify the encoding of the original documents
Furthermore the ways of including Beautiful Soup into the application are displayed in the
following [5]
sup2 From BeautifulSoup import BeautifulSoup For processing HTML
sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML
sup2 Import BeautifulSoup To get everything
n Html5lib
It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to
gain maximum compatibility with the current major desktop web browsers this implementation
will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5
specification
Deep Web Service Crawler
20
232 Architecture of the Pica-Pica Web Service
Description Crawler
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler
Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes
four fundamental components Service Page Grabber component WSDL Grabber component
Property Grabber component and WSML Register component
(1) The Service Page Grabber components is going to take the URL seed as the input and output the
link of the service page into following two components WSDL Grabber component and Property
Grabber component
(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the
delivered service pagersquos link And then check whether the validation for these obtained WSDL
document Finally only these valid WSDL document will be passed into the WSML Register
component for further processing
(3) The Property Grabber component will try to extract the servicersquos property the hosted in the
service page if there exists After that all these servicersquos properties would be saved into an INI
file as the information of that service
(4) The functionality of the WSML Register component is to write appropriate WSML document by
means of the valid WSDL documents that delivered from WSDL Grabber component and the
optionally INI files that delivered from Property Grabber component Afterwards register them in
Conqo
n WSML [9]
It stands for Web Service Modeling Language which provides a framework with different
language variants Hence it is often used to describe the different aspects of the semantic Web
Services according to the conceptual model of WSMO
n WSMO [10]
WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various
aspects related to the Semantic Web Services based on the ontologies Ontology is a formal
explicit specification of a shared conceptualization In fact these ontologies are the sticking points
that can satisfy the linkage between the agreement of the communities of users and the defined
conceptual semantics of the real-world
n Conqo [11]
Deep Web Service Crawler
21
It is a discovery framework that considers not only the Quality of Service (QoS) but also the
context information It will use a Web Service Repository to manage these service descriptions
that based on WSML
233 Implementation of the Pica-Pica Web Service
Description Crawler
This section is going to describe the processes of the implementation of the Pica-Pica Web Service
Description Crawler in detail
(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it
needs an input as the initial seed In this crawler there are five Web Service Registries which are
listed in the below The URL address of these five Web Service Registries will be used as the input
seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web
Service Description Crawler there has a single Python script for each Web Service Registry And
the crawling process for these Web Service Registriesrsquo Python script will be executed one after
another
Biocataloue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwsrvice-repositorycom
Xmethods httpwwwxmethodsnet
(2) Then after feeding with the input seed it will step into the next component Service Page Grabber
At first this component will try to read the data from the Web based on the input seed Then it
will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library
After that this Service Page Grabber component starts to look for the service page link for each
service that published in the Web Service Registry by means of the functions in Html5lib library In
the case that the service page link of one single service is found it will firstly check whether this
service page link is valid or not Once the service page link is valid it will pass it into the following
two components for further processing which are WSDL Grabber component and Property
Grabber component
(3) When the WSDL Grabber component receives a service page link from its previous component it
sets out to extract the WSDL link address for that service through the parsing tree of the data in
this service page Next this component will start to download the WSDL document of that service
in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into
the disk The process of this WSDL Grabber component will be continually carried on until there is
no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They
may either contain bad definitions or bad namespaceURI or be an empty document What is
worse it is not even of XML format Hence in order to pick them out this component will further
analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo
folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in
order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo
folder will be passed into the subsequent component
(4) Moreover since some Web Service Registries would give some addition information about the
Deep Web Service Crawler
22
services such as availability service provider version therefore the Property Grabber
component will set out to extract these information as this servicersquos properties And thereafter
save these servicersquos properties into an INI file However if there has no available additional
information then it is no need to extract the service property thus there is no INI file for that
service However for the implementation of this Pica-Pica Web Service Description Crawler only
the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to
extract the servicesrsquo properties While for other three Web Service Registries there has such
function
(5) Furthermore this is an optional to create a report file which contains the statistic information of
this process Such as the total number of services for one Web Service Registry the number of
services whose WSDL document is invalid etc
(6) As has been stated at present there has a folder with all valid WSDL documents and might also
have some INI files Therefore at this time the task of WSML Register component is to generate
the appropriate WSML documents with these valid WSDL document and INI files And then
register them in Conqo
24 Conclusions of the Existing Strategies
In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder
project Information Extraction technique and Pica-Pica Web Service Description Crawler
The task of this master program is going to obtain the available Web services and their related
information from the Web Actually this is a procedure of extracting the needed information for the
service such as the servicersquos WSDL document and its properties Therefore the Information
Extraction technique which used to extract the information that hosted in the Web could be used in
this master program
Moreover the Service-Finder is a large project that not only able to obtain the available Web services
and their related information from the Web but also enrich these crawled data with annotations by
means of the Service-Finder ontology and the Service Category ontology and integrate all the
information into a coherent semantic mode Furthermore it also provides the capabilities for
searching and browsing the data with the user interface and gives users with service
recommendations However as a master program the Service-Finder project is far more exceeding
the requirements Therefore it is just considered as a consultation for this master program
Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available
Web services and their related information this fulfills the primary task of this master program
Nevertheless regarding the information about the service this Pica-Pica Web Service Description
Crawler extracts only few properties sometime even no property Consequently in order to improve
the quality of the service it has to extract as much properties about the service as possible Thence in
chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler
Deep Web Service Crawler
23
3 Design and Implementation
In the previous chapter of the State of Art those already existing techniques or implementations are
presented In the following it is time to introduce the basic principle of the proposed approach Deep
Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica
Web Service Description Crawler
31 Deep Web Services Crawler Requirements
This section is mainly talking about the goals of the Deep Web Service Crawler approach system
requirements with respect to approach and some non-functional requirements
311 Basic Requirements for DWSC
The following lists are the basic requirements which should be achieved
(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list
of Web services that published in the Web Service Registry It contains not only the WSDL document
of the service but also a file with respect to the information about that service Therefore for the
purpose of producing the largest annotated Service Catalogue this proposed approach need to
extract as much properties about those Web services as possible Moreover it has also to download
the WSDL document that hosted along with the Web service That is to say these properties are not
only the interesting structured properties such as service name its WSDL link address but also some
other non-function properties for example endpoint and its monitoring information
(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties
For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of
schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way
this proposed approach provides three methods for the storage The first one will store them as an
XML file While for the second method it is going to store them in an INI file And the third method is
to use the database for the storage
312 System Requirements for DWSC
General Speaking the needed requirements for achieving a programming project contains following
lists
1) Operating System LinuxUnix Windows (XP Vista 2000 etc)
2) Programming language C++ Java Python C etc
3) Programming Tool NetBean Eclipse Visual Studio and so on
However in this master thesis the scripts of the Deep Web Service Crawler approach are written in
Java programming language Besides these code scripts are only tested running in Windows XP
operating system and Linux operating system but have not been tested in other operating system
Deep Web Service Crawler
24
313 Non-Functional Requirements for DWSC
In this part several non-functional requirements for the Deep Web Service Crawler approach are
presented
1) Transparency the process of data exploration and data storage should be done automatically
without the userrsquos intervention However at first user should specify the path in the hard disk which
will be used store these outputs of this program
2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in
order to stop the process from interruption there must have some necessary error handlings for the
process recovering
3) Completeness this approach should extract the interesting properties about each Web Service as
much as possible eg endpoint monitoring information etc
In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the
following five URLs hence the proposed approach must implement not less than those five Web
Service Registries
Biocatalogue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwservice-repositorycom
Xmethods httpwwwxmethodsnet
32 Deep Web Services Crawler Architecture
In this section an overview of this high-level architecture of Deep Web Services Crawler approach will
be introduced first Thereafter there are four subsections that focusing on the outlining of each single
component and how they will play together presented
The current components and flows of data in Deep Web Service Crawler can be summarized as
depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page
links and related service page links by crawling the Web (Web Service Extractor) Then those gathered
links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL
Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all
these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is
illustrated as following
Oslash Step 1
When the Deep Web Service Crawler starts to run the File Chooser container would require the
user to specify a path in the computer or in any other hard disk The reason for specifying the
path is that this Deep Web Service Crawler program needs a place to store all its outputs
Oslash Step 2
After that the Web Service Extractor will be triggered It is the main entry to the specific crawling
process However since the Deep Web Service Crawler program is a procedure which supposes
to crawl for Web Services in some given Web Service Registries Hence the URL addresses of
Deep Web Service Crawler
25
these Web Service Registries should be given as an initial seed for this Web Service Extractor
process However since the page structures of these Web Service Registries are completely
different thus there will be a dependent process for each Web Service Registry
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler
Oslash Step 3
So that according to the given seed two types of links would be obtained by the Web Service
Extractor component One is the service list page link And another is the service page link
Service list page is page that contains a list of Web Services and may be some information about
these Web Services While service page is page that contains much more information for single
service Finally it forwards these two types of links into the next two components Property
Grabber and WSDL Grabber
Oslash Step 4
Then on the one hand the Property Grabber component tries to gather the information about
the service that hosted in the service list page and service page Such as the name of the service
its description the ranking for this service etc Finally all the information of the service will be
collected together as the service properties which then will be delivered to the storage
component for further processing
Oslash Step 5
On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page
or the service page That is because for some Web Service Registry the WSDL link is hosted in
Deep Web Service Crawler
26
the service list page like the Biocatalogue and for other Web Service Registries it is hosted in
the service page such as Xmethods Then after obtaining the WSDL link it will also be
transmitted to the Storage component for further processing
Oslash Step6
When service properties information and WSDL link of the service are received by Storage
component it will be going to store them into the disk For the service properties they will be
stored into the disk by means of three different ways They are an XML file an INI file or one
record inside the table of Database However for the WSDL link the Storage component will try
to download the page content first according to the URL address of the WSDL link If it can work
successful the page content of the service will be stored as a WSDL document into the disk
Oslash Step 7
Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there
are more than one service or there are more than one service list page in the those Web Service
Registries the crawling process from step 3 to step 6 will be continued again and again until
there is no service or service list page in that Web Service Registries any more
Oslash Step 8
Furthermore after the crawling process for one Web Service Registry finishes there will also
generate a file that contains some of statistic information about this crawling process For
example the time when the crawling process of this Web Service Registry started and when
finished the total number of Web services in this Web Service Registry how many services have
empty WSDL document the average number of service properties in this Web Service Registry
and the average time cost for extracting service properties getting WSDL document and
generating XML file INI file etc
321 The Function of Web Service Extractor Component
The Web Service Extractor component is responsible for gathering service list page links and service
page links It pursues a focused crawl of the Web and only forwards service list page and service page
links to subsequent components for analyzing collecting gathering purposes Therefore it is going to
identify both service list page links and related service page links on these Web Service Registries
As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost
as important as the Web Service Extractor itself as it highly influences the part of the Web which it
needs to crawl The seed can or shall contain eg Web pages where these Web Services are
published or that they talk about Web Services
Deep Web Service Crawler
27
Figure3-2 Overview the process flow of the Web Service Extractor Component
After feeding with the seed of URL the Web Service Extractor component starts to get the link of
service list page from the initial page of this URL seed However this process would be diverse for
these five Web Service Registries Following shows the different situation in these Web Service
Registries
Oslash Service-Repository Web Service Registry
In this Web Service Registry the link of the first service list page is the URL address of its seed
which means some Web Services can be found in the home page of the Service-Repository Web
Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this
Web Service Registry Therefore the process of getting service list page link in this registry will be
continually carried on until there no more service list page link exists
Oslash Xmethods Web Service Registry
Although there has Web Services in the home page of the Xmethods Web Service Registry the
number of those Web Services is only a small subset of these in this Web Service Registry
Moreover in the Xmethods Web Service Registry there is only one page containing all Web
Services Therefore it must have to get the service list page link for that page
Oslash Ebi Web Service Registry
The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That
is to say there also has one page that contains all Web Services in this Web Service Registry
However this page is not the initial page of the input seed Therefore there needs more than
one operation steps to get the service list page link of that page
Oslash Seekda Web Service Registry
In Seekda Web Service Registry these Web Services are not contained in the initial page of the
input seed The service list page link can be obtained after several additional operation steps
Deep Web Service Crawler
28
However there has a problem for getting the service list page link in this registry Simply put if
there are more than one pages contains the Web Services for some unknown reasons it cannot
not get the links of the rest service list page In another word it can only get the link of the first
service list page
Oslash Biocatalogue Web Service Registry
The process of getting service list page in Biocatalogue Web Service Registry is almost the same
as in Seekda Web Service Registry The only difference is that it can get all these service list page
links in Biocatalogue Web Service Registry if there are more than one service list pages
Then after getting the link of service list page the Web Service Extractor will begin to get the link of
service page for each service that listed in the service list page The reason for why it can do like this is
that there has an internal link for every service which will address to the service page It is worth
noting that once a service page link is obtained this component will immediately forward the service
page link and the service list page link into the subsequent two components for further processing
Nevertheless the process of obtaining the service page link will be continuously carried out until all
services listed in that service list page are crawled Analogously the process of getting service list page
will also be continuously carried out until there no more service list page exists
3211 Features of the Web Service Extractor Component
The main features are described in the following paragraphs
1) Obtain service list page links
A central task is to obtain the corresponding service list page links It is an URL address that includes a
publicly list of Web Services and just a simply information about these Web services like the name of
the service an internal URL that links to another page which would contain the detailed information
about that service sometimes may have the link address for WSDL document
2) Obtain service page links
Once service list page links are found a crucial aspect is to extract the internal link of each service to
aid the task of service information discovery Therefore it is the task of the Web Service Extractor to
harvest the html page content of the service list page so that the service page links which would
contain much more detailed information of the single Web service can be obtained
3212 Input of the Web Service Extractor Component
This component is dependent on some specific input seeds And the only input required for this
component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section
313
3213 Output of the Web Service Extractor Component
The component will produce two service related page links from the Web
l Service list page links
l Service page links
Deep Web Service Crawler
29
3214 Demonstration for Web Service Extractor
In order to have comprehensive understanding of the process of the Web Service Extractor
component following gives some figures for explanation Though there are five URL addresses in this
section only the URL of the service-repository address is showed as an example
1) The input seed is the initial URL address of the Service-Repository which is
ldquohttpwwwservice-repositorycomrdquo
2) As has already said in section 321 the first service list link of this Web Service Registry is its input
seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page
of that link
Figure3-3 Service list page of the Service-Repository
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo
Figure3-5Code Overview of getting service page link in Service Repository
Figure3-6 Service page of the Web service ldquoBLZServicerdquo
Deep Web Service Crawler
30
3) For now because of the service list page link is already know the next step is to acquiring the
service page link according to these services that listed in the service list page The text in red box
of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the
complete link of the service page It should be prefixed with the initial URL address of
Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for
getting service page link of the Web service in Service Repository is showed in figure 3-5
Therefore the final link of this service page is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure 3-6 is the corrsponding service page of that link
4) Afterwards those two links service list page link and service page link which are gathered by the
Web Service Extractor component will be immediately forwarded to the next two components
which are WSDL Grabber component and Property Grabber component
322 The Function of WSDL Grabber Component
The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the
service lists page link or service page link And the whole process flow is illustrated in figure 3-7
Figure3-7 Overview the process flow of the WSDL Grabber Component
When the WSDL Grabber component receives the inputs that delivered from previous component it
will start to get the WSDL link for the service based on these inputs However although the inputs of
the WSDL Grabber component are the links of service page and service list page only one of them
contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the
service page or the service list page The reason why there needs these two links to be delivered into
this component is that the only one of the five Web Service Registries which is Biocatalogue hosts
the WSDL link in the service list page While for other four Web Service Registries the WSDL link is
hosted in the service page Therefore the WSDL link of these four Web Service Registries will be
obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL
link will be obtained through the service list page link However there has a problem for getting the
WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in
Deep Web Service Crawler
31
the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another
word these services will not have WSDL document For the situation like this the value of the WSDL
link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web
Services in other four Web Service Registries the WSDL link would always exist in the service page
Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service
it will inevitably be forwarded into Storage component for downloading the WSDL document at once
3221 Features of the WSDL Grabber Component
The WSDL Grabber component has to provide the following features
l Obtain WSDL links
WSDL link is the direct way to get into the page that contains the contents of the WSDL document
It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo
to represent this is an address that addressed to the page of WSDL document
3222 Input of the WSDL Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3223 Output of the WSDL Grabber Component
The component will only produce the following output data
l The URL address of WSDL link for each service
3224 Demonstration for WSDL Grabber Component
In this section it will demonstrate a list of figures in order to give a comprehensive understanding of
the process of the WSDL Grabber component Without a doubt it uses the same Web service of the
Service-Repository as an example too
1) The input for this WSDL Grabber component is the link of service page that obtained from Web
Service Extractor component The address of this link is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page
Deep Web Service Crawler
32
2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the
figure 3-8
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo
3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure
3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry
For other four Web Service Registries this will be different This function
ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo
Then it checks all of the nodes one by one to see if the text value of that node is the value of
ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will
be extracted as the value of WSDL link for this Web service
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function
Figure3-11 Code overview of ldquooneParameterrdquo function
4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web
service ldquoBLZServicerdquo can be obtained The value of it is
ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo
Deep Web Service Crawler
33
323 The Function of Property Grabber Component
The Property Grabber component is a module which is used to extract and gather all these Web
service information that hosted in the Web actually are the information that showed in service list
page and service page In the end all the obtained Web service information will be collected together
as the service properties which would be delivered into the Storage component for storing The
detailed process flow of Property Grabber component is illustrated in figure 3-12
The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component
However there is still a little difference between them with respect to the seed As have already
mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is
sufficient to get the WSDL link While regarding the Property Grabber component it needs two of
them as the seeds Therefore once the Property Grabber component receives the needed inputs it
will start to extract the information of that single Web service
Figure3-12 Overview the process flow of the Property Grabber Component
Therefore after the Property Grabber component receives the inputs it starts to extract the service
information for the Web service Generally speaking the service information consists of four aspects
which are structured information endpoint information monitoring information and whois
information respectively
(1) Structured Information
The structured information can be obtained by extracting the information that hosted in the
service page and service list page They are the basic descriptive information about the service
Such as the name of the service the URL address through which the WSDL document can be
obtained the description of introducing the service the provider who provides that service its
rating and the server who owns this service etc However the elements constituting this
Deep Web Service Crawler
34
structured information would be diverse according to these different Web Service Registries For
example the rating information of the Web service exists in the Service-Repository Web Service
Registry while the Xmethods Web Service Registry does not have this information In addition
even though the Web services in the same Web Service Registry some elements of the structured
information could be not exist For instance one service in a Web Service Registry may have the
description information of this service while another service in the same Web Service Registry
does not have this description information Tables 3-1 to 3-5 demonstrate the structured
information that should be extracted in these five Web Service Registries Moreover if the style of
the Biocataloue Web Service Registry is SOAP there has some additional information describing
the SOAP operation of this service While it is REST this additional information would be
something that talked about the REST operation They should also be considered as a part of the
structured information Table 3-6 and table 3-7 illustrate the information for these two different
operations
Service Name WSDL Link WSDL Version
Provider Server Rating
Homepage Owner Homepage Description
Table 3-1Structured Information of Service-Repository Web Service Registry
Service Name WSDL Link Provider
Service Style Homepage Implementation Language
Description User Description Contributed Client Name
Type of this Client Publisher for this client Used Tookit of this client
Used Language of this client Used Operation System of this client
Table 3-2Structured Information of Xmethods Web Service Registry
Service Name WSDL Link Server
Provider Providerrsquos Country Service Style
Rating Description User Description
Service Tags Documentation (within WSDL)
Table 3-3Structured Information of Seekda Web Service Registry
Service Name WSDL Link Port Name
Service URL Address Implementation Class
Table 3-4Structured Information of Ebi Web Service Registry
Service Name WSDL Link Style
Provider Providerrsquos Country View Times
Favorite Times Submitter Service Tags
Total Annotation Provider Annotation Member Annotation
Registry Annotation Base URL SOAP Lab Server Base URL
Description User Description Category
Table 3-5Structured Information of Biocatalogue Web Service Registry
Deep Web Service Crawler
35
SOAP Operation Name Inputs and Outputs Operation Description
Operation Tags Part of Which Service
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry
REST Operation Name Service Tags Used Template
Operation Description Part of Which Service Part of Which Endpoint Group
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry
(2) Endpoint Information
Endpoint information is the port information of the Web service which can be extracted only
through the service page However since different Web Service Registries have different structure
of the endpoint information with respect to the Web service Thence some elements of the
endpoint information for the Web service would be very diverse But there is one thing needed to
pay attention the Ebi Web Service Registry does not have endpoint information for all these Web
services published in this registry Moreover though the Web services in the same Web Service
Registry have the same structure of this endpoint information some elements of the endpoint
information would be missing or empty Furthermore these Web Service Registries could even
have no endpoint information for some Web services published by them Nevertheless whatever
there is endpoint information for the Web service there will be at least one such element which
is the URL address of the endpoint Following table 3-8 shows the endpoint information that
should be extracted in these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Endpoint Name Endpoint URL
Endpoint Critical Endpoint Type
Bound Endpoint
Xmethods
Endpoint URL Publisher of this Endpoint
Contact Email of this Publisher Implementation Language of this
Endpoint
Seekda Endpoint URL
Biocatalogue Endpoint Name Endpoint URL
Table 3-8 Endpoint Information of these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Service Availability Number of Downs
Total Uptime Total Downtime
MTTR MTBF
RTT Max of Endpoint RTT Min of Endpoint
RTT Average of Endpoint Ping Count of Endpoint
Seekda Service Availability Begin Time of Monitoring
Biocatalogue Monitored Status of Endpoint Overall Status of Service
Table 3-9 Monitoring Information of these five Web Service Registries
(3) Monitoring Information
Monitoring information is the tested statistic information for the Web service But it is worth
Deep Web Service Crawler
36
noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of
all these Web services published by them While for other three Web Service Registries only a
few of Web services may not have this information Table 3-12 displays the monitoring
information for these three Web Service Registries
(4) Whois Information
Whois information is not extracted from the information that hosted in the service page and
service list page It is the descriptive information for the service domain which can be gained by
means of the address of the WSDL link Because of that the process of getting the Whois
information will start after gaining the service domain first The final value of service domain
should not contain the string like this http https www etc It must belong to the top level
domain After that it is going to query the service domain database by means of sending the value
of service domain to the Whois client which is just a Web site on the Internet for example
ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain
exists a list of its information will be returned as the output However the structure of the
returned information will be diverse with respect to different service domain Therefore the most
challenge thing is that there has to deal with the extracting process for each different situation of
the returned information Table 3-10 gives the Whois information that needs be extracted for all
these five Web Service Registries
Service Domain URL Domain Name Domain Type
Domain Address Domain Description State
Postal Code City Country
Country Code Phone Fax
Email Organization Established Time
Table 3-10Whois Information for these five Web Service Registries
Finally all the information of these four aspects will be collected together And then be delivered into
the Storage component for further storage processing
3231 Features of the Property Grabber Component
The Property Grabber component has to provide the following features
l Obtain basic information
Generally speaking the more information one Web service has the better you can know how
well this Web service is Hence this Property Grabber component will be necessary to extract all
these basic information that hosted in the service list page and service page This basic
information contains the structure information endpoint information and monitoring
information
l Obtain Whois information
Due to the reason that the more information the Web service has the better the quality of the
Web service is That makes it is necessary to extract as much information about the Web service
as possible Therefore except the basic information the Property Grabber component can also
obtain some additional information called Whois information such as the type of the domain the
name of the person in charge the postal code of the domain city phone fax detailed address
email etc
Deep Web Service Crawler
37
3232 Input of the Property Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3233 Output of the Property Grabber Component
The component will produce the following output data
l Structured information of each service
l Endpoint information about each service if exists
l Monitoring information for the service and endpoint if exists
l Whois information of the service domain
However all these information will be collected together as the properties for each service Thereafter
the collected properties will be sent to the Storage component
3234 Demonstration for Property Grabber Component
Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the
Property Grabber component To simplify the explanation the example showed in this section uses
the same Web service as before
1) The inputs of the Property Grabber component are links of service list page and service page that
received from Web Service Extractor component Their links are
ldquohttpwwwservice-repositorycomrdquo
and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page
Deep Web Service Crawler
38
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page
3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that
displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of
the structured information have the same content like the contents of description showed in
service page and service list page Thence in order to saving the time of extracting process and
the space for storing process the elements with the same content would only be extracted once
Moreover there needs a transformation of non-descriptive text to the descriptive text for the
rating information because the contents of it are several image of stars Therefore the final
results of the extracted structured information for this Web service are showed in table 3-11
Because of there are no descriptive information for the provider homepage and Owner
Homepage their values are assigned as ldquoNULLrdquo
Service Name BLZService
WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl
WSDL Version 0
Server Apache-Coyote11
Description BLZService
Rating Four stars and A Half
Provider NULL
Homepage NULL
Owner Homepage NULL
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo
4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that
displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list
only one of them will be extracted as the information of endpoint That is because this master
program is intended to extract as more information as possible but these information should not
contain some redundant information Therefore for the purpose of that only one of them will be
extracted as the endpoint information even if there are more than one endpoint records Table
3-12 shows the final results of the endpoint information of this Web service
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page
Deep Web Service Crawler
39
Endpoint Name BLZServiceSOAP12port_http
Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService
Endpoint Critical True
Endpoint Type production
Bound Endpoint BLZServiceSOAP12Binding
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo
5) Then it is time to extract the monitoring information through invoking the function
ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box
above contains the monitoring information about the Web service and the red box below lists
the monitoring information for these endpoints As have already mentioned before only one
endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two
types of availability Actually they all represent the availability for this Web service just like the
availability that showed in figure 3-14 Therefore only one content of these availability
information is sufficient Table 3-13 shows the final results of this extracting process
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page
Service Availability 100
Number of Downs 0
Total Uptime 1 day 19 hours 19 minutes
Total Downtime 0 second
MTBF 1 day 19 hours 19 minutes
MTTR 0 second
RTT Max of Endpoint 141 ms
RTT Min of Endpoint 0 ms
RTT Average of Endpoint 577 ms
Ping Count of Endpoint 112
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo
6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the
service domain from the WSDL link first Towards this Web service the gained service domain is
ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying
process After that it returns a list of information for that service domain See figure 3-17 And
table 3-14 is the extracted Whois information
Deep Web Service Crawler
40
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo
Service Domain URL thomas-bayercom
Domain Name Thomas Bayer
Domain Type NULL
Domain Address Moltkestr40
Domain Description NULL
State NULL
Postal Code 54173
City Bonn
Country NULL
Country Code DE
Phone +4922855525760
Fax NULL
Email infopredic8de
Organization predic8 GmbH
Established Time NULL
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo
7) Finally all the information of these four aspects will be collected together as the service
properties and then these service properties are forwarded into the Storage component
324 The Function of Storage Component
The WSDL link from the WSDL Grabber component will be used by this Storage component to
download the WSDL document from the Web and stores it into the disk thereafter In addition the
service properties from the Property Grabber component will also be directly stored into the disk in
three different manners with this Storage component Figure 3-18 illustrates the process flow of the
Storage component
Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo
of the Storage component will be triggered After that it is going to transform the service properties
into three different output formats and store them into the disk These output formats are XML file
INI file and database records Besides it will also try to download the WSDL document with the URL
address of the WSDL link and then store the obtained WSDL document into disk too However this
ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo
ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the
storage tasks
Deep Web Service Crawler
41
Figure3-18 Overview the process flow of the Storage Component
(1) ldquogetWSDLrdquo sub function
The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it
into the disk Therefore above all it has to get the content of the WSDL document This
procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value
of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if
the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo
For the case like that it will create a WSDL document whose name is the service name appended
with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content
It is an empty document Nevertheless if that service does have WSDL link this sub function will
try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all
these contents that hosted in the Web would be downloaded and stored into the disk and named
only with the name of the service Otherwise it would create a WSDL document which prefixes
ldquoBadrdquo before the service name
(2) ldquogenerateXMLrdquo sub function
The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform
them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo
XML stands for eXtensible Markup Language which is a markup language that designed to
transport and store data The first line of an XML file is the XML declaration which defines the
XML version and the encoding used For example following ldquoltxml version=10
encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is
8-bit Unicode Transformation Format character set Besides an XML file contains XML elements
which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML
element can also contain other elements which could be simple text or a mixture of both
Deep Web Service Crawler
42
However an XML file must contain a root element as the parent of all other elements
(3) ldquogenerateINIrdquo sub function
The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will
transform them into an INI file and then store it in the disk with the name structure of servicersquos
name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for
configuration files They are just simple text files with a basic structure Generally speaking the
INI file contains three different parts They are section parameter and comment respectively
Parameter is the basic element contained in an INI file Its format is a key-value pair or can also
be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always
appears to the left of the equals sign The section is more like a room that grouped all its
parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And
section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo
Thence anything between the semicolon and the end of line will be ignored
(4) ldquogenerateDatabaserdquo sub function
The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions
Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will
turn them into the data of the database by using the SQL statements SQL stands for Structured
Query Language which is a database language that designed for accessing and manipulating data
in database Most of the actions that performed on a database are done with SQL statements
And the primary statements of SQL include insert into delete update select create alter and
drop Therefore for the purpose of transforming these service properties into the data of the
database this sub function has to create a database first using the ldquocreate databaserdquo statement
Then it should create a table to store the data A table is a collections of related data entries and
it consists of columns and rows Since the data for all these five Web Service Registries are not
very large therefore one database table is enough for storing these service properties Because of
that the field names of the service properties in columns for all these five Web Service Registries
should be uniform and well-defined Afterwards the service properties of each single service can
be put into the table as a record with the ldquoinsert intordquo statement of the SQL
3241 Features of the Storage Component
The Property Grabber component has to provide the following features
l Generate different output formats
The final result of this master program is to store the information of the service into the disk for
future work However this Storage component provides three different formats for storing these
service properties of the services that published in the Web Service Registries This makes the
storage of the service very flexible and also longevity
l Obtain the WSDL document
The important thing of this component is to check if the WSDL document can be obtained from
the WSDL link because the WSDL document plays a decisive role for determining the quality of
the service This Storage component provides the ability to deal with the different situations
about the process of obtaining the WSDL document
Deep Web Service Crawler
43
3242 Input of the Storage Component
This component requires the following input data
l WSDL link of each service
l Each service property information
3243 Output of the Storage Component
The component will produce the following output data
l WSDL document of the service
l XML document INI file and tables in database
3244 Demonstration for Storage Component
Following figures describe the fundamental implementation codes of this Storage component
The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the
implementation codes The first common place is about the parameters that defined in each of
these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute
path in the computer disk It will be used for the storing procedure and has already been specified
by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an
increasing Integer which is used as a part of the name of the service The reason for doing this is
that it can prevent the services that have the same name from overriding in the disk The content
of red mark among the codes of these figures is the second common place Its function is to
create a file or document for the service with the corresponding parameters
2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to
get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the
name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo
is the actual WSDL link which is the most important one for this sub function The other two
parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and
ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data
of the service in each Web Service Registry Such as the overall number of properties for that Web
Service Registry the overall number of services the number of services that have no WSDL link
the number of services contains no contents the number of services whose WSDL link are not
available etc The ldquoLog Informationrdquo text file records the results of the process steps and the
problems encountered For example which service is crawling now which Web Service Registry it
belongs the reason why it cannot get the WSDL document for this service the reason why this
service is unreachable and so on
Deep Web Service Crawler
44
Figure3-19 Implementation code for getting WSDL document
3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file
and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of
ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two
variables name and value
Figure3-20 Implementation code for generating XML file
Deep Web Service Crawler
45
Figure3-21 Implementation code for generating INI file
4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all
services in these five Web Service Registries into the records of the database Therefore it has to
create a database first The name of the database can be arbitrary as long as it conforms to the
naming rules in database This is the same for the name of the table Figure 3-22 displays the
procedure of creating a table in the database Due to the reason that it is hard to decide the
length of each service property thence the data types for all service properties are set to ldquoTextrdquo
Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo
statement
Figure3-22 Implementation code for creating table in database
Deep Web Service Crawler
46
Figure3-23 Implementation code for generating table records
33 Multithreaded Programming for DWSC
Multithreaded programming is a built-in characteristic which is provided by Java language A
multithreaded program contains two or more separate parts that that can execute concurrently Each
part of such a program is called a thread The use of multithreading enables to create programs that
makes efficient use of the system resources For example make maximum use of the CPU thatrsquos
because the idle time of the CPU can be kept to a minimum
In this master program there are five Web Service Registries that are needed to crawl for the services
among them Moreover the number of services published among each Web Service Registry is quite
different That makes the running time cost by each Web Service Registry is different Then it will
happen that one Web Service Registry who owns fewer services has to be waiting for executing until
another Web Service Registry who has much more services finishes Therefore in order to reduce the
waiting time for other Web Service Registries and maximize the use of the system resources it is
necessary to apply this multithreaded programming to this master program That is to say this master
program will create a thread for each Web Service Registry And these threads are executed
independently
34 Sleep Time Configuration for Web Service
Registries
Due to the reason that this master program is intended for downloading the WSDL document and
extracting the service information of these Web services that published in the Web Service Registry
this would inevitably affect the performance of the Web Service Registry In addition for the purpose
of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of
accessing Because of that sometimes when this master program is executing unknown errors would
happen For instance this master program will be continually halting at one point without getting any
more WSDL document and service information or some servicesrsquo WSDL documents of some Web
Deep Web Service Crawler
47
Service Registries cannot be obtained or there would miss some service information Therefore in
order to obtain the largest staffs of these Web services published in these five Web Service Registries
and not to affect the throughput of them it has to configure the accessing rate for each service of all
Web Service Registries
In consequence before going into the essential procedure for each single service of these Web Service
Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long
milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for
the specified number of milliseconds In other words it will temporarily cease execution for a while
Following table shows this time interval of the sleep function for each Web Service Registry
Web Service Registry Name Time Interval (milliseconds)
Service Repository 8000
Ebi 3000
Xmethods 10000
Seekda 20000
Biocatalogue 10000
Table 3-15Sleep Time of these five Web Service Registries
Deep Web Service Crawler
48
4 Experimental Results and Analysis
This chapter is going to show the quantitative experimental results of the prototype that presented in
chapter 3 Besides the analysis of these results will also be described and explained However in
order to gain a rather accurate result the experiments are carried out more than five times All these
data that displayed in the following tables and charts are the average of them
41 Statistic Information for Different Web Service
Registries
This section is mainly used to talk about the amount statistic of these Web services that published in
these five Web Service Registries It includes the overall number of the Web services that published in
each Web Service Registry and the number of these unavailable Web services which have been
archived because they may not be active anymore or are close to being non active Table 4-1 shows
the service amount statistic of these five Web Service Registries
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Overall
Services 57 289 382 853 2567
Unavailable
Services 0 0 0 0 125
Table4-1 Service amount statistic of these five Web Service Registries
Nevertheless in order to have an intuitive view of the service amount statistic in these five Web
Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As
you can see from the bar chart one on hand there has an ascending increase for the overall number
of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service
Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web
service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to
provide the Web services for the users because it contains far more services than other four Web
Service Registries On the other hand there is no unavailable service existed in almost all Web Service
Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web
Service Registry contains some Web services that cannot be used by the users To some degree this is
useless for the reason that these services cannot be used anymore and it will waste the network
resources on the Web Therefore all these unavailable services should be eliminated in order to
reduce the wasting of the network resources
Deep Web Service Crawler
49
Figure4-1 Service amount statistic of these five Web Service Registries
42 Statistic Information for WSDL Document
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Failed WSDL
Links 1 0 23 145 32
Without
WSDL Links 0 0 0 0 16
Empty
Content 0 0 2 0 2
Table4-2 Statistic information for WSDL Document
Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in
these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed
WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the
Web services whose WSDL links are invalid Say in other words that means it is impossible to get the
WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no
WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services
for these Web Service Registries It is used as the overall number of Web services in each Web Service
Registry who has no WSDL link That is to say there would be no WSDL document for Web services
like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL
document will also be created But there is no content and the name of this WSDL document will
contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which
represents the overall number of Web services that have the WSDL links and the URL address of them
0
500
1000
1500
2000
2500
3000
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
57289 382
853
2567
0 0 0 0 125
The
Num
ber o
f Web
Ser
vice
s
The name of Web Service Registry
Overall Services
Unavailable Services
Deep Web Service Crawler
50
are valid but their WSDL documents contains no content In this case a WSDL document whose name
contains a string ldquo(BAD)rdquo will be created
Figure4-2 Statistic information for WSDL Document
43 Comparison of Different Average Number of
Service Properties
This section is going to compare the average number of service properties in these five Web Service
Registries This average number of service properties is calculated by means of following equation = (1)
Where
ASP is the average number of service properties for one Web Service Registry
ONSP is the overall number of service properties in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
Figure 4-3 shows the average number of service properties of each Web service in these five Web
Service Registries As have already mentioned before one of the measurements for testing the quality
of Web services in the Web Service Registry is the service information The more information of a Web
service the better you know about that service consequently the corresponding Web Service
Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service
Repository and Biocatalogue Web Service Registries own a larger number of service properties than
that in other three Web Service Registries This can directly reflect that these two Web Service
Registries can provide more detailed information about the Web services published among them So
that the users can be easier to choose the service they need And they would also like to use the Web
services that published in these two Web Service Registries By contrast the Xmethods and Seekda
0
50
100
150
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
1 023
145
32
0 0 0 016
0 0 2 0 2
The
Num
ber o
f Ser
vice
s
The Name of the Web Service Registry
Failed WSDL Links
Without WSDL links
Empty Content
Deep Web Service Crawler
51
Web Service Registries who have less service information of the Web services would offer less quality
for these Web services Therefore users may be not like to use the Web services provided in these
two Web Service Registries Not to mention the Web services published in the Ebi Web Service
Registry
Figure4-3 Average Number of Service Properties
From the description presented in section 323 the causes of the issue for the different number of
service properties in these Web Service Registries may consist of following several points First the
number of the structured information for these Web services is different with respect to these five
Web Service Registries Even part of information for some Web services in one Web Service Registry
could be missing or have empty value For example the number of structure information that
supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that
in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for
almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce
the amount of the overall number of service properties Thirdly some Web Service Registries do not
have monitoring information such as Xmethods and Ebi In particular the Service Repository Web
Service Registry has a large amount of monitoring information of the Web services that could be
extracted in the Web Obvious the last one is the number of the whois information for these Web
services If the database of the whois client does not contain the information about the service
domain for the Web service in one Web Service Registry then there will be no whois information
could be extracted Moreover even there has information of the service domain the amount of the
information could be very diverse Therefore if the Web Service Registry undergoes the situation like
that many service domains of the Web services in this registry have no and few whois information
then the average number of service properties in that registry will decrease greatly
As a result in order to help users get better acquainted with the Web services that provided in Web
Service Registry and distinguish these Web services the Web Service Registry should do it best to
offer more and more information for each published Web service of it
05
101520253035
23
717 17
32
The
Num
ber o
f Pro
pert
ies
The Name of Web Service Registry
Average Number of Service Properties
Deep Web Service Crawler
52
44 Different Outputs of Web Services
The fundamental task of this master program is to obtain the WSDL document of the Web services
that hosted in the Web Service Registries as well as extract and gather the properties of these Web
services and thereafter store them into disk Therefore this section is going to describe the different
outputs of this master program which include the WSDL documents of Web services the generated
XML file and INI file and data records of these service properties in Database
Figure4-4 WSDL Document format of one Web service
The WSDL document of the Web service is just read from the Web according to the URL address of its
WSDL link And then these data will be stored into the disk as the WSDL document The name of the
WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document
And in order to distinguish these WSDL documents whose names would be the same while the
content of them are different the name of each obtained WSDL document in one Web Service
Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL
document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and
data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output
formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name
is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the
materials belonged to the same Web service The first three lines in that INI file are some service
comments which start from the semicolon to the end of the line They are the basic information for
the description of this INI file The following one line of them is the section that included in a pairs of
brackets It is important because it represents that the following lines behind it are the information of
this Web service Therefore the rest of these lines are the actual service information that existed in a
key-value pair with an equal between the pair Each service property is displayed from the beginning
of the line
Deep Web Service Crawler
53
Figure4-5 INI File format of one Web service
Figure4-6 XML File format of one Web service
Figure4-7 Database data format for all Web services
Deep Web Service Crawler
54
Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its
name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web
service Though the format of the XML file is different from that of the INI file the essential contents
of them are the same That is to say the values of these service properties are no different This is
because they are files generated from the collection of properties for the same Web service The XML
file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the
section in INI file is something like the root in XML file Therefore all values of these elements among
the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service
Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of
service information for all Web services in these five Web Service Registries And the entire service
information for one Web service will be only one record in this table Because of that the column
names of that table should be the union of the name of the service information in each Web Service
Registry However since the column names of the table should be unique thus these redundant
names in this union must be eliminated This could make sense and possible because the names of the
service information are well-defined and uniform for all these five Web Service Registries In addition
in this table the first column is the primary key which is an increasing Integer The function of this is
more like the Integer that contained in the name of the XML file and INI file While the rest columns in
that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in
this table represents that this property of that Web service is empty or missing
45 Comparison of Average Time Cost for Different
Parts of Single Web Service
This section aims to describe the comparison of the average time cost for different parts of getting
one single Web service in all these five Web Service Registries At first it has to calculate the average
time cost of getting one single service in the Web Service Registry The calculation of it can be
obtained through following equation = (2)
Where
ATC is the average time cost for one single Web service
OTS is the overall time cost of all the Web services in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the different parts of the average time cost for getting one single service consists of
following 6 aspects which are the average time cost for extracting service property the average time
cost for obtaining WSDL document the average time cost for generating XML file the average time
cost for generating INI file the average time cost for inserting the service property into the table of
the Database and the average time cost for some other procedures such as get the service list page
link get the service page link and so on The average time cost for extracting service property would
be obtained by means of following equation
Deep Web Service Crawler
55
= (3)
Where
ATCSI is the average time cost for extracting service property of one single Web service
OTSSI is the overall time cost for extracting service property of all the Web services in one Web
Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the calculation of other parts is similar to the equation for calculating the average time
cost for extracting service property While the calculation of the average time cost for other
procedures equals to the result of the average time cost for one single Web service minus the sum of
the average time cost of all other five parts
Service
property
WSDL
Document
XML
File INI File Database Others Overall
Service
Repository
8801 918 2 1 53 267 10042
Ebi 699 82 2 1 28 11 823
Xmethods 5801 1168 2 1 45 12 7029
Seekda 5186 1013 2 1 41 23 6266
Biocatalogue 39533 762 2 1 66 1636 42000
Table4-3 Average time cost information for all Web Service Registries
Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web
Service Registries The first column of table 4-3 is the name of these five Web Service Registries And
the last column in this table is the average time cost for the single service in one Web Service Registry
While the rest columns in this table are the average time cost of these six different parts In order to
have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated
with following corresponding figures See figure 4-8 to figure 4-14
Figure4-8 Average time cost for extracting service property in all Web Service Registries
8801
6995801 5186
39533
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Service Property
Deep Web Service Crawler
56
As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue
Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries
which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively
That is to say it will take much longer time to extract the service properties of these Web services
published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the
Biocatalogue Web Service Registry has even more average number of service properties which have
already been talked about in section 43 On the contrary the average number of service properties in
Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service
Registry is larger than that in Seekda Web Service Registry As have already known that the average
number of service properties is the same for these two Web Service Registries Nevertheless there
has course might explain the case that the average time in Xmethods costs more than that in Seekda
which is the process for extracting service properties in Xmethods Web Service Registry needs to be
executed by means of service page and service list page while it needs only service page link for
Seekda Web Service Registry
The average time cost for obtaining WSDL document in all these five Web Service Registries is
displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the
average time for extracting the WSDL link of Web service and the average time for reading the data of
WSDL document from the Web and then storing it into the disk As seen from this figure the average
time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is
1168 Although the process for extracting WSDL link will cost certain amount of time it does not have
a significant influence to the total average time spent for obtaining WSDL document This is because
the WSDL link of one Web service is almost gained in one step Therefore this can imply that the
average size of the WSDL document for Xmethods Web Service Registry is larger than that for other
four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds
for the process of extracting WSDL document
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries
918
82
11681013
762
0
200
400
600
800
1000
1200
1400
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
WSDL Document
Deep Web Service Crawler
57
Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different
outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11
the average time of generating XML file for one Web service cost the same for all these five Web
Service Registries which is only 2 milliseconds In addition the average time of generating INI file for
one Web service cost the same too And its value is just 1 millisecond Even though sum up the values
of these two types of average time costs the result is still so small that it could be omitted when
comparing to the overall average time cost of getting one Web service for each corresponding Web
Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI
file would be finished at once after receiving the input of service properties of one Web service
Furthermore it can be seen from figure 4-12 although the average time costs of creating database
record for each Web service in all these five Web Service Registries are larger than that time of
generating XML and INI file the process of the operation for creating database record is still fast
Figure4-10 Average time cost for generating XML file in all Web Service Registries
Figure4-11 Average time cost for generating INI file in all Web Service Registries
2 2 2 2 2
0
05
1
15
2
25
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
XML File
1 1 1 1 1
0
02
04
06
08
1
12
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
INI File
Deep Web Service Crawler
58
Figure4-12 Average time cost for creating database record in all Web Service Registries
Figure4-13 Average time cost for getting one Web service in all Web Service Registries
In the figure 4-13 it gives the average time cost for getting one single Web service in all these five
Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the
longest time for this process This is because the presentation of these five different parts that
described in the front shows that the average time cost of each part would needs more time to finish
the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining
WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web
Service Registry does not cost the longest Moreover there is amazing thing when looking at these
figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same
trend This can further indicate that one Web Service Registry who spends more time to get the
description information of one Web service would offer more information about the Web service
53
28
4541
66
0
10
20
30
40
50
60
70
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Database
10042
823
7029 6266
42000
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Overall
Deep Web Service Crawler
59
5 Conclusion and Further Direction
This master thesis provides a schema which aims to explore the description information of the Web
services that hosted in the different Web Service Registries Actually the description information of
the Web service consists of WSDL document and service information about that Web service
Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can
be used to obtain the WSDL document and service information of the Web service in these Web
Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web
services that hosted in the these Web Service Registries as well as the storage of these Web services
is not flexible and the most important thing is that only a few service information of the Web service
are extracted even more there is no service information would be extracted for some Web Service
Registry However the work presented in this master thesis can be able to explore all Web services in
these Web Service Registries Moreover the service information of Web service would be extracted as
much as it could with the approach presented in this master thesis so that the final result would be
the largest annotated service catalogue ever produced Furthermore regarding the storage process of
the description information of the Web service this master thesis provides three different ways that
can guarantee not only the completeness but also longevity of the description information for the
Web service
However in the implementation performed in this master thesis the whois client used for querying
the information of service domain will return a free text if the information exists And sometime this
free text differs completely That makes it must have to crawl each Web service in all Web Service
Registries at least once in the experiment stage So that all the cases of these free text would be
foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web
services in these Web Service Registries Therefore in order to simply the work other whois client
who can ease the work here needs to be found and used
Moreover in the experiment stage of this master thesis the time cost for getting Web service is still
large For the purpose of reducing this time the multithreaded programming would be applied to
each some parts of the process for getting one Web service
Although the work performed here is specialized for only these five Web Service Registries the
main parts of the principles used here is adaptable to other Web Service Registries that with only
small changes in the implementation codes or the structure
Deep Web Service Crawler
60
6 Bibliography
[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)
June 27 2008
[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from
httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-
whole Emanuele Della Valle (CEFRIEL) July 1 2008
[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk
ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-
architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009
[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web
Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October
2006
[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from
httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008
[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from
httpwwww3orgTRws-arch-scenarios February 11 2004
[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo
Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering
University of Washington Seattle pp233-272 February 1999
[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web
Consortium Working Draft WD-html5-20100624 January 22 2008
[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language
WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from
httpwwwwsmoorgTRd16d161v02
[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO
-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from
httpwwwwsmoorgTRd2v11
[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware
Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet
2008
Deep Web Service Crawler
61
7 Appendixes
There are additional outputs of this master program which are log information file and statistic report
file Figure 8-1 shows one of the basic output formats of the log information for these five Web
Service Registries
Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry
Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry
Deep Web Service Crawler
62
Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo
ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively
Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry
Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry
Deep Web Service Crawler
63
Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry
Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry
Deep Web Service Crawler
64
Table of Figures
Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12
Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16
Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25
Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27
Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41
Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46
Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50
Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51
Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52
Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56
Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58
Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58
Deep Web Service Crawler
65
Table of Tables
Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47
Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48
Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55
Deep Web Service Crawler
66
Table of Abbreviations
DTD Document Type Definition
DWSC Deep Web Service Crawler
HTML HyperText Markup Language
MTBF Mean Time between Failures
MTTR Mean Time to Recovery
QoS Quality of Service
REST Representational State Transfer
RTT Round Trip Time
SOAP Simple Object Access Protocol
SQL Structured Query Language
URL Uniform Resource Locator
WHATWG Web Hypertext Application Technology Working Group
XML eXtensible Markup Language
WSDL Web Service Description Language
WSML Web Service Modeling Language
WSMO Web Service Modeling Ontology
Deep Web Service Crawler
6
7 Appendixes 61
Table of Figures 64
Table of Tables 65
Table of Abbreviations 66
Deep Web Service Crawler
7
1 Introduction
In this introduction chapter of the master thesis at first itrsquos going to concisely explain the background
of the current situation then is the basic introduction of the proposed approach which is called Deep
Web Service Extraction Crawler
11 BackgroundMotivation
In the late 1990rsquos the Web Service Registry was a hot commodity The formal definition of Web
Service Registry is known as a link links page Its function is to uniformly present information that
comes from various sources Hence it can provide a convenient channel to the users for offering
searching and using the Web Services Actually the related metadata of the Web Services that
submitted by both the system and users are commonly hosted along with the Service descriptions
Nevertheless in fact when users enter one of the Web Service Registries to look for some Web
Services they might meet some situations that would bring lots of trouble to them One of the
situations may be like that these Web Service Registries return several similar published Web Services
after the users search on it For example two or more Web Services have the same name but their
versions are not the same Or two or more Web Services that derived from the same server but have
different contents etc Furthermore most users are also interested in a global view of the published
services For instance they want to know which Web Service Registry can provide better quality for
the Web Service Therefore in order to help users to differentiate those similar published Web
Services and have a global view of the Web Services this information should be monitored and rated
Moreover there are a great many Web Service Registries in the Internet Each Web Service Registry
can provide a great number of Web Services Obviously there might have some similar Web Services
among these Web Service Registries Or a Web Service in one of the Web Service Registry is related to
another Web Service in other Web Service Registries Hence these Web Services should be
comparable across different Web Service Registries However recently there has not much support of
this In addition towards the metadata actually not all of them are structured especially the
descriptions of the non-functional property Therefore what have to do now is to turn those
non-functional property descriptions into the structured format Clearly speaking it needs to extract
as much information as possible about the Web Services that offered in the Web Service Registries
Eventually after extracting all the information from the Web Service Registries it is necessary to store
them into the disk This procedure should be efficient flexible and completeness
12 Initial Designing of the Deep Web Service
Crawler Approach
The problems have already been stated in the previous section Hence the following work is to solve
Deep Web Service Crawler
8
these problems In this section it will present the basic principle of Deep Web Service Crawler
approach
At first it is the simply introduction of the Deep Web Service Crawler approach to these problems As
have already been mentioned each Web Service Registry can offer Web Services Moreover each
Web Service Registry has its own html page structures These structures may be the same or even
complete different Therefore the first thing is to identify which Web Service Registry that it will be
going to explore Since each Web Service Registry owns a unique URL this job can be done by directly
analyzing the corresponding URL address of that Web Service Registry After identifying which Web
Service Registry it is going to explore the following step is to obtain all these Web Services that
published in that Web Service Registry Then with all these obtained Web Services it is time to extract
analyze and gather the information of the services That information can be in structured format or
even in unstructured format In this master thesis some Deep Web Analysis Techniques will be
applied to obtain this information So that the information about each Web Service shall be the
largest annotated The last but not the least important all the information about the Web Services
need to be stored
13 Goals of this Master Thesis
The lists in the following are the goals of this master thesis
n Produce the largest annotated Service Catalogue
Service Catalogue is a list of service properties The more properties the service has the larger
Service Catalogue it owns Therefore this master program should extract as much service
properties as possible
n Flexible storage of these metadata of each service as annotations or dedicated documents
The metadata of one service includes not only the WSDL document but also service properties
All these metadata are important information for the service Therefore this master program
should provide flexible ways to store these metadata into the disk
n Improve the comparable property of the Web Services across different Web Service Registries
The names of service properties for one Web Service Registry could be different from another
Web Service Registry Hence for the purpose of improving the comparable ability all these
names of the service properties should be uniformed and well-defined
14 Outline of this Master Thesis
In this chapter the motivation objective and initial approach plan have already been discussed
Thereafter the remaining paper is structured as follows
Firstly in chapter 2 the work presented is almost based on these existing techniques In section 21
there is given a detailed introduction to the technique of the Service-Finder project Then in section
22 the already implemented technique of Pica-Pica Web Service Description Crawler is introduction
and discussed After that in section 23 the Information Retrieval technique is presented
Deep Web Service Crawler
9
Then following in chapter 3 it is explained the designing details of this Deep Web Service Crawler
approach In section 31 it gives a short description for the different requirements of this approach
Next in section 32 the actual design for the Deep Web Service Crawler is presented Then in section
33 34 the multithreaded programming and sleep time configuration that used in this master
program are introduced respectively
In chapter 4 it is supposed to display the experiments of this Deep Web Service Crawler approach
and then give some evaluation of it
Finally in Section 5 the conclusion and discussion of the work that already done as well as the work in
the future for this master task are presented respectively
Deep Web Service Crawler
10
2 State of the Art
This chapter aims at presenting some existing techniques or Strategies that related to the work of
applying this Deep Web Service Extraction Crawler approach Section 21 plans to talk about the
existing catalogues Service-Finder project And then in section 22 it is going to explain the existing
implemented crawlers Pica-Pica Web Service Description Crawler Finally in the section 23 it is
supposed to present some details about the Information Extraction technique
21 Service Finder Project
Service-Finder Project aims at developing a platform for Web Service discovery especially for the Web
Services that are embedded in a Web 20 environment [1] Hence it can provide an efficient access to
publicly available services The goals of the Service-Finder project are depicted as follows [1]
n Automatically gather Web Services and their related information
n Semi-automatically create semantic service description based on the information that available
on the Web
n Create and improve semantic annotations via the user feedback
n Describe the aggregated information in semantic models and allow reasoning query
However before describing the basic functionality of the Service-Finder Project there is going to
present one of its use cases and requirements first
211 Use Cases for Service-Finder Project
The Service-Finder Project employed the use case methodology of the W3C Use Case description [6]
for its needs and then applied this methodology to the use cases that it enumerated
2111 Use Case Methodology
There are three aspects needed to be considered for use case definitions of Service-Finder Project [1]
(1) Description that used to describe information of the use case
(2) Actors Roles and Goals that used to identify the actors they would be the roles they act and the
goals they need to achieve in the scenario
(3) Storyboard that used to describe the serial of interactions among the actors and the
Service-Finder Portal
2112 System Administrator
This section is going to present the use case that applied to the Service-Finder portal and that
illustrated the requirements on its functionality from a user point of view However all these
Deep Web Service Crawler
11
information in this use case are derived from [1] In this use case there has a system administrator
whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities
online and working all day and night Therefore if there is any system failures Sam Adams should fix
the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will
alert him immediately by sending him a SMS Message in the case of a system failure
n Description
This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an
SMS Messaging Service that he wants to build it into his application
n Actors Roles and Goals
The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him
are the immediate service delivery the reliability of the service and low base fee and transaction fee
n Storyboard
Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find
many useful services from it especially he know what he is looking for Hence he visits the
Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo
Requirement 1 Search functionality
Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the
number of matching services that will be displayed on one page And he would also expect there has
short information about the service functionality the service provider and the service availability So
that he could decide which service he will choose to read further
Requirement 2 Enable configurable pagination of the matching results and have some short
information for each service
Step 3 When Sam looks through the short information about the services that displayed on the first
page he expects to find the most relevant services that related to his request After that he would
like to read more detailed information about that service to see whether this service can provide the
needed functionality
Requirement 3 Rank the returned matching services and must provide ability to read more details of
a service
Step 4 In the case that all the returned matching services Sam got provide quite different
functionalities or they belong to different service categories for example the SMS messaging services
alert users not through SMS but voice messaging For this reason Sam would like to see other
different categories that may be contain the services he wants Or the services of other categories
which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can
further filter his search in terms of browsing through categories
Requirement 4 Service categories and allow the user to look all services that belonged to that specific
category If possible it should also allow the user to browse through categories
Step 5 When Sam got all the services that could provide a SMS messaging service via the methods
described in the Step 4 at present he wants to look for the services that offered by an Austrian
provider and have no base fees if possible
Requirement 5 Faceted search
Deep Web Service Crawler
12
Step 6 After Sam got all these specific services now he would like to choose the services that can
provide a high reliability
Requirement 6 Sort functionality based on usersrsquo chooses
Step 7 For now Sam expects to compare the service availability between the promised to the service
provider and the actually provided This should be contained in the servicesrsquo details And there needs
also have service coverage information so that Sam can know whether this service covers the areas
he lives and works Moreover Sam would also like to compare these services in other way For
instance put some services into a structured table to compare the transaction fees
Requirement 7 A side-by-side comparison table for services and a functionality that enable users to
select services he wants to compare
Step 8 At last Sam wants to know whether the service providers offer a free try out of the services
So that he can test the service functionality
Requirement 8 If possible display a note that offering free service trials
212 Architecture Plan for the Service-Finder Project
The architecture plan of the Service-Finder project contains five basic components Service Crawler
Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal
Interface Figure 2-1 presents a high level overview of the components and the data flow among
them
Figure2-1Dataflow of Service-Finder and Its Components [3]
Deep Web Service Crawler
13
2121 The Principle of the Service Crawler Component
The Service Crawler is responsible for gathering available services and their related information from
the Web The overall cycle is depicted as following
(1) A Web developer publishes a Web Service
(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services
like WSDL (Web Service Description Language) documents
(3) The Crawler is also going to search for other related information as long as a service is discovered
(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant
part of the Web
At last the output of the crawler would be forwarded to the subsequent components for analyzing
indexing and displaying
2122 The Principle of the Automatic Annotator Component
The Automatic Annotator receives the relevant data from previous component and generates
semantic service descriptions about the WSDL documents and its related information based on the
Service-Finder Ontology and Service Category Ontology
Firstly it will simply introduce those two compatible ontologies that would be used throughout the
whole process [2]
n Generic Service Ontology it is an ontology which is functional to describe the data objects For
example the services the service providers availability payment modalities and so on
n Service Category Ontology it is an ontology which is used to categorize the functionalities or
applications of the services For instance data verification messaging data storage weather etc
Afterwards it is going to talk about the function of this component with its input output
Oslash Input
u Crawled data from Service Crawler
u Service-Finder Ontologies
u Feedback or Correction of before annotations
Oslash Function
u Enrich the information about the service and extract semantic statements according to the
Service-Finder Ontologies For example categorize the service according to the Service
Category Ontology
u Determine whether a particular document is relevant or not through the Web link graph If
not discard these irrelevant documents
u Classify the pages into their genres For instance pricing user comments FAQ and so on
Oslash Output
u Semantic annotation of the services
Deep Web Service Crawler
14
2123 The Principle of the Conceptual Indexer and Matcher
Component
The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted
information of the services and supplying users the capability of retrieval and semantic query For
example the matchmaking between user requests and service offers and the act of retrieving user
feedback on extracted annotations
In addition letrsquos have a look of the function of this component and its input output
Oslash Input
u Semantic annotation data and full text information obtained from Automatic Annotation
u Semantic annotation data and full text information that come from user interfaces
u Cluster data from user and service clustering component
Oslash Function
u Store the semantic annotations received from the Automatic Annotation component and
from the user interface
u Store the cluster data that procured through the clustering component
u Store and index the textual description offered by the Automatic Annotation component
and the textual comments offered by users
u Ontological query the semantic data from the data store center
u Combined keyword and Ontological querying used for user queries
u Provide a list of similar services for a given service
Oslash Output
u A list of matching services that are queried by users In particular these services should be
sorted by ranking and can also be iterated
u All available data that related to a particular entity must be retrievable at the user interface
2124 The Principle of the Service-Finder Portal Interface
Component
The Service-Finder Portal Interface is the main entry point that provided for users of the
Service-Finder system to search and browse the data which is managed by the Conceptual Indexer
and Matcher component In addition the users can also contribute information by means of providing
tags comments categorizations and ratings to the data browsed Furthermore the developers can
still directly invoke the Service-Finder functionalities from their custom applications in terms of an API
Besides the details of this componentrsquos function input and output are represented as below
Oslash Input
u A list of ordered services for a query
u Detailed information about a service or a set of services and a service provider
u Query access to service category ontology and the most used tags provided by the users
Deep Web Service Crawler
15
u Service availability information
Oslash Function
u The Web Interface allows the users to search services by keyword tag or concept in the
categorization sort and filter query results by refining the query compare and bookmark
services try out the services that offer this functionality
u The API allows the developers to invoke Service-Finder functionalities
Oslash Output
u Explicit user annotations such as tags ratings comments decryptions and so on
u Implicit user data for example click stream of users bookmarks comparisons links sent
etc
u Manual advertising of available new services
2125 The Principle of the Cluster Engine Component
The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the
Service-Finder Portal eg the queried services and the compared services of the users Moreover it
also provides cluster data to the Conceptual Indexer and Matcher for providing service
recommendations
Furthermore letrsquos detailed introduce this componentrsquos function input and output
Oslash Input
u Service annotation data of both extracted and user feedback
u Usersrsquo Click streams used for extracting user behaviors
Oslash Function
u Obtain user clusters from user behaviors
u Obtain service clusters from service annotation data to enable to find similar services
Oslash Output
u Clusters of users and services
22 Information Extraction
Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge
amount of information sources on the Internet which has been limited the access to browsing and
searching for the reason of the heterogeneity and the lack of structure of Web information sources
Therefore the appearance of Information Extraction that transforms the Web pages into
program-friendly structures for post-processing would become a great necessity However the task of
Information Extraction is specified in terms of the inputs and the extraction targets And the
techniques used in the process of Information Extraction called extractor
221 Input Types of Information Extraction
Generally speaking there are three different input types The first input type is the unstructured
Deep Web Service Crawler
16
document For example the free text that showed in figure 2-2 It is unstructured and written in
natural language So that it will require substantial natural language processing While the second
input type is called the structured document For instance the XML documents based on the reason
that the data can be described through the available DTD (Document Type Definition) or XML
(eXtensible Markup Language) schema Finally but obviously the third input type is the
semi-structured document that are widespread on the Web Such as the large volume of HTML
pages like tables itemized lists and enumerated lists This is because HTML tags are often used to
render these embedded data in the HTML pages See figure 2-3
Figure2-2Left is the free text input type and right is its output [4]
Figure2-3A Semi-structured page containing data records
(in rectangular box) to be extracted [4]
Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly
regular structure And the data of these documents can be displayed in a format of HTML way or
non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and
generated from structured databases in terms of some templates or layouts thus it would be
considered as one of the input sources which could provide some of these semi-structured documents
For example the authors price and comments of the book pages that provided by Amazon have the
Deep Web Service Crawler
17
same layout That is because these Web pages are generated from the same database and applied
with the same template or layout Furthermore there has another option which could manually
generate HTML pages of semi-structured type For example although the publication lists that
provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have
title and source property for every single pager Eventually the inputs for some Information Extraction
can also be the pages with the same class or among various Web Service Registries
222 Extraction Targets of Information Extraction
Moreover regarding the task of the Information Extraction it has to consider the extraction target
There also have two different extraction targets The first one is the relation of k-tuple And the k in
there means the number of attributes in a record Nevertheless in some cases an attribute of one
record may have none instantiation Otherwise the attribute owns multiple instantiations In addition
the complex object with hierarchically organized data would be the second extraction target Though
the ways for depicting the extraction targets in a page are diverse the most common structure is the
hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf
nodes which called internal nodes And the structure for a data object may also be flat or nested To
be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise
if it is nested structure then the internal nodes that involved in this data object would be more than
two levels
Furthermore in order to make the Web pages readable for human being and having an easier
visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated
or demarcated However the displaying for a data object in a Web page would be affected by
following conditions [4]
Oslash The attribute of a data object has zero or several values
(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo
attribute For example a special offer only available for certain books might be a ldquononerdquo
attribute
(2) If there are more than one values for the attribute of a data object it will be called the
ldquomultiValuerdquo attribute For instance the name of the author for a book could be a
ldquomultiValuerdquo attribute
Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering
That is to say among this set of attribute the position of the attribute might be changed
according to the diverse instances of a data object Thus this attribute will be called the
ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would
enumerate the release data in front of the movesrsquo title while for the movies after year 1999
(including 1999) it will enumerate the release data behind the movesrsquo title
Oslash The attribute has different formats
This means the displaying format of the data object could be completely distinct with respect to
these different instances Therefore if the format of an attribute is free then a lot of rules will be
needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo
attribute For example an ecommerce Web site would use the bold font format to present the
general prices while use the red color format to display the sale prices Nevertheless there has
Deep Web Service Crawler
18
another situation that some different attributes for a data object have the same format For
example various attributes are presented in terms of using the ltTDgt tags in a table presentation
And the attributes like those could be differentiated by means of the order information of these
attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo
attributes it must have to revise the rules for extracting these attributes
Oslash The attribute cannot be decomposed
Because of the easier processing sometimes the input documents would like to be treated as
strings of tokens instead of the strings of characters In addition some of the attribute cannot
even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo
attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The
department code and the course number in them cannot be separated into two different strings
of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo
223 The Used Techniques in Information Extraction
The extractor used in the process of Information Extraction aims at providing a single uniform query
interface to access information sources like database server and Web server It consists of following
phases collecting returned Web pages labeling these Web pages generalizing extraction rules
extracting the relevant data and outputting the result in an appropriate format (XML format or
relational database) for further information integration For example at first the extractor queries the
Web server to gather the returned pages through the HTTP protocols after that it starts to extract the
contents among these HTML documents and integrate with other data sources thereafter Actually
the whole process of the extractor follows below steps
Oslash Step 1
At the beginning it must have to tokenize the input However there are two different
granularities for the input string tokenization They are tag-level encoding and word-level
encoding The tag-level encoding will transform the tags of HTML page into the general tokens
while transform all text string between two tags into a special token Nevertheless the
word-level encoding does this in another way It treats each word in a document as a token
Oslash Step 2
Next it should apply the extraction rules for every attributes of the data object in the Web pages
These extraction rules could be induced in terms of a top-down or bottom-up generalization
pattern mining or logic programming In addition the type of extraction rules may be indicated
by means of regular grammars or logic rules For example some use path-expressions of the
HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic
constraints and some use delimiter-based constraints such as HTML tags or literal words
Oslash Step 3
After that all these extracted data would be assembled into the records
Oslash Step 4
Finally iterate this process until all these data objects in the input
Deep Web Service Crawler
19
23 Pica-Pica Web Service Description Crawler
The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the
Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web
Services problem For example the evaluation of the descriptive quality of Web Services that offered
and how well are these Web Services described in nowadaysrsquo Web Service Registries
231 Needed Libraries of the Pica-Pica Web Service
Description Crawler
This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef
Spillner and programmed in terms of the Python language Actually in order to run these scripts to
parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib
n Beautiful Soup
It is an HTMLXML parser for Python language And it can even turn these invalid markups into a
parse tree [5] Moreover the following three features can make it more powerful
u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that
makes approximately as much sense as the original document Therefore you can obtain
the data that you want
u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating
searching and modifying the parse tree Hence you donrsquot need to create a custom parse
for every application
u If the document has already specified an encoding then you can ignore it since the
Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way
Otherwise what you have to do is just to specify the encoding of the original documents
Furthermore the ways of including Beautiful Soup into the application are displayed in the
following [5]
sup2 From BeautifulSoup import BeautifulSoup For processing HTML
sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML
sup2 Import BeautifulSoup To get everything
n Html5lib
It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to
gain maximum compatibility with the current major desktop web browsers this implementation
will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5
specification
Deep Web Service Crawler
20
232 Architecture of the Pica-Pica Web Service
Description Crawler
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler
Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes
four fundamental components Service Page Grabber component WSDL Grabber component
Property Grabber component and WSML Register component
(1) The Service Page Grabber components is going to take the URL seed as the input and output the
link of the service page into following two components WSDL Grabber component and Property
Grabber component
(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the
delivered service pagersquos link And then check whether the validation for these obtained WSDL
document Finally only these valid WSDL document will be passed into the WSML Register
component for further processing
(3) The Property Grabber component will try to extract the servicersquos property the hosted in the
service page if there exists After that all these servicersquos properties would be saved into an INI
file as the information of that service
(4) The functionality of the WSML Register component is to write appropriate WSML document by
means of the valid WSDL documents that delivered from WSDL Grabber component and the
optionally INI files that delivered from Property Grabber component Afterwards register them in
Conqo
n WSML [9]
It stands for Web Service Modeling Language which provides a framework with different
language variants Hence it is often used to describe the different aspects of the semantic Web
Services according to the conceptual model of WSMO
n WSMO [10]
WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various
aspects related to the Semantic Web Services based on the ontologies Ontology is a formal
explicit specification of a shared conceptualization In fact these ontologies are the sticking points
that can satisfy the linkage between the agreement of the communities of users and the defined
conceptual semantics of the real-world
n Conqo [11]
Deep Web Service Crawler
21
It is a discovery framework that considers not only the Quality of Service (QoS) but also the
context information It will use a Web Service Repository to manage these service descriptions
that based on WSML
233 Implementation of the Pica-Pica Web Service
Description Crawler
This section is going to describe the processes of the implementation of the Pica-Pica Web Service
Description Crawler in detail
(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it
needs an input as the initial seed In this crawler there are five Web Service Registries which are
listed in the below The URL address of these five Web Service Registries will be used as the input
seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web
Service Description Crawler there has a single Python script for each Web Service Registry And
the crawling process for these Web Service Registriesrsquo Python script will be executed one after
another
Biocataloue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwsrvice-repositorycom
Xmethods httpwwwxmethodsnet
(2) Then after feeding with the input seed it will step into the next component Service Page Grabber
At first this component will try to read the data from the Web based on the input seed Then it
will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library
After that this Service Page Grabber component starts to look for the service page link for each
service that published in the Web Service Registry by means of the functions in Html5lib library In
the case that the service page link of one single service is found it will firstly check whether this
service page link is valid or not Once the service page link is valid it will pass it into the following
two components for further processing which are WSDL Grabber component and Property
Grabber component
(3) When the WSDL Grabber component receives a service page link from its previous component it
sets out to extract the WSDL link address for that service through the parsing tree of the data in
this service page Next this component will start to download the WSDL document of that service
in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into
the disk The process of this WSDL Grabber component will be continually carried on until there is
no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They
may either contain bad definitions or bad namespaceURI or be an empty document What is
worse it is not even of XML format Hence in order to pick them out this component will further
analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo
folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in
order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo
folder will be passed into the subsequent component
(4) Moreover since some Web Service Registries would give some addition information about the
Deep Web Service Crawler
22
services such as availability service provider version therefore the Property Grabber
component will set out to extract these information as this servicersquos properties And thereafter
save these servicersquos properties into an INI file However if there has no available additional
information then it is no need to extract the service property thus there is no INI file for that
service However for the implementation of this Pica-Pica Web Service Description Crawler only
the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to
extract the servicesrsquo properties While for other three Web Service Registries there has such
function
(5) Furthermore this is an optional to create a report file which contains the statistic information of
this process Such as the total number of services for one Web Service Registry the number of
services whose WSDL document is invalid etc
(6) As has been stated at present there has a folder with all valid WSDL documents and might also
have some INI files Therefore at this time the task of WSML Register component is to generate
the appropriate WSML documents with these valid WSDL document and INI files And then
register them in Conqo
24 Conclusions of the Existing Strategies
In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder
project Information Extraction technique and Pica-Pica Web Service Description Crawler
The task of this master program is going to obtain the available Web services and their related
information from the Web Actually this is a procedure of extracting the needed information for the
service such as the servicersquos WSDL document and its properties Therefore the Information
Extraction technique which used to extract the information that hosted in the Web could be used in
this master program
Moreover the Service-Finder is a large project that not only able to obtain the available Web services
and their related information from the Web but also enrich these crawled data with annotations by
means of the Service-Finder ontology and the Service Category ontology and integrate all the
information into a coherent semantic mode Furthermore it also provides the capabilities for
searching and browsing the data with the user interface and gives users with service
recommendations However as a master program the Service-Finder project is far more exceeding
the requirements Therefore it is just considered as a consultation for this master program
Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available
Web services and their related information this fulfills the primary task of this master program
Nevertheless regarding the information about the service this Pica-Pica Web Service Description
Crawler extracts only few properties sometime even no property Consequently in order to improve
the quality of the service it has to extract as much properties about the service as possible Thence in
chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler
Deep Web Service Crawler
23
3 Design and Implementation
In the previous chapter of the State of Art those already existing techniques or implementations are
presented In the following it is time to introduce the basic principle of the proposed approach Deep
Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica
Web Service Description Crawler
31 Deep Web Services Crawler Requirements
This section is mainly talking about the goals of the Deep Web Service Crawler approach system
requirements with respect to approach and some non-functional requirements
311 Basic Requirements for DWSC
The following lists are the basic requirements which should be achieved
(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list
of Web services that published in the Web Service Registry It contains not only the WSDL document
of the service but also a file with respect to the information about that service Therefore for the
purpose of producing the largest annotated Service Catalogue this proposed approach need to
extract as much properties about those Web services as possible Moreover it has also to download
the WSDL document that hosted along with the Web service That is to say these properties are not
only the interesting structured properties such as service name its WSDL link address but also some
other non-function properties for example endpoint and its monitoring information
(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties
For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of
schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way
this proposed approach provides three methods for the storage The first one will store them as an
XML file While for the second method it is going to store them in an INI file And the third method is
to use the database for the storage
312 System Requirements for DWSC
General Speaking the needed requirements for achieving a programming project contains following
lists
1) Operating System LinuxUnix Windows (XP Vista 2000 etc)
2) Programming language C++ Java Python C etc
3) Programming Tool NetBean Eclipse Visual Studio and so on
However in this master thesis the scripts of the Deep Web Service Crawler approach are written in
Java programming language Besides these code scripts are only tested running in Windows XP
operating system and Linux operating system but have not been tested in other operating system
Deep Web Service Crawler
24
313 Non-Functional Requirements for DWSC
In this part several non-functional requirements for the Deep Web Service Crawler approach are
presented
1) Transparency the process of data exploration and data storage should be done automatically
without the userrsquos intervention However at first user should specify the path in the hard disk which
will be used store these outputs of this program
2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in
order to stop the process from interruption there must have some necessary error handlings for the
process recovering
3) Completeness this approach should extract the interesting properties about each Web Service as
much as possible eg endpoint monitoring information etc
In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the
following five URLs hence the proposed approach must implement not less than those five Web
Service Registries
Biocatalogue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwservice-repositorycom
Xmethods httpwwwxmethodsnet
32 Deep Web Services Crawler Architecture
In this section an overview of this high-level architecture of Deep Web Services Crawler approach will
be introduced first Thereafter there are four subsections that focusing on the outlining of each single
component and how they will play together presented
The current components and flows of data in Deep Web Service Crawler can be summarized as
depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page
links and related service page links by crawling the Web (Web Service Extractor) Then those gathered
links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL
Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all
these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is
illustrated as following
Oslash Step 1
When the Deep Web Service Crawler starts to run the File Chooser container would require the
user to specify a path in the computer or in any other hard disk The reason for specifying the
path is that this Deep Web Service Crawler program needs a place to store all its outputs
Oslash Step 2
After that the Web Service Extractor will be triggered It is the main entry to the specific crawling
process However since the Deep Web Service Crawler program is a procedure which supposes
to crawl for Web Services in some given Web Service Registries Hence the URL addresses of
Deep Web Service Crawler
25
these Web Service Registries should be given as an initial seed for this Web Service Extractor
process However since the page structures of these Web Service Registries are completely
different thus there will be a dependent process for each Web Service Registry
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler
Oslash Step 3
So that according to the given seed two types of links would be obtained by the Web Service
Extractor component One is the service list page link And another is the service page link
Service list page is page that contains a list of Web Services and may be some information about
these Web Services While service page is page that contains much more information for single
service Finally it forwards these two types of links into the next two components Property
Grabber and WSDL Grabber
Oslash Step 4
Then on the one hand the Property Grabber component tries to gather the information about
the service that hosted in the service list page and service page Such as the name of the service
its description the ranking for this service etc Finally all the information of the service will be
collected together as the service properties which then will be delivered to the storage
component for further processing
Oslash Step 5
On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page
or the service page That is because for some Web Service Registry the WSDL link is hosted in
Deep Web Service Crawler
26
the service list page like the Biocatalogue and for other Web Service Registries it is hosted in
the service page such as Xmethods Then after obtaining the WSDL link it will also be
transmitted to the Storage component for further processing
Oslash Step6
When service properties information and WSDL link of the service are received by Storage
component it will be going to store them into the disk For the service properties they will be
stored into the disk by means of three different ways They are an XML file an INI file or one
record inside the table of Database However for the WSDL link the Storage component will try
to download the page content first according to the URL address of the WSDL link If it can work
successful the page content of the service will be stored as a WSDL document into the disk
Oslash Step 7
Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there
are more than one service or there are more than one service list page in the those Web Service
Registries the crawling process from step 3 to step 6 will be continued again and again until
there is no service or service list page in that Web Service Registries any more
Oslash Step 8
Furthermore after the crawling process for one Web Service Registry finishes there will also
generate a file that contains some of statistic information about this crawling process For
example the time when the crawling process of this Web Service Registry started and when
finished the total number of Web services in this Web Service Registry how many services have
empty WSDL document the average number of service properties in this Web Service Registry
and the average time cost for extracting service properties getting WSDL document and
generating XML file INI file etc
321 The Function of Web Service Extractor Component
The Web Service Extractor component is responsible for gathering service list page links and service
page links It pursues a focused crawl of the Web and only forwards service list page and service page
links to subsequent components for analyzing collecting gathering purposes Therefore it is going to
identify both service list page links and related service page links on these Web Service Registries
As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost
as important as the Web Service Extractor itself as it highly influences the part of the Web which it
needs to crawl The seed can or shall contain eg Web pages where these Web Services are
published or that they talk about Web Services
Deep Web Service Crawler
27
Figure3-2 Overview the process flow of the Web Service Extractor Component
After feeding with the seed of URL the Web Service Extractor component starts to get the link of
service list page from the initial page of this URL seed However this process would be diverse for
these five Web Service Registries Following shows the different situation in these Web Service
Registries
Oslash Service-Repository Web Service Registry
In this Web Service Registry the link of the first service list page is the URL address of its seed
which means some Web Services can be found in the home page of the Service-Repository Web
Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this
Web Service Registry Therefore the process of getting service list page link in this registry will be
continually carried on until there no more service list page link exists
Oslash Xmethods Web Service Registry
Although there has Web Services in the home page of the Xmethods Web Service Registry the
number of those Web Services is only a small subset of these in this Web Service Registry
Moreover in the Xmethods Web Service Registry there is only one page containing all Web
Services Therefore it must have to get the service list page link for that page
Oslash Ebi Web Service Registry
The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That
is to say there also has one page that contains all Web Services in this Web Service Registry
However this page is not the initial page of the input seed Therefore there needs more than
one operation steps to get the service list page link of that page
Oslash Seekda Web Service Registry
In Seekda Web Service Registry these Web Services are not contained in the initial page of the
input seed The service list page link can be obtained after several additional operation steps
Deep Web Service Crawler
28
However there has a problem for getting the service list page link in this registry Simply put if
there are more than one pages contains the Web Services for some unknown reasons it cannot
not get the links of the rest service list page In another word it can only get the link of the first
service list page
Oslash Biocatalogue Web Service Registry
The process of getting service list page in Biocatalogue Web Service Registry is almost the same
as in Seekda Web Service Registry The only difference is that it can get all these service list page
links in Biocatalogue Web Service Registry if there are more than one service list pages
Then after getting the link of service list page the Web Service Extractor will begin to get the link of
service page for each service that listed in the service list page The reason for why it can do like this is
that there has an internal link for every service which will address to the service page It is worth
noting that once a service page link is obtained this component will immediately forward the service
page link and the service list page link into the subsequent two components for further processing
Nevertheless the process of obtaining the service page link will be continuously carried out until all
services listed in that service list page are crawled Analogously the process of getting service list page
will also be continuously carried out until there no more service list page exists
3211 Features of the Web Service Extractor Component
The main features are described in the following paragraphs
1) Obtain service list page links
A central task is to obtain the corresponding service list page links It is an URL address that includes a
publicly list of Web Services and just a simply information about these Web services like the name of
the service an internal URL that links to another page which would contain the detailed information
about that service sometimes may have the link address for WSDL document
2) Obtain service page links
Once service list page links are found a crucial aspect is to extract the internal link of each service to
aid the task of service information discovery Therefore it is the task of the Web Service Extractor to
harvest the html page content of the service list page so that the service page links which would
contain much more detailed information of the single Web service can be obtained
3212 Input of the Web Service Extractor Component
This component is dependent on some specific input seeds And the only input required for this
component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section
313
3213 Output of the Web Service Extractor Component
The component will produce two service related page links from the Web
l Service list page links
l Service page links
Deep Web Service Crawler
29
3214 Demonstration for Web Service Extractor
In order to have comprehensive understanding of the process of the Web Service Extractor
component following gives some figures for explanation Though there are five URL addresses in this
section only the URL of the service-repository address is showed as an example
1) The input seed is the initial URL address of the Service-Repository which is
ldquohttpwwwservice-repositorycomrdquo
2) As has already said in section 321 the first service list link of this Web Service Registry is its input
seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page
of that link
Figure3-3 Service list page of the Service-Repository
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo
Figure3-5Code Overview of getting service page link in Service Repository
Figure3-6 Service page of the Web service ldquoBLZServicerdquo
Deep Web Service Crawler
30
3) For now because of the service list page link is already know the next step is to acquiring the
service page link according to these services that listed in the service list page The text in red box
of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the
complete link of the service page It should be prefixed with the initial URL address of
Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for
getting service page link of the Web service in Service Repository is showed in figure 3-5
Therefore the final link of this service page is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure 3-6 is the corrsponding service page of that link
4) Afterwards those two links service list page link and service page link which are gathered by the
Web Service Extractor component will be immediately forwarded to the next two components
which are WSDL Grabber component and Property Grabber component
322 The Function of WSDL Grabber Component
The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the
service lists page link or service page link And the whole process flow is illustrated in figure 3-7
Figure3-7 Overview the process flow of the WSDL Grabber Component
When the WSDL Grabber component receives the inputs that delivered from previous component it
will start to get the WSDL link for the service based on these inputs However although the inputs of
the WSDL Grabber component are the links of service page and service list page only one of them
contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the
service page or the service list page The reason why there needs these two links to be delivered into
this component is that the only one of the five Web Service Registries which is Biocatalogue hosts
the WSDL link in the service list page While for other four Web Service Registries the WSDL link is
hosted in the service page Therefore the WSDL link of these four Web Service Registries will be
obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL
link will be obtained through the service list page link However there has a problem for getting the
WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in
Deep Web Service Crawler
31
the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another
word these services will not have WSDL document For the situation like this the value of the WSDL
link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web
Services in other four Web Service Registries the WSDL link would always exist in the service page
Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service
it will inevitably be forwarded into Storage component for downloading the WSDL document at once
3221 Features of the WSDL Grabber Component
The WSDL Grabber component has to provide the following features
l Obtain WSDL links
WSDL link is the direct way to get into the page that contains the contents of the WSDL document
It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo
to represent this is an address that addressed to the page of WSDL document
3222 Input of the WSDL Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3223 Output of the WSDL Grabber Component
The component will only produce the following output data
l The URL address of WSDL link for each service
3224 Demonstration for WSDL Grabber Component
In this section it will demonstrate a list of figures in order to give a comprehensive understanding of
the process of the WSDL Grabber component Without a doubt it uses the same Web service of the
Service-Repository as an example too
1) The input for this WSDL Grabber component is the link of service page that obtained from Web
Service Extractor component The address of this link is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page
Deep Web Service Crawler
32
2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the
figure 3-8
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo
3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure
3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry
For other four Web Service Registries this will be different This function
ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo
Then it checks all of the nodes one by one to see if the text value of that node is the value of
ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will
be extracted as the value of WSDL link for this Web service
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function
Figure3-11 Code overview of ldquooneParameterrdquo function
4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web
service ldquoBLZServicerdquo can be obtained The value of it is
ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo
Deep Web Service Crawler
33
323 The Function of Property Grabber Component
The Property Grabber component is a module which is used to extract and gather all these Web
service information that hosted in the Web actually are the information that showed in service list
page and service page In the end all the obtained Web service information will be collected together
as the service properties which would be delivered into the Storage component for storing The
detailed process flow of Property Grabber component is illustrated in figure 3-12
The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component
However there is still a little difference between them with respect to the seed As have already
mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is
sufficient to get the WSDL link While regarding the Property Grabber component it needs two of
them as the seeds Therefore once the Property Grabber component receives the needed inputs it
will start to extract the information of that single Web service
Figure3-12 Overview the process flow of the Property Grabber Component
Therefore after the Property Grabber component receives the inputs it starts to extract the service
information for the Web service Generally speaking the service information consists of four aspects
which are structured information endpoint information monitoring information and whois
information respectively
(1) Structured Information
The structured information can be obtained by extracting the information that hosted in the
service page and service list page They are the basic descriptive information about the service
Such as the name of the service the URL address through which the WSDL document can be
obtained the description of introducing the service the provider who provides that service its
rating and the server who owns this service etc However the elements constituting this
Deep Web Service Crawler
34
structured information would be diverse according to these different Web Service Registries For
example the rating information of the Web service exists in the Service-Repository Web Service
Registry while the Xmethods Web Service Registry does not have this information In addition
even though the Web services in the same Web Service Registry some elements of the structured
information could be not exist For instance one service in a Web Service Registry may have the
description information of this service while another service in the same Web Service Registry
does not have this description information Tables 3-1 to 3-5 demonstrate the structured
information that should be extracted in these five Web Service Registries Moreover if the style of
the Biocataloue Web Service Registry is SOAP there has some additional information describing
the SOAP operation of this service While it is REST this additional information would be
something that talked about the REST operation They should also be considered as a part of the
structured information Table 3-6 and table 3-7 illustrate the information for these two different
operations
Service Name WSDL Link WSDL Version
Provider Server Rating
Homepage Owner Homepage Description
Table 3-1Structured Information of Service-Repository Web Service Registry
Service Name WSDL Link Provider
Service Style Homepage Implementation Language
Description User Description Contributed Client Name
Type of this Client Publisher for this client Used Tookit of this client
Used Language of this client Used Operation System of this client
Table 3-2Structured Information of Xmethods Web Service Registry
Service Name WSDL Link Server
Provider Providerrsquos Country Service Style
Rating Description User Description
Service Tags Documentation (within WSDL)
Table 3-3Structured Information of Seekda Web Service Registry
Service Name WSDL Link Port Name
Service URL Address Implementation Class
Table 3-4Structured Information of Ebi Web Service Registry
Service Name WSDL Link Style
Provider Providerrsquos Country View Times
Favorite Times Submitter Service Tags
Total Annotation Provider Annotation Member Annotation
Registry Annotation Base URL SOAP Lab Server Base URL
Description User Description Category
Table 3-5Structured Information of Biocatalogue Web Service Registry
Deep Web Service Crawler
35
SOAP Operation Name Inputs and Outputs Operation Description
Operation Tags Part of Which Service
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry
REST Operation Name Service Tags Used Template
Operation Description Part of Which Service Part of Which Endpoint Group
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry
(2) Endpoint Information
Endpoint information is the port information of the Web service which can be extracted only
through the service page However since different Web Service Registries have different structure
of the endpoint information with respect to the Web service Thence some elements of the
endpoint information for the Web service would be very diverse But there is one thing needed to
pay attention the Ebi Web Service Registry does not have endpoint information for all these Web
services published in this registry Moreover though the Web services in the same Web Service
Registry have the same structure of this endpoint information some elements of the endpoint
information would be missing or empty Furthermore these Web Service Registries could even
have no endpoint information for some Web services published by them Nevertheless whatever
there is endpoint information for the Web service there will be at least one such element which
is the URL address of the endpoint Following table 3-8 shows the endpoint information that
should be extracted in these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Endpoint Name Endpoint URL
Endpoint Critical Endpoint Type
Bound Endpoint
Xmethods
Endpoint URL Publisher of this Endpoint
Contact Email of this Publisher Implementation Language of this
Endpoint
Seekda Endpoint URL
Biocatalogue Endpoint Name Endpoint URL
Table 3-8 Endpoint Information of these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Service Availability Number of Downs
Total Uptime Total Downtime
MTTR MTBF
RTT Max of Endpoint RTT Min of Endpoint
RTT Average of Endpoint Ping Count of Endpoint
Seekda Service Availability Begin Time of Monitoring
Biocatalogue Monitored Status of Endpoint Overall Status of Service
Table 3-9 Monitoring Information of these five Web Service Registries
(3) Monitoring Information
Monitoring information is the tested statistic information for the Web service But it is worth
Deep Web Service Crawler
36
noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of
all these Web services published by them While for other three Web Service Registries only a
few of Web services may not have this information Table 3-12 displays the monitoring
information for these three Web Service Registries
(4) Whois Information
Whois information is not extracted from the information that hosted in the service page and
service list page It is the descriptive information for the service domain which can be gained by
means of the address of the WSDL link Because of that the process of getting the Whois
information will start after gaining the service domain first The final value of service domain
should not contain the string like this http https www etc It must belong to the top level
domain After that it is going to query the service domain database by means of sending the value
of service domain to the Whois client which is just a Web site on the Internet for example
ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain
exists a list of its information will be returned as the output However the structure of the
returned information will be diverse with respect to different service domain Therefore the most
challenge thing is that there has to deal with the extracting process for each different situation of
the returned information Table 3-10 gives the Whois information that needs be extracted for all
these five Web Service Registries
Service Domain URL Domain Name Domain Type
Domain Address Domain Description State
Postal Code City Country
Country Code Phone Fax
Email Organization Established Time
Table 3-10Whois Information for these five Web Service Registries
Finally all the information of these four aspects will be collected together And then be delivered into
the Storage component for further storage processing
3231 Features of the Property Grabber Component
The Property Grabber component has to provide the following features
l Obtain basic information
Generally speaking the more information one Web service has the better you can know how
well this Web service is Hence this Property Grabber component will be necessary to extract all
these basic information that hosted in the service list page and service page This basic
information contains the structure information endpoint information and monitoring
information
l Obtain Whois information
Due to the reason that the more information the Web service has the better the quality of the
Web service is That makes it is necessary to extract as much information about the Web service
as possible Therefore except the basic information the Property Grabber component can also
obtain some additional information called Whois information such as the type of the domain the
name of the person in charge the postal code of the domain city phone fax detailed address
email etc
Deep Web Service Crawler
37
3232 Input of the Property Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3233 Output of the Property Grabber Component
The component will produce the following output data
l Structured information of each service
l Endpoint information about each service if exists
l Monitoring information for the service and endpoint if exists
l Whois information of the service domain
However all these information will be collected together as the properties for each service Thereafter
the collected properties will be sent to the Storage component
3234 Demonstration for Property Grabber Component
Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the
Property Grabber component To simplify the explanation the example showed in this section uses
the same Web service as before
1) The inputs of the Property Grabber component are links of service list page and service page that
received from Web Service Extractor component Their links are
ldquohttpwwwservice-repositorycomrdquo
and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page
Deep Web Service Crawler
38
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page
3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that
displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of
the structured information have the same content like the contents of description showed in
service page and service list page Thence in order to saving the time of extracting process and
the space for storing process the elements with the same content would only be extracted once
Moreover there needs a transformation of non-descriptive text to the descriptive text for the
rating information because the contents of it are several image of stars Therefore the final
results of the extracted structured information for this Web service are showed in table 3-11
Because of there are no descriptive information for the provider homepage and Owner
Homepage their values are assigned as ldquoNULLrdquo
Service Name BLZService
WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl
WSDL Version 0
Server Apache-Coyote11
Description BLZService
Rating Four stars and A Half
Provider NULL
Homepage NULL
Owner Homepage NULL
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo
4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that
displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list
only one of them will be extracted as the information of endpoint That is because this master
program is intended to extract as more information as possible but these information should not
contain some redundant information Therefore for the purpose of that only one of them will be
extracted as the endpoint information even if there are more than one endpoint records Table
3-12 shows the final results of the endpoint information of this Web service
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page
Deep Web Service Crawler
39
Endpoint Name BLZServiceSOAP12port_http
Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService
Endpoint Critical True
Endpoint Type production
Bound Endpoint BLZServiceSOAP12Binding
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo
5) Then it is time to extract the monitoring information through invoking the function
ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box
above contains the monitoring information about the Web service and the red box below lists
the monitoring information for these endpoints As have already mentioned before only one
endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two
types of availability Actually they all represent the availability for this Web service just like the
availability that showed in figure 3-14 Therefore only one content of these availability
information is sufficient Table 3-13 shows the final results of this extracting process
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page
Service Availability 100
Number of Downs 0
Total Uptime 1 day 19 hours 19 minutes
Total Downtime 0 second
MTBF 1 day 19 hours 19 minutes
MTTR 0 second
RTT Max of Endpoint 141 ms
RTT Min of Endpoint 0 ms
RTT Average of Endpoint 577 ms
Ping Count of Endpoint 112
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo
6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the
service domain from the WSDL link first Towards this Web service the gained service domain is
ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying
process After that it returns a list of information for that service domain See figure 3-17 And
table 3-14 is the extracted Whois information
Deep Web Service Crawler
40
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo
Service Domain URL thomas-bayercom
Domain Name Thomas Bayer
Domain Type NULL
Domain Address Moltkestr40
Domain Description NULL
State NULL
Postal Code 54173
City Bonn
Country NULL
Country Code DE
Phone +4922855525760
Fax NULL
Email infopredic8de
Organization predic8 GmbH
Established Time NULL
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo
7) Finally all the information of these four aspects will be collected together as the service
properties and then these service properties are forwarded into the Storage component
324 The Function of Storage Component
The WSDL link from the WSDL Grabber component will be used by this Storage component to
download the WSDL document from the Web and stores it into the disk thereafter In addition the
service properties from the Property Grabber component will also be directly stored into the disk in
three different manners with this Storage component Figure 3-18 illustrates the process flow of the
Storage component
Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo
of the Storage component will be triggered After that it is going to transform the service properties
into three different output formats and store them into the disk These output formats are XML file
INI file and database records Besides it will also try to download the WSDL document with the URL
address of the WSDL link and then store the obtained WSDL document into disk too However this
ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo
ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the
storage tasks
Deep Web Service Crawler
41
Figure3-18 Overview the process flow of the Storage Component
(1) ldquogetWSDLrdquo sub function
The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it
into the disk Therefore above all it has to get the content of the WSDL document This
procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value
of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if
the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo
For the case like that it will create a WSDL document whose name is the service name appended
with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content
It is an empty document Nevertheless if that service does have WSDL link this sub function will
try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all
these contents that hosted in the Web would be downloaded and stored into the disk and named
only with the name of the service Otherwise it would create a WSDL document which prefixes
ldquoBadrdquo before the service name
(2) ldquogenerateXMLrdquo sub function
The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform
them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo
XML stands for eXtensible Markup Language which is a markup language that designed to
transport and store data The first line of an XML file is the XML declaration which defines the
XML version and the encoding used For example following ldquoltxml version=10
encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is
8-bit Unicode Transformation Format character set Besides an XML file contains XML elements
which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML
element can also contain other elements which could be simple text or a mixture of both
Deep Web Service Crawler
42
However an XML file must contain a root element as the parent of all other elements
(3) ldquogenerateINIrdquo sub function
The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will
transform them into an INI file and then store it in the disk with the name structure of servicersquos
name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for
configuration files They are just simple text files with a basic structure Generally speaking the
INI file contains three different parts They are section parameter and comment respectively
Parameter is the basic element contained in an INI file Its format is a key-value pair or can also
be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always
appears to the left of the equals sign The section is more like a room that grouped all its
parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And
section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo
Thence anything between the semicolon and the end of line will be ignored
(4) ldquogenerateDatabaserdquo sub function
The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions
Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will
turn them into the data of the database by using the SQL statements SQL stands for Structured
Query Language which is a database language that designed for accessing and manipulating data
in database Most of the actions that performed on a database are done with SQL statements
And the primary statements of SQL include insert into delete update select create alter and
drop Therefore for the purpose of transforming these service properties into the data of the
database this sub function has to create a database first using the ldquocreate databaserdquo statement
Then it should create a table to store the data A table is a collections of related data entries and
it consists of columns and rows Since the data for all these five Web Service Registries are not
very large therefore one database table is enough for storing these service properties Because of
that the field names of the service properties in columns for all these five Web Service Registries
should be uniform and well-defined Afterwards the service properties of each single service can
be put into the table as a record with the ldquoinsert intordquo statement of the SQL
3241 Features of the Storage Component
The Property Grabber component has to provide the following features
l Generate different output formats
The final result of this master program is to store the information of the service into the disk for
future work However this Storage component provides three different formats for storing these
service properties of the services that published in the Web Service Registries This makes the
storage of the service very flexible and also longevity
l Obtain the WSDL document
The important thing of this component is to check if the WSDL document can be obtained from
the WSDL link because the WSDL document plays a decisive role for determining the quality of
the service This Storage component provides the ability to deal with the different situations
about the process of obtaining the WSDL document
Deep Web Service Crawler
43
3242 Input of the Storage Component
This component requires the following input data
l WSDL link of each service
l Each service property information
3243 Output of the Storage Component
The component will produce the following output data
l WSDL document of the service
l XML document INI file and tables in database
3244 Demonstration for Storage Component
Following figures describe the fundamental implementation codes of this Storage component
The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the
implementation codes The first common place is about the parameters that defined in each of
these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute
path in the computer disk It will be used for the storing procedure and has already been specified
by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an
increasing Integer which is used as a part of the name of the service The reason for doing this is
that it can prevent the services that have the same name from overriding in the disk The content
of red mark among the codes of these figures is the second common place Its function is to
create a file or document for the service with the corresponding parameters
2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to
get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the
name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo
is the actual WSDL link which is the most important one for this sub function The other two
parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and
ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data
of the service in each Web Service Registry Such as the overall number of properties for that Web
Service Registry the overall number of services the number of services that have no WSDL link
the number of services contains no contents the number of services whose WSDL link are not
available etc The ldquoLog Informationrdquo text file records the results of the process steps and the
problems encountered For example which service is crawling now which Web Service Registry it
belongs the reason why it cannot get the WSDL document for this service the reason why this
service is unreachable and so on
Deep Web Service Crawler
44
Figure3-19 Implementation code for getting WSDL document
3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file
and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of
ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two
variables name and value
Figure3-20 Implementation code for generating XML file
Deep Web Service Crawler
45
Figure3-21 Implementation code for generating INI file
4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all
services in these five Web Service Registries into the records of the database Therefore it has to
create a database first The name of the database can be arbitrary as long as it conforms to the
naming rules in database This is the same for the name of the table Figure 3-22 displays the
procedure of creating a table in the database Due to the reason that it is hard to decide the
length of each service property thence the data types for all service properties are set to ldquoTextrdquo
Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo
statement
Figure3-22 Implementation code for creating table in database
Deep Web Service Crawler
46
Figure3-23 Implementation code for generating table records
33 Multithreaded Programming for DWSC
Multithreaded programming is a built-in characteristic which is provided by Java language A
multithreaded program contains two or more separate parts that that can execute concurrently Each
part of such a program is called a thread The use of multithreading enables to create programs that
makes efficient use of the system resources For example make maximum use of the CPU thatrsquos
because the idle time of the CPU can be kept to a minimum
In this master program there are five Web Service Registries that are needed to crawl for the services
among them Moreover the number of services published among each Web Service Registry is quite
different That makes the running time cost by each Web Service Registry is different Then it will
happen that one Web Service Registry who owns fewer services has to be waiting for executing until
another Web Service Registry who has much more services finishes Therefore in order to reduce the
waiting time for other Web Service Registries and maximize the use of the system resources it is
necessary to apply this multithreaded programming to this master program That is to say this master
program will create a thread for each Web Service Registry And these threads are executed
independently
34 Sleep Time Configuration for Web Service
Registries
Due to the reason that this master program is intended for downloading the WSDL document and
extracting the service information of these Web services that published in the Web Service Registry
this would inevitably affect the performance of the Web Service Registry In addition for the purpose
of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of
accessing Because of that sometimes when this master program is executing unknown errors would
happen For instance this master program will be continually halting at one point without getting any
more WSDL document and service information or some servicesrsquo WSDL documents of some Web
Deep Web Service Crawler
47
Service Registries cannot be obtained or there would miss some service information Therefore in
order to obtain the largest staffs of these Web services published in these five Web Service Registries
and not to affect the throughput of them it has to configure the accessing rate for each service of all
Web Service Registries
In consequence before going into the essential procedure for each single service of these Web Service
Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long
milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for
the specified number of milliseconds In other words it will temporarily cease execution for a while
Following table shows this time interval of the sleep function for each Web Service Registry
Web Service Registry Name Time Interval (milliseconds)
Service Repository 8000
Ebi 3000
Xmethods 10000
Seekda 20000
Biocatalogue 10000
Table 3-15Sleep Time of these five Web Service Registries
Deep Web Service Crawler
48
4 Experimental Results and Analysis
This chapter is going to show the quantitative experimental results of the prototype that presented in
chapter 3 Besides the analysis of these results will also be described and explained However in
order to gain a rather accurate result the experiments are carried out more than five times All these
data that displayed in the following tables and charts are the average of them
41 Statistic Information for Different Web Service
Registries
This section is mainly used to talk about the amount statistic of these Web services that published in
these five Web Service Registries It includes the overall number of the Web services that published in
each Web Service Registry and the number of these unavailable Web services which have been
archived because they may not be active anymore or are close to being non active Table 4-1 shows
the service amount statistic of these five Web Service Registries
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Overall
Services 57 289 382 853 2567
Unavailable
Services 0 0 0 0 125
Table4-1 Service amount statistic of these five Web Service Registries
Nevertheless in order to have an intuitive view of the service amount statistic in these five Web
Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As
you can see from the bar chart one on hand there has an ascending increase for the overall number
of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service
Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web
service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to
provide the Web services for the users because it contains far more services than other four Web
Service Registries On the other hand there is no unavailable service existed in almost all Web Service
Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web
Service Registry contains some Web services that cannot be used by the users To some degree this is
useless for the reason that these services cannot be used anymore and it will waste the network
resources on the Web Therefore all these unavailable services should be eliminated in order to
reduce the wasting of the network resources
Deep Web Service Crawler
49
Figure4-1 Service amount statistic of these five Web Service Registries
42 Statistic Information for WSDL Document
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Failed WSDL
Links 1 0 23 145 32
Without
WSDL Links 0 0 0 0 16
Empty
Content 0 0 2 0 2
Table4-2 Statistic information for WSDL Document
Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in
these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed
WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the
Web services whose WSDL links are invalid Say in other words that means it is impossible to get the
WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no
WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services
for these Web Service Registries It is used as the overall number of Web services in each Web Service
Registry who has no WSDL link That is to say there would be no WSDL document for Web services
like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL
document will also be created But there is no content and the name of this WSDL document will
contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which
represents the overall number of Web services that have the WSDL links and the URL address of them
0
500
1000
1500
2000
2500
3000
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
57289 382
853
2567
0 0 0 0 125
The
Num
ber o
f Web
Ser
vice
s
The name of Web Service Registry
Overall Services
Unavailable Services
Deep Web Service Crawler
50
are valid but their WSDL documents contains no content In this case a WSDL document whose name
contains a string ldquo(BAD)rdquo will be created
Figure4-2 Statistic information for WSDL Document
43 Comparison of Different Average Number of
Service Properties
This section is going to compare the average number of service properties in these five Web Service
Registries This average number of service properties is calculated by means of following equation = (1)
Where
ASP is the average number of service properties for one Web Service Registry
ONSP is the overall number of service properties in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
Figure 4-3 shows the average number of service properties of each Web service in these five Web
Service Registries As have already mentioned before one of the measurements for testing the quality
of Web services in the Web Service Registry is the service information The more information of a Web
service the better you know about that service consequently the corresponding Web Service
Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service
Repository and Biocatalogue Web Service Registries own a larger number of service properties than
that in other three Web Service Registries This can directly reflect that these two Web Service
Registries can provide more detailed information about the Web services published among them So
that the users can be easier to choose the service they need And they would also like to use the Web
services that published in these two Web Service Registries By contrast the Xmethods and Seekda
0
50
100
150
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
1 023
145
32
0 0 0 016
0 0 2 0 2
The
Num
ber o
f Ser
vice
s
The Name of the Web Service Registry
Failed WSDL Links
Without WSDL links
Empty Content
Deep Web Service Crawler
51
Web Service Registries who have less service information of the Web services would offer less quality
for these Web services Therefore users may be not like to use the Web services provided in these
two Web Service Registries Not to mention the Web services published in the Ebi Web Service
Registry
Figure4-3 Average Number of Service Properties
From the description presented in section 323 the causes of the issue for the different number of
service properties in these Web Service Registries may consist of following several points First the
number of the structured information for these Web services is different with respect to these five
Web Service Registries Even part of information for some Web services in one Web Service Registry
could be missing or have empty value For example the number of structure information that
supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that
in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for
almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce
the amount of the overall number of service properties Thirdly some Web Service Registries do not
have monitoring information such as Xmethods and Ebi In particular the Service Repository Web
Service Registry has a large amount of monitoring information of the Web services that could be
extracted in the Web Obvious the last one is the number of the whois information for these Web
services If the database of the whois client does not contain the information about the service
domain for the Web service in one Web Service Registry then there will be no whois information
could be extracted Moreover even there has information of the service domain the amount of the
information could be very diverse Therefore if the Web Service Registry undergoes the situation like
that many service domains of the Web services in this registry have no and few whois information
then the average number of service properties in that registry will decrease greatly
As a result in order to help users get better acquainted with the Web services that provided in Web
Service Registry and distinguish these Web services the Web Service Registry should do it best to
offer more and more information for each published Web service of it
05
101520253035
23
717 17
32
The
Num
ber o
f Pro
pert
ies
The Name of Web Service Registry
Average Number of Service Properties
Deep Web Service Crawler
52
44 Different Outputs of Web Services
The fundamental task of this master program is to obtain the WSDL document of the Web services
that hosted in the Web Service Registries as well as extract and gather the properties of these Web
services and thereafter store them into disk Therefore this section is going to describe the different
outputs of this master program which include the WSDL documents of Web services the generated
XML file and INI file and data records of these service properties in Database
Figure4-4 WSDL Document format of one Web service
The WSDL document of the Web service is just read from the Web according to the URL address of its
WSDL link And then these data will be stored into the disk as the WSDL document The name of the
WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document
And in order to distinguish these WSDL documents whose names would be the same while the
content of them are different the name of each obtained WSDL document in one Web Service
Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL
document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and
data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output
formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name
is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the
materials belonged to the same Web service The first three lines in that INI file are some service
comments which start from the semicolon to the end of the line They are the basic information for
the description of this INI file The following one line of them is the section that included in a pairs of
brackets It is important because it represents that the following lines behind it are the information of
this Web service Therefore the rest of these lines are the actual service information that existed in a
key-value pair with an equal between the pair Each service property is displayed from the beginning
of the line
Deep Web Service Crawler
53
Figure4-5 INI File format of one Web service
Figure4-6 XML File format of one Web service
Figure4-7 Database data format for all Web services
Deep Web Service Crawler
54
Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its
name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web
service Though the format of the XML file is different from that of the INI file the essential contents
of them are the same That is to say the values of these service properties are no different This is
because they are files generated from the collection of properties for the same Web service The XML
file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the
section in INI file is something like the root in XML file Therefore all values of these elements among
the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service
Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of
service information for all Web services in these five Web Service Registries And the entire service
information for one Web service will be only one record in this table Because of that the column
names of that table should be the union of the name of the service information in each Web Service
Registry However since the column names of the table should be unique thus these redundant
names in this union must be eliminated This could make sense and possible because the names of the
service information are well-defined and uniform for all these five Web Service Registries In addition
in this table the first column is the primary key which is an increasing Integer The function of this is
more like the Integer that contained in the name of the XML file and INI file While the rest columns in
that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in
this table represents that this property of that Web service is empty or missing
45 Comparison of Average Time Cost for Different
Parts of Single Web Service
This section aims to describe the comparison of the average time cost for different parts of getting
one single Web service in all these five Web Service Registries At first it has to calculate the average
time cost of getting one single service in the Web Service Registry The calculation of it can be
obtained through following equation = (2)
Where
ATC is the average time cost for one single Web service
OTS is the overall time cost of all the Web services in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the different parts of the average time cost for getting one single service consists of
following 6 aspects which are the average time cost for extracting service property the average time
cost for obtaining WSDL document the average time cost for generating XML file the average time
cost for generating INI file the average time cost for inserting the service property into the table of
the Database and the average time cost for some other procedures such as get the service list page
link get the service page link and so on The average time cost for extracting service property would
be obtained by means of following equation
Deep Web Service Crawler
55
= (3)
Where
ATCSI is the average time cost for extracting service property of one single Web service
OTSSI is the overall time cost for extracting service property of all the Web services in one Web
Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the calculation of other parts is similar to the equation for calculating the average time
cost for extracting service property While the calculation of the average time cost for other
procedures equals to the result of the average time cost for one single Web service minus the sum of
the average time cost of all other five parts
Service
property
WSDL
Document
XML
File INI File Database Others Overall
Service
Repository
8801 918 2 1 53 267 10042
Ebi 699 82 2 1 28 11 823
Xmethods 5801 1168 2 1 45 12 7029
Seekda 5186 1013 2 1 41 23 6266
Biocatalogue 39533 762 2 1 66 1636 42000
Table4-3 Average time cost information for all Web Service Registries
Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web
Service Registries The first column of table 4-3 is the name of these five Web Service Registries And
the last column in this table is the average time cost for the single service in one Web Service Registry
While the rest columns in this table are the average time cost of these six different parts In order to
have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated
with following corresponding figures See figure 4-8 to figure 4-14
Figure4-8 Average time cost for extracting service property in all Web Service Registries
8801
6995801 5186
39533
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Service Property
Deep Web Service Crawler
56
As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue
Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries
which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively
That is to say it will take much longer time to extract the service properties of these Web services
published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the
Biocatalogue Web Service Registry has even more average number of service properties which have
already been talked about in section 43 On the contrary the average number of service properties in
Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service
Registry is larger than that in Seekda Web Service Registry As have already known that the average
number of service properties is the same for these two Web Service Registries Nevertheless there
has course might explain the case that the average time in Xmethods costs more than that in Seekda
which is the process for extracting service properties in Xmethods Web Service Registry needs to be
executed by means of service page and service list page while it needs only service page link for
Seekda Web Service Registry
The average time cost for obtaining WSDL document in all these five Web Service Registries is
displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the
average time for extracting the WSDL link of Web service and the average time for reading the data of
WSDL document from the Web and then storing it into the disk As seen from this figure the average
time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is
1168 Although the process for extracting WSDL link will cost certain amount of time it does not have
a significant influence to the total average time spent for obtaining WSDL document This is because
the WSDL link of one Web service is almost gained in one step Therefore this can imply that the
average size of the WSDL document for Xmethods Web Service Registry is larger than that for other
four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds
for the process of extracting WSDL document
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries
918
82
11681013
762
0
200
400
600
800
1000
1200
1400
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
WSDL Document
Deep Web Service Crawler
57
Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different
outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11
the average time of generating XML file for one Web service cost the same for all these five Web
Service Registries which is only 2 milliseconds In addition the average time of generating INI file for
one Web service cost the same too And its value is just 1 millisecond Even though sum up the values
of these two types of average time costs the result is still so small that it could be omitted when
comparing to the overall average time cost of getting one Web service for each corresponding Web
Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI
file would be finished at once after receiving the input of service properties of one Web service
Furthermore it can be seen from figure 4-12 although the average time costs of creating database
record for each Web service in all these five Web Service Registries are larger than that time of
generating XML and INI file the process of the operation for creating database record is still fast
Figure4-10 Average time cost for generating XML file in all Web Service Registries
Figure4-11 Average time cost for generating INI file in all Web Service Registries
2 2 2 2 2
0
05
1
15
2
25
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
XML File
1 1 1 1 1
0
02
04
06
08
1
12
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
INI File
Deep Web Service Crawler
58
Figure4-12 Average time cost for creating database record in all Web Service Registries
Figure4-13 Average time cost for getting one Web service in all Web Service Registries
In the figure 4-13 it gives the average time cost for getting one single Web service in all these five
Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the
longest time for this process This is because the presentation of these five different parts that
described in the front shows that the average time cost of each part would needs more time to finish
the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining
WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web
Service Registry does not cost the longest Moreover there is amazing thing when looking at these
figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same
trend This can further indicate that one Web Service Registry who spends more time to get the
description information of one Web service would offer more information about the Web service
53
28
4541
66
0
10
20
30
40
50
60
70
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Database
10042
823
7029 6266
42000
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Overall
Deep Web Service Crawler
59
5 Conclusion and Further Direction
This master thesis provides a schema which aims to explore the description information of the Web
services that hosted in the different Web Service Registries Actually the description information of
the Web service consists of WSDL document and service information about that Web service
Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can
be used to obtain the WSDL document and service information of the Web service in these Web
Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web
services that hosted in the these Web Service Registries as well as the storage of these Web services
is not flexible and the most important thing is that only a few service information of the Web service
are extracted even more there is no service information would be extracted for some Web Service
Registry However the work presented in this master thesis can be able to explore all Web services in
these Web Service Registries Moreover the service information of Web service would be extracted as
much as it could with the approach presented in this master thesis so that the final result would be
the largest annotated service catalogue ever produced Furthermore regarding the storage process of
the description information of the Web service this master thesis provides three different ways that
can guarantee not only the completeness but also longevity of the description information for the
Web service
However in the implementation performed in this master thesis the whois client used for querying
the information of service domain will return a free text if the information exists And sometime this
free text differs completely That makes it must have to crawl each Web service in all Web Service
Registries at least once in the experiment stage So that all the cases of these free text would be
foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web
services in these Web Service Registries Therefore in order to simply the work other whois client
who can ease the work here needs to be found and used
Moreover in the experiment stage of this master thesis the time cost for getting Web service is still
large For the purpose of reducing this time the multithreaded programming would be applied to
each some parts of the process for getting one Web service
Although the work performed here is specialized for only these five Web Service Registries the
main parts of the principles used here is adaptable to other Web Service Registries that with only
small changes in the implementation codes or the structure
Deep Web Service Crawler
60
6 Bibliography
[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)
June 27 2008
[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from
httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-
whole Emanuele Della Valle (CEFRIEL) July 1 2008
[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk
ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-
architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009
[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web
Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October
2006
[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from
httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008
[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from
httpwwww3orgTRws-arch-scenarios February 11 2004
[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo
Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering
University of Washington Seattle pp233-272 February 1999
[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web
Consortium Working Draft WD-html5-20100624 January 22 2008
[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language
WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from
httpwwwwsmoorgTRd16d161v02
[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO
-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from
httpwwwwsmoorgTRd2v11
[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware
Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet
2008
Deep Web Service Crawler
61
7 Appendixes
There are additional outputs of this master program which are log information file and statistic report
file Figure 8-1 shows one of the basic output formats of the log information for these five Web
Service Registries
Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry
Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry
Deep Web Service Crawler
62
Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo
ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively
Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry
Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry
Deep Web Service Crawler
63
Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry
Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry
Deep Web Service Crawler
64
Table of Figures
Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12
Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16
Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25
Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27
Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41
Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46
Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50
Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51
Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52
Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56
Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58
Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58
Deep Web Service Crawler
65
Table of Tables
Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47
Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48
Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55
Deep Web Service Crawler
66
Table of Abbreviations
DTD Document Type Definition
DWSC Deep Web Service Crawler
HTML HyperText Markup Language
MTBF Mean Time between Failures
MTTR Mean Time to Recovery
QoS Quality of Service
REST Representational State Transfer
RTT Round Trip Time
SOAP Simple Object Access Protocol
SQL Structured Query Language
URL Uniform Resource Locator
WHATWG Web Hypertext Application Technology Working Group
XML eXtensible Markup Language
WSDL Web Service Description Language
WSML Web Service Modeling Language
WSMO Web Service Modeling Ontology
Deep Web Service Crawler
7
1 Introduction
In this introduction chapter of the master thesis at first itrsquos going to concisely explain the background
of the current situation then is the basic introduction of the proposed approach which is called Deep
Web Service Extraction Crawler
11 BackgroundMotivation
In the late 1990rsquos the Web Service Registry was a hot commodity The formal definition of Web
Service Registry is known as a link links page Its function is to uniformly present information that
comes from various sources Hence it can provide a convenient channel to the users for offering
searching and using the Web Services Actually the related metadata of the Web Services that
submitted by both the system and users are commonly hosted along with the Service descriptions
Nevertheless in fact when users enter one of the Web Service Registries to look for some Web
Services they might meet some situations that would bring lots of trouble to them One of the
situations may be like that these Web Service Registries return several similar published Web Services
after the users search on it For example two or more Web Services have the same name but their
versions are not the same Or two or more Web Services that derived from the same server but have
different contents etc Furthermore most users are also interested in a global view of the published
services For instance they want to know which Web Service Registry can provide better quality for
the Web Service Therefore in order to help users to differentiate those similar published Web
Services and have a global view of the Web Services this information should be monitored and rated
Moreover there are a great many Web Service Registries in the Internet Each Web Service Registry
can provide a great number of Web Services Obviously there might have some similar Web Services
among these Web Service Registries Or a Web Service in one of the Web Service Registry is related to
another Web Service in other Web Service Registries Hence these Web Services should be
comparable across different Web Service Registries However recently there has not much support of
this In addition towards the metadata actually not all of them are structured especially the
descriptions of the non-functional property Therefore what have to do now is to turn those
non-functional property descriptions into the structured format Clearly speaking it needs to extract
as much information as possible about the Web Services that offered in the Web Service Registries
Eventually after extracting all the information from the Web Service Registries it is necessary to store
them into the disk This procedure should be efficient flexible and completeness
12 Initial Designing of the Deep Web Service
Crawler Approach
The problems have already been stated in the previous section Hence the following work is to solve
Deep Web Service Crawler
8
these problems In this section it will present the basic principle of Deep Web Service Crawler
approach
At first it is the simply introduction of the Deep Web Service Crawler approach to these problems As
have already been mentioned each Web Service Registry can offer Web Services Moreover each
Web Service Registry has its own html page structures These structures may be the same or even
complete different Therefore the first thing is to identify which Web Service Registry that it will be
going to explore Since each Web Service Registry owns a unique URL this job can be done by directly
analyzing the corresponding URL address of that Web Service Registry After identifying which Web
Service Registry it is going to explore the following step is to obtain all these Web Services that
published in that Web Service Registry Then with all these obtained Web Services it is time to extract
analyze and gather the information of the services That information can be in structured format or
even in unstructured format In this master thesis some Deep Web Analysis Techniques will be
applied to obtain this information So that the information about each Web Service shall be the
largest annotated The last but not the least important all the information about the Web Services
need to be stored
13 Goals of this Master Thesis
The lists in the following are the goals of this master thesis
n Produce the largest annotated Service Catalogue
Service Catalogue is a list of service properties The more properties the service has the larger
Service Catalogue it owns Therefore this master program should extract as much service
properties as possible
n Flexible storage of these metadata of each service as annotations or dedicated documents
The metadata of one service includes not only the WSDL document but also service properties
All these metadata are important information for the service Therefore this master program
should provide flexible ways to store these metadata into the disk
n Improve the comparable property of the Web Services across different Web Service Registries
The names of service properties for one Web Service Registry could be different from another
Web Service Registry Hence for the purpose of improving the comparable ability all these
names of the service properties should be uniformed and well-defined
14 Outline of this Master Thesis
In this chapter the motivation objective and initial approach plan have already been discussed
Thereafter the remaining paper is structured as follows
Firstly in chapter 2 the work presented is almost based on these existing techniques In section 21
there is given a detailed introduction to the technique of the Service-Finder project Then in section
22 the already implemented technique of Pica-Pica Web Service Description Crawler is introduction
and discussed After that in section 23 the Information Retrieval technique is presented
Deep Web Service Crawler
9
Then following in chapter 3 it is explained the designing details of this Deep Web Service Crawler
approach In section 31 it gives a short description for the different requirements of this approach
Next in section 32 the actual design for the Deep Web Service Crawler is presented Then in section
33 34 the multithreaded programming and sleep time configuration that used in this master
program are introduced respectively
In chapter 4 it is supposed to display the experiments of this Deep Web Service Crawler approach
and then give some evaluation of it
Finally in Section 5 the conclusion and discussion of the work that already done as well as the work in
the future for this master task are presented respectively
Deep Web Service Crawler
10
2 State of the Art
This chapter aims at presenting some existing techniques or Strategies that related to the work of
applying this Deep Web Service Extraction Crawler approach Section 21 plans to talk about the
existing catalogues Service-Finder project And then in section 22 it is going to explain the existing
implemented crawlers Pica-Pica Web Service Description Crawler Finally in the section 23 it is
supposed to present some details about the Information Extraction technique
21 Service Finder Project
Service-Finder Project aims at developing a platform for Web Service discovery especially for the Web
Services that are embedded in a Web 20 environment [1] Hence it can provide an efficient access to
publicly available services The goals of the Service-Finder project are depicted as follows [1]
n Automatically gather Web Services and their related information
n Semi-automatically create semantic service description based on the information that available
on the Web
n Create and improve semantic annotations via the user feedback
n Describe the aggregated information in semantic models and allow reasoning query
However before describing the basic functionality of the Service-Finder Project there is going to
present one of its use cases and requirements first
211 Use Cases for Service-Finder Project
The Service-Finder Project employed the use case methodology of the W3C Use Case description [6]
for its needs and then applied this methodology to the use cases that it enumerated
2111 Use Case Methodology
There are three aspects needed to be considered for use case definitions of Service-Finder Project [1]
(1) Description that used to describe information of the use case
(2) Actors Roles and Goals that used to identify the actors they would be the roles they act and the
goals they need to achieve in the scenario
(3) Storyboard that used to describe the serial of interactions among the actors and the
Service-Finder Portal
2112 System Administrator
This section is going to present the use case that applied to the Service-Finder portal and that
illustrated the requirements on its functionality from a user point of view However all these
Deep Web Service Crawler
11
information in this use case are derived from [1] In this use case there has a system administrator
whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities
online and working all day and night Therefore if there is any system failures Sam Adams should fix
the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will
alert him immediately by sending him a SMS Message in the case of a system failure
n Description
This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an
SMS Messaging Service that he wants to build it into his application
n Actors Roles and Goals
The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him
are the immediate service delivery the reliability of the service and low base fee and transaction fee
n Storyboard
Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find
many useful services from it especially he know what he is looking for Hence he visits the
Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo
Requirement 1 Search functionality
Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the
number of matching services that will be displayed on one page And he would also expect there has
short information about the service functionality the service provider and the service availability So
that he could decide which service he will choose to read further
Requirement 2 Enable configurable pagination of the matching results and have some short
information for each service
Step 3 When Sam looks through the short information about the services that displayed on the first
page he expects to find the most relevant services that related to his request After that he would
like to read more detailed information about that service to see whether this service can provide the
needed functionality
Requirement 3 Rank the returned matching services and must provide ability to read more details of
a service
Step 4 In the case that all the returned matching services Sam got provide quite different
functionalities or they belong to different service categories for example the SMS messaging services
alert users not through SMS but voice messaging For this reason Sam would like to see other
different categories that may be contain the services he wants Or the services of other categories
which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can
further filter his search in terms of browsing through categories
Requirement 4 Service categories and allow the user to look all services that belonged to that specific
category If possible it should also allow the user to browse through categories
Step 5 When Sam got all the services that could provide a SMS messaging service via the methods
described in the Step 4 at present he wants to look for the services that offered by an Austrian
provider and have no base fees if possible
Requirement 5 Faceted search
Deep Web Service Crawler
12
Step 6 After Sam got all these specific services now he would like to choose the services that can
provide a high reliability
Requirement 6 Sort functionality based on usersrsquo chooses
Step 7 For now Sam expects to compare the service availability between the promised to the service
provider and the actually provided This should be contained in the servicesrsquo details And there needs
also have service coverage information so that Sam can know whether this service covers the areas
he lives and works Moreover Sam would also like to compare these services in other way For
instance put some services into a structured table to compare the transaction fees
Requirement 7 A side-by-side comparison table for services and a functionality that enable users to
select services he wants to compare
Step 8 At last Sam wants to know whether the service providers offer a free try out of the services
So that he can test the service functionality
Requirement 8 If possible display a note that offering free service trials
212 Architecture Plan for the Service-Finder Project
The architecture plan of the Service-Finder project contains five basic components Service Crawler
Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal
Interface Figure 2-1 presents a high level overview of the components and the data flow among
them
Figure2-1Dataflow of Service-Finder and Its Components [3]
Deep Web Service Crawler
13
2121 The Principle of the Service Crawler Component
The Service Crawler is responsible for gathering available services and their related information from
the Web The overall cycle is depicted as following
(1) A Web developer publishes a Web Service
(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services
like WSDL (Web Service Description Language) documents
(3) The Crawler is also going to search for other related information as long as a service is discovered
(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant
part of the Web
At last the output of the crawler would be forwarded to the subsequent components for analyzing
indexing and displaying
2122 The Principle of the Automatic Annotator Component
The Automatic Annotator receives the relevant data from previous component and generates
semantic service descriptions about the WSDL documents and its related information based on the
Service-Finder Ontology and Service Category Ontology
Firstly it will simply introduce those two compatible ontologies that would be used throughout the
whole process [2]
n Generic Service Ontology it is an ontology which is functional to describe the data objects For
example the services the service providers availability payment modalities and so on
n Service Category Ontology it is an ontology which is used to categorize the functionalities or
applications of the services For instance data verification messaging data storage weather etc
Afterwards it is going to talk about the function of this component with its input output
Oslash Input
u Crawled data from Service Crawler
u Service-Finder Ontologies
u Feedback or Correction of before annotations
Oslash Function
u Enrich the information about the service and extract semantic statements according to the
Service-Finder Ontologies For example categorize the service according to the Service
Category Ontology
u Determine whether a particular document is relevant or not through the Web link graph If
not discard these irrelevant documents
u Classify the pages into their genres For instance pricing user comments FAQ and so on
Oslash Output
u Semantic annotation of the services
Deep Web Service Crawler
14
2123 The Principle of the Conceptual Indexer and Matcher
Component
The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted
information of the services and supplying users the capability of retrieval and semantic query For
example the matchmaking between user requests and service offers and the act of retrieving user
feedback on extracted annotations
In addition letrsquos have a look of the function of this component and its input output
Oslash Input
u Semantic annotation data and full text information obtained from Automatic Annotation
u Semantic annotation data and full text information that come from user interfaces
u Cluster data from user and service clustering component
Oslash Function
u Store the semantic annotations received from the Automatic Annotation component and
from the user interface
u Store the cluster data that procured through the clustering component
u Store and index the textual description offered by the Automatic Annotation component
and the textual comments offered by users
u Ontological query the semantic data from the data store center
u Combined keyword and Ontological querying used for user queries
u Provide a list of similar services for a given service
Oslash Output
u A list of matching services that are queried by users In particular these services should be
sorted by ranking and can also be iterated
u All available data that related to a particular entity must be retrievable at the user interface
2124 The Principle of the Service-Finder Portal Interface
Component
The Service-Finder Portal Interface is the main entry point that provided for users of the
Service-Finder system to search and browse the data which is managed by the Conceptual Indexer
and Matcher component In addition the users can also contribute information by means of providing
tags comments categorizations and ratings to the data browsed Furthermore the developers can
still directly invoke the Service-Finder functionalities from their custom applications in terms of an API
Besides the details of this componentrsquos function input and output are represented as below
Oslash Input
u A list of ordered services for a query
u Detailed information about a service or a set of services and a service provider
u Query access to service category ontology and the most used tags provided by the users
Deep Web Service Crawler
15
u Service availability information
Oslash Function
u The Web Interface allows the users to search services by keyword tag or concept in the
categorization sort and filter query results by refining the query compare and bookmark
services try out the services that offer this functionality
u The API allows the developers to invoke Service-Finder functionalities
Oslash Output
u Explicit user annotations such as tags ratings comments decryptions and so on
u Implicit user data for example click stream of users bookmarks comparisons links sent
etc
u Manual advertising of available new services
2125 The Principle of the Cluster Engine Component
The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the
Service-Finder Portal eg the queried services and the compared services of the users Moreover it
also provides cluster data to the Conceptual Indexer and Matcher for providing service
recommendations
Furthermore letrsquos detailed introduce this componentrsquos function input and output
Oslash Input
u Service annotation data of both extracted and user feedback
u Usersrsquo Click streams used for extracting user behaviors
Oslash Function
u Obtain user clusters from user behaviors
u Obtain service clusters from service annotation data to enable to find similar services
Oslash Output
u Clusters of users and services
22 Information Extraction
Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge
amount of information sources on the Internet which has been limited the access to browsing and
searching for the reason of the heterogeneity and the lack of structure of Web information sources
Therefore the appearance of Information Extraction that transforms the Web pages into
program-friendly structures for post-processing would become a great necessity However the task of
Information Extraction is specified in terms of the inputs and the extraction targets And the
techniques used in the process of Information Extraction called extractor
221 Input Types of Information Extraction
Generally speaking there are three different input types The first input type is the unstructured
Deep Web Service Crawler
16
document For example the free text that showed in figure 2-2 It is unstructured and written in
natural language So that it will require substantial natural language processing While the second
input type is called the structured document For instance the XML documents based on the reason
that the data can be described through the available DTD (Document Type Definition) or XML
(eXtensible Markup Language) schema Finally but obviously the third input type is the
semi-structured document that are widespread on the Web Such as the large volume of HTML
pages like tables itemized lists and enumerated lists This is because HTML tags are often used to
render these embedded data in the HTML pages See figure 2-3
Figure2-2Left is the free text input type and right is its output [4]
Figure2-3A Semi-structured page containing data records
(in rectangular box) to be extracted [4]
Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly
regular structure And the data of these documents can be displayed in a format of HTML way or
non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and
generated from structured databases in terms of some templates or layouts thus it would be
considered as one of the input sources which could provide some of these semi-structured documents
For example the authors price and comments of the book pages that provided by Amazon have the
Deep Web Service Crawler
17
same layout That is because these Web pages are generated from the same database and applied
with the same template or layout Furthermore there has another option which could manually
generate HTML pages of semi-structured type For example although the publication lists that
provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have
title and source property for every single pager Eventually the inputs for some Information Extraction
can also be the pages with the same class or among various Web Service Registries
222 Extraction Targets of Information Extraction
Moreover regarding the task of the Information Extraction it has to consider the extraction target
There also have two different extraction targets The first one is the relation of k-tuple And the k in
there means the number of attributes in a record Nevertheless in some cases an attribute of one
record may have none instantiation Otherwise the attribute owns multiple instantiations In addition
the complex object with hierarchically organized data would be the second extraction target Though
the ways for depicting the extraction targets in a page are diverse the most common structure is the
hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf
nodes which called internal nodes And the structure for a data object may also be flat or nested To
be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise
if it is nested structure then the internal nodes that involved in this data object would be more than
two levels
Furthermore in order to make the Web pages readable for human being and having an easier
visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated
or demarcated However the displaying for a data object in a Web page would be affected by
following conditions [4]
Oslash The attribute of a data object has zero or several values
(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo
attribute For example a special offer only available for certain books might be a ldquononerdquo
attribute
(2) If there are more than one values for the attribute of a data object it will be called the
ldquomultiValuerdquo attribute For instance the name of the author for a book could be a
ldquomultiValuerdquo attribute
Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering
That is to say among this set of attribute the position of the attribute might be changed
according to the diverse instances of a data object Thus this attribute will be called the
ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would
enumerate the release data in front of the movesrsquo title while for the movies after year 1999
(including 1999) it will enumerate the release data behind the movesrsquo title
Oslash The attribute has different formats
This means the displaying format of the data object could be completely distinct with respect to
these different instances Therefore if the format of an attribute is free then a lot of rules will be
needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo
attribute For example an ecommerce Web site would use the bold font format to present the
general prices while use the red color format to display the sale prices Nevertheless there has
Deep Web Service Crawler
18
another situation that some different attributes for a data object have the same format For
example various attributes are presented in terms of using the ltTDgt tags in a table presentation
And the attributes like those could be differentiated by means of the order information of these
attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo
attributes it must have to revise the rules for extracting these attributes
Oslash The attribute cannot be decomposed
Because of the easier processing sometimes the input documents would like to be treated as
strings of tokens instead of the strings of characters In addition some of the attribute cannot
even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo
attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The
department code and the course number in them cannot be separated into two different strings
of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo
223 The Used Techniques in Information Extraction
The extractor used in the process of Information Extraction aims at providing a single uniform query
interface to access information sources like database server and Web server It consists of following
phases collecting returned Web pages labeling these Web pages generalizing extraction rules
extracting the relevant data and outputting the result in an appropriate format (XML format or
relational database) for further information integration For example at first the extractor queries the
Web server to gather the returned pages through the HTTP protocols after that it starts to extract the
contents among these HTML documents and integrate with other data sources thereafter Actually
the whole process of the extractor follows below steps
Oslash Step 1
At the beginning it must have to tokenize the input However there are two different
granularities for the input string tokenization They are tag-level encoding and word-level
encoding The tag-level encoding will transform the tags of HTML page into the general tokens
while transform all text string between two tags into a special token Nevertheless the
word-level encoding does this in another way It treats each word in a document as a token
Oslash Step 2
Next it should apply the extraction rules for every attributes of the data object in the Web pages
These extraction rules could be induced in terms of a top-down or bottom-up generalization
pattern mining or logic programming In addition the type of extraction rules may be indicated
by means of regular grammars or logic rules For example some use path-expressions of the
HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic
constraints and some use delimiter-based constraints such as HTML tags or literal words
Oslash Step 3
After that all these extracted data would be assembled into the records
Oslash Step 4
Finally iterate this process until all these data objects in the input
Deep Web Service Crawler
19
23 Pica-Pica Web Service Description Crawler
The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the
Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web
Services problem For example the evaluation of the descriptive quality of Web Services that offered
and how well are these Web Services described in nowadaysrsquo Web Service Registries
231 Needed Libraries of the Pica-Pica Web Service
Description Crawler
This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef
Spillner and programmed in terms of the Python language Actually in order to run these scripts to
parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib
n Beautiful Soup
It is an HTMLXML parser for Python language And it can even turn these invalid markups into a
parse tree [5] Moreover the following three features can make it more powerful
u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that
makes approximately as much sense as the original document Therefore you can obtain
the data that you want
u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating
searching and modifying the parse tree Hence you donrsquot need to create a custom parse
for every application
u If the document has already specified an encoding then you can ignore it since the
Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way
Otherwise what you have to do is just to specify the encoding of the original documents
Furthermore the ways of including Beautiful Soup into the application are displayed in the
following [5]
sup2 From BeautifulSoup import BeautifulSoup For processing HTML
sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML
sup2 Import BeautifulSoup To get everything
n Html5lib
It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to
gain maximum compatibility with the current major desktop web browsers this implementation
will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5
specification
Deep Web Service Crawler
20
232 Architecture of the Pica-Pica Web Service
Description Crawler
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler
Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes
four fundamental components Service Page Grabber component WSDL Grabber component
Property Grabber component and WSML Register component
(1) The Service Page Grabber components is going to take the URL seed as the input and output the
link of the service page into following two components WSDL Grabber component and Property
Grabber component
(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the
delivered service pagersquos link And then check whether the validation for these obtained WSDL
document Finally only these valid WSDL document will be passed into the WSML Register
component for further processing
(3) The Property Grabber component will try to extract the servicersquos property the hosted in the
service page if there exists After that all these servicersquos properties would be saved into an INI
file as the information of that service
(4) The functionality of the WSML Register component is to write appropriate WSML document by
means of the valid WSDL documents that delivered from WSDL Grabber component and the
optionally INI files that delivered from Property Grabber component Afterwards register them in
Conqo
n WSML [9]
It stands for Web Service Modeling Language which provides a framework with different
language variants Hence it is often used to describe the different aspects of the semantic Web
Services according to the conceptual model of WSMO
n WSMO [10]
WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various
aspects related to the Semantic Web Services based on the ontologies Ontology is a formal
explicit specification of a shared conceptualization In fact these ontologies are the sticking points
that can satisfy the linkage between the agreement of the communities of users and the defined
conceptual semantics of the real-world
n Conqo [11]
Deep Web Service Crawler
21
It is a discovery framework that considers not only the Quality of Service (QoS) but also the
context information It will use a Web Service Repository to manage these service descriptions
that based on WSML
233 Implementation of the Pica-Pica Web Service
Description Crawler
This section is going to describe the processes of the implementation of the Pica-Pica Web Service
Description Crawler in detail
(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it
needs an input as the initial seed In this crawler there are five Web Service Registries which are
listed in the below The URL address of these five Web Service Registries will be used as the input
seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web
Service Description Crawler there has a single Python script for each Web Service Registry And
the crawling process for these Web Service Registriesrsquo Python script will be executed one after
another
Biocataloue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwsrvice-repositorycom
Xmethods httpwwwxmethodsnet
(2) Then after feeding with the input seed it will step into the next component Service Page Grabber
At first this component will try to read the data from the Web based on the input seed Then it
will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library
After that this Service Page Grabber component starts to look for the service page link for each
service that published in the Web Service Registry by means of the functions in Html5lib library In
the case that the service page link of one single service is found it will firstly check whether this
service page link is valid or not Once the service page link is valid it will pass it into the following
two components for further processing which are WSDL Grabber component and Property
Grabber component
(3) When the WSDL Grabber component receives a service page link from its previous component it
sets out to extract the WSDL link address for that service through the parsing tree of the data in
this service page Next this component will start to download the WSDL document of that service
in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into
the disk The process of this WSDL Grabber component will be continually carried on until there is
no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They
may either contain bad definitions or bad namespaceURI or be an empty document What is
worse it is not even of XML format Hence in order to pick them out this component will further
analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo
folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in
order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo
folder will be passed into the subsequent component
(4) Moreover since some Web Service Registries would give some addition information about the
Deep Web Service Crawler
22
services such as availability service provider version therefore the Property Grabber
component will set out to extract these information as this servicersquos properties And thereafter
save these servicersquos properties into an INI file However if there has no available additional
information then it is no need to extract the service property thus there is no INI file for that
service However for the implementation of this Pica-Pica Web Service Description Crawler only
the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to
extract the servicesrsquo properties While for other three Web Service Registries there has such
function
(5) Furthermore this is an optional to create a report file which contains the statistic information of
this process Such as the total number of services for one Web Service Registry the number of
services whose WSDL document is invalid etc
(6) As has been stated at present there has a folder with all valid WSDL documents and might also
have some INI files Therefore at this time the task of WSML Register component is to generate
the appropriate WSML documents with these valid WSDL document and INI files And then
register them in Conqo
24 Conclusions of the Existing Strategies
In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder
project Information Extraction technique and Pica-Pica Web Service Description Crawler
The task of this master program is going to obtain the available Web services and their related
information from the Web Actually this is a procedure of extracting the needed information for the
service such as the servicersquos WSDL document and its properties Therefore the Information
Extraction technique which used to extract the information that hosted in the Web could be used in
this master program
Moreover the Service-Finder is a large project that not only able to obtain the available Web services
and their related information from the Web but also enrich these crawled data with annotations by
means of the Service-Finder ontology and the Service Category ontology and integrate all the
information into a coherent semantic mode Furthermore it also provides the capabilities for
searching and browsing the data with the user interface and gives users with service
recommendations However as a master program the Service-Finder project is far more exceeding
the requirements Therefore it is just considered as a consultation for this master program
Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available
Web services and their related information this fulfills the primary task of this master program
Nevertheless regarding the information about the service this Pica-Pica Web Service Description
Crawler extracts only few properties sometime even no property Consequently in order to improve
the quality of the service it has to extract as much properties about the service as possible Thence in
chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler
Deep Web Service Crawler
23
3 Design and Implementation
In the previous chapter of the State of Art those already existing techniques or implementations are
presented In the following it is time to introduce the basic principle of the proposed approach Deep
Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica
Web Service Description Crawler
31 Deep Web Services Crawler Requirements
This section is mainly talking about the goals of the Deep Web Service Crawler approach system
requirements with respect to approach and some non-functional requirements
311 Basic Requirements for DWSC
The following lists are the basic requirements which should be achieved
(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list
of Web services that published in the Web Service Registry It contains not only the WSDL document
of the service but also a file with respect to the information about that service Therefore for the
purpose of producing the largest annotated Service Catalogue this proposed approach need to
extract as much properties about those Web services as possible Moreover it has also to download
the WSDL document that hosted along with the Web service That is to say these properties are not
only the interesting structured properties such as service name its WSDL link address but also some
other non-function properties for example endpoint and its monitoring information
(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties
For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of
schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way
this proposed approach provides three methods for the storage The first one will store them as an
XML file While for the second method it is going to store them in an INI file And the third method is
to use the database for the storage
312 System Requirements for DWSC
General Speaking the needed requirements for achieving a programming project contains following
lists
1) Operating System LinuxUnix Windows (XP Vista 2000 etc)
2) Programming language C++ Java Python C etc
3) Programming Tool NetBean Eclipse Visual Studio and so on
However in this master thesis the scripts of the Deep Web Service Crawler approach are written in
Java programming language Besides these code scripts are only tested running in Windows XP
operating system and Linux operating system but have not been tested in other operating system
Deep Web Service Crawler
24
313 Non-Functional Requirements for DWSC
In this part several non-functional requirements for the Deep Web Service Crawler approach are
presented
1) Transparency the process of data exploration and data storage should be done automatically
without the userrsquos intervention However at first user should specify the path in the hard disk which
will be used store these outputs of this program
2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in
order to stop the process from interruption there must have some necessary error handlings for the
process recovering
3) Completeness this approach should extract the interesting properties about each Web Service as
much as possible eg endpoint monitoring information etc
In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the
following five URLs hence the proposed approach must implement not less than those five Web
Service Registries
Biocatalogue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwservice-repositorycom
Xmethods httpwwwxmethodsnet
32 Deep Web Services Crawler Architecture
In this section an overview of this high-level architecture of Deep Web Services Crawler approach will
be introduced first Thereafter there are four subsections that focusing on the outlining of each single
component and how they will play together presented
The current components and flows of data in Deep Web Service Crawler can be summarized as
depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page
links and related service page links by crawling the Web (Web Service Extractor) Then those gathered
links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL
Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all
these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is
illustrated as following
Oslash Step 1
When the Deep Web Service Crawler starts to run the File Chooser container would require the
user to specify a path in the computer or in any other hard disk The reason for specifying the
path is that this Deep Web Service Crawler program needs a place to store all its outputs
Oslash Step 2
After that the Web Service Extractor will be triggered It is the main entry to the specific crawling
process However since the Deep Web Service Crawler program is a procedure which supposes
to crawl for Web Services in some given Web Service Registries Hence the URL addresses of
Deep Web Service Crawler
25
these Web Service Registries should be given as an initial seed for this Web Service Extractor
process However since the page structures of these Web Service Registries are completely
different thus there will be a dependent process for each Web Service Registry
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler
Oslash Step 3
So that according to the given seed two types of links would be obtained by the Web Service
Extractor component One is the service list page link And another is the service page link
Service list page is page that contains a list of Web Services and may be some information about
these Web Services While service page is page that contains much more information for single
service Finally it forwards these two types of links into the next two components Property
Grabber and WSDL Grabber
Oslash Step 4
Then on the one hand the Property Grabber component tries to gather the information about
the service that hosted in the service list page and service page Such as the name of the service
its description the ranking for this service etc Finally all the information of the service will be
collected together as the service properties which then will be delivered to the storage
component for further processing
Oslash Step 5
On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page
or the service page That is because for some Web Service Registry the WSDL link is hosted in
Deep Web Service Crawler
26
the service list page like the Biocatalogue and for other Web Service Registries it is hosted in
the service page such as Xmethods Then after obtaining the WSDL link it will also be
transmitted to the Storage component for further processing
Oslash Step6
When service properties information and WSDL link of the service are received by Storage
component it will be going to store them into the disk For the service properties they will be
stored into the disk by means of three different ways They are an XML file an INI file or one
record inside the table of Database However for the WSDL link the Storage component will try
to download the page content first according to the URL address of the WSDL link If it can work
successful the page content of the service will be stored as a WSDL document into the disk
Oslash Step 7
Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there
are more than one service or there are more than one service list page in the those Web Service
Registries the crawling process from step 3 to step 6 will be continued again and again until
there is no service or service list page in that Web Service Registries any more
Oslash Step 8
Furthermore after the crawling process for one Web Service Registry finishes there will also
generate a file that contains some of statistic information about this crawling process For
example the time when the crawling process of this Web Service Registry started and when
finished the total number of Web services in this Web Service Registry how many services have
empty WSDL document the average number of service properties in this Web Service Registry
and the average time cost for extracting service properties getting WSDL document and
generating XML file INI file etc
321 The Function of Web Service Extractor Component
The Web Service Extractor component is responsible for gathering service list page links and service
page links It pursues a focused crawl of the Web and only forwards service list page and service page
links to subsequent components for analyzing collecting gathering purposes Therefore it is going to
identify both service list page links and related service page links on these Web Service Registries
As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost
as important as the Web Service Extractor itself as it highly influences the part of the Web which it
needs to crawl The seed can or shall contain eg Web pages where these Web Services are
published or that they talk about Web Services
Deep Web Service Crawler
27
Figure3-2 Overview the process flow of the Web Service Extractor Component
After feeding with the seed of URL the Web Service Extractor component starts to get the link of
service list page from the initial page of this URL seed However this process would be diverse for
these five Web Service Registries Following shows the different situation in these Web Service
Registries
Oslash Service-Repository Web Service Registry
In this Web Service Registry the link of the first service list page is the URL address of its seed
which means some Web Services can be found in the home page of the Service-Repository Web
Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this
Web Service Registry Therefore the process of getting service list page link in this registry will be
continually carried on until there no more service list page link exists
Oslash Xmethods Web Service Registry
Although there has Web Services in the home page of the Xmethods Web Service Registry the
number of those Web Services is only a small subset of these in this Web Service Registry
Moreover in the Xmethods Web Service Registry there is only one page containing all Web
Services Therefore it must have to get the service list page link for that page
Oslash Ebi Web Service Registry
The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That
is to say there also has one page that contains all Web Services in this Web Service Registry
However this page is not the initial page of the input seed Therefore there needs more than
one operation steps to get the service list page link of that page
Oslash Seekda Web Service Registry
In Seekda Web Service Registry these Web Services are not contained in the initial page of the
input seed The service list page link can be obtained after several additional operation steps
Deep Web Service Crawler
28
However there has a problem for getting the service list page link in this registry Simply put if
there are more than one pages contains the Web Services for some unknown reasons it cannot
not get the links of the rest service list page In another word it can only get the link of the first
service list page
Oslash Biocatalogue Web Service Registry
The process of getting service list page in Biocatalogue Web Service Registry is almost the same
as in Seekda Web Service Registry The only difference is that it can get all these service list page
links in Biocatalogue Web Service Registry if there are more than one service list pages
Then after getting the link of service list page the Web Service Extractor will begin to get the link of
service page for each service that listed in the service list page The reason for why it can do like this is
that there has an internal link for every service which will address to the service page It is worth
noting that once a service page link is obtained this component will immediately forward the service
page link and the service list page link into the subsequent two components for further processing
Nevertheless the process of obtaining the service page link will be continuously carried out until all
services listed in that service list page are crawled Analogously the process of getting service list page
will also be continuously carried out until there no more service list page exists
3211 Features of the Web Service Extractor Component
The main features are described in the following paragraphs
1) Obtain service list page links
A central task is to obtain the corresponding service list page links It is an URL address that includes a
publicly list of Web Services and just a simply information about these Web services like the name of
the service an internal URL that links to another page which would contain the detailed information
about that service sometimes may have the link address for WSDL document
2) Obtain service page links
Once service list page links are found a crucial aspect is to extract the internal link of each service to
aid the task of service information discovery Therefore it is the task of the Web Service Extractor to
harvest the html page content of the service list page so that the service page links which would
contain much more detailed information of the single Web service can be obtained
3212 Input of the Web Service Extractor Component
This component is dependent on some specific input seeds And the only input required for this
component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section
313
3213 Output of the Web Service Extractor Component
The component will produce two service related page links from the Web
l Service list page links
l Service page links
Deep Web Service Crawler
29
3214 Demonstration for Web Service Extractor
In order to have comprehensive understanding of the process of the Web Service Extractor
component following gives some figures for explanation Though there are five URL addresses in this
section only the URL of the service-repository address is showed as an example
1) The input seed is the initial URL address of the Service-Repository which is
ldquohttpwwwservice-repositorycomrdquo
2) As has already said in section 321 the first service list link of this Web Service Registry is its input
seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page
of that link
Figure3-3 Service list page of the Service-Repository
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo
Figure3-5Code Overview of getting service page link in Service Repository
Figure3-6 Service page of the Web service ldquoBLZServicerdquo
Deep Web Service Crawler
30
3) For now because of the service list page link is already know the next step is to acquiring the
service page link according to these services that listed in the service list page The text in red box
of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the
complete link of the service page It should be prefixed with the initial URL address of
Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for
getting service page link of the Web service in Service Repository is showed in figure 3-5
Therefore the final link of this service page is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure 3-6 is the corrsponding service page of that link
4) Afterwards those two links service list page link and service page link which are gathered by the
Web Service Extractor component will be immediately forwarded to the next two components
which are WSDL Grabber component and Property Grabber component
322 The Function of WSDL Grabber Component
The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the
service lists page link or service page link And the whole process flow is illustrated in figure 3-7
Figure3-7 Overview the process flow of the WSDL Grabber Component
When the WSDL Grabber component receives the inputs that delivered from previous component it
will start to get the WSDL link for the service based on these inputs However although the inputs of
the WSDL Grabber component are the links of service page and service list page only one of them
contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the
service page or the service list page The reason why there needs these two links to be delivered into
this component is that the only one of the five Web Service Registries which is Biocatalogue hosts
the WSDL link in the service list page While for other four Web Service Registries the WSDL link is
hosted in the service page Therefore the WSDL link of these four Web Service Registries will be
obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL
link will be obtained through the service list page link However there has a problem for getting the
WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in
Deep Web Service Crawler
31
the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another
word these services will not have WSDL document For the situation like this the value of the WSDL
link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web
Services in other four Web Service Registries the WSDL link would always exist in the service page
Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service
it will inevitably be forwarded into Storage component for downloading the WSDL document at once
3221 Features of the WSDL Grabber Component
The WSDL Grabber component has to provide the following features
l Obtain WSDL links
WSDL link is the direct way to get into the page that contains the contents of the WSDL document
It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo
to represent this is an address that addressed to the page of WSDL document
3222 Input of the WSDL Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3223 Output of the WSDL Grabber Component
The component will only produce the following output data
l The URL address of WSDL link for each service
3224 Demonstration for WSDL Grabber Component
In this section it will demonstrate a list of figures in order to give a comprehensive understanding of
the process of the WSDL Grabber component Without a doubt it uses the same Web service of the
Service-Repository as an example too
1) The input for this WSDL Grabber component is the link of service page that obtained from Web
Service Extractor component The address of this link is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page
Deep Web Service Crawler
32
2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the
figure 3-8
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo
3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure
3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry
For other four Web Service Registries this will be different This function
ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo
Then it checks all of the nodes one by one to see if the text value of that node is the value of
ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will
be extracted as the value of WSDL link for this Web service
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function
Figure3-11 Code overview of ldquooneParameterrdquo function
4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web
service ldquoBLZServicerdquo can be obtained The value of it is
ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo
Deep Web Service Crawler
33
323 The Function of Property Grabber Component
The Property Grabber component is a module which is used to extract and gather all these Web
service information that hosted in the Web actually are the information that showed in service list
page and service page In the end all the obtained Web service information will be collected together
as the service properties which would be delivered into the Storage component for storing The
detailed process flow of Property Grabber component is illustrated in figure 3-12
The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component
However there is still a little difference between them with respect to the seed As have already
mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is
sufficient to get the WSDL link While regarding the Property Grabber component it needs two of
them as the seeds Therefore once the Property Grabber component receives the needed inputs it
will start to extract the information of that single Web service
Figure3-12 Overview the process flow of the Property Grabber Component
Therefore after the Property Grabber component receives the inputs it starts to extract the service
information for the Web service Generally speaking the service information consists of four aspects
which are structured information endpoint information monitoring information and whois
information respectively
(1) Structured Information
The structured information can be obtained by extracting the information that hosted in the
service page and service list page They are the basic descriptive information about the service
Such as the name of the service the URL address through which the WSDL document can be
obtained the description of introducing the service the provider who provides that service its
rating and the server who owns this service etc However the elements constituting this
Deep Web Service Crawler
34
structured information would be diverse according to these different Web Service Registries For
example the rating information of the Web service exists in the Service-Repository Web Service
Registry while the Xmethods Web Service Registry does not have this information In addition
even though the Web services in the same Web Service Registry some elements of the structured
information could be not exist For instance one service in a Web Service Registry may have the
description information of this service while another service in the same Web Service Registry
does not have this description information Tables 3-1 to 3-5 demonstrate the structured
information that should be extracted in these five Web Service Registries Moreover if the style of
the Biocataloue Web Service Registry is SOAP there has some additional information describing
the SOAP operation of this service While it is REST this additional information would be
something that talked about the REST operation They should also be considered as a part of the
structured information Table 3-6 and table 3-7 illustrate the information for these two different
operations
Service Name WSDL Link WSDL Version
Provider Server Rating
Homepage Owner Homepage Description
Table 3-1Structured Information of Service-Repository Web Service Registry
Service Name WSDL Link Provider
Service Style Homepage Implementation Language
Description User Description Contributed Client Name
Type of this Client Publisher for this client Used Tookit of this client
Used Language of this client Used Operation System of this client
Table 3-2Structured Information of Xmethods Web Service Registry
Service Name WSDL Link Server
Provider Providerrsquos Country Service Style
Rating Description User Description
Service Tags Documentation (within WSDL)
Table 3-3Structured Information of Seekda Web Service Registry
Service Name WSDL Link Port Name
Service URL Address Implementation Class
Table 3-4Structured Information of Ebi Web Service Registry
Service Name WSDL Link Style
Provider Providerrsquos Country View Times
Favorite Times Submitter Service Tags
Total Annotation Provider Annotation Member Annotation
Registry Annotation Base URL SOAP Lab Server Base URL
Description User Description Category
Table 3-5Structured Information of Biocatalogue Web Service Registry
Deep Web Service Crawler
35
SOAP Operation Name Inputs and Outputs Operation Description
Operation Tags Part of Which Service
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry
REST Operation Name Service Tags Used Template
Operation Description Part of Which Service Part of Which Endpoint Group
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry
(2) Endpoint Information
Endpoint information is the port information of the Web service which can be extracted only
through the service page However since different Web Service Registries have different structure
of the endpoint information with respect to the Web service Thence some elements of the
endpoint information for the Web service would be very diverse But there is one thing needed to
pay attention the Ebi Web Service Registry does not have endpoint information for all these Web
services published in this registry Moreover though the Web services in the same Web Service
Registry have the same structure of this endpoint information some elements of the endpoint
information would be missing or empty Furthermore these Web Service Registries could even
have no endpoint information for some Web services published by them Nevertheless whatever
there is endpoint information for the Web service there will be at least one such element which
is the URL address of the endpoint Following table 3-8 shows the endpoint information that
should be extracted in these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Endpoint Name Endpoint URL
Endpoint Critical Endpoint Type
Bound Endpoint
Xmethods
Endpoint URL Publisher of this Endpoint
Contact Email of this Publisher Implementation Language of this
Endpoint
Seekda Endpoint URL
Biocatalogue Endpoint Name Endpoint URL
Table 3-8 Endpoint Information of these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Service Availability Number of Downs
Total Uptime Total Downtime
MTTR MTBF
RTT Max of Endpoint RTT Min of Endpoint
RTT Average of Endpoint Ping Count of Endpoint
Seekda Service Availability Begin Time of Monitoring
Biocatalogue Monitored Status of Endpoint Overall Status of Service
Table 3-9 Monitoring Information of these five Web Service Registries
(3) Monitoring Information
Monitoring information is the tested statistic information for the Web service But it is worth
Deep Web Service Crawler
36
noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of
all these Web services published by them While for other three Web Service Registries only a
few of Web services may not have this information Table 3-12 displays the monitoring
information for these three Web Service Registries
(4) Whois Information
Whois information is not extracted from the information that hosted in the service page and
service list page It is the descriptive information for the service domain which can be gained by
means of the address of the WSDL link Because of that the process of getting the Whois
information will start after gaining the service domain first The final value of service domain
should not contain the string like this http https www etc It must belong to the top level
domain After that it is going to query the service domain database by means of sending the value
of service domain to the Whois client which is just a Web site on the Internet for example
ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain
exists a list of its information will be returned as the output However the structure of the
returned information will be diverse with respect to different service domain Therefore the most
challenge thing is that there has to deal with the extracting process for each different situation of
the returned information Table 3-10 gives the Whois information that needs be extracted for all
these five Web Service Registries
Service Domain URL Domain Name Domain Type
Domain Address Domain Description State
Postal Code City Country
Country Code Phone Fax
Email Organization Established Time
Table 3-10Whois Information for these five Web Service Registries
Finally all the information of these four aspects will be collected together And then be delivered into
the Storage component for further storage processing
3231 Features of the Property Grabber Component
The Property Grabber component has to provide the following features
l Obtain basic information
Generally speaking the more information one Web service has the better you can know how
well this Web service is Hence this Property Grabber component will be necessary to extract all
these basic information that hosted in the service list page and service page This basic
information contains the structure information endpoint information and monitoring
information
l Obtain Whois information
Due to the reason that the more information the Web service has the better the quality of the
Web service is That makes it is necessary to extract as much information about the Web service
as possible Therefore except the basic information the Property Grabber component can also
obtain some additional information called Whois information such as the type of the domain the
name of the person in charge the postal code of the domain city phone fax detailed address
email etc
Deep Web Service Crawler
37
3232 Input of the Property Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3233 Output of the Property Grabber Component
The component will produce the following output data
l Structured information of each service
l Endpoint information about each service if exists
l Monitoring information for the service and endpoint if exists
l Whois information of the service domain
However all these information will be collected together as the properties for each service Thereafter
the collected properties will be sent to the Storage component
3234 Demonstration for Property Grabber Component
Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the
Property Grabber component To simplify the explanation the example showed in this section uses
the same Web service as before
1) The inputs of the Property Grabber component are links of service list page and service page that
received from Web Service Extractor component Their links are
ldquohttpwwwservice-repositorycomrdquo
and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page
Deep Web Service Crawler
38
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page
3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that
displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of
the structured information have the same content like the contents of description showed in
service page and service list page Thence in order to saving the time of extracting process and
the space for storing process the elements with the same content would only be extracted once
Moreover there needs a transformation of non-descriptive text to the descriptive text for the
rating information because the contents of it are several image of stars Therefore the final
results of the extracted structured information for this Web service are showed in table 3-11
Because of there are no descriptive information for the provider homepage and Owner
Homepage their values are assigned as ldquoNULLrdquo
Service Name BLZService
WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl
WSDL Version 0
Server Apache-Coyote11
Description BLZService
Rating Four stars and A Half
Provider NULL
Homepage NULL
Owner Homepage NULL
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo
4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that
displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list
only one of them will be extracted as the information of endpoint That is because this master
program is intended to extract as more information as possible but these information should not
contain some redundant information Therefore for the purpose of that only one of them will be
extracted as the endpoint information even if there are more than one endpoint records Table
3-12 shows the final results of the endpoint information of this Web service
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page
Deep Web Service Crawler
39
Endpoint Name BLZServiceSOAP12port_http
Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService
Endpoint Critical True
Endpoint Type production
Bound Endpoint BLZServiceSOAP12Binding
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo
5) Then it is time to extract the monitoring information through invoking the function
ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box
above contains the monitoring information about the Web service and the red box below lists
the monitoring information for these endpoints As have already mentioned before only one
endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two
types of availability Actually they all represent the availability for this Web service just like the
availability that showed in figure 3-14 Therefore only one content of these availability
information is sufficient Table 3-13 shows the final results of this extracting process
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page
Service Availability 100
Number of Downs 0
Total Uptime 1 day 19 hours 19 minutes
Total Downtime 0 second
MTBF 1 day 19 hours 19 minutes
MTTR 0 second
RTT Max of Endpoint 141 ms
RTT Min of Endpoint 0 ms
RTT Average of Endpoint 577 ms
Ping Count of Endpoint 112
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo
6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the
service domain from the WSDL link first Towards this Web service the gained service domain is
ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying
process After that it returns a list of information for that service domain See figure 3-17 And
table 3-14 is the extracted Whois information
Deep Web Service Crawler
40
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo
Service Domain URL thomas-bayercom
Domain Name Thomas Bayer
Domain Type NULL
Domain Address Moltkestr40
Domain Description NULL
State NULL
Postal Code 54173
City Bonn
Country NULL
Country Code DE
Phone +4922855525760
Fax NULL
Email infopredic8de
Organization predic8 GmbH
Established Time NULL
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo
7) Finally all the information of these four aspects will be collected together as the service
properties and then these service properties are forwarded into the Storage component
324 The Function of Storage Component
The WSDL link from the WSDL Grabber component will be used by this Storage component to
download the WSDL document from the Web and stores it into the disk thereafter In addition the
service properties from the Property Grabber component will also be directly stored into the disk in
three different manners with this Storage component Figure 3-18 illustrates the process flow of the
Storage component
Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo
of the Storage component will be triggered After that it is going to transform the service properties
into three different output formats and store them into the disk These output formats are XML file
INI file and database records Besides it will also try to download the WSDL document with the URL
address of the WSDL link and then store the obtained WSDL document into disk too However this
ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo
ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the
storage tasks
Deep Web Service Crawler
41
Figure3-18 Overview the process flow of the Storage Component
(1) ldquogetWSDLrdquo sub function
The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it
into the disk Therefore above all it has to get the content of the WSDL document This
procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value
of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if
the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo
For the case like that it will create a WSDL document whose name is the service name appended
with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content
It is an empty document Nevertheless if that service does have WSDL link this sub function will
try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all
these contents that hosted in the Web would be downloaded and stored into the disk and named
only with the name of the service Otherwise it would create a WSDL document which prefixes
ldquoBadrdquo before the service name
(2) ldquogenerateXMLrdquo sub function
The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform
them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo
XML stands for eXtensible Markup Language which is a markup language that designed to
transport and store data The first line of an XML file is the XML declaration which defines the
XML version and the encoding used For example following ldquoltxml version=10
encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is
8-bit Unicode Transformation Format character set Besides an XML file contains XML elements
which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML
element can also contain other elements which could be simple text or a mixture of both
Deep Web Service Crawler
42
However an XML file must contain a root element as the parent of all other elements
(3) ldquogenerateINIrdquo sub function
The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will
transform them into an INI file and then store it in the disk with the name structure of servicersquos
name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for
configuration files They are just simple text files with a basic structure Generally speaking the
INI file contains three different parts They are section parameter and comment respectively
Parameter is the basic element contained in an INI file Its format is a key-value pair or can also
be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always
appears to the left of the equals sign The section is more like a room that grouped all its
parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And
section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo
Thence anything between the semicolon and the end of line will be ignored
(4) ldquogenerateDatabaserdquo sub function
The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions
Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will
turn them into the data of the database by using the SQL statements SQL stands for Structured
Query Language which is a database language that designed for accessing and manipulating data
in database Most of the actions that performed on a database are done with SQL statements
And the primary statements of SQL include insert into delete update select create alter and
drop Therefore for the purpose of transforming these service properties into the data of the
database this sub function has to create a database first using the ldquocreate databaserdquo statement
Then it should create a table to store the data A table is a collections of related data entries and
it consists of columns and rows Since the data for all these five Web Service Registries are not
very large therefore one database table is enough for storing these service properties Because of
that the field names of the service properties in columns for all these five Web Service Registries
should be uniform and well-defined Afterwards the service properties of each single service can
be put into the table as a record with the ldquoinsert intordquo statement of the SQL
3241 Features of the Storage Component
The Property Grabber component has to provide the following features
l Generate different output formats
The final result of this master program is to store the information of the service into the disk for
future work However this Storage component provides three different formats for storing these
service properties of the services that published in the Web Service Registries This makes the
storage of the service very flexible and also longevity
l Obtain the WSDL document
The important thing of this component is to check if the WSDL document can be obtained from
the WSDL link because the WSDL document plays a decisive role for determining the quality of
the service This Storage component provides the ability to deal with the different situations
about the process of obtaining the WSDL document
Deep Web Service Crawler
43
3242 Input of the Storage Component
This component requires the following input data
l WSDL link of each service
l Each service property information
3243 Output of the Storage Component
The component will produce the following output data
l WSDL document of the service
l XML document INI file and tables in database
3244 Demonstration for Storage Component
Following figures describe the fundamental implementation codes of this Storage component
The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the
implementation codes The first common place is about the parameters that defined in each of
these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute
path in the computer disk It will be used for the storing procedure and has already been specified
by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an
increasing Integer which is used as a part of the name of the service The reason for doing this is
that it can prevent the services that have the same name from overriding in the disk The content
of red mark among the codes of these figures is the second common place Its function is to
create a file or document for the service with the corresponding parameters
2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to
get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the
name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo
is the actual WSDL link which is the most important one for this sub function The other two
parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and
ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data
of the service in each Web Service Registry Such as the overall number of properties for that Web
Service Registry the overall number of services the number of services that have no WSDL link
the number of services contains no contents the number of services whose WSDL link are not
available etc The ldquoLog Informationrdquo text file records the results of the process steps and the
problems encountered For example which service is crawling now which Web Service Registry it
belongs the reason why it cannot get the WSDL document for this service the reason why this
service is unreachable and so on
Deep Web Service Crawler
44
Figure3-19 Implementation code for getting WSDL document
3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file
and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of
ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two
variables name and value
Figure3-20 Implementation code for generating XML file
Deep Web Service Crawler
45
Figure3-21 Implementation code for generating INI file
4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all
services in these five Web Service Registries into the records of the database Therefore it has to
create a database first The name of the database can be arbitrary as long as it conforms to the
naming rules in database This is the same for the name of the table Figure 3-22 displays the
procedure of creating a table in the database Due to the reason that it is hard to decide the
length of each service property thence the data types for all service properties are set to ldquoTextrdquo
Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo
statement
Figure3-22 Implementation code for creating table in database
Deep Web Service Crawler
46
Figure3-23 Implementation code for generating table records
33 Multithreaded Programming for DWSC
Multithreaded programming is a built-in characteristic which is provided by Java language A
multithreaded program contains two or more separate parts that that can execute concurrently Each
part of such a program is called a thread The use of multithreading enables to create programs that
makes efficient use of the system resources For example make maximum use of the CPU thatrsquos
because the idle time of the CPU can be kept to a minimum
In this master program there are five Web Service Registries that are needed to crawl for the services
among them Moreover the number of services published among each Web Service Registry is quite
different That makes the running time cost by each Web Service Registry is different Then it will
happen that one Web Service Registry who owns fewer services has to be waiting for executing until
another Web Service Registry who has much more services finishes Therefore in order to reduce the
waiting time for other Web Service Registries and maximize the use of the system resources it is
necessary to apply this multithreaded programming to this master program That is to say this master
program will create a thread for each Web Service Registry And these threads are executed
independently
34 Sleep Time Configuration for Web Service
Registries
Due to the reason that this master program is intended for downloading the WSDL document and
extracting the service information of these Web services that published in the Web Service Registry
this would inevitably affect the performance of the Web Service Registry In addition for the purpose
of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of
accessing Because of that sometimes when this master program is executing unknown errors would
happen For instance this master program will be continually halting at one point without getting any
more WSDL document and service information or some servicesrsquo WSDL documents of some Web
Deep Web Service Crawler
47
Service Registries cannot be obtained or there would miss some service information Therefore in
order to obtain the largest staffs of these Web services published in these five Web Service Registries
and not to affect the throughput of them it has to configure the accessing rate for each service of all
Web Service Registries
In consequence before going into the essential procedure for each single service of these Web Service
Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long
milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for
the specified number of milliseconds In other words it will temporarily cease execution for a while
Following table shows this time interval of the sleep function for each Web Service Registry
Web Service Registry Name Time Interval (milliseconds)
Service Repository 8000
Ebi 3000
Xmethods 10000
Seekda 20000
Biocatalogue 10000
Table 3-15Sleep Time of these five Web Service Registries
Deep Web Service Crawler
48
4 Experimental Results and Analysis
This chapter is going to show the quantitative experimental results of the prototype that presented in
chapter 3 Besides the analysis of these results will also be described and explained However in
order to gain a rather accurate result the experiments are carried out more than five times All these
data that displayed in the following tables and charts are the average of them
41 Statistic Information for Different Web Service
Registries
This section is mainly used to talk about the amount statistic of these Web services that published in
these five Web Service Registries It includes the overall number of the Web services that published in
each Web Service Registry and the number of these unavailable Web services which have been
archived because they may not be active anymore or are close to being non active Table 4-1 shows
the service amount statistic of these five Web Service Registries
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Overall
Services 57 289 382 853 2567
Unavailable
Services 0 0 0 0 125
Table4-1 Service amount statistic of these five Web Service Registries
Nevertheless in order to have an intuitive view of the service amount statistic in these five Web
Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As
you can see from the bar chart one on hand there has an ascending increase for the overall number
of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service
Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web
service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to
provide the Web services for the users because it contains far more services than other four Web
Service Registries On the other hand there is no unavailable service existed in almost all Web Service
Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web
Service Registry contains some Web services that cannot be used by the users To some degree this is
useless for the reason that these services cannot be used anymore and it will waste the network
resources on the Web Therefore all these unavailable services should be eliminated in order to
reduce the wasting of the network resources
Deep Web Service Crawler
49
Figure4-1 Service amount statistic of these five Web Service Registries
42 Statistic Information for WSDL Document
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Failed WSDL
Links 1 0 23 145 32
Without
WSDL Links 0 0 0 0 16
Empty
Content 0 0 2 0 2
Table4-2 Statistic information for WSDL Document
Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in
these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed
WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the
Web services whose WSDL links are invalid Say in other words that means it is impossible to get the
WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no
WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services
for these Web Service Registries It is used as the overall number of Web services in each Web Service
Registry who has no WSDL link That is to say there would be no WSDL document for Web services
like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL
document will also be created But there is no content and the name of this WSDL document will
contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which
represents the overall number of Web services that have the WSDL links and the URL address of them
0
500
1000
1500
2000
2500
3000
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
57289 382
853
2567
0 0 0 0 125
The
Num
ber o
f Web
Ser
vice
s
The name of Web Service Registry
Overall Services
Unavailable Services
Deep Web Service Crawler
50
are valid but their WSDL documents contains no content In this case a WSDL document whose name
contains a string ldquo(BAD)rdquo will be created
Figure4-2 Statistic information for WSDL Document
43 Comparison of Different Average Number of
Service Properties
This section is going to compare the average number of service properties in these five Web Service
Registries This average number of service properties is calculated by means of following equation = (1)
Where
ASP is the average number of service properties for one Web Service Registry
ONSP is the overall number of service properties in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
Figure 4-3 shows the average number of service properties of each Web service in these five Web
Service Registries As have already mentioned before one of the measurements for testing the quality
of Web services in the Web Service Registry is the service information The more information of a Web
service the better you know about that service consequently the corresponding Web Service
Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service
Repository and Biocatalogue Web Service Registries own a larger number of service properties than
that in other three Web Service Registries This can directly reflect that these two Web Service
Registries can provide more detailed information about the Web services published among them So
that the users can be easier to choose the service they need And they would also like to use the Web
services that published in these two Web Service Registries By contrast the Xmethods and Seekda
0
50
100
150
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
1 023
145
32
0 0 0 016
0 0 2 0 2
The
Num
ber o
f Ser
vice
s
The Name of the Web Service Registry
Failed WSDL Links
Without WSDL links
Empty Content
Deep Web Service Crawler
51
Web Service Registries who have less service information of the Web services would offer less quality
for these Web services Therefore users may be not like to use the Web services provided in these
two Web Service Registries Not to mention the Web services published in the Ebi Web Service
Registry
Figure4-3 Average Number of Service Properties
From the description presented in section 323 the causes of the issue for the different number of
service properties in these Web Service Registries may consist of following several points First the
number of the structured information for these Web services is different with respect to these five
Web Service Registries Even part of information for some Web services in one Web Service Registry
could be missing or have empty value For example the number of structure information that
supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that
in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for
almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce
the amount of the overall number of service properties Thirdly some Web Service Registries do not
have monitoring information such as Xmethods and Ebi In particular the Service Repository Web
Service Registry has a large amount of monitoring information of the Web services that could be
extracted in the Web Obvious the last one is the number of the whois information for these Web
services If the database of the whois client does not contain the information about the service
domain for the Web service in one Web Service Registry then there will be no whois information
could be extracted Moreover even there has information of the service domain the amount of the
information could be very diverse Therefore if the Web Service Registry undergoes the situation like
that many service domains of the Web services in this registry have no and few whois information
then the average number of service properties in that registry will decrease greatly
As a result in order to help users get better acquainted with the Web services that provided in Web
Service Registry and distinguish these Web services the Web Service Registry should do it best to
offer more and more information for each published Web service of it
05
101520253035
23
717 17
32
The
Num
ber o
f Pro
pert
ies
The Name of Web Service Registry
Average Number of Service Properties
Deep Web Service Crawler
52
44 Different Outputs of Web Services
The fundamental task of this master program is to obtain the WSDL document of the Web services
that hosted in the Web Service Registries as well as extract and gather the properties of these Web
services and thereafter store them into disk Therefore this section is going to describe the different
outputs of this master program which include the WSDL documents of Web services the generated
XML file and INI file and data records of these service properties in Database
Figure4-4 WSDL Document format of one Web service
The WSDL document of the Web service is just read from the Web according to the URL address of its
WSDL link And then these data will be stored into the disk as the WSDL document The name of the
WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document
And in order to distinguish these WSDL documents whose names would be the same while the
content of them are different the name of each obtained WSDL document in one Web Service
Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL
document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and
data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output
formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name
is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the
materials belonged to the same Web service The first three lines in that INI file are some service
comments which start from the semicolon to the end of the line They are the basic information for
the description of this INI file The following one line of them is the section that included in a pairs of
brackets It is important because it represents that the following lines behind it are the information of
this Web service Therefore the rest of these lines are the actual service information that existed in a
key-value pair with an equal between the pair Each service property is displayed from the beginning
of the line
Deep Web Service Crawler
53
Figure4-5 INI File format of one Web service
Figure4-6 XML File format of one Web service
Figure4-7 Database data format for all Web services
Deep Web Service Crawler
54
Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its
name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web
service Though the format of the XML file is different from that of the INI file the essential contents
of them are the same That is to say the values of these service properties are no different This is
because they are files generated from the collection of properties for the same Web service The XML
file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the
section in INI file is something like the root in XML file Therefore all values of these elements among
the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service
Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of
service information for all Web services in these five Web Service Registries And the entire service
information for one Web service will be only one record in this table Because of that the column
names of that table should be the union of the name of the service information in each Web Service
Registry However since the column names of the table should be unique thus these redundant
names in this union must be eliminated This could make sense and possible because the names of the
service information are well-defined and uniform for all these five Web Service Registries In addition
in this table the first column is the primary key which is an increasing Integer The function of this is
more like the Integer that contained in the name of the XML file and INI file While the rest columns in
that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in
this table represents that this property of that Web service is empty or missing
45 Comparison of Average Time Cost for Different
Parts of Single Web Service
This section aims to describe the comparison of the average time cost for different parts of getting
one single Web service in all these five Web Service Registries At first it has to calculate the average
time cost of getting one single service in the Web Service Registry The calculation of it can be
obtained through following equation = (2)
Where
ATC is the average time cost for one single Web service
OTS is the overall time cost of all the Web services in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the different parts of the average time cost for getting one single service consists of
following 6 aspects which are the average time cost for extracting service property the average time
cost for obtaining WSDL document the average time cost for generating XML file the average time
cost for generating INI file the average time cost for inserting the service property into the table of
the Database and the average time cost for some other procedures such as get the service list page
link get the service page link and so on The average time cost for extracting service property would
be obtained by means of following equation
Deep Web Service Crawler
55
= (3)
Where
ATCSI is the average time cost for extracting service property of one single Web service
OTSSI is the overall time cost for extracting service property of all the Web services in one Web
Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the calculation of other parts is similar to the equation for calculating the average time
cost for extracting service property While the calculation of the average time cost for other
procedures equals to the result of the average time cost for one single Web service minus the sum of
the average time cost of all other five parts
Service
property
WSDL
Document
XML
File INI File Database Others Overall
Service
Repository
8801 918 2 1 53 267 10042
Ebi 699 82 2 1 28 11 823
Xmethods 5801 1168 2 1 45 12 7029
Seekda 5186 1013 2 1 41 23 6266
Biocatalogue 39533 762 2 1 66 1636 42000
Table4-3 Average time cost information for all Web Service Registries
Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web
Service Registries The first column of table 4-3 is the name of these five Web Service Registries And
the last column in this table is the average time cost for the single service in one Web Service Registry
While the rest columns in this table are the average time cost of these six different parts In order to
have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated
with following corresponding figures See figure 4-8 to figure 4-14
Figure4-8 Average time cost for extracting service property in all Web Service Registries
8801
6995801 5186
39533
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Service Property
Deep Web Service Crawler
56
As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue
Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries
which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively
That is to say it will take much longer time to extract the service properties of these Web services
published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the
Biocatalogue Web Service Registry has even more average number of service properties which have
already been talked about in section 43 On the contrary the average number of service properties in
Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service
Registry is larger than that in Seekda Web Service Registry As have already known that the average
number of service properties is the same for these two Web Service Registries Nevertheless there
has course might explain the case that the average time in Xmethods costs more than that in Seekda
which is the process for extracting service properties in Xmethods Web Service Registry needs to be
executed by means of service page and service list page while it needs only service page link for
Seekda Web Service Registry
The average time cost for obtaining WSDL document in all these five Web Service Registries is
displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the
average time for extracting the WSDL link of Web service and the average time for reading the data of
WSDL document from the Web and then storing it into the disk As seen from this figure the average
time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is
1168 Although the process for extracting WSDL link will cost certain amount of time it does not have
a significant influence to the total average time spent for obtaining WSDL document This is because
the WSDL link of one Web service is almost gained in one step Therefore this can imply that the
average size of the WSDL document for Xmethods Web Service Registry is larger than that for other
four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds
for the process of extracting WSDL document
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries
918
82
11681013
762
0
200
400
600
800
1000
1200
1400
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
WSDL Document
Deep Web Service Crawler
57
Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different
outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11
the average time of generating XML file for one Web service cost the same for all these five Web
Service Registries which is only 2 milliseconds In addition the average time of generating INI file for
one Web service cost the same too And its value is just 1 millisecond Even though sum up the values
of these two types of average time costs the result is still so small that it could be omitted when
comparing to the overall average time cost of getting one Web service for each corresponding Web
Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI
file would be finished at once after receiving the input of service properties of one Web service
Furthermore it can be seen from figure 4-12 although the average time costs of creating database
record for each Web service in all these five Web Service Registries are larger than that time of
generating XML and INI file the process of the operation for creating database record is still fast
Figure4-10 Average time cost for generating XML file in all Web Service Registries
Figure4-11 Average time cost for generating INI file in all Web Service Registries
2 2 2 2 2
0
05
1
15
2
25
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
XML File
1 1 1 1 1
0
02
04
06
08
1
12
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
INI File
Deep Web Service Crawler
58
Figure4-12 Average time cost for creating database record in all Web Service Registries
Figure4-13 Average time cost for getting one Web service in all Web Service Registries
In the figure 4-13 it gives the average time cost for getting one single Web service in all these five
Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the
longest time for this process This is because the presentation of these five different parts that
described in the front shows that the average time cost of each part would needs more time to finish
the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining
WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web
Service Registry does not cost the longest Moreover there is amazing thing when looking at these
figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same
trend This can further indicate that one Web Service Registry who spends more time to get the
description information of one Web service would offer more information about the Web service
53
28
4541
66
0
10
20
30
40
50
60
70
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Database
10042
823
7029 6266
42000
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Overall
Deep Web Service Crawler
59
5 Conclusion and Further Direction
This master thesis provides a schema which aims to explore the description information of the Web
services that hosted in the different Web Service Registries Actually the description information of
the Web service consists of WSDL document and service information about that Web service
Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can
be used to obtain the WSDL document and service information of the Web service in these Web
Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web
services that hosted in the these Web Service Registries as well as the storage of these Web services
is not flexible and the most important thing is that only a few service information of the Web service
are extracted even more there is no service information would be extracted for some Web Service
Registry However the work presented in this master thesis can be able to explore all Web services in
these Web Service Registries Moreover the service information of Web service would be extracted as
much as it could with the approach presented in this master thesis so that the final result would be
the largest annotated service catalogue ever produced Furthermore regarding the storage process of
the description information of the Web service this master thesis provides three different ways that
can guarantee not only the completeness but also longevity of the description information for the
Web service
However in the implementation performed in this master thesis the whois client used for querying
the information of service domain will return a free text if the information exists And sometime this
free text differs completely That makes it must have to crawl each Web service in all Web Service
Registries at least once in the experiment stage So that all the cases of these free text would be
foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web
services in these Web Service Registries Therefore in order to simply the work other whois client
who can ease the work here needs to be found and used
Moreover in the experiment stage of this master thesis the time cost for getting Web service is still
large For the purpose of reducing this time the multithreaded programming would be applied to
each some parts of the process for getting one Web service
Although the work performed here is specialized for only these five Web Service Registries the
main parts of the principles used here is adaptable to other Web Service Registries that with only
small changes in the implementation codes or the structure
Deep Web Service Crawler
60
6 Bibliography
[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)
June 27 2008
[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from
httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-
whole Emanuele Della Valle (CEFRIEL) July 1 2008
[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk
ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-
architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009
[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web
Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October
2006
[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from
httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008
[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from
httpwwww3orgTRws-arch-scenarios February 11 2004
[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo
Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering
University of Washington Seattle pp233-272 February 1999
[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web
Consortium Working Draft WD-html5-20100624 January 22 2008
[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language
WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from
httpwwwwsmoorgTRd16d161v02
[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO
-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from
httpwwwwsmoorgTRd2v11
[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware
Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet
2008
Deep Web Service Crawler
61
7 Appendixes
There are additional outputs of this master program which are log information file and statistic report
file Figure 8-1 shows one of the basic output formats of the log information for these five Web
Service Registries
Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry
Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry
Deep Web Service Crawler
62
Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo
ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively
Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry
Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry
Deep Web Service Crawler
63
Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry
Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry
Deep Web Service Crawler
64
Table of Figures
Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12
Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16
Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25
Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27
Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41
Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46
Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50
Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51
Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52
Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56
Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58
Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58
Deep Web Service Crawler
65
Table of Tables
Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47
Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48
Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55
Deep Web Service Crawler
66
Table of Abbreviations
DTD Document Type Definition
DWSC Deep Web Service Crawler
HTML HyperText Markup Language
MTBF Mean Time between Failures
MTTR Mean Time to Recovery
QoS Quality of Service
REST Representational State Transfer
RTT Round Trip Time
SOAP Simple Object Access Protocol
SQL Structured Query Language
URL Uniform Resource Locator
WHATWG Web Hypertext Application Technology Working Group
XML eXtensible Markup Language
WSDL Web Service Description Language
WSML Web Service Modeling Language
WSMO Web Service Modeling Ontology
Deep Web Service Crawler
8
these problems In this section it will present the basic principle of Deep Web Service Crawler
approach
At first it is the simply introduction of the Deep Web Service Crawler approach to these problems As
have already been mentioned each Web Service Registry can offer Web Services Moreover each
Web Service Registry has its own html page structures These structures may be the same or even
complete different Therefore the first thing is to identify which Web Service Registry that it will be
going to explore Since each Web Service Registry owns a unique URL this job can be done by directly
analyzing the corresponding URL address of that Web Service Registry After identifying which Web
Service Registry it is going to explore the following step is to obtain all these Web Services that
published in that Web Service Registry Then with all these obtained Web Services it is time to extract
analyze and gather the information of the services That information can be in structured format or
even in unstructured format In this master thesis some Deep Web Analysis Techniques will be
applied to obtain this information So that the information about each Web Service shall be the
largest annotated The last but not the least important all the information about the Web Services
need to be stored
13 Goals of this Master Thesis
The lists in the following are the goals of this master thesis
n Produce the largest annotated Service Catalogue
Service Catalogue is a list of service properties The more properties the service has the larger
Service Catalogue it owns Therefore this master program should extract as much service
properties as possible
n Flexible storage of these metadata of each service as annotations or dedicated documents
The metadata of one service includes not only the WSDL document but also service properties
All these metadata are important information for the service Therefore this master program
should provide flexible ways to store these metadata into the disk
n Improve the comparable property of the Web Services across different Web Service Registries
The names of service properties for one Web Service Registry could be different from another
Web Service Registry Hence for the purpose of improving the comparable ability all these
names of the service properties should be uniformed and well-defined
14 Outline of this Master Thesis
In this chapter the motivation objective and initial approach plan have already been discussed
Thereafter the remaining paper is structured as follows
Firstly in chapter 2 the work presented is almost based on these existing techniques In section 21
there is given a detailed introduction to the technique of the Service-Finder project Then in section
22 the already implemented technique of Pica-Pica Web Service Description Crawler is introduction
and discussed After that in section 23 the Information Retrieval technique is presented
Deep Web Service Crawler
9
Then following in chapter 3 it is explained the designing details of this Deep Web Service Crawler
approach In section 31 it gives a short description for the different requirements of this approach
Next in section 32 the actual design for the Deep Web Service Crawler is presented Then in section
33 34 the multithreaded programming and sleep time configuration that used in this master
program are introduced respectively
In chapter 4 it is supposed to display the experiments of this Deep Web Service Crawler approach
and then give some evaluation of it
Finally in Section 5 the conclusion and discussion of the work that already done as well as the work in
the future for this master task are presented respectively
Deep Web Service Crawler
10
2 State of the Art
This chapter aims at presenting some existing techniques or Strategies that related to the work of
applying this Deep Web Service Extraction Crawler approach Section 21 plans to talk about the
existing catalogues Service-Finder project And then in section 22 it is going to explain the existing
implemented crawlers Pica-Pica Web Service Description Crawler Finally in the section 23 it is
supposed to present some details about the Information Extraction technique
21 Service Finder Project
Service-Finder Project aims at developing a platform for Web Service discovery especially for the Web
Services that are embedded in a Web 20 environment [1] Hence it can provide an efficient access to
publicly available services The goals of the Service-Finder project are depicted as follows [1]
n Automatically gather Web Services and their related information
n Semi-automatically create semantic service description based on the information that available
on the Web
n Create and improve semantic annotations via the user feedback
n Describe the aggregated information in semantic models and allow reasoning query
However before describing the basic functionality of the Service-Finder Project there is going to
present one of its use cases and requirements first
211 Use Cases for Service-Finder Project
The Service-Finder Project employed the use case methodology of the W3C Use Case description [6]
for its needs and then applied this methodology to the use cases that it enumerated
2111 Use Case Methodology
There are three aspects needed to be considered for use case definitions of Service-Finder Project [1]
(1) Description that used to describe information of the use case
(2) Actors Roles and Goals that used to identify the actors they would be the roles they act and the
goals they need to achieve in the scenario
(3) Storyboard that used to describe the serial of interactions among the actors and the
Service-Finder Portal
2112 System Administrator
This section is going to present the use case that applied to the Service-Finder portal and that
illustrated the requirements on its functionality from a user point of view However all these
Deep Web Service Crawler
11
information in this use case are derived from [1] In this use case there has a system administrator
whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities
online and working all day and night Therefore if there is any system failures Sam Adams should fix
the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will
alert him immediately by sending him a SMS Message in the case of a system failure
n Description
This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an
SMS Messaging Service that he wants to build it into his application
n Actors Roles and Goals
The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him
are the immediate service delivery the reliability of the service and low base fee and transaction fee
n Storyboard
Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find
many useful services from it especially he know what he is looking for Hence he visits the
Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo
Requirement 1 Search functionality
Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the
number of matching services that will be displayed on one page And he would also expect there has
short information about the service functionality the service provider and the service availability So
that he could decide which service he will choose to read further
Requirement 2 Enable configurable pagination of the matching results and have some short
information for each service
Step 3 When Sam looks through the short information about the services that displayed on the first
page he expects to find the most relevant services that related to his request After that he would
like to read more detailed information about that service to see whether this service can provide the
needed functionality
Requirement 3 Rank the returned matching services and must provide ability to read more details of
a service
Step 4 In the case that all the returned matching services Sam got provide quite different
functionalities or they belong to different service categories for example the SMS messaging services
alert users not through SMS but voice messaging For this reason Sam would like to see other
different categories that may be contain the services he wants Or the services of other categories
which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can
further filter his search in terms of browsing through categories
Requirement 4 Service categories and allow the user to look all services that belonged to that specific
category If possible it should also allow the user to browse through categories
Step 5 When Sam got all the services that could provide a SMS messaging service via the methods
described in the Step 4 at present he wants to look for the services that offered by an Austrian
provider and have no base fees if possible
Requirement 5 Faceted search
Deep Web Service Crawler
12
Step 6 After Sam got all these specific services now he would like to choose the services that can
provide a high reliability
Requirement 6 Sort functionality based on usersrsquo chooses
Step 7 For now Sam expects to compare the service availability between the promised to the service
provider and the actually provided This should be contained in the servicesrsquo details And there needs
also have service coverage information so that Sam can know whether this service covers the areas
he lives and works Moreover Sam would also like to compare these services in other way For
instance put some services into a structured table to compare the transaction fees
Requirement 7 A side-by-side comparison table for services and a functionality that enable users to
select services he wants to compare
Step 8 At last Sam wants to know whether the service providers offer a free try out of the services
So that he can test the service functionality
Requirement 8 If possible display a note that offering free service trials
212 Architecture Plan for the Service-Finder Project
The architecture plan of the Service-Finder project contains five basic components Service Crawler
Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal
Interface Figure 2-1 presents a high level overview of the components and the data flow among
them
Figure2-1Dataflow of Service-Finder and Its Components [3]
Deep Web Service Crawler
13
2121 The Principle of the Service Crawler Component
The Service Crawler is responsible for gathering available services and their related information from
the Web The overall cycle is depicted as following
(1) A Web developer publishes a Web Service
(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services
like WSDL (Web Service Description Language) documents
(3) The Crawler is also going to search for other related information as long as a service is discovered
(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant
part of the Web
At last the output of the crawler would be forwarded to the subsequent components for analyzing
indexing and displaying
2122 The Principle of the Automatic Annotator Component
The Automatic Annotator receives the relevant data from previous component and generates
semantic service descriptions about the WSDL documents and its related information based on the
Service-Finder Ontology and Service Category Ontology
Firstly it will simply introduce those two compatible ontologies that would be used throughout the
whole process [2]
n Generic Service Ontology it is an ontology which is functional to describe the data objects For
example the services the service providers availability payment modalities and so on
n Service Category Ontology it is an ontology which is used to categorize the functionalities or
applications of the services For instance data verification messaging data storage weather etc
Afterwards it is going to talk about the function of this component with its input output
Oslash Input
u Crawled data from Service Crawler
u Service-Finder Ontologies
u Feedback or Correction of before annotations
Oslash Function
u Enrich the information about the service and extract semantic statements according to the
Service-Finder Ontologies For example categorize the service according to the Service
Category Ontology
u Determine whether a particular document is relevant or not through the Web link graph If
not discard these irrelevant documents
u Classify the pages into their genres For instance pricing user comments FAQ and so on
Oslash Output
u Semantic annotation of the services
Deep Web Service Crawler
14
2123 The Principle of the Conceptual Indexer and Matcher
Component
The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted
information of the services and supplying users the capability of retrieval and semantic query For
example the matchmaking between user requests and service offers and the act of retrieving user
feedback on extracted annotations
In addition letrsquos have a look of the function of this component and its input output
Oslash Input
u Semantic annotation data and full text information obtained from Automatic Annotation
u Semantic annotation data and full text information that come from user interfaces
u Cluster data from user and service clustering component
Oslash Function
u Store the semantic annotations received from the Automatic Annotation component and
from the user interface
u Store the cluster data that procured through the clustering component
u Store and index the textual description offered by the Automatic Annotation component
and the textual comments offered by users
u Ontological query the semantic data from the data store center
u Combined keyword and Ontological querying used for user queries
u Provide a list of similar services for a given service
Oslash Output
u A list of matching services that are queried by users In particular these services should be
sorted by ranking and can also be iterated
u All available data that related to a particular entity must be retrievable at the user interface
2124 The Principle of the Service-Finder Portal Interface
Component
The Service-Finder Portal Interface is the main entry point that provided for users of the
Service-Finder system to search and browse the data which is managed by the Conceptual Indexer
and Matcher component In addition the users can also contribute information by means of providing
tags comments categorizations and ratings to the data browsed Furthermore the developers can
still directly invoke the Service-Finder functionalities from their custom applications in terms of an API
Besides the details of this componentrsquos function input and output are represented as below
Oslash Input
u A list of ordered services for a query
u Detailed information about a service or a set of services and a service provider
u Query access to service category ontology and the most used tags provided by the users
Deep Web Service Crawler
15
u Service availability information
Oslash Function
u The Web Interface allows the users to search services by keyword tag or concept in the
categorization sort and filter query results by refining the query compare and bookmark
services try out the services that offer this functionality
u The API allows the developers to invoke Service-Finder functionalities
Oslash Output
u Explicit user annotations such as tags ratings comments decryptions and so on
u Implicit user data for example click stream of users bookmarks comparisons links sent
etc
u Manual advertising of available new services
2125 The Principle of the Cluster Engine Component
The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the
Service-Finder Portal eg the queried services and the compared services of the users Moreover it
also provides cluster data to the Conceptual Indexer and Matcher for providing service
recommendations
Furthermore letrsquos detailed introduce this componentrsquos function input and output
Oslash Input
u Service annotation data of both extracted and user feedback
u Usersrsquo Click streams used for extracting user behaviors
Oslash Function
u Obtain user clusters from user behaviors
u Obtain service clusters from service annotation data to enable to find similar services
Oslash Output
u Clusters of users and services
22 Information Extraction
Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge
amount of information sources on the Internet which has been limited the access to browsing and
searching for the reason of the heterogeneity and the lack of structure of Web information sources
Therefore the appearance of Information Extraction that transforms the Web pages into
program-friendly structures for post-processing would become a great necessity However the task of
Information Extraction is specified in terms of the inputs and the extraction targets And the
techniques used in the process of Information Extraction called extractor
221 Input Types of Information Extraction
Generally speaking there are three different input types The first input type is the unstructured
Deep Web Service Crawler
16
document For example the free text that showed in figure 2-2 It is unstructured and written in
natural language So that it will require substantial natural language processing While the second
input type is called the structured document For instance the XML documents based on the reason
that the data can be described through the available DTD (Document Type Definition) or XML
(eXtensible Markup Language) schema Finally but obviously the third input type is the
semi-structured document that are widespread on the Web Such as the large volume of HTML
pages like tables itemized lists and enumerated lists This is because HTML tags are often used to
render these embedded data in the HTML pages See figure 2-3
Figure2-2Left is the free text input type and right is its output [4]
Figure2-3A Semi-structured page containing data records
(in rectangular box) to be extracted [4]
Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly
regular structure And the data of these documents can be displayed in a format of HTML way or
non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and
generated from structured databases in terms of some templates or layouts thus it would be
considered as one of the input sources which could provide some of these semi-structured documents
For example the authors price and comments of the book pages that provided by Amazon have the
Deep Web Service Crawler
17
same layout That is because these Web pages are generated from the same database and applied
with the same template or layout Furthermore there has another option which could manually
generate HTML pages of semi-structured type For example although the publication lists that
provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have
title and source property for every single pager Eventually the inputs for some Information Extraction
can also be the pages with the same class or among various Web Service Registries
222 Extraction Targets of Information Extraction
Moreover regarding the task of the Information Extraction it has to consider the extraction target
There also have two different extraction targets The first one is the relation of k-tuple And the k in
there means the number of attributes in a record Nevertheless in some cases an attribute of one
record may have none instantiation Otherwise the attribute owns multiple instantiations In addition
the complex object with hierarchically organized data would be the second extraction target Though
the ways for depicting the extraction targets in a page are diverse the most common structure is the
hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf
nodes which called internal nodes And the structure for a data object may also be flat or nested To
be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise
if it is nested structure then the internal nodes that involved in this data object would be more than
two levels
Furthermore in order to make the Web pages readable for human being and having an easier
visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated
or demarcated However the displaying for a data object in a Web page would be affected by
following conditions [4]
Oslash The attribute of a data object has zero or several values
(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo
attribute For example a special offer only available for certain books might be a ldquononerdquo
attribute
(2) If there are more than one values for the attribute of a data object it will be called the
ldquomultiValuerdquo attribute For instance the name of the author for a book could be a
ldquomultiValuerdquo attribute
Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering
That is to say among this set of attribute the position of the attribute might be changed
according to the diverse instances of a data object Thus this attribute will be called the
ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would
enumerate the release data in front of the movesrsquo title while for the movies after year 1999
(including 1999) it will enumerate the release data behind the movesrsquo title
Oslash The attribute has different formats
This means the displaying format of the data object could be completely distinct with respect to
these different instances Therefore if the format of an attribute is free then a lot of rules will be
needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo
attribute For example an ecommerce Web site would use the bold font format to present the
general prices while use the red color format to display the sale prices Nevertheless there has
Deep Web Service Crawler
18
another situation that some different attributes for a data object have the same format For
example various attributes are presented in terms of using the ltTDgt tags in a table presentation
And the attributes like those could be differentiated by means of the order information of these
attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo
attributes it must have to revise the rules for extracting these attributes
Oslash The attribute cannot be decomposed
Because of the easier processing sometimes the input documents would like to be treated as
strings of tokens instead of the strings of characters In addition some of the attribute cannot
even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo
attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The
department code and the course number in them cannot be separated into two different strings
of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo
223 The Used Techniques in Information Extraction
The extractor used in the process of Information Extraction aims at providing a single uniform query
interface to access information sources like database server and Web server It consists of following
phases collecting returned Web pages labeling these Web pages generalizing extraction rules
extracting the relevant data and outputting the result in an appropriate format (XML format or
relational database) for further information integration For example at first the extractor queries the
Web server to gather the returned pages through the HTTP protocols after that it starts to extract the
contents among these HTML documents and integrate with other data sources thereafter Actually
the whole process of the extractor follows below steps
Oslash Step 1
At the beginning it must have to tokenize the input However there are two different
granularities for the input string tokenization They are tag-level encoding and word-level
encoding The tag-level encoding will transform the tags of HTML page into the general tokens
while transform all text string between two tags into a special token Nevertheless the
word-level encoding does this in another way It treats each word in a document as a token
Oslash Step 2
Next it should apply the extraction rules for every attributes of the data object in the Web pages
These extraction rules could be induced in terms of a top-down or bottom-up generalization
pattern mining or logic programming In addition the type of extraction rules may be indicated
by means of regular grammars or logic rules For example some use path-expressions of the
HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic
constraints and some use delimiter-based constraints such as HTML tags or literal words
Oslash Step 3
After that all these extracted data would be assembled into the records
Oslash Step 4
Finally iterate this process until all these data objects in the input
Deep Web Service Crawler
19
23 Pica-Pica Web Service Description Crawler
The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the
Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web
Services problem For example the evaluation of the descriptive quality of Web Services that offered
and how well are these Web Services described in nowadaysrsquo Web Service Registries
231 Needed Libraries of the Pica-Pica Web Service
Description Crawler
This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef
Spillner and programmed in terms of the Python language Actually in order to run these scripts to
parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib
n Beautiful Soup
It is an HTMLXML parser for Python language And it can even turn these invalid markups into a
parse tree [5] Moreover the following three features can make it more powerful
u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that
makes approximately as much sense as the original document Therefore you can obtain
the data that you want
u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating
searching and modifying the parse tree Hence you donrsquot need to create a custom parse
for every application
u If the document has already specified an encoding then you can ignore it since the
Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way
Otherwise what you have to do is just to specify the encoding of the original documents
Furthermore the ways of including Beautiful Soup into the application are displayed in the
following [5]
sup2 From BeautifulSoup import BeautifulSoup For processing HTML
sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML
sup2 Import BeautifulSoup To get everything
n Html5lib
It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to
gain maximum compatibility with the current major desktop web browsers this implementation
will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5
specification
Deep Web Service Crawler
20
232 Architecture of the Pica-Pica Web Service
Description Crawler
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler
Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes
four fundamental components Service Page Grabber component WSDL Grabber component
Property Grabber component and WSML Register component
(1) The Service Page Grabber components is going to take the URL seed as the input and output the
link of the service page into following two components WSDL Grabber component and Property
Grabber component
(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the
delivered service pagersquos link And then check whether the validation for these obtained WSDL
document Finally only these valid WSDL document will be passed into the WSML Register
component for further processing
(3) The Property Grabber component will try to extract the servicersquos property the hosted in the
service page if there exists After that all these servicersquos properties would be saved into an INI
file as the information of that service
(4) The functionality of the WSML Register component is to write appropriate WSML document by
means of the valid WSDL documents that delivered from WSDL Grabber component and the
optionally INI files that delivered from Property Grabber component Afterwards register them in
Conqo
n WSML [9]
It stands for Web Service Modeling Language which provides a framework with different
language variants Hence it is often used to describe the different aspects of the semantic Web
Services according to the conceptual model of WSMO
n WSMO [10]
WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various
aspects related to the Semantic Web Services based on the ontologies Ontology is a formal
explicit specification of a shared conceptualization In fact these ontologies are the sticking points
that can satisfy the linkage between the agreement of the communities of users and the defined
conceptual semantics of the real-world
n Conqo [11]
Deep Web Service Crawler
21
It is a discovery framework that considers not only the Quality of Service (QoS) but also the
context information It will use a Web Service Repository to manage these service descriptions
that based on WSML
233 Implementation of the Pica-Pica Web Service
Description Crawler
This section is going to describe the processes of the implementation of the Pica-Pica Web Service
Description Crawler in detail
(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it
needs an input as the initial seed In this crawler there are five Web Service Registries which are
listed in the below The URL address of these five Web Service Registries will be used as the input
seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web
Service Description Crawler there has a single Python script for each Web Service Registry And
the crawling process for these Web Service Registriesrsquo Python script will be executed one after
another
Biocataloue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwsrvice-repositorycom
Xmethods httpwwwxmethodsnet
(2) Then after feeding with the input seed it will step into the next component Service Page Grabber
At first this component will try to read the data from the Web based on the input seed Then it
will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library
After that this Service Page Grabber component starts to look for the service page link for each
service that published in the Web Service Registry by means of the functions in Html5lib library In
the case that the service page link of one single service is found it will firstly check whether this
service page link is valid or not Once the service page link is valid it will pass it into the following
two components for further processing which are WSDL Grabber component and Property
Grabber component
(3) When the WSDL Grabber component receives a service page link from its previous component it
sets out to extract the WSDL link address for that service through the parsing tree of the data in
this service page Next this component will start to download the WSDL document of that service
in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into
the disk The process of this WSDL Grabber component will be continually carried on until there is
no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They
may either contain bad definitions or bad namespaceURI or be an empty document What is
worse it is not even of XML format Hence in order to pick them out this component will further
analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo
folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in
order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo
folder will be passed into the subsequent component
(4) Moreover since some Web Service Registries would give some addition information about the
Deep Web Service Crawler
22
services such as availability service provider version therefore the Property Grabber
component will set out to extract these information as this servicersquos properties And thereafter
save these servicersquos properties into an INI file However if there has no available additional
information then it is no need to extract the service property thus there is no INI file for that
service However for the implementation of this Pica-Pica Web Service Description Crawler only
the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to
extract the servicesrsquo properties While for other three Web Service Registries there has such
function
(5) Furthermore this is an optional to create a report file which contains the statistic information of
this process Such as the total number of services for one Web Service Registry the number of
services whose WSDL document is invalid etc
(6) As has been stated at present there has a folder with all valid WSDL documents and might also
have some INI files Therefore at this time the task of WSML Register component is to generate
the appropriate WSML documents with these valid WSDL document and INI files And then
register them in Conqo
24 Conclusions of the Existing Strategies
In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder
project Information Extraction technique and Pica-Pica Web Service Description Crawler
The task of this master program is going to obtain the available Web services and their related
information from the Web Actually this is a procedure of extracting the needed information for the
service such as the servicersquos WSDL document and its properties Therefore the Information
Extraction technique which used to extract the information that hosted in the Web could be used in
this master program
Moreover the Service-Finder is a large project that not only able to obtain the available Web services
and their related information from the Web but also enrich these crawled data with annotations by
means of the Service-Finder ontology and the Service Category ontology and integrate all the
information into a coherent semantic mode Furthermore it also provides the capabilities for
searching and browsing the data with the user interface and gives users with service
recommendations However as a master program the Service-Finder project is far more exceeding
the requirements Therefore it is just considered as a consultation for this master program
Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available
Web services and their related information this fulfills the primary task of this master program
Nevertheless regarding the information about the service this Pica-Pica Web Service Description
Crawler extracts only few properties sometime even no property Consequently in order to improve
the quality of the service it has to extract as much properties about the service as possible Thence in
chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler
Deep Web Service Crawler
23
3 Design and Implementation
In the previous chapter of the State of Art those already existing techniques or implementations are
presented In the following it is time to introduce the basic principle of the proposed approach Deep
Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica
Web Service Description Crawler
31 Deep Web Services Crawler Requirements
This section is mainly talking about the goals of the Deep Web Service Crawler approach system
requirements with respect to approach and some non-functional requirements
311 Basic Requirements for DWSC
The following lists are the basic requirements which should be achieved
(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list
of Web services that published in the Web Service Registry It contains not only the WSDL document
of the service but also a file with respect to the information about that service Therefore for the
purpose of producing the largest annotated Service Catalogue this proposed approach need to
extract as much properties about those Web services as possible Moreover it has also to download
the WSDL document that hosted along with the Web service That is to say these properties are not
only the interesting structured properties such as service name its WSDL link address but also some
other non-function properties for example endpoint and its monitoring information
(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties
For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of
schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way
this proposed approach provides three methods for the storage The first one will store them as an
XML file While for the second method it is going to store them in an INI file And the third method is
to use the database for the storage
312 System Requirements for DWSC
General Speaking the needed requirements for achieving a programming project contains following
lists
1) Operating System LinuxUnix Windows (XP Vista 2000 etc)
2) Programming language C++ Java Python C etc
3) Programming Tool NetBean Eclipse Visual Studio and so on
However in this master thesis the scripts of the Deep Web Service Crawler approach are written in
Java programming language Besides these code scripts are only tested running in Windows XP
operating system and Linux operating system but have not been tested in other operating system
Deep Web Service Crawler
24
313 Non-Functional Requirements for DWSC
In this part several non-functional requirements for the Deep Web Service Crawler approach are
presented
1) Transparency the process of data exploration and data storage should be done automatically
without the userrsquos intervention However at first user should specify the path in the hard disk which
will be used store these outputs of this program
2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in
order to stop the process from interruption there must have some necessary error handlings for the
process recovering
3) Completeness this approach should extract the interesting properties about each Web Service as
much as possible eg endpoint monitoring information etc
In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the
following five URLs hence the proposed approach must implement not less than those five Web
Service Registries
Biocatalogue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwservice-repositorycom
Xmethods httpwwwxmethodsnet
32 Deep Web Services Crawler Architecture
In this section an overview of this high-level architecture of Deep Web Services Crawler approach will
be introduced first Thereafter there are four subsections that focusing on the outlining of each single
component and how they will play together presented
The current components and flows of data in Deep Web Service Crawler can be summarized as
depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page
links and related service page links by crawling the Web (Web Service Extractor) Then those gathered
links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL
Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all
these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is
illustrated as following
Oslash Step 1
When the Deep Web Service Crawler starts to run the File Chooser container would require the
user to specify a path in the computer or in any other hard disk The reason for specifying the
path is that this Deep Web Service Crawler program needs a place to store all its outputs
Oslash Step 2
After that the Web Service Extractor will be triggered It is the main entry to the specific crawling
process However since the Deep Web Service Crawler program is a procedure which supposes
to crawl for Web Services in some given Web Service Registries Hence the URL addresses of
Deep Web Service Crawler
25
these Web Service Registries should be given as an initial seed for this Web Service Extractor
process However since the page structures of these Web Service Registries are completely
different thus there will be a dependent process for each Web Service Registry
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler
Oslash Step 3
So that according to the given seed two types of links would be obtained by the Web Service
Extractor component One is the service list page link And another is the service page link
Service list page is page that contains a list of Web Services and may be some information about
these Web Services While service page is page that contains much more information for single
service Finally it forwards these two types of links into the next two components Property
Grabber and WSDL Grabber
Oslash Step 4
Then on the one hand the Property Grabber component tries to gather the information about
the service that hosted in the service list page and service page Such as the name of the service
its description the ranking for this service etc Finally all the information of the service will be
collected together as the service properties which then will be delivered to the storage
component for further processing
Oslash Step 5
On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page
or the service page That is because for some Web Service Registry the WSDL link is hosted in
Deep Web Service Crawler
26
the service list page like the Biocatalogue and for other Web Service Registries it is hosted in
the service page such as Xmethods Then after obtaining the WSDL link it will also be
transmitted to the Storage component for further processing
Oslash Step6
When service properties information and WSDL link of the service are received by Storage
component it will be going to store them into the disk For the service properties they will be
stored into the disk by means of three different ways They are an XML file an INI file or one
record inside the table of Database However for the WSDL link the Storage component will try
to download the page content first according to the URL address of the WSDL link If it can work
successful the page content of the service will be stored as a WSDL document into the disk
Oslash Step 7
Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there
are more than one service or there are more than one service list page in the those Web Service
Registries the crawling process from step 3 to step 6 will be continued again and again until
there is no service or service list page in that Web Service Registries any more
Oslash Step 8
Furthermore after the crawling process for one Web Service Registry finishes there will also
generate a file that contains some of statistic information about this crawling process For
example the time when the crawling process of this Web Service Registry started and when
finished the total number of Web services in this Web Service Registry how many services have
empty WSDL document the average number of service properties in this Web Service Registry
and the average time cost for extracting service properties getting WSDL document and
generating XML file INI file etc
321 The Function of Web Service Extractor Component
The Web Service Extractor component is responsible for gathering service list page links and service
page links It pursues a focused crawl of the Web and only forwards service list page and service page
links to subsequent components for analyzing collecting gathering purposes Therefore it is going to
identify both service list page links and related service page links on these Web Service Registries
As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost
as important as the Web Service Extractor itself as it highly influences the part of the Web which it
needs to crawl The seed can or shall contain eg Web pages where these Web Services are
published or that they talk about Web Services
Deep Web Service Crawler
27
Figure3-2 Overview the process flow of the Web Service Extractor Component
After feeding with the seed of URL the Web Service Extractor component starts to get the link of
service list page from the initial page of this URL seed However this process would be diverse for
these five Web Service Registries Following shows the different situation in these Web Service
Registries
Oslash Service-Repository Web Service Registry
In this Web Service Registry the link of the first service list page is the URL address of its seed
which means some Web Services can be found in the home page of the Service-Repository Web
Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this
Web Service Registry Therefore the process of getting service list page link in this registry will be
continually carried on until there no more service list page link exists
Oslash Xmethods Web Service Registry
Although there has Web Services in the home page of the Xmethods Web Service Registry the
number of those Web Services is only a small subset of these in this Web Service Registry
Moreover in the Xmethods Web Service Registry there is only one page containing all Web
Services Therefore it must have to get the service list page link for that page
Oslash Ebi Web Service Registry
The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That
is to say there also has one page that contains all Web Services in this Web Service Registry
However this page is not the initial page of the input seed Therefore there needs more than
one operation steps to get the service list page link of that page
Oslash Seekda Web Service Registry
In Seekda Web Service Registry these Web Services are not contained in the initial page of the
input seed The service list page link can be obtained after several additional operation steps
Deep Web Service Crawler
28
However there has a problem for getting the service list page link in this registry Simply put if
there are more than one pages contains the Web Services for some unknown reasons it cannot
not get the links of the rest service list page In another word it can only get the link of the first
service list page
Oslash Biocatalogue Web Service Registry
The process of getting service list page in Biocatalogue Web Service Registry is almost the same
as in Seekda Web Service Registry The only difference is that it can get all these service list page
links in Biocatalogue Web Service Registry if there are more than one service list pages
Then after getting the link of service list page the Web Service Extractor will begin to get the link of
service page for each service that listed in the service list page The reason for why it can do like this is
that there has an internal link for every service which will address to the service page It is worth
noting that once a service page link is obtained this component will immediately forward the service
page link and the service list page link into the subsequent two components for further processing
Nevertheless the process of obtaining the service page link will be continuously carried out until all
services listed in that service list page are crawled Analogously the process of getting service list page
will also be continuously carried out until there no more service list page exists
3211 Features of the Web Service Extractor Component
The main features are described in the following paragraphs
1) Obtain service list page links
A central task is to obtain the corresponding service list page links It is an URL address that includes a
publicly list of Web Services and just a simply information about these Web services like the name of
the service an internal URL that links to another page which would contain the detailed information
about that service sometimes may have the link address for WSDL document
2) Obtain service page links
Once service list page links are found a crucial aspect is to extract the internal link of each service to
aid the task of service information discovery Therefore it is the task of the Web Service Extractor to
harvest the html page content of the service list page so that the service page links which would
contain much more detailed information of the single Web service can be obtained
3212 Input of the Web Service Extractor Component
This component is dependent on some specific input seeds And the only input required for this
component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section
313
3213 Output of the Web Service Extractor Component
The component will produce two service related page links from the Web
l Service list page links
l Service page links
Deep Web Service Crawler
29
3214 Demonstration for Web Service Extractor
In order to have comprehensive understanding of the process of the Web Service Extractor
component following gives some figures for explanation Though there are five URL addresses in this
section only the URL of the service-repository address is showed as an example
1) The input seed is the initial URL address of the Service-Repository which is
ldquohttpwwwservice-repositorycomrdquo
2) As has already said in section 321 the first service list link of this Web Service Registry is its input
seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page
of that link
Figure3-3 Service list page of the Service-Repository
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo
Figure3-5Code Overview of getting service page link in Service Repository
Figure3-6 Service page of the Web service ldquoBLZServicerdquo
Deep Web Service Crawler
30
3) For now because of the service list page link is already know the next step is to acquiring the
service page link according to these services that listed in the service list page The text in red box
of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the
complete link of the service page It should be prefixed with the initial URL address of
Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for
getting service page link of the Web service in Service Repository is showed in figure 3-5
Therefore the final link of this service page is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure 3-6 is the corrsponding service page of that link
4) Afterwards those two links service list page link and service page link which are gathered by the
Web Service Extractor component will be immediately forwarded to the next two components
which are WSDL Grabber component and Property Grabber component
322 The Function of WSDL Grabber Component
The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the
service lists page link or service page link And the whole process flow is illustrated in figure 3-7
Figure3-7 Overview the process flow of the WSDL Grabber Component
When the WSDL Grabber component receives the inputs that delivered from previous component it
will start to get the WSDL link for the service based on these inputs However although the inputs of
the WSDL Grabber component are the links of service page and service list page only one of them
contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the
service page or the service list page The reason why there needs these two links to be delivered into
this component is that the only one of the five Web Service Registries which is Biocatalogue hosts
the WSDL link in the service list page While for other four Web Service Registries the WSDL link is
hosted in the service page Therefore the WSDL link of these four Web Service Registries will be
obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL
link will be obtained through the service list page link However there has a problem for getting the
WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in
Deep Web Service Crawler
31
the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another
word these services will not have WSDL document For the situation like this the value of the WSDL
link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web
Services in other four Web Service Registries the WSDL link would always exist in the service page
Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service
it will inevitably be forwarded into Storage component for downloading the WSDL document at once
3221 Features of the WSDL Grabber Component
The WSDL Grabber component has to provide the following features
l Obtain WSDL links
WSDL link is the direct way to get into the page that contains the contents of the WSDL document
It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo
to represent this is an address that addressed to the page of WSDL document
3222 Input of the WSDL Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3223 Output of the WSDL Grabber Component
The component will only produce the following output data
l The URL address of WSDL link for each service
3224 Demonstration for WSDL Grabber Component
In this section it will demonstrate a list of figures in order to give a comprehensive understanding of
the process of the WSDL Grabber component Without a doubt it uses the same Web service of the
Service-Repository as an example too
1) The input for this WSDL Grabber component is the link of service page that obtained from Web
Service Extractor component The address of this link is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page
Deep Web Service Crawler
32
2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the
figure 3-8
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo
3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure
3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry
For other four Web Service Registries this will be different This function
ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo
Then it checks all of the nodes one by one to see if the text value of that node is the value of
ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will
be extracted as the value of WSDL link for this Web service
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function
Figure3-11 Code overview of ldquooneParameterrdquo function
4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web
service ldquoBLZServicerdquo can be obtained The value of it is
ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo
Deep Web Service Crawler
33
323 The Function of Property Grabber Component
The Property Grabber component is a module which is used to extract and gather all these Web
service information that hosted in the Web actually are the information that showed in service list
page and service page In the end all the obtained Web service information will be collected together
as the service properties which would be delivered into the Storage component for storing The
detailed process flow of Property Grabber component is illustrated in figure 3-12
The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component
However there is still a little difference between them with respect to the seed As have already
mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is
sufficient to get the WSDL link While regarding the Property Grabber component it needs two of
them as the seeds Therefore once the Property Grabber component receives the needed inputs it
will start to extract the information of that single Web service
Figure3-12 Overview the process flow of the Property Grabber Component
Therefore after the Property Grabber component receives the inputs it starts to extract the service
information for the Web service Generally speaking the service information consists of four aspects
which are structured information endpoint information monitoring information and whois
information respectively
(1) Structured Information
The structured information can be obtained by extracting the information that hosted in the
service page and service list page They are the basic descriptive information about the service
Such as the name of the service the URL address through which the WSDL document can be
obtained the description of introducing the service the provider who provides that service its
rating and the server who owns this service etc However the elements constituting this
Deep Web Service Crawler
34
structured information would be diverse according to these different Web Service Registries For
example the rating information of the Web service exists in the Service-Repository Web Service
Registry while the Xmethods Web Service Registry does not have this information In addition
even though the Web services in the same Web Service Registry some elements of the structured
information could be not exist For instance one service in a Web Service Registry may have the
description information of this service while another service in the same Web Service Registry
does not have this description information Tables 3-1 to 3-5 demonstrate the structured
information that should be extracted in these five Web Service Registries Moreover if the style of
the Biocataloue Web Service Registry is SOAP there has some additional information describing
the SOAP operation of this service While it is REST this additional information would be
something that talked about the REST operation They should also be considered as a part of the
structured information Table 3-6 and table 3-7 illustrate the information for these two different
operations
Service Name WSDL Link WSDL Version
Provider Server Rating
Homepage Owner Homepage Description
Table 3-1Structured Information of Service-Repository Web Service Registry
Service Name WSDL Link Provider
Service Style Homepage Implementation Language
Description User Description Contributed Client Name
Type of this Client Publisher for this client Used Tookit of this client
Used Language of this client Used Operation System of this client
Table 3-2Structured Information of Xmethods Web Service Registry
Service Name WSDL Link Server
Provider Providerrsquos Country Service Style
Rating Description User Description
Service Tags Documentation (within WSDL)
Table 3-3Structured Information of Seekda Web Service Registry
Service Name WSDL Link Port Name
Service URL Address Implementation Class
Table 3-4Structured Information of Ebi Web Service Registry
Service Name WSDL Link Style
Provider Providerrsquos Country View Times
Favorite Times Submitter Service Tags
Total Annotation Provider Annotation Member Annotation
Registry Annotation Base URL SOAP Lab Server Base URL
Description User Description Category
Table 3-5Structured Information of Biocatalogue Web Service Registry
Deep Web Service Crawler
35
SOAP Operation Name Inputs and Outputs Operation Description
Operation Tags Part of Which Service
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry
REST Operation Name Service Tags Used Template
Operation Description Part of Which Service Part of Which Endpoint Group
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry
(2) Endpoint Information
Endpoint information is the port information of the Web service which can be extracted only
through the service page However since different Web Service Registries have different structure
of the endpoint information with respect to the Web service Thence some elements of the
endpoint information for the Web service would be very diverse But there is one thing needed to
pay attention the Ebi Web Service Registry does not have endpoint information for all these Web
services published in this registry Moreover though the Web services in the same Web Service
Registry have the same structure of this endpoint information some elements of the endpoint
information would be missing or empty Furthermore these Web Service Registries could even
have no endpoint information for some Web services published by them Nevertheless whatever
there is endpoint information for the Web service there will be at least one such element which
is the URL address of the endpoint Following table 3-8 shows the endpoint information that
should be extracted in these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Endpoint Name Endpoint URL
Endpoint Critical Endpoint Type
Bound Endpoint
Xmethods
Endpoint URL Publisher of this Endpoint
Contact Email of this Publisher Implementation Language of this
Endpoint
Seekda Endpoint URL
Biocatalogue Endpoint Name Endpoint URL
Table 3-8 Endpoint Information of these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Service Availability Number of Downs
Total Uptime Total Downtime
MTTR MTBF
RTT Max of Endpoint RTT Min of Endpoint
RTT Average of Endpoint Ping Count of Endpoint
Seekda Service Availability Begin Time of Monitoring
Biocatalogue Monitored Status of Endpoint Overall Status of Service
Table 3-9 Monitoring Information of these five Web Service Registries
(3) Monitoring Information
Monitoring information is the tested statistic information for the Web service But it is worth
Deep Web Service Crawler
36
noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of
all these Web services published by them While for other three Web Service Registries only a
few of Web services may not have this information Table 3-12 displays the monitoring
information for these three Web Service Registries
(4) Whois Information
Whois information is not extracted from the information that hosted in the service page and
service list page It is the descriptive information for the service domain which can be gained by
means of the address of the WSDL link Because of that the process of getting the Whois
information will start after gaining the service domain first The final value of service domain
should not contain the string like this http https www etc It must belong to the top level
domain After that it is going to query the service domain database by means of sending the value
of service domain to the Whois client which is just a Web site on the Internet for example
ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain
exists a list of its information will be returned as the output However the structure of the
returned information will be diverse with respect to different service domain Therefore the most
challenge thing is that there has to deal with the extracting process for each different situation of
the returned information Table 3-10 gives the Whois information that needs be extracted for all
these five Web Service Registries
Service Domain URL Domain Name Domain Type
Domain Address Domain Description State
Postal Code City Country
Country Code Phone Fax
Email Organization Established Time
Table 3-10Whois Information for these five Web Service Registries
Finally all the information of these four aspects will be collected together And then be delivered into
the Storage component for further storage processing
3231 Features of the Property Grabber Component
The Property Grabber component has to provide the following features
l Obtain basic information
Generally speaking the more information one Web service has the better you can know how
well this Web service is Hence this Property Grabber component will be necessary to extract all
these basic information that hosted in the service list page and service page This basic
information contains the structure information endpoint information and monitoring
information
l Obtain Whois information
Due to the reason that the more information the Web service has the better the quality of the
Web service is That makes it is necessary to extract as much information about the Web service
as possible Therefore except the basic information the Property Grabber component can also
obtain some additional information called Whois information such as the type of the domain the
name of the person in charge the postal code of the domain city phone fax detailed address
email etc
Deep Web Service Crawler
37
3232 Input of the Property Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3233 Output of the Property Grabber Component
The component will produce the following output data
l Structured information of each service
l Endpoint information about each service if exists
l Monitoring information for the service and endpoint if exists
l Whois information of the service domain
However all these information will be collected together as the properties for each service Thereafter
the collected properties will be sent to the Storage component
3234 Demonstration for Property Grabber Component
Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the
Property Grabber component To simplify the explanation the example showed in this section uses
the same Web service as before
1) The inputs of the Property Grabber component are links of service list page and service page that
received from Web Service Extractor component Their links are
ldquohttpwwwservice-repositorycomrdquo
and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page
Deep Web Service Crawler
38
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page
3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that
displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of
the structured information have the same content like the contents of description showed in
service page and service list page Thence in order to saving the time of extracting process and
the space for storing process the elements with the same content would only be extracted once
Moreover there needs a transformation of non-descriptive text to the descriptive text for the
rating information because the contents of it are several image of stars Therefore the final
results of the extracted structured information for this Web service are showed in table 3-11
Because of there are no descriptive information for the provider homepage and Owner
Homepage their values are assigned as ldquoNULLrdquo
Service Name BLZService
WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl
WSDL Version 0
Server Apache-Coyote11
Description BLZService
Rating Four stars and A Half
Provider NULL
Homepage NULL
Owner Homepage NULL
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo
4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that
displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list
only one of them will be extracted as the information of endpoint That is because this master
program is intended to extract as more information as possible but these information should not
contain some redundant information Therefore for the purpose of that only one of them will be
extracted as the endpoint information even if there are more than one endpoint records Table
3-12 shows the final results of the endpoint information of this Web service
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page
Deep Web Service Crawler
39
Endpoint Name BLZServiceSOAP12port_http
Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService
Endpoint Critical True
Endpoint Type production
Bound Endpoint BLZServiceSOAP12Binding
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo
5) Then it is time to extract the monitoring information through invoking the function
ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box
above contains the monitoring information about the Web service and the red box below lists
the monitoring information for these endpoints As have already mentioned before only one
endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two
types of availability Actually they all represent the availability for this Web service just like the
availability that showed in figure 3-14 Therefore only one content of these availability
information is sufficient Table 3-13 shows the final results of this extracting process
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page
Service Availability 100
Number of Downs 0
Total Uptime 1 day 19 hours 19 minutes
Total Downtime 0 second
MTBF 1 day 19 hours 19 minutes
MTTR 0 second
RTT Max of Endpoint 141 ms
RTT Min of Endpoint 0 ms
RTT Average of Endpoint 577 ms
Ping Count of Endpoint 112
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo
6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the
service domain from the WSDL link first Towards this Web service the gained service domain is
ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying
process After that it returns a list of information for that service domain See figure 3-17 And
table 3-14 is the extracted Whois information
Deep Web Service Crawler
40
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo
Service Domain URL thomas-bayercom
Domain Name Thomas Bayer
Domain Type NULL
Domain Address Moltkestr40
Domain Description NULL
State NULL
Postal Code 54173
City Bonn
Country NULL
Country Code DE
Phone +4922855525760
Fax NULL
Email infopredic8de
Organization predic8 GmbH
Established Time NULL
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo
7) Finally all the information of these four aspects will be collected together as the service
properties and then these service properties are forwarded into the Storage component
324 The Function of Storage Component
The WSDL link from the WSDL Grabber component will be used by this Storage component to
download the WSDL document from the Web and stores it into the disk thereafter In addition the
service properties from the Property Grabber component will also be directly stored into the disk in
three different manners with this Storage component Figure 3-18 illustrates the process flow of the
Storage component
Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo
of the Storage component will be triggered After that it is going to transform the service properties
into three different output formats and store them into the disk These output formats are XML file
INI file and database records Besides it will also try to download the WSDL document with the URL
address of the WSDL link and then store the obtained WSDL document into disk too However this
ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo
ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the
storage tasks
Deep Web Service Crawler
41
Figure3-18 Overview the process flow of the Storage Component
(1) ldquogetWSDLrdquo sub function
The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it
into the disk Therefore above all it has to get the content of the WSDL document This
procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value
of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if
the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo
For the case like that it will create a WSDL document whose name is the service name appended
with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content
It is an empty document Nevertheless if that service does have WSDL link this sub function will
try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all
these contents that hosted in the Web would be downloaded and stored into the disk and named
only with the name of the service Otherwise it would create a WSDL document which prefixes
ldquoBadrdquo before the service name
(2) ldquogenerateXMLrdquo sub function
The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform
them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo
XML stands for eXtensible Markup Language which is a markup language that designed to
transport and store data The first line of an XML file is the XML declaration which defines the
XML version and the encoding used For example following ldquoltxml version=10
encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is
8-bit Unicode Transformation Format character set Besides an XML file contains XML elements
which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML
element can also contain other elements which could be simple text or a mixture of both
Deep Web Service Crawler
42
However an XML file must contain a root element as the parent of all other elements
(3) ldquogenerateINIrdquo sub function
The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will
transform them into an INI file and then store it in the disk with the name structure of servicersquos
name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for
configuration files They are just simple text files with a basic structure Generally speaking the
INI file contains three different parts They are section parameter and comment respectively
Parameter is the basic element contained in an INI file Its format is a key-value pair or can also
be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always
appears to the left of the equals sign The section is more like a room that grouped all its
parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And
section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo
Thence anything between the semicolon and the end of line will be ignored
(4) ldquogenerateDatabaserdquo sub function
The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions
Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will
turn them into the data of the database by using the SQL statements SQL stands for Structured
Query Language which is a database language that designed for accessing and manipulating data
in database Most of the actions that performed on a database are done with SQL statements
And the primary statements of SQL include insert into delete update select create alter and
drop Therefore for the purpose of transforming these service properties into the data of the
database this sub function has to create a database first using the ldquocreate databaserdquo statement
Then it should create a table to store the data A table is a collections of related data entries and
it consists of columns and rows Since the data for all these five Web Service Registries are not
very large therefore one database table is enough for storing these service properties Because of
that the field names of the service properties in columns for all these five Web Service Registries
should be uniform and well-defined Afterwards the service properties of each single service can
be put into the table as a record with the ldquoinsert intordquo statement of the SQL
3241 Features of the Storage Component
The Property Grabber component has to provide the following features
l Generate different output formats
The final result of this master program is to store the information of the service into the disk for
future work However this Storage component provides three different formats for storing these
service properties of the services that published in the Web Service Registries This makes the
storage of the service very flexible and also longevity
l Obtain the WSDL document
The important thing of this component is to check if the WSDL document can be obtained from
the WSDL link because the WSDL document plays a decisive role for determining the quality of
the service This Storage component provides the ability to deal with the different situations
about the process of obtaining the WSDL document
Deep Web Service Crawler
43
3242 Input of the Storage Component
This component requires the following input data
l WSDL link of each service
l Each service property information
3243 Output of the Storage Component
The component will produce the following output data
l WSDL document of the service
l XML document INI file and tables in database
3244 Demonstration for Storage Component
Following figures describe the fundamental implementation codes of this Storage component
The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the
implementation codes The first common place is about the parameters that defined in each of
these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute
path in the computer disk It will be used for the storing procedure and has already been specified
by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an
increasing Integer which is used as a part of the name of the service The reason for doing this is
that it can prevent the services that have the same name from overriding in the disk The content
of red mark among the codes of these figures is the second common place Its function is to
create a file or document for the service with the corresponding parameters
2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to
get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the
name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo
is the actual WSDL link which is the most important one for this sub function The other two
parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and
ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data
of the service in each Web Service Registry Such as the overall number of properties for that Web
Service Registry the overall number of services the number of services that have no WSDL link
the number of services contains no contents the number of services whose WSDL link are not
available etc The ldquoLog Informationrdquo text file records the results of the process steps and the
problems encountered For example which service is crawling now which Web Service Registry it
belongs the reason why it cannot get the WSDL document for this service the reason why this
service is unreachable and so on
Deep Web Service Crawler
44
Figure3-19 Implementation code for getting WSDL document
3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file
and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of
ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two
variables name and value
Figure3-20 Implementation code for generating XML file
Deep Web Service Crawler
45
Figure3-21 Implementation code for generating INI file
4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all
services in these five Web Service Registries into the records of the database Therefore it has to
create a database first The name of the database can be arbitrary as long as it conforms to the
naming rules in database This is the same for the name of the table Figure 3-22 displays the
procedure of creating a table in the database Due to the reason that it is hard to decide the
length of each service property thence the data types for all service properties are set to ldquoTextrdquo
Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo
statement
Figure3-22 Implementation code for creating table in database
Deep Web Service Crawler
46
Figure3-23 Implementation code for generating table records
33 Multithreaded Programming for DWSC
Multithreaded programming is a built-in characteristic which is provided by Java language A
multithreaded program contains two or more separate parts that that can execute concurrently Each
part of such a program is called a thread The use of multithreading enables to create programs that
makes efficient use of the system resources For example make maximum use of the CPU thatrsquos
because the idle time of the CPU can be kept to a minimum
In this master program there are five Web Service Registries that are needed to crawl for the services
among them Moreover the number of services published among each Web Service Registry is quite
different That makes the running time cost by each Web Service Registry is different Then it will
happen that one Web Service Registry who owns fewer services has to be waiting for executing until
another Web Service Registry who has much more services finishes Therefore in order to reduce the
waiting time for other Web Service Registries and maximize the use of the system resources it is
necessary to apply this multithreaded programming to this master program That is to say this master
program will create a thread for each Web Service Registry And these threads are executed
independently
34 Sleep Time Configuration for Web Service
Registries
Due to the reason that this master program is intended for downloading the WSDL document and
extracting the service information of these Web services that published in the Web Service Registry
this would inevitably affect the performance of the Web Service Registry In addition for the purpose
of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of
accessing Because of that sometimes when this master program is executing unknown errors would
happen For instance this master program will be continually halting at one point without getting any
more WSDL document and service information or some servicesrsquo WSDL documents of some Web
Deep Web Service Crawler
47
Service Registries cannot be obtained or there would miss some service information Therefore in
order to obtain the largest staffs of these Web services published in these five Web Service Registries
and not to affect the throughput of them it has to configure the accessing rate for each service of all
Web Service Registries
In consequence before going into the essential procedure for each single service of these Web Service
Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long
milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for
the specified number of milliseconds In other words it will temporarily cease execution for a while
Following table shows this time interval of the sleep function for each Web Service Registry
Web Service Registry Name Time Interval (milliseconds)
Service Repository 8000
Ebi 3000
Xmethods 10000
Seekda 20000
Biocatalogue 10000
Table 3-15Sleep Time of these five Web Service Registries
Deep Web Service Crawler
48
4 Experimental Results and Analysis
This chapter is going to show the quantitative experimental results of the prototype that presented in
chapter 3 Besides the analysis of these results will also be described and explained However in
order to gain a rather accurate result the experiments are carried out more than five times All these
data that displayed in the following tables and charts are the average of them
41 Statistic Information for Different Web Service
Registries
This section is mainly used to talk about the amount statistic of these Web services that published in
these five Web Service Registries It includes the overall number of the Web services that published in
each Web Service Registry and the number of these unavailable Web services which have been
archived because they may not be active anymore or are close to being non active Table 4-1 shows
the service amount statistic of these five Web Service Registries
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Overall
Services 57 289 382 853 2567
Unavailable
Services 0 0 0 0 125
Table4-1 Service amount statistic of these five Web Service Registries
Nevertheless in order to have an intuitive view of the service amount statistic in these five Web
Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As
you can see from the bar chart one on hand there has an ascending increase for the overall number
of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service
Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web
service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to
provide the Web services for the users because it contains far more services than other four Web
Service Registries On the other hand there is no unavailable service existed in almost all Web Service
Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web
Service Registry contains some Web services that cannot be used by the users To some degree this is
useless for the reason that these services cannot be used anymore and it will waste the network
resources on the Web Therefore all these unavailable services should be eliminated in order to
reduce the wasting of the network resources
Deep Web Service Crawler
49
Figure4-1 Service amount statistic of these five Web Service Registries
42 Statistic Information for WSDL Document
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Failed WSDL
Links 1 0 23 145 32
Without
WSDL Links 0 0 0 0 16
Empty
Content 0 0 2 0 2
Table4-2 Statistic information for WSDL Document
Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in
these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed
WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the
Web services whose WSDL links are invalid Say in other words that means it is impossible to get the
WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no
WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services
for these Web Service Registries It is used as the overall number of Web services in each Web Service
Registry who has no WSDL link That is to say there would be no WSDL document for Web services
like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL
document will also be created But there is no content and the name of this WSDL document will
contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which
represents the overall number of Web services that have the WSDL links and the URL address of them
0
500
1000
1500
2000
2500
3000
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
57289 382
853
2567
0 0 0 0 125
The
Num
ber o
f Web
Ser
vice
s
The name of Web Service Registry
Overall Services
Unavailable Services
Deep Web Service Crawler
50
are valid but their WSDL documents contains no content In this case a WSDL document whose name
contains a string ldquo(BAD)rdquo will be created
Figure4-2 Statistic information for WSDL Document
43 Comparison of Different Average Number of
Service Properties
This section is going to compare the average number of service properties in these five Web Service
Registries This average number of service properties is calculated by means of following equation = (1)
Where
ASP is the average number of service properties for one Web Service Registry
ONSP is the overall number of service properties in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
Figure 4-3 shows the average number of service properties of each Web service in these five Web
Service Registries As have already mentioned before one of the measurements for testing the quality
of Web services in the Web Service Registry is the service information The more information of a Web
service the better you know about that service consequently the corresponding Web Service
Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service
Repository and Biocatalogue Web Service Registries own a larger number of service properties than
that in other three Web Service Registries This can directly reflect that these two Web Service
Registries can provide more detailed information about the Web services published among them So
that the users can be easier to choose the service they need And they would also like to use the Web
services that published in these two Web Service Registries By contrast the Xmethods and Seekda
0
50
100
150
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
1 023
145
32
0 0 0 016
0 0 2 0 2
The
Num
ber o
f Ser
vice
s
The Name of the Web Service Registry
Failed WSDL Links
Without WSDL links
Empty Content
Deep Web Service Crawler
51
Web Service Registries who have less service information of the Web services would offer less quality
for these Web services Therefore users may be not like to use the Web services provided in these
two Web Service Registries Not to mention the Web services published in the Ebi Web Service
Registry
Figure4-3 Average Number of Service Properties
From the description presented in section 323 the causes of the issue for the different number of
service properties in these Web Service Registries may consist of following several points First the
number of the structured information for these Web services is different with respect to these five
Web Service Registries Even part of information for some Web services in one Web Service Registry
could be missing or have empty value For example the number of structure information that
supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that
in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for
almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce
the amount of the overall number of service properties Thirdly some Web Service Registries do not
have monitoring information such as Xmethods and Ebi In particular the Service Repository Web
Service Registry has a large amount of monitoring information of the Web services that could be
extracted in the Web Obvious the last one is the number of the whois information for these Web
services If the database of the whois client does not contain the information about the service
domain for the Web service in one Web Service Registry then there will be no whois information
could be extracted Moreover even there has information of the service domain the amount of the
information could be very diverse Therefore if the Web Service Registry undergoes the situation like
that many service domains of the Web services in this registry have no and few whois information
then the average number of service properties in that registry will decrease greatly
As a result in order to help users get better acquainted with the Web services that provided in Web
Service Registry and distinguish these Web services the Web Service Registry should do it best to
offer more and more information for each published Web service of it
05
101520253035
23
717 17
32
The
Num
ber o
f Pro
pert
ies
The Name of Web Service Registry
Average Number of Service Properties
Deep Web Service Crawler
52
44 Different Outputs of Web Services
The fundamental task of this master program is to obtain the WSDL document of the Web services
that hosted in the Web Service Registries as well as extract and gather the properties of these Web
services and thereafter store them into disk Therefore this section is going to describe the different
outputs of this master program which include the WSDL documents of Web services the generated
XML file and INI file and data records of these service properties in Database
Figure4-4 WSDL Document format of one Web service
The WSDL document of the Web service is just read from the Web according to the URL address of its
WSDL link And then these data will be stored into the disk as the WSDL document The name of the
WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document
And in order to distinguish these WSDL documents whose names would be the same while the
content of them are different the name of each obtained WSDL document in one Web Service
Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL
document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and
data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output
formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name
is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the
materials belonged to the same Web service The first three lines in that INI file are some service
comments which start from the semicolon to the end of the line They are the basic information for
the description of this INI file The following one line of them is the section that included in a pairs of
brackets It is important because it represents that the following lines behind it are the information of
this Web service Therefore the rest of these lines are the actual service information that existed in a
key-value pair with an equal between the pair Each service property is displayed from the beginning
of the line
Deep Web Service Crawler
53
Figure4-5 INI File format of one Web service
Figure4-6 XML File format of one Web service
Figure4-7 Database data format for all Web services
Deep Web Service Crawler
54
Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its
name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web
service Though the format of the XML file is different from that of the INI file the essential contents
of them are the same That is to say the values of these service properties are no different This is
because they are files generated from the collection of properties for the same Web service The XML
file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the
section in INI file is something like the root in XML file Therefore all values of these elements among
the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service
Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of
service information for all Web services in these five Web Service Registries And the entire service
information for one Web service will be only one record in this table Because of that the column
names of that table should be the union of the name of the service information in each Web Service
Registry However since the column names of the table should be unique thus these redundant
names in this union must be eliminated This could make sense and possible because the names of the
service information are well-defined and uniform for all these five Web Service Registries In addition
in this table the first column is the primary key which is an increasing Integer The function of this is
more like the Integer that contained in the name of the XML file and INI file While the rest columns in
that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in
this table represents that this property of that Web service is empty or missing
45 Comparison of Average Time Cost for Different
Parts of Single Web Service
This section aims to describe the comparison of the average time cost for different parts of getting
one single Web service in all these five Web Service Registries At first it has to calculate the average
time cost of getting one single service in the Web Service Registry The calculation of it can be
obtained through following equation = (2)
Where
ATC is the average time cost for one single Web service
OTS is the overall time cost of all the Web services in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the different parts of the average time cost for getting one single service consists of
following 6 aspects which are the average time cost for extracting service property the average time
cost for obtaining WSDL document the average time cost for generating XML file the average time
cost for generating INI file the average time cost for inserting the service property into the table of
the Database and the average time cost for some other procedures such as get the service list page
link get the service page link and so on The average time cost for extracting service property would
be obtained by means of following equation
Deep Web Service Crawler
55
= (3)
Where
ATCSI is the average time cost for extracting service property of one single Web service
OTSSI is the overall time cost for extracting service property of all the Web services in one Web
Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the calculation of other parts is similar to the equation for calculating the average time
cost for extracting service property While the calculation of the average time cost for other
procedures equals to the result of the average time cost for one single Web service minus the sum of
the average time cost of all other five parts
Service
property
WSDL
Document
XML
File INI File Database Others Overall
Service
Repository
8801 918 2 1 53 267 10042
Ebi 699 82 2 1 28 11 823
Xmethods 5801 1168 2 1 45 12 7029
Seekda 5186 1013 2 1 41 23 6266
Biocatalogue 39533 762 2 1 66 1636 42000
Table4-3 Average time cost information for all Web Service Registries
Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web
Service Registries The first column of table 4-3 is the name of these five Web Service Registries And
the last column in this table is the average time cost for the single service in one Web Service Registry
While the rest columns in this table are the average time cost of these six different parts In order to
have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated
with following corresponding figures See figure 4-8 to figure 4-14
Figure4-8 Average time cost for extracting service property in all Web Service Registries
8801
6995801 5186
39533
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Service Property
Deep Web Service Crawler
56
As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue
Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries
which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively
That is to say it will take much longer time to extract the service properties of these Web services
published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the
Biocatalogue Web Service Registry has even more average number of service properties which have
already been talked about in section 43 On the contrary the average number of service properties in
Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service
Registry is larger than that in Seekda Web Service Registry As have already known that the average
number of service properties is the same for these two Web Service Registries Nevertheless there
has course might explain the case that the average time in Xmethods costs more than that in Seekda
which is the process for extracting service properties in Xmethods Web Service Registry needs to be
executed by means of service page and service list page while it needs only service page link for
Seekda Web Service Registry
The average time cost for obtaining WSDL document in all these five Web Service Registries is
displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the
average time for extracting the WSDL link of Web service and the average time for reading the data of
WSDL document from the Web and then storing it into the disk As seen from this figure the average
time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is
1168 Although the process for extracting WSDL link will cost certain amount of time it does not have
a significant influence to the total average time spent for obtaining WSDL document This is because
the WSDL link of one Web service is almost gained in one step Therefore this can imply that the
average size of the WSDL document for Xmethods Web Service Registry is larger than that for other
four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds
for the process of extracting WSDL document
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries
918
82
11681013
762
0
200
400
600
800
1000
1200
1400
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
WSDL Document
Deep Web Service Crawler
57
Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different
outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11
the average time of generating XML file for one Web service cost the same for all these five Web
Service Registries which is only 2 milliseconds In addition the average time of generating INI file for
one Web service cost the same too And its value is just 1 millisecond Even though sum up the values
of these two types of average time costs the result is still so small that it could be omitted when
comparing to the overall average time cost of getting one Web service for each corresponding Web
Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI
file would be finished at once after receiving the input of service properties of one Web service
Furthermore it can be seen from figure 4-12 although the average time costs of creating database
record for each Web service in all these five Web Service Registries are larger than that time of
generating XML and INI file the process of the operation for creating database record is still fast
Figure4-10 Average time cost for generating XML file in all Web Service Registries
Figure4-11 Average time cost for generating INI file in all Web Service Registries
2 2 2 2 2
0
05
1
15
2
25
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
XML File
1 1 1 1 1
0
02
04
06
08
1
12
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
INI File
Deep Web Service Crawler
58
Figure4-12 Average time cost for creating database record in all Web Service Registries
Figure4-13 Average time cost for getting one Web service in all Web Service Registries
In the figure 4-13 it gives the average time cost for getting one single Web service in all these five
Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the
longest time for this process This is because the presentation of these five different parts that
described in the front shows that the average time cost of each part would needs more time to finish
the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining
WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web
Service Registry does not cost the longest Moreover there is amazing thing when looking at these
figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same
trend This can further indicate that one Web Service Registry who spends more time to get the
description information of one Web service would offer more information about the Web service
53
28
4541
66
0
10
20
30
40
50
60
70
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Database
10042
823
7029 6266
42000
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Overall
Deep Web Service Crawler
59
5 Conclusion and Further Direction
This master thesis provides a schema which aims to explore the description information of the Web
services that hosted in the different Web Service Registries Actually the description information of
the Web service consists of WSDL document and service information about that Web service
Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can
be used to obtain the WSDL document and service information of the Web service in these Web
Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web
services that hosted in the these Web Service Registries as well as the storage of these Web services
is not flexible and the most important thing is that only a few service information of the Web service
are extracted even more there is no service information would be extracted for some Web Service
Registry However the work presented in this master thesis can be able to explore all Web services in
these Web Service Registries Moreover the service information of Web service would be extracted as
much as it could with the approach presented in this master thesis so that the final result would be
the largest annotated service catalogue ever produced Furthermore regarding the storage process of
the description information of the Web service this master thesis provides three different ways that
can guarantee not only the completeness but also longevity of the description information for the
Web service
However in the implementation performed in this master thesis the whois client used for querying
the information of service domain will return a free text if the information exists And sometime this
free text differs completely That makes it must have to crawl each Web service in all Web Service
Registries at least once in the experiment stage So that all the cases of these free text would be
foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web
services in these Web Service Registries Therefore in order to simply the work other whois client
who can ease the work here needs to be found and used
Moreover in the experiment stage of this master thesis the time cost for getting Web service is still
large For the purpose of reducing this time the multithreaded programming would be applied to
each some parts of the process for getting one Web service
Although the work performed here is specialized for only these five Web Service Registries the
main parts of the principles used here is adaptable to other Web Service Registries that with only
small changes in the implementation codes or the structure
Deep Web Service Crawler
60
6 Bibliography
[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)
June 27 2008
[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from
httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-
whole Emanuele Della Valle (CEFRIEL) July 1 2008
[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk
ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-
architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009
[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web
Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October
2006
[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from
httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008
[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from
httpwwww3orgTRws-arch-scenarios February 11 2004
[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo
Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering
University of Washington Seattle pp233-272 February 1999
[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web
Consortium Working Draft WD-html5-20100624 January 22 2008
[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language
WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from
httpwwwwsmoorgTRd16d161v02
[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO
-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from
httpwwwwsmoorgTRd2v11
[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware
Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet
2008
Deep Web Service Crawler
61
7 Appendixes
There are additional outputs of this master program which are log information file and statistic report
file Figure 8-1 shows one of the basic output formats of the log information for these five Web
Service Registries
Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry
Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry
Deep Web Service Crawler
62
Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo
ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively
Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry
Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry
Deep Web Service Crawler
63
Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry
Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry
Deep Web Service Crawler
64
Table of Figures
Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12
Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16
Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25
Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27
Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41
Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46
Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50
Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51
Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52
Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56
Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58
Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58
Deep Web Service Crawler
65
Table of Tables
Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47
Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48
Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55
Deep Web Service Crawler
66
Table of Abbreviations
DTD Document Type Definition
DWSC Deep Web Service Crawler
HTML HyperText Markup Language
MTBF Mean Time between Failures
MTTR Mean Time to Recovery
QoS Quality of Service
REST Representational State Transfer
RTT Round Trip Time
SOAP Simple Object Access Protocol
SQL Structured Query Language
URL Uniform Resource Locator
WHATWG Web Hypertext Application Technology Working Group
XML eXtensible Markup Language
WSDL Web Service Description Language
WSML Web Service Modeling Language
WSMO Web Service Modeling Ontology
Deep Web Service Crawler
9
Then following in chapter 3 it is explained the designing details of this Deep Web Service Crawler
approach In section 31 it gives a short description for the different requirements of this approach
Next in section 32 the actual design for the Deep Web Service Crawler is presented Then in section
33 34 the multithreaded programming and sleep time configuration that used in this master
program are introduced respectively
In chapter 4 it is supposed to display the experiments of this Deep Web Service Crawler approach
and then give some evaluation of it
Finally in Section 5 the conclusion and discussion of the work that already done as well as the work in
the future for this master task are presented respectively
Deep Web Service Crawler
10
2 State of the Art
This chapter aims at presenting some existing techniques or Strategies that related to the work of
applying this Deep Web Service Extraction Crawler approach Section 21 plans to talk about the
existing catalogues Service-Finder project And then in section 22 it is going to explain the existing
implemented crawlers Pica-Pica Web Service Description Crawler Finally in the section 23 it is
supposed to present some details about the Information Extraction technique
21 Service Finder Project
Service-Finder Project aims at developing a platform for Web Service discovery especially for the Web
Services that are embedded in a Web 20 environment [1] Hence it can provide an efficient access to
publicly available services The goals of the Service-Finder project are depicted as follows [1]
n Automatically gather Web Services and their related information
n Semi-automatically create semantic service description based on the information that available
on the Web
n Create and improve semantic annotations via the user feedback
n Describe the aggregated information in semantic models and allow reasoning query
However before describing the basic functionality of the Service-Finder Project there is going to
present one of its use cases and requirements first
211 Use Cases for Service-Finder Project
The Service-Finder Project employed the use case methodology of the W3C Use Case description [6]
for its needs and then applied this methodology to the use cases that it enumerated
2111 Use Case Methodology
There are three aspects needed to be considered for use case definitions of Service-Finder Project [1]
(1) Description that used to describe information of the use case
(2) Actors Roles and Goals that used to identify the actors they would be the roles they act and the
goals they need to achieve in the scenario
(3) Storyboard that used to describe the serial of interactions among the actors and the
Service-Finder Portal
2112 System Administrator
This section is going to present the use case that applied to the Service-Finder portal and that
illustrated the requirements on its functionality from a user point of view However all these
Deep Web Service Crawler
11
information in this use case are derived from [1] In this use case there has a system administrator
whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities
online and working all day and night Therefore if there is any system failures Sam Adams should fix
the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will
alert him immediately by sending him a SMS Message in the case of a system failure
n Description
This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an
SMS Messaging Service that he wants to build it into his application
n Actors Roles and Goals
The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him
are the immediate service delivery the reliability of the service and low base fee and transaction fee
n Storyboard
Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find
many useful services from it especially he know what he is looking for Hence he visits the
Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo
Requirement 1 Search functionality
Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the
number of matching services that will be displayed on one page And he would also expect there has
short information about the service functionality the service provider and the service availability So
that he could decide which service he will choose to read further
Requirement 2 Enable configurable pagination of the matching results and have some short
information for each service
Step 3 When Sam looks through the short information about the services that displayed on the first
page he expects to find the most relevant services that related to his request After that he would
like to read more detailed information about that service to see whether this service can provide the
needed functionality
Requirement 3 Rank the returned matching services and must provide ability to read more details of
a service
Step 4 In the case that all the returned matching services Sam got provide quite different
functionalities or they belong to different service categories for example the SMS messaging services
alert users not through SMS but voice messaging For this reason Sam would like to see other
different categories that may be contain the services he wants Or the services of other categories
which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can
further filter his search in terms of browsing through categories
Requirement 4 Service categories and allow the user to look all services that belonged to that specific
category If possible it should also allow the user to browse through categories
Step 5 When Sam got all the services that could provide a SMS messaging service via the methods
described in the Step 4 at present he wants to look for the services that offered by an Austrian
provider and have no base fees if possible
Requirement 5 Faceted search
Deep Web Service Crawler
12
Step 6 After Sam got all these specific services now he would like to choose the services that can
provide a high reliability
Requirement 6 Sort functionality based on usersrsquo chooses
Step 7 For now Sam expects to compare the service availability between the promised to the service
provider and the actually provided This should be contained in the servicesrsquo details And there needs
also have service coverage information so that Sam can know whether this service covers the areas
he lives and works Moreover Sam would also like to compare these services in other way For
instance put some services into a structured table to compare the transaction fees
Requirement 7 A side-by-side comparison table for services and a functionality that enable users to
select services he wants to compare
Step 8 At last Sam wants to know whether the service providers offer a free try out of the services
So that he can test the service functionality
Requirement 8 If possible display a note that offering free service trials
212 Architecture Plan for the Service-Finder Project
The architecture plan of the Service-Finder project contains five basic components Service Crawler
Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal
Interface Figure 2-1 presents a high level overview of the components and the data flow among
them
Figure2-1Dataflow of Service-Finder and Its Components [3]
Deep Web Service Crawler
13
2121 The Principle of the Service Crawler Component
The Service Crawler is responsible for gathering available services and their related information from
the Web The overall cycle is depicted as following
(1) A Web developer publishes a Web Service
(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services
like WSDL (Web Service Description Language) documents
(3) The Crawler is also going to search for other related information as long as a service is discovered
(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant
part of the Web
At last the output of the crawler would be forwarded to the subsequent components for analyzing
indexing and displaying
2122 The Principle of the Automatic Annotator Component
The Automatic Annotator receives the relevant data from previous component and generates
semantic service descriptions about the WSDL documents and its related information based on the
Service-Finder Ontology and Service Category Ontology
Firstly it will simply introduce those two compatible ontologies that would be used throughout the
whole process [2]
n Generic Service Ontology it is an ontology which is functional to describe the data objects For
example the services the service providers availability payment modalities and so on
n Service Category Ontology it is an ontology which is used to categorize the functionalities or
applications of the services For instance data verification messaging data storage weather etc
Afterwards it is going to talk about the function of this component with its input output
Oslash Input
u Crawled data from Service Crawler
u Service-Finder Ontologies
u Feedback or Correction of before annotations
Oslash Function
u Enrich the information about the service and extract semantic statements according to the
Service-Finder Ontologies For example categorize the service according to the Service
Category Ontology
u Determine whether a particular document is relevant or not through the Web link graph If
not discard these irrelevant documents
u Classify the pages into their genres For instance pricing user comments FAQ and so on
Oslash Output
u Semantic annotation of the services
Deep Web Service Crawler
14
2123 The Principle of the Conceptual Indexer and Matcher
Component
The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted
information of the services and supplying users the capability of retrieval and semantic query For
example the matchmaking between user requests and service offers and the act of retrieving user
feedback on extracted annotations
In addition letrsquos have a look of the function of this component and its input output
Oslash Input
u Semantic annotation data and full text information obtained from Automatic Annotation
u Semantic annotation data and full text information that come from user interfaces
u Cluster data from user and service clustering component
Oslash Function
u Store the semantic annotations received from the Automatic Annotation component and
from the user interface
u Store the cluster data that procured through the clustering component
u Store and index the textual description offered by the Automatic Annotation component
and the textual comments offered by users
u Ontological query the semantic data from the data store center
u Combined keyword and Ontological querying used for user queries
u Provide a list of similar services for a given service
Oslash Output
u A list of matching services that are queried by users In particular these services should be
sorted by ranking and can also be iterated
u All available data that related to a particular entity must be retrievable at the user interface
2124 The Principle of the Service-Finder Portal Interface
Component
The Service-Finder Portal Interface is the main entry point that provided for users of the
Service-Finder system to search and browse the data which is managed by the Conceptual Indexer
and Matcher component In addition the users can also contribute information by means of providing
tags comments categorizations and ratings to the data browsed Furthermore the developers can
still directly invoke the Service-Finder functionalities from their custom applications in terms of an API
Besides the details of this componentrsquos function input and output are represented as below
Oslash Input
u A list of ordered services for a query
u Detailed information about a service or a set of services and a service provider
u Query access to service category ontology and the most used tags provided by the users
Deep Web Service Crawler
15
u Service availability information
Oslash Function
u The Web Interface allows the users to search services by keyword tag or concept in the
categorization sort and filter query results by refining the query compare and bookmark
services try out the services that offer this functionality
u The API allows the developers to invoke Service-Finder functionalities
Oslash Output
u Explicit user annotations such as tags ratings comments decryptions and so on
u Implicit user data for example click stream of users bookmarks comparisons links sent
etc
u Manual advertising of available new services
2125 The Principle of the Cluster Engine Component
The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the
Service-Finder Portal eg the queried services and the compared services of the users Moreover it
also provides cluster data to the Conceptual Indexer and Matcher for providing service
recommendations
Furthermore letrsquos detailed introduce this componentrsquos function input and output
Oslash Input
u Service annotation data of both extracted and user feedback
u Usersrsquo Click streams used for extracting user behaviors
Oslash Function
u Obtain user clusters from user behaviors
u Obtain service clusters from service annotation data to enable to find similar services
Oslash Output
u Clusters of users and services
22 Information Extraction
Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge
amount of information sources on the Internet which has been limited the access to browsing and
searching for the reason of the heterogeneity and the lack of structure of Web information sources
Therefore the appearance of Information Extraction that transforms the Web pages into
program-friendly structures for post-processing would become a great necessity However the task of
Information Extraction is specified in terms of the inputs and the extraction targets And the
techniques used in the process of Information Extraction called extractor
221 Input Types of Information Extraction
Generally speaking there are three different input types The first input type is the unstructured
Deep Web Service Crawler
16
document For example the free text that showed in figure 2-2 It is unstructured and written in
natural language So that it will require substantial natural language processing While the second
input type is called the structured document For instance the XML documents based on the reason
that the data can be described through the available DTD (Document Type Definition) or XML
(eXtensible Markup Language) schema Finally but obviously the third input type is the
semi-structured document that are widespread on the Web Such as the large volume of HTML
pages like tables itemized lists and enumerated lists This is because HTML tags are often used to
render these embedded data in the HTML pages See figure 2-3
Figure2-2Left is the free text input type and right is its output [4]
Figure2-3A Semi-structured page containing data records
(in rectangular box) to be extracted [4]
Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly
regular structure And the data of these documents can be displayed in a format of HTML way or
non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and
generated from structured databases in terms of some templates or layouts thus it would be
considered as one of the input sources which could provide some of these semi-structured documents
For example the authors price and comments of the book pages that provided by Amazon have the
Deep Web Service Crawler
17
same layout That is because these Web pages are generated from the same database and applied
with the same template or layout Furthermore there has another option which could manually
generate HTML pages of semi-structured type For example although the publication lists that
provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have
title and source property for every single pager Eventually the inputs for some Information Extraction
can also be the pages with the same class or among various Web Service Registries
222 Extraction Targets of Information Extraction
Moreover regarding the task of the Information Extraction it has to consider the extraction target
There also have two different extraction targets The first one is the relation of k-tuple And the k in
there means the number of attributes in a record Nevertheless in some cases an attribute of one
record may have none instantiation Otherwise the attribute owns multiple instantiations In addition
the complex object with hierarchically organized data would be the second extraction target Though
the ways for depicting the extraction targets in a page are diverse the most common structure is the
hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf
nodes which called internal nodes And the structure for a data object may also be flat or nested To
be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise
if it is nested structure then the internal nodes that involved in this data object would be more than
two levels
Furthermore in order to make the Web pages readable for human being and having an easier
visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated
or demarcated However the displaying for a data object in a Web page would be affected by
following conditions [4]
Oslash The attribute of a data object has zero or several values
(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo
attribute For example a special offer only available for certain books might be a ldquononerdquo
attribute
(2) If there are more than one values for the attribute of a data object it will be called the
ldquomultiValuerdquo attribute For instance the name of the author for a book could be a
ldquomultiValuerdquo attribute
Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering
That is to say among this set of attribute the position of the attribute might be changed
according to the diverse instances of a data object Thus this attribute will be called the
ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would
enumerate the release data in front of the movesrsquo title while for the movies after year 1999
(including 1999) it will enumerate the release data behind the movesrsquo title
Oslash The attribute has different formats
This means the displaying format of the data object could be completely distinct with respect to
these different instances Therefore if the format of an attribute is free then a lot of rules will be
needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo
attribute For example an ecommerce Web site would use the bold font format to present the
general prices while use the red color format to display the sale prices Nevertheless there has
Deep Web Service Crawler
18
another situation that some different attributes for a data object have the same format For
example various attributes are presented in terms of using the ltTDgt tags in a table presentation
And the attributes like those could be differentiated by means of the order information of these
attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo
attributes it must have to revise the rules for extracting these attributes
Oslash The attribute cannot be decomposed
Because of the easier processing sometimes the input documents would like to be treated as
strings of tokens instead of the strings of characters In addition some of the attribute cannot
even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo
attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The
department code and the course number in them cannot be separated into two different strings
of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo
223 The Used Techniques in Information Extraction
The extractor used in the process of Information Extraction aims at providing a single uniform query
interface to access information sources like database server and Web server It consists of following
phases collecting returned Web pages labeling these Web pages generalizing extraction rules
extracting the relevant data and outputting the result in an appropriate format (XML format or
relational database) for further information integration For example at first the extractor queries the
Web server to gather the returned pages through the HTTP protocols after that it starts to extract the
contents among these HTML documents and integrate with other data sources thereafter Actually
the whole process of the extractor follows below steps
Oslash Step 1
At the beginning it must have to tokenize the input However there are two different
granularities for the input string tokenization They are tag-level encoding and word-level
encoding The tag-level encoding will transform the tags of HTML page into the general tokens
while transform all text string between two tags into a special token Nevertheless the
word-level encoding does this in another way It treats each word in a document as a token
Oslash Step 2
Next it should apply the extraction rules for every attributes of the data object in the Web pages
These extraction rules could be induced in terms of a top-down or bottom-up generalization
pattern mining or logic programming In addition the type of extraction rules may be indicated
by means of regular grammars or logic rules For example some use path-expressions of the
HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic
constraints and some use delimiter-based constraints such as HTML tags or literal words
Oslash Step 3
After that all these extracted data would be assembled into the records
Oslash Step 4
Finally iterate this process until all these data objects in the input
Deep Web Service Crawler
19
23 Pica-Pica Web Service Description Crawler
The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the
Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web
Services problem For example the evaluation of the descriptive quality of Web Services that offered
and how well are these Web Services described in nowadaysrsquo Web Service Registries
231 Needed Libraries of the Pica-Pica Web Service
Description Crawler
This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef
Spillner and programmed in terms of the Python language Actually in order to run these scripts to
parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib
n Beautiful Soup
It is an HTMLXML parser for Python language And it can even turn these invalid markups into a
parse tree [5] Moreover the following three features can make it more powerful
u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that
makes approximately as much sense as the original document Therefore you can obtain
the data that you want
u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating
searching and modifying the parse tree Hence you donrsquot need to create a custom parse
for every application
u If the document has already specified an encoding then you can ignore it since the
Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way
Otherwise what you have to do is just to specify the encoding of the original documents
Furthermore the ways of including Beautiful Soup into the application are displayed in the
following [5]
sup2 From BeautifulSoup import BeautifulSoup For processing HTML
sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML
sup2 Import BeautifulSoup To get everything
n Html5lib
It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to
gain maximum compatibility with the current major desktop web browsers this implementation
will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5
specification
Deep Web Service Crawler
20
232 Architecture of the Pica-Pica Web Service
Description Crawler
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler
Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes
four fundamental components Service Page Grabber component WSDL Grabber component
Property Grabber component and WSML Register component
(1) The Service Page Grabber components is going to take the URL seed as the input and output the
link of the service page into following two components WSDL Grabber component and Property
Grabber component
(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the
delivered service pagersquos link And then check whether the validation for these obtained WSDL
document Finally only these valid WSDL document will be passed into the WSML Register
component for further processing
(3) The Property Grabber component will try to extract the servicersquos property the hosted in the
service page if there exists After that all these servicersquos properties would be saved into an INI
file as the information of that service
(4) The functionality of the WSML Register component is to write appropriate WSML document by
means of the valid WSDL documents that delivered from WSDL Grabber component and the
optionally INI files that delivered from Property Grabber component Afterwards register them in
Conqo
n WSML [9]
It stands for Web Service Modeling Language which provides a framework with different
language variants Hence it is often used to describe the different aspects of the semantic Web
Services according to the conceptual model of WSMO
n WSMO [10]
WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various
aspects related to the Semantic Web Services based on the ontologies Ontology is a formal
explicit specification of a shared conceptualization In fact these ontologies are the sticking points
that can satisfy the linkage between the agreement of the communities of users and the defined
conceptual semantics of the real-world
n Conqo [11]
Deep Web Service Crawler
21
It is a discovery framework that considers not only the Quality of Service (QoS) but also the
context information It will use a Web Service Repository to manage these service descriptions
that based on WSML
233 Implementation of the Pica-Pica Web Service
Description Crawler
This section is going to describe the processes of the implementation of the Pica-Pica Web Service
Description Crawler in detail
(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it
needs an input as the initial seed In this crawler there are five Web Service Registries which are
listed in the below The URL address of these five Web Service Registries will be used as the input
seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web
Service Description Crawler there has a single Python script for each Web Service Registry And
the crawling process for these Web Service Registriesrsquo Python script will be executed one after
another
Biocataloue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwsrvice-repositorycom
Xmethods httpwwwxmethodsnet
(2) Then after feeding with the input seed it will step into the next component Service Page Grabber
At first this component will try to read the data from the Web based on the input seed Then it
will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library
After that this Service Page Grabber component starts to look for the service page link for each
service that published in the Web Service Registry by means of the functions in Html5lib library In
the case that the service page link of one single service is found it will firstly check whether this
service page link is valid or not Once the service page link is valid it will pass it into the following
two components for further processing which are WSDL Grabber component and Property
Grabber component
(3) When the WSDL Grabber component receives a service page link from its previous component it
sets out to extract the WSDL link address for that service through the parsing tree of the data in
this service page Next this component will start to download the WSDL document of that service
in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into
the disk The process of this WSDL Grabber component will be continually carried on until there is
no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They
may either contain bad definitions or bad namespaceURI or be an empty document What is
worse it is not even of XML format Hence in order to pick them out this component will further
analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo
folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in
order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo
folder will be passed into the subsequent component
(4) Moreover since some Web Service Registries would give some addition information about the
Deep Web Service Crawler
22
services such as availability service provider version therefore the Property Grabber
component will set out to extract these information as this servicersquos properties And thereafter
save these servicersquos properties into an INI file However if there has no available additional
information then it is no need to extract the service property thus there is no INI file for that
service However for the implementation of this Pica-Pica Web Service Description Crawler only
the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to
extract the servicesrsquo properties While for other three Web Service Registries there has such
function
(5) Furthermore this is an optional to create a report file which contains the statistic information of
this process Such as the total number of services for one Web Service Registry the number of
services whose WSDL document is invalid etc
(6) As has been stated at present there has a folder with all valid WSDL documents and might also
have some INI files Therefore at this time the task of WSML Register component is to generate
the appropriate WSML documents with these valid WSDL document and INI files And then
register them in Conqo
24 Conclusions of the Existing Strategies
In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder
project Information Extraction technique and Pica-Pica Web Service Description Crawler
The task of this master program is going to obtain the available Web services and their related
information from the Web Actually this is a procedure of extracting the needed information for the
service such as the servicersquos WSDL document and its properties Therefore the Information
Extraction technique which used to extract the information that hosted in the Web could be used in
this master program
Moreover the Service-Finder is a large project that not only able to obtain the available Web services
and their related information from the Web but also enrich these crawled data with annotations by
means of the Service-Finder ontology and the Service Category ontology and integrate all the
information into a coherent semantic mode Furthermore it also provides the capabilities for
searching and browsing the data with the user interface and gives users with service
recommendations However as a master program the Service-Finder project is far more exceeding
the requirements Therefore it is just considered as a consultation for this master program
Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available
Web services and their related information this fulfills the primary task of this master program
Nevertheless regarding the information about the service this Pica-Pica Web Service Description
Crawler extracts only few properties sometime even no property Consequently in order to improve
the quality of the service it has to extract as much properties about the service as possible Thence in
chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler
Deep Web Service Crawler
23
3 Design and Implementation
In the previous chapter of the State of Art those already existing techniques or implementations are
presented In the following it is time to introduce the basic principle of the proposed approach Deep
Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica
Web Service Description Crawler
31 Deep Web Services Crawler Requirements
This section is mainly talking about the goals of the Deep Web Service Crawler approach system
requirements with respect to approach and some non-functional requirements
311 Basic Requirements for DWSC
The following lists are the basic requirements which should be achieved
(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list
of Web services that published in the Web Service Registry It contains not only the WSDL document
of the service but also a file with respect to the information about that service Therefore for the
purpose of producing the largest annotated Service Catalogue this proposed approach need to
extract as much properties about those Web services as possible Moreover it has also to download
the WSDL document that hosted along with the Web service That is to say these properties are not
only the interesting structured properties such as service name its WSDL link address but also some
other non-function properties for example endpoint and its monitoring information
(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties
For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of
schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way
this proposed approach provides three methods for the storage The first one will store them as an
XML file While for the second method it is going to store them in an INI file And the third method is
to use the database for the storage
312 System Requirements for DWSC
General Speaking the needed requirements for achieving a programming project contains following
lists
1) Operating System LinuxUnix Windows (XP Vista 2000 etc)
2) Programming language C++ Java Python C etc
3) Programming Tool NetBean Eclipse Visual Studio and so on
However in this master thesis the scripts of the Deep Web Service Crawler approach are written in
Java programming language Besides these code scripts are only tested running in Windows XP
operating system and Linux operating system but have not been tested in other operating system
Deep Web Service Crawler
24
313 Non-Functional Requirements for DWSC
In this part several non-functional requirements for the Deep Web Service Crawler approach are
presented
1) Transparency the process of data exploration and data storage should be done automatically
without the userrsquos intervention However at first user should specify the path in the hard disk which
will be used store these outputs of this program
2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in
order to stop the process from interruption there must have some necessary error handlings for the
process recovering
3) Completeness this approach should extract the interesting properties about each Web Service as
much as possible eg endpoint monitoring information etc
In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the
following five URLs hence the proposed approach must implement not less than those five Web
Service Registries
Biocatalogue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwservice-repositorycom
Xmethods httpwwwxmethodsnet
32 Deep Web Services Crawler Architecture
In this section an overview of this high-level architecture of Deep Web Services Crawler approach will
be introduced first Thereafter there are four subsections that focusing on the outlining of each single
component and how they will play together presented
The current components and flows of data in Deep Web Service Crawler can be summarized as
depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page
links and related service page links by crawling the Web (Web Service Extractor) Then those gathered
links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL
Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all
these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is
illustrated as following
Oslash Step 1
When the Deep Web Service Crawler starts to run the File Chooser container would require the
user to specify a path in the computer or in any other hard disk The reason for specifying the
path is that this Deep Web Service Crawler program needs a place to store all its outputs
Oslash Step 2
After that the Web Service Extractor will be triggered It is the main entry to the specific crawling
process However since the Deep Web Service Crawler program is a procedure which supposes
to crawl for Web Services in some given Web Service Registries Hence the URL addresses of
Deep Web Service Crawler
25
these Web Service Registries should be given as an initial seed for this Web Service Extractor
process However since the page structures of these Web Service Registries are completely
different thus there will be a dependent process for each Web Service Registry
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler
Oslash Step 3
So that according to the given seed two types of links would be obtained by the Web Service
Extractor component One is the service list page link And another is the service page link
Service list page is page that contains a list of Web Services and may be some information about
these Web Services While service page is page that contains much more information for single
service Finally it forwards these two types of links into the next two components Property
Grabber and WSDL Grabber
Oslash Step 4
Then on the one hand the Property Grabber component tries to gather the information about
the service that hosted in the service list page and service page Such as the name of the service
its description the ranking for this service etc Finally all the information of the service will be
collected together as the service properties which then will be delivered to the storage
component for further processing
Oslash Step 5
On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page
or the service page That is because for some Web Service Registry the WSDL link is hosted in
Deep Web Service Crawler
26
the service list page like the Biocatalogue and for other Web Service Registries it is hosted in
the service page such as Xmethods Then after obtaining the WSDL link it will also be
transmitted to the Storage component for further processing
Oslash Step6
When service properties information and WSDL link of the service are received by Storage
component it will be going to store them into the disk For the service properties they will be
stored into the disk by means of three different ways They are an XML file an INI file or one
record inside the table of Database However for the WSDL link the Storage component will try
to download the page content first according to the URL address of the WSDL link If it can work
successful the page content of the service will be stored as a WSDL document into the disk
Oslash Step 7
Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there
are more than one service or there are more than one service list page in the those Web Service
Registries the crawling process from step 3 to step 6 will be continued again and again until
there is no service or service list page in that Web Service Registries any more
Oslash Step 8
Furthermore after the crawling process for one Web Service Registry finishes there will also
generate a file that contains some of statistic information about this crawling process For
example the time when the crawling process of this Web Service Registry started and when
finished the total number of Web services in this Web Service Registry how many services have
empty WSDL document the average number of service properties in this Web Service Registry
and the average time cost for extracting service properties getting WSDL document and
generating XML file INI file etc
321 The Function of Web Service Extractor Component
The Web Service Extractor component is responsible for gathering service list page links and service
page links It pursues a focused crawl of the Web and only forwards service list page and service page
links to subsequent components for analyzing collecting gathering purposes Therefore it is going to
identify both service list page links and related service page links on these Web Service Registries
As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost
as important as the Web Service Extractor itself as it highly influences the part of the Web which it
needs to crawl The seed can or shall contain eg Web pages where these Web Services are
published or that they talk about Web Services
Deep Web Service Crawler
27
Figure3-2 Overview the process flow of the Web Service Extractor Component
After feeding with the seed of URL the Web Service Extractor component starts to get the link of
service list page from the initial page of this URL seed However this process would be diverse for
these five Web Service Registries Following shows the different situation in these Web Service
Registries
Oslash Service-Repository Web Service Registry
In this Web Service Registry the link of the first service list page is the URL address of its seed
which means some Web Services can be found in the home page of the Service-Repository Web
Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this
Web Service Registry Therefore the process of getting service list page link in this registry will be
continually carried on until there no more service list page link exists
Oslash Xmethods Web Service Registry
Although there has Web Services in the home page of the Xmethods Web Service Registry the
number of those Web Services is only a small subset of these in this Web Service Registry
Moreover in the Xmethods Web Service Registry there is only one page containing all Web
Services Therefore it must have to get the service list page link for that page
Oslash Ebi Web Service Registry
The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That
is to say there also has one page that contains all Web Services in this Web Service Registry
However this page is not the initial page of the input seed Therefore there needs more than
one operation steps to get the service list page link of that page
Oslash Seekda Web Service Registry
In Seekda Web Service Registry these Web Services are not contained in the initial page of the
input seed The service list page link can be obtained after several additional operation steps
Deep Web Service Crawler
28
However there has a problem for getting the service list page link in this registry Simply put if
there are more than one pages contains the Web Services for some unknown reasons it cannot
not get the links of the rest service list page In another word it can only get the link of the first
service list page
Oslash Biocatalogue Web Service Registry
The process of getting service list page in Biocatalogue Web Service Registry is almost the same
as in Seekda Web Service Registry The only difference is that it can get all these service list page
links in Biocatalogue Web Service Registry if there are more than one service list pages
Then after getting the link of service list page the Web Service Extractor will begin to get the link of
service page for each service that listed in the service list page The reason for why it can do like this is
that there has an internal link for every service which will address to the service page It is worth
noting that once a service page link is obtained this component will immediately forward the service
page link and the service list page link into the subsequent two components for further processing
Nevertheless the process of obtaining the service page link will be continuously carried out until all
services listed in that service list page are crawled Analogously the process of getting service list page
will also be continuously carried out until there no more service list page exists
3211 Features of the Web Service Extractor Component
The main features are described in the following paragraphs
1) Obtain service list page links
A central task is to obtain the corresponding service list page links It is an URL address that includes a
publicly list of Web Services and just a simply information about these Web services like the name of
the service an internal URL that links to another page which would contain the detailed information
about that service sometimes may have the link address for WSDL document
2) Obtain service page links
Once service list page links are found a crucial aspect is to extract the internal link of each service to
aid the task of service information discovery Therefore it is the task of the Web Service Extractor to
harvest the html page content of the service list page so that the service page links which would
contain much more detailed information of the single Web service can be obtained
3212 Input of the Web Service Extractor Component
This component is dependent on some specific input seeds And the only input required for this
component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section
313
3213 Output of the Web Service Extractor Component
The component will produce two service related page links from the Web
l Service list page links
l Service page links
Deep Web Service Crawler
29
3214 Demonstration for Web Service Extractor
In order to have comprehensive understanding of the process of the Web Service Extractor
component following gives some figures for explanation Though there are five URL addresses in this
section only the URL of the service-repository address is showed as an example
1) The input seed is the initial URL address of the Service-Repository which is
ldquohttpwwwservice-repositorycomrdquo
2) As has already said in section 321 the first service list link of this Web Service Registry is its input
seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page
of that link
Figure3-3 Service list page of the Service-Repository
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo
Figure3-5Code Overview of getting service page link in Service Repository
Figure3-6 Service page of the Web service ldquoBLZServicerdquo
Deep Web Service Crawler
30
3) For now because of the service list page link is already know the next step is to acquiring the
service page link according to these services that listed in the service list page The text in red box
of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the
complete link of the service page It should be prefixed with the initial URL address of
Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for
getting service page link of the Web service in Service Repository is showed in figure 3-5
Therefore the final link of this service page is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure 3-6 is the corrsponding service page of that link
4) Afterwards those two links service list page link and service page link which are gathered by the
Web Service Extractor component will be immediately forwarded to the next two components
which are WSDL Grabber component and Property Grabber component
322 The Function of WSDL Grabber Component
The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the
service lists page link or service page link And the whole process flow is illustrated in figure 3-7
Figure3-7 Overview the process flow of the WSDL Grabber Component
When the WSDL Grabber component receives the inputs that delivered from previous component it
will start to get the WSDL link for the service based on these inputs However although the inputs of
the WSDL Grabber component are the links of service page and service list page only one of them
contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the
service page or the service list page The reason why there needs these two links to be delivered into
this component is that the only one of the five Web Service Registries which is Biocatalogue hosts
the WSDL link in the service list page While for other four Web Service Registries the WSDL link is
hosted in the service page Therefore the WSDL link of these four Web Service Registries will be
obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL
link will be obtained through the service list page link However there has a problem for getting the
WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in
Deep Web Service Crawler
31
the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another
word these services will not have WSDL document For the situation like this the value of the WSDL
link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web
Services in other four Web Service Registries the WSDL link would always exist in the service page
Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service
it will inevitably be forwarded into Storage component for downloading the WSDL document at once
3221 Features of the WSDL Grabber Component
The WSDL Grabber component has to provide the following features
l Obtain WSDL links
WSDL link is the direct way to get into the page that contains the contents of the WSDL document
It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo
to represent this is an address that addressed to the page of WSDL document
3222 Input of the WSDL Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3223 Output of the WSDL Grabber Component
The component will only produce the following output data
l The URL address of WSDL link for each service
3224 Demonstration for WSDL Grabber Component
In this section it will demonstrate a list of figures in order to give a comprehensive understanding of
the process of the WSDL Grabber component Without a doubt it uses the same Web service of the
Service-Repository as an example too
1) The input for this WSDL Grabber component is the link of service page that obtained from Web
Service Extractor component The address of this link is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page
Deep Web Service Crawler
32
2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the
figure 3-8
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo
3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure
3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry
For other four Web Service Registries this will be different This function
ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo
Then it checks all of the nodes one by one to see if the text value of that node is the value of
ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will
be extracted as the value of WSDL link for this Web service
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function
Figure3-11 Code overview of ldquooneParameterrdquo function
4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web
service ldquoBLZServicerdquo can be obtained The value of it is
ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo
Deep Web Service Crawler
33
323 The Function of Property Grabber Component
The Property Grabber component is a module which is used to extract and gather all these Web
service information that hosted in the Web actually are the information that showed in service list
page and service page In the end all the obtained Web service information will be collected together
as the service properties which would be delivered into the Storage component for storing The
detailed process flow of Property Grabber component is illustrated in figure 3-12
The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component
However there is still a little difference between them with respect to the seed As have already
mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is
sufficient to get the WSDL link While regarding the Property Grabber component it needs two of
them as the seeds Therefore once the Property Grabber component receives the needed inputs it
will start to extract the information of that single Web service
Figure3-12 Overview the process flow of the Property Grabber Component
Therefore after the Property Grabber component receives the inputs it starts to extract the service
information for the Web service Generally speaking the service information consists of four aspects
which are structured information endpoint information monitoring information and whois
information respectively
(1) Structured Information
The structured information can be obtained by extracting the information that hosted in the
service page and service list page They are the basic descriptive information about the service
Such as the name of the service the URL address through which the WSDL document can be
obtained the description of introducing the service the provider who provides that service its
rating and the server who owns this service etc However the elements constituting this
Deep Web Service Crawler
34
structured information would be diverse according to these different Web Service Registries For
example the rating information of the Web service exists in the Service-Repository Web Service
Registry while the Xmethods Web Service Registry does not have this information In addition
even though the Web services in the same Web Service Registry some elements of the structured
information could be not exist For instance one service in a Web Service Registry may have the
description information of this service while another service in the same Web Service Registry
does not have this description information Tables 3-1 to 3-5 demonstrate the structured
information that should be extracted in these five Web Service Registries Moreover if the style of
the Biocataloue Web Service Registry is SOAP there has some additional information describing
the SOAP operation of this service While it is REST this additional information would be
something that talked about the REST operation They should also be considered as a part of the
structured information Table 3-6 and table 3-7 illustrate the information for these two different
operations
Service Name WSDL Link WSDL Version
Provider Server Rating
Homepage Owner Homepage Description
Table 3-1Structured Information of Service-Repository Web Service Registry
Service Name WSDL Link Provider
Service Style Homepage Implementation Language
Description User Description Contributed Client Name
Type of this Client Publisher for this client Used Tookit of this client
Used Language of this client Used Operation System of this client
Table 3-2Structured Information of Xmethods Web Service Registry
Service Name WSDL Link Server
Provider Providerrsquos Country Service Style
Rating Description User Description
Service Tags Documentation (within WSDL)
Table 3-3Structured Information of Seekda Web Service Registry
Service Name WSDL Link Port Name
Service URL Address Implementation Class
Table 3-4Structured Information of Ebi Web Service Registry
Service Name WSDL Link Style
Provider Providerrsquos Country View Times
Favorite Times Submitter Service Tags
Total Annotation Provider Annotation Member Annotation
Registry Annotation Base URL SOAP Lab Server Base URL
Description User Description Category
Table 3-5Structured Information of Biocatalogue Web Service Registry
Deep Web Service Crawler
35
SOAP Operation Name Inputs and Outputs Operation Description
Operation Tags Part of Which Service
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry
REST Operation Name Service Tags Used Template
Operation Description Part of Which Service Part of Which Endpoint Group
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry
(2) Endpoint Information
Endpoint information is the port information of the Web service which can be extracted only
through the service page However since different Web Service Registries have different structure
of the endpoint information with respect to the Web service Thence some elements of the
endpoint information for the Web service would be very diverse But there is one thing needed to
pay attention the Ebi Web Service Registry does not have endpoint information for all these Web
services published in this registry Moreover though the Web services in the same Web Service
Registry have the same structure of this endpoint information some elements of the endpoint
information would be missing or empty Furthermore these Web Service Registries could even
have no endpoint information for some Web services published by them Nevertheless whatever
there is endpoint information for the Web service there will be at least one such element which
is the URL address of the endpoint Following table 3-8 shows the endpoint information that
should be extracted in these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Endpoint Name Endpoint URL
Endpoint Critical Endpoint Type
Bound Endpoint
Xmethods
Endpoint URL Publisher of this Endpoint
Contact Email of this Publisher Implementation Language of this
Endpoint
Seekda Endpoint URL
Biocatalogue Endpoint Name Endpoint URL
Table 3-8 Endpoint Information of these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Service Availability Number of Downs
Total Uptime Total Downtime
MTTR MTBF
RTT Max of Endpoint RTT Min of Endpoint
RTT Average of Endpoint Ping Count of Endpoint
Seekda Service Availability Begin Time of Monitoring
Biocatalogue Monitored Status of Endpoint Overall Status of Service
Table 3-9 Monitoring Information of these five Web Service Registries
(3) Monitoring Information
Monitoring information is the tested statistic information for the Web service But it is worth
Deep Web Service Crawler
36
noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of
all these Web services published by them While for other three Web Service Registries only a
few of Web services may not have this information Table 3-12 displays the monitoring
information for these three Web Service Registries
(4) Whois Information
Whois information is not extracted from the information that hosted in the service page and
service list page It is the descriptive information for the service domain which can be gained by
means of the address of the WSDL link Because of that the process of getting the Whois
information will start after gaining the service domain first The final value of service domain
should not contain the string like this http https www etc It must belong to the top level
domain After that it is going to query the service domain database by means of sending the value
of service domain to the Whois client which is just a Web site on the Internet for example
ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain
exists a list of its information will be returned as the output However the structure of the
returned information will be diverse with respect to different service domain Therefore the most
challenge thing is that there has to deal with the extracting process for each different situation of
the returned information Table 3-10 gives the Whois information that needs be extracted for all
these five Web Service Registries
Service Domain URL Domain Name Domain Type
Domain Address Domain Description State
Postal Code City Country
Country Code Phone Fax
Email Organization Established Time
Table 3-10Whois Information for these five Web Service Registries
Finally all the information of these four aspects will be collected together And then be delivered into
the Storage component for further storage processing
3231 Features of the Property Grabber Component
The Property Grabber component has to provide the following features
l Obtain basic information
Generally speaking the more information one Web service has the better you can know how
well this Web service is Hence this Property Grabber component will be necessary to extract all
these basic information that hosted in the service list page and service page This basic
information contains the structure information endpoint information and monitoring
information
l Obtain Whois information
Due to the reason that the more information the Web service has the better the quality of the
Web service is That makes it is necessary to extract as much information about the Web service
as possible Therefore except the basic information the Property Grabber component can also
obtain some additional information called Whois information such as the type of the domain the
name of the person in charge the postal code of the domain city phone fax detailed address
email etc
Deep Web Service Crawler
37
3232 Input of the Property Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3233 Output of the Property Grabber Component
The component will produce the following output data
l Structured information of each service
l Endpoint information about each service if exists
l Monitoring information for the service and endpoint if exists
l Whois information of the service domain
However all these information will be collected together as the properties for each service Thereafter
the collected properties will be sent to the Storage component
3234 Demonstration for Property Grabber Component
Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the
Property Grabber component To simplify the explanation the example showed in this section uses
the same Web service as before
1) The inputs of the Property Grabber component are links of service list page and service page that
received from Web Service Extractor component Their links are
ldquohttpwwwservice-repositorycomrdquo
and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page
Deep Web Service Crawler
38
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page
3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that
displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of
the structured information have the same content like the contents of description showed in
service page and service list page Thence in order to saving the time of extracting process and
the space for storing process the elements with the same content would only be extracted once
Moreover there needs a transformation of non-descriptive text to the descriptive text for the
rating information because the contents of it are several image of stars Therefore the final
results of the extracted structured information for this Web service are showed in table 3-11
Because of there are no descriptive information for the provider homepage and Owner
Homepage their values are assigned as ldquoNULLrdquo
Service Name BLZService
WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl
WSDL Version 0
Server Apache-Coyote11
Description BLZService
Rating Four stars and A Half
Provider NULL
Homepage NULL
Owner Homepage NULL
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo
4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that
displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list
only one of them will be extracted as the information of endpoint That is because this master
program is intended to extract as more information as possible but these information should not
contain some redundant information Therefore for the purpose of that only one of them will be
extracted as the endpoint information even if there are more than one endpoint records Table
3-12 shows the final results of the endpoint information of this Web service
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page
Deep Web Service Crawler
39
Endpoint Name BLZServiceSOAP12port_http
Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService
Endpoint Critical True
Endpoint Type production
Bound Endpoint BLZServiceSOAP12Binding
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo
5) Then it is time to extract the monitoring information through invoking the function
ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box
above contains the monitoring information about the Web service and the red box below lists
the monitoring information for these endpoints As have already mentioned before only one
endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two
types of availability Actually they all represent the availability for this Web service just like the
availability that showed in figure 3-14 Therefore only one content of these availability
information is sufficient Table 3-13 shows the final results of this extracting process
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page
Service Availability 100
Number of Downs 0
Total Uptime 1 day 19 hours 19 minutes
Total Downtime 0 second
MTBF 1 day 19 hours 19 minutes
MTTR 0 second
RTT Max of Endpoint 141 ms
RTT Min of Endpoint 0 ms
RTT Average of Endpoint 577 ms
Ping Count of Endpoint 112
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo
6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the
service domain from the WSDL link first Towards this Web service the gained service domain is
ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying
process After that it returns a list of information for that service domain See figure 3-17 And
table 3-14 is the extracted Whois information
Deep Web Service Crawler
40
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo
Service Domain URL thomas-bayercom
Domain Name Thomas Bayer
Domain Type NULL
Domain Address Moltkestr40
Domain Description NULL
State NULL
Postal Code 54173
City Bonn
Country NULL
Country Code DE
Phone +4922855525760
Fax NULL
Email infopredic8de
Organization predic8 GmbH
Established Time NULL
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo
7) Finally all the information of these four aspects will be collected together as the service
properties and then these service properties are forwarded into the Storage component
324 The Function of Storage Component
The WSDL link from the WSDL Grabber component will be used by this Storage component to
download the WSDL document from the Web and stores it into the disk thereafter In addition the
service properties from the Property Grabber component will also be directly stored into the disk in
three different manners with this Storage component Figure 3-18 illustrates the process flow of the
Storage component
Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo
of the Storage component will be triggered After that it is going to transform the service properties
into three different output formats and store them into the disk These output formats are XML file
INI file and database records Besides it will also try to download the WSDL document with the URL
address of the WSDL link and then store the obtained WSDL document into disk too However this
ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo
ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the
storage tasks
Deep Web Service Crawler
41
Figure3-18 Overview the process flow of the Storage Component
(1) ldquogetWSDLrdquo sub function
The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it
into the disk Therefore above all it has to get the content of the WSDL document This
procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value
of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if
the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo
For the case like that it will create a WSDL document whose name is the service name appended
with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content
It is an empty document Nevertheless if that service does have WSDL link this sub function will
try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all
these contents that hosted in the Web would be downloaded and stored into the disk and named
only with the name of the service Otherwise it would create a WSDL document which prefixes
ldquoBadrdquo before the service name
(2) ldquogenerateXMLrdquo sub function
The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform
them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo
XML stands for eXtensible Markup Language which is a markup language that designed to
transport and store data The first line of an XML file is the XML declaration which defines the
XML version and the encoding used For example following ldquoltxml version=10
encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is
8-bit Unicode Transformation Format character set Besides an XML file contains XML elements
which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML
element can also contain other elements which could be simple text or a mixture of both
Deep Web Service Crawler
42
However an XML file must contain a root element as the parent of all other elements
(3) ldquogenerateINIrdquo sub function
The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will
transform them into an INI file and then store it in the disk with the name structure of servicersquos
name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for
configuration files They are just simple text files with a basic structure Generally speaking the
INI file contains three different parts They are section parameter and comment respectively
Parameter is the basic element contained in an INI file Its format is a key-value pair or can also
be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always
appears to the left of the equals sign The section is more like a room that grouped all its
parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And
section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo
Thence anything between the semicolon and the end of line will be ignored
(4) ldquogenerateDatabaserdquo sub function
The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions
Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will
turn them into the data of the database by using the SQL statements SQL stands for Structured
Query Language which is a database language that designed for accessing and manipulating data
in database Most of the actions that performed on a database are done with SQL statements
And the primary statements of SQL include insert into delete update select create alter and
drop Therefore for the purpose of transforming these service properties into the data of the
database this sub function has to create a database first using the ldquocreate databaserdquo statement
Then it should create a table to store the data A table is a collections of related data entries and
it consists of columns and rows Since the data for all these five Web Service Registries are not
very large therefore one database table is enough for storing these service properties Because of
that the field names of the service properties in columns for all these five Web Service Registries
should be uniform and well-defined Afterwards the service properties of each single service can
be put into the table as a record with the ldquoinsert intordquo statement of the SQL
3241 Features of the Storage Component
The Property Grabber component has to provide the following features
l Generate different output formats
The final result of this master program is to store the information of the service into the disk for
future work However this Storage component provides three different formats for storing these
service properties of the services that published in the Web Service Registries This makes the
storage of the service very flexible and also longevity
l Obtain the WSDL document
The important thing of this component is to check if the WSDL document can be obtained from
the WSDL link because the WSDL document plays a decisive role for determining the quality of
the service This Storage component provides the ability to deal with the different situations
about the process of obtaining the WSDL document
Deep Web Service Crawler
43
3242 Input of the Storage Component
This component requires the following input data
l WSDL link of each service
l Each service property information
3243 Output of the Storage Component
The component will produce the following output data
l WSDL document of the service
l XML document INI file and tables in database
3244 Demonstration for Storage Component
Following figures describe the fundamental implementation codes of this Storage component
The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the
implementation codes The first common place is about the parameters that defined in each of
these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute
path in the computer disk It will be used for the storing procedure and has already been specified
by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an
increasing Integer which is used as a part of the name of the service The reason for doing this is
that it can prevent the services that have the same name from overriding in the disk The content
of red mark among the codes of these figures is the second common place Its function is to
create a file or document for the service with the corresponding parameters
2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to
get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the
name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo
is the actual WSDL link which is the most important one for this sub function The other two
parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and
ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data
of the service in each Web Service Registry Such as the overall number of properties for that Web
Service Registry the overall number of services the number of services that have no WSDL link
the number of services contains no contents the number of services whose WSDL link are not
available etc The ldquoLog Informationrdquo text file records the results of the process steps and the
problems encountered For example which service is crawling now which Web Service Registry it
belongs the reason why it cannot get the WSDL document for this service the reason why this
service is unreachable and so on
Deep Web Service Crawler
44
Figure3-19 Implementation code for getting WSDL document
3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file
and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of
ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two
variables name and value
Figure3-20 Implementation code for generating XML file
Deep Web Service Crawler
45
Figure3-21 Implementation code for generating INI file
4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all
services in these five Web Service Registries into the records of the database Therefore it has to
create a database first The name of the database can be arbitrary as long as it conforms to the
naming rules in database This is the same for the name of the table Figure 3-22 displays the
procedure of creating a table in the database Due to the reason that it is hard to decide the
length of each service property thence the data types for all service properties are set to ldquoTextrdquo
Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo
statement
Figure3-22 Implementation code for creating table in database
Deep Web Service Crawler
46
Figure3-23 Implementation code for generating table records
33 Multithreaded Programming for DWSC
Multithreaded programming is a built-in characteristic which is provided by Java language A
multithreaded program contains two or more separate parts that that can execute concurrently Each
part of such a program is called a thread The use of multithreading enables to create programs that
makes efficient use of the system resources For example make maximum use of the CPU thatrsquos
because the idle time of the CPU can be kept to a minimum
In this master program there are five Web Service Registries that are needed to crawl for the services
among them Moreover the number of services published among each Web Service Registry is quite
different That makes the running time cost by each Web Service Registry is different Then it will
happen that one Web Service Registry who owns fewer services has to be waiting for executing until
another Web Service Registry who has much more services finishes Therefore in order to reduce the
waiting time for other Web Service Registries and maximize the use of the system resources it is
necessary to apply this multithreaded programming to this master program That is to say this master
program will create a thread for each Web Service Registry And these threads are executed
independently
34 Sleep Time Configuration for Web Service
Registries
Due to the reason that this master program is intended for downloading the WSDL document and
extracting the service information of these Web services that published in the Web Service Registry
this would inevitably affect the performance of the Web Service Registry In addition for the purpose
of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of
accessing Because of that sometimes when this master program is executing unknown errors would
happen For instance this master program will be continually halting at one point without getting any
more WSDL document and service information or some servicesrsquo WSDL documents of some Web
Deep Web Service Crawler
47
Service Registries cannot be obtained or there would miss some service information Therefore in
order to obtain the largest staffs of these Web services published in these five Web Service Registries
and not to affect the throughput of them it has to configure the accessing rate for each service of all
Web Service Registries
In consequence before going into the essential procedure for each single service of these Web Service
Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long
milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for
the specified number of milliseconds In other words it will temporarily cease execution for a while
Following table shows this time interval of the sleep function for each Web Service Registry
Web Service Registry Name Time Interval (milliseconds)
Service Repository 8000
Ebi 3000
Xmethods 10000
Seekda 20000
Biocatalogue 10000
Table 3-15Sleep Time of these five Web Service Registries
Deep Web Service Crawler
48
4 Experimental Results and Analysis
This chapter is going to show the quantitative experimental results of the prototype that presented in
chapter 3 Besides the analysis of these results will also be described and explained However in
order to gain a rather accurate result the experiments are carried out more than five times All these
data that displayed in the following tables and charts are the average of them
41 Statistic Information for Different Web Service
Registries
This section is mainly used to talk about the amount statistic of these Web services that published in
these five Web Service Registries It includes the overall number of the Web services that published in
each Web Service Registry and the number of these unavailable Web services which have been
archived because they may not be active anymore or are close to being non active Table 4-1 shows
the service amount statistic of these five Web Service Registries
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Overall
Services 57 289 382 853 2567
Unavailable
Services 0 0 0 0 125
Table4-1 Service amount statistic of these five Web Service Registries
Nevertheless in order to have an intuitive view of the service amount statistic in these five Web
Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As
you can see from the bar chart one on hand there has an ascending increase for the overall number
of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service
Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web
service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to
provide the Web services for the users because it contains far more services than other four Web
Service Registries On the other hand there is no unavailable service existed in almost all Web Service
Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web
Service Registry contains some Web services that cannot be used by the users To some degree this is
useless for the reason that these services cannot be used anymore and it will waste the network
resources on the Web Therefore all these unavailable services should be eliminated in order to
reduce the wasting of the network resources
Deep Web Service Crawler
49
Figure4-1 Service amount statistic of these five Web Service Registries
42 Statistic Information for WSDL Document
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Failed WSDL
Links 1 0 23 145 32
Without
WSDL Links 0 0 0 0 16
Empty
Content 0 0 2 0 2
Table4-2 Statistic information for WSDL Document
Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in
these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed
WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the
Web services whose WSDL links are invalid Say in other words that means it is impossible to get the
WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no
WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services
for these Web Service Registries It is used as the overall number of Web services in each Web Service
Registry who has no WSDL link That is to say there would be no WSDL document for Web services
like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL
document will also be created But there is no content and the name of this WSDL document will
contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which
represents the overall number of Web services that have the WSDL links and the URL address of them
0
500
1000
1500
2000
2500
3000
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
57289 382
853
2567
0 0 0 0 125
The
Num
ber o
f Web
Ser
vice
s
The name of Web Service Registry
Overall Services
Unavailable Services
Deep Web Service Crawler
50
are valid but their WSDL documents contains no content In this case a WSDL document whose name
contains a string ldquo(BAD)rdquo will be created
Figure4-2 Statistic information for WSDL Document
43 Comparison of Different Average Number of
Service Properties
This section is going to compare the average number of service properties in these five Web Service
Registries This average number of service properties is calculated by means of following equation = (1)
Where
ASP is the average number of service properties for one Web Service Registry
ONSP is the overall number of service properties in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
Figure 4-3 shows the average number of service properties of each Web service in these five Web
Service Registries As have already mentioned before one of the measurements for testing the quality
of Web services in the Web Service Registry is the service information The more information of a Web
service the better you know about that service consequently the corresponding Web Service
Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service
Repository and Biocatalogue Web Service Registries own a larger number of service properties than
that in other three Web Service Registries This can directly reflect that these two Web Service
Registries can provide more detailed information about the Web services published among them So
that the users can be easier to choose the service they need And they would also like to use the Web
services that published in these two Web Service Registries By contrast the Xmethods and Seekda
0
50
100
150
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
1 023
145
32
0 0 0 016
0 0 2 0 2
The
Num
ber o
f Ser
vice
s
The Name of the Web Service Registry
Failed WSDL Links
Without WSDL links
Empty Content
Deep Web Service Crawler
51
Web Service Registries who have less service information of the Web services would offer less quality
for these Web services Therefore users may be not like to use the Web services provided in these
two Web Service Registries Not to mention the Web services published in the Ebi Web Service
Registry
Figure4-3 Average Number of Service Properties
From the description presented in section 323 the causes of the issue for the different number of
service properties in these Web Service Registries may consist of following several points First the
number of the structured information for these Web services is different with respect to these five
Web Service Registries Even part of information for some Web services in one Web Service Registry
could be missing or have empty value For example the number of structure information that
supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that
in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for
almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce
the amount of the overall number of service properties Thirdly some Web Service Registries do not
have monitoring information such as Xmethods and Ebi In particular the Service Repository Web
Service Registry has a large amount of monitoring information of the Web services that could be
extracted in the Web Obvious the last one is the number of the whois information for these Web
services If the database of the whois client does not contain the information about the service
domain for the Web service in one Web Service Registry then there will be no whois information
could be extracted Moreover even there has information of the service domain the amount of the
information could be very diverse Therefore if the Web Service Registry undergoes the situation like
that many service domains of the Web services in this registry have no and few whois information
then the average number of service properties in that registry will decrease greatly
As a result in order to help users get better acquainted with the Web services that provided in Web
Service Registry and distinguish these Web services the Web Service Registry should do it best to
offer more and more information for each published Web service of it
05
101520253035
23
717 17
32
The
Num
ber o
f Pro
pert
ies
The Name of Web Service Registry
Average Number of Service Properties
Deep Web Service Crawler
52
44 Different Outputs of Web Services
The fundamental task of this master program is to obtain the WSDL document of the Web services
that hosted in the Web Service Registries as well as extract and gather the properties of these Web
services and thereafter store them into disk Therefore this section is going to describe the different
outputs of this master program which include the WSDL documents of Web services the generated
XML file and INI file and data records of these service properties in Database
Figure4-4 WSDL Document format of one Web service
The WSDL document of the Web service is just read from the Web according to the URL address of its
WSDL link And then these data will be stored into the disk as the WSDL document The name of the
WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document
And in order to distinguish these WSDL documents whose names would be the same while the
content of them are different the name of each obtained WSDL document in one Web Service
Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL
document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and
data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output
formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name
is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the
materials belonged to the same Web service The first three lines in that INI file are some service
comments which start from the semicolon to the end of the line They are the basic information for
the description of this INI file The following one line of them is the section that included in a pairs of
brackets It is important because it represents that the following lines behind it are the information of
this Web service Therefore the rest of these lines are the actual service information that existed in a
key-value pair with an equal between the pair Each service property is displayed from the beginning
of the line
Deep Web Service Crawler
53
Figure4-5 INI File format of one Web service
Figure4-6 XML File format of one Web service
Figure4-7 Database data format for all Web services
Deep Web Service Crawler
54
Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its
name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web
service Though the format of the XML file is different from that of the INI file the essential contents
of them are the same That is to say the values of these service properties are no different This is
because they are files generated from the collection of properties for the same Web service The XML
file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the
section in INI file is something like the root in XML file Therefore all values of these elements among
the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service
Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of
service information for all Web services in these five Web Service Registries And the entire service
information for one Web service will be only one record in this table Because of that the column
names of that table should be the union of the name of the service information in each Web Service
Registry However since the column names of the table should be unique thus these redundant
names in this union must be eliminated This could make sense and possible because the names of the
service information are well-defined and uniform for all these five Web Service Registries In addition
in this table the first column is the primary key which is an increasing Integer The function of this is
more like the Integer that contained in the name of the XML file and INI file While the rest columns in
that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in
this table represents that this property of that Web service is empty or missing
45 Comparison of Average Time Cost for Different
Parts of Single Web Service
This section aims to describe the comparison of the average time cost for different parts of getting
one single Web service in all these five Web Service Registries At first it has to calculate the average
time cost of getting one single service in the Web Service Registry The calculation of it can be
obtained through following equation = (2)
Where
ATC is the average time cost for one single Web service
OTS is the overall time cost of all the Web services in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the different parts of the average time cost for getting one single service consists of
following 6 aspects which are the average time cost for extracting service property the average time
cost for obtaining WSDL document the average time cost for generating XML file the average time
cost for generating INI file the average time cost for inserting the service property into the table of
the Database and the average time cost for some other procedures such as get the service list page
link get the service page link and so on The average time cost for extracting service property would
be obtained by means of following equation
Deep Web Service Crawler
55
= (3)
Where
ATCSI is the average time cost for extracting service property of one single Web service
OTSSI is the overall time cost for extracting service property of all the Web services in one Web
Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the calculation of other parts is similar to the equation for calculating the average time
cost for extracting service property While the calculation of the average time cost for other
procedures equals to the result of the average time cost for one single Web service minus the sum of
the average time cost of all other five parts
Service
property
WSDL
Document
XML
File INI File Database Others Overall
Service
Repository
8801 918 2 1 53 267 10042
Ebi 699 82 2 1 28 11 823
Xmethods 5801 1168 2 1 45 12 7029
Seekda 5186 1013 2 1 41 23 6266
Biocatalogue 39533 762 2 1 66 1636 42000
Table4-3 Average time cost information for all Web Service Registries
Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web
Service Registries The first column of table 4-3 is the name of these five Web Service Registries And
the last column in this table is the average time cost for the single service in one Web Service Registry
While the rest columns in this table are the average time cost of these six different parts In order to
have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated
with following corresponding figures See figure 4-8 to figure 4-14
Figure4-8 Average time cost for extracting service property in all Web Service Registries
8801
6995801 5186
39533
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Service Property
Deep Web Service Crawler
56
As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue
Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries
which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively
That is to say it will take much longer time to extract the service properties of these Web services
published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the
Biocatalogue Web Service Registry has even more average number of service properties which have
already been talked about in section 43 On the contrary the average number of service properties in
Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service
Registry is larger than that in Seekda Web Service Registry As have already known that the average
number of service properties is the same for these two Web Service Registries Nevertheless there
has course might explain the case that the average time in Xmethods costs more than that in Seekda
which is the process for extracting service properties in Xmethods Web Service Registry needs to be
executed by means of service page and service list page while it needs only service page link for
Seekda Web Service Registry
The average time cost for obtaining WSDL document in all these five Web Service Registries is
displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the
average time for extracting the WSDL link of Web service and the average time for reading the data of
WSDL document from the Web and then storing it into the disk As seen from this figure the average
time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is
1168 Although the process for extracting WSDL link will cost certain amount of time it does not have
a significant influence to the total average time spent for obtaining WSDL document This is because
the WSDL link of one Web service is almost gained in one step Therefore this can imply that the
average size of the WSDL document for Xmethods Web Service Registry is larger than that for other
four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds
for the process of extracting WSDL document
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries
918
82
11681013
762
0
200
400
600
800
1000
1200
1400
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
WSDL Document
Deep Web Service Crawler
57
Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different
outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11
the average time of generating XML file for one Web service cost the same for all these five Web
Service Registries which is only 2 milliseconds In addition the average time of generating INI file for
one Web service cost the same too And its value is just 1 millisecond Even though sum up the values
of these two types of average time costs the result is still so small that it could be omitted when
comparing to the overall average time cost of getting one Web service for each corresponding Web
Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI
file would be finished at once after receiving the input of service properties of one Web service
Furthermore it can be seen from figure 4-12 although the average time costs of creating database
record for each Web service in all these five Web Service Registries are larger than that time of
generating XML and INI file the process of the operation for creating database record is still fast
Figure4-10 Average time cost for generating XML file in all Web Service Registries
Figure4-11 Average time cost for generating INI file in all Web Service Registries
2 2 2 2 2
0
05
1
15
2
25
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
XML File
1 1 1 1 1
0
02
04
06
08
1
12
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
INI File
Deep Web Service Crawler
58
Figure4-12 Average time cost for creating database record in all Web Service Registries
Figure4-13 Average time cost for getting one Web service in all Web Service Registries
In the figure 4-13 it gives the average time cost for getting one single Web service in all these five
Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the
longest time for this process This is because the presentation of these five different parts that
described in the front shows that the average time cost of each part would needs more time to finish
the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining
WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web
Service Registry does not cost the longest Moreover there is amazing thing when looking at these
figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same
trend This can further indicate that one Web Service Registry who spends more time to get the
description information of one Web service would offer more information about the Web service
53
28
4541
66
0
10
20
30
40
50
60
70
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Database
10042
823
7029 6266
42000
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Overall
Deep Web Service Crawler
59
5 Conclusion and Further Direction
This master thesis provides a schema which aims to explore the description information of the Web
services that hosted in the different Web Service Registries Actually the description information of
the Web service consists of WSDL document and service information about that Web service
Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can
be used to obtain the WSDL document and service information of the Web service in these Web
Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web
services that hosted in the these Web Service Registries as well as the storage of these Web services
is not flexible and the most important thing is that only a few service information of the Web service
are extracted even more there is no service information would be extracted for some Web Service
Registry However the work presented in this master thesis can be able to explore all Web services in
these Web Service Registries Moreover the service information of Web service would be extracted as
much as it could with the approach presented in this master thesis so that the final result would be
the largest annotated service catalogue ever produced Furthermore regarding the storage process of
the description information of the Web service this master thesis provides three different ways that
can guarantee not only the completeness but also longevity of the description information for the
Web service
However in the implementation performed in this master thesis the whois client used for querying
the information of service domain will return a free text if the information exists And sometime this
free text differs completely That makes it must have to crawl each Web service in all Web Service
Registries at least once in the experiment stage So that all the cases of these free text would be
foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web
services in these Web Service Registries Therefore in order to simply the work other whois client
who can ease the work here needs to be found and used
Moreover in the experiment stage of this master thesis the time cost for getting Web service is still
large For the purpose of reducing this time the multithreaded programming would be applied to
each some parts of the process for getting one Web service
Although the work performed here is specialized for only these five Web Service Registries the
main parts of the principles used here is adaptable to other Web Service Registries that with only
small changes in the implementation codes or the structure
Deep Web Service Crawler
60
6 Bibliography
[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)
June 27 2008
[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from
httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-
whole Emanuele Della Valle (CEFRIEL) July 1 2008
[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk
ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-
architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009
[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web
Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October
2006
[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from
httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008
[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from
httpwwww3orgTRws-arch-scenarios February 11 2004
[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo
Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering
University of Washington Seattle pp233-272 February 1999
[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web
Consortium Working Draft WD-html5-20100624 January 22 2008
[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language
WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from
httpwwwwsmoorgTRd16d161v02
[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO
-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from
httpwwwwsmoorgTRd2v11
[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware
Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet
2008
Deep Web Service Crawler
61
7 Appendixes
There are additional outputs of this master program which are log information file and statistic report
file Figure 8-1 shows one of the basic output formats of the log information for these five Web
Service Registries
Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry
Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry
Deep Web Service Crawler
62
Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo
ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively
Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry
Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry
Deep Web Service Crawler
63
Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry
Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry
Deep Web Service Crawler
64
Table of Figures
Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12
Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16
Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25
Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27
Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41
Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46
Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50
Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51
Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52
Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56
Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58
Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58
Deep Web Service Crawler
65
Table of Tables
Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47
Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48
Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55
Deep Web Service Crawler
66
Table of Abbreviations
DTD Document Type Definition
DWSC Deep Web Service Crawler
HTML HyperText Markup Language
MTBF Mean Time between Failures
MTTR Mean Time to Recovery
QoS Quality of Service
REST Representational State Transfer
RTT Round Trip Time
SOAP Simple Object Access Protocol
SQL Structured Query Language
URL Uniform Resource Locator
WHATWG Web Hypertext Application Technology Working Group
XML eXtensible Markup Language
WSDL Web Service Description Language
WSML Web Service Modeling Language
WSMO Web Service Modeling Ontology
Deep Web Service Crawler
10
2 State of the Art
This chapter aims at presenting some existing techniques or Strategies that related to the work of
applying this Deep Web Service Extraction Crawler approach Section 21 plans to talk about the
existing catalogues Service-Finder project And then in section 22 it is going to explain the existing
implemented crawlers Pica-Pica Web Service Description Crawler Finally in the section 23 it is
supposed to present some details about the Information Extraction technique
21 Service Finder Project
Service-Finder Project aims at developing a platform for Web Service discovery especially for the Web
Services that are embedded in a Web 20 environment [1] Hence it can provide an efficient access to
publicly available services The goals of the Service-Finder project are depicted as follows [1]
n Automatically gather Web Services and their related information
n Semi-automatically create semantic service description based on the information that available
on the Web
n Create and improve semantic annotations via the user feedback
n Describe the aggregated information in semantic models and allow reasoning query
However before describing the basic functionality of the Service-Finder Project there is going to
present one of its use cases and requirements first
211 Use Cases for Service-Finder Project
The Service-Finder Project employed the use case methodology of the W3C Use Case description [6]
for its needs and then applied this methodology to the use cases that it enumerated
2111 Use Case Methodology
There are three aspects needed to be considered for use case definitions of Service-Finder Project [1]
(1) Description that used to describe information of the use case
(2) Actors Roles and Goals that used to identify the actors they would be the roles they act and the
goals they need to achieve in the scenario
(3) Storyboard that used to describe the serial of interactions among the actors and the
Service-Finder Portal
2112 System Administrator
This section is going to present the use case that applied to the Service-Finder portal and that
illustrated the requirements on its functionality from a user point of view However all these
Deep Web Service Crawler
11
information in this use case are derived from [1] In this use case there has a system administrator
whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities
online and working all day and night Therefore if there is any system failures Sam Adams should fix
the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will
alert him immediately by sending him a SMS Message in the case of a system failure
n Description
This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an
SMS Messaging Service that he wants to build it into his application
n Actors Roles and Goals
The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him
are the immediate service delivery the reliability of the service and low base fee and transaction fee
n Storyboard
Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find
many useful services from it especially he know what he is looking for Hence he visits the
Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo
Requirement 1 Search functionality
Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the
number of matching services that will be displayed on one page And he would also expect there has
short information about the service functionality the service provider and the service availability So
that he could decide which service he will choose to read further
Requirement 2 Enable configurable pagination of the matching results and have some short
information for each service
Step 3 When Sam looks through the short information about the services that displayed on the first
page he expects to find the most relevant services that related to his request After that he would
like to read more detailed information about that service to see whether this service can provide the
needed functionality
Requirement 3 Rank the returned matching services and must provide ability to read more details of
a service
Step 4 In the case that all the returned matching services Sam got provide quite different
functionalities or they belong to different service categories for example the SMS messaging services
alert users not through SMS but voice messaging For this reason Sam would like to see other
different categories that may be contain the services he wants Or the services of other categories
which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can
further filter his search in terms of browsing through categories
Requirement 4 Service categories and allow the user to look all services that belonged to that specific
category If possible it should also allow the user to browse through categories
Step 5 When Sam got all the services that could provide a SMS messaging service via the methods
described in the Step 4 at present he wants to look for the services that offered by an Austrian
provider and have no base fees if possible
Requirement 5 Faceted search
Deep Web Service Crawler
12
Step 6 After Sam got all these specific services now he would like to choose the services that can
provide a high reliability
Requirement 6 Sort functionality based on usersrsquo chooses
Step 7 For now Sam expects to compare the service availability between the promised to the service
provider and the actually provided This should be contained in the servicesrsquo details And there needs
also have service coverage information so that Sam can know whether this service covers the areas
he lives and works Moreover Sam would also like to compare these services in other way For
instance put some services into a structured table to compare the transaction fees
Requirement 7 A side-by-side comparison table for services and a functionality that enable users to
select services he wants to compare
Step 8 At last Sam wants to know whether the service providers offer a free try out of the services
So that he can test the service functionality
Requirement 8 If possible display a note that offering free service trials
212 Architecture Plan for the Service-Finder Project
The architecture plan of the Service-Finder project contains five basic components Service Crawler
Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal
Interface Figure 2-1 presents a high level overview of the components and the data flow among
them
Figure2-1Dataflow of Service-Finder and Its Components [3]
Deep Web Service Crawler
13
2121 The Principle of the Service Crawler Component
The Service Crawler is responsible for gathering available services and their related information from
the Web The overall cycle is depicted as following
(1) A Web developer publishes a Web Service
(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services
like WSDL (Web Service Description Language) documents
(3) The Crawler is also going to search for other related information as long as a service is discovered
(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant
part of the Web
At last the output of the crawler would be forwarded to the subsequent components for analyzing
indexing and displaying
2122 The Principle of the Automatic Annotator Component
The Automatic Annotator receives the relevant data from previous component and generates
semantic service descriptions about the WSDL documents and its related information based on the
Service-Finder Ontology and Service Category Ontology
Firstly it will simply introduce those two compatible ontologies that would be used throughout the
whole process [2]
n Generic Service Ontology it is an ontology which is functional to describe the data objects For
example the services the service providers availability payment modalities and so on
n Service Category Ontology it is an ontology which is used to categorize the functionalities or
applications of the services For instance data verification messaging data storage weather etc
Afterwards it is going to talk about the function of this component with its input output
Oslash Input
u Crawled data from Service Crawler
u Service-Finder Ontologies
u Feedback or Correction of before annotations
Oslash Function
u Enrich the information about the service and extract semantic statements according to the
Service-Finder Ontologies For example categorize the service according to the Service
Category Ontology
u Determine whether a particular document is relevant or not through the Web link graph If
not discard these irrelevant documents
u Classify the pages into their genres For instance pricing user comments FAQ and so on
Oslash Output
u Semantic annotation of the services
Deep Web Service Crawler
14
2123 The Principle of the Conceptual Indexer and Matcher
Component
The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted
information of the services and supplying users the capability of retrieval and semantic query For
example the matchmaking between user requests and service offers and the act of retrieving user
feedback on extracted annotations
In addition letrsquos have a look of the function of this component and its input output
Oslash Input
u Semantic annotation data and full text information obtained from Automatic Annotation
u Semantic annotation data and full text information that come from user interfaces
u Cluster data from user and service clustering component
Oslash Function
u Store the semantic annotations received from the Automatic Annotation component and
from the user interface
u Store the cluster data that procured through the clustering component
u Store and index the textual description offered by the Automatic Annotation component
and the textual comments offered by users
u Ontological query the semantic data from the data store center
u Combined keyword and Ontological querying used for user queries
u Provide a list of similar services for a given service
Oslash Output
u A list of matching services that are queried by users In particular these services should be
sorted by ranking and can also be iterated
u All available data that related to a particular entity must be retrievable at the user interface
2124 The Principle of the Service-Finder Portal Interface
Component
The Service-Finder Portal Interface is the main entry point that provided for users of the
Service-Finder system to search and browse the data which is managed by the Conceptual Indexer
and Matcher component In addition the users can also contribute information by means of providing
tags comments categorizations and ratings to the data browsed Furthermore the developers can
still directly invoke the Service-Finder functionalities from their custom applications in terms of an API
Besides the details of this componentrsquos function input and output are represented as below
Oslash Input
u A list of ordered services for a query
u Detailed information about a service or a set of services and a service provider
u Query access to service category ontology and the most used tags provided by the users
Deep Web Service Crawler
15
u Service availability information
Oslash Function
u The Web Interface allows the users to search services by keyword tag or concept in the
categorization sort and filter query results by refining the query compare and bookmark
services try out the services that offer this functionality
u The API allows the developers to invoke Service-Finder functionalities
Oslash Output
u Explicit user annotations such as tags ratings comments decryptions and so on
u Implicit user data for example click stream of users bookmarks comparisons links sent
etc
u Manual advertising of available new services
2125 The Principle of the Cluster Engine Component
The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the
Service-Finder Portal eg the queried services and the compared services of the users Moreover it
also provides cluster data to the Conceptual Indexer and Matcher for providing service
recommendations
Furthermore letrsquos detailed introduce this componentrsquos function input and output
Oslash Input
u Service annotation data of both extracted and user feedback
u Usersrsquo Click streams used for extracting user behaviors
Oslash Function
u Obtain user clusters from user behaviors
u Obtain service clusters from service annotation data to enable to find similar services
Oslash Output
u Clusters of users and services
22 Information Extraction
Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge
amount of information sources on the Internet which has been limited the access to browsing and
searching for the reason of the heterogeneity and the lack of structure of Web information sources
Therefore the appearance of Information Extraction that transforms the Web pages into
program-friendly structures for post-processing would become a great necessity However the task of
Information Extraction is specified in terms of the inputs and the extraction targets And the
techniques used in the process of Information Extraction called extractor
221 Input Types of Information Extraction
Generally speaking there are three different input types The first input type is the unstructured
Deep Web Service Crawler
16
document For example the free text that showed in figure 2-2 It is unstructured and written in
natural language So that it will require substantial natural language processing While the second
input type is called the structured document For instance the XML documents based on the reason
that the data can be described through the available DTD (Document Type Definition) or XML
(eXtensible Markup Language) schema Finally but obviously the third input type is the
semi-structured document that are widespread on the Web Such as the large volume of HTML
pages like tables itemized lists and enumerated lists This is because HTML tags are often used to
render these embedded data in the HTML pages See figure 2-3
Figure2-2Left is the free text input type and right is its output [4]
Figure2-3A Semi-structured page containing data records
(in rectangular box) to be extracted [4]
Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly
regular structure And the data of these documents can be displayed in a format of HTML way or
non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and
generated from structured databases in terms of some templates or layouts thus it would be
considered as one of the input sources which could provide some of these semi-structured documents
For example the authors price and comments of the book pages that provided by Amazon have the
Deep Web Service Crawler
17
same layout That is because these Web pages are generated from the same database and applied
with the same template or layout Furthermore there has another option which could manually
generate HTML pages of semi-structured type For example although the publication lists that
provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have
title and source property for every single pager Eventually the inputs for some Information Extraction
can also be the pages with the same class or among various Web Service Registries
222 Extraction Targets of Information Extraction
Moreover regarding the task of the Information Extraction it has to consider the extraction target
There also have two different extraction targets The first one is the relation of k-tuple And the k in
there means the number of attributes in a record Nevertheless in some cases an attribute of one
record may have none instantiation Otherwise the attribute owns multiple instantiations In addition
the complex object with hierarchically organized data would be the second extraction target Though
the ways for depicting the extraction targets in a page are diverse the most common structure is the
hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf
nodes which called internal nodes And the structure for a data object may also be flat or nested To
be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise
if it is nested structure then the internal nodes that involved in this data object would be more than
two levels
Furthermore in order to make the Web pages readable for human being and having an easier
visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated
or demarcated However the displaying for a data object in a Web page would be affected by
following conditions [4]
Oslash The attribute of a data object has zero or several values
(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo
attribute For example a special offer only available for certain books might be a ldquononerdquo
attribute
(2) If there are more than one values for the attribute of a data object it will be called the
ldquomultiValuerdquo attribute For instance the name of the author for a book could be a
ldquomultiValuerdquo attribute
Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering
That is to say among this set of attribute the position of the attribute might be changed
according to the diverse instances of a data object Thus this attribute will be called the
ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would
enumerate the release data in front of the movesrsquo title while for the movies after year 1999
(including 1999) it will enumerate the release data behind the movesrsquo title
Oslash The attribute has different formats
This means the displaying format of the data object could be completely distinct with respect to
these different instances Therefore if the format of an attribute is free then a lot of rules will be
needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo
attribute For example an ecommerce Web site would use the bold font format to present the
general prices while use the red color format to display the sale prices Nevertheless there has
Deep Web Service Crawler
18
another situation that some different attributes for a data object have the same format For
example various attributes are presented in terms of using the ltTDgt tags in a table presentation
And the attributes like those could be differentiated by means of the order information of these
attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo
attributes it must have to revise the rules for extracting these attributes
Oslash The attribute cannot be decomposed
Because of the easier processing sometimes the input documents would like to be treated as
strings of tokens instead of the strings of characters In addition some of the attribute cannot
even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo
attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The
department code and the course number in them cannot be separated into two different strings
of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo
223 The Used Techniques in Information Extraction
The extractor used in the process of Information Extraction aims at providing a single uniform query
interface to access information sources like database server and Web server It consists of following
phases collecting returned Web pages labeling these Web pages generalizing extraction rules
extracting the relevant data and outputting the result in an appropriate format (XML format or
relational database) for further information integration For example at first the extractor queries the
Web server to gather the returned pages through the HTTP protocols after that it starts to extract the
contents among these HTML documents and integrate with other data sources thereafter Actually
the whole process of the extractor follows below steps
Oslash Step 1
At the beginning it must have to tokenize the input However there are two different
granularities for the input string tokenization They are tag-level encoding and word-level
encoding The tag-level encoding will transform the tags of HTML page into the general tokens
while transform all text string between two tags into a special token Nevertheless the
word-level encoding does this in another way It treats each word in a document as a token
Oslash Step 2
Next it should apply the extraction rules for every attributes of the data object in the Web pages
These extraction rules could be induced in terms of a top-down or bottom-up generalization
pattern mining or logic programming In addition the type of extraction rules may be indicated
by means of regular grammars or logic rules For example some use path-expressions of the
HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic
constraints and some use delimiter-based constraints such as HTML tags or literal words
Oslash Step 3
After that all these extracted data would be assembled into the records
Oslash Step 4
Finally iterate this process until all these data objects in the input
Deep Web Service Crawler
19
23 Pica-Pica Web Service Description Crawler
The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the
Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web
Services problem For example the evaluation of the descriptive quality of Web Services that offered
and how well are these Web Services described in nowadaysrsquo Web Service Registries
231 Needed Libraries of the Pica-Pica Web Service
Description Crawler
This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef
Spillner and programmed in terms of the Python language Actually in order to run these scripts to
parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib
n Beautiful Soup
It is an HTMLXML parser for Python language And it can even turn these invalid markups into a
parse tree [5] Moreover the following three features can make it more powerful
u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that
makes approximately as much sense as the original document Therefore you can obtain
the data that you want
u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating
searching and modifying the parse tree Hence you donrsquot need to create a custom parse
for every application
u If the document has already specified an encoding then you can ignore it since the
Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way
Otherwise what you have to do is just to specify the encoding of the original documents
Furthermore the ways of including Beautiful Soup into the application are displayed in the
following [5]
sup2 From BeautifulSoup import BeautifulSoup For processing HTML
sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML
sup2 Import BeautifulSoup To get everything
n Html5lib
It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to
gain maximum compatibility with the current major desktop web browsers this implementation
will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5
specification
Deep Web Service Crawler
20
232 Architecture of the Pica-Pica Web Service
Description Crawler
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler
Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes
four fundamental components Service Page Grabber component WSDL Grabber component
Property Grabber component and WSML Register component
(1) The Service Page Grabber components is going to take the URL seed as the input and output the
link of the service page into following two components WSDL Grabber component and Property
Grabber component
(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the
delivered service pagersquos link And then check whether the validation for these obtained WSDL
document Finally only these valid WSDL document will be passed into the WSML Register
component for further processing
(3) The Property Grabber component will try to extract the servicersquos property the hosted in the
service page if there exists After that all these servicersquos properties would be saved into an INI
file as the information of that service
(4) The functionality of the WSML Register component is to write appropriate WSML document by
means of the valid WSDL documents that delivered from WSDL Grabber component and the
optionally INI files that delivered from Property Grabber component Afterwards register them in
Conqo
n WSML [9]
It stands for Web Service Modeling Language which provides a framework with different
language variants Hence it is often used to describe the different aspects of the semantic Web
Services according to the conceptual model of WSMO
n WSMO [10]
WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various
aspects related to the Semantic Web Services based on the ontologies Ontology is a formal
explicit specification of a shared conceptualization In fact these ontologies are the sticking points
that can satisfy the linkage between the agreement of the communities of users and the defined
conceptual semantics of the real-world
n Conqo [11]
Deep Web Service Crawler
21
It is a discovery framework that considers not only the Quality of Service (QoS) but also the
context information It will use a Web Service Repository to manage these service descriptions
that based on WSML
233 Implementation of the Pica-Pica Web Service
Description Crawler
This section is going to describe the processes of the implementation of the Pica-Pica Web Service
Description Crawler in detail
(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it
needs an input as the initial seed In this crawler there are five Web Service Registries which are
listed in the below The URL address of these five Web Service Registries will be used as the input
seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web
Service Description Crawler there has a single Python script for each Web Service Registry And
the crawling process for these Web Service Registriesrsquo Python script will be executed one after
another
Biocataloue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwsrvice-repositorycom
Xmethods httpwwwxmethodsnet
(2) Then after feeding with the input seed it will step into the next component Service Page Grabber
At first this component will try to read the data from the Web based on the input seed Then it
will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library
After that this Service Page Grabber component starts to look for the service page link for each
service that published in the Web Service Registry by means of the functions in Html5lib library In
the case that the service page link of one single service is found it will firstly check whether this
service page link is valid or not Once the service page link is valid it will pass it into the following
two components for further processing which are WSDL Grabber component and Property
Grabber component
(3) When the WSDL Grabber component receives a service page link from its previous component it
sets out to extract the WSDL link address for that service through the parsing tree of the data in
this service page Next this component will start to download the WSDL document of that service
in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into
the disk The process of this WSDL Grabber component will be continually carried on until there is
no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They
may either contain bad definitions or bad namespaceURI or be an empty document What is
worse it is not even of XML format Hence in order to pick them out this component will further
analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo
folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in
order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo
folder will be passed into the subsequent component
(4) Moreover since some Web Service Registries would give some addition information about the
Deep Web Service Crawler
22
services such as availability service provider version therefore the Property Grabber
component will set out to extract these information as this servicersquos properties And thereafter
save these servicersquos properties into an INI file However if there has no available additional
information then it is no need to extract the service property thus there is no INI file for that
service However for the implementation of this Pica-Pica Web Service Description Crawler only
the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to
extract the servicesrsquo properties While for other three Web Service Registries there has such
function
(5) Furthermore this is an optional to create a report file which contains the statistic information of
this process Such as the total number of services for one Web Service Registry the number of
services whose WSDL document is invalid etc
(6) As has been stated at present there has a folder with all valid WSDL documents and might also
have some INI files Therefore at this time the task of WSML Register component is to generate
the appropriate WSML documents with these valid WSDL document and INI files And then
register them in Conqo
24 Conclusions of the Existing Strategies
In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder
project Information Extraction technique and Pica-Pica Web Service Description Crawler
The task of this master program is going to obtain the available Web services and their related
information from the Web Actually this is a procedure of extracting the needed information for the
service such as the servicersquos WSDL document and its properties Therefore the Information
Extraction technique which used to extract the information that hosted in the Web could be used in
this master program
Moreover the Service-Finder is a large project that not only able to obtain the available Web services
and their related information from the Web but also enrich these crawled data with annotations by
means of the Service-Finder ontology and the Service Category ontology and integrate all the
information into a coherent semantic mode Furthermore it also provides the capabilities for
searching and browsing the data with the user interface and gives users with service
recommendations However as a master program the Service-Finder project is far more exceeding
the requirements Therefore it is just considered as a consultation for this master program
Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available
Web services and their related information this fulfills the primary task of this master program
Nevertheless regarding the information about the service this Pica-Pica Web Service Description
Crawler extracts only few properties sometime even no property Consequently in order to improve
the quality of the service it has to extract as much properties about the service as possible Thence in
chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler
Deep Web Service Crawler
23
3 Design and Implementation
In the previous chapter of the State of Art those already existing techniques or implementations are
presented In the following it is time to introduce the basic principle of the proposed approach Deep
Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica
Web Service Description Crawler
31 Deep Web Services Crawler Requirements
This section is mainly talking about the goals of the Deep Web Service Crawler approach system
requirements with respect to approach and some non-functional requirements
311 Basic Requirements for DWSC
The following lists are the basic requirements which should be achieved
(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list
of Web services that published in the Web Service Registry It contains not only the WSDL document
of the service but also a file with respect to the information about that service Therefore for the
purpose of producing the largest annotated Service Catalogue this proposed approach need to
extract as much properties about those Web services as possible Moreover it has also to download
the WSDL document that hosted along with the Web service That is to say these properties are not
only the interesting structured properties such as service name its WSDL link address but also some
other non-function properties for example endpoint and its monitoring information
(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties
For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of
schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way
this proposed approach provides three methods for the storage The first one will store them as an
XML file While for the second method it is going to store them in an INI file And the third method is
to use the database for the storage
312 System Requirements for DWSC
General Speaking the needed requirements for achieving a programming project contains following
lists
1) Operating System LinuxUnix Windows (XP Vista 2000 etc)
2) Programming language C++ Java Python C etc
3) Programming Tool NetBean Eclipse Visual Studio and so on
However in this master thesis the scripts of the Deep Web Service Crawler approach are written in
Java programming language Besides these code scripts are only tested running in Windows XP
operating system and Linux operating system but have not been tested in other operating system
Deep Web Service Crawler
24
313 Non-Functional Requirements for DWSC
In this part several non-functional requirements for the Deep Web Service Crawler approach are
presented
1) Transparency the process of data exploration and data storage should be done automatically
without the userrsquos intervention However at first user should specify the path in the hard disk which
will be used store these outputs of this program
2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in
order to stop the process from interruption there must have some necessary error handlings for the
process recovering
3) Completeness this approach should extract the interesting properties about each Web Service as
much as possible eg endpoint monitoring information etc
In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the
following five URLs hence the proposed approach must implement not less than those five Web
Service Registries
Biocatalogue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwservice-repositorycom
Xmethods httpwwwxmethodsnet
32 Deep Web Services Crawler Architecture
In this section an overview of this high-level architecture of Deep Web Services Crawler approach will
be introduced first Thereafter there are four subsections that focusing on the outlining of each single
component and how they will play together presented
The current components and flows of data in Deep Web Service Crawler can be summarized as
depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page
links and related service page links by crawling the Web (Web Service Extractor) Then those gathered
links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL
Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all
these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is
illustrated as following
Oslash Step 1
When the Deep Web Service Crawler starts to run the File Chooser container would require the
user to specify a path in the computer or in any other hard disk The reason for specifying the
path is that this Deep Web Service Crawler program needs a place to store all its outputs
Oslash Step 2
After that the Web Service Extractor will be triggered It is the main entry to the specific crawling
process However since the Deep Web Service Crawler program is a procedure which supposes
to crawl for Web Services in some given Web Service Registries Hence the URL addresses of
Deep Web Service Crawler
25
these Web Service Registries should be given as an initial seed for this Web Service Extractor
process However since the page structures of these Web Service Registries are completely
different thus there will be a dependent process for each Web Service Registry
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler
Oslash Step 3
So that according to the given seed two types of links would be obtained by the Web Service
Extractor component One is the service list page link And another is the service page link
Service list page is page that contains a list of Web Services and may be some information about
these Web Services While service page is page that contains much more information for single
service Finally it forwards these two types of links into the next two components Property
Grabber and WSDL Grabber
Oslash Step 4
Then on the one hand the Property Grabber component tries to gather the information about
the service that hosted in the service list page and service page Such as the name of the service
its description the ranking for this service etc Finally all the information of the service will be
collected together as the service properties which then will be delivered to the storage
component for further processing
Oslash Step 5
On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page
or the service page That is because for some Web Service Registry the WSDL link is hosted in
Deep Web Service Crawler
26
the service list page like the Biocatalogue and for other Web Service Registries it is hosted in
the service page such as Xmethods Then after obtaining the WSDL link it will also be
transmitted to the Storage component for further processing
Oslash Step6
When service properties information and WSDL link of the service are received by Storage
component it will be going to store them into the disk For the service properties they will be
stored into the disk by means of three different ways They are an XML file an INI file or one
record inside the table of Database However for the WSDL link the Storage component will try
to download the page content first according to the URL address of the WSDL link If it can work
successful the page content of the service will be stored as a WSDL document into the disk
Oslash Step 7
Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there
are more than one service or there are more than one service list page in the those Web Service
Registries the crawling process from step 3 to step 6 will be continued again and again until
there is no service or service list page in that Web Service Registries any more
Oslash Step 8
Furthermore after the crawling process for one Web Service Registry finishes there will also
generate a file that contains some of statistic information about this crawling process For
example the time when the crawling process of this Web Service Registry started and when
finished the total number of Web services in this Web Service Registry how many services have
empty WSDL document the average number of service properties in this Web Service Registry
and the average time cost for extracting service properties getting WSDL document and
generating XML file INI file etc
321 The Function of Web Service Extractor Component
The Web Service Extractor component is responsible for gathering service list page links and service
page links It pursues a focused crawl of the Web and only forwards service list page and service page
links to subsequent components for analyzing collecting gathering purposes Therefore it is going to
identify both service list page links and related service page links on these Web Service Registries
As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost
as important as the Web Service Extractor itself as it highly influences the part of the Web which it
needs to crawl The seed can or shall contain eg Web pages where these Web Services are
published or that they talk about Web Services
Deep Web Service Crawler
27
Figure3-2 Overview the process flow of the Web Service Extractor Component
After feeding with the seed of URL the Web Service Extractor component starts to get the link of
service list page from the initial page of this URL seed However this process would be diverse for
these five Web Service Registries Following shows the different situation in these Web Service
Registries
Oslash Service-Repository Web Service Registry
In this Web Service Registry the link of the first service list page is the URL address of its seed
which means some Web Services can be found in the home page of the Service-Repository Web
Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this
Web Service Registry Therefore the process of getting service list page link in this registry will be
continually carried on until there no more service list page link exists
Oslash Xmethods Web Service Registry
Although there has Web Services in the home page of the Xmethods Web Service Registry the
number of those Web Services is only a small subset of these in this Web Service Registry
Moreover in the Xmethods Web Service Registry there is only one page containing all Web
Services Therefore it must have to get the service list page link for that page
Oslash Ebi Web Service Registry
The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That
is to say there also has one page that contains all Web Services in this Web Service Registry
However this page is not the initial page of the input seed Therefore there needs more than
one operation steps to get the service list page link of that page
Oslash Seekda Web Service Registry
In Seekda Web Service Registry these Web Services are not contained in the initial page of the
input seed The service list page link can be obtained after several additional operation steps
Deep Web Service Crawler
28
However there has a problem for getting the service list page link in this registry Simply put if
there are more than one pages contains the Web Services for some unknown reasons it cannot
not get the links of the rest service list page In another word it can only get the link of the first
service list page
Oslash Biocatalogue Web Service Registry
The process of getting service list page in Biocatalogue Web Service Registry is almost the same
as in Seekda Web Service Registry The only difference is that it can get all these service list page
links in Biocatalogue Web Service Registry if there are more than one service list pages
Then after getting the link of service list page the Web Service Extractor will begin to get the link of
service page for each service that listed in the service list page The reason for why it can do like this is
that there has an internal link for every service which will address to the service page It is worth
noting that once a service page link is obtained this component will immediately forward the service
page link and the service list page link into the subsequent two components for further processing
Nevertheless the process of obtaining the service page link will be continuously carried out until all
services listed in that service list page are crawled Analogously the process of getting service list page
will also be continuously carried out until there no more service list page exists
3211 Features of the Web Service Extractor Component
The main features are described in the following paragraphs
1) Obtain service list page links
A central task is to obtain the corresponding service list page links It is an URL address that includes a
publicly list of Web Services and just a simply information about these Web services like the name of
the service an internal URL that links to another page which would contain the detailed information
about that service sometimes may have the link address for WSDL document
2) Obtain service page links
Once service list page links are found a crucial aspect is to extract the internal link of each service to
aid the task of service information discovery Therefore it is the task of the Web Service Extractor to
harvest the html page content of the service list page so that the service page links which would
contain much more detailed information of the single Web service can be obtained
3212 Input of the Web Service Extractor Component
This component is dependent on some specific input seeds And the only input required for this
component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section
313
3213 Output of the Web Service Extractor Component
The component will produce two service related page links from the Web
l Service list page links
l Service page links
Deep Web Service Crawler
29
3214 Demonstration for Web Service Extractor
In order to have comprehensive understanding of the process of the Web Service Extractor
component following gives some figures for explanation Though there are five URL addresses in this
section only the URL of the service-repository address is showed as an example
1) The input seed is the initial URL address of the Service-Repository which is
ldquohttpwwwservice-repositorycomrdquo
2) As has already said in section 321 the first service list link of this Web Service Registry is its input
seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page
of that link
Figure3-3 Service list page of the Service-Repository
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo
Figure3-5Code Overview of getting service page link in Service Repository
Figure3-6 Service page of the Web service ldquoBLZServicerdquo
Deep Web Service Crawler
30
3) For now because of the service list page link is already know the next step is to acquiring the
service page link according to these services that listed in the service list page The text in red box
of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the
complete link of the service page It should be prefixed with the initial URL address of
Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for
getting service page link of the Web service in Service Repository is showed in figure 3-5
Therefore the final link of this service page is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure 3-6 is the corrsponding service page of that link
4) Afterwards those two links service list page link and service page link which are gathered by the
Web Service Extractor component will be immediately forwarded to the next two components
which are WSDL Grabber component and Property Grabber component
322 The Function of WSDL Grabber Component
The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the
service lists page link or service page link And the whole process flow is illustrated in figure 3-7
Figure3-7 Overview the process flow of the WSDL Grabber Component
When the WSDL Grabber component receives the inputs that delivered from previous component it
will start to get the WSDL link for the service based on these inputs However although the inputs of
the WSDL Grabber component are the links of service page and service list page only one of them
contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the
service page or the service list page The reason why there needs these two links to be delivered into
this component is that the only one of the five Web Service Registries which is Biocatalogue hosts
the WSDL link in the service list page While for other four Web Service Registries the WSDL link is
hosted in the service page Therefore the WSDL link of these four Web Service Registries will be
obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL
link will be obtained through the service list page link However there has a problem for getting the
WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in
Deep Web Service Crawler
31
the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another
word these services will not have WSDL document For the situation like this the value of the WSDL
link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web
Services in other four Web Service Registries the WSDL link would always exist in the service page
Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service
it will inevitably be forwarded into Storage component for downloading the WSDL document at once
3221 Features of the WSDL Grabber Component
The WSDL Grabber component has to provide the following features
l Obtain WSDL links
WSDL link is the direct way to get into the page that contains the contents of the WSDL document
It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo
to represent this is an address that addressed to the page of WSDL document
3222 Input of the WSDL Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3223 Output of the WSDL Grabber Component
The component will only produce the following output data
l The URL address of WSDL link for each service
3224 Demonstration for WSDL Grabber Component
In this section it will demonstrate a list of figures in order to give a comprehensive understanding of
the process of the WSDL Grabber component Without a doubt it uses the same Web service of the
Service-Repository as an example too
1) The input for this WSDL Grabber component is the link of service page that obtained from Web
Service Extractor component The address of this link is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page
Deep Web Service Crawler
32
2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the
figure 3-8
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo
3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure
3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry
For other four Web Service Registries this will be different This function
ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo
Then it checks all of the nodes one by one to see if the text value of that node is the value of
ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will
be extracted as the value of WSDL link for this Web service
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function
Figure3-11 Code overview of ldquooneParameterrdquo function
4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web
service ldquoBLZServicerdquo can be obtained The value of it is
ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo
Deep Web Service Crawler
33
323 The Function of Property Grabber Component
The Property Grabber component is a module which is used to extract and gather all these Web
service information that hosted in the Web actually are the information that showed in service list
page and service page In the end all the obtained Web service information will be collected together
as the service properties which would be delivered into the Storage component for storing The
detailed process flow of Property Grabber component is illustrated in figure 3-12
The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component
However there is still a little difference between them with respect to the seed As have already
mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is
sufficient to get the WSDL link While regarding the Property Grabber component it needs two of
them as the seeds Therefore once the Property Grabber component receives the needed inputs it
will start to extract the information of that single Web service
Figure3-12 Overview the process flow of the Property Grabber Component
Therefore after the Property Grabber component receives the inputs it starts to extract the service
information for the Web service Generally speaking the service information consists of four aspects
which are structured information endpoint information monitoring information and whois
information respectively
(1) Structured Information
The structured information can be obtained by extracting the information that hosted in the
service page and service list page They are the basic descriptive information about the service
Such as the name of the service the URL address through which the WSDL document can be
obtained the description of introducing the service the provider who provides that service its
rating and the server who owns this service etc However the elements constituting this
Deep Web Service Crawler
34
structured information would be diverse according to these different Web Service Registries For
example the rating information of the Web service exists in the Service-Repository Web Service
Registry while the Xmethods Web Service Registry does not have this information In addition
even though the Web services in the same Web Service Registry some elements of the structured
information could be not exist For instance one service in a Web Service Registry may have the
description information of this service while another service in the same Web Service Registry
does not have this description information Tables 3-1 to 3-5 demonstrate the structured
information that should be extracted in these five Web Service Registries Moreover if the style of
the Biocataloue Web Service Registry is SOAP there has some additional information describing
the SOAP operation of this service While it is REST this additional information would be
something that talked about the REST operation They should also be considered as a part of the
structured information Table 3-6 and table 3-7 illustrate the information for these two different
operations
Service Name WSDL Link WSDL Version
Provider Server Rating
Homepage Owner Homepage Description
Table 3-1Structured Information of Service-Repository Web Service Registry
Service Name WSDL Link Provider
Service Style Homepage Implementation Language
Description User Description Contributed Client Name
Type of this Client Publisher for this client Used Tookit of this client
Used Language of this client Used Operation System of this client
Table 3-2Structured Information of Xmethods Web Service Registry
Service Name WSDL Link Server
Provider Providerrsquos Country Service Style
Rating Description User Description
Service Tags Documentation (within WSDL)
Table 3-3Structured Information of Seekda Web Service Registry
Service Name WSDL Link Port Name
Service URL Address Implementation Class
Table 3-4Structured Information of Ebi Web Service Registry
Service Name WSDL Link Style
Provider Providerrsquos Country View Times
Favorite Times Submitter Service Tags
Total Annotation Provider Annotation Member Annotation
Registry Annotation Base URL SOAP Lab Server Base URL
Description User Description Category
Table 3-5Structured Information of Biocatalogue Web Service Registry
Deep Web Service Crawler
35
SOAP Operation Name Inputs and Outputs Operation Description
Operation Tags Part of Which Service
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry
REST Operation Name Service Tags Used Template
Operation Description Part of Which Service Part of Which Endpoint Group
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry
(2) Endpoint Information
Endpoint information is the port information of the Web service which can be extracted only
through the service page However since different Web Service Registries have different structure
of the endpoint information with respect to the Web service Thence some elements of the
endpoint information for the Web service would be very diverse But there is one thing needed to
pay attention the Ebi Web Service Registry does not have endpoint information for all these Web
services published in this registry Moreover though the Web services in the same Web Service
Registry have the same structure of this endpoint information some elements of the endpoint
information would be missing or empty Furthermore these Web Service Registries could even
have no endpoint information for some Web services published by them Nevertheless whatever
there is endpoint information for the Web service there will be at least one such element which
is the URL address of the endpoint Following table 3-8 shows the endpoint information that
should be extracted in these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Endpoint Name Endpoint URL
Endpoint Critical Endpoint Type
Bound Endpoint
Xmethods
Endpoint URL Publisher of this Endpoint
Contact Email of this Publisher Implementation Language of this
Endpoint
Seekda Endpoint URL
Biocatalogue Endpoint Name Endpoint URL
Table 3-8 Endpoint Information of these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Service Availability Number of Downs
Total Uptime Total Downtime
MTTR MTBF
RTT Max of Endpoint RTT Min of Endpoint
RTT Average of Endpoint Ping Count of Endpoint
Seekda Service Availability Begin Time of Monitoring
Biocatalogue Monitored Status of Endpoint Overall Status of Service
Table 3-9 Monitoring Information of these five Web Service Registries
(3) Monitoring Information
Monitoring information is the tested statistic information for the Web service But it is worth
Deep Web Service Crawler
36
noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of
all these Web services published by them While for other three Web Service Registries only a
few of Web services may not have this information Table 3-12 displays the monitoring
information for these three Web Service Registries
(4) Whois Information
Whois information is not extracted from the information that hosted in the service page and
service list page It is the descriptive information for the service domain which can be gained by
means of the address of the WSDL link Because of that the process of getting the Whois
information will start after gaining the service domain first The final value of service domain
should not contain the string like this http https www etc It must belong to the top level
domain After that it is going to query the service domain database by means of sending the value
of service domain to the Whois client which is just a Web site on the Internet for example
ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain
exists a list of its information will be returned as the output However the structure of the
returned information will be diverse with respect to different service domain Therefore the most
challenge thing is that there has to deal with the extracting process for each different situation of
the returned information Table 3-10 gives the Whois information that needs be extracted for all
these five Web Service Registries
Service Domain URL Domain Name Domain Type
Domain Address Domain Description State
Postal Code City Country
Country Code Phone Fax
Email Organization Established Time
Table 3-10Whois Information for these five Web Service Registries
Finally all the information of these four aspects will be collected together And then be delivered into
the Storage component for further storage processing
3231 Features of the Property Grabber Component
The Property Grabber component has to provide the following features
l Obtain basic information
Generally speaking the more information one Web service has the better you can know how
well this Web service is Hence this Property Grabber component will be necessary to extract all
these basic information that hosted in the service list page and service page This basic
information contains the structure information endpoint information and monitoring
information
l Obtain Whois information
Due to the reason that the more information the Web service has the better the quality of the
Web service is That makes it is necessary to extract as much information about the Web service
as possible Therefore except the basic information the Property Grabber component can also
obtain some additional information called Whois information such as the type of the domain the
name of the person in charge the postal code of the domain city phone fax detailed address
email etc
Deep Web Service Crawler
37
3232 Input of the Property Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3233 Output of the Property Grabber Component
The component will produce the following output data
l Structured information of each service
l Endpoint information about each service if exists
l Monitoring information for the service and endpoint if exists
l Whois information of the service domain
However all these information will be collected together as the properties for each service Thereafter
the collected properties will be sent to the Storage component
3234 Demonstration for Property Grabber Component
Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the
Property Grabber component To simplify the explanation the example showed in this section uses
the same Web service as before
1) The inputs of the Property Grabber component are links of service list page and service page that
received from Web Service Extractor component Their links are
ldquohttpwwwservice-repositorycomrdquo
and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page
Deep Web Service Crawler
38
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page
3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that
displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of
the structured information have the same content like the contents of description showed in
service page and service list page Thence in order to saving the time of extracting process and
the space for storing process the elements with the same content would only be extracted once
Moreover there needs a transformation of non-descriptive text to the descriptive text for the
rating information because the contents of it are several image of stars Therefore the final
results of the extracted structured information for this Web service are showed in table 3-11
Because of there are no descriptive information for the provider homepage and Owner
Homepage their values are assigned as ldquoNULLrdquo
Service Name BLZService
WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl
WSDL Version 0
Server Apache-Coyote11
Description BLZService
Rating Four stars and A Half
Provider NULL
Homepage NULL
Owner Homepage NULL
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo
4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that
displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list
only one of them will be extracted as the information of endpoint That is because this master
program is intended to extract as more information as possible but these information should not
contain some redundant information Therefore for the purpose of that only one of them will be
extracted as the endpoint information even if there are more than one endpoint records Table
3-12 shows the final results of the endpoint information of this Web service
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page
Deep Web Service Crawler
39
Endpoint Name BLZServiceSOAP12port_http
Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService
Endpoint Critical True
Endpoint Type production
Bound Endpoint BLZServiceSOAP12Binding
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo
5) Then it is time to extract the monitoring information through invoking the function
ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box
above contains the monitoring information about the Web service and the red box below lists
the monitoring information for these endpoints As have already mentioned before only one
endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two
types of availability Actually they all represent the availability for this Web service just like the
availability that showed in figure 3-14 Therefore only one content of these availability
information is sufficient Table 3-13 shows the final results of this extracting process
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page
Service Availability 100
Number of Downs 0
Total Uptime 1 day 19 hours 19 minutes
Total Downtime 0 second
MTBF 1 day 19 hours 19 minutes
MTTR 0 second
RTT Max of Endpoint 141 ms
RTT Min of Endpoint 0 ms
RTT Average of Endpoint 577 ms
Ping Count of Endpoint 112
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo
6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the
service domain from the WSDL link first Towards this Web service the gained service domain is
ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying
process After that it returns a list of information for that service domain See figure 3-17 And
table 3-14 is the extracted Whois information
Deep Web Service Crawler
40
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo
Service Domain URL thomas-bayercom
Domain Name Thomas Bayer
Domain Type NULL
Domain Address Moltkestr40
Domain Description NULL
State NULL
Postal Code 54173
City Bonn
Country NULL
Country Code DE
Phone +4922855525760
Fax NULL
Email infopredic8de
Organization predic8 GmbH
Established Time NULL
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo
7) Finally all the information of these four aspects will be collected together as the service
properties and then these service properties are forwarded into the Storage component
324 The Function of Storage Component
The WSDL link from the WSDL Grabber component will be used by this Storage component to
download the WSDL document from the Web and stores it into the disk thereafter In addition the
service properties from the Property Grabber component will also be directly stored into the disk in
three different manners with this Storage component Figure 3-18 illustrates the process flow of the
Storage component
Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo
of the Storage component will be triggered After that it is going to transform the service properties
into three different output formats and store them into the disk These output formats are XML file
INI file and database records Besides it will also try to download the WSDL document with the URL
address of the WSDL link and then store the obtained WSDL document into disk too However this
ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo
ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the
storage tasks
Deep Web Service Crawler
41
Figure3-18 Overview the process flow of the Storage Component
(1) ldquogetWSDLrdquo sub function
The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it
into the disk Therefore above all it has to get the content of the WSDL document This
procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value
of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if
the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo
For the case like that it will create a WSDL document whose name is the service name appended
with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content
It is an empty document Nevertheless if that service does have WSDL link this sub function will
try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all
these contents that hosted in the Web would be downloaded and stored into the disk and named
only with the name of the service Otherwise it would create a WSDL document which prefixes
ldquoBadrdquo before the service name
(2) ldquogenerateXMLrdquo sub function
The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform
them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo
XML stands for eXtensible Markup Language which is a markup language that designed to
transport and store data The first line of an XML file is the XML declaration which defines the
XML version and the encoding used For example following ldquoltxml version=10
encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is
8-bit Unicode Transformation Format character set Besides an XML file contains XML elements
which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML
element can also contain other elements which could be simple text or a mixture of both
Deep Web Service Crawler
42
However an XML file must contain a root element as the parent of all other elements
(3) ldquogenerateINIrdquo sub function
The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will
transform them into an INI file and then store it in the disk with the name structure of servicersquos
name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for
configuration files They are just simple text files with a basic structure Generally speaking the
INI file contains three different parts They are section parameter and comment respectively
Parameter is the basic element contained in an INI file Its format is a key-value pair or can also
be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always
appears to the left of the equals sign The section is more like a room that grouped all its
parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And
section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo
Thence anything between the semicolon and the end of line will be ignored
(4) ldquogenerateDatabaserdquo sub function
The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions
Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will
turn them into the data of the database by using the SQL statements SQL stands for Structured
Query Language which is a database language that designed for accessing and manipulating data
in database Most of the actions that performed on a database are done with SQL statements
And the primary statements of SQL include insert into delete update select create alter and
drop Therefore for the purpose of transforming these service properties into the data of the
database this sub function has to create a database first using the ldquocreate databaserdquo statement
Then it should create a table to store the data A table is a collections of related data entries and
it consists of columns and rows Since the data for all these five Web Service Registries are not
very large therefore one database table is enough for storing these service properties Because of
that the field names of the service properties in columns for all these five Web Service Registries
should be uniform and well-defined Afterwards the service properties of each single service can
be put into the table as a record with the ldquoinsert intordquo statement of the SQL
3241 Features of the Storage Component
The Property Grabber component has to provide the following features
l Generate different output formats
The final result of this master program is to store the information of the service into the disk for
future work However this Storage component provides three different formats for storing these
service properties of the services that published in the Web Service Registries This makes the
storage of the service very flexible and also longevity
l Obtain the WSDL document
The important thing of this component is to check if the WSDL document can be obtained from
the WSDL link because the WSDL document plays a decisive role for determining the quality of
the service This Storage component provides the ability to deal with the different situations
about the process of obtaining the WSDL document
Deep Web Service Crawler
43
3242 Input of the Storage Component
This component requires the following input data
l WSDL link of each service
l Each service property information
3243 Output of the Storage Component
The component will produce the following output data
l WSDL document of the service
l XML document INI file and tables in database
3244 Demonstration for Storage Component
Following figures describe the fundamental implementation codes of this Storage component
The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the
implementation codes The first common place is about the parameters that defined in each of
these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute
path in the computer disk It will be used for the storing procedure and has already been specified
by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an
increasing Integer which is used as a part of the name of the service The reason for doing this is
that it can prevent the services that have the same name from overriding in the disk The content
of red mark among the codes of these figures is the second common place Its function is to
create a file or document for the service with the corresponding parameters
2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to
get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the
name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo
is the actual WSDL link which is the most important one for this sub function The other two
parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and
ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data
of the service in each Web Service Registry Such as the overall number of properties for that Web
Service Registry the overall number of services the number of services that have no WSDL link
the number of services contains no contents the number of services whose WSDL link are not
available etc The ldquoLog Informationrdquo text file records the results of the process steps and the
problems encountered For example which service is crawling now which Web Service Registry it
belongs the reason why it cannot get the WSDL document for this service the reason why this
service is unreachable and so on
Deep Web Service Crawler
44
Figure3-19 Implementation code for getting WSDL document
3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file
and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of
ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two
variables name and value
Figure3-20 Implementation code for generating XML file
Deep Web Service Crawler
45
Figure3-21 Implementation code for generating INI file
4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all
services in these five Web Service Registries into the records of the database Therefore it has to
create a database first The name of the database can be arbitrary as long as it conforms to the
naming rules in database This is the same for the name of the table Figure 3-22 displays the
procedure of creating a table in the database Due to the reason that it is hard to decide the
length of each service property thence the data types for all service properties are set to ldquoTextrdquo
Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo
statement
Figure3-22 Implementation code for creating table in database
Deep Web Service Crawler
46
Figure3-23 Implementation code for generating table records
33 Multithreaded Programming for DWSC
Multithreaded programming is a built-in characteristic which is provided by Java language A
multithreaded program contains two or more separate parts that that can execute concurrently Each
part of such a program is called a thread The use of multithreading enables to create programs that
makes efficient use of the system resources For example make maximum use of the CPU thatrsquos
because the idle time of the CPU can be kept to a minimum
In this master program there are five Web Service Registries that are needed to crawl for the services
among them Moreover the number of services published among each Web Service Registry is quite
different That makes the running time cost by each Web Service Registry is different Then it will
happen that one Web Service Registry who owns fewer services has to be waiting for executing until
another Web Service Registry who has much more services finishes Therefore in order to reduce the
waiting time for other Web Service Registries and maximize the use of the system resources it is
necessary to apply this multithreaded programming to this master program That is to say this master
program will create a thread for each Web Service Registry And these threads are executed
independently
34 Sleep Time Configuration for Web Service
Registries
Due to the reason that this master program is intended for downloading the WSDL document and
extracting the service information of these Web services that published in the Web Service Registry
this would inevitably affect the performance of the Web Service Registry In addition for the purpose
of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of
accessing Because of that sometimes when this master program is executing unknown errors would
happen For instance this master program will be continually halting at one point without getting any
more WSDL document and service information or some servicesrsquo WSDL documents of some Web
Deep Web Service Crawler
47
Service Registries cannot be obtained or there would miss some service information Therefore in
order to obtain the largest staffs of these Web services published in these five Web Service Registries
and not to affect the throughput of them it has to configure the accessing rate for each service of all
Web Service Registries
In consequence before going into the essential procedure for each single service of these Web Service
Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long
milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for
the specified number of milliseconds In other words it will temporarily cease execution for a while
Following table shows this time interval of the sleep function for each Web Service Registry
Web Service Registry Name Time Interval (milliseconds)
Service Repository 8000
Ebi 3000
Xmethods 10000
Seekda 20000
Biocatalogue 10000
Table 3-15Sleep Time of these five Web Service Registries
Deep Web Service Crawler
48
4 Experimental Results and Analysis
This chapter is going to show the quantitative experimental results of the prototype that presented in
chapter 3 Besides the analysis of these results will also be described and explained However in
order to gain a rather accurate result the experiments are carried out more than five times All these
data that displayed in the following tables and charts are the average of them
41 Statistic Information for Different Web Service
Registries
This section is mainly used to talk about the amount statistic of these Web services that published in
these five Web Service Registries It includes the overall number of the Web services that published in
each Web Service Registry and the number of these unavailable Web services which have been
archived because they may not be active anymore or are close to being non active Table 4-1 shows
the service amount statistic of these five Web Service Registries
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Overall
Services 57 289 382 853 2567
Unavailable
Services 0 0 0 0 125
Table4-1 Service amount statistic of these five Web Service Registries
Nevertheless in order to have an intuitive view of the service amount statistic in these five Web
Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As
you can see from the bar chart one on hand there has an ascending increase for the overall number
of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service
Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web
service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to
provide the Web services for the users because it contains far more services than other four Web
Service Registries On the other hand there is no unavailable service existed in almost all Web Service
Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web
Service Registry contains some Web services that cannot be used by the users To some degree this is
useless for the reason that these services cannot be used anymore and it will waste the network
resources on the Web Therefore all these unavailable services should be eliminated in order to
reduce the wasting of the network resources
Deep Web Service Crawler
49
Figure4-1 Service amount statistic of these five Web Service Registries
42 Statistic Information for WSDL Document
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Failed WSDL
Links 1 0 23 145 32
Without
WSDL Links 0 0 0 0 16
Empty
Content 0 0 2 0 2
Table4-2 Statistic information for WSDL Document
Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in
these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed
WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the
Web services whose WSDL links are invalid Say in other words that means it is impossible to get the
WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no
WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services
for these Web Service Registries It is used as the overall number of Web services in each Web Service
Registry who has no WSDL link That is to say there would be no WSDL document for Web services
like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL
document will also be created But there is no content and the name of this WSDL document will
contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which
represents the overall number of Web services that have the WSDL links and the URL address of them
0
500
1000
1500
2000
2500
3000
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
57289 382
853
2567
0 0 0 0 125
The
Num
ber o
f Web
Ser
vice
s
The name of Web Service Registry
Overall Services
Unavailable Services
Deep Web Service Crawler
50
are valid but their WSDL documents contains no content In this case a WSDL document whose name
contains a string ldquo(BAD)rdquo will be created
Figure4-2 Statistic information for WSDL Document
43 Comparison of Different Average Number of
Service Properties
This section is going to compare the average number of service properties in these five Web Service
Registries This average number of service properties is calculated by means of following equation = (1)
Where
ASP is the average number of service properties for one Web Service Registry
ONSP is the overall number of service properties in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
Figure 4-3 shows the average number of service properties of each Web service in these five Web
Service Registries As have already mentioned before one of the measurements for testing the quality
of Web services in the Web Service Registry is the service information The more information of a Web
service the better you know about that service consequently the corresponding Web Service
Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service
Repository and Biocatalogue Web Service Registries own a larger number of service properties than
that in other three Web Service Registries This can directly reflect that these two Web Service
Registries can provide more detailed information about the Web services published among them So
that the users can be easier to choose the service they need And they would also like to use the Web
services that published in these two Web Service Registries By contrast the Xmethods and Seekda
0
50
100
150
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
1 023
145
32
0 0 0 016
0 0 2 0 2
The
Num
ber o
f Ser
vice
s
The Name of the Web Service Registry
Failed WSDL Links
Without WSDL links
Empty Content
Deep Web Service Crawler
51
Web Service Registries who have less service information of the Web services would offer less quality
for these Web services Therefore users may be not like to use the Web services provided in these
two Web Service Registries Not to mention the Web services published in the Ebi Web Service
Registry
Figure4-3 Average Number of Service Properties
From the description presented in section 323 the causes of the issue for the different number of
service properties in these Web Service Registries may consist of following several points First the
number of the structured information for these Web services is different with respect to these five
Web Service Registries Even part of information for some Web services in one Web Service Registry
could be missing or have empty value For example the number of structure information that
supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that
in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for
almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce
the amount of the overall number of service properties Thirdly some Web Service Registries do not
have monitoring information such as Xmethods and Ebi In particular the Service Repository Web
Service Registry has a large amount of monitoring information of the Web services that could be
extracted in the Web Obvious the last one is the number of the whois information for these Web
services If the database of the whois client does not contain the information about the service
domain for the Web service in one Web Service Registry then there will be no whois information
could be extracted Moreover even there has information of the service domain the amount of the
information could be very diverse Therefore if the Web Service Registry undergoes the situation like
that many service domains of the Web services in this registry have no and few whois information
then the average number of service properties in that registry will decrease greatly
As a result in order to help users get better acquainted with the Web services that provided in Web
Service Registry and distinguish these Web services the Web Service Registry should do it best to
offer more and more information for each published Web service of it
05
101520253035
23
717 17
32
The
Num
ber o
f Pro
pert
ies
The Name of Web Service Registry
Average Number of Service Properties
Deep Web Service Crawler
52
44 Different Outputs of Web Services
The fundamental task of this master program is to obtain the WSDL document of the Web services
that hosted in the Web Service Registries as well as extract and gather the properties of these Web
services and thereafter store them into disk Therefore this section is going to describe the different
outputs of this master program which include the WSDL documents of Web services the generated
XML file and INI file and data records of these service properties in Database
Figure4-4 WSDL Document format of one Web service
The WSDL document of the Web service is just read from the Web according to the URL address of its
WSDL link And then these data will be stored into the disk as the WSDL document The name of the
WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document
And in order to distinguish these WSDL documents whose names would be the same while the
content of them are different the name of each obtained WSDL document in one Web Service
Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL
document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and
data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output
formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name
is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the
materials belonged to the same Web service The first three lines in that INI file are some service
comments which start from the semicolon to the end of the line They are the basic information for
the description of this INI file The following one line of them is the section that included in a pairs of
brackets It is important because it represents that the following lines behind it are the information of
this Web service Therefore the rest of these lines are the actual service information that existed in a
key-value pair with an equal between the pair Each service property is displayed from the beginning
of the line
Deep Web Service Crawler
53
Figure4-5 INI File format of one Web service
Figure4-6 XML File format of one Web service
Figure4-7 Database data format for all Web services
Deep Web Service Crawler
54
Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its
name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web
service Though the format of the XML file is different from that of the INI file the essential contents
of them are the same That is to say the values of these service properties are no different This is
because they are files generated from the collection of properties for the same Web service The XML
file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the
section in INI file is something like the root in XML file Therefore all values of these elements among
the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service
Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of
service information for all Web services in these five Web Service Registries And the entire service
information for one Web service will be only one record in this table Because of that the column
names of that table should be the union of the name of the service information in each Web Service
Registry However since the column names of the table should be unique thus these redundant
names in this union must be eliminated This could make sense and possible because the names of the
service information are well-defined and uniform for all these five Web Service Registries In addition
in this table the first column is the primary key which is an increasing Integer The function of this is
more like the Integer that contained in the name of the XML file and INI file While the rest columns in
that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in
this table represents that this property of that Web service is empty or missing
45 Comparison of Average Time Cost for Different
Parts of Single Web Service
This section aims to describe the comparison of the average time cost for different parts of getting
one single Web service in all these five Web Service Registries At first it has to calculate the average
time cost of getting one single service in the Web Service Registry The calculation of it can be
obtained through following equation = (2)
Where
ATC is the average time cost for one single Web service
OTS is the overall time cost of all the Web services in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the different parts of the average time cost for getting one single service consists of
following 6 aspects which are the average time cost for extracting service property the average time
cost for obtaining WSDL document the average time cost for generating XML file the average time
cost for generating INI file the average time cost for inserting the service property into the table of
the Database and the average time cost for some other procedures such as get the service list page
link get the service page link and so on The average time cost for extracting service property would
be obtained by means of following equation
Deep Web Service Crawler
55
= (3)
Where
ATCSI is the average time cost for extracting service property of one single Web service
OTSSI is the overall time cost for extracting service property of all the Web services in one Web
Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the calculation of other parts is similar to the equation for calculating the average time
cost for extracting service property While the calculation of the average time cost for other
procedures equals to the result of the average time cost for one single Web service minus the sum of
the average time cost of all other five parts
Service
property
WSDL
Document
XML
File INI File Database Others Overall
Service
Repository
8801 918 2 1 53 267 10042
Ebi 699 82 2 1 28 11 823
Xmethods 5801 1168 2 1 45 12 7029
Seekda 5186 1013 2 1 41 23 6266
Biocatalogue 39533 762 2 1 66 1636 42000
Table4-3 Average time cost information for all Web Service Registries
Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web
Service Registries The first column of table 4-3 is the name of these five Web Service Registries And
the last column in this table is the average time cost for the single service in one Web Service Registry
While the rest columns in this table are the average time cost of these six different parts In order to
have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated
with following corresponding figures See figure 4-8 to figure 4-14
Figure4-8 Average time cost for extracting service property in all Web Service Registries
8801
6995801 5186
39533
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Service Property
Deep Web Service Crawler
56
As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue
Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries
which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively
That is to say it will take much longer time to extract the service properties of these Web services
published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the
Biocatalogue Web Service Registry has even more average number of service properties which have
already been talked about in section 43 On the contrary the average number of service properties in
Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service
Registry is larger than that in Seekda Web Service Registry As have already known that the average
number of service properties is the same for these two Web Service Registries Nevertheless there
has course might explain the case that the average time in Xmethods costs more than that in Seekda
which is the process for extracting service properties in Xmethods Web Service Registry needs to be
executed by means of service page and service list page while it needs only service page link for
Seekda Web Service Registry
The average time cost for obtaining WSDL document in all these five Web Service Registries is
displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the
average time for extracting the WSDL link of Web service and the average time for reading the data of
WSDL document from the Web and then storing it into the disk As seen from this figure the average
time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is
1168 Although the process for extracting WSDL link will cost certain amount of time it does not have
a significant influence to the total average time spent for obtaining WSDL document This is because
the WSDL link of one Web service is almost gained in one step Therefore this can imply that the
average size of the WSDL document for Xmethods Web Service Registry is larger than that for other
four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds
for the process of extracting WSDL document
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries
918
82
11681013
762
0
200
400
600
800
1000
1200
1400
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
WSDL Document
Deep Web Service Crawler
57
Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different
outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11
the average time of generating XML file for one Web service cost the same for all these five Web
Service Registries which is only 2 milliseconds In addition the average time of generating INI file for
one Web service cost the same too And its value is just 1 millisecond Even though sum up the values
of these two types of average time costs the result is still so small that it could be omitted when
comparing to the overall average time cost of getting one Web service for each corresponding Web
Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI
file would be finished at once after receiving the input of service properties of one Web service
Furthermore it can be seen from figure 4-12 although the average time costs of creating database
record for each Web service in all these five Web Service Registries are larger than that time of
generating XML and INI file the process of the operation for creating database record is still fast
Figure4-10 Average time cost for generating XML file in all Web Service Registries
Figure4-11 Average time cost for generating INI file in all Web Service Registries
2 2 2 2 2
0
05
1
15
2
25
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
XML File
1 1 1 1 1
0
02
04
06
08
1
12
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
INI File
Deep Web Service Crawler
58
Figure4-12 Average time cost for creating database record in all Web Service Registries
Figure4-13 Average time cost for getting one Web service in all Web Service Registries
In the figure 4-13 it gives the average time cost for getting one single Web service in all these five
Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the
longest time for this process This is because the presentation of these five different parts that
described in the front shows that the average time cost of each part would needs more time to finish
the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining
WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web
Service Registry does not cost the longest Moreover there is amazing thing when looking at these
figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same
trend This can further indicate that one Web Service Registry who spends more time to get the
description information of one Web service would offer more information about the Web service
53
28
4541
66
0
10
20
30
40
50
60
70
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Database
10042
823
7029 6266
42000
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Overall
Deep Web Service Crawler
59
5 Conclusion and Further Direction
This master thesis provides a schema which aims to explore the description information of the Web
services that hosted in the different Web Service Registries Actually the description information of
the Web service consists of WSDL document and service information about that Web service
Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can
be used to obtain the WSDL document and service information of the Web service in these Web
Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web
services that hosted in the these Web Service Registries as well as the storage of these Web services
is not flexible and the most important thing is that only a few service information of the Web service
are extracted even more there is no service information would be extracted for some Web Service
Registry However the work presented in this master thesis can be able to explore all Web services in
these Web Service Registries Moreover the service information of Web service would be extracted as
much as it could with the approach presented in this master thesis so that the final result would be
the largest annotated service catalogue ever produced Furthermore regarding the storage process of
the description information of the Web service this master thesis provides three different ways that
can guarantee not only the completeness but also longevity of the description information for the
Web service
However in the implementation performed in this master thesis the whois client used for querying
the information of service domain will return a free text if the information exists And sometime this
free text differs completely That makes it must have to crawl each Web service in all Web Service
Registries at least once in the experiment stage So that all the cases of these free text would be
foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web
services in these Web Service Registries Therefore in order to simply the work other whois client
who can ease the work here needs to be found and used
Moreover in the experiment stage of this master thesis the time cost for getting Web service is still
large For the purpose of reducing this time the multithreaded programming would be applied to
each some parts of the process for getting one Web service
Although the work performed here is specialized for only these five Web Service Registries the
main parts of the principles used here is adaptable to other Web Service Registries that with only
small changes in the implementation codes or the structure
Deep Web Service Crawler
60
6 Bibliography
[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)
June 27 2008
[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from
httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-
whole Emanuele Della Valle (CEFRIEL) July 1 2008
[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk
ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-
architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009
[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web
Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October
2006
[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from
httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008
[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from
httpwwww3orgTRws-arch-scenarios February 11 2004
[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo
Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering
University of Washington Seattle pp233-272 February 1999
[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web
Consortium Working Draft WD-html5-20100624 January 22 2008
[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language
WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from
httpwwwwsmoorgTRd16d161v02
[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO
-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from
httpwwwwsmoorgTRd2v11
[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware
Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet
2008
Deep Web Service Crawler
61
7 Appendixes
There are additional outputs of this master program which are log information file and statistic report
file Figure 8-1 shows one of the basic output formats of the log information for these five Web
Service Registries
Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry
Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry
Deep Web Service Crawler
62
Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo
ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively
Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry
Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry
Deep Web Service Crawler
63
Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry
Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry
Deep Web Service Crawler
64
Table of Figures
Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12
Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16
Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25
Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27
Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41
Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46
Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50
Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51
Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52
Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56
Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58
Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58
Deep Web Service Crawler
65
Table of Tables
Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47
Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48
Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55
Deep Web Service Crawler
66
Table of Abbreviations
DTD Document Type Definition
DWSC Deep Web Service Crawler
HTML HyperText Markup Language
MTBF Mean Time between Failures
MTTR Mean Time to Recovery
QoS Quality of Service
REST Representational State Transfer
RTT Round Trip Time
SOAP Simple Object Access Protocol
SQL Structured Query Language
URL Uniform Resource Locator
WHATWG Web Hypertext Application Technology Working Group
XML eXtensible Markup Language
WSDL Web Service Description Language
WSML Web Service Modeling Language
WSMO Web Service Modeling Ontology
Deep Web Service Crawler
11
information in this use case are derived from [1] In this use case there has a system administrator
whose name is ldquoSam Adamsrdquo He works for the bank His job is to keep the online payment facilities
online and working all day and night Therefore if there is any system failures Sam Adams should fix
the problems as early as he can Thatrsquos why he wants to use an SMS Messaging service which will
alert him immediately by sending him a SMS Message in the case of a system failure
n Description
This use case is dedicated to describe the system administrator ldquoSam Adamsrdquo who is looking for an
SMS Messaging Service that he wants to build it into his application
n Actors Roles and Goals
The name of the Actor is ldquoSam Adamsrdquo His Role is a system administrator at a bank The goals of him
are the immediate service delivery the reliability of the service and low base fee and transaction fee
n Storyboard
Step 1 Because of Sam Adams knows the Service-Finder portal and he also knows that he can find
many useful services from it especially he know what he is looking for Hence he visits the
Service-Finder portal and starts to search by entering the keywords ldquoSMS Alertrdquo
Requirement 1 Search functionality
Step 2 Now the Service-Finder returns a list of matching services However Sam wants to choose the
number of matching services that will be displayed on one page And he would also expect there has
short information about the service functionality the service provider and the service availability So
that he could decide which service he will choose to read further
Requirement 2 Enable configurable pagination of the matching results and have some short
information for each service
Step 3 When Sam looks through the short information about the services that displayed on the first
page he expects to find the most relevant services that related to his request After that he would
like to read more detailed information about that service to see whether this service can provide the
needed functionality
Requirement 3 Rank the returned matching services and must provide ability to read more details of
a service
Step 4 In the case that all the returned matching services Sam got provide quite different
functionalities or they belong to different service categories for example the SMS messaging services
alert users not through SMS but voice messaging For this reason Sam would like to see other
different categories that may be contain the services he wants Or the services of other categories
which he is also interested in (like ldquoSMS Messagingrdquo) Besides another possible way is that Sam can
further filter his search in terms of browsing through categories
Requirement 4 Service categories and allow the user to look all services that belonged to that specific
category If possible it should also allow the user to browse through categories
Step 5 When Sam got all the services that could provide a SMS messaging service via the methods
described in the Step 4 at present he wants to look for the services that offered by an Austrian
provider and have no base fees if possible
Requirement 5 Faceted search
Deep Web Service Crawler
12
Step 6 After Sam got all these specific services now he would like to choose the services that can
provide a high reliability
Requirement 6 Sort functionality based on usersrsquo chooses
Step 7 For now Sam expects to compare the service availability between the promised to the service
provider and the actually provided This should be contained in the servicesrsquo details And there needs
also have service coverage information so that Sam can know whether this service covers the areas
he lives and works Moreover Sam would also like to compare these services in other way For
instance put some services into a structured table to compare the transaction fees
Requirement 7 A side-by-side comparison table for services and a functionality that enable users to
select services he wants to compare
Step 8 At last Sam wants to know whether the service providers offer a free try out of the services
So that he can test the service functionality
Requirement 8 If possible display a note that offering free service trials
212 Architecture Plan for the Service-Finder Project
The architecture plan of the Service-Finder project contains five basic components Service Crawler
Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal
Interface Figure 2-1 presents a high level overview of the components and the data flow among
them
Figure2-1Dataflow of Service-Finder and Its Components [3]
Deep Web Service Crawler
13
2121 The Principle of the Service Crawler Component
The Service Crawler is responsible for gathering available services and their related information from
the Web The overall cycle is depicted as following
(1) A Web developer publishes a Web Service
(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services
like WSDL (Web Service Description Language) documents
(3) The Crawler is also going to search for other related information as long as a service is discovered
(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant
part of the Web
At last the output of the crawler would be forwarded to the subsequent components for analyzing
indexing and displaying
2122 The Principle of the Automatic Annotator Component
The Automatic Annotator receives the relevant data from previous component and generates
semantic service descriptions about the WSDL documents and its related information based on the
Service-Finder Ontology and Service Category Ontology
Firstly it will simply introduce those two compatible ontologies that would be used throughout the
whole process [2]
n Generic Service Ontology it is an ontology which is functional to describe the data objects For
example the services the service providers availability payment modalities and so on
n Service Category Ontology it is an ontology which is used to categorize the functionalities or
applications of the services For instance data verification messaging data storage weather etc
Afterwards it is going to talk about the function of this component with its input output
Oslash Input
u Crawled data from Service Crawler
u Service-Finder Ontologies
u Feedback or Correction of before annotations
Oslash Function
u Enrich the information about the service and extract semantic statements according to the
Service-Finder Ontologies For example categorize the service according to the Service
Category Ontology
u Determine whether a particular document is relevant or not through the Web link graph If
not discard these irrelevant documents
u Classify the pages into their genres For instance pricing user comments FAQ and so on
Oslash Output
u Semantic annotation of the services
Deep Web Service Crawler
14
2123 The Principle of the Conceptual Indexer and Matcher
Component
The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted
information of the services and supplying users the capability of retrieval and semantic query For
example the matchmaking between user requests and service offers and the act of retrieving user
feedback on extracted annotations
In addition letrsquos have a look of the function of this component and its input output
Oslash Input
u Semantic annotation data and full text information obtained from Automatic Annotation
u Semantic annotation data and full text information that come from user interfaces
u Cluster data from user and service clustering component
Oslash Function
u Store the semantic annotations received from the Automatic Annotation component and
from the user interface
u Store the cluster data that procured through the clustering component
u Store and index the textual description offered by the Automatic Annotation component
and the textual comments offered by users
u Ontological query the semantic data from the data store center
u Combined keyword and Ontological querying used for user queries
u Provide a list of similar services for a given service
Oslash Output
u A list of matching services that are queried by users In particular these services should be
sorted by ranking and can also be iterated
u All available data that related to a particular entity must be retrievable at the user interface
2124 The Principle of the Service-Finder Portal Interface
Component
The Service-Finder Portal Interface is the main entry point that provided for users of the
Service-Finder system to search and browse the data which is managed by the Conceptual Indexer
and Matcher component In addition the users can also contribute information by means of providing
tags comments categorizations and ratings to the data browsed Furthermore the developers can
still directly invoke the Service-Finder functionalities from their custom applications in terms of an API
Besides the details of this componentrsquos function input and output are represented as below
Oslash Input
u A list of ordered services for a query
u Detailed information about a service or a set of services and a service provider
u Query access to service category ontology and the most used tags provided by the users
Deep Web Service Crawler
15
u Service availability information
Oslash Function
u The Web Interface allows the users to search services by keyword tag or concept in the
categorization sort and filter query results by refining the query compare and bookmark
services try out the services that offer this functionality
u The API allows the developers to invoke Service-Finder functionalities
Oslash Output
u Explicit user annotations such as tags ratings comments decryptions and so on
u Implicit user data for example click stream of users bookmarks comparisons links sent
etc
u Manual advertising of available new services
2125 The Principle of the Cluster Engine Component
The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the
Service-Finder Portal eg the queried services and the compared services of the users Moreover it
also provides cluster data to the Conceptual Indexer and Matcher for providing service
recommendations
Furthermore letrsquos detailed introduce this componentrsquos function input and output
Oslash Input
u Service annotation data of both extracted and user feedback
u Usersrsquo Click streams used for extracting user behaviors
Oslash Function
u Obtain user clusters from user behaviors
u Obtain service clusters from service annotation data to enable to find similar services
Oslash Output
u Clusters of users and services
22 Information Extraction
Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge
amount of information sources on the Internet which has been limited the access to browsing and
searching for the reason of the heterogeneity and the lack of structure of Web information sources
Therefore the appearance of Information Extraction that transforms the Web pages into
program-friendly structures for post-processing would become a great necessity However the task of
Information Extraction is specified in terms of the inputs and the extraction targets And the
techniques used in the process of Information Extraction called extractor
221 Input Types of Information Extraction
Generally speaking there are three different input types The first input type is the unstructured
Deep Web Service Crawler
16
document For example the free text that showed in figure 2-2 It is unstructured and written in
natural language So that it will require substantial natural language processing While the second
input type is called the structured document For instance the XML documents based on the reason
that the data can be described through the available DTD (Document Type Definition) or XML
(eXtensible Markup Language) schema Finally but obviously the third input type is the
semi-structured document that are widespread on the Web Such as the large volume of HTML
pages like tables itemized lists and enumerated lists This is because HTML tags are often used to
render these embedded data in the HTML pages See figure 2-3
Figure2-2Left is the free text input type and right is its output [4]
Figure2-3A Semi-structured page containing data records
(in rectangular box) to be extracted [4]
Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly
regular structure And the data of these documents can be displayed in a format of HTML way or
non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and
generated from structured databases in terms of some templates or layouts thus it would be
considered as one of the input sources which could provide some of these semi-structured documents
For example the authors price and comments of the book pages that provided by Amazon have the
Deep Web Service Crawler
17
same layout That is because these Web pages are generated from the same database and applied
with the same template or layout Furthermore there has another option which could manually
generate HTML pages of semi-structured type For example although the publication lists that
provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have
title and source property for every single pager Eventually the inputs for some Information Extraction
can also be the pages with the same class or among various Web Service Registries
222 Extraction Targets of Information Extraction
Moreover regarding the task of the Information Extraction it has to consider the extraction target
There also have two different extraction targets The first one is the relation of k-tuple And the k in
there means the number of attributes in a record Nevertheless in some cases an attribute of one
record may have none instantiation Otherwise the attribute owns multiple instantiations In addition
the complex object with hierarchically organized data would be the second extraction target Though
the ways for depicting the extraction targets in a page are diverse the most common structure is the
hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf
nodes which called internal nodes And the structure for a data object may also be flat or nested To
be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise
if it is nested structure then the internal nodes that involved in this data object would be more than
two levels
Furthermore in order to make the Web pages readable for human being and having an easier
visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated
or demarcated However the displaying for a data object in a Web page would be affected by
following conditions [4]
Oslash The attribute of a data object has zero or several values
(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo
attribute For example a special offer only available for certain books might be a ldquononerdquo
attribute
(2) If there are more than one values for the attribute of a data object it will be called the
ldquomultiValuerdquo attribute For instance the name of the author for a book could be a
ldquomultiValuerdquo attribute
Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering
That is to say among this set of attribute the position of the attribute might be changed
according to the diverse instances of a data object Thus this attribute will be called the
ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would
enumerate the release data in front of the movesrsquo title while for the movies after year 1999
(including 1999) it will enumerate the release data behind the movesrsquo title
Oslash The attribute has different formats
This means the displaying format of the data object could be completely distinct with respect to
these different instances Therefore if the format of an attribute is free then a lot of rules will be
needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo
attribute For example an ecommerce Web site would use the bold font format to present the
general prices while use the red color format to display the sale prices Nevertheless there has
Deep Web Service Crawler
18
another situation that some different attributes for a data object have the same format For
example various attributes are presented in terms of using the ltTDgt tags in a table presentation
And the attributes like those could be differentiated by means of the order information of these
attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo
attributes it must have to revise the rules for extracting these attributes
Oslash The attribute cannot be decomposed
Because of the easier processing sometimes the input documents would like to be treated as
strings of tokens instead of the strings of characters In addition some of the attribute cannot
even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo
attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The
department code and the course number in them cannot be separated into two different strings
of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo
223 The Used Techniques in Information Extraction
The extractor used in the process of Information Extraction aims at providing a single uniform query
interface to access information sources like database server and Web server It consists of following
phases collecting returned Web pages labeling these Web pages generalizing extraction rules
extracting the relevant data and outputting the result in an appropriate format (XML format or
relational database) for further information integration For example at first the extractor queries the
Web server to gather the returned pages through the HTTP protocols after that it starts to extract the
contents among these HTML documents and integrate with other data sources thereafter Actually
the whole process of the extractor follows below steps
Oslash Step 1
At the beginning it must have to tokenize the input However there are two different
granularities for the input string tokenization They are tag-level encoding and word-level
encoding The tag-level encoding will transform the tags of HTML page into the general tokens
while transform all text string between two tags into a special token Nevertheless the
word-level encoding does this in another way It treats each word in a document as a token
Oslash Step 2
Next it should apply the extraction rules for every attributes of the data object in the Web pages
These extraction rules could be induced in terms of a top-down or bottom-up generalization
pattern mining or logic programming In addition the type of extraction rules may be indicated
by means of regular grammars or logic rules For example some use path-expressions of the
HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic
constraints and some use delimiter-based constraints such as HTML tags or literal words
Oslash Step 3
After that all these extracted data would be assembled into the records
Oslash Step 4
Finally iterate this process until all these data objects in the input
Deep Web Service Crawler
19
23 Pica-Pica Web Service Description Crawler
The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the
Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web
Services problem For example the evaluation of the descriptive quality of Web Services that offered
and how well are these Web Services described in nowadaysrsquo Web Service Registries
231 Needed Libraries of the Pica-Pica Web Service
Description Crawler
This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef
Spillner and programmed in terms of the Python language Actually in order to run these scripts to
parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib
n Beautiful Soup
It is an HTMLXML parser for Python language And it can even turn these invalid markups into a
parse tree [5] Moreover the following three features can make it more powerful
u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that
makes approximately as much sense as the original document Therefore you can obtain
the data that you want
u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating
searching and modifying the parse tree Hence you donrsquot need to create a custom parse
for every application
u If the document has already specified an encoding then you can ignore it since the
Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way
Otherwise what you have to do is just to specify the encoding of the original documents
Furthermore the ways of including Beautiful Soup into the application are displayed in the
following [5]
sup2 From BeautifulSoup import BeautifulSoup For processing HTML
sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML
sup2 Import BeautifulSoup To get everything
n Html5lib
It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to
gain maximum compatibility with the current major desktop web browsers this implementation
will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5
specification
Deep Web Service Crawler
20
232 Architecture of the Pica-Pica Web Service
Description Crawler
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler
Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes
four fundamental components Service Page Grabber component WSDL Grabber component
Property Grabber component and WSML Register component
(1) The Service Page Grabber components is going to take the URL seed as the input and output the
link of the service page into following two components WSDL Grabber component and Property
Grabber component
(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the
delivered service pagersquos link And then check whether the validation for these obtained WSDL
document Finally only these valid WSDL document will be passed into the WSML Register
component for further processing
(3) The Property Grabber component will try to extract the servicersquos property the hosted in the
service page if there exists After that all these servicersquos properties would be saved into an INI
file as the information of that service
(4) The functionality of the WSML Register component is to write appropriate WSML document by
means of the valid WSDL documents that delivered from WSDL Grabber component and the
optionally INI files that delivered from Property Grabber component Afterwards register them in
Conqo
n WSML [9]
It stands for Web Service Modeling Language which provides a framework with different
language variants Hence it is often used to describe the different aspects of the semantic Web
Services according to the conceptual model of WSMO
n WSMO [10]
WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various
aspects related to the Semantic Web Services based on the ontologies Ontology is a formal
explicit specification of a shared conceptualization In fact these ontologies are the sticking points
that can satisfy the linkage between the agreement of the communities of users and the defined
conceptual semantics of the real-world
n Conqo [11]
Deep Web Service Crawler
21
It is a discovery framework that considers not only the Quality of Service (QoS) but also the
context information It will use a Web Service Repository to manage these service descriptions
that based on WSML
233 Implementation of the Pica-Pica Web Service
Description Crawler
This section is going to describe the processes of the implementation of the Pica-Pica Web Service
Description Crawler in detail
(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it
needs an input as the initial seed In this crawler there are five Web Service Registries which are
listed in the below The URL address of these five Web Service Registries will be used as the input
seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web
Service Description Crawler there has a single Python script for each Web Service Registry And
the crawling process for these Web Service Registriesrsquo Python script will be executed one after
another
Biocataloue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwsrvice-repositorycom
Xmethods httpwwwxmethodsnet
(2) Then after feeding with the input seed it will step into the next component Service Page Grabber
At first this component will try to read the data from the Web based on the input seed Then it
will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library
After that this Service Page Grabber component starts to look for the service page link for each
service that published in the Web Service Registry by means of the functions in Html5lib library In
the case that the service page link of one single service is found it will firstly check whether this
service page link is valid or not Once the service page link is valid it will pass it into the following
two components for further processing which are WSDL Grabber component and Property
Grabber component
(3) When the WSDL Grabber component receives a service page link from its previous component it
sets out to extract the WSDL link address for that service through the parsing tree of the data in
this service page Next this component will start to download the WSDL document of that service
in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into
the disk The process of this WSDL Grabber component will be continually carried on until there is
no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They
may either contain bad definitions or bad namespaceURI or be an empty document What is
worse it is not even of XML format Hence in order to pick them out this component will further
analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo
folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in
order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo
folder will be passed into the subsequent component
(4) Moreover since some Web Service Registries would give some addition information about the
Deep Web Service Crawler
22
services such as availability service provider version therefore the Property Grabber
component will set out to extract these information as this servicersquos properties And thereafter
save these servicersquos properties into an INI file However if there has no available additional
information then it is no need to extract the service property thus there is no INI file for that
service However for the implementation of this Pica-Pica Web Service Description Crawler only
the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to
extract the servicesrsquo properties While for other three Web Service Registries there has such
function
(5) Furthermore this is an optional to create a report file which contains the statistic information of
this process Such as the total number of services for one Web Service Registry the number of
services whose WSDL document is invalid etc
(6) As has been stated at present there has a folder with all valid WSDL documents and might also
have some INI files Therefore at this time the task of WSML Register component is to generate
the appropriate WSML documents with these valid WSDL document and INI files And then
register them in Conqo
24 Conclusions of the Existing Strategies
In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder
project Information Extraction technique and Pica-Pica Web Service Description Crawler
The task of this master program is going to obtain the available Web services and their related
information from the Web Actually this is a procedure of extracting the needed information for the
service such as the servicersquos WSDL document and its properties Therefore the Information
Extraction technique which used to extract the information that hosted in the Web could be used in
this master program
Moreover the Service-Finder is a large project that not only able to obtain the available Web services
and their related information from the Web but also enrich these crawled data with annotations by
means of the Service-Finder ontology and the Service Category ontology and integrate all the
information into a coherent semantic mode Furthermore it also provides the capabilities for
searching and browsing the data with the user interface and gives users with service
recommendations However as a master program the Service-Finder project is far more exceeding
the requirements Therefore it is just considered as a consultation for this master program
Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available
Web services and their related information this fulfills the primary task of this master program
Nevertheless regarding the information about the service this Pica-Pica Web Service Description
Crawler extracts only few properties sometime even no property Consequently in order to improve
the quality of the service it has to extract as much properties about the service as possible Thence in
chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler
Deep Web Service Crawler
23
3 Design and Implementation
In the previous chapter of the State of Art those already existing techniques or implementations are
presented In the following it is time to introduce the basic principle of the proposed approach Deep
Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica
Web Service Description Crawler
31 Deep Web Services Crawler Requirements
This section is mainly talking about the goals of the Deep Web Service Crawler approach system
requirements with respect to approach and some non-functional requirements
311 Basic Requirements for DWSC
The following lists are the basic requirements which should be achieved
(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list
of Web services that published in the Web Service Registry It contains not only the WSDL document
of the service but also a file with respect to the information about that service Therefore for the
purpose of producing the largest annotated Service Catalogue this proposed approach need to
extract as much properties about those Web services as possible Moreover it has also to download
the WSDL document that hosted along with the Web service That is to say these properties are not
only the interesting structured properties such as service name its WSDL link address but also some
other non-function properties for example endpoint and its monitoring information
(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties
For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of
schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way
this proposed approach provides three methods for the storage The first one will store them as an
XML file While for the second method it is going to store them in an INI file And the third method is
to use the database for the storage
312 System Requirements for DWSC
General Speaking the needed requirements for achieving a programming project contains following
lists
1) Operating System LinuxUnix Windows (XP Vista 2000 etc)
2) Programming language C++ Java Python C etc
3) Programming Tool NetBean Eclipse Visual Studio and so on
However in this master thesis the scripts of the Deep Web Service Crawler approach are written in
Java programming language Besides these code scripts are only tested running in Windows XP
operating system and Linux operating system but have not been tested in other operating system
Deep Web Service Crawler
24
313 Non-Functional Requirements for DWSC
In this part several non-functional requirements for the Deep Web Service Crawler approach are
presented
1) Transparency the process of data exploration and data storage should be done automatically
without the userrsquos intervention However at first user should specify the path in the hard disk which
will be used store these outputs of this program
2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in
order to stop the process from interruption there must have some necessary error handlings for the
process recovering
3) Completeness this approach should extract the interesting properties about each Web Service as
much as possible eg endpoint monitoring information etc
In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the
following five URLs hence the proposed approach must implement not less than those five Web
Service Registries
Biocatalogue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwservice-repositorycom
Xmethods httpwwwxmethodsnet
32 Deep Web Services Crawler Architecture
In this section an overview of this high-level architecture of Deep Web Services Crawler approach will
be introduced first Thereafter there are four subsections that focusing on the outlining of each single
component and how they will play together presented
The current components and flows of data in Deep Web Service Crawler can be summarized as
depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page
links and related service page links by crawling the Web (Web Service Extractor) Then those gathered
links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL
Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all
these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is
illustrated as following
Oslash Step 1
When the Deep Web Service Crawler starts to run the File Chooser container would require the
user to specify a path in the computer or in any other hard disk The reason for specifying the
path is that this Deep Web Service Crawler program needs a place to store all its outputs
Oslash Step 2
After that the Web Service Extractor will be triggered It is the main entry to the specific crawling
process However since the Deep Web Service Crawler program is a procedure which supposes
to crawl for Web Services in some given Web Service Registries Hence the URL addresses of
Deep Web Service Crawler
25
these Web Service Registries should be given as an initial seed for this Web Service Extractor
process However since the page structures of these Web Service Registries are completely
different thus there will be a dependent process for each Web Service Registry
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler
Oslash Step 3
So that according to the given seed two types of links would be obtained by the Web Service
Extractor component One is the service list page link And another is the service page link
Service list page is page that contains a list of Web Services and may be some information about
these Web Services While service page is page that contains much more information for single
service Finally it forwards these two types of links into the next two components Property
Grabber and WSDL Grabber
Oslash Step 4
Then on the one hand the Property Grabber component tries to gather the information about
the service that hosted in the service list page and service page Such as the name of the service
its description the ranking for this service etc Finally all the information of the service will be
collected together as the service properties which then will be delivered to the storage
component for further processing
Oslash Step 5
On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page
or the service page That is because for some Web Service Registry the WSDL link is hosted in
Deep Web Service Crawler
26
the service list page like the Biocatalogue and for other Web Service Registries it is hosted in
the service page such as Xmethods Then after obtaining the WSDL link it will also be
transmitted to the Storage component for further processing
Oslash Step6
When service properties information and WSDL link of the service are received by Storage
component it will be going to store them into the disk For the service properties they will be
stored into the disk by means of three different ways They are an XML file an INI file or one
record inside the table of Database However for the WSDL link the Storage component will try
to download the page content first according to the URL address of the WSDL link If it can work
successful the page content of the service will be stored as a WSDL document into the disk
Oslash Step 7
Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there
are more than one service or there are more than one service list page in the those Web Service
Registries the crawling process from step 3 to step 6 will be continued again and again until
there is no service or service list page in that Web Service Registries any more
Oslash Step 8
Furthermore after the crawling process for one Web Service Registry finishes there will also
generate a file that contains some of statistic information about this crawling process For
example the time when the crawling process of this Web Service Registry started and when
finished the total number of Web services in this Web Service Registry how many services have
empty WSDL document the average number of service properties in this Web Service Registry
and the average time cost for extracting service properties getting WSDL document and
generating XML file INI file etc
321 The Function of Web Service Extractor Component
The Web Service Extractor component is responsible for gathering service list page links and service
page links It pursues a focused crawl of the Web and only forwards service list page and service page
links to subsequent components for analyzing collecting gathering purposes Therefore it is going to
identify both service list page links and related service page links on these Web Service Registries
As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost
as important as the Web Service Extractor itself as it highly influences the part of the Web which it
needs to crawl The seed can or shall contain eg Web pages where these Web Services are
published or that they talk about Web Services
Deep Web Service Crawler
27
Figure3-2 Overview the process flow of the Web Service Extractor Component
After feeding with the seed of URL the Web Service Extractor component starts to get the link of
service list page from the initial page of this URL seed However this process would be diverse for
these five Web Service Registries Following shows the different situation in these Web Service
Registries
Oslash Service-Repository Web Service Registry
In this Web Service Registry the link of the first service list page is the URL address of its seed
which means some Web Services can be found in the home page of the Service-Repository Web
Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this
Web Service Registry Therefore the process of getting service list page link in this registry will be
continually carried on until there no more service list page link exists
Oslash Xmethods Web Service Registry
Although there has Web Services in the home page of the Xmethods Web Service Registry the
number of those Web Services is only a small subset of these in this Web Service Registry
Moreover in the Xmethods Web Service Registry there is only one page containing all Web
Services Therefore it must have to get the service list page link for that page
Oslash Ebi Web Service Registry
The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That
is to say there also has one page that contains all Web Services in this Web Service Registry
However this page is not the initial page of the input seed Therefore there needs more than
one operation steps to get the service list page link of that page
Oslash Seekda Web Service Registry
In Seekda Web Service Registry these Web Services are not contained in the initial page of the
input seed The service list page link can be obtained after several additional operation steps
Deep Web Service Crawler
28
However there has a problem for getting the service list page link in this registry Simply put if
there are more than one pages contains the Web Services for some unknown reasons it cannot
not get the links of the rest service list page In another word it can only get the link of the first
service list page
Oslash Biocatalogue Web Service Registry
The process of getting service list page in Biocatalogue Web Service Registry is almost the same
as in Seekda Web Service Registry The only difference is that it can get all these service list page
links in Biocatalogue Web Service Registry if there are more than one service list pages
Then after getting the link of service list page the Web Service Extractor will begin to get the link of
service page for each service that listed in the service list page The reason for why it can do like this is
that there has an internal link for every service which will address to the service page It is worth
noting that once a service page link is obtained this component will immediately forward the service
page link and the service list page link into the subsequent two components for further processing
Nevertheless the process of obtaining the service page link will be continuously carried out until all
services listed in that service list page are crawled Analogously the process of getting service list page
will also be continuously carried out until there no more service list page exists
3211 Features of the Web Service Extractor Component
The main features are described in the following paragraphs
1) Obtain service list page links
A central task is to obtain the corresponding service list page links It is an URL address that includes a
publicly list of Web Services and just a simply information about these Web services like the name of
the service an internal URL that links to another page which would contain the detailed information
about that service sometimes may have the link address for WSDL document
2) Obtain service page links
Once service list page links are found a crucial aspect is to extract the internal link of each service to
aid the task of service information discovery Therefore it is the task of the Web Service Extractor to
harvest the html page content of the service list page so that the service page links which would
contain much more detailed information of the single Web service can be obtained
3212 Input of the Web Service Extractor Component
This component is dependent on some specific input seeds And the only input required for this
component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section
313
3213 Output of the Web Service Extractor Component
The component will produce two service related page links from the Web
l Service list page links
l Service page links
Deep Web Service Crawler
29
3214 Demonstration for Web Service Extractor
In order to have comprehensive understanding of the process of the Web Service Extractor
component following gives some figures for explanation Though there are five URL addresses in this
section only the URL of the service-repository address is showed as an example
1) The input seed is the initial URL address of the Service-Repository which is
ldquohttpwwwservice-repositorycomrdquo
2) As has already said in section 321 the first service list link of this Web Service Registry is its input
seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page
of that link
Figure3-3 Service list page of the Service-Repository
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo
Figure3-5Code Overview of getting service page link in Service Repository
Figure3-6 Service page of the Web service ldquoBLZServicerdquo
Deep Web Service Crawler
30
3) For now because of the service list page link is already know the next step is to acquiring the
service page link according to these services that listed in the service list page The text in red box
of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the
complete link of the service page It should be prefixed with the initial URL address of
Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for
getting service page link of the Web service in Service Repository is showed in figure 3-5
Therefore the final link of this service page is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure 3-6 is the corrsponding service page of that link
4) Afterwards those two links service list page link and service page link which are gathered by the
Web Service Extractor component will be immediately forwarded to the next two components
which are WSDL Grabber component and Property Grabber component
322 The Function of WSDL Grabber Component
The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the
service lists page link or service page link And the whole process flow is illustrated in figure 3-7
Figure3-7 Overview the process flow of the WSDL Grabber Component
When the WSDL Grabber component receives the inputs that delivered from previous component it
will start to get the WSDL link for the service based on these inputs However although the inputs of
the WSDL Grabber component are the links of service page and service list page only one of them
contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the
service page or the service list page The reason why there needs these two links to be delivered into
this component is that the only one of the five Web Service Registries which is Biocatalogue hosts
the WSDL link in the service list page While for other four Web Service Registries the WSDL link is
hosted in the service page Therefore the WSDL link of these four Web Service Registries will be
obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL
link will be obtained through the service list page link However there has a problem for getting the
WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in
Deep Web Service Crawler
31
the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another
word these services will not have WSDL document For the situation like this the value of the WSDL
link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web
Services in other four Web Service Registries the WSDL link would always exist in the service page
Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service
it will inevitably be forwarded into Storage component for downloading the WSDL document at once
3221 Features of the WSDL Grabber Component
The WSDL Grabber component has to provide the following features
l Obtain WSDL links
WSDL link is the direct way to get into the page that contains the contents of the WSDL document
It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo
to represent this is an address that addressed to the page of WSDL document
3222 Input of the WSDL Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3223 Output of the WSDL Grabber Component
The component will only produce the following output data
l The URL address of WSDL link for each service
3224 Demonstration for WSDL Grabber Component
In this section it will demonstrate a list of figures in order to give a comprehensive understanding of
the process of the WSDL Grabber component Without a doubt it uses the same Web service of the
Service-Repository as an example too
1) The input for this WSDL Grabber component is the link of service page that obtained from Web
Service Extractor component The address of this link is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page
Deep Web Service Crawler
32
2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the
figure 3-8
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo
3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure
3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry
For other four Web Service Registries this will be different This function
ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo
Then it checks all of the nodes one by one to see if the text value of that node is the value of
ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will
be extracted as the value of WSDL link for this Web service
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function
Figure3-11 Code overview of ldquooneParameterrdquo function
4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web
service ldquoBLZServicerdquo can be obtained The value of it is
ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo
Deep Web Service Crawler
33
323 The Function of Property Grabber Component
The Property Grabber component is a module which is used to extract and gather all these Web
service information that hosted in the Web actually are the information that showed in service list
page and service page In the end all the obtained Web service information will be collected together
as the service properties which would be delivered into the Storage component for storing The
detailed process flow of Property Grabber component is illustrated in figure 3-12
The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component
However there is still a little difference between them with respect to the seed As have already
mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is
sufficient to get the WSDL link While regarding the Property Grabber component it needs two of
them as the seeds Therefore once the Property Grabber component receives the needed inputs it
will start to extract the information of that single Web service
Figure3-12 Overview the process flow of the Property Grabber Component
Therefore after the Property Grabber component receives the inputs it starts to extract the service
information for the Web service Generally speaking the service information consists of four aspects
which are structured information endpoint information monitoring information and whois
information respectively
(1) Structured Information
The structured information can be obtained by extracting the information that hosted in the
service page and service list page They are the basic descriptive information about the service
Such as the name of the service the URL address through which the WSDL document can be
obtained the description of introducing the service the provider who provides that service its
rating and the server who owns this service etc However the elements constituting this
Deep Web Service Crawler
34
structured information would be diverse according to these different Web Service Registries For
example the rating information of the Web service exists in the Service-Repository Web Service
Registry while the Xmethods Web Service Registry does not have this information In addition
even though the Web services in the same Web Service Registry some elements of the structured
information could be not exist For instance one service in a Web Service Registry may have the
description information of this service while another service in the same Web Service Registry
does not have this description information Tables 3-1 to 3-5 demonstrate the structured
information that should be extracted in these five Web Service Registries Moreover if the style of
the Biocataloue Web Service Registry is SOAP there has some additional information describing
the SOAP operation of this service While it is REST this additional information would be
something that talked about the REST operation They should also be considered as a part of the
structured information Table 3-6 and table 3-7 illustrate the information for these two different
operations
Service Name WSDL Link WSDL Version
Provider Server Rating
Homepage Owner Homepage Description
Table 3-1Structured Information of Service-Repository Web Service Registry
Service Name WSDL Link Provider
Service Style Homepage Implementation Language
Description User Description Contributed Client Name
Type of this Client Publisher for this client Used Tookit of this client
Used Language of this client Used Operation System of this client
Table 3-2Structured Information of Xmethods Web Service Registry
Service Name WSDL Link Server
Provider Providerrsquos Country Service Style
Rating Description User Description
Service Tags Documentation (within WSDL)
Table 3-3Structured Information of Seekda Web Service Registry
Service Name WSDL Link Port Name
Service URL Address Implementation Class
Table 3-4Structured Information of Ebi Web Service Registry
Service Name WSDL Link Style
Provider Providerrsquos Country View Times
Favorite Times Submitter Service Tags
Total Annotation Provider Annotation Member Annotation
Registry Annotation Base URL SOAP Lab Server Base URL
Description User Description Category
Table 3-5Structured Information of Biocatalogue Web Service Registry
Deep Web Service Crawler
35
SOAP Operation Name Inputs and Outputs Operation Description
Operation Tags Part of Which Service
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry
REST Operation Name Service Tags Used Template
Operation Description Part of Which Service Part of Which Endpoint Group
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry
(2) Endpoint Information
Endpoint information is the port information of the Web service which can be extracted only
through the service page However since different Web Service Registries have different structure
of the endpoint information with respect to the Web service Thence some elements of the
endpoint information for the Web service would be very diverse But there is one thing needed to
pay attention the Ebi Web Service Registry does not have endpoint information for all these Web
services published in this registry Moreover though the Web services in the same Web Service
Registry have the same structure of this endpoint information some elements of the endpoint
information would be missing or empty Furthermore these Web Service Registries could even
have no endpoint information for some Web services published by them Nevertheless whatever
there is endpoint information for the Web service there will be at least one such element which
is the URL address of the endpoint Following table 3-8 shows the endpoint information that
should be extracted in these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Endpoint Name Endpoint URL
Endpoint Critical Endpoint Type
Bound Endpoint
Xmethods
Endpoint URL Publisher of this Endpoint
Contact Email of this Publisher Implementation Language of this
Endpoint
Seekda Endpoint URL
Biocatalogue Endpoint Name Endpoint URL
Table 3-8 Endpoint Information of these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Service Availability Number of Downs
Total Uptime Total Downtime
MTTR MTBF
RTT Max of Endpoint RTT Min of Endpoint
RTT Average of Endpoint Ping Count of Endpoint
Seekda Service Availability Begin Time of Monitoring
Biocatalogue Monitored Status of Endpoint Overall Status of Service
Table 3-9 Monitoring Information of these five Web Service Registries
(3) Monitoring Information
Monitoring information is the tested statistic information for the Web service But it is worth
Deep Web Service Crawler
36
noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of
all these Web services published by them While for other three Web Service Registries only a
few of Web services may not have this information Table 3-12 displays the monitoring
information for these three Web Service Registries
(4) Whois Information
Whois information is not extracted from the information that hosted in the service page and
service list page It is the descriptive information for the service domain which can be gained by
means of the address of the WSDL link Because of that the process of getting the Whois
information will start after gaining the service domain first The final value of service domain
should not contain the string like this http https www etc It must belong to the top level
domain After that it is going to query the service domain database by means of sending the value
of service domain to the Whois client which is just a Web site on the Internet for example
ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain
exists a list of its information will be returned as the output However the structure of the
returned information will be diverse with respect to different service domain Therefore the most
challenge thing is that there has to deal with the extracting process for each different situation of
the returned information Table 3-10 gives the Whois information that needs be extracted for all
these five Web Service Registries
Service Domain URL Domain Name Domain Type
Domain Address Domain Description State
Postal Code City Country
Country Code Phone Fax
Email Organization Established Time
Table 3-10Whois Information for these five Web Service Registries
Finally all the information of these four aspects will be collected together And then be delivered into
the Storage component for further storage processing
3231 Features of the Property Grabber Component
The Property Grabber component has to provide the following features
l Obtain basic information
Generally speaking the more information one Web service has the better you can know how
well this Web service is Hence this Property Grabber component will be necessary to extract all
these basic information that hosted in the service list page and service page This basic
information contains the structure information endpoint information and monitoring
information
l Obtain Whois information
Due to the reason that the more information the Web service has the better the quality of the
Web service is That makes it is necessary to extract as much information about the Web service
as possible Therefore except the basic information the Property Grabber component can also
obtain some additional information called Whois information such as the type of the domain the
name of the person in charge the postal code of the domain city phone fax detailed address
email etc
Deep Web Service Crawler
37
3232 Input of the Property Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3233 Output of the Property Grabber Component
The component will produce the following output data
l Structured information of each service
l Endpoint information about each service if exists
l Monitoring information for the service and endpoint if exists
l Whois information of the service domain
However all these information will be collected together as the properties for each service Thereafter
the collected properties will be sent to the Storage component
3234 Demonstration for Property Grabber Component
Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the
Property Grabber component To simplify the explanation the example showed in this section uses
the same Web service as before
1) The inputs of the Property Grabber component are links of service list page and service page that
received from Web Service Extractor component Their links are
ldquohttpwwwservice-repositorycomrdquo
and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page
Deep Web Service Crawler
38
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page
3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that
displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of
the structured information have the same content like the contents of description showed in
service page and service list page Thence in order to saving the time of extracting process and
the space for storing process the elements with the same content would only be extracted once
Moreover there needs a transformation of non-descriptive text to the descriptive text for the
rating information because the contents of it are several image of stars Therefore the final
results of the extracted structured information for this Web service are showed in table 3-11
Because of there are no descriptive information for the provider homepage and Owner
Homepage their values are assigned as ldquoNULLrdquo
Service Name BLZService
WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl
WSDL Version 0
Server Apache-Coyote11
Description BLZService
Rating Four stars and A Half
Provider NULL
Homepage NULL
Owner Homepage NULL
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo
4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that
displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list
only one of them will be extracted as the information of endpoint That is because this master
program is intended to extract as more information as possible but these information should not
contain some redundant information Therefore for the purpose of that only one of them will be
extracted as the endpoint information even if there are more than one endpoint records Table
3-12 shows the final results of the endpoint information of this Web service
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page
Deep Web Service Crawler
39
Endpoint Name BLZServiceSOAP12port_http
Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService
Endpoint Critical True
Endpoint Type production
Bound Endpoint BLZServiceSOAP12Binding
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo
5) Then it is time to extract the monitoring information through invoking the function
ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box
above contains the monitoring information about the Web service and the red box below lists
the monitoring information for these endpoints As have already mentioned before only one
endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two
types of availability Actually they all represent the availability for this Web service just like the
availability that showed in figure 3-14 Therefore only one content of these availability
information is sufficient Table 3-13 shows the final results of this extracting process
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page
Service Availability 100
Number of Downs 0
Total Uptime 1 day 19 hours 19 minutes
Total Downtime 0 second
MTBF 1 day 19 hours 19 minutes
MTTR 0 second
RTT Max of Endpoint 141 ms
RTT Min of Endpoint 0 ms
RTT Average of Endpoint 577 ms
Ping Count of Endpoint 112
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo
6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the
service domain from the WSDL link first Towards this Web service the gained service domain is
ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying
process After that it returns a list of information for that service domain See figure 3-17 And
table 3-14 is the extracted Whois information
Deep Web Service Crawler
40
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo
Service Domain URL thomas-bayercom
Domain Name Thomas Bayer
Domain Type NULL
Domain Address Moltkestr40
Domain Description NULL
State NULL
Postal Code 54173
City Bonn
Country NULL
Country Code DE
Phone +4922855525760
Fax NULL
Email infopredic8de
Organization predic8 GmbH
Established Time NULL
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo
7) Finally all the information of these four aspects will be collected together as the service
properties and then these service properties are forwarded into the Storage component
324 The Function of Storage Component
The WSDL link from the WSDL Grabber component will be used by this Storage component to
download the WSDL document from the Web and stores it into the disk thereafter In addition the
service properties from the Property Grabber component will also be directly stored into the disk in
three different manners with this Storage component Figure 3-18 illustrates the process flow of the
Storage component
Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo
of the Storage component will be triggered After that it is going to transform the service properties
into three different output formats and store them into the disk These output formats are XML file
INI file and database records Besides it will also try to download the WSDL document with the URL
address of the WSDL link and then store the obtained WSDL document into disk too However this
ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo
ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the
storage tasks
Deep Web Service Crawler
41
Figure3-18 Overview the process flow of the Storage Component
(1) ldquogetWSDLrdquo sub function
The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it
into the disk Therefore above all it has to get the content of the WSDL document This
procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value
of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if
the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo
For the case like that it will create a WSDL document whose name is the service name appended
with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content
It is an empty document Nevertheless if that service does have WSDL link this sub function will
try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all
these contents that hosted in the Web would be downloaded and stored into the disk and named
only with the name of the service Otherwise it would create a WSDL document which prefixes
ldquoBadrdquo before the service name
(2) ldquogenerateXMLrdquo sub function
The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform
them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo
XML stands for eXtensible Markup Language which is a markup language that designed to
transport and store data The first line of an XML file is the XML declaration which defines the
XML version and the encoding used For example following ldquoltxml version=10
encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is
8-bit Unicode Transformation Format character set Besides an XML file contains XML elements
which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML
element can also contain other elements which could be simple text or a mixture of both
Deep Web Service Crawler
42
However an XML file must contain a root element as the parent of all other elements
(3) ldquogenerateINIrdquo sub function
The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will
transform them into an INI file and then store it in the disk with the name structure of servicersquos
name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for
configuration files They are just simple text files with a basic structure Generally speaking the
INI file contains three different parts They are section parameter and comment respectively
Parameter is the basic element contained in an INI file Its format is a key-value pair or can also
be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always
appears to the left of the equals sign The section is more like a room that grouped all its
parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And
section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo
Thence anything between the semicolon and the end of line will be ignored
(4) ldquogenerateDatabaserdquo sub function
The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions
Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will
turn them into the data of the database by using the SQL statements SQL stands for Structured
Query Language which is a database language that designed for accessing and manipulating data
in database Most of the actions that performed on a database are done with SQL statements
And the primary statements of SQL include insert into delete update select create alter and
drop Therefore for the purpose of transforming these service properties into the data of the
database this sub function has to create a database first using the ldquocreate databaserdquo statement
Then it should create a table to store the data A table is a collections of related data entries and
it consists of columns and rows Since the data for all these five Web Service Registries are not
very large therefore one database table is enough for storing these service properties Because of
that the field names of the service properties in columns for all these five Web Service Registries
should be uniform and well-defined Afterwards the service properties of each single service can
be put into the table as a record with the ldquoinsert intordquo statement of the SQL
3241 Features of the Storage Component
The Property Grabber component has to provide the following features
l Generate different output formats
The final result of this master program is to store the information of the service into the disk for
future work However this Storage component provides three different formats for storing these
service properties of the services that published in the Web Service Registries This makes the
storage of the service very flexible and also longevity
l Obtain the WSDL document
The important thing of this component is to check if the WSDL document can be obtained from
the WSDL link because the WSDL document plays a decisive role for determining the quality of
the service This Storage component provides the ability to deal with the different situations
about the process of obtaining the WSDL document
Deep Web Service Crawler
43
3242 Input of the Storage Component
This component requires the following input data
l WSDL link of each service
l Each service property information
3243 Output of the Storage Component
The component will produce the following output data
l WSDL document of the service
l XML document INI file and tables in database
3244 Demonstration for Storage Component
Following figures describe the fundamental implementation codes of this Storage component
The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the
implementation codes The first common place is about the parameters that defined in each of
these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute
path in the computer disk It will be used for the storing procedure and has already been specified
by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an
increasing Integer which is used as a part of the name of the service The reason for doing this is
that it can prevent the services that have the same name from overriding in the disk The content
of red mark among the codes of these figures is the second common place Its function is to
create a file or document for the service with the corresponding parameters
2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to
get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the
name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo
is the actual WSDL link which is the most important one for this sub function The other two
parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and
ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data
of the service in each Web Service Registry Such as the overall number of properties for that Web
Service Registry the overall number of services the number of services that have no WSDL link
the number of services contains no contents the number of services whose WSDL link are not
available etc The ldquoLog Informationrdquo text file records the results of the process steps and the
problems encountered For example which service is crawling now which Web Service Registry it
belongs the reason why it cannot get the WSDL document for this service the reason why this
service is unreachable and so on
Deep Web Service Crawler
44
Figure3-19 Implementation code for getting WSDL document
3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file
and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of
ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two
variables name and value
Figure3-20 Implementation code for generating XML file
Deep Web Service Crawler
45
Figure3-21 Implementation code for generating INI file
4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all
services in these five Web Service Registries into the records of the database Therefore it has to
create a database first The name of the database can be arbitrary as long as it conforms to the
naming rules in database This is the same for the name of the table Figure 3-22 displays the
procedure of creating a table in the database Due to the reason that it is hard to decide the
length of each service property thence the data types for all service properties are set to ldquoTextrdquo
Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo
statement
Figure3-22 Implementation code for creating table in database
Deep Web Service Crawler
46
Figure3-23 Implementation code for generating table records
33 Multithreaded Programming for DWSC
Multithreaded programming is a built-in characteristic which is provided by Java language A
multithreaded program contains two or more separate parts that that can execute concurrently Each
part of such a program is called a thread The use of multithreading enables to create programs that
makes efficient use of the system resources For example make maximum use of the CPU thatrsquos
because the idle time of the CPU can be kept to a minimum
In this master program there are five Web Service Registries that are needed to crawl for the services
among them Moreover the number of services published among each Web Service Registry is quite
different That makes the running time cost by each Web Service Registry is different Then it will
happen that one Web Service Registry who owns fewer services has to be waiting for executing until
another Web Service Registry who has much more services finishes Therefore in order to reduce the
waiting time for other Web Service Registries and maximize the use of the system resources it is
necessary to apply this multithreaded programming to this master program That is to say this master
program will create a thread for each Web Service Registry And these threads are executed
independently
34 Sleep Time Configuration for Web Service
Registries
Due to the reason that this master program is intended for downloading the WSDL document and
extracting the service information of these Web services that published in the Web Service Registry
this would inevitably affect the performance of the Web Service Registry In addition for the purpose
of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of
accessing Because of that sometimes when this master program is executing unknown errors would
happen For instance this master program will be continually halting at one point without getting any
more WSDL document and service information or some servicesrsquo WSDL documents of some Web
Deep Web Service Crawler
47
Service Registries cannot be obtained or there would miss some service information Therefore in
order to obtain the largest staffs of these Web services published in these five Web Service Registries
and not to affect the throughput of them it has to configure the accessing rate for each service of all
Web Service Registries
In consequence before going into the essential procedure for each single service of these Web Service
Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long
milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for
the specified number of milliseconds In other words it will temporarily cease execution for a while
Following table shows this time interval of the sleep function for each Web Service Registry
Web Service Registry Name Time Interval (milliseconds)
Service Repository 8000
Ebi 3000
Xmethods 10000
Seekda 20000
Biocatalogue 10000
Table 3-15Sleep Time of these five Web Service Registries
Deep Web Service Crawler
48
4 Experimental Results and Analysis
This chapter is going to show the quantitative experimental results of the prototype that presented in
chapter 3 Besides the analysis of these results will also be described and explained However in
order to gain a rather accurate result the experiments are carried out more than five times All these
data that displayed in the following tables and charts are the average of them
41 Statistic Information for Different Web Service
Registries
This section is mainly used to talk about the amount statistic of these Web services that published in
these five Web Service Registries It includes the overall number of the Web services that published in
each Web Service Registry and the number of these unavailable Web services which have been
archived because they may not be active anymore or are close to being non active Table 4-1 shows
the service amount statistic of these five Web Service Registries
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Overall
Services 57 289 382 853 2567
Unavailable
Services 0 0 0 0 125
Table4-1 Service amount statistic of these five Web Service Registries
Nevertheless in order to have an intuitive view of the service amount statistic in these five Web
Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As
you can see from the bar chart one on hand there has an ascending increase for the overall number
of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service
Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web
service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to
provide the Web services for the users because it contains far more services than other four Web
Service Registries On the other hand there is no unavailable service existed in almost all Web Service
Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web
Service Registry contains some Web services that cannot be used by the users To some degree this is
useless for the reason that these services cannot be used anymore and it will waste the network
resources on the Web Therefore all these unavailable services should be eliminated in order to
reduce the wasting of the network resources
Deep Web Service Crawler
49
Figure4-1 Service amount statistic of these five Web Service Registries
42 Statistic Information for WSDL Document
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Failed WSDL
Links 1 0 23 145 32
Without
WSDL Links 0 0 0 0 16
Empty
Content 0 0 2 0 2
Table4-2 Statistic information for WSDL Document
Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in
these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed
WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the
Web services whose WSDL links are invalid Say in other words that means it is impossible to get the
WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no
WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services
for these Web Service Registries It is used as the overall number of Web services in each Web Service
Registry who has no WSDL link That is to say there would be no WSDL document for Web services
like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL
document will also be created But there is no content and the name of this WSDL document will
contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which
represents the overall number of Web services that have the WSDL links and the URL address of them
0
500
1000
1500
2000
2500
3000
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
57289 382
853
2567
0 0 0 0 125
The
Num
ber o
f Web
Ser
vice
s
The name of Web Service Registry
Overall Services
Unavailable Services
Deep Web Service Crawler
50
are valid but their WSDL documents contains no content In this case a WSDL document whose name
contains a string ldquo(BAD)rdquo will be created
Figure4-2 Statistic information for WSDL Document
43 Comparison of Different Average Number of
Service Properties
This section is going to compare the average number of service properties in these five Web Service
Registries This average number of service properties is calculated by means of following equation = (1)
Where
ASP is the average number of service properties for one Web Service Registry
ONSP is the overall number of service properties in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
Figure 4-3 shows the average number of service properties of each Web service in these five Web
Service Registries As have already mentioned before one of the measurements for testing the quality
of Web services in the Web Service Registry is the service information The more information of a Web
service the better you know about that service consequently the corresponding Web Service
Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service
Repository and Biocatalogue Web Service Registries own a larger number of service properties than
that in other three Web Service Registries This can directly reflect that these two Web Service
Registries can provide more detailed information about the Web services published among them So
that the users can be easier to choose the service they need And they would also like to use the Web
services that published in these two Web Service Registries By contrast the Xmethods and Seekda
0
50
100
150
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
1 023
145
32
0 0 0 016
0 0 2 0 2
The
Num
ber o
f Ser
vice
s
The Name of the Web Service Registry
Failed WSDL Links
Without WSDL links
Empty Content
Deep Web Service Crawler
51
Web Service Registries who have less service information of the Web services would offer less quality
for these Web services Therefore users may be not like to use the Web services provided in these
two Web Service Registries Not to mention the Web services published in the Ebi Web Service
Registry
Figure4-3 Average Number of Service Properties
From the description presented in section 323 the causes of the issue for the different number of
service properties in these Web Service Registries may consist of following several points First the
number of the structured information for these Web services is different with respect to these five
Web Service Registries Even part of information for some Web services in one Web Service Registry
could be missing or have empty value For example the number of structure information that
supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that
in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for
almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce
the amount of the overall number of service properties Thirdly some Web Service Registries do not
have monitoring information such as Xmethods and Ebi In particular the Service Repository Web
Service Registry has a large amount of monitoring information of the Web services that could be
extracted in the Web Obvious the last one is the number of the whois information for these Web
services If the database of the whois client does not contain the information about the service
domain for the Web service in one Web Service Registry then there will be no whois information
could be extracted Moreover even there has information of the service domain the amount of the
information could be very diverse Therefore if the Web Service Registry undergoes the situation like
that many service domains of the Web services in this registry have no and few whois information
then the average number of service properties in that registry will decrease greatly
As a result in order to help users get better acquainted with the Web services that provided in Web
Service Registry and distinguish these Web services the Web Service Registry should do it best to
offer more and more information for each published Web service of it
05
101520253035
23
717 17
32
The
Num
ber o
f Pro
pert
ies
The Name of Web Service Registry
Average Number of Service Properties
Deep Web Service Crawler
52
44 Different Outputs of Web Services
The fundamental task of this master program is to obtain the WSDL document of the Web services
that hosted in the Web Service Registries as well as extract and gather the properties of these Web
services and thereafter store them into disk Therefore this section is going to describe the different
outputs of this master program which include the WSDL documents of Web services the generated
XML file and INI file and data records of these service properties in Database
Figure4-4 WSDL Document format of one Web service
The WSDL document of the Web service is just read from the Web according to the URL address of its
WSDL link And then these data will be stored into the disk as the WSDL document The name of the
WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document
And in order to distinguish these WSDL documents whose names would be the same while the
content of them are different the name of each obtained WSDL document in one Web Service
Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL
document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and
data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output
formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name
is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the
materials belonged to the same Web service The first three lines in that INI file are some service
comments which start from the semicolon to the end of the line They are the basic information for
the description of this INI file The following one line of them is the section that included in a pairs of
brackets It is important because it represents that the following lines behind it are the information of
this Web service Therefore the rest of these lines are the actual service information that existed in a
key-value pair with an equal between the pair Each service property is displayed from the beginning
of the line
Deep Web Service Crawler
53
Figure4-5 INI File format of one Web service
Figure4-6 XML File format of one Web service
Figure4-7 Database data format for all Web services
Deep Web Service Crawler
54
Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its
name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web
service Though the format of the XML file is different from that of the INI file the essential contents
of them are the same That is to say the values of these service properties are no different This is
because they are files generated from the collection of properties for the same Web service The XML
file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the
section in INI file is something like the root in XML file Therefore all values of these elements among
the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service
Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of
service information for all Web services in these five Web Service Registries And the entire service
information for one Web service will be only one record in this table Because of that the column
names of that table should be the union of the name of the service information in each Web Service
Registry However since the column names of the table should be unique thus these redundant
names in this union must be eliminated This could make sense and possible because the names of the
service information are well-defined and uniform for all these five Web Service Registries In addition
in this table the first column is the primary key which is an increasing Integer The function of this is
more like the Integer that contained in the name of the XML file and INI file While the rest columns in
that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in
this table represents that this property of that Web service is empty or missing
45 Comparison of Average Time Cost for Different
Parts of Single Web Service
This section aims to describe the comparison of the average time cost for different parts of getting
one single Web service in all these five Web Service Registries At first it has to calculate the average
time cost of getting one single service in the Web Service Registry The calculation of it can be
obtained through following equation = (2)
Where
ATC is the average time cost for one single Web service
OTS is the overall time cost of all the Web services in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the different parts of the average time cost for getting one single service consists of
following 6 aspects which are the average time cost for extracting service property the average time
cost for obtaining WSDL document the average time cost for generating XML file the average time
cost for generating INI file the average time cost for inserting the service property into the table of
the Database and the average time cost for some other procedures such as get the service list page
link get the service page link and so on The average time cost for extracting service property would
be obtained by means of following equation
Deep Web Service Crawler
55
= (3)
Where
ATCSI is the average time cost for extracting service property of one single Web service
OTSSI is the overall time cost for extracting service property of all the Web services in one Web
Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the calculation of other parts is similar to the equation for calculating the average time
cost for extracting service property While the calculation of the average time cost for other
procedures equals to the result of the average time cost for one single Web service minus the sum of
the average time cost of all other five parts
Service
property
WSDL
Document
XML
File INI File Database Others Overall
Service
Repository
8801 918 2 1 53 267 10042
Ebi 699 82 2 1 28 11 823
Xmethods 5801 1168 2 1 45 12 7029
Seekda 5186 1013 2 1 41 23 6266
Biocatalogue 39533 762 2 1 66 1636 42000
Table4-3 Average time cost information for all Web Service Registries
Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web
Service Registries The first column of table 4-3 is the name of these five Web Service Registries And
the last column in this table is the average time cost for the single service in one Web Service Registry
While the rest columns in this table are the average time cost of these six different parts In order to
have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated
with following corresponding figures See figure 4-8 to figure 4-14
Figure4-8 Average time cost for extracting service property in all Web Service Registries
8801
6995801 5186
39533
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Service Property
Deep Web Service Crawler
56
As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue
Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries
which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively
That is to say it will take much longer time to extract the service properties of these Web services
published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the
Biocatalogue Web Service Registry has even more average number of service properties which have
already been talked about in section 43 On the contrary the average number of service properties in
Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service
Registry is larger than that in Seekda Web Service Registry As have already known that the average
number of service properties is the same for these two Web Service Registries Nevertheless there
has course might explain the case that the average time in Xmethods costs more than that in Seekda
which is the process for extracting service properties in Xmethods Web Service Registry needs to be
executed by means of service page and service list page while it needs only service page link for
Seekda Web Service Registry
The average time cost for obtaining WSDL document in all these five Web Service Registries is
displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the
average time for extracting the WSDL link of Web service and the average time for reading the data of
WSDL document from the Web and then storing it into the disk As seen from this figure the average
time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is
1168 Although the process for extracting WSDL link will cost certain amount of time it does not have
a significant influence to the total average time spent for obtaining WSDL document This is because
the WSDL link of one Web service is almost gained in one step Therefore this can imply that the
average size of the WSDL document for Xmethods Web Service Registry is larger than that for other
four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds
for the process of extracting WSDL document
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries
918
82
11681013
762
0
200
400
600
800
1000
1200
1400
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
WSDL Document
Deep Web Service Crawler
57
Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different
outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11
the average time of generating XML file for one Web service cost the same for all these five Web
Service Registries which is only 2 milliseconds In addition the average time of generating INI file for
one Web service cost the same too And its value is just 1 millisecond Even though sum up the values
of these two types of average time costs the result is still so small that it could be omitted when
comparing to the overall average time cost of getting one Web service for each corresponding Web
Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI
file would be finished at once after receiving the input of service properties of one Web service
Furthermore it can be seen from figure 4-12 although the average time costs of creating database
record for each Web service in all these five Web Service Registries are larger than that time of
generating XML and INI file the process of the operation for creating database record is still fast
Figure4-10 Average time cost for generating XML file in all Web Service Registries
Figure4-11 Average time cost for generating INI file in all Web Service Registries
2 2 2 2 2
0
05
1
15
2
25
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
XML File
1 1 1 1 1
0
02
04
06
08
1
12
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
INI File
Deep Web Service Crawler
58
Figure4-12 Average time cost for creating database record in all Web Service Registries
Figure4-13 Average time cost for getting one Web service in all Web Service Registries
In the figure 4-13 it gives the average time cost for getting one single Web service in all these five
Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the
longest time for this process This is because the presentation of these five different parts that
described in the front shows that the average time cost of each part would needs more time to finish
the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining
WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web
Service Registry does not cost the longest Moreover there is amazing thing when looking at these
figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same
trend This can further indicate that one Web Service Registry who spends more time to get the
description information of one Web service would offer more information about the Web service
53
28
4541
66
0
10
20
30
40
50
60
70
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Database
10042
823
7029 6266
42000
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Overall
Deep Web Service Crawler
59
5 Conclusion and Further Direction
This master thesis provides a schema which aims to explore the description information of the Web
services that hosted in the different Web Service Registries Actually the description information of
the Web service consists of WSDL document and service information about that Web service
Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can
be used to obtain the WSDL document and service information of the Web service in these Web
Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web
services that hosted in the these Web Service Registries as well as the storage of these Web services
is not flexible and the most important thing is that only a few service information of the Web service
are extracted even more there is no service information would be extracted for some Web Service
Registry However the work presented in this master thesis can be able to explore all Web services in
these Web Service Registries Moreover the service information of Web service would be extracted as
much as it could with the approach presented in this master thesis so that the final result would be
the largest annotated service catalogue ever produced Furthermore regarding the storage process of
the description information of the Web service this master thesis provides three different ways that
can guarantee not only the completeness but also longevity of the description information for the
Web service
However in the implementation performed in this master thesis the whois client used for querying
the information of service domain will return a free text if the information exists And sometime this
free text differs completely That makes it must have to crawl each Web service in all Web Service
Registries at least once in the experiment stage So that all the cases of these free text would be
foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web
services in these Web Service Registries Therefore in order to simply the work other whois client
who can ease the work here needs to be found and used
Moreover in the experiment stage of this master thesis the time cost for getting Web service is still
large For the purpose of reducing this time the multithreaded programming would be applied to
each some parts of the process for getting one Web service
Although the work performed here is specialized for only these five Web Service Registries the
main parts of the principles used here is adaptable to other Web Service Registries that with only
small changes in the implementation codes or the structure
Deep Web Service Crawler
60
6 Bibliography
[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)
June 27 2008
[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from
httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-
whole Emanuele Della Valle (CEFRIEL) July 1 2008
[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk
ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-
architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009
[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web
Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October
2006
[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from
httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008
[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from
httpwwww3orgTRws-arch-scenarios February 11 2004
[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo
Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering
University of Washington Seattle pp233-272 February 1999
[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web
Consortium Working Draft WD-html5-20100624 January 22 2008
[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language
WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from
httpwwwwsmoorgTRd16d161v02
[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO
-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from
httpwwwwsmoorgTRd2v11
[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware
Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet
2008
Deep Web Service Crawler
61
7 Appendixes
There are additional outputs of this master program which are log information file and statistic report
file Figure 8-1 shows one of the basic output formats of the log information for these five Web
Service Registries
Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry
Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry
Deep Web Service Crawler
62
Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo
ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively
Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry
Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry
Deep Web Service Crawler
63
Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry
Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry
Deep Web Service Crawler
64
Table of Figures
Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12
Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16
Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25
Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27
Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41
Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46
Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50
Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51
Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52
Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56
Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58
Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58
Deep Web Service Crawler
65
Table of Tables
Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47
Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48
Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55
Deep Web Service Crawler
66
Table of Abbreviations
DTD Document Type Definition
DWSC Deep Web Service Crawler
HTML HyperText Markup Language
MTBF Mean Time between Failures
MTTR Mean Time to Recovery
QoS Quality of Service
REST Representational State Transfer
RTT Round Trip Time
SOAP Simple Object Access Protocol
SQL Structured Query Language
URL Uniform Resource Locator
WHATWG Web Hypertext Application Technology Working Group
XML eXtensible Markup Language
WSDL Web Service Description Language
WSML Web Service Modeling Language
WSMO Web Service Modeling Ontology
Deep Web Service Crawler
12
Step 6 After Sam got all these specific services now he would like to choose the services that can
provide a high reliability
Requirement 6 Sort functionality based on usersrsquo chooses
Step 7 For now Sam expects to compare the service availability between the promised to the service
provider and the actually provided This should be contained in the servicesrsquo details And there needs
also have service coverage information so that Sam can know whether this service covers the areas
he lives and works Moreover Sam would also like to compare these services in other way For
instance put some services into a structured table to compare the transaction fees
Requirement 7 A side-by-side comparison table for services and a functionality that enable users to
select services he wants to compare
Step 8 At last Sam wants to know whether the service providers offer a free try out of the services
So that he can test the service functionality
Requirement 8 If possible display a note that offering free service trials
212 Architecture Plan for the Service-Finder Project
The architecture plan of the Service-Finder project contains five basic components Service Crawler
Automatic Annotator Conceptual Indexer and Matcher Cluster Engine and Service-Finder Portal
Interface Figure 2-1 presents a high level overview of the components and the data flow among
them
Figure2-1Dataflow of Service-Finder and Its Components [3]
Deep Web Service Crawler
13
2121 The Principle of the Service Crawler Component
The Service Crawler is responsible for gathering available services and their related information from
the Web The overall cycle is depicted as following
(1) A Web developer publishes a Web Service
(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services
like WSDL (Web Service Description Language) documents
(3) The Crawler is also going to search for other related information as long as a service is discovered
(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant
part of the Web
At last the output of the crawler would be forwarded to the subsequent components for analyzing
indexing and displaying
2122 The Principle of the Automatic Annotator Component
The Automatic Annotator receives the relevant data from previous component and generates
semantic service descriptions about the WSDL documents and its related information based on the
Service-Finder Ontology and Service Category Ontology
Firstly it will simply introduce those two compatible ontologies that would be used throughout the
whole process [2]
n Generic Service Ontology it is an ontology which is functional to describe the data objects For
example the services the service providers availability payment modalities and so on
n Service Category Ontology it is an ontology which is used to categorize the functionalities or
applications of the services For instance data verification messaging data storage weather etc
Afterwards it is going to talk about the function of this component with its input output
Oslash Input
u Crawled data from Service Crawler
u Service-Finder Ontologies
u Feedback or Correction of before annotations
Oslash Function
u Enrich the information about the service and extract semantic statements according to the
Service-Finder Ontologies For example categorize the service according to the Service
Category Ontology
u Determine whether a particular document is relevant or not through the Web link graph If
not discard these irrelevant documents
u Classify the pages into their genres For instance pricing user comments FAQ and so on
Oslash Output
u Semantic annotation of the services
Deep Web Service Crawler
14
2123 The Principle of the Conceptual Indexer and Matcher
Component
The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted
information of the services and supplying users the capability of retrieval and semantic query For
example the matchmaking between user requests and service offers and the act of retrieving user
feedback on extracted annotations
In addition letrsquos have a look of the function of this component and its input output
Oslash Input
u Semantic annotation data and full text information obtained from Automatic Annotation
u Semantic annotation data and full text information that come from user interfaces
u Cluster data from user and service clustering component
Oslash Function
u Store the semantic annotations received from the Automatic Annotation component and
from the user interface
u Store the cluster data that procured through the clustering component
u Store and index the textual description offered by the Automatic Annotation component
and the textual comments offered by users
u Ontological query the semantic data from the data store center
u Combined keyword and Ontological querying used for user queries
u Provide a list of similar services for a given service
Oslash Output
u A list of matching services that are queried by users In particular these services should be
sorted by ranking and can also be iterated
u All available data that related to a particular entity must be retrievable at the user interface
2124 The Principle of the Service-Finder Portal Interface
Component
The Service-Finder Portal Interface is the main entry point that provided for users of the
Service-Finder system to search and browse the data which is managed by the Conceptual Indexer
and Matcher component In addition the users can also contribute information by means of providing
tags comments categorizations and ratings to the data browsed Furthermore the developers can
still directly invoke the Service-Finder functionalities from their custom applications in terms of an API
Besides the details of this componentrsquos function input and output are represented as below
Oslash Input
u A list of ordered services for a query
u Detailed information about a service or a set of services and a service provider
u Query access to service category ontology and the most used tags provided by the users
Deep Web Service Crawler
15
u Service availability information
Oslash Function
u The Web Interface allows the users to search services by keyword tag or concept in the
categorization sort and filter query results by refining the query compare and bookmark
services try out the services that offer this functionality
u The API allows the developers to invoke Service-Finder functionalities
Oslash Output
u Explicit user annotations such as tags ratings comments decryptions and so on
u Implicit user data for example click stream of users bookmarks comparisons links sent
etc
u Manual advertising of available new services
2125 The Principle of the Cluster Engine Component
The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the
Service-Finder Portal eg the queried services and the compared services of the users Moreover it
also provides cluster data to the Conceptual Indexer and Matcher for providing service
recommendations
Furthermore letrsquos detailed introduce this componentrsquos function input and output
Oslash Input
u Service annotation data of both extracted and user feedback
u Usersrsquo Click streams used for extracting user behaviors
Oslash Function
u Obtain user clusters from user behaviors
u Obtain service clusters from service annotation data to enable to find similar services
Oslash Output
u Clusters of users and services
22 Information Extraction
Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge
amount of information sources on the Internet which has been limited the access to browsing and
searching for the reason of the heterogeneity and the lack of structure of Web information sources
Therefore the appearance of Information Extraction that transforms the Web pages into
program-friendly structures for post-processing would become a great necessity However the task of
Information Extraction is specified in terms of the inputs and the extraction targets And the
techniques used in the process of Information Extraction called extractor
221 Input Types of Information Extraction
Generally speaking there are three different input types The first input type is the unstructured
Deep Web Service Crawler
16
document For example the free text that showed in figure 2-2 It is unstructured and written in
natural language So that it will require substantial natural language processing While the second
input type is called the structured document For instance the XML documents based on the reason
that the data can be described through the available DTD (Document Type Definition) or XML
(eXtensible Markup Language) schema Finally but obviously the third input type is the
semi-structured document that are widespread on the Web Such as the large volume of HTML
pages like tables itemized lists and enumerated lists This is because HTML tags are often used to
render these embedded data in the HTML pages See figure 2-3
Figure2-2Left is the free text input type and right is its output [4]
Figure2-3A Semi-structured page containing data records
(in rectangular box) to be extracted [4]
Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly
regular structure And the data of these documents can be displayed in a format of HTML way or
non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and
generated from structured databases in terms of some templates or layouts thus it would be
considered as one of the input sources which could provide some of these semi-structured documents
For example the authors price and comments of the book pages that provided by Amazon have the
Deep Web Service Crawler
17
same layout That is because these Web pages are generated from the same database and applied
with the same template or layout Furthermore there has another option which could manually
generate HTML pages of semi-structured type For example although the publication lists that
provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have
title and source property for every single pager Eventually the inputs for some Information Extraction
can also be the pages with the same class or among various Web Service Registries
222 Extraction Targets of Information Extraction
Moreover regarding the task of the Information Extraction it has to consider the extraction target
There also have two different extraction targets The first one is the relation of k-tuple And the k in
there means the number of attributes in a record Nevertheless in some cases an attribute of one
record may have none instantiation Otherwise the attribute owns multiple instantiations In addition
the complex object with hierarchically organized data would be the second extraction target Though
the ways for depicting the extraction targets in a page are diverse the most common structure is the
hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf
nodes which called internal nodes And the structure for a data object may also be flat or nested To
be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise
if it is nested structure then the internal nodes that involved in this data object would be more than
two levels
Furthermore in order to make the Web pages readable for human being and having an easier
visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated
or demarcated However the displaying for a data object in a Web page would be affected by
following conditions [4]
Oslash The attribute of a data object has zero or several values
(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo
attribute For example a special offer only available for certain books might be a ldquononerdquo
attribute
(2) If there are more than one values for the attribute of a data object it will be called the
ldquomultiValuerdquo attribute For instance the name of the author for a book could be a
ldquomultiValuerdquo attribute
Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering
That is to say among this set of attribute the position of the attribute might be changed
according to the diverse instances of a data object Thus this attribute will be called the
ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would
enumerate the release data in front of the movesrsquo title while for the movies after year 1999
(including 1999) it will enumerate the release data behind the movesrsquo title
Oslash The attribute has different formats
This means the displaying format of the data object could be completely distinct with respect to
these different instances Therefore if the format of an attribute is free then a lot of rules will be
needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo
attribute For example an ecommerce Web site would use the bold font format to present the
general prices while use the red color format to display the sale prices Nevertheless there has
Deep Web Service Crawler
18
another situation that some different attributes for a data object have the same format For
example various attributes are presented in terms of using the ltTDgt tags in a table presentation
And the attributes like those could be differentiated by means of the order information of these
attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo
attributes it must have to revise the rules for extracting these attributes
Oslash The attribute cannot be decomposed
Because of the easier processing sometimes the input documents would like to be treated as
strings of tokens instead of the strings of characters In addition some of the attribute cannot
even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo
attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The
department code and the course number in them cannot be separated into two different strings
of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo
223 The Used Techniques in Information Extraction
The extractor used in the process of Information Extraction aims at providing a single uniform query
interface to access information sources like database server and Web server It consists of following
phases collecting returned Web pages labeling these Web pages generalizing extraction rules
extracting the relevant data and outputting the result in an appropriate format (XML format or
relational database) for further information integration For example at first the extractor queries the
Web server to gather the returned pages through the HTTP protocols after that it starts to extract the
contents among these HTML documents and integrate with other data sources thereafter Actually
the whole process of the extractor follows below steps
Oslash Step 1
At the beginning it must have to tokenize the input However there are two different
granularities for the input string tokenization They are tag-level encoding and word-level
encoding The tag-level encoding will transform the tags of HTML page into the general tokens
while transform all text string between two tags into a special token Nevertheless the
word-level encoding does this in another way It treats each word in a document as a token
Oslash Step 2
Next it should apply the extraction rules for every attributes of the data object in the Web pages
These extraction rules could be induced in terms of a top-down or bottom-up generalization
pattern mining or logic programming In addition the type of extraction rules may be indicated
by means of regular grammars or logic rules For example some use path-expressions of the
HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic
constraints and some use delimiter-based constraints such as HTML tags or literal words
Oslash Step 3
After that all these extracted data would be assembled into the records
Oslash Step 4
Finally iterate this process until all these data objects in the input
Deep Web Service Crawler
19
23 Pica-Pica Web Service Description Crawler
The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the
Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web
Services problem For example the evaluation of the descriptive quality of Web Services that offered
and how well are these Web Services described in nowadaysrsquo Web Service Registries
231 Needed Libraries of the Pica-Pica Web Service
Description Crawler
This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef
Spillner and programmed in terms of the Python language Actually in order to run these scripts to
parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib
n Beautiful Soup
It is an HTMLXML parser for Python language And it can even turn these invalid markups into a
parse tree [5] Moreover the following three features can make it more powerful
u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that
makes approximately as much sense as the original document Therefore you can obtain
the data that you want
u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating
searching and modifying the parse tree Hence you donrsquot need to create a custom parse
for every application
u If the document has already specified an encoding then you can ignore it since the
Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way
Otherwise what you have to do is just to specify the encoding of the original documents
Furthermore the ways of including Beautiful Soup into the application are displayed in the
following [5]
sup2 From BeautifulSoup import BeautifulSoup For processing HTML
sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML
sup2 Import BeautifulSoup To get everything
n Html5lib
It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to
gain maximum compatibility with the current major desktop web browsers this implementation
will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5
specification
Deep Web Service Crawler
20
232 Architecture of the Pica-Pica Web Service
Description Crawler
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler
Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes
four fundamental components Service Page Grabber component WSDL Grabber component
Property Grabber component and WSML Register component
(1) The Service Page Grabber components is going to take the URL seed as the input and output the
link of the service page into following two components WSDL Grabber component and Property
Grabber component
(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the
delivered service pagersquos link And then check whether the validation for these obtained WSDL
document Finally only these valid WSDL document will be passed into the WSML Register
component for further processing
(3) The Property Grabber component will try to extract the servicersquos property the hosted in the
service page if there exists After that all these servicersquos properties would be saved into an INI
file as the information of that service
(4) The functionality of the WSML Register component is to write appropriate WSML document by
means of the valid WSDL documents that delivered from WSDL Grabber component and the
optionally INI files that delivered from Property Grabber component Afterwards register them in
Conqo
n WSML [9]
It stands for Web Service Modeling Language which provides a framework with different
language variants Hence it is often used to describe the different aspects of the semantic Web
Services according to the conceptual model of WSMO
n WSMO [10]
WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various
aspects related to the Semantic Web Services based on the ontologies Ontology is a formal
explicit specification of a shared conceptualization In fact these ontologies are the sticking points
that can satisfy the linkage between the agreement of the communities of users and the defined
conceptual semantics of the real-world
n Conqo [11]
Deep Web Service Crawler
21
It is a discovery framework that considers not only the Quality of Service (QoS) but also the
context information It will use a Web Service Repository to manage these service descriptions
that based on WSML
233 Implementation of the Pica-Pica Web Service
Description Crawler
This section is going to describe the processes of the implementation of the Pica-Pica Web Service
Description Crawler in detail
(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it
needs an input as the initial seed In this crawler there are five Web Service Registries which are
listed in the below The URL address of these five Web Service Registries will be used as the input
seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web
Service Description Crawler there has a single Python script for each Web Service Registry And
the crawling process for these Web Service Registriesrsquo Python script will be executed one after
another
Biocataloue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwsrvice-repositorycom
Xmethods httpwwwxmethodsnet
(2) Then after feeding with the input seed it will step into the next component Service Page Grabber
At first this component will try to read the data from the Web based on the input seed Then it
will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library
After that this Service Page Grabber component starts to look for the service page link for each
service that published in the Web Service Registry by means of the functions in Html5lib library In
the case that the service page link of one single service is found it will firstly check whether this
service page link is valid or not Once the service page link is valid it will pass it into the following
two components for further processing which are WSDL Grabber component and Property
Grabber component
(3) When the WSDL Grabber component receives a service page link from its previous component it
sets out to extract the WSDL link address for that service through the parsing tree of the data in
this service page Next this component will start to download the WSDL document of that service
in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into
the disk The process of this WSDL Grabber component will be continually carried on until there is
no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They
may either contain bad definitions or bad namespaceURI or be an empty document What is
worse it is not even of XML format Hence in order to pick them out this component will further
analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo
folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in
order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo
folder will be passed into the subsequent component
(4) Moreover since some Web Service Registries would give some addition information about the
Deep Web Service Crawler
22
services such as availability service provider version therefore the Property Grabber
component will set out to extract these information as this servicersquos properties And thereafter
save these servicersquos properties into an INI file However if there has no available additional
information then it is no need to extract the service property thus there is no INI file for that
service However for the implementation of this Pica-Pica Web Service Description Crawler only
the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to
extract the servicesrsquo properties While for other three Web Service Registries there has such
function
(5) Furthermore this is an optional to create a report file which contains the statistic information of
this process Such as the total number of services for one Web Service Registry the number of
services whose WSDL document is invalid etc
(6) As has been stated at present there has a folder with all valid WSDL documents and might also
have some INI files Therefore at this time the task of WSML Register component is to generate
the appropriate WSML documents with these valid WSDL document and INI files And then
register them in Conqo
24 Conclusions of the Existing Strategies
In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder
project Information Extraction technique and Pica-Pica Web Service Description Crawler
The task of this master program is going to obtain the available Web services and their related
information from the Web Actually this is a procedure of extracting the needed information for the
service such as the servicersquos WSDL document and its properties Therefore the Information
Extraction technique which used to extract the information that hosted in the Web could be used in
this master program
Moreover the Service-Finder is a large project that not only able to obtain the available Web services
and their related information from the Web but also enrich these crawled data with annotations by
means of the Service-Finder ontology and the Service Category ontology and integrate all the
information into a coherent semantic mode Furthermore it also provides the capabilities for
searching and browsing the data with the user interface and gives users with service
recommendations However as a master program the Service-Finder project is far more exceeding
the requirements Therefore it is just considered as a consultation for this master program
Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available
Web services and their related information this fulfills the primary task of this master program
Nevertheless regarding the information about the service this Pica-Pica Web Service Description
Crawler extracts only few properties sometime even no property Consequently in order to improve
the quality of the service it has to extract as much properties about the service as possible Thence in
chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler
Deep Web Service Crawler
23
3 Design and Implementation
In the previous chapter of the State of Art those already existing techniques or implementations are
presented In the following it is time to introduce the basic principle of the proposed approach Deep
Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica
Web Service Description Crawler
31 Deep Web Services Crawler Requirements
This section is mainly talking about the goals of the Deep Web Service Crawler approach system
requirements with respect to approach and some non-functional requirements
311 Basic Requirements for DWSC
The following lists are the basic requirements which should be achieved
(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list
of Web services that published in the Web Service Registry It contains not only the WSDL document
of the service but also a file with respect to the information about that service Therefore for the
purpose of producing the largest annotated Service Catalogue this proposed approach need to
extract as much properties about those Web services as possible Moreover it has also to download
the WSDL document that hosted along with the Web service That is to say these properties are not
only the interesting structured properties such as service name its WSDL link address but also some
other non-function properties for example endpoint and its monitoring information
(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties
For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of
schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way
this proposed approach provides three methods for the storage The first one will store them as an
XML file While for the second method it is going to store them in an INI file And the third method is
to use the database for the storage
312 System Requirements for DWSC
General Speaking the needed requirements for achieving a programming project contains following
lists
1) Operating System LinuxUnix Windows (XP Vista 2000 etc)
2) Programming language C++ Java Python C etc
3) Programming Tool NetBean Eclipse Visual Studio and so on
However in this master thesis the scripts of the Deep Web Service Crawler approach are written in
Java programming language Besides these code scripts are only tested running in Windows XP
operating system and Linux operating system but have not been tested in other operating system
Deep Web Service Crawler
24
313 Non-Functional Requirements for DWSC
In this part several non-functional requirements for the Deep Web Service Crawler approach are
presented
1) Transparency the process of data exploration and data storage should be done automatically
without the userrsquos intervention However at first user should specify the path in the hard disk which
will be used store these outputs of this program
2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in
order to stop the process from interruption there must have some necessary error handlings for the
process recovering
3) Completeness this approach should extract the interesting properties about each Web Service as
much as possible eg endpoint monitoring information etc
In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the
following five URLs hence the proposed approach must implement not less than those five Web
Service Registries
Biocatalogue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwservice-repositorycom
Xmethods httpwwwxmethodsnet
32 Deep Web Services Crawler Architecture
In this section an overview of this high-level architecture of Deep Web Services Crawler approach will
be introduced first Thereafter there are four subsections that focusing on the outlining of each single
component and how they will play together presented
The current components and flows of data in Deep Web Service Crawler can be summarized as
depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page
links and related service page links by crawling the Web (Web Service Extractor) Then those gathered
links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL
Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all
these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is
illustrated as following
Oslash Step 1
When the Deep Web Service Crawler starts to run the File Chooser container would require the
user to specify a path in the computer or in any other hard disk The reason for specifying the
path is that this Deep Web Service Crawler program needs a place to store all its outputs
Oslash Step 2
After that the Web Service Extractor will be triggered It is the main entry to the specific crawling
process However since the Deep Web Service Crawler program is a procedure which supposes
to crawl for Web Services in some given Web Service Registries Hence the URL addresses of
Deep Web Service Crawler
25
these Web Service Registries should be given as an initial seed for this Web Service Extractor
process However since the page structures of these Web Service Registries are completely
different thus there will be a dependent process for each Web Service Registry
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler
Oslash Step 3
So that according to the given seed two types of links would be obtained by the Web Service
Extractor component One is the service list page link And another is the service page link
Service list page is page that contains a list of Web Services and may be some information about
these Web Services While service page is page that contains much more information for single
service Finally it forwards these two types of links into the next two components Property
Grabber and WSDL Grabber
Oslash Step 4
Then on the one hand the Property Grabber component tries to gather the information about
the service that hosted in the service list page and service page Such as the name of the service
its description the ranking for this service etc Finally all the information of the service will be
collected together as the service properties which then will be delivered to the storage
component for further processing
Oslash Step 5
On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page
or the service page That is because for some Web Service Registry the WSDL link is hosted in
Deep Web Service Crawler
26
the service list page like the Biocatalogue and for other Web Service Registries it is hosted in
the service page such as Xmethods Then after obtaining the WSDL link it will also be
transmitted to the Storage component for further processing
Oslash Step6
When service properties information and WSDL link of the service are received by Storage
component it will be going to store them into the disk For the service properties they will be
stored into the disk by means of three different ways They are an XML file an INI file or one
record inside the table of Database However for the WSDL link the Storage component will try
to download the page content first according to the URL address of the WSDL link If it can work
successful the page content of the service will be stored as a WSDL document into the disk
Oslash Step 7
Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there
are more than one service or there are more than one service list page in the those Web Service
Registries the crawling process from step 3 to step 6 will be continued again and again until
there is no service or service list page in that Web Service Registries any more
Oslash Step 8
Furthermore after the crawling process for one Web Service Registry finishes there will also
generate a file that contains some of statistic information about this crawling process For
example the time when the crawling process of this Web Service Registry started and when
finished the total number of Web services in this Web Service Registry how many services have
empty WSDL document the average number of service properties in this Web Service Registry
and the average time cost for extracting service properties getting WSDL document and
generating XML file INI file etc
321 The Function of Web Service Extractor Component
The Web Service Extractor component is responsible for gathering service list page links and service
page links It pursues a focused crawl of the Web and only forwards service list page and service page
links to subsequent components for analyzing collecting gathering purposes Therefore it is going to
identify both service list page links and related service page links on these Web Service Registries
As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost
as important as the Web Service Extractor itself as it highly influences the part of the Web which it
needs to crawl The seed can or shall contain eg Web pages where these Web Services are
published or that they talk about Web Services
Deep Web Service Crawler
27
Figure3-2 Overview the process flow of the Web Service Extractor Component
After feeding with the seed of URL the Web Service Extractor component starts to get the link of
service list page from the initial page of this URL seed However this process would be diverse for
these five Web Service Registries Following shows the different situation in these Web Service
Registries
Oslash Service-Repository Web Service Registry
In this Web Service Registry the link of the first service list page is the URL address of its seed
which means some Web Services can be found in the home page of the Service-Repository Web
Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this
Web Service Registry Therefore the process of getting service list page link in this registry will be
continually carried on until there no more service list page link exists
Oslash Xmethods Web Service Registry
Although there has Web Services in the home page of the Xmethods Web Service Registry the
number of those Web Services is only a small subset of these in this Web Service Registry
Moreover in the Xmethods Web Service Registry there is only one page containing all Web
Services Therefore it must have to get the service list page link for that page
Oslash Ebi Web Service Registry
The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That
is to say there also has one page that contains all Web Services in this Web Service Registry
However this page is not the initial page of the input seed Therefore there needs more than
one operation steps to get the service list page link of that page
Oslash Seekda Web Service Registry
In Seekda Web Service Registry these Web Services are not contained in the initial page of the
input seed The service list page link can be obtained after several additional operation steps
Deep Web Service Crawler
28
However there has a problem for getting the service list page link in this registry Simply put if
there are more than one pages contains the Web Services for some unknown reasons it cannot
not get the links of the rest service list page In another word it can only get the link of the first
service list page
Oslash Biocatalogue Web Service Registry
The process of getting service list page in Biocatalogue Web Service Registry is almost the same
as in Seekda Web Service Registry The only difference is that it can get all these service list page
links in Biocatalogue Web Service Registry if there are more than one service list pages
Then after getting the link of service list page the Web Service Extractor will begin to get the link of
service page for each service that listed in the service list page The reason for why it can do like this is
that there has an internal link for every service which will address to the service page It is worth
noting that once a service page link is obtained this component will immediately forward the service
page link and the service list page link into the subsequent two components for further processing
Nevertheless the process of obtaining the service page link will be continuously carried out until all
services listed in that service list page are crawled Analogously the process of getting service list page
will also be continuously carried out until there no more service list page exists
3211 Features of the Web Service Extractor Component
The main features are described in the following paragraphs
1) Obtain service list page links
A central task is to obtain the corresponding service list page links It is an URL address that includes a
publicly list of Web Services and just a simply information about these Web services like the name of
the service an internal URL that links to another page which would contain the detailed information
about that service sometimes may have the link address for WSDL document
2) Obtain service page links
Once service list page links are found a crucial aspect is to extract the internal link of each service to
aid the task of service information discovery Therefore it is the task of the Web Service Extractor to
harvest the html page content of the service list page so that the service page links which would
contain much more detailed information of the single Web service can be obtained
3212 Input of the Web Service Extractor Component
This component is dependent on some specific input seeds And the only input required for this
component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section
313
3213 Output of the Web Service Extractor Component
The component will produce two service related page links from the Web
l Service list page links
l Service page links
Deep Web Service Crawler
29
3214 Demonstration for Web Service Extractor
In order to have comprehensive understanding of the process of the Web Service Extractor
component following gives some figures for explanation Though there are five URL addresses in this
section only the URL of the service-repository address is showed as an example
1) The input seed is the initial URL address of the Service-Repository which is
ldquohttpwwwservice-repositorycomrdquo
2) As has already said in section 321 the first service list link of this Web Service Registry is its input
seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page
of that link
Figure3-3 Service list page of the Service-Repository
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo
Figure3-5Code Overview of getting service page link in Service Repository
Figure3-6 Service page of the Web service ldquoBLZServicerdquo
Deep Web Service Crawler
30
3) For now because of the service list page link is already know the next step is to acquiring the
service page link according to these services that listed in the service list page The text in red box
of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the
complete link of the service page It should be prefixed with the initial URL address of
Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for
getting service page link of the Web service in Service Repository is showed in figure 3-5
Therefore the final link of this service page is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure 3-6 is the corrsponding service page of that link
4) Afterwards those two links service list page link and service page link which are gathered by the
Web Service Extractor component will be immediately forwarded to the next two components
which are WSDL Grabber component and Property Grabber component
322 The Function of WSDL Grabber Component
The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the
service lists page link or service page link And the whole process flow is illustrated in figure 3-7
Figure3-7 Overview the process flow of the WSDL Grabber Component
When the WSDL Grabber component receives the inputs that delivered from previous component it
will start to get the WSDL link for the service based on these inputs However although the inputs of
the WSDL Grabber component are the links of service page and service list page only one of them
contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the
service page or the service list page The reason why there needs these two links to be delivered into
this component is that the only one of the five Web Service Registries which is Biocatalogue hosts
the WSDL link in the service list page While for other four Web Service Registries the WSDL link is
hosted in the service page Therefore the WSDL link of these four Web Service Registries will be
obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL
link will be obtained through the service list page link However there has a problem for getting the
WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in
Deep Web Service Crawler
31
the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another
word these services will not have WSDL document For the situation like this the value of the WSDL
link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web
Services in other four Web Service Registries the WSDL link would always exist in the service page
Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service
it will inevitably be forwarded into Storage component for downloading the WSDL document at once
3221 Features of the WSDL Grabber Component
The WSDL Grabber component has to provide the following features
l Obtain WSDL links
WSDL link is the direct way to get into the page that contains the contents of the WSDL document
It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo
to represent this is an address that addressed to the page of WSDL document
3222 Input of the WSDL Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3223 Output of the WSDL Grabber Component
The component will only produce the following output data
l The URL address of WSDL link for each service
3224 Demonstration for WSDL Grabber Component
In this section it will demonstrate a list of figures in order to give a comprehensive understanding of
the process of the WSDL Grabber component Without a doubt it uses the same Web service of the
Service-Repository as an example too
1) The input for this WSDL Grabber component is the link of service page that obtained from Web
Service Extractor component The address of this link is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page
Deep Web Service Crawler
32
2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the
figure 3-8
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo
3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure
3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry
For other four Web Service Registries this will be different This function
ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo
Then it checks all of the nodes one by one to see if the text value of that node is the value of
ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will
be extracted as the value of WSDL link for this Web service
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function
Figure3-11 Code overview of ldquooneParameterrdquo function
4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web
service ldquoBLZServicerdquo can be obtained The value of it is
ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo
Deep Web Service Crawler
33
323 The Function of Property Grabber Component
The Property Grabber component is a module which is used to extract and gather all these Web
service information that hosted in the Web actually are the information that showed in service list
page and service page In the end all the obtained Web service information will be collected together
as the service properties which would be delivered into the Storage component for storing The
detailed process flow of Property Grabber component is illustrated in figure 3-12
The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component
However there is still a little difference between them with respect to the seed As have already
mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is
sufficient to get the WSDL link While regarding the Property Grabber component it needs two of
them as the seeds Therefore once the Property Grabber component receives the needed inputs it
will start to extract the information of that single Web service
Figure3-12 Overview the process flow of the Property Grabber Component
Therefore after the Property Grabber component receives the inputs it starts to extract the service
information for the Web service Generally speaking the service information consists of four aspects
which are structured information endpoint information monitoring information and whois
information respectively
(1) Structured Information
The structured information can be obtained by extracting the information that hosted in the
service page and service list page They are the basic descriptive information about the service
Such as the name of the service the URL address through which the WSDL document can be
obtained the description of introducing the service the provider who provides that service its
rating and the server who owns this service etc However the elements constituting this
Deep Web Service Crawler
34
structured information would be diverse according to these different Web Service Registries For
example the rating information of the Web service exists in the Service-Repository Web Service
Registry while the Xmethods Web Service Registry does not have this information In addition
even though the Web services in the same Web Service Registry some elements of the structured
information could be not exist For instance one service in a Web Service Registry may have the
description information of this service while another service in the same Web Service Registry
does not have this description information Tables 3-1 to 3-5 demonstrate the structured
information that should be extracted in these five Web Service Registries Moreover if the style of
the Biocataloue Web Service Registry is SOAP there has some additional information describing
the SOAP operation of this service While it is REST this additional information would be
something that talked about the REST operation They should also be considered as a part of the
structured information Table 3-6 and table 3-7 illustrate the information for these two different
operations
Service Name WSDL Link WSDL Version
Provider Server Rating
Homepage Owner Homepage Description
Table 3-1Structured Information of Service-Repository Web Service Registry
Service Name WSDL Link Provider
Service Style Homepage Implementation Language
Description User Description Contributed Client Name
Type of this Client Publisher for this client Used Tookit of this client
Used Language of this client Used Operation System of this client
Table 3-2Structured Information of Xmethods Web Service Registry
Service Name WSDL Link Server
Provider Providerrsquos Country Service Style
Rating Description User Description
Service Tags Documentation (within WSDL)
Table 3-3Structured Information of Seekda Web Service Registry
Service Name WSDL Link Port Name
Service URL Address Implementation Class
Table 3-4Structured Information of Ebi Web Service Registry
Service Name WSDL Link Style
Provider Providerrsquos Country View Times
Favorite Times Submitter Service Tags
Total Annotation Provider Annotation Member Annotation
Registry Annotation Base URL SOAP Lab Server Base URL
Description User Description Category
Table 3-5Structured Information of Biocatalogue Web Service Registry
Deep Web Service Crawler
35
SOAP Operation Name Inputs and Outputs Operation Description
Operation Tags Part of Which Service
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry
REST Operation Name Service Tags Used Template
Operation Description Part of Which Service Part of Which Endpoint Group
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry
(2) Endpoint Information
Endpoint information is the port information of the Web service which can be extracted only
through the service page However since different Web Service Registries have different structure
of the endpoint information with respect to the Web service Thence some elements of the
endpoint information for the Web service would be very diverse But there is one thing needed to
pay attention the Ebi Web Service Registry does not have endpoint information for all these Web
services published in this registry Moreover though the Web services in the same Web Service
Registry have the same structure of this endpoint information some elements of the endpoint
information would be missing or empty Furthermore these Web Service Registries could even
have no endpoint information for some Web services published by them Nevertheless whatever
there is endpoint information for the Web service there will be at least one such element which
is the URL address of the endpoint Following table 3-8 shows the endpoint information that
should be extracted in these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Endpoint Name Endpoint URL
Endpoint Critical Endpoint Type
Bound Endpoint
Xmethods
Endpoint URL Publisher of this Endpoint
Contact Email of this Publisher Implementation Language of this
Endpoint
Seekda Endpoint URL
Biocatalogue Endpoint Name Endpoint URL
Table 3-8 Endpoint Information of these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Service Availability Number of Downs
Total Uptime Total Downtime
MTTR MTBF
RTT Max of Endpoint RTT Min of Endpoint
RTT Average of Endpoint Ping Count of Endpoint
Seekda Service Availability Begin Time of Monitoring
Biocatalogue Monitored Status of Endpoint Overall Status of Service
Table 3-9 Monitoring Information of these five Web Service Registries
(3) Monitoring Information
Monitoring information is the tested statistic information for the Web service But it is worth
Deep Web Service Crawler
36
noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of
all these Web services published by them While for other three Web Service Registries only a
few of Web services may not have this information Table 3-12 displays the monitoring
information for these three Web Service Registries
(4) Whois Information
Whois information is not extracted from the information that hosted in the service page and
service list page It is the descriptive information for the service domain which can be gained by
means of the address of the WSDL link Because of that the process of getting the Whois
information will start after gaining the service domain first The final value of service domain
should not contain the string like this http https www etc It must belong to the top level
domain After that it is going to query the service domain database by means of sending the value
of service domain to the Whois client which is just a Web site on the Internet for example
ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain
exists a list of its information will be returned as the output However the structure of the
returned information will be diverse with respect to different service domain Therefore the most
challenge thing is that there has to deal with the extracting process for each different situation of
the returned information Table 3-10 gives the Whois information that needs be extracted for all
these five Web Service Registries
Service Domain URL Domain Name Domain Type
Domain Address Domain Description State
Postal Code City Country
Country Code Phone Fax
Email Organization Established Time
Table 3-10Whois Information for these five Web Service Registries
Finally all the information of these four aspects will be collected together And then be delivered into
the Storage component for further storage processing
3231 Features of the Property Grabber Component
The Property Grabber component has to provide the following features
l Obtain basic information
Generally speaking the more information one Web service has the better you can know how
well this Web service is Hence this Property Grabber component will be necessary to extract all
these basic information that hosted in the service list page and service page This basic
information contains the structure information endpoint information and monitoring
information
l Obtain Whois information
Due to the reason that the more information the Web service has the better the quality of the
Web service is That makes it is necessary to extract as much information about the Web service
as possible Therefore except the basic information the Property Grabber component can also
obtain some additional information called Whois information such as the type of the domain the
name of the person in charge the postal code of the domain city phone fax detailed address
email etc
Deep Web Service Crawler
37
3232 Input of the Property Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3233 Output of the Property Grabber Component
The component will produce the following output data
l Structured information of each service
l Endpoint information about each service if exists
l Monitoring information for the service and endpoint if exists
l Whois information of the service domain
However all these information will be collected together as the properties for each service Thereafter
the collected properties will be sent to the Storage component
3234 Demonstration for Property Grabber Component
Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the
Property Grabber component To simplify the explanation the example showed in this section uses
the same Web service as before
1) The inputs of the Property Grabber component are links of service list page and service page that
received from Web Service Extractor component Their links are
ldquohttpwwwservice-repositorycomrdquo
and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page
Deep Web Service Crawler
38
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page
3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that
displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of
the structured information have the same content like the contents of description showed in
service page and service list page Thence in order to saving the time of extracting process and
the space for storing process the elements with the same content would only be extracted once
Moreover there needs a transformation of non-descriptive text to the descriptive text for the
rating information because the contents of it are several image of stars Therefore the final
results of the extracted structured information for this Web service are showed in table 3-11
Because of there are no descriptive information for the provider homepage and Owner
Homepage their values are assigned as ldquoNULLrdquo
Service Name BLZService
WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl
WSDL Version 0
Server Apache-Coyote11
Description BLZService
Rating Four stars and A Half
Provider NULL
Homepage NULL
Owner Homepage NULL
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo
4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that
displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list
only one of them will be extracted as the information of endpoint That is because this master
program is intended to extract as more information as possible but these information should not
contain some redundant information Therefore for the purpose of that only one of them will be
extracted as the endpoint information even if there are more than one endpoint records Table
3-12 shows the final results of the endpoint information of this Web service
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page
Deep Web Service Crawler
39
Endpoint Name BLZServiceSOAP12port_http
Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService
Endpoint Critical True
Endpoint Type production
Bound Endpoint BLZServiceSOAP12Binding
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo
5) Then it is time to extract the monitoring information through invoking the function
ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box
above contains the monitoring information about the Web service and the red box below lists
the monitoring information for these endpoints As have already mentioned before only one
endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two
types of availability Actually they all represent the availability for this Web service just like the
availability that showed in figure 3-14 Therefore only one content of these availability
information is sufficient Table 3-13 shows the final results of this extracting process
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page
Service Availability 100
Number of Downs 0
Total Uptime 1 day 19 hours 19 minutes
Total Downtime 0 second
MTBF 1 day 19 hours 19 minutes
MTTR 0 second
RTT Max of Endpoint 141 ms
RTT Min of Endpoint 0 ms
RTT Average of Endpoint 577 ms
Ping Count of Endpoint 112
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo
6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the
service domain from the WSDL link first Towards this Web service the gained service domain is
ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying
process After that it returns a list of information for that service domain See figure 3-17 And
table 3-14 is the extracted Whois information
Deep Web Service Crawler
40
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo
Service Domain URL thomas-bayercom
Domain Name Thomas Bayer
Domain Type NULL
Domain Address Moltkestr40
Domain Description NULL
State NULL
Postal Code 54173
City Bonn
Country NULL
Country Code DE
Phone +4922855525760
Fax NULL
Email infopredic8de
Organization predic8 GmbH
Established Time NULL
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo
7) Finally all the information of these four aspects will be collected together as the service
properties and then these service properties are forwarded into the Storage component
324 The Function of Storage Component
The WSDL link from the WSDL Grabber component will be used by this Storage component to
download the WSDL document from the Web and stores it into the disk thereafter In addition the
service properties from the Property Grabber component will also be directly stored into the disk in
three different manners with this Storage component Figure 3-18 illustrates the process flow of the
Storage component
Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo
of the Storage component will be triggered After that it is going to transform the service properties
into three different output formats and store them into the disk These output formats are XML file
INI file and database records Besides it will also try to download the WSDL document with the URL
address of the WSDL link and then store the obtained WSDL document into disk too However this
ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo
ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the
storage tasks
Deep Web Service Crawler
41
Figure3-18 Overview the process flow of the Storage Component
(1) ldquogetWSDLrdquo sub function
The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it
into the disk Therefore above all it has to get the content of the WSDL document This
procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value
of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if
the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo
For the case like that it will create a WSDL document whose name is the service name appended
with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content
It is an empty document Nevertheless if that service does have WSDL link this sub function will
try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all
these contents that hosted in the Web would be downloaded and stored into the disk and named
only with the name of the service Otherwise it would create a WSDL document which prefixes
ldquoBadrdquo before the service name
(2) ldquogenerateXMLrdquo sub function
The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform
them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo
XML stands for eXtensible Markup Language which is a markup language that designed to
transport and store data The first line of an XML file is the XML declaration which defines the
XML version and the encoding used For example following ldquoltxml version=10
encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is
8-bit Unicode Transformation Format character set Besides an XML file contains XML elements
which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML
element can also contain other elements which could be simple text or a mixture of both
Deep Web Service Crawler
42
However an XML file must contain a root element as the parent of all other elements
(3) ldquogenerateINIrdquo sub function
The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will
transform them into an INI file and then store it in the disk with the name structure of servicersquos
name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for
configuration files They are just simple text files with a basic structure Generally speaking the
INI file contains three different parts They are section parameter and comment respectively
Parameter is the basic element contained in an INI file Its format is a key-value pair or can also
be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always
appears to the left of the equals sign The section is more like a room that grouped all its
parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And
section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo
Thence anything between the semicolon and the end of line will be ignored
(4) ldquogenerateDatabaserdquo sub function
The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions
Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will
turn them into the data of the database by using the SQL statements SQL stands for Structured
Query Language which is a database language that designed for accessing and manipulating data
in database Most of the actions that performed on a database are done with SQL statements
And the primary statements of SQL include insert into delete update select create alter and
drop Therefore for the purpose of transforming these service properties into the data of the
database this sub function has to create a database first using the ldquocreate databaserdquo statement
Then it should create a table to store the data A table is a collections of related data entries and
it consists of columns and rows Since the data for all these five Web Service Registries are not
very large therefore one database table is enough for storing these service properties Because of
that the field names of the service properties in columns for all these five Web Service Registries
should be uniform and well-defined Afterwards the service properties of each single service can
be put into the table as a record with the ldquoinsert intordquo statement of the SQL
3241 Features of the Storage Component
The Property Grabber component has to provide the following features
l Generate different output formats
The final result of this master program is to store the information of the service into the disk for
future work However this Storage component provides three different formats for storing these
service properties of the services that published in the Web Service Registries This makes the
storage of the service very flexible and also longevity
l Obtain the WSDL document
The important thing of this component is to check if the WSDL document can be obtained from
the WSDL link because the WSDL document plays a decisive role for determining the quality of
the service This Storage component provides the ability to deal with the different situations
about the process of obtaining the WSDL document
Deep Web Service Crawler
43
3242 Input of the Storage Component
This component requires the following input data
l WSDL link of each service
l Each service property information
3243 Output of the Storage Component
The component will produce the following output data
l WSDL document of the service
l XML document INI file and tables in database
3244 Demonstration for Storage Component
Following figures describe the fundamental implementation codes of this Storage component
The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the
implementation codes The first common place is about the parameters that defined in each of
these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute
path in the computer disk It will be used for the storing procedure and has already been specified
by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an
increasing Integer which is used as a part of the name of the service The reason for doing this is
that it can prevent the services that have the same name from overriding in the disk The content
of red mark among the codes of these figures is the second common place Its function is to
create a file or document for the service with the corresponding parameters
2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to
get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the
name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo
is the actual WSDL link which is the most important one for this sub function The other two
parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and
ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data
of the service in each Web Service Registry Such as the overall number of properties for that Web
Service Registry the overall number of services the number of services that have no WSDL link
the number of services contains no contents the number of services whose WSDL link are not
available etc The ldquoLog Informationrdquo text file records the results of the process steps and the
problems encountered For example which service is crawling now which Web Service Registry it
belongs the reason why it cannot get the WSDL document for this service the reason why this
service is unreachable and so on
Deep Web Service Crawler
44
Figure3-19 Implementation code for getting WSDL document
3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file
and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of
ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two
variables name and value
Figure3-20 Implementation code for generating XML file
Deep Web Service Crawler
45
Figure3-21 Implementation code for generating INI file
4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all
services in these five Web Service Registries into the records of the database Therefore it has to
create a database first The name of the database can be arbitrary as long as it conforms to the
naming rules in database This is the same for the name of the table Figure 3-22 displays the
procedure of creating a table in the database Due to the reason that it is hard to decide the
length of each service property thence the data types for all service properties are set to ldquoTextrdquo
Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo
statement
Figure3-22 Implementation code for creating table in database
Deep Web Service Crawler
46
Figure3-23 Implementation code for generating table records
33 Multithreaded Programming for DWSC
Multithreaded programming is a built-in characteristic which is provided by Java language A
multithreaded program contains two or more separate parts that that can execute concurrently Each
part of such a program is called a thread The use of multithreading enables to create programs that
makes efficient use of the system resources For example make maximum use of the CPU thatrsquos
because the idle time of the CPU can be kept to a minimum
In this master program there are five Web Service Registries that are needed to crawl for the services
among them Moreover the number of services published among each Web Service Registry is quite
different That makes the running time cost by each Web Service Registry is different Then it will
happen that one Web Service Registry who owns fewer services has to be waiting for executing until
another Web Service Registry who has much more services finishes Therefore in order to reduce the
waiting time for other Web Service Registries and maximize the use of the system resources it is
necessary to apply this multithreaded programming to this master program That is to say this master
program will create a thread for each Web Service Registry And these threads are executed
independently
34 Sleep Time Configuration for Web Service
Registries
Due to the reason that this master program is intended for downloading the WSDL document and
extracting the service information of these Web services that published in the Web Service Registry
this would inevitably affect the performance of the Web Service Registry In addition for the purpose
of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of
accessing Because of that sometimes when this master program is executing unknown errors would
happen For instance this master program will be continually halting at one point without getting any
more WSDL document and service information or some servicesrsquo WSDL documents of some Web
Deep Web Service Crawler
47
Service Registries cannot be obtained or there would miss some service information Therefore in
order to obtain the largest staffs of these Web services published in these five Web Service Registries
and not to affect the throughput of them it has to configure the accessing rate for each service of all
Web Service Registries
In consequence before going into the essential procedure for each single service of these Web Service
Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long
milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for
the specified number of milliseconds In other words it will temporarily cease execution for a while
Following table shows this time interval of the sleep function for each Web Service Registry
Web Service Registry Name Time Interval (milliseconds)
Service Repository 8000
Ebi 3000
Xmethods 10000
Seekda 20000
Biocatalogue 10000
Table 3-15Sleep Time of these five Web Service Registries
Deep Web Service Crawler
48
4 Experimental Results and Analysis
This chapter is going to show the quantitative experimental results of the prototype that presented in
chapter 3 Besides the analysis of these results will also be described and explained However in
order to gain a rather accurate result the experiments are carried out more than five times All these
data that displayed in the following tables and charts are the average of them
41 Statistic Information for Different Web Service
Registries
This section is mainly used to talk about the amount statistic of these Web services that published in
these five Web Service Registries It includes the overall number of the Web services that published in
each Web Service Registry and the number of these unavailable Web services which have been
archived because they may not be active anymore or are close to being non active Table 4-1 shows
the service amount statistic of these five Web Service Registries
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Overall
Services 57 289 382 853 2567
Unavailable
Services 0 0 0 0 125
Table4-1 Service amount statistic of these five Web Service Registries
Nevertheless in order to have an intuitive view of the service amount statistic in these five Web
Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As
you can see from the bar chart one on hand there has an ascending increase for the overall number
of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service
Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web
service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to
provide the Web services for the users because it contains far more services than other four Web
Service Registries On the other hand there is no unavailable service existed in almost all Web Service
Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web
Service Registry contains some Web services that cannot be used by the users To some degree this is
useless for the reason that these services cannot be used anymore and it will waste the network
resources on the Web Therefore all these unavailable services should be eliminated in order to
reduce the wasting of the network resources
Deep Web Service Crawler
49
Figure4-1 Service amount statistic of these five Web Service Registries
42 Statistic Information for WSDL Document
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Failed WSDL
Links 1 0 23 145 32
Without
WSDL Links 0 0 0 0 16
Empty
Content 0 0 2 0 2
Table4-2 Statistic information for WSDL Document
Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in
these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed
WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the
Web services whose WSDL links are invalid Say in other words that means it is impossible to get the
WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no
WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services
for these Web Service Registries It is used as the overall number of Web services in each Web Service
Registry who has no WSDL link That is to say there would be no WSDL document for Web services
like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL
document will also be created But there is no content and the name of this WSDL document will
contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which
represents the overall number of Web services that have the WSDL links and the URL address of them
0
500
1000
1500
2000
2500
3000
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
57289 382
853
2567
0 0 0 0 125
The
Num
ber o
f Web
Ser
vice
s
The name of Web Service Registry
Overall Services
Unavailable Services
Deep Web Service Crawler
50
are valid but their WSDL documents contains no content In this case a WSDL document whose name
contains a string ldquo(BAD)rdquo will be created
Figure4-2 Statistic information for WSDL Document
43 Comparison of Different Average Number of
Service Properties
This section is going to compare the average number of service properties in these five Web Service
Registries This average number of service properties is calculated by means of following equation = (1)
Where
ASP is the average number of service properties for one Web Service Registry
ONSP is the overall number of service properties in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
Figure 4-3 shows the average number of service properties of each Web service in these five Web
Service Registries As have already mentioned before one of the measurements for testing the quality
of Web services in the Web Service Registry is the service information The more information of a Web
service the better you know about that service consequently the corresponding Web Service
Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service
Repository and Biocatalogue Web Service Registries own a larger number of service properties than
that in other three Web Service Registries This can directly reflect that these two Web Service
Registries can provide more detailed information about the Web services published among them So
that the users can be easier to choose the service they need And they would also like to use the Web
services that published in these two Web Service Registries By contrast the Xmethods and Seekda
0
50
100
150
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
1 023
145
32
0 0 0 016
0 0 2 0 2
The
Num
ber o
f Ser
vice
s
The Name of the Web Service Registry
Failed WSDL Links
Without WSDL links
Empty Content
Deep Web Service Crawler
51
Web Service Registries who have less service information of the Web services would offer less quality
for these Web services Therefore users may be not like to use the Web services provided in these
two Web Service Registries Not to mention the Web services published in the Ebi Web Service
Registry
Figure4-3 Average Number of Service Properties
From the description presented in section 323 the causes of the issue for the different number of
service properties in these Web Service Registries may consist of following several points First the
number of the structured information for these Web services is different with respect to these five
Web Service Registries Even part of information for some Web services in one Web Service Registry
could be missing or have empty value For example the number of structure information that
supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that
in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for
almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce
the amount of the overall number of service properties Thirdly some Web Service Registries do not
have monitoring information such as Xmethods and Ebi In particular the Service Repository Web
Service Registry has a large amount of monitoring information of the Web services that could be
extracted in the Web Obvious the last one is the number of the whois information for these Web
services If the database of the whois client does not contain the information about the service
domain for the Web service in one Web Service Registry then there will be no whois information
could be extracted Moreover even there has information of the service domain the amount of the
information could be very diverse Therefore if the Web Service Registry undergoes the situation like
that many service domains of the Web services in this registry have no and few whois information
then the average number of service properties in that registry will decrease greatly
As a result in order to help users get better acquainted with the Web services that provided in Web
Service Registry and distinguish these Web services the Web Service Registry should do it best to
offer more and more information for each published Web service of it
05
101520253035
23
717 17
32
The
Num
ber o
f Pro
pert
ies
The Name of Web Service Registry
Average Number of Service Properties
Deep Web Service Crawler
52
44 Different Outputs of Web Services
The fundamental task of this master program is to obtain the WSDL document of the Web services
that hosted in the Web Service Registries as well as extract and gather the properties of these Web
services and thereafter store them into disk Therefore this section is going to describe the different
outputs of this master program which include the WSDL documents of Web services the generated
XML file and INI file and data records of these service properties in Database
Figure4-4 WSDL Document format of one Web service
The WSDL document of the Web service is just read from the Web according to the URL address of its
WSDL link And then these data will be stored into the disk as the WSDL document The name of the
WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document
And in order to distinguish these WSDL documents whose names would be the same while the
content of them are different the name of each obtained WSDL document in one Web Service
Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL
document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and
data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output
formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name
is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the
materials belonged to the same Web service The first three lines in that INI file are some service
comments which start from the semicolon to the end of the line They are the basic information for
the description of this INI file The following one line of them is the section that included in a pairs of
brackets It is important because it represents that the following lines behind it are the information of
this Web service Therefore the rest of these lines are the actual service information that existed in a
key-value pair with an equal between the pair Each service property is displayed from the beginning
of the line
Deep Web Service Crawler
53
Figure4-5 INI File format of one Web service
Figure4-6 XML File format of one Web service
Figure4-7 Database data format for all Web services
Deep Web Service Crawler
54
Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its
name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web
service Though the format of the XML file is different from that of the INI file the essential contents
of them are the same That is to say the values of these service properties are no different This is
because they are files generated from the collection of properties for the same Web service The XML
file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the
section in INI file is something like the root in XML file Therefore all values of these elements among
the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service
Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of
service information for all Web services in these five Web Service Registries And the entire service
information for one Web service will be only one record in this table Because of that the column
names of that table should be the union of the name of the service information in each Web Service
Registry However since the column names of the table should be unique thus these redundant
names in this union must be eliminated This could make sense and possible because the names of the
service information are well-defined and uniform for all these five Web Service Registries In addition
in this table the first column is the primary key which is an increasing Integer The function of this is
more like the Integer that contained in the name of the XML file and INI file While the rest columns in
that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in
this table represents that this property of that Web service is empty or missing
45 Comparison of Average Time Cost for Different
Parts of Single Web Service
This section aims to describe the comparison of the average time cost for different parts of getting
one single Web service in all these five Web Service Registries At first it has to calculate the average
time cost of getting one single service in the Web Service Registry The calculation of it can be
obtained through following equation = (2)
Where
ATC is the average time cost for one single Web service
OTS is the overall time cost of all the Web services in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the different parts of the average time cost for getting one single service consists of
following 6 aspects which are the average time cost for extracting service property the average time
cost for obtaining WSDL document the average time cost for generating XML file the average time
cost for generating INI file the average time cost for inserting the service property into the table of
the Database and the average time cost for some other procedures such as get the service list page
link get the service page link and so on The average time cost for extracting service property would
be obtained by means of following equation
Deep Web Service Crawler
55
= (3)
Where
ATCSI is the average time cost for extracting service property of one single Web service
OTSSI is the overall time cost for extracting service property of all the Web services in one Web
Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the calculation of other parts is similar to the equation for calculating the average time
cost for extracting service property While the calculation of the average time cost for other
procedures equals to the result of the average time cost for one single Web service minus the sum of
the average time cost of all other five parts
Service
property
WSDL
Document
XML
File INI File Database Others Overall
Service
Repository
8801 918 2 1 53 267 10042
Ebi 699 82 2 1 28 11 823
Xmethods 5801 1168 2 1 45 12 7029
Seekda 5186 1013 2 1 41 23 6266
Biocatalogue 39533 762 2 1 66 1636 42000
Table4-3 Average time cost information for all Web Service Registries
Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web
Service Registries The first column of table 4-3 is the name of these five Web Service Registries And
the last column in this table is the average time cost for the single service in one Web Service Registry
While the rest columns in this table are the average time cost of these six different parts In order to
have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated
with following corresponding figures See figure 4-8 to figure 4-14
Figure4-8 Average time cost for extracting service property in all Web Service Registries
8801
6995801 5186
39533
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Service Property
Deep Web Service Crawler
56
As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue
Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries
which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively
That is to say it will take much longer time to extract the service properties of these Web services
published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the
Biocatalogue Web Service Registry has even more average number of service properties which have
already been talked about in section 43 On the contrary the average number of service properties in
Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service
Registry is larger than that in Seekda Web Service Registry As have already known that the average
number of service properties is the same for these two Web Service Registries Nevertheless there
has course might explain the case that the average time in Xmethods costs more than that in Seekda
which is the process for extracting service properties in Xmethods Web Service Registry needs to be
executed by means of service page and service list page while it needs only service page link for
Seekda Web Service Registry
The average time cost for obtaining WSDL document in all these five Web Service Registries is
displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the
average time for extracting the WSDL link of Web service and the average time for reading the data of
WSDL document from the Web and then storing it into the disk As seen from this figure the average
time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is
1168 Although the process for extracting WSDL link will cost certain amount of time it does not have
a significant influence to the total average time spent for obtaining WSDL document This is because
the WSDL link of one Web service is almost gained in one step Therefore this can imply that the
average size of the WSDL document for Xmethods Web Service Registry is larger than that for other
four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds
for the process of extracting WSDL document
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries
918
82
11681013
762
0
200
400
600
800
1000
1200
1400
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
WSDL Document
Deep Web Service Crawler
57
Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different
outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11
the average time of generating XML file for one Web service cost the same for all these five Web
Service Registries which is only 2 milliseconds In addition the average time of generating INI file for
one Web service cost the same too And its value is just 1 millisecond Even though sum up the values
of these two types of average time costs the result is still so small that it could be omitted when
comparing to the overall average time cost of getting one Web service for each corresponding Web
Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI
file would be finished at once after receiving the input of service properties of one Web service
Furthermore it can be seen from figure 4-12 although the average time costs of creating database
record for each Web service in all these five Web Service Registries are larger than that time of
generating XML and INI file the process of the operation for creating database record is still fast
Figure4-10 Average time cost for generating XML file in all Web Service Registries
Figure4-11 Average time cost for generating INI file in all Web Service Registries
2 2 2 2 2
0
05
1
15
2
25
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
XML File
1 1 1 1 1
0
02
04
06
08
1
12
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
INI File
Deep Web Service Crawler
58
Figure4-12 Average time cost for creating database record in all Web Service Registries
Figure4-13 Average time cost for getting one Web service in all Web Service Registries
In the figure 4-13 it gives the average time cost for getting one single Web service in all these five
Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the
longest time for this process This is because the presentation of these five different parts that
described in the front shows that the average time cost of each part would needs more time to finish
the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining
WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web
Service Registry does not cost the longest Moreover there is amazing thing when looking at these
figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same
trend This can further indicate that one Web Service Registry who spends more time to get the
description information of one Web service would offer more information about the Web service
53
28
4541
66
0
10
20
30
40
50
60
70
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Database
10042
823
7029 6266
42000
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Overall
Deep Web Service Crawler
59
5 Conclusion and Further Direction
This master thesis provides a schema which aims to explore the description information of the Web
services that hosted in the different Web Service Registries Actually the description information of
the Web service consists of WSDL document and service information about that Web service
Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can
be used to obtain the WSDL document and service information of the Web service in these Web
Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web
services that hosted in the these Web Service Registries as well as the storage of these Web services
is not flexible and the most important thing is that only a few service information of the Web service
are extracted even more there is no service information would be extracted for some Web Service
Registry However the work presented in this master thesis can be able to explore all Web services in
these Web Service Registries Moreover the service information of Web service would be extracted as
much as it could with the approach presented in this master thesis so that the final result would be
the largest annotated service catalogue ever produced Furthermore regarding the storage process of
the description information of the Web service this master thesis provides three different ways that
can guarantee not only the completeness but also longevity of the description information for the
Web service
However in the implementation performed in this master thesis the whois client used for querying
the information of service domain will return a free text if the information exists And sometime this
free text differs completely That makes it must have to crawl each Web service in all Web Service
Registries at least once in the experiment stage So that all the cases of these free text would be
foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web
services in these Web Service Registries Therefore in order to simply the work other whois client
who can ease the work here needs to be found and used
Moreover in the experiment stage of this master thesis the time cost for getting Web service is still
large For the purpose of reducing this time the multithreaded programming would be applied to
each some parts of the process for getting one Web service
Although the work performed here is specialized for only these five Web Service Registries the
main parts of the principles used here is adaptable to other Web Service Registries that with only
small changes in the implementation codes or the structure
Deep Web Service Crawler
60
6 Bibliography
[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)
June 27 2008
[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from
httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-
whole Emanuele Della Valle (CEFRIEL) July 1 2008
[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk
ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-
architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009
[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web
Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October
2006
[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from
httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008
[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from
httpwwww3orgTRws-arch-scenarios February 11 2004
[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo
Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering
University of Washington Seattle pp233-272 February 1999
[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web
Consortium Working Draft WD-html5-20100624 January 22 2008
[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language
WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from
httpwwwwsmoorgTRd16d161v02
[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO
-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from
httpwwwwsmoorgTRd2v11
[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware
Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet
2008
Deep Web Service Crawler
61
7 Appendixes
There are additional outputs of this master program which are log information file and statistic report
file Figure 8-1 shows one of the basic output formats of the log information for these five Web
Service Registries
Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry
Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry
Deep Web Service Crawler
62
Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo
ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively
Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry
Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry
Deep Web Service Crawler
63
Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry
Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry
Deep Web Service Crawler
64
Table of Figures
Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12
Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16
Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25
Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27
Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41
Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46
Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50
Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51
Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52
Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56
Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58
Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58
Deep Web Service Crawler
65
Table of Tables
Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47
Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48
Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55
Deep Web Service Crawler
66
Table of Abbreviations
DTD Document Type Definition
DWSC Deep Web Service Crawler
HTML HyperText Markup Language
MTBF Mean Time between Failures
MTTR Mean Time to Recovery
QoS Quality of Service
REST Representational State Transfer
RTT Round Trip Time
SOAP Simple Object Access Protocol
SQL Structured Query Language
URL Uniform Resource Locator
WHATWG Web Hypertext Application Technology Working Group
XML eXtensible Markup Language
WSDL Web Service Description Language
WSML Web Service Modeling Language
WSMO Web Service Modeling Ontology
Deep Web Service Crawler
13
2121 The Principle of the Service Crawler Component
The Service Crawler is responsible for gathering available services and their related information from
the Web The overall cycle is depicted as following
(1) A Web developer publishes a Web Service
(2) Then the Crawling component begins to harvest the Web in order to identify the Web Services
like WSDL (Web Service Description Language) documents
(3) The Crawler is also going to search for other related information as long as a service is discovered
(4) Later after each periodic interval the Crawler will produce a consistent snapshot of the relevant
part of the Web
At last the output of the crawler would be forwarded to the subsequent components for analyzing
indexing and displaying
2122 The Principle of the Automatic Annotator Component
The Automatic Annotator receives the relevant data from previous component and generates
semantic service descriptions about the WSDL documents and its related information based on the
Service-Finder Ontology and Service Category Ontology
Firstly it will simply introduce those two compatible ontologies that would be used throughout the
whole process [2]
n Generic Service Ontology it is an ontology which is functional to describe the data objects For
example the services the service providers availability payment modalities and so on
n Service Category Ontology it is an ontology which is used to categorize the functionalities or
applications of the services For instance data verification messaging data storage weather etc
Afterwards it is going to talk about the function of this component with its input output
Oslash Input
u Crawled data from Service Crawler
u Service-Finder Ontologies
u Feedback or Correction of before annotations
Oslash Function
u Enrich the information about the service and extract semantic statements according to the
Service-Finder Ontologies For example categorize the service according to the Service
Category Ontology
u Determine whether a particular document is relevant or not through the Web link graph If
not discard these irrelevant documents
u Classify the pages into their genres For instance pricing user comments FAQ and so on
Oslash Output
u Semantic annotation of the services
Deep Web Service Crawler
14
2123 The Principle of the Conceptual Indexer and Matcher
Component
The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted
information of the services and supplying users the capability of retrieval and semantic query For
example the matchmaking between user requests and service offers and the act of retrieving user
feedback on extracted annotations
In addition letrsquos have a look of the function of this component and its input output
Oslash Input
u Semantic annotation data and full text information obtained from Automatic Annotation
u Semantic annotation data and full text information that come from user interfaces
u Cluster data from user and service clustering component
Oslash Function
u Store the semantic annotations received from the Automatic Annotation component and
from the user interface
u Store the cluster data that procured through the clustering component
u Store and index the textual description offered by the Automatic Annotation component
and the textual comments offered by users
u Ontological query the semantic data from the data store center
u Combined keyword and Ontological querying used for user queries
u Provide a list of similar services for a given service
Oslash Output
u A list of matching services that are queried by users In particular these services should be
sorted by ranking and can also be iterated
u All available data that related to a particular entity must be retrievable at the user interface
2124 The Principle of the Service-Finder Portal Interface
Component
The Service-Finder Portal Interface is the main entry point that provided for users of the
Service-Finder system to search and browse the data which is managed by the Conceptual Indexer
and Matcher component In addition the users can also contribute information by means of providing
tags comments categorizations and ratings to the data browsed Furthermore the developers can
still directly invoke the Service-Finder functionalities from their custom applications in terms of an API
Besides the details of this componentrsquos function input and output are represented as below
Oslash Input
u A list of ordered services for a query
u Detailed information about a service or a set of services and a service provider
u Query access to service category ontology and the most used tags provided by the users
Deep Web Service Crawler
15
u Service availability information
Oslash Function
u The Web Interface allows the users to search services by keyword tag or concept in the
categorization sort and filter query results by refining the query compare and bookmark
services try out the services that offer this functionality
u The API allows the developers to invoke Service-Finder functionalities
Oslash Output
u Explicit user annotations such as tags ratings comments decryptions and so on
u Implicit user data for example click stream of users bookmarks comparisons links sent
etc
u Manual advertising of available new services
2125 The Principle of the Cluster Engine Component
The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the
Service-Finder Portal eg the queried services and the compared services of the users Moreover it
also provides cluster data to the Conceptual Indexer and Matcher for providing service
recommendations
Furthermore letrsquos detailed introduce this componentrsquos function input and output
Oslash Input
u Service annotation data of both extracted and user feedback
u Usersrsquo Click streams used for extracting user behaviors
Oslash Function
u Obtain user clusters from user behaviors
u Obtain service clusters from service annotation data to enable to find similar services
Oslash Output
u Clusters of users and services
22 Information Extraction
Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge
amount of information sources on the Internet which has been limited the access to browsing and
searching for the reason of the heterogeneity and the lack of structure of Web information sources
Therefore the appearance of Information Extraction that transforms the Web pages into
program-friendly structures for post-processing would become a great necessity However the task of
Information Extraction is specified in terms of the inputs and the extraction targets And the
techniques used in the process of Information Extraction called extractor
221 Input Types of Information Extraction
Generally speaking there are three different input types The first input type is the unstructured
Deep Web Service Crawler
16
document For example the free text that showed in figure 2-2 It is unstructured and written in
natural language So that it will require substantial natural language processing While the second
input type is called the structured document For instance the XML documents based on the reason
that the data can be described through the available DTD (Document Type Definition) or XML
(eXtensible Markup Language) schema Finally but obviously the third input type is the
semi-structured document that are widespread on the Web Such as the large volume of HTML
pages like tables itemized lists and enumerated lists This is because HTML tags are often used to
render these embedded data in the HTML pages See figure 2-3
Figure2-2Left is the free text input type and right is its output [4]
Figure2-3A Semi-structured page containing data records
(in rectangular box) to be extracted [4]
Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly
regular structure And the data of these documents can be displayed in a format of HTML way or
non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and
generated from structured databases in terms of some templates or layouts thus it would be
considered as one of the input sources which could provide some of these semi-structured documents
For example the authors price and comments of the book pages that provided by Amazon have the
Deep Web Service Crawler
17
same layout That is because these Web pages are generated from the same database and applied
with the same template or layout Furthermore there has another option which could manually
generate HTML pages of semi-structured type For example although the publication lists that
provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have
title and source property for every single pager Eventually the inputs for some Information Extraction
can also be the pages with the same class or among various Web Service Registries
222 Extraction Targets of Information Extraction
Moreover regarding the task of the Information Extraction it has to consider the extraction target
There also have two different extraction targets The first one is the relation of k-tuple And the k in
there means the number of attributes in a record Nevertheless in some cases an attribute of one
record may have none instantiation Otherwise the attribute owns multiple instantiations In addition
the complex object with hierarchically organized data would be the second extraction target Though
the ways for depicting the extraction targets in a page are diverse the most common structure is the
hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf
nodes which called internal nodes And the structure for a data object may also be flat or nested To
be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise
if it is nested structure then the internal nodes that involved in this data object would be more than
two levels
Furthermore in order to make the Web pages readable for human being and having an easier
visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated
or demarcated However the displaying for a data object in a Web page would be affected by
following conditions [4]
Oslash The attribute of a data object has zero or several values
(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo
attribute For example a special offer only available for certain books might be a ldquononerdquo
attribute
(2) If there are more than one values for the attribute of a data object it will be called the
ldquomultiValuerdquo attribute For instance the name of the author for a book could be a
ldquomultiValuerdquo attribute
Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering
That is to say among this set of attribute the position of the attribute might be changed
according to the diverse instances of a data object Thus this attribute will be called the
ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would
enumerate the release data in front of the movesrsquo title while for the movies after year 1999
(including 1999) it will enumerate the release data behind the movesrsquo title
Oslash The attribute has different formats
This means the displaying format of the data object could be completely distinct with respect to
these different instances Therefore if the format of an attribute is free then a lot of rules will be
needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo
attribute For example an ecommerce Web site would use the bold font format to present the
general prices while use the red color format to display the sale prices Nevertheless there has
Deep Web Service Crawler
18
another situation that some different attributes for a data object have the same format For
example various attributes are presented in terms of using the ltTDgt tags in a table presentation
And the attributes like those could be differentiated by means of the order information of these
attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo
attributes it must have to revise the rules for extracting these attributes
Oslash The attribute cannot be decomposed
Because of the easier processing sometimes the input documents would like to be treated as
strings of tokens instead of the strings of characters In addition some of the attribute cannot
even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo
attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The
department code and the course number in them cannot be separated into two different strings
of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo
223 The Used Techniques in Information Extraction
The extractor used in the process of Information Extraction aims at providing a single uniform query
interface to access information sources like database server and Web server It consists of following
phases collecting returned Web pages labeling these Web pages generalizing extraction rules
extracting the relevant data and outputting the result in an appropriate format (XML format or
relational database) for further information integration For example at first the extractor queries the
Web server to gather the returned pages through the HTTP protocols after that it starts to extract the
contents among these HTML documents and integrate with other data sources thereafter Actually
the whole process of the extractor follows below steps
Oslash Step 1
At the beginning it must have to tokenize the input However there are two different
granularities for the input string tokenization They are tag-level encoding and word-level
encoding The tag-level encoding will transform the tags of HTML page into the general tokens
while transform all text string between two tags into a special token Nevertheless the
word-level encoding does this in another way It treats each word in a document as a token
Oslash Step 2
Next it should apply the extraction rules for every attributes of the data object in the Web pages
These extraction rules could be induced in terms of a top-down or bottom-up generalization
pattern mining or logic programming In addition the type of extraction rules may be indicated
by means of regular grammars or logic rules For example some use path-expressions of the
HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic
constraints and some use delimiter-based constraints such as HTML tags or literal words
Oslash Step 3
After that all these extracted data would be assembled into the records
Oslash Step 4
Finally iterate this process until all these data objects in the input
Deep Web Service Crawler
19
23 Pica-Pica Web Service Description Crawler
The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the
Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web
Services problem For example the evaluation of the descriptive quality of Web Services that offered
and how well are these Web Services described in nowadaysrsquo Web Service Registries
231 Needed Libraries of the Pica-Pica Web Service
Description Crawler
This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef
Spillner and programmed in terms of the Python language Actually in order to run these scripts to
parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib
n Beautiful Soup
It is an HTMLXML parser for Python language And it can even turn these invalid markups into a
parse tree [5] Moreover the following three features can make it more powerful
u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that
makes approximately as much sense as the original document Therefore you can obtain
the data that you want
u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating
searching and modifying the parse tree Hence you donrsquot need to create a custom parse
for every application
u If the document has already specified an encoding then you can ignore it since the
Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way
Otherwise what you have to do is just to specify the encoding of the original documents
Furthermore the ways of including Beautiful Soup into the application are displayed in the
following [5]
sup2 From BeautifulSoup import BeautifulSoup For processing HTML
sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML
sup2 Import BeautifulSoup To get everything
n Html5lib
It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to
gain maximum compatibility with the current major desktop web browsers this implementation
will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5
specification
Deep Web Service Crawler
20
232 Architecture of the Pica-Pica Web Service
Description Crawler
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler
Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes
four fundamental components Service Page Grabber component WSDL Grabber component
Property Grabber component and WSML Register component
(1) The Service Page Grabber components is going to take the URL seed as the input and output the
link of the service page into following two components WSDL Grabber component and Property
Grabber component
(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the
delivered service pagersquos link And then check whether the validation for these obtained WSDL
document Finally only these valid WSDL document will be passed into the WSML Register
component for further processing
(3) The Property Grabber component will try to extract the servicersquos property the hosted in the
service page if there exists After that all these servicersquos properties would be saved into an INI
file as the information of that service
(4) The functionality of the WSML Register component is to write appropriate WSML document by
means of the valid WSDL documents that delivered from WSDL Grabber component and the
optionally INI files that delivered from Property Grabber component Afterwards register them in
Conqo
n WSML [9]
It stands for Web Service Modeling Language which provides a framework with different
language variants Hence it is often used to describe the different aspects of the semantic Web
Services according to the conceptual model of WSMO
n WSMO [10]
WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various
aspects related to the Semantic Web Services based on the ontologies Ontology is a formal
explicit specification of a shared conceptualization In fact these ontologies are the sticking points
that can satisfy the linkage between the agreement of the communities of users and the defined
conceptual semantics of the real-world
n Conqo [11]
Deep Web Service Crawler
21
It is a discovery framework that considers not only the Quality of Service (QoS) but also the
context information It will use a Web Service Repository to manage these service descriptions
that based on WSML
233 Implementation of the Pica-Pica Web Service
Description Crawler
This section is going to describe the processes of the implementation of the Pica-Pica Web Service
Description Crawler in detail
(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it
needs an input as the initial seed In this crawler there are five Web Service Registries which are
listed in the below The URL address of these five Web Service Registries will be used as the input
seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web
Service Description Crawler there has a single Python script for each Web Service Registry And
the crawling process for these Web Service Registriesrsquo Python script will be executed one after
another
Biocataloue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwsrvice-repositorycom
Xmethods httpwwwxmethodsnet
(2) Then after feeding with the input seed it will step into the next component Service Page Grabber
At first this component will try to read the data from the Web based on the input seed Then it
will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library
After that this Service Page Grabber component starts to look for the service page link for each
service that published in the Web Service Registry by means of the functions in Html5lib library In
the case that the service page link of one single service is found it will firstly check whether this
service page link is valid or not Once the service page link is valid it will pass it into the following
two components for further processing which are WSDL Grabber component and Property
Grabber component
(3) When the WSDL Grabber component receives a service page link from its previous component it
sets out to extract the WSDL link address for that service through the parsing tree of the data in
this service page Next this component will start to download the WSDL document of that service
in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into
the disk The process of this WSDL Grabber component will be continually carried on until there is
no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They
may either contain bad definitions or bad namespaceURI or be an empty document What is
worse it is not even of XML format Hence in order to pick them out this component will further
analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo
folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in
order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo
folder will be passed into the subsequent component
(4) Moreover since some Web Service Registries would give some addition information about the
Deep Web Service Crawler
22
services such as availability service provider version therefore the Property Grabber
component will set out to extract these information as this servicersquos properties And thereafter
save these servicersquos properties into an INI file However if there has no available additional
information then it is no need to extract the service property thus there is no INI file for that
service However for the implementation of this Pica-Pica Web Service Description Crawler only
the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to
extract the servicesrsquo properties While for other three Web Service Registries there has such
function
(5) Furthermore this is an optional to create a report file which contains the statistic information of
this process Such as the total number of services for one Web Service Registry the number of
services whose WSDL document is invalid etc
(6) As has been stated at present there has a folder with all valid WSDL documents and might also
have some INI files Therefore at this time the task of WSML Register component is to generate
the appropriate WSML documents with these valid WSDL document and INI files And then
register them in Conqo
24 Conclusions of the Existing Strategies
In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder
project Information Extraction technique and Pica-Pica Web Service Description Crawler
The task of this master program is going to obtain the available Web services and their related
information from the Web Actually this is a procedure of extracting the needed information for the
service such as the servicersquos WSDL document and its properties Therefore the Information
Extraction technique which used to extract the information that hosted in the Web could be used in
this master program
Moreover the Service-Finder is a large project that not only able to obtain the available Web services
and their related information from the Web but also enrich these crawled data with annotations by
means of the Service-Finder ontology and the Service Category ontology and integrate all the
information into a coherent semantic mode Furthermore it also provides the capabilities for
searching and browsing the data with the user interface and gives users with service
recommendations However as a master program the Service-Finder project is far more exceeding
the requirements Therefore it is just considered as a consultation for this master program
Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available
Web services and their related information this fulfills the primary task of this master program
Nevertheless regarding the information about the service this Pica-Pica Web Service Description
Crawler extracts only few properties sometime even no property Consequently in order to improve
the quality of the service it has to extract as much properties about the service as possible Thence in
chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler
Deep Web Service Crawler
23
3 Design and Implementation
In the previous chapter of the State of Art those already existing techniques or implementations are
presented In the following it is time to introduce the basic principle of the proposed approach Deep
Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica
Web Service Description Crawler
31 Deep Web Services Crawler Requirements
This section is mainly talking about the goals of the Deep Web Service Crawler approach system
requirements with respect to approach and some non-functional requirements
311 Basic Requirements for DWSC
The following lists are the basic requirements which should be achieved
(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list
of Web services that published in the Web Service Registry It contains not only the WSDL document
of the service but also a file with respect to the information about that service Therefore for the
purpose of producing the largest annotated Service Catalogue this proposed approach need to
extract as much properties about those Web services as possible Moreover it has also to download
the WSDL document that hosted along with the Web service That is to say these properties are not
only the interesting structured properties such as service name its WSDL link address but also some
other non-function properties for example endpoint and its monitoring information
(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties
For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of
schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way
this proposed approach provides three methods for the storage The first one will store them as an
XML file While for the second method it is going to store them in an INI file And the third method is
to use the database for the storage
312 System Requirements for DWSC
General Speaking the needed requirements for achieving a programming project contains following
lists
1) Operating System LinuxUnix Windows (XP Vista 2000 etc)
2) Programming language C++ Java Python C etc
3) Programming Tool NetBean Eclipse Visual Studio and so on
However in this master thesis the scripts of the Deep Web Service Crawler approach are written in
Java programming language Besides these code scripts are only tested running in Windows XP
operating system and Linux operating system but have not been tested in other operating system
Deep Web Service Crawler
24
313 Non-Functional Requirements for DWSC
In this part several non-functional requirements for the Deep Web Service Crawler approach are
presented
1) Transparency the process of data exploration and data storage should be done automatically
without the userrsquos intervention However at first user should specify the path in the hard disk which
will be used store these outputs of this program
2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in
order to stop the process from interruption there must have some necessary error handlings for the
process recovering
3) Completeness this approach should extract the interesting properties about each Web Service as
much as possible eg endpoint monitoring information etc
In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the
following five URLs hence the proposed approach must implement not less than those five Web
Service Registries
Biocatalogue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwservice-repositorycom
Xmethods httpwwwxmethodsnet
32 Deep Web Services Crawler Architecture
In this section an overview of this high-level architecture of Deep Web Services Crawler approach will
be introduced first Thereafter there are four subsections that focusing on the outlining of each single
component and how they will play together presented
The current components and flows of data in Deep Web Service Crawler can be summarized as
depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page
links and related service page links by crawling the Web (Web Service Extractor) Then those gathered
links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL
Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all
these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is
illustrated as following
Oslash Step 1
When the Deep Web Service Crawler starts to run the File Chooser container would require the
user to specify a path in the computer or in any other hard disk The reason for specifying the
path is that this Deep Web Service Crawler program needs a place to store all its outputs
Oslash Step 2
After that the Web Service Extractor will be triggered It is the main entry to the specific crawling
process However since the Deep Web Service Crawler program is a procedure which supposes
to crawl for Web Services in some given Web Service Registries Hence the URL addresses of
Deep Web Service Crawler
25
these Web Service Registries should be given as an initial seed for this Web Service Extractor
process However since the page structures of these Web Service Registries are completely
different thus there will be a dependent process for each Web Service Registry
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler
Oslash Step 3
So that according to the given seed two types of links would be obtained by the Web Service
Extractor component One is the service list page link And another is the service page link
Service list page is page that contains a list of Web Services and may be some information about
these Web Services While service page is page that contains much more information for single
service Finally it forwards these two types of links into the next two components Property
Grabber and WSDL Grabber
Oslash Step 4
Then on the one hand the Property Grabber component tries to gather the information about
the service that hosted in the service list page and service page Such as the name of the service
its description the ranking for this service etc Finally all the information of the service will be
collected together as the service properties which then will be delivered to the storage
component for further processing
Oslash Step 5
On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page
or the service page That is because for some Web Service Registry the WSDL link is hosted in
Deep Web Service Crawler
26
the service list page like the Biocatalogue and for other Web Service Registries it is hosted in
the service page such as Xmethods Then after obtaining the WSDL link it will also be
transmitted to the Storage component for further processing
Oslash Step6
When service properties information and WSDL link of the service are received by Storage
component it will be going to store them into the disk For the service properties they will be
stored into the disk by means of three different ways They are an XML file an INI file or one
record inside the table of Database However for the WSDL link the Storage component will try
to download the page content first according to the URL address of the WSDL link If it can work
successful the page content of the service will be stored as a WSDL document into the disk
Oslash Step 7
Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there
are more than one service or there are more than one service list page in the those Web Service
Registries the crawling process from step 3 to step 6 will be continued again and again until
there is no service or service list page in that Web Service Registries any more
Oslash Step 8
Furthermore after the crawling process for one Web Service Registry finishes there will also
generate a file that contains some of statistic information about this crawling process For
example the time when the crawling process of this Web Service Registry started and when
finished the total number of Web services in this Web Service Registry how many services have
empty WSDL document the average number of service properties in this Web Service Registry
and the average time cost for extracting service properties getting WSDL document and
generating XML file INI file etc
321 The Function of Web Service Extractor Component
The Web Service Extractor component is responsible for gathering service list page links and service
page links It pursues a focused crawl of the Web and only forwards service list page and service page
links to subsequent components for analyzing collecting gathering purposes Therefore it is going to
identify both service list page links and related service page links on these Web Service Registries
As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost
as important as the Web Service Extractor itself as it highly influences the part of the Web which it
needs to crawl The seed can or shall contain eg Web pages where these Web Services are
published or that they talk about Web Services
Deep Web Service Crawler
27
Figure3-2 Overview the process flow of the Web Service Extractor Component
After feeding with the seed of URL the Web Service Extractor component starts to get the link of
service list page from the initial page of this URL seed However this process would be diverse for
these five Web Service Registries Following shows the different situation in these Web Service
Registries
Oslash Service-Repository Web Service Registry
In this Web Service Registry the link of the first service list page is the URL address of its seed
which means some Web Services can be found in the home page of the Service-Repository Web
Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this
Web Service Registry Therefore the process of getting service list page link in this registry will be
continually carried on until there no more service list page link exists
Oslash Xmethods Web Service Registry
Although there has Web Services in the home page of the Xmethods Web Service Registry the
number of those Web Services is only a small subset of these in this Web Service Registry
Moreover in the Xmethods Web Service Registry there is only one page containing all Web
Services Therefore it must have to get the service list page link for that page
Oslash Ebi Web Service Registry
The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That
is to say there also has one page that contains all Web Services in this Web Service Registry
However this page is not the initial page of the input seed Therefore there needs more than
one operation steps to get the service list page link of that page
Oslash Seekda Web Service Registry
In Seekda Web Service Registry these Web Services are not contained in the initial page of the
input seed The service list page link can be obtained after several additional operation steps
Deep Web Service Crawler
28
However there has a problem for getting the service list page link in this registry Simply put if
there are more than one pages contains the Web Services for some unknown reasons it cannot
not get the links of the rest service list page In another word it can only get the link of the first
service list page
Oslash Biocatalogue Web Service Registry
The process of getting service list page in Biocatalogue Web Service Registry is almost the same
as in Seekda Web Service Registry The only difference is that it can get all these service list page
links in Biocatalogue Web Service Registry if there are more than one service list pages
Then after getting the link of service list page the Web Service Extractor will begin to get the link of
service page for each service that listed in the service list page The reason for why it can do like this is
that there has an internal link for every service which will address to the service page It is worth
noting that once a service page link is obtained this component will immediately forward the service
page link and the service list page link into the subsequent two components for further processing
Nevertheless the process of obtaining the service page link will be continuously carried out until all
services listed in that service list page are crawled Analogously the process of getting service list page
will also be continuously carried out until there no more service list page exists
3211 Features of the Web Service Extractor Component
The main features are described in the following paragraphs
1) Obtain service list page links
A central task is to obtain the corresponding service list page links It is an URL address that includes a
publicly list of Web Services and just a simply information about these Web services like the name of
the service an internal URL that links to another page which would contain the detailed information
about that service sometimes may have the link address for WSDL document
2) Obtain service page links
Once service list page links are found a crucial aspect is to extract the internal link of each service to
aid the task of service information discovery Therefore it is the task of the Web Service Extractor to
harvest the html page content of the service list page so that the service page links which would
contain much more detailed information of the single Web service can be obtained
3212 Input of the Web Service Extractor Component
This component is dependent on some specific input seeds And the only input required for this
component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section
313
3213 Output of the Web Service Extractor Component
The component will produce two service related page links from the Web
l Service list page links
l Service page links
Deep Web Service Crawler
29
3214 Demonstration for Web Service Extractor
In order to have comprehensive understanding of the process of the Web Service Extractor
component following gives some figures for explanation Though there are five URL addresses in this
section only the URL of the service-repository address is showed as an example
1) The input seed is the initial URL address of the Service-Repository which is
ldquohttpwwwservice-repositorycomrdquo
2) As has already said in section 321 the first service list link of this Web Service Registry is its input
seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page
of that link
Figure3-3 Service list page of the Service-Repository
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo
Figure3-5Code Overview of getting service page link in Service Repository
Figure3-6 Service page of the Web service ldquoBLZServicerdquo
Deep Web Service Crawler
30
3) For now because of the service list page link is already know the next step is to acquiring the
service page link according to these services that listed in the service list page The text in red box
of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the
complete link of the service page It should be prefixed with the initial URL address of
Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for
getting service page link of the Web service in Service Repository is showed in figure 3-5
Therefore the final link of this service page is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure 3-6 is the corrsponding service page of that link
4) Afterwards those two links service list page link and service page link which are gathered by the
Web Service Extractor component will be immediately forwarded to the next two components
which are WSDL Grabber component and Property Grabber component
322 The Function of WSDL Grabber Component
The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the
service lists page link or service page link And the whole process flow is illustrated in figure 3-7
Figure3-7 Overview the process flow of the WSDL Grabber Component
When the WSDL Grabber component receives the inputs that delivered from previous component it
will start to get the WSDL link for the service based on these inputs However although the inputs of
the WSDL Grabber component are the links of service page and service list page only one of them
contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the
service page or the service list page The reason why there needs these two links to be delivered into
this component is that the only one of the five Web Service Registries which is Biocatalogue hosts
the WSDL link in the service list page While for other four Web Service Registries the WSDL link is
hosted in the service page Therefore the WSDL link of these four Web Service Registries will be
obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL
link will be obtained through the service list page link However there has a problem for getting the
WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in
Deep Web Service Crawler
31
the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another
word these services will not have WSDL document For the situation like this the value of the WSDL
link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web
Services in other four Web Service Registries the WSDL link would always exist in the service page
Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service
it will inevitably be forwarded into Storage component for downloading the WSDL document at once
3221 Features of the WSDL Grabber Component
The WSDL Grabber component has to provide the following features
l Obtain WSDL links
WSDL link is the direct way to get into the page that contains the contents of the WSDL document
It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo
to represent this is an address that addressed to the page of WSDL document
3222 Input of the WSDL Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3223 Output of the WSDL Grabber Component
The component will only produce the following output data
l The URL address of WSDL link for each service
3224 Demonstration for WSDL Grabber Component
In this section it will demonstrate a list of figures in order to give a comprehensive understanding of
the process of the WSDL Grabber component Without a doubt it uses the same Web service of the
Service-Repository as an example too
1) The input for this WSDL Grabber component is the link of service page that obtained from Web
Service Extractor component The address of this link is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page
Deep Web Service Crawler
32
2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the
figure 3-8
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo
3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure
3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry
For other four Web Service Registries this will be different This function
ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo
Then it checks all of the nodes one by one to see if the text value of that node is the value of
ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will
be extracted as the value of WSDL link for this Web service
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function
Figure3-11 Code overview of ldquooneParameterrdquo function
4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web
service ldquoBLZServicerdquo can be obtained The value of it is
ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo
Deep Web Service Crawler
33
323 The Function of Property Grabber Component
The Property Grabber component is a module which is used to extract and gather all these Web
service information that hosted in the Web actually are the information that showed in service list
page and service page In the end all the obtained Web service information will be collected together
as the service properties which would be delivered into the Storage component for storing The
detailed process flow of Property Grabber component is illustrated in figure 3-12
The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component
However there is still a little difference between them with respect to the seed As have already
mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is
sufficient to get the WSDL link While regarding the Property Grabber component it needs two of
them as the seeds Therefore once the Property Grabber component receives the needed inputs it
will start to extract the information of that single Web service
Figure3-12 Overview the process flow of the Property Grabber Component
Therefore after the Property Grabber component receives the inputs it starts to extract the service
information for the Web service Generally speaking the service information consists of four aspects
which are structured information endpoint information monitoring information and whois
information respectively
(1) Structured Information
The structured information can be obtained by extracting the information that hosted in the
service page and service list page They are the basic descriptive information about the service
Such as the name of the service the URL address through which the WSDL document can be
obtained the description of introducing the service the provider who provides that service its
rating and the server who owns this service etc However the elements constituting this
Deep Web Service Crawler
34
structured information would be diverse according to these different Web Service Registries For
example the rating information of the Web service exists in the Service-Repository Web Service
Registry while the Xmethods Web Service Registry does not have this information In addition
even though the Web services in the same Web Service Registry some elements of the structured
information could be not exist For instance one service in a Web Service Registry may have the
description information of this service while another service in the same Web Service Registry
does not have this description information Tables 3-1 to 3-5 demonstrate the structured
information that should be extracted in these five Web Service Registries Moreover if the style of
the Biocataloue Web Service Registry is SOAP there has some additional information describing
the SOAP operation of this service While it is REST this additional information would be
something that talked about the REST operation They should also be considered as a part of the
structured information Table 3-6 and table 3-7 illustrate the information for these two different
operations
Service Name WSDL Link WSDL Version
Provider Server Rating
Homepage Owner Homepage Description
Table 3-1Structured Information of Service-Repository Web Service Registry
Service Name WSDL Link Provider
Service Style Homepage Implementation Language
Description User Description Contributed Client Name
Type of this Client Publisher for this client Used Tookit of this client
Used Language of this client Used Operation System of this client
Table 3-2Structured Information of Xmethods Web Service Registry
Service Name WSDL Link Server
Provider Providerrsquos Country Service Style
Rating Description User Description
Service Tags Documentation (within WSDL)
Table 3-3Structured Information of Seekda Web Service Registry
Service Name WSDL Link Port Name
Service URL Address Implementation Class
Table 3-4Structured Information of Ebi Web Service Registry
Service Name WSDL Link Style
Provider Providerrsquos Country View Times
Favorite Times Submitter Service Tags
Total Annotation Provider Annotation Member Annotation
Registry Annotation Base URL SOAP Lab Server Base URL
Description User Description Category
Table 3-5Structured Information of Biocatalogue Web Service Registry
Deep Web Service Crawler
35
SOAP Operation Name Inputs and Outputs Operation Description
Operation Tags Part of Which Service
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry
REST Operation Name Service Tags Used Template
Operation Description Part of Which Service Part of Which Endpoint Group
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry
(2) Endpoint Information
Endpoint information is the port information of the Web service which can be extracted only
through the service page However since different Web Service Registries have different structure
of the endpoint information with respect to the Web service Thence some elements of the
endpoint information for the Web service would be very diverse But there is one thing needed to
pay attention the Ebi Web Service Registry does not have endpoint information for all these Web
services published in this registry Moreover though the Web services in the same Web Service
Registry have the same structure of this endpoint information some elements of the endpoint
information would be missing or empty Furthermore these Web Service Registries could even
have no endpoint information for some Web services published by them Nevertheless whatever
there is endpoint information for the Web service there will be at least one such element which
is the URL address of the endpoint Following table 3-8 shows the endpoint information that
should be extracted in these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Endpoint Name Endpoint URL
Endpoint Critical Endpoint Type
Bound Endpoint
Xmethods
Endpoint URL Publisher of this Endpoint
Contact Email of this Publisher Implementation Language of this
Endpoint
Seekda Endpoint URL
Biocatalogue Endpoint Name Endpoint URL
Table 3-8 Endpoint Information of these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Service Availability Number of Downs
Total Uptime Total Downtime
MTTR MTBF
RTT Max of Endpoint RTT Min of Endpoint
RTT Average of Endpoint Ping Count of Endpoint
Seekda Service Availability Begin Time of Monitoring
Biocatalogue Monitored Status of Endpoint Overall Status of Service
Table 3-9 Monitoring Information of these five Web Service Registries
(3) Monitoring Information
Monitoring information is the tested statistic information for the Web service But it is worth
Deep Web Service Crawler
36
noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of
all these Web services published by them While for other three Web Service Registries only a
few of Web services may not have this information Table 3-12 displays the monitoring
information for these three Web Service Registries
(4) Whois Information
Whois information is not extracted from the information that hosted in the service page and
service list page It is the descriptive information for the service domain which can be gained by
means of the address of the WSDL link Because of that the process of getting the Whois
information will start after gaining the service domain first The final value of service domain
should not contain the string like this http https www etc It must belong to the top level
domain After that it is going to query the service domain database by means of sending the value
of service domain to the Whois client which is just a Web site on the Internet for example
ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain
exists a list of its information will be returned as the output However the structure of the
returned information will be diverse with respect to different service domain Therefore the most
challenge thing is that there has to deal with the extracting process for each different situation of
the returned information Table 3-10 gives the Whois information that needs be extracted for all
these five Web Service Registries
Service Domain URL Domain Name Domain Type
Domain Address Domain Description State
Postal Code City Country
Country Code Phone Fax
Email Organization Established Time
Table 3-10Whois Information for these five Web Service Registries
Finally all the information of these four aspects will be collected together And then be delivered into
the Storage component for further storage processing
3231 Features of the Property Grabber Component
The Property Grabber component has to provide the following features
l Obtain basic information
Generally speaking the more information one Web service has the better you can know how
well this Web service is Hence this Property Grabber component will be necessary to extract all
these basic information that hosted in the service list page and service page This basic
information contains the structure information endpoint information and monitoring
information
l Obtain Whois information
Due to the reason that the more information the Web service has the better the quality of the
Web service is That makes it is necessary to extract as much information about the Web service
as possible Therefore except the basic information the Property Grabber component can also
obtain some additional information called Whois information such as the type of the domain the
name of the person in charge the postal code of the domain city phone fax detailed address
email etc
Deep Web Service Crawler
37
3232 Input of the Property Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3233 Output of the Property Grabber Component
The component will produce the following output data
l Structured information of each service
l Endpoint information about each service if exists
l Monitoring information for the service and endpoint if exists
l Whois information of the service domain
However all these information will be collected together as the properties for each service Thereafter
the collected properties will be sent to the Storage component
3234 Demonstration for Property Grabber Component
Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the
Property Grabber component To simplify the explanation the example showed in this section uses
the same Web service as before
1) The inputs of the Property Grabber component are links of service list page and service page that
received from Web Service Extractor component Their links are
ldquohttpwwwservice-repositorycomrdquo
and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page
Deep Web Service Crawler
38
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page
3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that
displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of
the structured information have the same content like the contents of description showed in
service page and service list page Thence in order to saving the time of extracting process and
the space for storing process the elements with the same content would only be extracted once
Moreover there needs a transformation of non-descriptive text to the descriptive text for the
rating information because the contents of it are several image of stars Therefore the final
results of the extracted structured information for this Web service are showed in table 3-11
Because of there are no descriptive information for the provider homepage and Owner
Homepage their values are assigned as ldquoNULLrdquo
Service Name BLZService
WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl
WSDL Version 0
Server Apache-Coyote11
Description BLZService
Rating Four stars and A Half
Provider NULL
Homepage NULL
Owner Homepage NULL
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo
4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that
displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list
only one of them will be extracted as the information of endpoint That is because this master
program is intended to extract as more information as possible but these information should not
contain some redundant information Therefore for the purpose of that only one of them will be
extracted as the endpoint information even if there are more than one endpoint records Table
3-12 shows the final results of the endpoint information of this Web service
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page
Deep Web Service Crawler
39
Endpoint Name BLZServiceSOAP12port_http
Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService
Endpoint Critical True
Endpoint Type production
Bound Endpoint BLZServiceSOAP12Binding
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo
5) Then it is time to extract the monitoring information through invoking the function
ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box
above contains the monitoring information about the Web service and the red box below lists
the monitoring information for these endpoints As have already mentioned before only one
endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two
types of availability Actually they all represent the availability for this Web service just like the
availability that showed in figure 3-14 Therefore only one content of these availability
information is sufficient Table 3-13 shows the final results of this extracting process
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page
Service Availability 100
Number of Downs 0
Total Uptime 1 day 19 hours 19 minutes
Total Downtime 0 second
MTBF 1 day 19 hours 19 minutes
MTTR 0 second
RTT Max of Endpoint 141 ms
RTT Min of Endpoint 0 ms
RTT Average of Endpoint 577 ms
Ping Count of Endpoint 112
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo
6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the
service domain from the WSDL link first Towards this Web service the gained service domain is
ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying
process After that it returns a list of information for that service domain See figure 3-17 And
table 3-14 is the extracted Whois information
Deep Web Service Crawler
40
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo
Service Domain URL thomas-bayercom
Domain Name Thomas Bayer
Domain Type NULL
Domain Address Moltkestr40
Domain Description NULL
State NULL
Postal Code 54173
City Bonn
Country NULL
Country Code DE
Phone +4922855525760
Fax NULL
Email infopredic8de
Organization predic8 GmbH
Established Time NULL
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo
7) Finally all the information of these four aspects will be collected together as the service
properties and then these service properties are forwarded into the Storage component
324 The Function of Storage Component
The WSDL link from the WSDL Grabber component will be used by this Storage component to
download the WSDL document from the Web and stores it into the disk thereafter In addition the
service properties from the Property Grabber component will also be directly stored into the disk in
three different manners with this Storage component Figure 3-18 illustrates the process flow of the
Storage component
Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo
of the Storage component will be triggered After that it is going to transform the service properties
into three different output formats and store them into the disk These output formats are XML file
INI file and database records Besides it will also try to download the WSDL document with the URL
address of the WSDL link and then store the obtained WSDL document into disk too However this
ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo
ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the
storage tasks
Deep Web Service Crawler
41
Figure3-18 Overview the process flow of the Storage Component
(1) ldquogetWSDLrdquo sub function
The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it
into the disk Therefore above all it has to get the content of the WSDL document This
procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value
of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if
the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo
For the case like that it will create a WSDL document whose name is the service name appended
with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content
It is an empty document Nevertheless if that service does have WSDL link this sub function will
try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all
these contents that hosted in the Web would be downloaded and stored into the disk and named
only with the name of the service Otherwise it would create a WSDL document which prefixes
ldquoBadrdquo before the service name
(2) ldquogenerateXMLrdquo sub function
The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform
them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo
XML stands for eXtensible Markup Language which is a markup language that designed to
transport and store data The first line of an XML file is the XML declaration which defines the
XML version and the encoding used For example following ldquoltxml version=10
encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is
8-bit Unicode Transformation Format character set Besides an XML file contains XML elements
which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML
element can also contain other elements which could be simple text or a mixture of both
Deep Web Service Crawler
42
However an XML file must contain a root element as the parent of all other elements
(3) ldquogenerateINIrdquo sub function
The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will
transform them into an INI file and then store it in the disk with the name structure of servicersquos
name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for
configuration files They are just simple text files with a basic structure Generally speaking the
INI file contains three different parts They are section parameter and comment respectively
Parameter is the basic element contained in an INI file Its format is a key-value pair or can also
be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always
appears to the left of the equals sign The section is more like a room that grouped all its
parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And
section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo
Thence anything between the semicolon and the end of line will be ignored
(4) ldquogenerateDatabaserdquo sub function
The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions
Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will
turn them into the data of the database by using the SQL statements SQL stands for Structured
Query Language which is a database language that designed for accessing and manipulating data
in database Most of the actions that performed on a database are done with SQL statements
And the primary statements of SQL include insert into delete update select create alter and
drop Therefore for the purpose of transforming these service properties into the data of the
database this sub function has to create a database first using the ldquocreate databaserdquo statement
Then it should create a table to store the data A table is a collections of related data entries and
it consists of columns and rows Since the data for all these five Web Service Registries are not
very large therefore one database table is enough for storing these service properties Because of
that the field names of the service properties in columns for all these five Web Service Registries
should be uniform and well-defined Afterwards the service properties of each single service can
be put into the table as a record with the ldquoinsert intordquo statement of the SQL
3241 Features of the Storage Component
The Property Grabber component has to provide the following features
l Generate different output formats
The final result of this master program is to store the information of the service into the disk for
future work However this Storage component provides three different formats for storing these
service properties of the services that published in the Web Service Registries This makes the
storage of the service very flexible and also longevity
l Obtain the WSDL document
The important thing of this component is to check if the WSDL document can be obtained from
the WSDL link because the WSDL document plays a decisive role for determining the quality of
the service This Storage component provides the ability to deal with the different situations
about the process of obtaining the WSDL document
Deep Web Service Crawler
43
3242 Input of the Storage Component
This component requires the following input data
l WSDL link of each service
l Each service property information
3243 Output of the Storage Component
The component will produce the following output data
l WSDL document of the service
l XML document INI file and tables in database
3244 Demonstration for Storage Component
Following figures describe the fundamental implementation codes of this Storage component
The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the
implementation codes The first common place is about the parameters that defined in each of
these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute
path in the computer disk It will be used for the storing procedure and has already been specified
by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an
increasing Integer which is used as a part of the name of the service The reason for doing this is
that it can prevent the services that have the same name from overriding in the disk The content
of red mark among the codes of these figures is the second common place Its function is to
create a file or document for the service with the corresponding parameters
2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to
get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the
name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo
is the actual WSDL link which is the most important one for this sub function The other two
parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and
ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data
of the service in each Web Service Registry Such as the overall number of properties for that Web
Service Registry the overall number of services the number of services that have no WSDL link
the number of services contains no contents the number of services whose WSDL link are not
available etc The ldquoLog Informationrdquo text file records the results of the process steps and the
problems encountered For example which service is crawling now which Web Service Registry it
belongs the reason why it cannot get the WSDL document for this service the reason why this
service is unreachable and so on
Deep Web Service Crawler
44
Figure3-19 Implementation code for getting WSDL document
3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file
and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of
ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two
variables name and value
Figure3-20 Implementation code for generating XML file
Deep Web Service Crawler
45
Figure3-21 Implementation code for generating INI file
4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all
services in these five Web Service Registries into the records of the database Therefore it has to
create a database first The name of the database can be arbitrary as long as it conforms to the
naming rules in database This is the same for the name of the table Figure 3-22 displays the
procedure of creating a table in the database Due to the reason that it is hard to decide the
length of each service property thence the data types for all service properties are set to ldquoTextrdquo
Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo
statement
Figure3-22 Implementation code for creating table in database
Deep Web Service Crawler
46
Figure3-23 Implementation code for generating table records
33 Multithreaded Programming for DWSC
Multithreaded programming is a built-in characteristic which is provided by Java language A
multithreaded program contains two or more separate parts that that can execute concurrently Each
part of such a program is called a thread The use of multithreading enables to create programs that
makes efficient use of the system resources For example make maximum use of the CPU thatrsquos
because the idle time of the CPU can be kept to a minimum
In this master program there are five Web Service Registries that are needed to crawl for the services
among them Moreover the number of services published among each Web Service Registry is quite
different That makes the running time cost by each Web Service Registry is different Then it will
happen that one Web Service Registry who owns fewer services has to be waiting for executing until
another Web Service Registry who has much more services finishes Therefore in order to reduce the
waiting time for other Web Service Registries and maximize the use of the system resources it is
necessary to apply this multithreaded programming to this master program That is to say this master
program will create a thread for each Web Service Registry And these threads are executed
independently
34 Sleep Time Configuration for Web Service
Registries
Due to the reason that this master program is intended for downloading the WSDL document and
extracting the service information of these Web services that published in the Web Service Registry
this would inevitably affect the performance of the Web Service Registry In addition for the purpose
of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of
accessing Because of that sometimes when this master program is executing unknown errors would
happen For instance this master program will be continually halting at one point without getting any
more WSDL document and service information or some servicesrsquo WSDL documents of some Web
Deep Web Service Crawler
47
Service Registries cannot be obtained or there would miss some service information Therefore in
order to obtain the largest staffs of these Web services published in these five Web Service Registries
and not to affect the throughput of them it has to configure the accessing rate for each service of all
Web Service Registries
In consequence before going into the essential procedure for each single service of these Web Service
Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long
milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for
the specified number of milliseconds In other words it will temporarily cease execution for a while
Following table shows this time interval of the sleep function for each Web Service Registry
Web Service Registry Name Time Interval (milliseconds)
Service Repository 8000
Ebi 3000
Xmethods 10000
Seekda 20000
Biocatalogue 10000
Table 3-15Sleep Time of these five Web Service Registries
Deep Web Service Crawler
48
4 Experimental Results and Analysis
This chapter is going to show the quantitative experimental results of the prototype that presented in
chapter 3 Besides the analysis of these results will also be described and explained However in
order to gain a rather accurate result the experiments are carried out more than five times All these
data that displayed in the following tables and charts are the average of them
41 Statistic Information for Different Web Service
Registries
This section is mainly used to talk about the amount statistic of these Web services that published in
these five Web Service Registries It includes the overall number of the Web services that published in
each Web Service Registry and the number of these unavailable Web services which have been
archived because they may not be active anymore or are close to being non active Table 4-1 shows
the service amount statistic of these five Web Service Registries
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Overall
Services 57 289 382 853 2567
Unavailable
Services 0 0 0 0 125
Table4-1 Service amount statistic of these five Web Service Registries
Nevertheless in order to have an intuitive view of the service amount statistic in these five Web
Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As
you can see from the bar chart one on hand there has an ascending increase for the overall number
of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service
Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web
service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to
provide the Web services for the users because it contains far more services than other four Web
Service Registries On the other hand there is no unavailable service existed in almost all Web Service
Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web
Service Registry contains some Web services that cannot be used by the users To some degree this is
useless for the reason that these services cannot be used anymore and it will waste the network
resources on the Web Therefore all these unavailable services should be eliminated in order to
reduce the wasting of the network resources
Deep Web Service Crawler
49
Figure4-1 Service amount statistic of these five Web Service Registries
42 Statistic Information for WSDL Document
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Failed WSDL
Links 1 0 23 145 32
Without
WSDL Links 0 0 0 0 16
Empty
Content 0 0 2 0 2
Table4-2 Statistic information for WSDL Document
Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in
these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed
WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the
Web services whose WSDL links are invalid Say in other words that means it is impossible to get the
WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no
WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services
for these Web Service Registries It is used as the overall number of Web services in each Web Service
Registry who has no WSDL link That is to say there would be no WSDL document for Web services
like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL
document will also be created But there is no content and the name of this WSDL document will
contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which
represents the overall number of Web services that have the WSDL links and the URL address of them
0
500
1000
1500
2000
2500
3000
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
57289 382
853
2567
0 0 0 0 125
The
Num
ber o
f Web
Ser
vice
s
The name of Web Service Registry
Overall Services
Unavailable Services
Deep Web Service Crawler
50
are valid but their WSDL documents contains no content In this case a WSDL document whose name
contains a string ldquo(BAD)rdquo will be created
Figure4-2 Statistic information for WSDL Document
43 Comparison of Different Average Number of
Service Properties
This section is going to compare the average number of service properties in these five Web Service
Registries This average number of service properties is calculated by means of following equation = (1)
Where
ASP is the average number of service properties for one Web Service Registry
ONSP is the overall number of service properties in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
Figure 4-3 shows the average number of service properties of each Web service in these five Web
Service Registries As have already mentioned before one of the measurements for testing the quality
of Web services in the Web Service Registry is the service information The more information of a Web
service the better you know about that service consequently the corresponding Web Service
Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service
Repository and Biocatalogue Web Service Registries own a larger number of service properties than
that in other three Web Service Registries This can directly reflect that these two Web Service
Registries can provide more detailed information about the Web services published among them So
that the users can be easier to choose the service they need And they would also like to use the Web
services that published in these two Web Service Registries By contrast the Xmethods and Seekda
0
50
100
150
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
1 023
145
32
0 0 0 016
0 0 2 0 2
The
Num
ber o
f Ser
vice
s
The Name of the Web Service Registry
Failed WSDL Links
Without WSDL links
Empty Content
Deep Web Service Crawler
51
Web Service Registries who have less service information of the Web services would offer less quality
for these Web services Therefore users may be not like to use the Web services provided in these
two Web Service Registries Not to mention the Web services published in the Ebi Web Service
Registry
Figure4-3 Average Number of Service Properties
From the description presented in section 323 the causes of the issue for the different number of
service properties in these Web Service Registries may consist of following several points First the
number of the structured information for these Web services is different with respect to these five
Web Service Registries Even part of information for some Web services in one Web Service Registry
could be missing or have empty value For example the number of structure information that
supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that
in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for
almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce
the amount of the overall number of service properties Thirdly some Web Service Registries do not
have monitoring information such as Xmethods and Ebi In particular the Service Repository Web
Service Registry has a large amount of monitoring information of the Web services that could be
extracted in the Web Obvious the last one is the number of the whois information for these Web
services If the database of the whois client does not contain the information about the service
domain for the Web service in one Web Service Registry then there will be no whois information
could be extracted Moreover even there has information of the service domain the amount of the
information could be very diverse Therefore if the Web Service Registry undergoes the situation like
that many service domains of the Web services in this registry have no and few whois information
then the average number of service properties in that registry will decrease greatly
As a result in order to help users get better acquainted with the Web services that provided in Web
Service Registry and distinguish these Web services the Web Service Registry should do it best to
offer more and more information for each published Web service of it
05
101520253035
23
717 17
32
The
Num
ber o
f Pro
pert
ies
The Name of Web Service Registry
Average Number of Service Properties
Deep Web Service Crawler
52
44 Different Outputs of Web Services
The fundamental task of this master program is to obtain the WSDL document of the Web services
that hosted in the Web Service Registries as well as extract and gather the properties of these Web
services and thereafter store them into disk Therefore this section is going to describe the different
outputs of this master program which include the WSDL documents of Web services the generated
XML file and INI file and data records of these service properties in Database
Figure4-4 WSDL Document format of one Web service
The WSDL document of the Web service is just read from the Web according to the URL address of its
WSDL link And then these data will be stored into the disk as the WSDL document The name of the
WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document
And in order to distinguish these WSDL documents whose names would be the same while the
content of them are different the name of each obtained WSDL document in one Web Service
Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL
document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and
data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output
formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name
is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the
materials belonged to the same Web service The first three lines in that INI file are some service
comments which start from the semicolon to the end of the line They are the basic information for
the description of this INI file The following one line of them is the section that included in a pairs of
brackets It is important because it represents that the following lines behind it are the information of
this Web service Therefore the rest of these lines are the actual service information that existed in a
key-value pair with an equal between the pair Each service property is displayed from the beginning
of the line
Deep Web Service Crawler
53
Figure4-5 INI File format of one Web service
Figure4-6 XML File format of one Web service
Figure4-7 Database data format for all Web services
Deep Web Service Crawler
54
Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its
name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web
service Though the format of the XML file is different from that of the INI file the essential contents
of them are the same That is to say the values of these service properties are no different This is
because they are files generated from the collection of properties for the same Web service The XML
file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the
section in INI file is something like the root in XML file Therefore all values of these elements among
the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service
Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of
service information for all Web services in these five Web Service Registries And the entire service
information for one Web service will be only one record in this table Because of that the column
names of that table should be the union of the name of the service information in each Web Service
Registry However since the column names of the table should be unique thus these redundant
names in this union must be eliminated This could make sense and possible because the names of the
service information are well-defined and uniform for all these five Web Service Registries In addition
in this table the first column is the primary key which is an increasing Integer The function of this is
more like the Integer that contained in the name of the XML file and INI file While the rest columns in
that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in
this table represents that this property of that Web service is empty or missing
45 Comparison of Average Time Cost for Different
Parts of Single Web Service
This section aims to describe the comparison of the average time cost for different parts of getting
one single Web service in all these five Web Service Registries At first it has to calculate the average
time cost of getting one single service in the Web Service Registry The calculation of it can be
obtained through following equation = (2)
Where
ATC is the average time cost for one single Web service
OTS is the overall time cost of all the Web services in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the different parts of the average time cost for getting one single service consists of
following 6 aspects which are the average time cost for extracting service property the average time
cost for obtaining WSDL document the average time cost for generating XML file the average time
cost for generating INI file the average time cost for inserting the service property into the table of
the Database and the average time cost for some other procedures such as get the service list page
link get the service page link and so on The average time cost for extracting service property would
be obtained by means of following equation
Deep Web Service Crawler
55
= (3)
Where
ATCSI is the average time cost for extracting service property of one single Web service
OTSSI is the overall time cost for extracting service property of all the Web services in one Web
Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the calculation of other parts is similar to the equation for calculating the average time
cost for extracting service property While the calculation of the average time cost for other
procedures equals to the result of the average time cost for one single Web service minus the sum of
the average time cost of all other five parts
Service
property
WSDL
Document
XML
File INI File Database Others Overall
Service
Repository
8801 918 2 1 53 267 10042
Ebi 699 82 2 1 28 11 823
Xmethods 5801 1168 2 1 45 12 7029
Seekda 5186 1013 2 1 41 23 6266
Biocatalogue 39533 762 2 1 66 1636 42000
Table4-3 Average time cost information for all Web Service Registries
Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web
Service Registries The first column of table 4-3 is the name of these five Web Service Registries And
the last column in this table is the average time cost for the single service in one Web Service Registry
While the rest columns in this table are the average time cost of these six different parts In order to
have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated
with following corresponding figures See figure 4-8 to figure 4-14
Figure4-8 Average time cost for extracting service property in all Web Service Registries
8801
6995801 5186
39533
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Service Property
Deep Web Service Crawler
56
As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue
Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries
which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively
That is to say it will take much longer time to extract the service properties of these Web services
published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the
Biocatalogue Web Service Registry has even more average number of service properties which have
already been talked about in section 43 On the contrary the average number of service properties in
Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service
Registry is larger than that in Seekda Web Service Registry As have already known that the average
number of service properties is the same for these two Web Service Registries Nevertheless there
has course might explain the case that the average time in Xmethods costs more than that in Seekda
which is the process for extracting service properties in Xmethods Web Service Registry needs to be
executed by means of service page and service list page while it needs only service page link for
Seekda Web Service Registry
The average time cost for obtaining WSDL document in all these five Web Service Registries is
displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the
average time for extracting the WSDL link of Web service and the average time for reading the data of
WSDL document from the Web and then storing it into the disk As seen from this figure the average
time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is
1168 Although the process for extracting WSDL link will cost certain amount of time it does not have
a significant influence to the total average time spent for obtaining WSDL document This is because
the WSDL link of one Web service is almost gained in one step Therefore this can imply that the
average size of the WSDL document for Xmethods Web Service Registry is larger than that for other
four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds
for the process of extracting WSDL document
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries
918
82
11681013
762
0
200
400
600
800
1000
1200
1400
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
WSDL Document
Deep Web Service Crawler
57
Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different
outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11
the average time of generating XML file for one Web service cost the same for all these five Web
Service Registries which is only 2 milliseconds In addition the average time of generating INI file for
one Web service cost the same too And its value is just 1 millisecond Even though sum up the values
of these two types of average time costs the result is still so small that it could be omitted when
comparing to the overall average time cost of getting one Web service for each corresponding Web
Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI
file would be finished at once after receiving the input of service properties of one Web service
Furthermore it can be seen from figure 4-12 although the average time costs of creating database
record for each Web service in all these five Web Service Registries are larger than that time of
generating XML and INI file the process of the operation for creating database record is still fast
Figure4-10 Average time cost for generating XML file in all Web Service Registries
Figure4-11 Average time cost for generating INI file in all Web Service Registries
2 2 2 2 2
0
05
1
15
2
25
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
XML File
1 1 1 1 1
0
02
04
06
08
1
12
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
INI File
Deep Web Service Crawler
58
Figure4-12 Average time cost for creating database record in all Web Service Registries
Figure4-13 Average time cost for getting one Web service in all Web Service Registries
In the figure 4-13 it gives the average time cost for getting one single Web service in all these five
Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the
longest time for this process This is because the presentation of these five different parts that
described in the front shows that the average time cost of each part would needs more time to finish
the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining
WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web
Service Registry does not cost the longest Moreover there is amazing thing when looking at these
figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same
trend This can further indicate that one Web Service Registry who spends more time to get the
description information of one Web service would offer more information about the Web service
53
28
4541
66
0
10
20
30
40
50
60
70
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Database
10042
823
7029 6266
42000
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Overall
Deep Web Service Crawler
59
5 Conclusion and Further Direction
This master thesis provides a schema which aims to explore the description information of the Web
services that hosted in the different Web Service Registries Actually the description information of
the Web service consists of WSDL document and service information about that Web service
Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can
be used to obtain the WSDL document and service information of the Web service in these Web
Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web
services that hosted in the these Web Service Registries as well as the storage of these Web services
is not flexible and the most important thing is that only a few service information of the Web service
are extracted even more there is no service information would be extracted for some Web Service
Registry However the work presented in this master thesis can be able to explore all Web services in
these Web Service Registries Moreover the service information of Web service would be extracted as
much as it could with the approach presented in this master thesis so that the final result would be
the largest annotated service catalogue ever produced Furthermore regarding the storage process of
the description information of the Web service this master thesis provides three different ways that
can guarantee not only the completeness but also longevity of the description information for the
Web service
However in the implementation performed in this master thesis the whois client used for querying
the information of service domain will return a free text if the information exists And sometime this
free text differs completely That makes it must have to crawl each Web service in all Web Service
Registries at least once in the experiment stage So that all the cases of these free text would be
foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web
services in these Web Service Registries Therefore in order to simply the work other whois client
who can ease the work here needs to be found and used
Moreover in the experiment stage of this master thesis the time cost for getting Web service is still
large For the purpose of reducing this time the multithreaded programming would be applied to
each some parts of the process for getting one Web service
Although the work performed here is specialized for only these five Web Service Registries the
main parts of the principles used here is adaptable to other Web Service Registries that with only
small changes in the implementation codes or the structure
Deep Web Service Crawler
60
6 Bibliography
[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)
June 27 2008
[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from
httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-
whole Emanuele Della Valle (CEFRIEL) July 1 2008
[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk
ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-
architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009
[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web
Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October
2006
[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from
httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008
[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from
httpwwww3orgTRws-arch-scenarios February 11 2004
[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo
Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering
University of Washington Seattle pp233-272 February 1999
[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web
Consortium Working Draft WD-html5-20100624 January 22 2008
[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language
WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from
httpwwwwsmoorgTRd16d161v02
[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO
-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from
httpwwwwsmoorgTRd2v11
[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware
Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet
2008
Deep Web Service Crawler
61
7 Appendixes
There are additional outputs of this master program which are log information file and statistic report
file Figure 8-1 shows one of the basic output formats of the log information for these five Web
Service Registries
Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry
Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry
Deep Web Service Crawler
62
Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo
ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively
Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry
Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry
Deep Web Service Crawler
63
Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry
Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry
Deep Web Service Crawler
64
Table of Figures
Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12
Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16
Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25
Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27
Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41
Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46
Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50
Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51
Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52
Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56
Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58
Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58
Deep Web Service Crawler
65
Table of Tables
Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47
Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48
Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55
Deep Web Service Crawler
66
Table of Abbreviations
DTD Document Type Definition
DWSC Deep Web Service Crawler
HTML HyperText Markup Language
MTBF Mean Time between Failures
MTTR Mean Time to Recovery
QoS Quality of Service
REST Representational State Transfer
RTT Round Trip Time
SOAP Simple Object Access Protocol
SQL Structured Query Language
URL Uniform Resource Locator
WHATWG Web Hypertext Application Technology Working Group
XML eXtensible Markup Language
WSDL Web Service Description Language
WSML Web Service Modeling Language
WSMO Web Service Modeling Ontology
Deep Web Service Crawler
14
2123 The Principle of the Conceptual Indexer and Matcher
Component
The Conceptual Indexer and Matcher is more like a data store center that aims at storing all extracted
information of the services and supplying users the capability of retrieval and semantic query For
example the matchmaking between user requests and service offers and the act of retrieving user
feedback on extracted annotations
In addition letrsquos have a look of the function of this component and its input output
Oslash Input
u Semantic annotation data and full text information obtained from Automatic Annotation
u Semantic annotation data and full text information that come from user interfaces
u Cluster data from user and service clustering component
Oslash Function
u Store the semantic annotations received from the Automatic Annotation component and
from the user interface
u Store the cluster data that procured through the clustering component
u Store and index the textual description offered by the Automatic Annotation component
and the textual comments offered by users
u Ontological query the semantic data from the data store center
u Combined keyword and Ontological querying used for user queries
u Provide a list of similar services for a given service
Oslash Output
u A list of matching services that are queried by users In particular these services should be
sorted by ranking and can also be iterated
u All available data that related to a particular entity must be retrievable at the user interface
2124 The Principle of the Service-Finder Portal Interface
Component
The Service-Finder Portal Interface is the main entry point that provided for users of the
Service-Finder system to search and browse the data which is managed by the Conceptual Indexer
and Matcher component In addition the users can also contribute information by means of providing
tags comments categorizations and ratings to the data browsed Furthermore the developers can
still directly invoke the Service-Finder functionalities from their custom applications in terms of an API
Besides the details of this componentrsquos function input and output are represented as below
Oslash Input
u A list of ordered services for a query
u Detailed information about a service or a set of services and a service provider
u Query access to service category ontology and the most used tags provided by the users
Deep Web Service Crawler
15
u Service availability information
Oslash Function
u The Web Interface allows the users to search services by keyword tag or concept in the
categorization sort and filter query results by refining the query compare and bookmark
services try out the services that offer this functionality
u The API allows the developers to invoke Service-Finder functionalities
Oslash Output
u Explicit user annotations such as tags ratings comments decryptions and so on
u Implicit user data for example click stream of users bookmarks comparisons links sent
etc
u Manual advertising of available new services
2125 The Principle of the Cluster Engine Component
The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the
Service-Finder Portal eg the queried services and the compared services of the users Moreover it
also provides cluster data to the Conceptual Indexer and Matcher for providing service
recommendations
Furthermore letrsquos detailed introduce this componentrsquos function input and output
Oslash Input
u Service annotation data of both extracted and user feedback
u Usersrsquo Click streams used for extracting user behaviors
Oslash Function
u Obtain user clusters from user behaviors
u Obtain service clusters from service annotation data to enable to find similar services
Oslash Output
u Clusters of users and services
22 Information Extraction
Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge
amount of information sources on the Internet which has been limited the access to browsing and
searching for the reason of the heterogeneity and the lack of structure of Web information sources
Therefore the appearance of Information Extraction that transforms the Web pages into
program-friendly structures for post-processing would become a great necessity However the task of
Information Extraction is specified in terms of the inputs and the extraction targets And the
techniques used in the process of Information Extraction called extractor
221 Input Types of Information Extraction
Generally speaking there are three different input types The first input type is the unstructured
Deep Web Service Crawler
16
document For example the free text that showed in figure 2-2 It is unstructured and written in
natural language So that it will require substantial natural language processing While the second
input type is called the structured document For instance the XML documents based on the reason
that the data can be described through the available DTD (Document Type Definition) or XML
(eXtensible Markup Language) schema Finally but obviously the third input type is the
semi-structured document that are widespread on the Web Such as the large volume of HTML
pages like tables itemized lists and enumerated lists This is because HTML tags are often used to
render these embedded data in the HTML pages See figure 2-3
Figure2-2Left is the free text input type and right is its output [4]
Figure2-3A Semi-structured page containing data records
(in rectangular box) to be extracted [4]
Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly
regular structure And the data of these documents can be displayed in a format of HTML way or
non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and
generated from structured databases in terms of some templates or layouts thus it would be
considered as one of the input sources which could provide some of these semi-structured documents
For example the authors price and comments of the book pages that provided by Amazon have the
Deep Web Service Crawler
17
same layout That is because these Web pages are generated from the same database and applied
with the same template or layout Furthermore there has another option which could manually
generate HTML pages of semi-structured type For example although the publication lists that
provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have
title and source property for every single pager Eventually the inputs for some Information Extraction
can also be the pages with the same class or among various Web Service Registries
222 Extraction Targets of Information Extraction
Moreover regarding the task of the Information Extraction it has to consider the extraction target
There also have two different extraction targets The first one is the relation of k-tuple And the k in
there means the number of attributes in a record Nevertheless in some cases an attribute of one
record may have none instantiation Otherwise the attribute owns multiple instantiations In addition
the complex object with hierarchically organized data would be the second extraction target Though
the ways for depicting the extraction targets in a page are diverse the most common structure is the
hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf
nodes which called internal nodes And the structure for a data object may also be flat or nested To
be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise
if it is nested structure then the internal nodes that involved in this data object would be more than
two levels
Furthermore in order to make the Web pages readable for human being and having an easier
visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated
or demarcated However the displaying for a data object in a Web page would be affected by
following conditions [4]
Oslash The attribute of a data object has zero or several values
(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo
attribute For example a special offer only available for certain books might be a ldquononerdquo
attribute
(2) If there are more than one values for the attribute of a data object it will be called the
ldquomultiValuerdquo attribute For instance the name of the author for a book could be a
ldquomultiValuerdquo attribute
Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering
That is to say among this set of attribute the position of the attribute might be changed
according to the diverse instances of a data object Thus this attribute will be called the
ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would
enumerate the release data in front of the movesrsquo title while for the movies after year 1999
(including 1999) it will enumerate the release data behind the movesrsquo title
Oslash The attribute has different formats
This means the displaying format of the data object could be completely distinct with respect to
these different instances Therefore if the format of an attribute is free then a lot of rules will be
needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo
attribute For example an ecommerce Web site would use the bold font format to present the
general prices while use the red color format to display the sale prices Nevertheless there has
Deep Web Service Crawler
18
another situation that some different attributes for a data object have the same format For
example various attributes are presented in terms of using the ltTDgt tags in a table presentation
And the attributes like those could be differentiated by means of the order information of these
attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo
attributes it must have to revise the rules for extracting these attributes
Oslash The attribute cannot be decomposed
Because of the easier processing sometimes the input documents would like to be treated as
strings of tokens instead of the strings of characters In addition some of the attribute cannot
even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo
attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The
department code and the course number in them cannot be separated into two different strings
of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo
223 The Used Techniques in Information Extraction
The extractor used in the process of Information Extraction aims at providing a single uniform query
interface to access information sources like database server and Web server It consists of following
phases collecting returned Web pages labeling these Web pages generalizing extraction rules
extracting the relevant data and outputting the result in an appropriate format (XML format or
relational database) for further information integration For example at first the extractor queries the
Web server to gather the returned pages through the HTTP protocols after that it starts to extract the
contents among these HTML documents and integrate with other data sources thereafter Actually
the whole process of the extractor follows below steps
Oslash Step 1
At the beginning it must have to tokenize the input However there are two different
granularities for the input string tokenization They are tag-level encoding and word-level
encoding The tag-level encoding will transform the tags of HTML page into the general tokens
while transform all text string between two tags into a special token Nevertheless the
word-level encoding does this in another way It treats each word in a document as a token
Oslash Step 2
Next it should apply the extraction rules for every attributes of the data object in the Web pages
These extraction rules could be induced in terms of a top-down or bottom-up generalization
pattern mining or logic programming In addition the type of extraction rules may be indicated
by means of regular grammars or logic rules For example some use path-expressions of the
HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic
constraints and some use delimiter-based constraints such as HTML tags or literal words
Oslash Step 3
After that all these extracted data would be assembled into the records
Oslash Step 4
Finally iterate this process until all these data objects in the input
Deep Web Service Crawler
19
23 Pica-Pica Web Service Description Crawler
The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the
Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web
Services problem For example the evaluation of the descriptive quality of Web Services that offered
and how well are these Web Services described in nowadaysrsquo Web Service Registries
231 Needed Libraries of the Pica-Pica Web Service
Description Crawler
This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef
Spillner and programmed in terms of the Python language Actually in order to run these scripts to
parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib
n Beautiful Soup
It is an HTMLXML parser for Python language And it can even turn these invalid markups into a
parse tree [5] Moreover the following three features can make it more powerful
u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that
makes approximately as much sense as the original document Therefore you can obtain
the data that you want
u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating
searching and modifying the parse tree Hence you donrsquot need to create a custom parse
for every application
u If the document has already specified an encoding then you can ignore it since the
Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way
Otherwise what you have to do is just to specify the encoding of the original documents
Furthermore the ways of including Beautiful Soup into the application are displayed in the
following [5]
sup2 From BeautifulSoup import BeautifulSoup For processing HTML
sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML
sup2 Import BeautifulSoup To get everything
n Html5lib
It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to
gain maximum compatibility with the current major desktop web browsers this implementation
will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5
specification
Deep Web Service Crawler
20
232 Architecture of the Pica-Pica Web Service
Description Crawler
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler
Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes
four fundamental components Service Page Grabber component WSDL Grabber component
Property Grabber component and WSML Register component
(1) The Service Page Grabber components is going to take the URL seed as the input and output the
link of the service page into following two components WSDL Grabber component and Property
Grabber component
(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the
delivered service pagersquos link And then check whether the validation for these obtained WSDL
document Finally only these valid WSDL document will be passed into the WSML Register
component for further processing
(3) The Property Grabber component will try to extract the servicersquos property the hosted in the
service page if there exists After that all these servicersquos properties would be saved into an INI
file as the information of that service
(4) The functionality of the WSML Register component is to write appropriate WSML document by
means of the valid WSDL documents that delivered from WSDL Grabber component and the
optionally INI files that delivered from Property Grabber component Afterwards register them in
Conqo
n WSML [9]
It stands for Web Service Modeling Language which provides a framework with different
language variants Hence it is often used to describe the different aspects of the semantic Web
Services according to the conceptual model of WSMO
n WSMO [10]
WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various
aspects related to the Semantic Web Services based on the ontologies Ontology is a formal
explicit specification of a shared conceptualization In fact these ontologies are the sticking points
that can satisfy the linkage between the agreement of the communities of users and the defined
conceptual semantics of the real-world
n Conqo [11]
Deep Web Service Crawler
21
It is a discovery framework that considers not only the Quality of Service (QoS) but also the
context information It will use a Web Service Repository to manage these service descriptions
that based on WSML
233 Implementation of the Pica-Pica Web Service
Description Crawler
This section is going to describe the processes of the implementation of the Pica-Pica Web Service
Description Crawler in detail
(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it
needs an input as the initial seed In this crawler there are five Web Service Registries which are
listed in the below The URL address of these five Web Service Registries will be used as the input
seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web
Service Description Crawler there has a single Python script for each Web Service Registry And
the crawling process for these Web Service Registriesrsquo Python script will be executed one after
another
Biocataloue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwsrvice-repositorycom
Xmethods httpwwwxmethodsnet
(2) Then after feeding with the input seed it will step into the next component Service Page Grabber
At first this component will try to read the data from the Web based on the input seed Then it
will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library
After that this Service Page Grabber component starts to look for the service page link for each
service that published in the Web Service Registry by means of the functions in Html5lib library In
the case that the service page link of one single service is found it will firstly check whether this
service page link is valid or not Once the service page link is valid it will pass it into the following
two components for further processing which are WSDL Grabber component and Property
Grabber component
(3) When the WSDL Grabber component receives a service page link from its previous component it
sets out to extract the WSDL link address for that service through the parsing tree of the data in
this service page Next this component will start to download the WSDL document of that service
in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into
the disk The process of this WSDL Grabber component will be continually carried on until there is
no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They
may either contain bad definitions or bad namespaceURI or be an empty document What is
worse it is not even of XML format Hence in order to pick them out this component will further
analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo
folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in
order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo
folder will be passed into the subsequent component
(4) Moreover since some Web Service Registries would give some addition information about the
Deep Web Service Crawler
22
services such as availability service provider version therefore the Property Grabber
component will set out to extract these information as this servicersquos properties And thereafter
save these servicersquos properties into an INI file However if there has no available additional
information then it is no need to extract the service property thus there is no INI file for that
service However for the implementation of this Pica-Pica Web Service Description Crawler only
the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to
extract the servicesrsquo properties While for other three Web Service Registries there has such
function
(5) Furthermore this is an optional to create a report file which contains the statistic information of
this process Such as the total number of services for one Web Service Registry the number of
services whose WSDL document is invalid etc
(6) As has been stated at present there has a folder with all valid WSDL documents and might also
have some INI files Therefore at this time the task of WSML Register component is to generate
the appropriate WSML documents with these valid WSDL document and INI files And then
register them in Conqo
24 Conclusions of the Existing Strategies
In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder
project Information Extraction technique and Pica-Pica Web Service Description Crawler
The task of this master program is going to obtain the available Web services and their related
information from the Web Actually this is a procedure of extracting the needed information for the
service such as the servicersquos WSDL document and its properties Therefore the Information
Extraction technique which used to extract the information that hosted in the Web could be used in
this master program
Moreover the Service-Finder is a large project that not only able to obtain the available Web services
and their related information from the Web but also enrich these crawled data with annotations by
means of the Service-Finder ontology and the Service Category ontology and integrate all the
information into a coherent semantic mode Furthermore it also provides the capabilities for
searching and browsing the data with the user interface and gives users with service
recommendations However as a master program the Service-Finder project is far more exceeding
the requirements Therefore it is just considered as a consultation for this master program
Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available
Web services and their related information this fulfills the primary task of this master program
Nevertheless regarding the information about the service this Pica-Pica Web Service Description
Crawler extracts only few properties sometime even no property Consequently in order to improve
the quality of the service it has to extract as much properties about the service as possible Thence in
chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler
Deep Web Service Crawler
23
3 Design and Implementation
In the previous chapter of the State of Art those already existing techniques or implementations are
presented In the following it is time to introduce the basic principle of the proposed approach Deep
Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica
Web Service Description Crawler
31 Deep Web Services Crawler Requirements
This section is mainly talking about the goals of the Deep Web Service Crawler approach system
requirements with respect to approach and some non-functional requirements
311 Basic Requirements for DWSC
The following lists are the basic requirements which should be achieved
(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list
of Web services that published in the Web Service Registry It contains not only the WSDL document
of the service but also a file with respect to the information about that service Therefore for the
purpose of producing the largest annotated Service Catalogue this proposed approach need to
extract as much properties about those Web services as possible Moreover it has also to download
the WSDL document that hosted along with the Web service That is to say these properties are not
only the interesting structured properties such as service name its WSDL link address but also some
other non-function properties for example endpoint and its monitoring information
(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties
For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of
schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way
this proposed approach provides three methods for the storage The first one will store them as an
XML file While for the second method it is going to store them in an INI file And the third method is
to use the database for the storage
312 System Requirements for DWSC
General Speaking the needed requirements for achieving a programming project contains following
lists
1) Operating System LinuxUnix Windows (XP Vista 2000 etc)
2) Programming language C++ Java Python C etc
3) Programming Tool NetBean Eclipse Visual Studio and so on
However in this master thesis the scripts of the Deep Web Service Crawler approach are written in
Java programming language Besides these code scripts are only tested running in Windows XP
operating system and Linux operating system but have not been tested in other operating system
Deep Web Service Crawler
24
313 Non-Functional Requirements for DWSC
In this part several non-functional requirements for the Deep Web Service Crawler approach are
presented
1) Transparency the process of data exploration and data storage should be done automatically
without the userrsquos intervention However at first user should specify the path in the hard disk which
will be used store these outputs of this program
2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in
order to stop the process from interruption there must have some necessary error handlings for the
process recovering
3) Completeness this approach should extract the interesting properties about each Web Service as
much as possible eg endpoint monitoring information etc
In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the
following five URLs hence the proposed approach must implement not less than those five Web
Service Registries
Biocatalogue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwservice-repositorycom
Xmethods httpwwwxmethodsnet
32 Deep Web Services Crawler Architecture
In this section an overview of this high-level architecture of Deep Web Services Crawler approach will
be introduced first Thereafter there are four subsections that focusing on the outlining of each single
component and how they will play together presented
The current components and flows of data in Deep Web Service Crawler can be summarized as
depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page
links and related service page links by crawling the Web (Web Service Extractor) Then those gathered
links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL
Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all
these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is
illustrated as following
Oslash Step 1
When the Deep Web Service Crawler starts to run the File Chooser container would require the
user to specify a path in the computer or in any other hard disk The reason for specifying the
path is that this Deep Web Service Crawler program needs a place to store all its outputs
Oslash Step 2
After that the Web Service Extractor will be triggered It is the main entry to the specific crawling
process However since the Deep Web Service Crawler program is a procedure which supposes
to crawl for Web Services in some given Web Service Registries Hence the URL addresses of
Deep Web Service Crawler
25
these Web Service Registries should be given as an initial seed for this Web Service Extractor
process However since the page structures of these Web Service Registries are completely
different thus there will be a dependent process for each Web Service Registry
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler
Oslash Step 3
So that according to the given seed two types of links would be obtained by the Web Service
Extractor component One is the service list page link And another is the service page link
Service list page is page that contains a list of Web Services and may be some information about
these Web Services While service page is page that contains much more information for single
service Finally it forwards these two types of links into the next two components Property
Grabber and WSDL Grabber
Oslash Step 4
Then on the one hand the Property Grabber component tries to gather the information about
the service that hosted in the service list page and service page Such as the name of the service
its description the ranking for this service etc Finally all the information of the service will be
collected together as the service properties which then will be delivered to the storage
component for further processing
Oslash Step 5
On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page
or the service page That is because for some Web Service Registry the WSDL link is hosted in
Deep Web Service Crawler
26
the service list page like the Biocatalogue and for other Web Service Registries it is hosted in
the service page such as Xmethods Then after obtaining the WSDL link it will also be
transmitted to the Storage component for further processing
Oslash Step6
When service properties information and WSDL link of the service are received by Storage
component it will be going to store them into the disk For the service properties they will be
stored into the disk by means of three different ways They are an XML file an INI file or one
record inside the table of Database However for the WSDL link the Storage component will try
to download the page content first according to the URL address of the WSDL link If it can work
successful the page content of the service will be stored as a WSDL document into the disk
Oslash Step 7
Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there
are more than one service or there are more than one service list page in the those Web Service
Registries the crawling process from step 3 to step 6 will be continued again and again until
there is no service or service list page in that Web Service Registries any more
Oslash Step 8
Furthermore after the crawling process for one Web Service Registry finishes there will also
generate a file that contains some of statistic information about this crawling process For
example the time when the crawling process of this Web Service Registry started and when
finished the total number of Web services in this Web Service Registry how many services have
empty WSDL document the average number of service properties in this Web Service Registry
and the average time cost for extracting service properties getting WSDL document and
generating XML file INI file etc
321 The Function of Web Service Extractor Component
The Web Service Extractor component is responsible for gathering service list page links and service
page links It pursues a focused crawl of the Web and only forwards service list page and service page
links to subsequent components for analyzing collecting gathering purposes Therefore it is going to
identify both service list page links and related service page links on these Web Service Registries
As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost
as important as the Web Service Extractor itself as it highly influences the part of the Web which it
needs to crawl The seed can or shall contain eg Web pages where these Web Services are
published or that they talk about Web Services
Deep Web Service Crawler
27
Figure3-2 Overview the process flow of the Web Service Extractor Component
After feeding with the seed of URL the Web Service Extractor component starts to get the link of
service list page from the initial page of this URL seed However this process would be diverse for
these five Web Service Registries Following shows the different situation in these Web Service
Registries
Oslash Service-Repository Web Service Registry
In this Web Service Registry the link of the first service list page is the URL address of its seed
which means some Web Services can be found in the home page of the Service-Repository Web
Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this
Web Service Registry Therefore the process of getting service list page link in this registry will be
continually carried on until there no more service list page link exists
Oslash Xmethods Web Service Registry
Although there has Web Services in the home page of the Xmethods Web Service Registry the
number of those Web Services is only a small subset of these in this Web Service Registry
Moreover in the Xmethods Web Service Registry there is only one page containing all Web
Services Therefore it must have to get the service list page link for that page
Oslash Ebi Web Service Registry
The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That
is to say there also has one page that contains all Web Services in this Web Service Registry
However this page is not the initial page of the input seed Therefore there needs more than
one operation steps to get the service list page link of that page
Oslash Seekda Web Service Registry
In Seekda Web Service Registry these Web Services are not contained in the initial page of the
input seed The service list page link can be obtained after several additional operation steps
Deep Web Service Crawler
28
However there has a problem for getting the service list page link in this registry Simply put if
there are more than one pages contains the Web Services for some unknown reasons it cannot
not get the links of the rest service list page In another word it can only get the link of the first
service list page
Oslash Biocatalogue Web Service Registry
The process of getting service list page in Biocatalogue Web Service Registry is almost the same
as in Seekda Web Service Registry The only difference is that it can get all these service list page
links in Biocatalogue Web Service Registry if there are more than one service list pages
Then after getting the link of service list page the Web Service Extractor will begin to get the link of
service page for each service that listed in the service list page The reason for why it can do like this is
that there has an internal link for every service which will address to the service page It is worth
noting that once a service page link is obtained this component will immediately forward the service
page link and the service list page link into the subsequent two components for further processing
Nevertheless the process of obtaining the service page link will be continuously carried out until all
services listed in that service list page are crawled Analogously the process of getting service list page
will also be continuously carried out until there no more service list page exists
3211 Features of the Web Service Extractor Component
The main features are described in the following paragraphs
1) Obtain service list page links
A central task is to obtain the corresponding service list page links It is an URL address that includes a
publicly list of Web Services and just a simply information about these Web services like the name of
the service an internal URL that links to another page which would contain the detailed information
about that service sometimes may have the link address for WSDL document
2) Obtain service page links
Once service list page links are found a crucial aspect is to extract the internal link of each service to
aid the task of service information discovery Therefore it is the task of the Web Service Extractor to
harvest the html page content of the service list page so that the service page links which would
contain much more detailed information of the single Web service can be obtained
3212 Input of the Web Service Extractor Component
This component is dependent on some specific input seeds And the only input required for this
component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section
313
3213 Output of the Web Service Extractor Component
The component will produce two service related page links from the Web
l Service list page links
l Service page links
Deep Web Service Crawler
29
3214 Demonstration for Web Service Extractor
In order to have comprehensive understanding of the process of the Web Service Extractor
component following gives some figures for explanation Though there are five URL addresses in this
section only the URL of the service-repository address is showed as an example
1) The input seed is the initial URL address of the Service-Repository which is
ldquohttpwwwservice-repositorycomrdquo
2) As has already said in section 321 the first service list link of this Web Service Registry is its input
seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page
of that link
Figure3-3 Service list page of the Service-Repository
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo
Figure3-5Code Overview of getting service page link in Service Repository
Figure3-6 Service page of the Web service ldquoBLZServicerdquo
Deep Web Service Crawler
30
3) For now because of the service list page link is already know the next step is to acquiring the
service page link according to these services that listed in the service list page The text in red box
of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the
complete link of the service page It should be prefixed with the initial URL address of
Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for
getting service page link of the Web service in Service Repository is showed in figure 3-5
Therefore the final link of this service page is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure 3-6 is the corrsponding service page of that link
4) Afterwards those two links service list page link and service page link which are gathered by the
Web Service Extractor component will be immediately forwarded to the next two components
which are WSDL Grabber component and Property Grabber component
322 The Function of WSDL Grabber Component
The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the
service lists page link or service page link And the whole process flow is illustrated in figure 3-7
Figure3-7 Overview the process flow of the WSDL Grabber Component
When the WSDL Grabber component receives the inputs that delivered from previous component it
will start to get the WSDL link for the service based on these inputs However although the inputs of
the WSDL Grabber component are the links of service page and service list page only one of them
contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the
service page or the service list page The reason why there needs these two links to be delivered into
this component is that the only one of the five Web Service Registries which is Biocatalogue hosts
the WSDL link in the service list page While for other four Web Service Registries the WSDL link is
hosted in the service page Therefore the WSDL link of these four Web Service Registries will be
obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL
link will be obtained through the service list page link However there has a problem for getting the
WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in
Deep Web Service Crawler
31
the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another
word these services will not have WSDL document For the situation like this the value of the WSDL
link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web
Services in other four Web Service Registries the WSDL link would always exist in the service page
Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service
it will inevitably be forwarded into Storage component for downloading the WSDL document at once
3221 Features of the WSDL Grabber Component
The WSDL Grabber component has to provide the following features
l Obtain WSDL links
WSDL link is the direct way to get into the page that contains the contents of the WSDL document
It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo
to represent this is an address that addressed to the page of WSDL document
3222 Input of the WSDL Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3223 Output of the WSDL Grabber Component
The component will only produce the following output data
l The URL address of WSDL link for each service
3224 Demonstration for WSDL Grabber Component
In this section it will demonstrate a list of figures in order to give a comprehensive understanding of
the process of the WSDL Grabber component Without a doubt it uses the same Web service of the
Service-Repository as an example too
1) The input for this WSDL Grabber component is the link of service page that obtained from Web
Service Extractor component The address of this link is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page
Deep Web Service Crawler
32
2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the
figure 3-8
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo
3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure
3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry
For other four Web Service Registries this will be different This function
ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo
Then it checks all of the nodes one by one to see if the text value of that node is the value of
ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will
be extracted as the value of WSDL link for this Web service
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function
Figure3-11 Code overview of ldquooneParameterrdquo function
4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web
service ldquoBLZServicerdquo can be obtained The value of it is
ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo
Deep Web Service Crawler
33
323 The Function of Property Grabber Component
The Property Grabber component is a module which is used to extract and gather all these Web
service information that hosted in the Web actually are the information that showed in service list
page and service page In the end all the obtained Web service information will be collected together
as the service properties which would be delivered into the Storage component for storing The
detailed process flow of Property Grabber component is illustrated in figure 3-12
The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component
However there is still a little difference between them with respect to the seed As have already
mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is
sufficient to get the WSDL link While regarding the Property Grabber component it needs two of
them as the seeds Therefore once the Property Grabber component receives the needed inputs it
will start to extract the information of that single Web service
Figure3-12 Overview the process flow of the Property Grabber Component
Therefore after the Property Grabber component receives the inputs it starts to extract the service
information for the Web service Generally speaking the service information consists of four aspects
which are structured information endpoint information monitoring information and whois
information respectively
(1) Structured Information
The structured information can be obtained by extracting the information that hosted in the
service page and service list page They are the basic descriptive information about the service
Such as the name of the service the URL address through which the WSDL document can be
obtained the description of introducing the service the provider who provides that service its
rating and the server who owns this service etc However the elements constituting this
Deep Web Service Crawler
34
structured information would be diverse according to these different Web Service Registries For
example the rating information of the Web service exists in the Service-Repository Web Service
Registry while the Xmethods Web Service Registry does not have this information In addition
even though the Web services in the same Web Service Registry some elements of the structured
information could be not exist For instance one service in a Web Service Registry may have the
description information of this service while another service in the same Web Service Registry
does not have this description information Tables 3-1 to 3-5 demonstrate the structured
information that should be extracted in these five Web Service Registries Moreover if the style of
the Biocataloue Web Service Registry is SOAP there has some additional information describing
the SOAP operation of this service While it is REST this additional information would be
something that talked about the REST operation They should also be considered as a part of the
structured information Table 3-6 and table 3-7 illustrate the information for these two different
operations
Service Name WSDL Link WSDL Version
Provider Server Rating
Homepage Owner Homepage Description
Table 3-1Structured Information of Service-Repository Web Service Registry
Service Name WSDL Link Provider
Service Style Homepage Implementation Language
Description User Description Contributed Client Name
Type of this Client Publisher for this client Used Tookit of this client
Used Language of this client Used Operation System of this client
Table 3-2Structured Information of Xmethods Web Service Registry
Service Name WSDL Link Server
Provider Providerrsquos Country Service Style
Rating Description User Description
Service Tags Documentation (within WSDL)
Table 3-3Structured Information of Seekda Web Service Registry
Service Name WSDL Link Port Name
Service URL Address Implementation Class
Table 3-4Structured Information of Ebi Web Service Registry
Service Name WSDL Link Style
Provider Providerrsquos Country View Times
Favorite Times Submitter Service Tags
Total Annotation Provider Annotation Member Annotation
Registry Annotation Base URL SOAP Lab Server Base URL
Description User Description Category
Table 3-5Structured Information of Biocatalogue Web Service Registry
Deep Web Service Crawler
35
SOAP Operation Name Inputs and Outputs Operation Description
Operation Tags Part of Which Service
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry
REST Operation Name Service Tags Used Template
Operation Description Part of Which Service Part of Which Endpoint Group
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry
(2) Endpoint Information
Endpoint information is the port information of the Web service which can be extracted only
through the service page However since different Web Service Registries have different structure
of the endpoint information with respect to the Web service Thence some elements of the
endpoint information for the Web service would be very diverse But there is one thing needed to
pay attention the Ebi Web Service Registry does not have endpoint information for all these Web
services published in this registry Moreover though the Web services in the same Web Service
Registry have the same structure of this endpoint information some elements of the endpoint
information would be missing or empty Furthermore these Web Service Registries could even
have no endpoint information for some Web services published by them Nevertheless whatever
there is endpoint information for the Web service there will be at least one such element which
is the URL address of the endpoint Following table 3-8 shows the endpoint information that
should be extracted in these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Endpoint Name Endpoint URL
Endpoint Critical Endpoint Type
Bound Endpoint
Xmethods
Endpoint URL Publisher of this Endpoint
Contact Email of this Publisher Implementation Language of this
Endpoint
Seekda Endpoint URL
Biocatalogue Endpoint Name Endpoint URL
Table 3-8 Endpoint Information of these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Service Availability Number of Downs
Total Uptime Total Downtime
MTTR MTBF
RTT Max of Endpoint RTT Min of Endpoint
RTT Average of Endpoint Ping Count of Endpoint
Seekda Service Availability Begin Time of Monitoring
Biocatalogue Monitored Status of Endpoint Overall Status of Service
Table 3-9 Monitoring Information of these five Web Service Registries
(3) Monitoring Information
Monitoring information is the tested statistic information for the Web service But it is worth
Deep Web Service Crawler
36
noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of
all these Web services published by them While for other three Web Service Registries only a
few of Web services may not have this information Table 3-12 displays the monitoring
information for these three Web Service Registries
(4) Whois Information
Whois information is not extracted from the information that hosted in the service page and
service list page It is the descriptive information for the service domain which can be gained by
means of the address of the WSDL link Because of that the process of getting the Whois
information will start after gaining the service domain first The final value of service domain
should not contain the string like this http https www etc It must belong to the top level
domain After that it is going to query the service domain database by means of sending the value
of service domain to the Whois client which is just a Web site on the Internet for example
ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain
exists a list of its information will be returned as the output However the structure of the
returned information will be diverse with respect to different service domain Therefore the most
challenge thing is that there has to deal with the extracting process for each different situation of
the returned information Table 3-10 gives the Whois information that needs be extracted for all
these five Web Service Registries
Service Domain URL Domain Name Domain Type
Domain Address Domain Description State
Postal Code City Country
Country Code Phone Fax
Email Organization Established Time
Table 3-10Whois Information for these five Web Service Registries
Finally all the information of these four aspects will be collected together And then be delivered into
the Storage component for further storage processing
3231 Features of the Property Grabber Component
The Property Grabber component has to provide the following features
l Obtain basic information
Generally speaking the more information one Web service has the better you can know how
well this Web service is Hence this Property Grabber component will be necessary to extract all
these basic information that hosted in the service list page and service page This basic
information contains the structure information endpoint information and monitoring
information
l Obtain Whois information
Due to the reason that the more information the Web service has the better the quality of the
Web service is That makes it is necessary to extract as much information about the Web service
as possible Therefore except the basic information the Property Grabber component can also
obtain some additional information called Whois information such as the type of the domain the
name of the person in charge the postal code of the domain city phone fax detailed address
email etc
Deep Web Service Crawler
37
3232 Input of the Property Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3233 Output of the Property Grabber Component
The component will produce the following output data
l Structured information of each service
l Endpoint information about each service if exists
l Monitoring information for the service and endpoint if exists
l Whois information of the service domain
However all these information will be collected together as the properties for each service Thereafter
the collected properties will be sent to the Storage component
3234 Demonstration for Property Grabber Component
Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the
Property Grabber component To simplify the explanation the example showed in this section uses
the same Web service as before
1) The inputs of the Property Grabber component are links of service list page and service page that
received from Web Service Extractor component Their links are
ldquohttpwwwservice-repositorycomrdquo
and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page
Deep Web Service Crawler
38
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page
3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that
displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of
the structured information have the same content like the contents of description showed in
service page and service list page Thence in order to saving the time of extracting process and
the space for storing process the elements with the same content would only be extracted once
Moreover there needs a transformation of non-descriptive text to the descriptive text for the
rating information because the contents of it are several image of stars Therefore the final
results of the extracted structured information for this Web service are showed in table 3-11
Because of there are no descriptive information for the provider homepage and Owner
Homepage their values are assigned as ldquoNULLrdquo
Service Name BLZService
WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl
WSDL Version 0
Server Apache-Coyote11
Description BLZService
Rating Four stars and A Half
Provider NULL
Homepage NULL
Owner Homepage NULL
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo
4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that
displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list
only one of them will be extracted as the information of endpoint That is because this master
program is intended to extract as more information as possible but these information should not
contain some redundant information Therefore for the purpose of that only one of them will be
extracted as the endpoint information even if there are more than one endpoint records Table
3-12 shows the final results of the endpoint information of this Web service
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page
Deep Web Service Crawler
39
Endpoint Name BLZServiceSOAP12port_http
Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService
Endpoint Critical True
Endpoint Type production
Bound Endpoint BLZServiceSOAP12Binding
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo
5) Then it is time to extract the monitoring information through invoking the function
ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box
above contains the monitoring information about the Web service and the red box below lists
the monitoring information for these endpoints As have already mentioned before only one
endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two
types of availability Actually they all represent the availability for this Web service just like the
availability that showed in figure 3-14 Therefore only one content of these availability
information is sufficient Table 3-13 shows the final results of this extracting process
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page
Service Availability 100
Number of Downs 0
Total Uptime 1 day 19 hours 19 minutes
Total Downtime 0 second
MTBF 1 day 19 hours 19 minutes
MTTR 0 second
RTT Max of Endpoint 141 ms
RTT Min of Endpoint 0 ms
RTT Average of Endpoint 577 ms
Ping Count of Endpoint 112
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo
6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the
service domain from the WSDL link first Towards this Web service the gained service domain is
ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying
process After that it returns a list of information for that service domain See figure 3-17 And
table 3-14 is the extracted Whois information
Deep Web Service Crawler
40
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo
Service Domain URL thomas-bayercom
Domain Name Thomas Bayer
Domain Type NULL
Domain Address Moltkestr40
Domain Description NULL
State NULL
Postal Code 54173
City Bonn
Country NULL
Country Code DE
Phone +4922855525760
Fax NULL
Email infopredic8de
Organization predic8 GmbH
Established Time NULL
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo
7) Finally all the information of these four aspects will be collected together as the service
properties and then these service properties are forwarded into the Storage component
324 The Function of Storage Component
The WSDL link from the WSDL Grabber component will be used by this Storage component to
download the WSDL document from the Web and stores it into the disk thereafter In addition the
service properties from the Property Grabber component will also be directly stored into the disk in
three different manners with this Storage component Figure 3-18 illustrates the process flow of the
Storage component
Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo
of the Storage component will be triggered After that it is going to transform the service properties
into three different output formats and store them into the disk These output formats are XML file
INI file and database records Besides it will also try to download the WSDL document with the URL
address of the WSDL link and then store the obtained WSDL document into disk too However this
ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo
ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the
storage tasks
Deep Web Service Crawler
41
Figure3-18 Overview the process flow of the Storage Component
(1) ldquogetWSDLrdquo sub function
The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it
into the disk Therefore above all it has to get the content of the WSDL document This
procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value
of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if
the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo
For the case like that it will create a WSDL document whose name is the service name appended
with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content
It is an empty document Nevertheless if that service does have WSDL link this sub function will
try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all
these contents that hosted in the Web would be downloaded and stored into the disk and named
only with the name of the service Otherwise it would create a WSDL document which prefixes
ldquoBadrdquo before the service name
(2) ldquogenerateXMLrdquo sub function
The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform
them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo
XML stands for eXtensible Markup Language which is a markup language that designed to
transport and store data The first line of an XML file is the XML declaration which defines the
XML version and the encoding used For example following ldquoltxml version=10
encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is
8-bit Unicode Transformation Format character set Besides an XML file contains XML elements
which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML
element can also contain other elements which could be simple text or a mixture of both
Deep Web Service Crawler
42
However an XML file must contain a root element as the parent of all other elements
(3) ldquogenerateINIrdquo sub function
The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will
transform them into an INI file and then store it in the disk with the name structure of servicersquos
name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for
configuration files They are just simple text files with a basic structure Generally speaking the
INI file contains three different parts They are section parameter and comment respectively
Parameter is the basic element contained in an INI file Its format is a key-value pair or can also
be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always
appears to the left of the equals sign The section is more like a room that grouped all its
parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And
section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo
Thence anything between the semicolon and the end of line will be ignored
(4) ldquogenerateDatabaserdquo sub function
The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions
Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will
turn them into the data of the database by using the SQL statements SQL stands for Structured
Query Language which is a database language that designed for accessing and manipulating data
in database Most of the actions that performed on a database are done with SQL statements
And the primary statements of SQL include insert into delete update select create alter and
drop Therefore for the purpose of transforming these service properties into the data of the
database this sub function has to create a database first using the ldquocreate databaserdquo statement
Then it should create a table to store the data A table is a collections of related data entries and
it consists of columns and rows Since the data for all these five Web Service Registries are not
very large therefore one database table is enough for storing these service properties Because of
that the field names of the service properties in columns for all these five Web Service Registries
should be uniform and well-defined Afterwards the service properties of each single service can
be put into the table as a record with the ldquoinsert intordquo statement of the SQL
3241 Features of the Storage Component
The Property Grabber component has to provide the following features
l Generate different output formats
The final result of this master program is to store the information of the service into the disk for
future work However this Storage component provides three different formats for storing these
service properties of the services that published in the Web Service Registries This makes the
storage of the service very flexible and also longevity
l Obtain the WSDL document
The important thing of this component is to check if the WSDL document can be obtained from
the WSDL link because the WSDL document plays a decisive role for determining the quality of
the service This Storage component provides the ability to deal with the different situations
about the process of obtaining the WSDL document
Deep Web Service Crawler
43
3242 Input of the Storage Component
This component requires the following input data
l WSDL link of each service
l Each service property information
3243 Output of the Storage Component
The component will produce the following output data
l WSDL document of the service
l XML document INI file and tables in database
3244 Demonstration for Storage Component
Following figures describe the fundamental implementation codes of this Storage component
The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the
implementation codes The first common place is about the parameters that defined in each of
these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute
path in the computer disk It will be used for the storing procedure and has already been specified
by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an
increasing Integer which is used as a part of the name of the service The reason for doing this is
that it can prevent the services that have the same name from overriding in the disk The content
of red mark among the codes of these figures is the second common place Its function is to
create a file or document for the service with the corresponding parameters
2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to
get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the
name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo
is the actual WSDL link which is the most important one for this sub function The other two
parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and
ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data
of the service in each Web Service Registry Such as the overall number of properties for that Web
Service Registry the overall number of services the number of services that have no WSDL link
the number of services contains no contents the number of services whose WSDL link are not
available etc The ldquoLog Informationrdquo text file records the results of the process steps and the
problems encountered For example which service is crawling now which Web Service Registry it
belongs the reason why it cannot get the WSDL document for this service the reason why this
service is unreachable and so on
Deep Web Service Crawler
44
Figure3-19 Implementation code for getting WSDL document
3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file
and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of
ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two
variables name and value
Figure3-20 Implementation code for generating XML file
Deep Web Service Crawler
45
Figure3-21 Implementation code for generating INI file
4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all
services in these five Web Service Registries into the records of the database Therefore it has to
create a database first The name of the database can be arbitrary as long as it conforms to the
naming rules in database This is the same for the name of the table Figure 3-22 displays the
procedure of creating a table in the database Due to the reason that it is hard to decide the
length of each service property thence the data types for all service properties are set to ldquoTextrdquo
Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo
statement
Figure3-22 Implementation code for creating table in database
Deep Web Service Crawler
46
Figure3-23 Implementation code for generating table records
33 Multithreaded Programming for DWSC
Multithreaded programming is a built-in characteristic which is provided by Java language A
multithreaded program contains two or more separate parts that that can execute concurrently Each
part of such a program is called a thread The use of multithreading enables to create programs that
makes efficient use of the system resources For example make maximum use of the CPU thatrsquos
because the idle time of the CPU can be kept to a minimum
In this master program there are five Web Service Registries that are needed to crawl for the services
among them Moreover the number of services published among each Web Service Registry is quite
different That makes the running time cost by each Web Service Registry is different Then it will
happen that one Web Service Registry who owns fewer services has to be waiting for executing until
another Web Service Registry who has much more services finishes Therefore in order to reduce the
waiting time for other Web Service Registries and maximize the use of the system resources it is
necessary to apply this multithreaded programming to this master program That is to say this master
program will create a thread for each Web Service Registry And these threads are executed
independently
34 Sleep Time Configuration for Web Service
Registries
Due to the reason that this master program is intended for downloading the WSDL document and
extracting the service information of these Web services that published in the Web Service Registry
this would inevitably affect the performance of the Web Service Registry In addition for the purpose
of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of
accessing Because of that sometimes when this master program is executing unknown errors would
happen For instance this master program will be continually halting at one point without getting any
more WSDL document and service information or some servicesrsquo WSDL documents of some Web
Deep Web Service Crawler
47
Service Registries cannot be obtained or there would miss some service information Therefore in
order to obtain the largest staffs of these Web services published in these five Web Service Registries
and not to affect the throughput of them it has to configure the accessing rate for each service of all
Web Service Registries
In consequence before going into the essential procedure for each single service of these Web Service
Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long
milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for
the specified number of milliseconds In other words it will temporarily cease execution for a while
Following table shows this time interval of the sleep function for each Web Service Registry
Web Service Registry Name Time Interval (milliseconds)
Service Repository 8000
Ebi 3000
Xmethods 10000
Seekda 20000
Biocatalogue 10000
Table 3-15Sleep Time of these five Web Service Registries
Deep Web Service Crawler
48
4 Experimental Results and Analysis
This chapter is going to show the quantitative experimental results of the prototype that presented in
chapter 3 Besides the analysis of these results will also be described and explained However in
order to gain a rather accurate result the experiments are carried out more than five times All these
data that displayed in the following tables and charts are the average of them
41 Statistic Information for Different Web Service
Registries
This section is mainly used to talk about the amount statistic of these Web services that published in
these five Web Service Registries It includes the overall number of the Web services that published in
each Web Service Registry and the number of these unavailable Web services which have been
archived because they may not be active anymore or are close to being non active Table 4-1 shows
the service amount statistic of these five Web Service Registries
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Overall
Services 57 289 382 853 2567
Unavailable
Services 0 0 0 0 125
Table4-1 Service amount statistic of these five Web Service Registries
Nevertheless in order to have an intuitive view of the service amount statistic in these five Web
Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As
you can see from the bar chart one on hand there has an ascending increase for the overall number
of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service
Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web
service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to
provide the Web services for the users because it contains far more services than other four Web
Service Registries On the other hand there is no unavailable service existed in almost all Web Service
Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web
Service Registry contains some Web services that cannot be used by the users To some degree this is
useless for the reason that these services cannot be used anymore and it will waste the network
resources on the Web Therefore all these unavailable services should be eliminated in order to
reduce the wasting of the network resources
Deep Web Service Crawler
49
Figure4-1 Service amount statistic of these five Web Service Registries
42 Statistic Information for WSDL Document
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Failed WSDL
Links 1 0 23 145 32
Without
WSDL Links 0 0 0 0 16
Empty
Content 0 0 2 0 2
Table4-2 Statistic information for WSDL Document
Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in
these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed
WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the
Web services whose WSDL links are invalid Say in other words that means it is impossible to get the
WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no
WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services
for these Web Service Registries It is used as the overall number of Web services in each Web Service
Registry who has no WSDL link That is to say there would be no WSDL document for Web services
like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL
document will also be created But there is no content and the name of this WSDL document will
contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which
represents the overall number of Web services that have the WSDL links and the URL address of them
0
500
1000
1500
2000
2500
3000
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
57289 382
853
2567
0 0 0 0 125
The
Num
ber o
f Web
Ser
vice
s
The name of Web Service Registry
Overall Services
Unavailable Services
Deep Web Service Crawler
50
are valid but their WSDL documents contains no content In this case a WSDL document whose name
contains a string ldquo(BAD)rdquo will be created
Figure4-2 Statistic information for WSDL Document
43 Comparison of Different Average Number of
Service Properties
This section is going to compare the average number of service properties in these five Web Service
Registries This average number of service properties is calculated by means of following equation = (1)
Where
ASP is the average number of service properties for one Web Service Registry
ONSP is the overall number of service properties in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
Figure 4-3 shows the average number of service properties of each Web service in these five Web
Service Registries As have already mentioned before one of the measurements for testing the quality
of Web services in the Web Service Registry is the service information The more information of a Web
service the better you know about that service consequently the corresponding Web Service
Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service
Repository and Biocatalogue Web Service Registries own a larger number of service properties than
that in other three Web Service Registries This can directly reflect that these two Web Service
Registries can provide more detailed information about the Web services published among them So
that the users can be easier to choose the service they need And they would also like to use the Web
services that published in these two Web Service Registries By contrast the Xmethods and Seekda
0
50
100
150
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
1 023
145
32
0 0 0 016
0 0 2 0 2
The
Num
ber o
f Ser
vice
s
The Name of the Web Service Registry
Failed WSDL Links
Without WSDL links
Empty Content
Deep Web Service Crawler
51
Web Service Registries who have less service information of the Web services would offer less quality
for these Web services Therefore users may be not like to use the Web services provided in these
two Web Service Registries Not to mention the Web services published in the Ebi Web Service
Registry
Figure4-3 Average Number of Service Properties
From the description presented in section 323 the causes of the issue for the different number of
service properties in these Web Service Registries may consist of following several points First the
number of the structured information for these Web services is different with respect to these five
Web Service Registries Even part of information for some Web services in one Web Service Registry
could be missing or have empty value For example the number of structure information that
supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that
in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for
almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce
the amount of the overall number of service properties Thirdly some Web Service Registries do not
have monitoring information such as Xmethods and Ebi In particular the Service Repository Web
Service Registry has a large amount of monitoring information of the Web services that could be
extracted in the Web Obvious the last one is the number of the whois information for these Web
services If the database of the whois client does not contain the information about the service
domain for the Web service in one Web Service Registry then there will be no whois information
could be extracted Moreover even there has information of the service domain the amount of the
information could be very diverse Therefore if the Web Service Registry undergoes the situation like
that many service domains of the Web services in this registry have no and few whois information
then the average number of service properties in that registry will decrease greatly
As a result in order to help users get better acquainted with the Web services that provided in Web
Service Registry and distinguish these Web services the Web Service Registry should do it best to
offer more and more information for each published Web service of it
05
101520253035
23
717 17
32
The
Num
ber o
f Pro
pert
ies
The Name of Web Service Registry
Average Number of Service Properties
Deep Web Service Crawler
52
44 Different Outputs of Web Services
The fundamental task of this master program is to obtain the WSDL document of the Web services
that hosted in the Web Service Registries as well as extract and gather the properties of these Web
services and thereafter store them into disk Therefore this section is going to describe the different
outputs of this master program which include the WSDL documents of Web services the generated
XML file and INI file and data records of these service properties in Database
Figure4-4 WSDL Document format of one Web service
The WSDL document of the Web service is just read from the Web according to the URL address of its
WSDL link And then these data will be stored into the disk as the WSDL document The name of the
WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document
And in order to distinguish these WSDL documents whose names would be the same while the
content of them are different the name of each obtained WSDL document in one Web Service
Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL
document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and
data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output
formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name
is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the
materials belonged to the same Web service The first three lines in that INI file are some service
comments which start from the semicolon to the end of the line They are the basic information for
the description of this INI file The following one line of them is the section that included in a pairs of
brackets It is important because it represents that the following lines behind it are the information of
this Web service Therefore the rest of these lines are the actual service information that existed in a
key-value pair with an equal between the pair Each service property is displayed from the beginning
of the line
Deep Web Service Crawler
53
Figure4-5 INI File format of one Web service
Figure4-6 XML File format of one Web service
Figure4-7 Database data format for all Web services
Deep Web Service Crawler
54
Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its
name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web
service Though the format of the XML file is different from that of the INI file the essential contents
of them are the same That is to say the values of these service properties are no different This is
because they are files generated from the collection of properties for the same Web service The XML
file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the
section in INI file is something like the root in XML file Therefore all values of these elements among
the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service
Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of
service information for all Web services in these five Web Service Registries And the entire service
information for one Web service will be only one record in this table Because of that the column
names of that table should be the union of the name of the service information in each Web Service
Registry However since the column names of the table should be unique thus these redundant
names in this union must be eliminated This could make sense and possible because the names of the
service information are well-defined and uniform for all these five Web Service Registries In addition
in this table the first column is the primary key which is an increasing Integer The function of this is
more like the Integer that contained in the name of the XML file and INI file While the rest columns in
that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in
this table represents that this property of that Web service is empty or missing
45 Comparison of Average Time Cost for Different
Parts of Single Web Service
This section aims to describe the comparison of the average time cost for different parts of getting
one single Web service in all these five Web Service Registries At first it has to calculate the average
time cost of getting one single service in the Web Service Registry The calculation of it can be
obtained through following equation = (2)
Where
ATC is the average time cost for one single Web service
OTS is the overall time cost of all the Web services in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the different parts of the average time cost for getting one single service consists of
following 6 aspects which are the average time cost for extracting service property the average time
cost for obtaining WSDL document the average time cost for generating XML file the average time
cost for generating INI file the average time cost for inserting the service property into the table of
the Database and the average time cost for some other procedures such as get the service list page
link get the service page link and so on The average time cost for extracting service property would
be obtained by means of following equation
Deep Web Service Crawler
55
= (3)
Where
ATCSI is the average time cost for extracting service property of one single Web service
OTSSI is the overall time cost for extracting service property of all the Web services in one Web
Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the calculation of other parts is similar to the equation for calculating the average time
cost for extracting service property While the calculation of the average time cost for other
procedures equals to the result of the average time cost for one single Web service minus the sum of
the average time cost of all other five parts
Service
property
WSDL
Document
XML
File INI File Database Others Overall
Service
Repository
8801 918 2 1 53 267 10042
Ebi 699 82 2 1 28 11 823
Xmethods 5801 1168 2 1 45 12 7029
Seekda 5186 1013 2 1 41 23 6266
Biocatalogue 39533 762 2 1 66 1636 42000
Table4-3 Average time cost information for all Web Service Registries
Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web
Service Registries The first column of table 4-3 is the name of these five Web Service Registries And
the last column in this table is the average time cost for the single service in one Web Service Registry
While the rest columns in this table are the average time cost of these six different parts In order to
have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated
with following corresponding figures See figure 4-8 to figure 4-14
Figure4-8 Average time cost for extracting service property in all Web Service Registries
8801
6995801 5186
39533
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Service Property
Deep Web Service Crawler
56
As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue
Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries
which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively
That is to say it will take much longer time to extract the service properties of these Web services
published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the
Biocatalogue Web Service Registry has even more average number of service properties which have
already been talked about in section 43 On the contrary the average number of service properties in
Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service
Registry is larger than that in Seekda Web Service Registry As have already known that the average
number of service properties is the same for these two Web Service Registries Nevertheless there
has course might explain the case that the average time in Xmethods costs more than that in Seekda
which is the process for extracting service properties in Xmethods Web Service Registry needs to be
executed by means of service page and service list page while it needs only service page link for
Seekda Web Service Registry
The average time cost for obtaining WSDL document in all these five Web Service Registries is
displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the
average time for extracting the WSDL link of Web service and the average time for reading the data of
WSDL document from the Web and then storing it into the disk As seen from this figure the average
time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is
1168 Although the process for extracting WSDL link will cost certain amount of time it does not have
a significant influence to the total average time spent for obtaining WSDL document This is because
the WSDL link of one Web service is almost gained in one step Therefore this can imply that the
average size of the WSDL document for Xmethods Web Service Registry is larger than that for other
four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds
for the process of extracting WSDL document
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries
918
82
11681013
762
0
200
400
600
800
1000
1200
1400
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
WSDL Document
Deep Web Service Crawler
57
Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different
outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11
the average time of generating XML file for one Web service cost the same for all these five Web
Service Registries which is only 2 milliseconds In addition the average time of generating INI file for
one Web service cost the same too And its value is just 1 millisecond Even though sum up the values
of these two types of average time costs the result is still so small that it could be omitted when
comparing to the overall average time cost of getting one Web service for each corresponding Web
Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI
file would be finished at once after receiving the input of service properties of one Web service
Furthermore it can be seen from figure 4-12 although the average time costs of creating database
record for each Web service in all these five Web Service Registries are larger than that time of
generating XML and INI file the process of the operation for creating database record is still fast
Figure4-10 Average time cost for generating XML file in all Web Service Registries
Figure4-11 Average time cost for generating INI file in all Web Service Registries
2 2 2 2 2
0
05
1
15
2
25
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
XML File
1 1 1 1 1
0
02
04
06
08
1
12
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
INI File
Deep Web Service Crawler
58
Figure4-12 Average time cost for creating database record in all Web Service Registries
Figure4-13 Average time cost for getting one Web service in all Web Service Registries
In the figure 4-13 it gives the average time cost for getting one single Web service in all these five
Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the
longest time for this process This is because the presentation of these five different parts that
described in the front shows that the average time cost of each part would needs more time to finish
the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining
WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web
Service Registry does not cost the longest Moreover there is amazing thing when looking at these
figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same
trend This can further indicate that one Web Service Registry who spends more time to get the
description information of one Web service would offer more information about the Web service
53
28
4541
66
0
10
20
30
40
50
60
70
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Database
10042
823
7029 6266
42000
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Overall
Deep Web Service Crawler
59
5 Conclusion and Further Direction
This master thesis provides a schema which aims to explore the description information of the Web
services that hosted in the different Web Service Registries Actually the description information of
the Web service consists of WSDL document and service information about that Web service
Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can
be used to obtain the WSDL document and service information of the Web service in these Web
Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web
services that hosted in the these Web Service Registries as well as the storage of these Web services
is not flexible and the most important thing is that only a few service information of the Web service
are extracted even more there is no service information would be extracted for some Web Service
Registry However the work presented in this master thesis can be able to explore all Web services in
these Web Service Registries Moreover the service information of Web service would be extracted as
much as it could with the approach presented in this master thesis so that the final result would be
the largest annotated service catalogue ever produced Furthermore regarding the storage process of
the description information of the Web service this master thesis provides three different ways that
can guarantee not only the completeness but also longevity of the description information for the
Web service
However in the implementation performed in this master thesis the whois client used for querying
the information of service domain will return a free text if the information exists And sometime this
free text differs completely That makes it must have to crawl each Web service in all Web Service
Registries at least once in the experiment stage So that all the cases of these free text would be
foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web
services in these Web Service Registries Therefore in order to simply the work other whois client
who can ease the work here needs to be found and used
Moreover in the experiment stage of this master thesis the time cost for getting Web service is still
large For the purpose of reducing this time the multithreaded programming would be applied to
each some parts of the process for getting one Web service
Although the work performed here is specialized for only these five Web Service Registries the
main parts of the principles used here is adaptable to other Web Service Registries that with only
small changes in the implementation codes or the structure
Deep Web Service Crawler
60
6 Bibliography
[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)
June 27 2008
[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from
httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-
whole Emanuele Della Valle (CEFRIEL) July 1 2008
[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk
ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-
architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009
[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web
Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October
2006
[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from
httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008
[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from
httpwwww3orgTRws-arch-scenarios February 11 2004
[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo
Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering
University of Washington Seattle pp233-272 February 1999
[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web
Consortium Working Draft WD-html5-20100624 January 22 2008
[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language
WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from
httpwwwwsmoorgTRd16d161v02
[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO
-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from
httpwwwwsmoorgTRd2v11
[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware
Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet
2008
Deep Web Service Crawler
61
7 Appendixes
There are additional outputs of this master program which are log information file and statistic report
file Figure 8-1 shows one of the basic output formats of the log information for these five Web
Service Registries
Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry
Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry
Deep Web Service Crawler
62
Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo
ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively
Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry
Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry
Deep Web Service Crawler
63
Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry
Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry
Deep Web Service Crawler
64
Table of Figures
Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12
Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16
Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25
Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27
Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41
Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46
Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50
Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51
Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52
Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56
Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58
Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58
Deep Web Service Crawler
65
Table of Tables
Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47
Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48
Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55
Deep Web Service Crawler
66
Table of Abbreviations
DTD Document Type Definition
DWSC Deep Web Service Crawler
HTML HyperText Markup Language
MTBF Mean Time between Failures
MTTR Mean Time to Recovery
QoS Quality of Service
REST Representational State Transfer
RTT Round Trip Time
SOAP Simple Object Access Protocol
SQL Structured Query Language
URL Uniform Resource Locator
WHATWG Web Hypertext Application Technology Working Group
XML eXtensible Markup Language
WSDL Web Service Description Language
WSML Web Service Modeling Language
WSMO Web Service Modeling Ontology
Deep Web Service Crawler
15
u Service availability information
Oslash Function
u The Web Interface allows the users to search services by keyword tag or concept in the
categorization sort and filter query results by refining the query compare and bookmark
services try out the services that offer this functionality
u The API allows the developers to invoke Service-Finder functionalities
Oslash Output
u Explicit user annotations such as tags ratings comments decryptions and so on
u Implicit user data for example click stream of users bookmarks comparisons links sent
etc
u Manual advertising of available new services
2125 The Principle of the Cluster Engine Component
The Cluster Engine is in charge of collecting and analyzing information of user behaviors from the
Service-Finder Portal eg the queried services and the compared services of the users Moreover it
also provides cluster data to the Conceptual Indexer and Matcher for providing service
recommendations
Furthermore letrsquos detailed introduce this componentrsquos function input and output
Oslash Input
u Service annotation data of both extracted and user feedback
u Usersrsquo Click streams used for extracting user behaviors
Oslash Function
u Obtain user clusters from user behaviors
u Obtain service clusters from service annotation data to enable to find similar services
Oslash Output
u Clusters of users and services
22 Information Extraction
Due to the cause of the rapid development and use of the World-Wide Web there procedure a huge
amount of information sources on the Internet which has been limited the access to browsing and
searching for the reason of the heterogeneity and the lack of structure of Web information sources
Therefore the appearance of Information Extraction that transforms the Web pages into
program-friendly structures for post-processing would become a great necessity However the task of
Information Extraction is specified in terms of the inputs and the extraction targets And the
techniques used in the process of Information Extraction called extractor
221 Input Types of Information Extraction
Generally speaking there are three different input types The first input type is the unstructured
Deep Web Service Crawler
16
document For example the free text that showed in figure 2-2 It is unstructured and written in
natural language So that it will require substantial natural language processing While the second
input type is called the structured document For instance the XML documents based on the reason
that the data can be described through the available DTD (Document Type Definition) or XML
(eXtensible Markup Language) schema Finally but obviously the third input type is the
semi-structured document that are widespread on the Web Such as the large volume of HTML
pages like tables itemized lists and enumerated lists This is because HTML tags are often used to
render these embedded data in the HTML pages See figure 2-3
Figure2-2Left is the free text input type and right is its output [4]
Figure2-3A Semi-structured page containing data records
(in rectangular box) to be extracted [4]
Therefore in this way the inputs of semi-structured type can be seen as the documents with a fairly
regular structure And the data of these documents can be displayed in a format of HTML way or
non-HTML way Besides owing to the reason that the Web pages of the Deep Web are dynamic and
generated from structured databases in terms of some templates or layouts thus it would be
considered as one of the input sources which could provide some of these semi-structured documents
For example the authors price and comments of the book pages that provided by Amazon have the
Deep Web Service Crawler
17
same layout That is because these Web pages are generated from the same database and applied
with the same template or layout Furthermore there has another option which could manually
generate HTML pages of semi-structured type For example although the publication lists that
provided from different kinds of researchersrsquo homepages are produced by diverse uses they all have
title and source property for every single pager Eventually the inputs for some Information Extraction
can also be the pages with the same class or among various Web Service Registries
222 Extraction Targets of Information Extraction
Moreover regarding the task of the Information Extraction it has to consider the extraction target
There also have two different extraction targets The first one is the relation of k-tuple And the k in
there means the number of attributes in a record Nevertheless in some cases an attribute of one
record may have none instantiation Otherwise the attribute owns multiple instantiations In addition
the complex object with hierarchically organized data would be the second extraction target Though
the ways for depicting the extraction targets in a page are diverse the most common structure is the
hierarchical tree This hierarchical tree may contain only one leaf node or one or more lists of leaf
nodes which called internal nodes And the structure for a data object may also be flat or nested To
be brief if the structure is flat then there is only one leaf node that call also be called root Otherwise
if it is nested structure then the internal nodes that involved in this data object would be more than
two levels
Furthermore in order to make the Web pages readable for human being and having an easier
visualization these tables or tuples of the same list or elements of a tuple should be definitely isolated
or demarcated However the displaying for a data object in a Web page would be affected by
following conditions [4]
Oslash The attribute of a data object has zero or several values
(1) If there is no value for the attribute of a data object this attribute will be called the ldquononerdquo
attribute For example a special offer only available for certain books might be a ldquononerdquo
attribute
(2) If there are more than one values for the attribute of a data object it will be called the
ldquomultiValuerdquo attribute For instance the name of the author for a book could be a
ldquomultiValuerdquo attribute
Oslash The set of attributes (A1 A2 A3 hellip) has multiple ordering
That is to say among this set of attribute the position of the attribute might be changed
according to the diverse instances of a data object Thus this attribute will be called the
ldquomultiOrderingrdquo attribute For instance for the moives before year 1999 the move site would
enumerate the release data in front of the movesrsquo title while for the movies after year 1999
(including 1999) it will enumerate the release data behind the movesrsquo title
Oslash The attribute has different formats
This means the displaying format of the data object could be completely distinct with respect to
these different instances Therefore if the format of an attribute is free then a lot of rules will be
needed to deal with all kinds of possible cases This kind attribute will be called ldquomultiFormatrdquo
attribute For example an ecommerce Web site would use the bold font format to present the
general prices while use the red color format to display the sale prices Nevertheless there has
Deep Web Service Crawler
18
another situation that some different attributes for a data object have the same format For
example various attributes are presented in terms of using the ltTDgt tags in a table presentation
And the attributes like those could be differentiated by means of the order information of these
attributes However for cases that there occurs ldquononerdquo attribute or exists ldquomultiOrderingrdquo
attributes it must have to revise the rules for extracting these attributes
Oslash The attribute cannot be decomposed
Because of the easier processing sometimes the input documents would like to be treated as
strings of tokens instead of the strings of characters In addition some of the attribute cannot
even be decomposed into several individual tokens These attributes are called the ldquountokenizedrdquo
attributes For example the college course catalogue like ldquoCOMP4016rdquo or ldquoGEOL2001rdquo The
department code and the course number in them cannot be separated into two different strings
of characters like that ldquoCOMPrdquo and ldquo4016rdquo or ldquoGEOLrdquo and ldquo2001rdquo
223 The Used Techniques in Information Extraction
The extractor used in the process of Information Extraction aims at providing a single uniform query
interface to access information sources like database server and Web server It consists of following
phases collecting returned Web pages labeling these Web pages generalizing extraction rules
extracting the relevant data and outputting the result in an appropriate format (XML format or
relational database) for further information integration For example at first the extractor queries the
Web server to gather the returned pages through the HTTP protocols after that it starts to extract the
contents among these HTML documents and integrate with other data sources thereafter Actually
the whole process of the extractor follows below steps
Oslash Step 1
At the beginning it must have to tokenize the input However there are two different
granularities for the input string tokenization They are tag-level encoding and word-level
encoding The tag-level encoding will transform the tags of HTML page into the general tokens
while transform all text string between two tags into a special token Nevertheless the
word-level encoding does this in another way It treats each word in a document as a token
Oslash Step 2
Next it should apply the extraction rules for every attributes of the data object in the Web pages
These extraction rules could be induced in terms of a top-down or bottom-up generalization
pattern mining or logic programming In addition the type of extraction rules may be indicated
by means of regular grammars or logic rules For example some use path-expressions of the
HTML parse tree path like htmlheadtitle or html-gttable[0] some use syntactic or semantic
constraints and some use delimiter-based constraints such as HTML tags or literal words
Oslash Step 3
After that all these extracted data would be assembled into the records
Oslash Step 4
Finally iterate this process until all these data objects in the input
Deep Web Service Crawler
19
23 Pica-Pica Web Service Description Crawler
The Pica-Pica is knows as a kind of bird species it can also be called pie However at the moment the
Pica-Pica here is a Web Service Description Crawler which is designed to solve the quality of Web
Services problem For example the evaluation of the descriptive quality of Web Services that offered
and how well are these Web Services described in nowadaysrsquo Web Service Registries
231 Needed Libraries of the Pica-Pica Web Service
Description Crawler
This version of the Pica-Pica Web Service Description Crawler is written by Anton Caceres and Josef
Spillner and programmed in terms of the Python language Actually in order to run these scripts to
parse the HTML pages it needs two additional libraries Beautiful Soup and Html5lib
n Beautiful Soup
It is an HTMLXML parser for Python language And it can even turn these invalid markups into a
parse tree [5] Moreover the following three features can make it more powerful
u The bad markup doesnrsquot choke the Beautiful Soup In fact it will generate a parse tree that
makes approximately as much sense as the original document Therefore you can obtain
the data that you want
u The Beautiful Soup has a toolkit that can provide simple idiomatic methods for navigating
searching and modifying the parse tree Hence you donrsquot need to create a custom parse
for every application
u If the document has already specified an encoding then you can ignore it since the
Beautiful Soup can convert the documents from Unicode to UTF-8 in an automatic way
Otherwise what you have to do is just to specify the encoding of the original documents
Furthermore the ways of including Beautiful Soup into the application are displayed in the
following [5]
sup2 From BeautifulSoup import BeautifulSoup For processing HTML
sup2 From BeautifulSoup import BeautifulStoneSoup For processing XML
sup2 Import BeautifulSoup To get everything
n Html5lib
It is a Python package which can implement the HTML5 [8] parsing algorithm And in order to
gain maximum compatibility with the current major desktop web browsers this implementation
will be based on the WHATWG (Web Hypertext Application Technology Working Group) HTML5
specification
Deep Web Service Crawler
20
232 Architecture of the Pica-Pica Web Service
Description Crawler
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawler
Figure 2-4 illustrates the primary architecture of Pica-Pica Web Service Description Crawler It includes
four fundamental components Service Page Grabber component WSDL Grabber component
Property Grabber component and WSML Register component
(1) The Service Page Grabber components is going to take the URL seed as the input and output the
link of the service page into following two components WSDL Grabber component and Property
Grabber component
(2) The WSDL Grabber component is responsible for obtaining the WSDL document based on the
delivered service pagersquos link And then check whether the validation for these obtained WSDL
document Finally only these valid WSDL document will be passed into the WSML Register
component for further processing
(3) The Property Grabber component will try to extract the servicersquos property the hosted in the
service page if there exists After that all these servicersquos properties would be saved into an INI
file as the information of that service
(4) The functionality of the WSML Register component is to write appropriate WSML document by
means of the valid WSDL documents that delivered from WSDL Grabber component and the
optionally INI files that delivered from Property Grabber component Afterwards register them in
Conqo
n WSML [9]
It stands for Web Service Modeling Language which provides a framework with different
language variants Hence it is often used to describe the different aspects of the semantic Web
Services according to the conceptual model of WSMO
n WSMO [10]
WSMO whose full name is Web Service Modeling Ontology is dedicated to describe various
aspects related to the Semantic Web Services based on the ontologies Ontology is a formal
explicit specification of a shared conceptualization In fact these ontologies are the sticking points
that can satisfy the linkage between the agreement of the communities of users and the defined
conceptual semantics of the real-world
n Conqo [11]
Deep Web Service Crawler
21
It is a discovery framework that considers not only the Quality of Service (QoS) but also the
context information It will use a Web Service Repository to manage these service descriptions
that based on WSML
233 Implementation of the Pica-Pica Web Service
Description Crawler
This section is going to describe the processes of the implementation of the Pica-Pica Web Service
Description Crawler in detail
(1) Firstly for starting the whole crawling process of the Pica-Pica Web Service Description Crawler it
needs an input as the initial seed In this crawler there are five Web Service Registries which are
listed in the below The URL address of these five Web Service Registries will be used as the input
seed for this Pica-Pica Web Service Description Crawler Moreover in this version of Pica-Pica Web
Service Description Crawler there has a single Python script for each Web Service Registry And
the crawling process for these Web Service Registriesrsquo Python script will be executed one after
another
Biocataloue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwsrvice-repositorycom
Xmethods httpwwwxmethodsnet
(2) Then after feeding with the input seed it will step into the next component Service Page Grabber
At first this component will try to read the data from the Web based on the input seed Then it
will establish a parsing tree of the read data in terms of the functions of the Beautiful Soup library
After that this Service Page Grabber component starts to look for the service page link for each
service that published in the Web Service Registry by means of the functions in Html5lib library In
the case that the service page link of one single service is found it will firstly check whether this
service page link is valid or not Once the service page link is valid it will pass it into the following
two components for further processing which are WSDL Grabber component and Property
Grabber component
(3) When the WSDL Grabber component receives a service page link from its previous component it
sets out to extract the WSDL link address for that service through the parsing tree of the data in
this service page Next this component will start to download the WSDL document of that service
in terms of the WSDL link address Thereafter the obtained WSDL document would be stored into
the disk The process of this WSDL Grabber component will be continually carried on until there is
no more servicersquos link passed to it Certainly not all grabbed WSDL documents are effective They
may either contain bad definitions or bad namespaceURI or be an empty document What is
worse it is not even of XML format Hence in order to pick them out this component will further
analyze the involved WSDL documents Then put all these valid documents into a ldquovalidWSDLsrdquo
folder Whereas other invalid documents will be put into a folder named ldquoinvalidWSDLsrdquo in
order to gather statistic information Finally only these WSDL documents in the ldquovalidWSDLsrdquo
folder will be passed into the subsequent component
(4) Moreover since some Web Service Registries would give some addition information about the
Deep Web Service Crawler
22
services such as availability service provider version therefore the Property Grabber
component will set out to extract these information as this servicersquos properties And thereafter
save these servicersquos properties into an INI file However if there has no available additional
information then it is no need to extract the service property thus there is no INI file for that
service However for the implementation of this Pica-Pica Web Service Description Crawler only
the Python scripts for Seekda and Service-Repository Web Service Registries has the functions to
extract the servicesrsquo properties While for other three Web Service Registries there has such
function
(5) Furthermore this is an optional to create a report file which contains the statistic information of
this process Such as the total number of services for one Web Service Registry the number of
services whose WSDL document is invalid etc
(6) As has been stated at present there has a folder with all valid WSDL documents and might also
have some INI files Therefore at this time the task of WSML Register component is to generate
the appropriate WSML documents with these valid WSDL document and INI files And then
register them in Conqo
24 Conclusions of the Existing Strategies
In this chapter it presents three aspects of the existing Strategies which includes the Service-Finder
project Information Extraction technique and Pica-Pica Web Service Description Crawler
The task of this master program is going to obtain the available Web services and their related
information from the Web Actually this is a procedure of extracting the needed information for the
service such as the servicersquos WSDL document and its properties Therefore the Information
Extraction technique which used to extract the information that hosted in the Web could be used in
this master program
Moreover the Service-Finder is a large project that not only able to obtain the available Web services
and their related information from the Web but also enrich these crawled data with annotations by
means of the Service-Finder ontology and the Service Category ontology and integrate all the
information into a coherent semantic mode Furthermore it also provides the capabilities for
searching and browsing the data with the user interface and gives users with service
recommendations However as a master program the Service-Finder project is far more exceeding
the requirements Therefore it is just considered as a consultation for this master program
Furthermore since the Pica-Pica Web Service Description Crawler aims at only obtaining the available
Web services and their related information this fulfills the primary task of this master program
Nevertheless regarding the information about the service this Pica-Pica Web Service Description
Crawler extracts only few properties sometime even no property Consequently in order to improve
the quality of the service it has to extract as much properties about the service as possible Thence in
chapter 3 it presents an extension of this Pica-Pica Web Service Description Crawler
Deep Web Service Crawler
23
3 Design and Implementation
In the previous chapter of the State of Art those already existing techniques or implementations are
presented In the following it is time to introduce the basic principle of the proposed approach Deep
Web Services Crawler which is based on these previous existing techniques especially the Pica-Pica
Web Service Description Crawler
31 Deep Web Services Crawler Requirements
This section is mainly talking about the goals of the Deep Web Service Crawler approach system
requirements with respect to approach and some non-functional requirements
311 Basic Requirements for DWSC
The following lists are the basic requirements which should be achieved
(1) Produce the largest annotated Service Catalogue In here the Service Catalogue is actually a list
of Web services that published in the Web Service Registry It contains not only the WSDL document
of the service but also a file with respect to the information about that service Therefore for the
purpose of producing the largest annotated Service Catalogue this proposed approach need to
extract as much properties about those Web services as possible Moreover it has also to download
the WSDL document that hosted along with the Web service That is to say these properties are not
only the interesting structured properties such as service name its WSDL link address but also some
other non-function properties for example endpoint and its monitoring information
(2) Flexible storage each service has a Service Catalogue that contains all its interesting properties
For now how to deal with those servicesrsquo properties is a huge problem this means what kinds of
schemes will be used to store those servicesrsquo properties Hence in order to store them in flexible way
this proposed approach provides three methods for the storage The first one will store them as an
XML file While for the second method it is going to store them in an INI file And the third method is
to use the database for the storage
312 System Requirements for DWSC
General Speaking the needed requirements for achieving a programming project contains following
lists
1) Operating System LinuxUnix Windows (XP Vista 2000 etc)
2) Programming language C++ Java Python C etc
3) Programming Tool NetBean Eclipse Visual Studio and so on
However in this master thesis the scripts of the Deep Web Service Crawler approach are written in
Java programming language Besides these code scripts are only tested running in Windows XP
operating system and Linux operating system but have not been tested in other operating system
Deep Web Service Crawler
24
313 Non-Functional Requirements for DWSC
In this part several non-functional requirements for the Deep Web Service Crawler approach are
presented
1) Transparency the process of data exploration and data storage should be done automatically
without the userrsquos intervention However at first user should specify the path in the hard disk which
will be used store these outputs of this program
2) Fault-tolerant during the executing of this program errors can happen inevitably Therefore in
order to stop the process from interruption there must have some necessary error handlings for the
process recovering
3) Completeness this approach should extract the interesting properties about each Web Service as
much as possible eg endpoint monitoring information etc
In addition since the Pica-Pica Web Service Crawler has already implemented the strategies for the
following five URLs hence the proposed approach must implement not less than those five Web
Service Registries
Biocatalogue httpwwwbiocataloguecom
Ebi httpwwwebiacuk
Seekda httpwwwseekdacom
Service-Repository httpwwwservice-repositorycom
Xmethods httpwwwxmethodsnet
32 Deep Web Services Crawler Architecture
In this section an overview of this high-level architecture of Deep Web Services Crawler approach will
be introduced first Thereafter there are four subsections that focusing on the outlining of each single
component and how they will play together presented
The current components and flows of data in Deep Web Service Crawler can be summarized as
depicted in Figure 3-1 using the continuous arrows It first tries to obtain the available service list page
links and related service page links by crawling the Web (Web Service Extractor) Then those gathered
links are processed in two separated steps one is going to get the servicersquos WSDL document (WSDL
Grabber) and another is for collecting the properties for each service (Property Grabber) Finally all
these data will be stored into the storage device (Storage) The whole detailed process in figure 3-1 is
illustrated as following
Oslash Step 1
When the Deep Web Service Crawler starts to run the File Chooser container would require the
user to specify a path in the computer or in any other hard disk The reason for specifying the
path is that this Deep Web Service Crawler program needs a place to store all its outputs
Oslash Step 2
After that the Web Service Extractor will be triggered It is the main entry to the specific crawling
process However since the Deep Web Service Crawler program is a procedure which supposes
to crawl for Web Services in some given Web Service Registries Hence the URL addresses of
Deep Web Service Crawler
25
these Web Service Registries should be given as an initial seed for this Web Service Extractor
process However since the page structures of these Web Service Registries are completely
different thus there will be a dependent process for each Web Service Registry
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawler
Oslash Step 3
So that according to the given seed two types of links would be obtained by the Web Service
Extractor component One is the service list page link And another is the service page link
Service list page is page that contains a list of Web Services and may be some information about
these Web Services While service page is page that contains much more information for single
service Finally it forwards these two types of links into the next two components Property
Grabber and WSDL Grabber
Oslash Step 4
Then on the one hand the Property Grabber component tries to gather the information about
the service that hosted in the service list page and service page Such as the name of the service
its description the ranking for this service etc Finally all the information of the service will be
collected together as the service properties which then will be delivered to the storage
component for further processing
Oslash Step 5
On the other hand the WSDL Grabber will try to obtain the WSDL link from the service list page
or the service page That is because for some Web Service Registry the WSDL link is hosted in
Deep Web Service Crawler
26
the service list page like the Biocatalogue and for other Web Service Registries it is hosted in
the service page such as Xmethods Then after obtaining the WSDL link it will also be
transmitted to the Storage component for further processing
Oslash Step6
When service properties information and WSDL link of the service are received by Storage
component it will be going to store them into the disk For the service properties they will be
stored into the disk by means of three different ways They are an XML file an INI file or one
record inside the table of Database However for the WSDL link the Storage component will try
to download the page content first according to the URL address of the WSDL link If it can work
successful the page content of the service will be stored as a WSDL document into the disk
Oslash Step 7
Nevertheless this is just a single service crawling process from step 3 to step 6 Thence if there
are more than one service or there are more than one service list page in the those Web Service
Registries the crawling process from step 3 to step 6 will be continued again and again until
there is no service or service list page in that Web Service Registries any more
Oslash Step 8
Furthermore after the crawling process for one Web Service Registry finishes there will also
generate a file that contains some of statistic information about this crawling process For
example the time when the crawling process of this Web Service Registry started and when
finished the total number of Web services in this Web Service Registry how many services have
empty WSDL document the average number of service properties in this Web Service Registry
and the average time cost for extracting service properties getting WSDL document and
generating XML file INI file etc
321 The Function of Web Service Extractor Component
The Web Service Extractor component is responsible for gathering service list page links and service
page links It pursues a focused crawl of the Web and only forwards service list page and service page
links to subsequent components for analyzing collecting gathering purposes Therefore it is going to
identify both service list page links and related service page links on these Web Service Registries
As you can see from figure 3-2 a crawl for Web Services needs to start from a seed of URL It is almost
as important as the Web Service Extractor itself as it highly influences the part of the Web which it
needs to crawl The seed can or shall contain eg Web pages where these Web Services are
published or that they talk about Web Services
Deep Web Service Crawler
27
Figure3-2 Overview the process flow of the Web Service Extractor Component
After feeding with the seed of URL the Web Service Extractor component starts to get the link of
service list page from the initial page of this URL seed However this process would be diverse for
these five Web Service Registries Following shows the different situation in these Web Service
Registries
Oslash Service-Repository Web Service Registry
In this Web Service Registry the link of the first service list page is the URL address of its seed
which means some Web Services can be found in the home page of the Service-Repository Web
Service Registry The ldquofirstrdquo in here implies there are more than one service list page link in this
Web Service Registry Therefore the process of getting service list page link in this registry will be
continually carried on until there no more service list page link exists
Oslash Xmethods Web Service Registry
Although there has Web Services in the home page of the Xmethods Web Service Registry the
number of those Web Services is only a small subset of these in this Web Service Registry
Moreover in the Xmethods Web Service Registry there is only one page containing all Web
Services Therefore it must have to get the service list page link for that page
Oslash Ebi Web Service Registry
The situation in Ebi Web Service Registry is a little bit like in Xmethods Web Service Registry That
is to say there also has one page that contains all Web Services in this Web Service Registry
However this page is not the initial page of the input seed Therefore there needs more than
one operation steps to get the service list page link of that page
Oslash Seekda Web Service Registry
In Seekda Web Service Registry these Web Services are not contained in the initial page of the
input seed The service list page link can be obtained after several additional operation steps
Deep Web Service Crawler
28
However there has a problem for getting the service list page link in this registry Simply put if
there are more than one pages contains the Web Services for some unknown reasons it cannot
not get the links of the rest service list page In another word it can only get the link of the first
service list page
Oslash Biocatalogue Web Service Registry
The process of getting service list page in Biocatalogue Web Service Registry is almost the same
as in Seekda Web Service Registry The only difference is that it can get all these service list page
links in Biocatalogue Web Service Registry if there are more than one service list pages
Then after getting the link of service list page the Web Service Extractor will begin to get the link of
service page for each service that listed in the service list page The reason for why it can do like this is
that there has an internal link for every service which will address to the service page It is worth
noting that once a service page link is obtained this component will immediately forward the service
page link and the service list page link into the subsequent two components for further processing
Nevertheless the process of obtaining the service page link will be continuously carried out until all
services listed in that service list page are crawled Analogously the process of getting service list page
will also be continuously carried out until there no more service list page exists
3211 Features of the Web Service Extractor Component
The main features are described in the following paragraphs
1) Obtain service list page links
A central task is to obtain the corresponding service list page links It is an URL address that includes a
publicly list of Web Services and just a simply information about these Web services like the name of
the service an internal URL that links to another page which would contain the detailed information
about that service sometimes may have the link address for WSDL document
2) Obtain service page links
Once service list page links are found a crucial aspect is to extract the internal link of each service to
aid the task of service information discovery Therefore it is the task of the Web Service Extractor to
harvest the html page content of the service list page so that the service page links which would
contain much more detailed information of the single Web service can be obtained
3212 Input of the Web Service Extractor Component
This component is dependent on some specific input seeds And the only input required for this
component is a seed of URL Actually this URL seed will be one of the URLs that displayed in section
313
3213 Output of the Web Service Extractor Component
The component will produce two service related page links from the Web
l Service list page links
l Service page links
Deep Web Service Crawler
29
3214 Demonstration for Web Service Extractor
In order to have comprehensive understanding of the process of the Web Service Extractor
component following gives some figures for explanation Though there are five URL addresses in this
section only the URL of the service-repository address is showed as an example
1) The input seed is the initial URL address of the Service-Repository which is
ldquohttpwwwservice-repositorycomrdquo
2) As has already said in section 321 the first service list link of this Web Service Registry is its input
seed ldquohttpwwwservice-repositorycomrdquo Figure 3-3 shows the corresponding service list page
of that link
Figure3-3 Service list page of the Service-Repository
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquo
Figure3-5Code Overview of getting service page link in Service Repository
Figure3-6 Service page of the Web service ldquoBLZServicerdquo
Deep Web Service Crawler
30
3) For now because of the service list page link is already know the next step is to acquiring the
service page link according to these services that listed in the service list page The text in red box
of figure 3-4 shows the intrenal link of the Web service ldquoBLZServicerdquo However this is not the
complete link of the service page It should be prefixed with the initial URL address of
Service-Repository Web Service Registry ldquohttpwwwservice-repositorycomrdquo The code for
getting service page link of the Web service in Service Repository is showed in figure 3-5
Therefore the final link of this service page is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure 3-6 is the corrsponding service page of that link
4) Afterwards those two links service list page link and service page link which are gathered by the
Web Service Extractor component will be immediately forwarded to the next two components
which are WSDL Grabber component and Property Grabber component
322 The Function of WSDL Grabber Component
The WSDL Grabber component is going to acquire the WSDL link that hosted in the Web based on the
service lists page link or service page link And the whole process flow is illustrated in figure 3-7
Figure3-7 Overview the process flow of the WSDL Grabber Component
When the WSDL Grabber component receives the inputs that delivered from previous component it
will start to get the WSDL link for the service based on these inputs However although the inputs of
the WSDL Grabber component are the links of service page and service list page only one of them
contains the WSDL link for the corresponding service That is to say the WSDL link exists either in the
service page or the service list page The reason why there needs these two links to be delivered into
this component is that the only one of the five Web Service Registries which is Biocatalogue hosts
the WSDL link in the service list page While for other four Web Service Registries the WSDL link is
hosted in the service page Therefore the WSDL link of these four Web Service Registries will be
obtained in terms of the service page link And for the Biocatalogue Web Service Registry the WSDL
link will be obtained through the service list page link However there has a problem for getting the
WSDL link of the Biocatalogue Web Service Registry To be brief some of Web services that listed in
Deep Web Service Crawler
31
the service list page of the Biocatalogue Web Service Registry would not have WSDL link in another
word these services will not have WSDL document For the situation like this the value of the WSDL
link with respect to these Web services will be assigned a ldquoNULLrdquo value Nevertheless for these Web
Services in other four Web Service Registries the WSDL link would always exist in the service page
Eventually whenever the WSDL Grabber component extracts the WSDL link for one single Web service
it will inevitably be forwarded into Storage component for downloading the WSDL document at once
3221 Features of the WSDL Grabber Component
The WSDL Grabber component has to provide the following features
l Obtain WSDL links
WSDL link is the direct way to get into the page that contains the contents of the WSDL document
It is actually a URL address but at the end of this URL address has something like ldquowsdlrdquo or ldquoWSDLrdquo
to represent this is an address that addressed to the page of WSDL document
3222 Input of the WSDL Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3223 Output of the WSDL Grabber Component
The component will only produce the following output data
l The URL address of WSDL link for each service
3224 Demonstration for WSDL Grabber Component
In this section it will demonstrate a list of figures in order to give a comprehensive understanding of
the process of the WSDL Grabber component Without a doubt it uses the same Web service of the
Service-Repository as an example too
1) The input for this WSDL Grabber component is the link of service page that obtained from Web
Service Extractor component The address of this link is
ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service page
Deep Web Service Crawler
32
2) Figure 3-9 shows the corresponding original source code of the WSDL link that highlighted in the
figure 3-8
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo
3) Figure 3-10 and figure 3-11 are the codes that used to extract the WSDL link that showed in figure
3-9 However figure 3-10 is the particular code only for Service-Repository Web Service Registry
For other four Web Service Registries this will be different This function
ldquogetServiceRepositoryWSDLLinkrdquo first to get a list of nodes that have the same html tag name ldquobrdquo
Then it checks all of the nodes one by one to see if the text value of that node is the value of
ldquoWSDLrdquo If it fulfills the condition then the attribute value in its sibling which is ldquoardquo in here will
be extracted as the value of WSDL link for this Web service
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo function
Figure3-11 Code overview of ldquooneParameterrdquo function
4) Finally after applying the function showed in figure 3-10 and 3-11 the WSDL link for the Web
service ldquoBLZServicerdquo can be obtained The value of it is
ldquohttpservicesunitedplanetdeblzBlzServiceasmxWSDLrdquo
Deep Web Service Crawler
33
323 The Function of Property Grabber Component
The Property Grabber component is a module which is used to extract and gather all these Web
service information that hosted in the Web actually are the information that showed in service list
page and service page In the end all the obtained Web service information will be collected together
as the service properties which would be delivered into the Storage component for storing The
detailed process flow of Property Grabber component is illustrated in figure 3-12
The inputs of Property Grabber component are the same as the inputs of WSDL Grabber component
However there is still a little difference between them with respect to the seed As have already
mentioned in section 322 for the seed of WSDL Grabber component only one of the inputs is
sufficient to get the WSDL link While regarding the Property Grabber component it needs two of
them as the seeds Therefore once the Property Grabber component receives the needed inputs it
will start to extract the information of that single Web service
Figure3-12 Overview the process flow of the Property Grabber Component
Therefore after the Property Grabber component receives the inputs it starts to extract the service
information for the Web service Generally speaking the service information consists of four aspects
which are structured information endpoint information monitoring information and whois
information respectively
(1) Structured Information
The structured information can be obtained by extracting the information that hosted in the
service page and service list page They are the basic descriptive information about the service
Such as the name of the service the URL address through which the WSDL document can be
obtained the description of introducing the service the provider who provides that service its
rating and the server who owns this service etc However the elements constituting this
Deep Web Service Crawler
34
structured information would be diverse according to these different Web Service Registries For
example the rating information of the Web service exists in the Service-Repository Web Service
Registry while the Xmethods Web Service Registry does not have this information In addition
even though the Web services in the same Web Service Registry some elements of the structured
information could be not exist For instance one service in a Web Service Registry may have the
description information of this service while another service in the same Web Service Registry
does not have this description information Tables 3-1 to 3-5 demonstrate the structured
information that should be extracted in these five Web Service Registries Moreover if the style of
the Biocataloue Web Service Registry is SOAP there has some additional information describing
the SOAP operation of this service While it is REST this additional information would be
something that talked about the REST operation They should also be considered as a part of the
structured information Table 3-6 and table 3-7 illustrate the information for these two different
operations
Service Name WSDL Link WSDL Version
Provider Server Rating
Homepage Owner Homepage Description
Table 3-1Structured Information of Service-Repository Web Service Registry
Service Name WSDL Link Provider
Service Style Homepage Implementation Language
Description User Description Contributed Client Name
Type of this Client Publisher for this client Used Tookit of this client
Used Language of this client Used Operation System of this client
Table 3-2Structured Information of Xmethods Web Service Registry
Service Name WSDL Link Server
Provider Providerrsquos Country Service Style
Rating Description User Description
Service Tags Documentation (within WSDL)
Table 3-3Structured Information of Seekda Web Service Registry
Service Name WSDL Link Port Name
Service URL Address Implementation Class
Table 3-4Structured Information of Ebi Web Service Registry
Service Name WSDL Link Style
Provider Providerrsquos Country View Times
Favorite Times Submitter Service Tags
Total Annotation Provider Annotation Member Annotation
Registry Annotation Base URL SOAP Lab Server Base URL
Description User Description Category
Table 3-5Structured Information of Biocatalogue Web Service Registry
Deep Web Service Crawler
35
SOAP Operation Name Inputs and Outputs Operation Description
Operation Tags Part of Which Service
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registry
REST Operation Name Service Tags Used Template
Operation Description Part of Which Service Part of Which Endpoint Group
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry
(2) Endpoint Information
Endpoint information is the port information of the Web service which can be extracted only
through the service page However since different Web Service Registries have different structure
of the endpoint information with respect to the Web service Thence some elements of the
endpoint information for the Web service would be very diverse But there is one thing needed to
pay attention the Ebi Web Service Registry does not have endpoint information for all these Web
services published in this registry Moreover though the Web services in the same Web Service
Registry have the same structure of this endpoint information some elements of the endpoint
information would be missing or empty Furthermore these Web Service Registries could even
have no endpoint information for some Web services published by them Nevertheless whatever
there is endpoint information for the Web service there will be at least one such element which
is the URL address of the endpoint Following table 3-8 shows the endpoint information that
should be extracted in these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Endpoint Name Endpoint URL
Endpoint Critical Endpoint Type
Bound Endpoint
Xmethods
Endpoint URL Publisher of this Endpoint
Contact Email of this Publisher Implementation Language of this
Endpoint
Seekda Endpoint URL
Biocatalogue Endpoint Name Endpoint URL
Table 3-8 Endpoint Information of these five Web Service Registries
Web Service
Registry Name
Elements of the Monitoring Information
Service
Repository
Service Availability Number of Downs
Total Uptime Total Downtime
MTTR MTBF
RTT Max of Endpoint RTT Min of Endpoint
RTT Average of Endpoint Ping Count of Endpoint
Seekda Service Availability Begin Time of Monitoring
Biocatalogue Monitored Status of Endpoint Overall Status of Service
Table 3-9 Monitoring Information of these five Web Service Registries
(3) Monitoring Information
Monitoring information is the tested statistic information for the Web service But it is worth
Deep Web Service Crawler
36
noting that Ebi and Xmethods Web Service Registries do not have the monitoring information of
all these Web services published by them While for other three Web Service Registries only a
few of Web services may not have this information Table 3-12 displays the monitoring
information for these three Web Service Registries
(4) Whois Information
Whois information is not extracted from the information that hosted in the service page and
service list page It is the descriptive information for the service domain which can be gained by
means of the address of the WSDL link Because of that the process of getting the Whois
information will start after gaining the service domain first The final value of service domain
should not contain the string like this http https www etc It must belong to the top level
domain After that it is going to query the service domain database by means of sending the value
of service domain to the Whois client which is just a Web site on the Internet for example
ldquohttpwwwwhois365comcndomainrdquo And then if the information of that service domain
exists a list of its information will be returned as the output However the structure of the
returned information will be diverse with respect to different service domain Therefore the most
challenge thing is that there has to deal with the extracting process for each different situation of
the returned information Table 3-10 gives the Whois information that needs be extracted for all
these five Web Service Registries
Service Domain URL Domain Name Domain Type
Domain Address Domain Description State
Postal Code City Country
Country Code Phone Fax
Email Organization Established Time
Table 3-10Whois Information for these five Web Service Registries
Finally all the information of these four aspects will be collected together And then be delivered into
the Storage component for further storage processing
3231 Features of the Property Grabber Component
The Property Grabber component has to provide the following features
l Obtain basic information
Generally speaking the more information one Web service has the better you can know how
well this Web service is Hence this Property Grabber component will be necessary to extract all
these basic information that hosted in the service list page and service page This basic
information contains the structure information endpoint information and monitoring
information
l Obtain Whois information
Due to the reason that the more information the Web service has the better the quality of the
Web service is That makes it is necessary to extract as much information about the Web service
as possible Therefore except the basic information the Property Grabber component can also
obtain some additional information called Whois information such as the type of the domain the
name of the person in charge the postal code of the domain city phone fax detailed address
email etc
Deep Web Service Crawler
37
3232 Input of the Property Grabber Component
This component requires the following input data
l Service list page link
l Service page link
3233 Output of the Property Grabber Component
The component will produce the following output data
l Structured information of each service
l Endpoint information about each service if exists
l Monitoring information for the service and endpoint if exists
l Whois information of the service domain
However all these information will be collected together as the properties for each service Thereafter
the collected properties will be sent to the Storage component
3234 Demonstration for Property Grabber Component
Those pictures from figure 3-13 to figure 3-16 depict the fundamental and primary procedure of the
Property Grabber component To simplify the explanation the example showed in this section uses
the same Web service as before
1) The inputs of the Property Grabber component are links of service list page and service page that
received from Web Service Extractor component Their links are
ldquohttpwwwservice-repositorycomrdquo
and ldquohttpwwwservice-repositorycomserviceoverview-210897616rdquo
2) When the ldquoGetPropertyrdquo function is triggered it calls its four sub functions showed in figure 3-12
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list page
Deep Web Service Crawler
38
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service page
3) Firstly the ldquogetStructuredPropertyrdquo function will try to extract the structure information that
displayed in red boxes of figure 3-13 and figure 3-14 However since there are several elements of
the structured information have the same content like the contents of description showed in
service page and service list page Thence in order to saving the time of extracting process and
the space for storing process the elements with the same content would only be extracted once
Moreover there needs a transformation of non-descriptive text to the descriptive text for the
rating information because the contents of it are several image of stars Therefore the final
results of the extracted structured information for this Web service are showed in table 3-11
Because of there are no descriptive information for the provider homepage and Owner
Homepage their values are assigned as ldquoNULLrdquo
Service Name BLZService
WSDL Link httpwwwthomas-bayercomaxis2servicesBLZServicewsdl
WSDL Version 0
Server Apache-Coyote11
Description BLZService
Rating Four stars and A Half
Provider NULL
Homepage NULL
Owner Homepage NULL
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo
4) Secondly the ldquogetEndpointPropertyrdquo function plans to extract the endpoint information that
displayed in the red box of figure 3-15 Though there is more than one endpoint record in the list
only one of them will be extracted as the information of endpoint That is because this master
program is intended to extract as more information as possible but these information should not
contain some redundant information Therefore for the purpose of that only one of them will be
extracted as the endpoint information even if there are more than one endpoint records Table
3-12 shows the final results of the endpoint information of this Web service
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service page
Deep Web Service Crawler
39
Endpoint Name BLZServiceSOAP12port_http
Endpoint URL httpwwwthomas-bayercom80axis2servicesBLZService
Endpoint Critical True
Endpoint Type production
Bound Endpoint BLZServiceSOAP12Binding
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo
5) Then it is time to extract the monitoring information through invoking the function
ldquogetMonitoringPropertyrdquo Figure 3-16 displays two types of monitoring properties the red box
above contains the monitoring information about the Web service and the red box below lists
the monitoring information for these endpoints As have already mentioned before only one
endpoint statistic record will be extracted Besides as you can see from figure 3-16 there are two
types of availability Actually they all represent the availability for this Web service just like the
availability that showed in figure 3-14 Therefore only one content of these availability
information is sufficient Table 3-13 shows the final results of this extracting process
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service page
Service Availability 100
Number of Downs 0
Total Uptime 1 day 19 hours 19 minutes
Total Downtime 0 second
MTBF 1 day 19 hours 19 minutes
MTTR 0 second
RTT Max of Endpoint 141 ms
RTT Min of Endpoint 0 ms
RTT Average of Endpoint 577 ms
Ping Count of Endpoint 112
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo
6) Next in order to extract the Whois properties the ldquogetWhoisPropertyrdquo function has to gain the
service domain from the WSDL link first Towards this Web service the gained service domain is
ldquothomas-bayercomrdquo Then it sends this service domain as input to the Whois client for querying
process After that it returns a list of information for that service domain See figure 3-17 And
table 3-14 is the extracted Whois information
Deep Web Service Crawler
40
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo
Service Domain URL thomas-bayercom
Domain Name Thomas Bayer
Domain Type NULL
Domain Address Moltkestr40
Domain Description NULL
State NULL
Postal Code 54173
City Bonn
Country NULL
Country Code DE
Phone +4922855525760
Fax NULL
Email infopredic8de
Organization predic8 GmbH
Established Time NULL
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo
7) Finally all the information of these four aspects will be collected together as the service
properties and then these service properties are forwarded into the Storage component
324 The Function of Storage Component
The WSDL link from the WSDL Grabber component will be used by this Storage component to
download the WSDL document from the Web and stores it into the disk thereafter In addition the
service properties from the Property Grabber component will also be directly stored into the disk in
three different manners with this Storage component Figure 3-18 illustrates the process flow of the
Storage component
Therefore when the Storage component receives the needed inputs the mediator function ldquoStoragerrdquo
of the Storage component will be triggered After that it is going to transform the service properties
into three different output formats and store them into the disk These output formats are XML file
INI file and database records Besides it will also try to download the WSDL document with the URL
address of the WSDL link and then store the obtained WSDL document into disk too However this
ldquoStoragerrdquo function is composed of four sub functions which are ldquogetWSDLrdquo ldquogenerateXMLrdquo
ldquogenerateDatabaserdquo ldquogenerateINIrdquo sub functions Each sub function is in charge of one aspect of the
storage tasks
Deep Web Service Crawler
41
Figure3-18 Overview the process flow of the Storage Component
(1) ldquogetWSDLrdquo sub function
The task of the ldquogetWSDLrdquo sub function is going to down the WSDL document and then store it
into the disk Therefore above all it has to get the content of the WSDL document This
procedure will be done as following First the ldquogetWSDLrdquo sub function tries to check if the value
of the received WSDL link equals to ldquoNULLrdquo or not As have already presented in section 322 if
the Web service does not have WSDL link the value of its WSDL link will be assigned with ldquoNULLrdquo
For the case like that it will create a WSDL document whose name is the service name appended
with a mark of ldquoNo WSDL Documentrdquo and obviously this document does not contain any content
It is an empty document Nevertheless if that service does have WSDL link this sub function will
try to connect to the Internet based on the URL address of the WSDL link Once it succeeds all
these contents that hosted in the Web would be downloaded and stored into the disk and named
only with the name of the service Otherwise it would create a WSDL document which prefixes
ldquoBadrdquo before the service name
(2) ldquogenerateXMLrdquo sub function
The ldquogenerateXMLrdquo sub function will take the service properties as the input thereafter transform
them into a XML file and store it in the disk with the name structure of service name plus ldquoxmlrdquo
XML stands for eXtensible Markup Language which is a markup language that designed to
transport and store data The first line of an XML file is the XML declaration which defines the
XML version and the encoding used For example following ldquoltxml version=10
encoding=UTF-8gtrdquo means this is an XML file whose version is 10 and the used encoding is
8-bit Unicode Transformation Format character set Besides an XML file contains XML elements
which can be everything from the elementrsquos start tag to the elementrsquos end tag Moreover an XML
element can also contain other elements which could be simple text or a mixture of both
Deep Web Service Crawler
42
However an XML file must contain a root element as the parent of all other elements
(3) ldquogenerateINIrdquo sub function
The ldquogenerateINIrdquo sub function is also taking the servicersquos properties as the input But it will
transform them into an INI file and then store it in the disk with the name structure of servicersquos
name plus ldquoinirdquo ldquoinirdquo stands for initialization This INI file format is a de facto standard for
configuration files They are just simple text files with a basic structure Generally speaking the
INI file contains three different parts They are section parameter and comment respectively
Parameter is the basic element contained in an INI file Its format is a key-value pair or can also
be called name-value pair This pair is delimited by an equals sign ldquo=rdquo and the key or name always
appears to the left of the equals sign The section is more like a room that grouped all its
parameters together It is always appears on a single line with a pairs of square brackets ldquo[]rdquo And
section may not be nested Comment is some descriptive text which begins with a semicolon ldquordquo
Thence anything between the semicolon and the end of line will be ignored
(4) ldquogenerateDatabaserdquo sub function
The inputs of the ldquogenerateDatabaserdquo sub function are the same as previous two sub functions
Instead of transforming them into a file like XML or INI the ldquogenerateDatabaserdquo sub function will
turn them into the data of the database by using the SQL statements SQL stands for Structured
Query Language which is a database language that designed for accessing and manipulating data
in database Most of the actions that performed on a database are done with SQL statements
And the primary statements of SQL include insert into delete update select create alter and
drop Therefore for the purpose of transforming these service properties into the data of the
database this sub function has to create a database first using the ldquocreate databaserdquo statement
Then it should create a table to store the data A table is a collections of related data entries and
it consists of columns and rows Since the data for all these five Web Service Registries are not
very large therefore one database table is enough for storing these service properties Because of
that the field names of the service properties in columns for all these five Web Service Registries
should be uniform and well-defined Afterwards the service properties of each single service can
be put into the table as a record with the ldquoinsert intordquo statement of the SQL
3241 Features of the Storage Component
The Property Grabber component has to provide the following features
l Generate different output formats
The final result of this master program is to store the information of the service into the disk for
future work However this Storage component provides three different formats for storing these
service properties of the services that published in the Web Service Registries This makes the
storage of the service very flexible and also longevity
l Obtain the WSDL document
The important thing of this component is to check if the WSDL document can be obtained from
the WSDL link because the WSDL document plays a decisive role for determining the quality of
the service This Storage component provides the ability to deal with the different situations
about the process of obtaining the WSDL document
Deep Web Service Crawler
43
3242 Input of the Storage Component
This component requires the following input data
l WSDL link of each service
l Each service property information
3243 Output of the Storage Component
The component will produce the following output data
l WSDL document of the service
l XML document INI file and tables in database
3244 Demonstration for Storage Component
Following figures describe the fundamental implementation codes of this Storage component
The detailed depiction is illustrated as below 1) As you can seen from figure 3-18 to figure 3-20 there are several common places among the
implementation codes The first common place is about the parameters that defined in each of
these sub functions which are ldquopathrdquo and ldquoSecurityIntrdquo The parameter ldquopathrdquo is the absolute
path in the computer disk It will be used for the storing procedure and has already been specified
by the user at the beginning of the whole program And the parameter ldquoSecurityIntrdquo is an
increasing Integer which is used as a part of the name of the service The reason for doing this is
that it can prevent the services that have the same name from overriding in the disk The content
of red mark among the codes of these figures is the second common place Its function is to
create a file or document for the service with the corresponding parameters
2) Figure 3-19 displays the implementation code of the ldquogetWSDLrdquo sub function which is designed to
get the WSDL document of the service based on its WSDL link The parameter ldquonamerdquo is the
name of the service that will be used as the name of the WSDL document The parameter ldquolinkStrrdquo
is the actual WSDL link which is the most important one for this sub function The other two
parameters ldquostatisticrdquo and ldquologrdquo are the objects of the text files called ldquoStatistic Informationrdquo and
ldquoLog Informationrdquo respectively ldquoStatistic Informationrdquo text file is used to record the statistic data
of the service in each Web Service Registry Such as the overall number of properties for that Web
Service Registry the overall number of services the number of services that have no WSDL link
the number of services contains no contents the number of services whose WSDL link are not
available etc The ldquoLog Informationrdquo text file records the results of the process steps and the
problems encountered For example which service is crawling now which Web Service Registry it
belongs the reason why it cannot get the WSDL document for this service the reason why this
service is unreachable and so on
Deep Web Service Crawler
44
Figure3-19 Implementation code for getting WSDL document
3) Figure 3-20 and figure 3-21 show the codes for turning the service properties into the XML file
and INI file And storing those two files into the disk thereafter The paremeter ldquovecrdquo is a Vector of
ldquoPropertyStructrdquo data type This ldquoPropertyStructrdquo data type is a object of a class consists of two
variables name and value
Figure3-20 Implementation code for generating XML file
Deep Web Service Crawler
45
Figure3-21 Implementation code for generating INI file
4) The codes in figure 3-22 to figure 3-23 show the process that turns the service properties of all
services in these five Web Service Registries into the records of the database Therefore it has to
create a database first The name of the database can be arbitrary as long as it conforms to the
naming rules in database This is the same for the name of the table Figure 3-22 displays the
procedure of creating a table in the database Due to the reason that it is hard to decide the
length of each service property thence the data types for all service properties are set to ldquoTextrdquo
Figure 3-23 is the code for inserting the service properties into the table with the ldquoupdaterdquo
statement
Figure3-22 Implementation code for creating table in database
Deep Web Service Crawler
46
Figure3-23 Implementation code for generating table records
33 Multithreaded Programming for DWSC
Multithreaded programming is a built-in characteristic which is provided by Java language A
multithreaded program contains two or more separate parts that that can execute concurrently Each
part of such a program is called a thread The use of multithreading enables to create programs that
makes efficient use of the system resources For example make maximum use of the CPU thatrsquos
because the idle time of the CPU can be kept to a minimum
In this master program there are five Web Service Registries that are needed to crawl for the services
among them Moreover the number of services published among each Web Service Registry is quite
different That makes the running time cost by each Web Service Registry is different Then it will
happen that one Web Service Registry who owns fewer services has to be waiting for executing until
another Web Service Registry who has much more services finishes Therefore in order to reduce the
waiting time for other Web Service Registries and maximize the use of the system resources it is
necessary to apply this multithreaded programming to this master program That is to say this master
program will create a thread for each Web Service Registry And these threads are executed
independently
34 Sleep Time Configuration for Web Service
Registries
Due to the reason that this master program is intended for downloading the WSDL document and
extracting the service information of these Web services that published in the Web Service Registry
this would inevitably affect the performance of the Web Service Registry In addition for the purpose
of not exceeding the throughput capability these Web Service Registries will surely restrict the rate of
accessing Because of that sometimes when this master program is executing unknown errors would
happen For instance this master program will be continually halting at one point without getting any
more WSDL document and service information or some servicesrsquo WSDL documents of some Web
Deep Web Service Crawler
47
Service Registries cannot be obtained or there would miss some service information Therefore in
order to obtain the largest staffs of these Web services published in these five Web Service Registries
and not to affect the throughput of them it has to configure the accessing rate for each service of all
Web Service Registries
In consequence before going into the essential procedure for each single service of these Web Service
Registries this master program will plans to call the systemrsquos built-in function ldquosleep (long
milliseconds)rdquo It is a public static function which will cause the currently executing thread to sleep for
the specified number of milliseconds In other words it will temporarily cease execution for a while
Following table shows this time interval of the sleep function for each Web Service Registry
Web Service Registry Name Time Interval (milliseconds)
Service Repository 8000
Ebi 3000
Xmethods 10000
Seekda 20000
Biocatalogue 10000
Table 3-15Sleep Time of these five Web Service Registries
Deep Web Service Crawler
48
4 Experimental Results and Analysis
This chapter is going to show the quantitative experimental results of the prototype that presented in
chapter 3 Besides the analysis of these results will also be described and explained However in
order to gain a rather accurate result the experiments are carried out more than five times All these
data that displayed in the following tables and charts are the average of them
41 Statistic Information for Different Web Service
Registries
This section is mainly used to talk about the amount statistic of these Web services that published in
these five Web Service Registries It includes the overall number of the Web services that published in
each Web Service Registry and the number of these unavailable Web services which have been
archived because they may not be active anymore or are close to being non active Table 4-1 shows
the service amount statistic of these five Web Service Registries
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Overall
Services 57 289 382 853 2567
Unavailable
Services 0 0 0 0 125
Table4-1 Service amount statistic of these five Web Service Registries
Nevertheless in order to have an intuitive view of the service amount statistic in these five Web
Service Registries There contains a bar chart see figure 4-1 which is turned from the table 4-1 As
you can see from the bar chart one on hand there has an ascending increase for the overall number
of Web services from the Service Repository Web Service Registry to Biocatalogue Web Service
Registry Especially for the Biocatalogue Web Service Registry it owns the largest number of Web
service This indicates that the Biocatalogue Web Service Registry has much more powerful ability to
provide the Web services for the users because it contains far more services than other four Web
Service Registries On the other hand there is no unavailable service existed in almost all Web Service
Registries except for the Biocatalogue Web Service Registry That is to say only the Biocatalogue Web
Service Registry contains some Web services that cannot be used by the users To some degree this is
useless for the reason that these services cannot be used anymore and it will waste the network
resources on the Web Therefore all these unavailable services should be eliminated in order to
reduce the wasting of the network resources
Deep Web Service Crawler
49
Figure4-1 Service amount statistic of these five Web Service Registries
42 Statistic Information for WSDL Document
Web Service
Registry Name
Service
Reporitory Ebi Xmethods Seekda Biocatalogue
Failed WSDL
Links 1 0 23 145 32
Without
WSDL Links 0 0 0 0 16
Empty
Content 0 0 2 0 2
Table4-2 Statistic information for WSDL Document
Table 4-2 and figure 4-2 aim to show the error statistic of the WSDL document of these Web service in
these five Web Service Registries There are three aspects of that The first one is about the ldquoFailed
WSDL Linksrdquo of the Web services among these Web Service Registries It is the overall number of the
Web services whose WSDL links are invalid Say in other words that means it is impossible to get the
WSDL documents of the Web services with the URL address of these WSDL links Therefore there is no
WSDL document will be created The second aspect is the ldquoWithout WSDL Linksrdquo of the Web services
for these Web Service Registries It is used as the overall number of Web services in each Web Service
Registry who has no WSDL link That is to say there would be no WSDL document for Web services
like that Therefore the value of WSDL link for such Web service will be a ldquoNULLrdquo However a WSDL
document will also be created But there is no content and the name of this WSDL document will
contain a string of ldquo[No WSDL Document]rdquo The third aspect is about the ldquoEmpty Contentrdquo which
represents the overall number of Web services that have the WSDL links and the URL address of them
0
500
1000
1500
2000
2500
3000
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
57289 382
853
2567
0 0 0 0 125
The
Num
ber o
f Web
Ser
vice
s
The name of Web Service Registry
Overall Services
Unavailable Services
Deep Web Service Crawler
50
are valid but their WSDL documents contains no content In this case a WSDL document whose name
contains a string ldquo(BAD)rdquo will be created
Figure4-2 Statistic information for WSDL Document
43 Comparison of Different Average Number of
Service Properties
This section is going to compare the average number of service properties in these five Web Service
Registries This average number of service properties is calculated by means of following equation = (1)
Where
ASP is the average number of service properties for one Web Service Registry
ONSP is the overall number of service properties in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
Figure 4-3 shows the average number of service properties of each Web service in these five Web
Service Registries As have already mentioned before one of the measurements for testing the quality
of Web services in the Web Service Registry is the service information The more information of a Web
service the better you know about that service consequently the corresponding Web Service
Registry can offer better quality of the Web services to users As seen from the figure 4-3 the Service
Repository and Biocatalogue Web Service Registries own a larger number of service properties than
that in other three Web Service Registries This can directly reflect that these two Web Service
Registries can provide more detailed information about the Web services published among them So
that the users can be easier to choose the service they need And they would also like to use the Web
services that published in these two Web Service Registries By contrast the Xmethods and Seekda
0
50
100
150
Service Reporitory
Ebi Xmethods Seekda Biocatalogue
1 023
145
32
0 0 0 016
0 0 2 0 2
The
Num
ber o
f Ser
vice
s
The Name of the Web Service Registry
Failed WSDL Links
Without WSDL links
Empty Content
Deep Web Service Crawler
51
Web Service Registries who have less service information of the Web services would offer less quality
for these Web services Therefore users may be not like to use the Web services provided in these
two Web Service Registries Not to mention the Web services published in the Ebi Web Service
Registry
Figure4-3 Average Number of Service Properties
From the description presented in section 323 the causes of the issue for the different number of
service properties in these Web Service Registries may consist of following several points First the
number of the structured information for these Web services is different with respect to these five
Web Service Registries Even part of information for some Web services in one Web Service Registry
could be missing or have empty value For example the number of structure information that
supposed to be extracted for these Web services in Ebi Web Service Registry differs greatly from that
in Biocatalogue Web Service Registry Secondly there is a certain amount of endpoint information for
almost all Web Service Registries except the Ebi Web Service Registry This would more or less reduce
the amount of the overall number of service properties Thirdly some Web Service Registries do not
have monitoring information such as Xmethods and Ebi In particular the Service Repository Web
Service Registry has a large amount of monitoring information of the Web services that could be
extracted in the Web Obvious the last one is the number of the whois information for these Web
services If the database of the whois client does not contain the information about the service
domain for the Web service in one Web Service Registry then there will be no whois information
could be extracted Moreover even there has information of the service domain the amount of the
information could be very diverse Therefore if the Web Service Registry undergoes the situation like
that many service domains of the Web services in this registry have no and few whois information
then the average number of service properties in that registry will decrease greatly
As a result in order to help users get better acquainted with the Web services that provided in Web
Service Registry and distinguish these Web services the Web Service Registry should do it best to
offer more and more information for each published Web service of it
05
101520253035
23
717 17
32
The
Num
ber o
f Pro
pert
ies
The Name of Web Service Registry
Average Number of Service Properties
Deep Web Service Crawler
52
44 Different Outputs of Web Services
The fundamental task of this master program is to obtain the WSDL document of the Web services
that hosted in the Web Service Registries as well as extract and gather the properties of these Web
services and thereafter store them into disk Therefore this section is going to describe the different
outputs of this master program which include the WSDL documents of Web services the generated
XML file and INI file and data records of these service properties in Database
Figure4-4 WSDL Document format of one Web service
The WSDL document of the Web service is just read from the Web according to the URL address of its
WSDL link And then these data will be stored into the disk as the WSDL document The name of the
WSDL document will be the service name plus an ending ldquowsdlrdquo that marks it is a WSDL document
And in order to distinguish these WSDL documents whose names would be the same while the
content of them are different the name of each obtained WSDL document in one Web Service
Registry will contain a unique different Integer in front of its name Figure 4-4 shows the valid WSDL
document format of the Web service Its name is ldquo1BLZServicewsdlrdquo Besides regarding the obtained service properties they will be transformed into XML file INI file and
data records in the Database Figure 4-5 figure 4-6 and figure 4-7 show the three different output
formats respectively As you can see from figure 4-5 this is an INI file of the Web service whose name
is ldquo1BLZServiceinirdquo The Integer is the same as in WSDL document This is because they are the
materials belonged to the same Web service The first three lines in that INI file are some service
comments which start from the semicolon to the end of the line They are the basic information for
the description of this INI file The following one line of them is the section that included in a pairs of
brackets It is important because it represents that the following lines behind it are the information of
this Web service Therefore the rest of these lines are the actual service information that existed in a
key-value pair with an equal between the pair Each service property is displayed from the beginning
of the line
Deep Web Service Crawler
53
Figure4-5 INI File format of one Web service
Figure4-6 XML File format of one Web service
Figure4-7 Database data format for all Web services
Deep Web Service Crawler
54
Furthermore figure 4-6 shows the generated XML file format of the Web service ldquoBLZServicerdquo its
name is ldquo1BLZServicexmlrdquo It is no need to say this XML file is a part of the materials of the same Web
service Though the format of the XML file is different from that of the INI file the essential contents
of them are the same That is to say the values of these service properties are no different This is
because they are files generated from the collection of properties for the same Web service The XML
file also has some comments like that in INI file which are displayed between ldquolt--rdquo and rdquo--gtrdquo And the
section in INI file is something like the root in XML file Therefore all values of these elements among
the root ldquoservicerdquo in this XML file are the values of the service properties for this Web service
Eventually it can be seen from figure 4-7 this is a Database table which is used to store the data of
service information for all Web services in these five Web Service Registries And the entire service
information for one Web service will be only one record in this table Because of that the column
names of that table should be the union of the name of the service information in each Web Service
Registry However since the column names of the table should be unique thus these redundant
names in this union must be eliminated This could make sense and possible because the names of the
service information are well-defined and uniform for all these five Web Service Registries In addition
in this table the first column is the primary key which is an increasing Integer The function of this is
more like the Integer that contained in the name of the XML file and INI file While the rest columns in
that table are the corresponding service properties of the Web services The value ldquoNULLrdquo or ldquoNullrdquo in
this table represents that this property of that Web service is empty or missing
45 Comparison of Average Time Cost for Different
Parts of Single Web Service
This section aims to describe the comparison of the average time cost for different parts of getting
one single Web service in all these five Web Service Registries At first it has to calculate the average
time cost of getting one single service in the Web Service Registry The calculation of it can be
obtained through following equation = (2)
Where
ATC is the average time cost for one single Web service
OTS is the overall time cost of all the Web services in one Web Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the different parts of the average time cost for getting one single service consists of
following 6 aspects which are the average time cost for extracting service property the average time
cost for obtaining WSDL document the average time cost for generating XML file the average time
cost for generating INI file the average time cost for inserting the service property into the table of
the Database and the average time cost for some other procedures such as get the service list page
link get the service page link and so on The average time cost for extracting service property would
be obtained by means of following equation
Deep Web Service Crawler
55
= (3)
Where
ATCSI is the average time cost for extracting service property of one single Web service
OTSSI is the overall time cost for extracting service property of all the Web services in one Web
Service Registry
ONS is the overall number of Web services that have already been crawled from the
corresponding Web Service Registry
In addition the calculation of other parts is similar to the equation for calculating the average time
cost for extracting service property While the calculation of the average time cost for other
procedures equals to the result of the average time cost for one single Web service minus the sum of
the average time cost of all other five parts
Service
property
WSDL
Document
XML
File INI File Database Others Overall
Service
Repository
8801 918 2 1 53 267 10042
Ebi 699 82 2 1 28 11 823
Xmethods 5801 1168 2 1 45 12 7029
Seekda 5186 1013 2 1 41 23 6266
Biocatalogue 39533 762 2 1 66 1636 42000
Table4-3 Average time cost information for all Web Service Registries
Table 4-3 displays the average time cost of one Web service and its different parts in all these five Web
Service Registries The first column of table 4-3 is the name of these five Web Service Registries And
the last column in this table is the average time cost for the single service in one Web Service Registry
While the rest columns in this table are the average time cost of these six different parts In order to
have an intuitive view of these data in table 4-3 the data in each column of this table are illustrated
with following corresponding figures See figure 4-8 to figure 4-14
Figure4-8 Average time cost for extracting service property in all Web Service Registries
8801
6995801 5186
39533
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Service Property
Deep Web Service Crawler
56
As you can see from figure 4-8 the average time cost for extracting service property in Biocatalogue
Web Service Registry is 39533 This data is much larger than that in other four Web Service Registries
which are 8801 699 5801 and 5186 for Service Repository Ebi Xmethods and Seekda respectively
That is to say it will take much longer time to extract the service properties of these Web services
published by Biocatalogue Web Service Registry In addition this can indirectly indicate that the
Biocatalogue Web Service Registry has even more average number of service properties which have
already been talked about in section 43 On the contrary the average number of service properties in
Ebi Web Service Registry is smallest Moreover the average time cost in Xmethods Web Service
Registry is larger than that in Seekda Web Service Registry As have already known that the average
number of service properties is the same for these two Web Service Registries Nevertheless there
has course might explain the case that the average time in Xmethods costs more than that in Seekda
which is the process for extracting service properties in Xmethods Web Service Registry needs to be
executed by means of service page and service list page while it needs only service page link for
Seekda Web Service Registry
The average time cost for obtaining WSDL document in all these five Web Service Registries is
displayed in figure 4-9 Actually this average time cost for obtaining WSDL document is the sum of the
average time for extracting the WSDL link of Web service and the average time for reading the data of
WSDL document from the Web and then storing it into the disk As seen from this figure the average
time cost for obtaining WSDL document in Xmethods Web Service Registry is the largest which is
1168 Although the process for extracting WSDL link will cost certain amount of time it does not have
a significant influence to the total average time spent for obtaining WSDL document This is because
the WSDL link of one Web service is almost gained in one step Therefore this can imply that the
average size of the WSDL document for Xmethods Web Service Registry is larger than that for other
four Web Service Registries in particular the Ebi Web Service Registry who just takes 82 milliseconds
for the process of extracting WSDL document
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries
918
82
11681013
762
0
200
400
600
800
1000
1200
1400
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
WSDL Document
Deep Web Service Crawler
57
Figure 4-10 figure 4-11 and figure 4-12 show the average time cost of generating these three different
outputs in all these five Web Service Registries As you can see from these two figures 4-10 and 4-11
the average time of generating XML file for one Web service cost the same for all these five Web
Service Registries which is only 2 milliseconds In addition the average time of generating INI file for
one Web service cost the same too And its value is just 1 millisecond Even though sum up the values
of these two types of average time costs the result is still so small that it could be omitted when
comparing to the overall average time cost of getting one Web service for each corresponding Web
Service Registry that showed in figure 4-13 This implies that the process of generating XML and INI
file would be finished at once after receiving the input of service properties of one Web service
Furthermore it can be seen from figure 4-12 although the average time costs of creating database
record for each Web service in all these five Web Service Registries are larger than that time of
generating XML and INI file the process of the operation for creating database record is still fast
Figure4-10 Average time cost for generating XML file in all Web Service Registries
Figure4-11 Average time cost for generating INI file in all Web Service Registries
2 2 2 2 2
0
05
1
15
2
25
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
XML File
1 1 1 1 1
0
02
04
06
08
1
12
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
INI File
Deep Web Service Crawler
58
Figure4-12 Average time cost for creating database record in all Web Service Registries
Figure4-13 Average time cost for getting one Web service in all Web Service Registries
In the figure 4-13 it gives the average time cost for getting one single Web service in all these five
Web Service Registries Without any doubt the Biocatalogue Web Service Registry should take the
longest time for this process This is because the presentation of these five different parts that
described in the front shows that the average time cost of each part would needs more time to finish
the corresponding process in Biocatalogue Web Service Registry except for the process of obtaining
WSDL document in which the average time for obtaining WSDL document in Biocatalogue Web
Service Registry does not cost the longest Moreover there is amazing thing when looking at these
figures 4-8 4-12 and 4-13 in which you can find that the shape of these curves have almost the same
trend This can further indicate that one Web Service Registry who spends more time to get the
description information of one Web service would offer more information about the Web service
53
28
4541
66
0
10
20
30
40
50
60
70
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Database
10042
823
7029 6266
42000
05000
1000015000200002500030000350004000045000
Service Repository
Ebi Xmethods Seekda Biocatalogue
The
Tim
e in
Mill
isec
onds
The Name of Web Service Registry
Overall
Deep Web Service Crawler
59
5 Conclusion and Further Direction
This master thesis provides a schema which aims to explore the description information of the Web
services that hosted in the different Web Service Registries Actually the description information of
the Web service consists of WSDL document and service information about that Web service
Although there has an existing approach which is Pica-Pica Web Service Description Crawler that can
be used to obtain the WSDL document and service information of the Web service in these Web
Service Registries the functionality of it is restricted Such as it explore only a small subset of the Web
services that hosted in the these Web Service Registries as well as the storage of these Web services
is not flexible and the most important thing is that only a few service information of the Web service
are extracted even more there is no service information would be extracted for some Web Service
Registry However the work presented in this master thesis can be able to explore all Web services in
these Web Service Registries Moreover the service information of Web service would be extracted as
much as it could with the approach presented in this master thesis so that the final result would be
the largest annotated service catalogue ever produced Furthermore regarding the storage process of
the description information of the Web service this master thesis provides three different ways that
can guarantee not only the completeness but also longevity of the description information for the
Web service
However in the implementation performed in this master thesis the whois client used for querying
the information of service domain will return a free text if the information exists And sometime this
free text differs completely That makes it must have to crawl each Web service in all Web Service
Registries at least once in the experiment stage So that all the cases of these free text would be
foresaw and processed afterwards Nevertheless this is a huge work because there are lots of Web
services in these Web Service Registries Therefore in order to simply the work other whois client
who can ease the work here needs to be found and used
Moreover in the experiment stage of this master thesis the time cost for getting Web service is still
large For the purpose of reducing this time the multithreaded programming would be applied to
each some parts of the process for getting one Web service
Although the work performed here is specialized for only these five Web Service Registries the
main parts of the principles used here is adaptable to other Web Service Registries that with only
small changes in the implementation codes or the structure
Deep Web Service Crawler
60
6 Bibliography
[1] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD11 ndash Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public32-d11 Emanuele Della Valle (CEFRIEL)
June 27 2008
[2] Nathalie Steinmetz Holger Lausen Dario Cerizza Andrea Turati Irene Celino Adam Funk and
Michael Erdmann ldquoD12 - First Design of Service-Finder as a Wholerdquo Available from
httpwwwservice-findereudeliverablespublic7-public37-d12-first-design-of-service-finder-as-a-
whole Emanuele Della Valle (CEFRIEL) July 1 2008
[3] Nathalie Steinmetz Holger Lausen Irene Celino Dario Cerizza Saartje Brockmans Adam Funk
ldquoD13 ndash Revised Requirement Analysis and Architectural Planrdquo Available from
httpwwwservice-findereudeliverablespublic7-public57-d13-revised-requirement-analysis-and-
architectural-plan Emanuele Della Valle (CEFRIEL) April 1 2009
[4] Chia-Hui Chang Mohammed Kayed Moheb Ramzy Girgis Khaled Shaalan ldquoA Survey of Web
Information Extraction Systemsrdquo Volume 18 Issue 10 IEEE Computer Society pp1411-1428 October
2006
[5] Leonard Richardson ldquoBeautiful Soup Documentationrdquo Available from
httpwwwcrummycomsoftwareBeautifulSoupdocumentationhtml October 13 2008
[6] Hao He Hugo Haas David Orchard ldquoWeb Services Architecture Usage Scenariosrdquo Available from
httpwwww3orgTRws-arch-scenarios February 11 2004
[7] Stephen Soderland ldquoLearning Information Extraction Rules for Semi-Structured and Free Textrdquo
Volume 34 Issue 1-3 Journal of Machine Learning Department Computer Science and Engineering
University of Washington Seattle pp233-272 February 1999
[8] Ian Hickson ldquoA Vocabulary and Associated APIs for HTML and XHTMLrdquo World Wide Web
Consortium Working Draft WD-html5-20100624 January 22 2008
[9] Holger Lausen Jos de Bruijn Axel Polleres Dieter Fensel ldquoThe Web Service Modeling Language
WSMLrdquo WSML Deliverable D161v02 March 20 2005 Available from
httpwwwwsmoorgTRd16d161v02
[10] Dumitru Roman Holger Lausen Uwe Keller ldquoWeb Service Modeling Ontology - Standard (WSMO
-Standard)rdquo WSMO deliverable D2 version 11 06 March 2004 Available from
httpwwwwsmoorgTRd2v11
[11] Iris Braum Anja Strunk Gergana Stoyanova Bastian Buder ldquoConQo ndash A Context- And QoS-Aware
Service Discoveryrdquo TU Dresden Department of Computer Science in Proceedings of WWWInternet
2008
Deep Web Service Crawler
61
7 Appendixes
There are additional outputs of this master program which are log information file and statistic report
file Figure 8-1 shows one of the basic output formats of the log information for these five Web
Service Registries
Figure8-1Log information of the ldquoService Repositoryrdquo Web Service Registry
Figure8-2Statistic information of the ldquoService Repositoryrdquo Web Service Registry
Deep Web Service Crawler
62
Figure 8-2 to figure 8-6 shows the output formation of statistic information for ldquoService Repositoryrdquo
ldquoEbirdquo ldquoXmethodsrdquo ldquoSeekdardquo ldquoBiocataloguerdquo Web Service Registry respectively
Figure8-3Statistic information of the ldquoEbirdquo Web Service Registry
Figure8-4Statistic information of the ldquoXmethodsrdquo Web Service Registry
Deep Web Service Crawler
63
Figure8-5Statistic information of the ldquoSeekdardquo Web Service Registry
Figure8-6Statistic information of the ldquoBiocataloguerdquo Web Service Registry
Deep Web Service Crawler
64
Table of Figures
Figure2-1Dataflow of Service-Finder and Its Componentshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip12
Figure2-2Left is the free text input type and right is its outputhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip16
Figure2-3A Semi-structured page containing data records (in rectangular box) to be extracted helliphelliphellip16
Figure2-4 Architecture of the Pica-Pica Web Service Description Crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip20
Figure3-1 Overview of the Basic Architecture for the Deep Web Services crawlerhelliphelliphelliphelliphelliphelliphelliphelliphelliphellip25
Figure3-2 Overview the process flow of the Web Service Extractor Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip27
Figure3-3 Service list page of the Service-Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-4Origianl source code of the internal link for Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-5Code Overview of getting service page link in Service Repositoryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29
Figure3-6 Service page of the Web service ldquoBLZServicerdquohelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip29 Figure3-7 Overview the process flow of the WSDL Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip30
Figure3-8 WSDL link of the Web service ldquoBLZServicerdquo in the service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip31
Figure3-9 Original source code of the WSDL link for Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-10 Code overview of ldquogetServiceRepositoryWSDLLinkrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-11 Code overview of ldquooneParameterrdquo functionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip32
Figure3-12 Overview the process flow of the Property Grabber Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip33
Figure3-13 Structure properties of the Service ldquoBLZServicerdquo in service list pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip37
Figure3-14 Structure properties of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-15 Endpoint information of the Web service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphellip38
Figure3-16 Monitoring Information of the Service ldquoBLZServicerdquo in service pagehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Figure3-17 Whois Information of the service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Figure3-18 Overview the process flow of the Storage Componenthelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 41
Figure3-19 Implementation code for getting WSDL document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-20 Implementation code for generating XML file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip44
Figure3-21 Implementation code for generating INI file helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-22 Implementation code for creating table in database helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip45
Figure3-23 Implementation code for generating table records helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip46
Figure4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Figure4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip50
Figure4-3 Average Number of Service Properties helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip51
Figure4-4 WSDL Document format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip52
Figure4-5 INI File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53 Figure4-6 XML File format of one Web service helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-7 Database data format for all Web services helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip53
Figure4-8 Average time cost for extracting service property in all Web Service Registries helliphelliphelliphelliphellip55
Figure4-9 Average time cost for obtaining WSDL document in all Web Service Registries helliphelliphelliphelliphellip56
Figure4-10 Average time cost for generating XML file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-11 Average time cost for generating INI file in all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphellip57
Figure4-12 Average time cost for creating database record in all Web Service Registries helliphelliphelliphelliphelliphellip58
Figure4-13 Average time cost for getting one Web service in all Web Service Registries helliphelliphelliphelliphelliphellip58
Deep Web Service Crawler
65
Table of Tables
Table 3-1Structured Information of Service-Repository Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-2Structured Information of Xmethods Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-3Structured Information of Seekda Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip34
Table 3-4Structured Information of Ebi Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-5Structured Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 34
Table 3-6SOAP Operation Information of Biocatalogue Web Service Registryhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-7 REST Operation Information of Biocatalogue Web Service Registry helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-8 Endpoint Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-9 Monitoring Information of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 35
Table 3-10Whois Information for these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 36
Table 3-11Extracted Structured Information of Web Service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip 38
Table 3-12Extracted Endpoint Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-13Extracted Monitoring Information of the Web service ldquoBLZServicerdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip39
Table 3-14Extracted Whois Information of service domain ldquothomas-bayercomrdquo helliphelliphelliphelliphelliphelliphelliphelliphelliphellip40
Table 3-15Sleep Time of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip47
Table4-1 Service amount statistic of these five Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip48
Table4-2 Statistic information for WSDL Document helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip49
Table4-3 Average time cost information for all Web Service Registries helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip55
Deep Web Service Crawler
66
Table of Abbreviations
DTD Document Type Definition
DWSC Deep Web Service Crawler
HTML HyperText Markup Language
MTBF Mean Time between Failures
MTTR Mean Time to Recovery
QoS Quality of Service
REST Representational State Transfer
RTT Round Trip Time
SOAP Simple Object Access Protocol
SQL Structured Query Language
URL Uniform Resource Locator
WHATWG Web Hypertext Application Technology Working Group
XML eXtensible Markup Language
WSDL Web Service Description Language
WSML Web Service Modeling Language
WSMO Web Service Modeling Ontology