Peer to Peer English/Chinese Cross-Language Information
Retrieval
By
CHENGYE LU
A dissertation submitted for the degree of
Doctor of Philosophy
Faculty of Science and Technology
Queensland University of Technology
September 2008
i
Keywords
Peer to peer system, distributed information retrieval, security, cross-language
information retrieval, query translation, out of vocabulary problem, translation
disambiguation, collection fusion, collection profiling
iii
Abstract
Peer to peer systems have been widely used in the internet. However, most of the
peer to peer information systems are still missing some of the important features,
for example cross-language IR (Information Retrieval) and collection selection /
fusion features.
Cross-language IR is the state-of-art research area in IR research community. It has
not been used in any real world IR systems yet. Cross-language IR has the ability to
issue a query in one language and receive documents in other languages. In
typical peer to peer environment, users are from multiple countries. Their
collections are definitely in multiple languages. Cross-language IR can help users
to find documents more easily. E.g. many Chinese researchers will search research
papers in both Chinese and English. With Cross-language IR, they can do one
query in Chinese and get documents in two languages.
The Out Of Vocabulary (OOV) problem is one of the key research areas in cross-
language information retrieval. In recent years, web mining was shown to be one
of the effective approaches to solving this problem. However, how to extract
Multiword Lexical Units (MLUs) from the web content and how to select the
iv
correct translations from the extracted candidate MLUs are still two difficult
problems in web mining based automated translation approaches.
Discovering resource descriptions and merging results obtained from remote
search engines are two key issues in distributed information retrieval studies. In
uncooperative environments, query-based sampling and normalized-score based
merging strategies are well-known approaches to solve such problems. However,
such approaches only consider the content of the remote database but do not
consider the retrieval performance of the remote search engine.
This thesis presents research on building a peer to peer IR system with cross-
language IR and advance collection profiling technique for fusion features.
Particularly, this thesis first presents a new Chinese term measurement and new
Chinese MLU extraction process that works well on small corpora. An approach to
selection of MLUs in a more accurate manner is also presented. After that, this
thesis proposes a collection profiling strategy which can discover not only
collection content but also retrieval performance of the remote search engine.
Based on collection profiling, a web-based query classification method and two
collection fusion approaches are developed and presented in this thesis. Our
experiments show that the proposed strategies are effective in merging results in
v
uncooperative peer to peer environments. Here, an uncooperative environment is
defined as each peer in the system is autonomous. Peer like to share documents
but they do not share collection statistics. This environment is a typical peer to
peer IR environment. Finally, all those approaches are grouped together to build
up a secure peer to peer multilingual IR system that cooperates through X.509 and
email system.
vi
Table of Contents
Keywords ................................................................................................................ i
Abstract ................................................................................................................ iii
Table of Contents .................................................................................................. vi
List of Figures ...................................................................................................... xiii
List of Tables ........................................................................................................ xv
Acknowledgement ............................................................................................. xvii
List of peer reviewed publication ........................................................................ xix
Statement of Original Authorship ........................................................................ xxi
Chapter 1 Introduction ........................................................................................... 1
1.1 Background................................................................................................... 3
1.2 Contributions ................................................................................................ 8
1.3 Thesis outline.............................................................................................. 11
vii
Chapter 2 Literature review ................................................................................. 13
2.1 Evaluation of Peer to Peer Information Retrieval System ............................ 14
2.2 Peer to peer information retrieval system architecture .............................. 17
2.2.1 Centralized architecture ....................................................................... 18
2.1.1 Unstructured architecture .............................................................. 19
2.1.2 Structured architecture ................................................................... 20
2.3 Firewall traversal ........................................................................................ 21
2.4 Distributed computing and Peer to peer system ......................................... 22
2.5 Collection selection and fusion ................................................................... 24
2.5.1 Collection rank ..................................................................................... 26
2.5.2 Getting resource description ................................................................ 28
2.5.3 Collection fusion approaches................................................................ 31
2.6 Translation .................................................................................................. 35
viii
2.6.1 Transliteration ...................................................................................... 37
2.6.2 Parallel Text Mining .............................................................................. 39
2.6.3 Web mining for query translation ......................................................... 41
2.7 Term extraction .......................................................................................... 43
2.7.1 Mutual information and its variations .................................................. 45
2.7.2 Local Maxima based approaches .......................................................... 47
2.8 Summary .................................................................................................... 48
Chapter 3 Web based query translation ............................................................... 50
3.1 Collecting Web Document summaries ........................................................ 53
3.2 Term extraction .......................................................................................... 56
3.2.1 Frequency Change Measurement ......................................................... 56
3.2.2 A Bottom-up Term Extraction Strategy ................................................. 61
3.3 Translation selection ................................................................................... 65
ix
3.3.1 The algorithm ....................................................................................... 66
3.3.2 Time Complexities ................................................................................ 70
Chapter 4 Multilingual Experiments ..................................................................... 73
4.1 Test set ....................................................................................................... 73
4.2 Term extraction experiments ...................................................................... 75
4.3 Discussion ................................................................................................... 76
4.3.1 Mutual information based approaches ................................................. 77
4.3.2 Local Maxima based approaches .......................................................... 79
4.3.3 SQUT Approach .................................................................................... 80
4.4 Translation selection Experiments .............................................................. 82
4.5 Discussion ................................................................................................... 83
4.5.1 IgnoreOOV ........................................................................................... 83
4.5.2 SimpleSelect ......................................................................................... 85
x
4.5.3 TQUT .................................................................................................... 87
Chapter 5 Web based collection profiling for collection fusion ............................. 90
5.1 A simple example........................................................................................ 91
5.2 Collection profiling ...................................................................................... 92
5.3 Query classification ..................................................................................... 94
5.4 Collection fusion ......................................................................................... 98
Chapter 6 Profiling and fusion evaluation ........................................................... 102
6.1 Query classification ................................................................................... 103
6.1.1 Discussion .......................................................................................... 103
6.2 Collection Rank Experiment ...................................................................... 105
6.2.1 Discussion .......................................................................................... 105
6.3 Collection fusion experiment .................................................................... 108
6.4 Discussion ................................................................................................. 109
xi
Chapter 7 The P2PIR System architecture .......................................................... 115
7.1 System Design .......................................................................................... 120
7.2 The P2PIR system Search lifecycle ............................................................. 121
7.3 Communication protocol .......................................................................... 124
7.4 Security..................................................................................................... 129
7.4.1 Group management ........................................................................... 129
7.4.2 Security protocol ................................................................................ 133
7.4.3 Encryption methods ........................................................................... 135
7.4.4 XML signature .................................................................................... 136
7.5 The search engine ..................................................................................... 137
Chapter 8 Peer to peer system evaluation .......................................................... 140
8.1.1 Test environment ............................................................................... 140
8.1.2 Test queries and collection ................................................................. 141
xii
8.1.3 Test results ......................................................................................... 141
8.1.4 Results and raw score fusion .............................................................. 144
Chapter 9 Conclusion and future work ............................................................... 147
9.1 Summary .................................................................................................. 147
9.2 Future work .............................................................................................. 150
Appendix ............................................................................................................ 153
Bibliography ....................................................................................................... 161
xiii
List of Figures
Figure 2–1 Search in a semantic space .................................................... 21
Figure 2–2 the process of collection selection ......................................... 24
Figure 3–1 Three sample document summaries for “Stealth Fighter”
returned from Google ............................................................................. 54
Figure 3–2 Sample output of Chinese string collection ............................ 55
Figure 6–1 P-R curves of 4 runs under System1 ..................................... 110
Figure 6–2 P-R curves of 4 runs under M-GPX ....................................... 111
Figure 7–1 System overview ................................................................. 115
Figure 7–2 The P2PIR system network structure ................................... 117
Figure 7–3 Dataflow of Making Search Request .................................... 122
Figure 7–4 Dataflow of Response to Search Request............................. 123
Figure 7–5 Dataflow of Receiving Results .............................................. 124
xiv
Figure 7–6 Schema for plain text communication ................................. 128
Figure 7–7 Member validation process ................................................. 132
Figure 7–8 XML encryption structure(Imamura, Dillaway et al. 2002) ... 135
Figure 8–1 PR curves ............................................................................. 146
Figure 10–1 Encryption and Decryption using Symmetric and Asymmetric
together ............................................................................................... 154
xv
List of Tables
Table 3-1 Chinese strings and R .............................................................. 60
Table 3-2 Sample Combination of Translations ....................................... 70
Table 4-1 Test document collections ....................................................... 74
Table 4-2 OOV translation accuracy ........................................................ 77
Table 4-3 Some Extracted terms by MI ................................................... 77
Table 4-4 NTCIR retrieval performance ................................................... 83
Table 4-5 Retrieval performance on queries that contains OOV terms only
............................................................................................................... 85
Table 4-6 OOV translation accuracy for NTCIR5 collection ...................... 87
Table 6-1 Collection Rank Top 5 for topic “ Financial news (財經新聞)” 106
Table 6-2 Collection Rank Top 5 – International (國際新聞) ................. 106
Table 6-3 Collection Rank Top 5 – Science/Tech (科技新聞) ................. 106
xvi
Table 6-4 Collection Rank Top 5 – Sports (運動新聞) ............................ 107
Table 6-5 Average Precision – System1 ................................................. 110
Table 6-6 average precision – M-GPX .................................................... 111
Table 6-7 Query classification VS No classification ................................ 113
Table 8-1 Collection size and search time.............................................. 142
Table 8-2 Comparison of Centralize search and Distributed search ....... 143
xvii
Acknowledgement
Firstly, I would like to express my immense gratitude to Professor Shlomo
Geva, my principal supervisor , for all his guidance and encouragement
throughout this research work. He has been always there providing
sufficient support with his excellent expertise in information retrieval area.
Many thanks also go to my associate supervisor, Dr . Yue Xu for her
generous support and comments on my work with her knowledge in data
mining area and English – Chinese translation area.
I would also like to thank my examiners for their precious comments and
suggestions.
Special thanks must go to Faculty of Information Technology, QUT, which
has provided me the comfortable research environment with needed
facilities and financial support including my scholarship and travel
allowances over the period of my candidature. I would especially like to
thank all the members of our research group for offering invaluable
advice and comments regarding my research work.
xviii
This work would not have been accomplished without the constant
support of my family. I would like to dedicate this thesis to my parents for
their never-ending encouragement over these years. Last but certainly not
the least I would like to thank my wife Ning and my parents-in-law for
their tremendous support.
xix
List of peer reviewed publication
Journal paper:
1. Chengye Lu, Yue Xu, Shlomo Geva: Web-based Query Translation
for English-Chinese CLIR. International Journal of Computational
Linguistics and Chinese Language Processing, 61-90.
Conference papers:
1. Chengye Lu, Yue Xu , Shlomo Geva: A Bottom-Up Term Extraction
Approach for Web-Based Translation in Chinese-English IR Systems.
Proceedings of the 2007 Australasian Document Computing
Symposium.
2. Chengye Lu, Yue Xu, Shlomo Geva: Collection Profiling for
Collection Fusion in Distributed Information Retrieval Systems.
Proceedings of the 2007 Knowledge Science, Engineering and
Management: 279-288
3. Chengye Lu, Yue Xu, Shlomo Geva: Translation disambiguation in
web-based translation extraction for English-Chinese CLIR.
xx
Proceedings of the 2007 ACM symposium on Applied computing :
819-823
4. Chengye Lu, Shlomo Geva: Secure email-based peer to peer
information retrieval. Proceedings of the 2005 International
Conference on Cyberworlds : 531-538
xxi
Statement of Original Authorship
The work contained in this thesis has not been previously submitted to
meet requirements for an award at this or any other higher education
institution. To the best of my knowledge and belief, the thesis contains no
material previously published or written by another person except where
due reference is made.
Name:__________________________________________________
Signed:_________________________________________________
Date:___________________________________________________
1
Chapter 1
Introduction
Traditional web search engines use special software to navigate and browse the
web, automatically following hyperlinks from one document to another, and
extracting textual information. This information is used to build huge index
structures correlating keywords, hyperlinks, and other document features to web
pages (Sun and Chen 2001). Obviously such tasks require powerful servers,
massive amounts of storage and bandwidth to monitor the changing web and to
index all the web pages. It is estimated that the volume of information on the web
is about 66.8 to 91.9 Petabytes (1015 bytes)(Charles, Good et al. 2003). It was over
three times of the amount of information than 2000. In one of the research
papers, it is reported that in 2008 the size of the average web page has more than
tripled since 2003(King 2008). According to his research, event we have same
amount of web pages as 2003, we will have more than 200 Petabytes of data on
the web.
Due to such large scale of the web, it is an impossible task to index the whole web
with classic centralized models and algorithms. As a result, decentralized models
2
become an alternative solution for web information retrieval. Peer to Peer (P2P)
information systems seem to be the most promising decentralized models.
Peer to Peer Systems are decentralized, large scale computer networks, where
peers operate as clients and server at the same time(Aberer, Klemm et al. 2001).
As personal computers become more and more powerful and disk space is getting
larger and cheaper, more and more people join P2P networks to share their
personal documents. Existing P2P systems already form very large networks, such
as the Bit Torrent network and the Edonkey network, with millions of personal
computers participating. Obviously, it is impossible to excursively index the whole
web via such large amount of computers through P2P networks. Moreover, it is
interesting to investigate how P2P information sharing systems might be able to
provide collaborative P2P information retrieval (P2PIR) on networks of PCs. On the
other hand, more and more people are able to read more than one language.
People would like to search documents in their second or third language. E.g.
many Chinese researchers need to seek English documents for reference.
However, none of the popular P2P systems provide multilingual features. That is,
the search engine will only search the documents in one language. As a result,
users have to manually translate queries from one language to another and use
the original query and translated queries in the search engine to search
documents in different languages. This is inconvenient to users.
3
1.1 Background
There are a number of problems in P2PIR systems. Some of those problems will be
addressed here and will be discussed in more detail later in the thesis. The first
problem is that current P2P systems lack IR features.
Public domain P2P systems are widely used in Internet file sharing systems and
instant messaging systems (Ng, Ooi et al. 2003) such as Napster, Kazaa and ICQ
systems which the reader may be familiar with. However, such systems are
focused on robust connectivity and scalability. They do not meet the very
demanding requirements of Information Retrieval operations. Such operations are
of primary interest in group collaboration for information sharing. By comparison
with traditional centralized search engines, the P2P systems are relatively simple.
They only provide file level search, which means, for instance, that one can only
search by document title but not by document content. Therefore, basic IR
operations, such as text searching and result ranking, are not possible. This
limitation makes it difficult, if not impossible, for users to find the documents that
they really need.
The second problem is that current applications lack of security features. Most of
the popular P2P systems assume complete trust between peers. Therefore, no
access control mechanisms are provided to ensure security, privacy, and
4
confidentiality (Sun and Chen 2001). But access control is very important for
P2PIR in order to support secure collaborative information sharing. For example,
a large medical research team may be distributed around the globe, working
jointly on a specific problem. The team might have to share patient clinical data
that is held by team members on distributed private and secure PCs. Such
information must be kept private and centralization of records may be completely
out of the question. It cannot be placed in the public domain and so there is no
scope for exploiting public domain P2P file sharing software, or the available, but
very intrusive, public domain search engines. At the same time, without going into
the expense of establishing and maintaining a private network and a centralized
search service, it is not possible to provide a comprehensive solution that is as
easy to join and use, as the existing public domain systems. Furthermore, public
domain P2P systems cannot be used in environments where users are locked
behind firewalls. It is very common for different user groups to exist behind
different organizations’ firewalls, sometimes even being compartmentalized
within a single organization. This greatly restricts the use of direct P2P systems in
collaborative information sharing.
Some recent distributed systems(McDaniel, Prakash et al. 1999; Saxena, Tsudik et
al. 2003) do focus on security but such systems are designed specifically to work
for particular group collaborations. Such systems do not have the appealing
5
general usability and scalability of the open P2P systems. They also require
adherence to standards, centralization, and coordination, all of which are never
enforceable, or even desirable in a voluntary Special Interest Groups (SIG)
environment with truly independent members. The cost of membership in terms
of effort of participation must be close to zero for such groups to prosper. A
fundamental lesson that can be learned from the Internet phenomena is that
decentralization and minimal control are critical success factors in voluntary
collaboration.
The third problem of the current systems is lack of multilingual features. That is,
current P2P-IR systems only allow monolingual information retrieval. As more and
more documents written in various languages become available on the Internet,
increasingly users wish to explore documents that were written in either their
native language or some other language. Cross-language information retrieval
(CLIR) systems enable users to retrieve documents written in more than one
language through a single query. This is a helpful end user feature. For example, a
researcher from China may need to find documents in both Chinese and English.
CLIR could help him to search documents by using a single query, either in English
or Chinese to find articles in both languages. Moreover, the use of CLIR not only
reduces the communication cost of whole peer to peer system but also reduces
the CPU load. For example, in monolingual environment, if a user wants to find
6
document in two languages, he needs to send out two queries. And two set of
documents will be sent back. If the cut-off is 100 documents and 10 peers return
results, he will get 2000 documents. In cross-language IR environment, as the user
only sends out one set of query, only one set of results will be sent back from
remote peers. If the cut-off is still 100 documents and 10 peers return results, he
will get 1000 documents. That is a saving of 50% of communication cost. It is also
a saving of 50% of CPU power in the collection fusion processing. The more
languages the user is looking for , the more savings CLIR can gave to the user . For
example, if the user is looking for documents in 5 languages, the saving will be
80%. In addition, CLIR can provide better retrieval results. As everyone knows,
Chinese are used in different regions. The translation for the same English term
will be quite different in different regions. For example, the movie “Planet of the
Apes” is translated as “决战猩球”in Taiwan but “猿人争霸战”in Hong Kong. It is
not possible for everyone that speaks Chinese to know all translations in all the
regions. If the user only chooses the translation he knows, he will miss all other
related documents that use other translations. CLIR will help the user to find out
all possible translations so it is not going to miss any documents. Finally, CLIR can
be used in some tasks that human cannot do. For example, query expansion and
relevance feed back. In such tasks, there will be a huge amount of terms need to
7
be translated. Most of the end users would like such process to be automated and
would not want to be invoked in the translation.
The fourth problem of the current systems is lack of resource discovery and result
merging. Many researches in distributed information retrieval do pay attention to
resource discovery however most P2P-IR systems only use a very simple approach.
Some P2P systems called unstructured networks, like Gnutella and Kazaa, typically
broadcast search queries to neighbouring peers to locate information. While such
approaches are effective for finding highly replicated items, their performance is
very poor for finding rare items. In contrast, structured P2P systems, such as Bit
Torrent networks and Edonkey networks, use central servers as directory service
and use distributed hash table abstraction to identify remote resources. Such
networks are good at finding rare items, but tend to introduce higher costs than
unstructured systems.
Result merging approaches used in most of the popular P2P systems are very
simple because most of the popular P2P systems lack IR features. They simply
merge results. Users need to scan through the whole list to find the relevant
documents. One basic solution is to sort the merged list by document scores. In
most cases, collection statistics (e.g., size of the collection, inverse
document/term frequency, etc.) are used to calculate document scores in most of
8
the retrieval model such as Boolean model, probability model and vector space
model. The use of collection statistics makes the document score quite different
in different collections. Even the same document will have different scores if it is
in different collections. Therefore, document scores from different clients are not
comparable. Sorting the merged list by document score cannot provide good
enough results.
1.2 Contributions
The principal observation is that it is generally considered too expensive to
provide a dedicated centralized information sharing mechanism to large groups of
independent users from various countries. Many real world IR systems have
documents in multiple languages (e.g. www.Wikipedia.org) but only allow you do
monolingual queries. That is, you type in a query in one language, you get
documents in the same language. If one wants to get documents in other
languages, he has to type in a query in other languages. In addition, an
information retrieval system used for large scale secured SIG (special interest
group) environment should have the features from both peer to peer information
retrieval systems and cross-language information retrieval systems, the former is
to solve the communication, security and collection profiling problems and the
latter is to provide multilingual functionalities.
9
Firewall is another serious obstacle for conventional P2PIR system. Under current
internet infrastructure, it is not possible to make direct connection between peers
when they are behind different firewalls when port forwarding is not available.
Due to security and management issues, large organizations usually do not allow
port forwarding to enable P2P connection. Such policy virtually makes P2PIR
system unusable for researchers as most researchers’ computers are behind
corporate level firewalls.
The objective of the research is to address three major issues in peer to peer
systems: communication issues, CLIR issues and collection fusion issues. The
system proposed in this thesis is to provide a workable P2PIR system that
supports secure collaborative information retrieval over private collections in
multiple languages. It is intended to allow special interest groups to securely and
effectively search and share private collections over the Internet. And it is
intended to support the state of the art cross-language information retrieval
functionality without the excessive costs of centralized solutions. This is done
through the exploitation of the distributed resources of SIG members, each with
their own document collection, storage capacity, processing capacity, and a local
search engine. Although we used our own search engine in our experiments, any
search engine (e.g. desktop Google or MS search engine) can be used in the
10
system. The major contribution of this thesis is to provide cross-language IR
feature to the peer to peer IR system.
In order to evaluate the proposed system, a P2PIR distributed information
retrieval prototype system has been developed to address all these issues
discussed above. Not only can it be used over the public domain like popular P2P
systems, but it supports powerful generic text searches while it also provides the
necessary security and multilingual information retrieval features for private
group collaboration.
The P2PIR system offers some unique advantages over conventional P2P systems.
• Offline processing: a user can be offline when a query is issued. The
query is stored in a mailbox. Therefore, when the user’s machine goes
online the system can act and respond to the query.
• High level encryption is used at the email application layer without
impeding the operation of the system over the public email
infrastructure.
• Advanced Searching functionality.
• Tunnelling through firewalls.
• Server-less.
• Complete access control by end users.
11
• Multilingual search engine
• Advance collection profiling and result merging
To achieve the goal of the P2PIR system, several new approaches are proposed in
this thesis. The summarized contributions are briefed as follows:
• Proposed XML based communication protocol with advance security
features
• Web based query translation, including English – Chinese translation
extraction, term extraction and translation disambiguation.
• Web based collection profiling, including Chinese query classification,
new remote system performance measurement and result fusion
based on the new measurement.
1.3 Thesis outline
The rest of this thesis is organized as follows:
Chapter 2: this chapter is a literature review of related technologies and
disciplines in the area of distributed systems and multilingual information retrieval
systems. It focuses on the latest works on P2P system architecture, collection
selection, collection fusion and query translation.
12
Chapter 3: this chapter presents a proposed algorithm for Chinese keyword
extraction as well as an algorithm for translation selection. These two algorithms
make up the approach of web based query translation. In order to make the
algorithms easily understood, this chapter also provides a sample example on
translating English terms into Chinese using the proposed approach.
Chapter 5: a web based collection profiling strategy is introduced in this chapter.
Some new technologies are also proposed to support the strategy including a web
based Chinese query term classification method, a remote system performance
measurement, a collection selection strategy, and a collection fusion strategy.
Chapter 7: this chapter demonstrates a peer to peer system that implements all
proposed algorithms and strategies introduced in this thesis. An email based peer
to peer communication protocol presented in this chapter as well.
0 and Chapter 6 Chapter 8 are evaluations of all proposed approaches. Detailed
analysis and the comparison results are also included in those chapters.
13
Chapter 2
Literature review
The purpose of this literature review is to analyse existing information retrieval
techniques for distributed systems, peer to peer systems and multilingual
systems. In particular, this review is focused on the problems described below:
• Peer to peer information retrieval system architecture: how to manage
remote peers, including how to discover remote peers and how peers
communicate with each other .
• Resource Description: how to learn and describe the topics covered by
different data collections.
• Resource Selection: given an information need (a query) and a set of
resource descriptions, how to select a set of resources (text databases
or collections) to search.
• Query Translation: given an information need in some base
representation, how to map it automatically to representations (query
languages and vocabularies) that is appropriate for the selected
databases.
14
• Result Merging: after result lists have been returned from the selected
databases, how to integrate them into a single ranked list.
2.1 Evaluation of Peer to Peer Information Retrieval System
Information system is the system than can deliver information in collection
(documents) that relevant to given search criterion (query). Evaluation of the
system is to investigate how well does the system work. According to the
objective of the investigation, the evaluation of IR system is usually divided into
six levels(Saracevic 1995):
1. Engineering level: the objective of this level is to inspect the performance
of hardware and software, such as reliability, speed, flexibility, etc.
2. Input level: the objective of this level is to question about the coverage in
the designated area.
3. Processing level: the objective of this level is to exam the performance of
the algorithms, techniques and approaches, such as computational
effectiveness, efficiency of the algorithms, etc.
4. Output level: the objective of this level is to investigate the effectiveness
of outputs, e.g. the quality of results.
15
5. User level: the objective of this level is to question about the satisfaction
of user .
6. Social level: this level studies the social impact of the system.
Nowadays, most researchers in the IR community use evaluation on output level
to measure the quality of IR systems. Recall and precision are two major
properties that have been accepted as the measurements of search
effectiveness.(Singhal 2001)
Precision is the fraction of the documents retrieved that are relevant to the given
query. It is represented by equation:
presision = |{relevant documents} ∩ {retrieved documents}||{retrieved documents}|
Usually precision considers all documents retrieved. It is also possible to evaluate
by a cut-off number of documents, represented as P@n. For example, P@100
represents the precision at the top 100 documents.
Recall is the fraction of the documents that are relevant to the query that are
successfully retrieved. It is represented by equation:
Recall = |{relevant documents} ∩ {retrieved documents}||{relevant documents}|
16
A perfect IR system should have should retrieve as many relevant documents as
possible which means high recall, and it should retrieve as few non-relevant
documents as possible, which means high precision. Unfortunately, these two
goals have proven to be quite conflicting over years of research. When the
techniques improve precision then it will hurt recall, vice-versa. Recent years,
researchers like to use average precision to measure the effectiveness of IR
systems. Average precision combines the information from both precision and
recall. Average precision is computed by measuring precision at different recall
points (e.g. 10%, 20%, etc.) and averaging. (Salton and McGill 1996)
Evaluations of peer to peer system are mostly focus on engineer level. (Ehrig,
Schmitz et al. 2004). The common measurements used are:
• Reliability: this measures the degree of failures and tolerated when system
breaks down or measures the percentage of actual breakdowns of the
system.
• Response time: this measures the time from query sent to result received.
• Network load: this figure shows the network traffic of the system.
As peer to peer systems are designed for different needs, there is no standard
measurement for all peer to peer systems. For example, for IM (internet message)
17
systems, response time is the most critical measurement while reliability is more
important in file sharing systems.
2.2 Peer to peer information retrieval system architecture
According to the relationship between peers, peer to peer information retrieval
systems can be divided into cooperative systems and uncooperative systems. In a
cooperative environment, various kinds of information such as resource
description, collection index and collection statistics, etc., are usually held in a
central place. Peers can use such information to help their search. In an
uncooperative environment, each peer is independent and knows nothing about
others. Peers can answer queries and return documents, but they do not provide
other information such as collection statistics, collection description or retrieval
model.
According to the network structure, peer to peer systems can be catalogued as
centralized peer to peer network, structured peer to peer network and
unstructured network(Lv, Cao et al. 2002). Structured and unstructured are both
decentralized architectures.
18
2.2.1 Centralized architecture
Centralized peer to peer architecture is the mix of traditional client-server
architecture and pure peer to peer architecture. In this architecture, some nodes
will work as servers. Unlike client-server architecture, the servers in a centralized
peer to peer architecture only provide directory service to other nodes and they
do not contain any other information resources. All information resources are
located in all other peers. Centralized architecture faces the same problem as
client-server architecture. Single point of failure and scalability are the main
issues. It is obvious that one directory server can only handle certain amount of
request from peers. The response time will increase when peers submit more
requests. Server failure will result in the whole network stopping to work. An easy
solution is adding more servers to the network. Centralized systems become more
practical in the real world. Most of the real world peer to peer systems are based
on this architecture, such as Bit Torrent and Edonkey network. One reason is that
it is similar to the existing client-server architecture. So transferring to this
architecture from existing system is easy. Another benefit of the centralized
architecture is that it is easy to manage. In addition, a centralized architecture can
reduce broadcasting and improve the network usage so it can provide higher
scalability. Security might be another important reason. “Identity management,
19
authentication and authorization cannot be done in a global scale, they have to be
domain or realm based one way or another.(Sankar 2003)”
2.1.1 Unstructured architecture
Under unstructured architecture, there is no directory server in the network. All
the peers in the system are equal. They both can issue requests, response to other
requests or route requests to other nodes. One of the best known peer to peer
systems under this architecture is Gnutella (http://www.gnutella.com/). Peers
typically flood (broadcast) search queries by forwarding them to their neighbours
to locate information. It is clear that flooding-based approaches are effective for
finding popular items but the performance is quite poor for rare items. In
addition, a badly designed system will large amount of peers can easily flood the
entire network.
Improving the scalability of unstructured architecture is the major topic for
researchers. Several techniques were developed to reduce the number of peers
that are visited while broadcasting queries. The architecture proposed by
Zeinalipour-Yazti (Zeinalipour-Yazti, Kalogeraki et al. 2003) tried to reduce the
time to find the clients that contain the documents. The key idea is that each
client keeps a log for the past queries to propagate the query messages to only a
subset of other clients. At the beginning, a query will be sent to all other clients.
20
The system will automatically record the query and clients that returned “good”
results. As the system runs for some time, it will keep a record that shows which
client contains “good” results for which kind of query. Then the system will only
send the query to those clients to reduce the network traffic. In other words,
Meta data for collection selection is gathered at run time while the traditional
method is gathering it before the system runs.
2.1.2 Structured architecture
Structured architecture was proposed (Callan, Lu et al. 1995; Bawa, Manku et al.
2003; Ng, Ooi et al. 2003; Tang, Xu et al. 2003) to solve the fundamental limitation
of unstructured architecture. Under this architecture, peers are grouped or
clustered. The peer to peer topology is tightly controlled and that files are placed
not at random nodes but at specified locations that will make subsequent queries
easier to satisfy(Lv, Cao et al. 2002). For example, make use of distributed hash
table (DHT) abstraction to find queries efficiently. The only question is how to
organize peers.
In (Bawa, Manku et al. 2003; Tang, Xu et al. 2003), this problem is directly
addressed by reassigning documents to peers that cluster semantically related
documents. The peers are organized in Content Addressable Networks (CANs)
which are used to partition a logical document space into zones. Using Latent
21
Semantic Indexing (LSI) to generate semantic vectors as key, documents then will
be mapped into DHT. By using this approach, documents relevant to a given query
tend to cluster on a small number of neighbours, as shown in figure 2-1.
Figure 2–1 Search in a semantic space
In(Bawa, Manku et al. 2003), the authors created a dynamic peer to peer network
while in (Tang, Xu et al. 2003) the peer to peer network is static. When a new
node joins the network, it gets segment from its neighbours and joins an
appropriate segment.
2.3 Firewall traversal
As discuss in 1.2, firewall traversal is one of the major issues in peer to peer
systems. There are some approaches used in real world products (e.g. Skype) that
can provide the ability of firewall traversal, such as STUN(Rosenberg, Weinberger
et al. 2003) and TURN(Rosenberg, Mahy et al. 2005). Such protocols require a
22
server on the public internet. Before two peers directly connect to each other, the
peers need to connect to the server first. The server will then analyse the network
packets to discover the IP addresses and ports of the client. After that, the two
peers can make direct connection based on the IP addresses and ports. The
authors already address that providing the server in the network will result high
cost and such protocol can only provide one to one connections. Therefore, such
protocols are not suitable to be used in unstructured peer to peer environments.
2.4 Distributed computing and Peer to peer system
“Distributed computing system combines multiple computers connected by a
network to solve a single problem(Brown 1999)”. Some studies (Brown 1999; Sun
and Chen 2001) of distributed computing systems come from the study of parallel
computing systems because a distributed computing system can be treated as a
parallel computing system with relevant slow processors communication channel.
Each computer in the distributed computing system can be treated as a processor
in the parallel computing system but with more freedom. Moreover, each
computer in a distributed computing system may have different CPU, different OS
and different communication software. The only thing those computers must have
is that they must have the same communication protocol.
23
“Peer to peer systems are distributed systems in which nodes of equal roles and
capabilities exchange information and services directly with each other(Beverly
and Hector 2001)”. The main benefits of the P2P systems include: improving
scalability by avoiding dependency on centralized points; enable flexible
information sharing between peers and low cost(Sankar 2003). According to de
Kretser’s study, (de Kretser, Moffat et al. 1998) distributed computer systems
which is built on a set of low end PCs can provide similar performance to high end
servers. Peer to peer data systems with high speed connection can be used as
alternative method of database clustering.
As we can see, peer to peer systems can be treated as a sub category of
distributed system. They share the same difficulties in distributed processing.
However, a peer to peer system has its own issues (Bawa, Manku et al. 2003; Ng,
Ooi et al. 2003). First, peer to peer systems are usually dynamic while distributed
systems are always static. In a peer to peer environment, a node can dynamically
join and leave the system. In distributed systems, the components are fixed.
Second, in a peer to peer environment, each node is equal and performs the same
task. In distributed systems, each component usually performs a different task.
Third, in distributed systems, global information is available in a central place. In
peer to peer systems, global information is not always available. Therefore, some
techniques used in the distributed systems are not fit for Peer to peer systems.
24
2.5 Collection selection and fusion
Collection selection defined as the process of determining which of the distributed
document collections are most likely to contain relevant documents for the
current query and therefore should receive the query for processing (Brown
1999). Collection fusion (or merge) is referred to integrating the results from each
individual distributed client. The final merged result list should include as much as
relevant documents as possible and the relevant documents should have higher
ranks (Powell and French 2003).
Figure 2–2 the process of collection selection
25
The process of collection selection and collection fusion is showen in figure 2-2.
Query q comes to the collection selection model which ranks the entire set of
collections base on the similarity between recourse and query. Then the top
ranked collections are selected and the query is routed to and run on the selected
collections, e.g. the collections c1 and c3 in the figure. Then the results are
merged into a single list. The aim of collection selection is to efficiently use
resources including bandwidth and processor power. Collection selection not only
decreases the time taken to send out a query and return results but also decreases
the time for merging results from different collections. Collection selection is first
used to retrieve multiple collections in a centralize environment. E.g. multiple
collections are stored on single machine or on different machines with a high
speed connection. As the idea of distributed information system comes up, it is
now used in peer to peer systems to reduce network traffic.
Merging result into a single result list is a difficult task because document scores
returned by distributed collections are usually not comparable. In most retrieval
models such as the Boolean model, the probability model and the vector space
model, collection statistics (e.g., size of the collection, inverse document/term
frequency, etc.) are used to calculate document scores. The use of collection
statistics makes the document scores calculated by different search engines
incompatible. Even the same document may have different scores if calculated in
26
different collections. Therefore, the document scores are incomparable and
merging results from different collections becomes a very complex task.
2.5.1 Collection rank
From the section above we can see that, collection ranking makes significant
impact on collection selection. In a cooperative environment, global index is the
most common technique for collection ranking. In the global index architecture,
usually there is a directory server that holds all the information about the
distributed collections. Peers can get global collection statistics via the directory
server then merge the distributed results together. With the help of global
statistics, the whole peer to peer network can be treated as a single centralized
collection. Queries can be run on local machine or on the central server without
sending them to the remote peers. The remote peers only need to provide the
documents requested. STARTS(Gravano, Chang et al. 1997) is one of the best
known protocols for peer to peer communication in cooperative environment.
Clients can exchange their collection statistic via STARTS protocol. The global index
architecture can achieve nearly 100% of the centralized information retrieval
performance because a client will have all the information to calculate a document
score as if they are in a centralized place. This architecture requires a deep
collaboration between clients and fits well when all the clients are happy to share
their entire collections. However, this architecture is not practical in real word
27
large scale distributed networks because not all clients want to share their
collection information.
GlOSS (Gravano and García-Molina 1995), CORI (Callan, Lu et al. 1995) and
CVV(Yuwono and Lee 1997) are three of the best known collection ranking
approaches that require far less cooperation between peers. All three approaches
are based on calculating the collection similarity to a query and require resource
description (see 2.5.2 for detail) before the calculation process.
In GlOSS, the similarity is calculated by the number of documents in the collection
that are relevant to the query. This algorithm works well with large collections of
heterogeneous data. Gravano and GarcíaMolina (Gravano and García-Molina
1995) also suggest a variation of GlOSS known as gGlOOS which is based on vector
space model. The similarity is calculated by the vector sum of the relevant
documents instead of the number of relevant documents.
CORI is based on probabilistic inference network which is originally used for
document selection. Callan (Callan, Lu et al. 1995) introduced this algorithm for
collection selection. It uses document frequency (df, the number of documents
containing the query term in a collection) and inverse collection frequency (icf, the
number of collections not containing the query word) to calculate collection ranks.
One of the advantages of CORI is that it only uses 0.4% of the original collection.
28
The CVV ranking algorithm uses a combination of document frequency and cue
validity variance information(Yuwono and Lee 1997). The variability of the fraction
of documents in a database that contains a specific word is characterized by the
cue validity variance.
2.5.2 Getting resource description
Although GlOSS, CORI and CVV do not require deep cooperation between peers,
they still need an accurate representation of remote peers (so called resource
description). In practice, such information is unlikely to be available, especially in a
peer to peer environment. Automatic generation of resource descriptions has
been studied since early 1980(Callan and Connell 2001). Query-based sampling
(Hawking and Thistlewaite 1999; Callan and Connell 2001; Si and Callan 2003) is
the most popular technique for discovering resource descriptions in
uncooperative environments. The basic idea is to send queries to remote peers
and examine term frequency in the returned documents.
D. Hawking’s approach requires special designed queries in the sampling process.
It also requires significant communications costs in wide area networks. J. Callan
(Callan and Connell 2001) proposed a more flexible approach. It does not require
specially queries and the communication cost is reasonable. It has been proved
29
that running 300 queries in the process can provide good enough resource
description(Callan and Connell 2001; Rasolofo, Abbaci et al. 2001).
Si and Callan extended their study in 2003(Si and Callan 2003). In their study, the
found that most of the old resource selection approaches do not work well when
remote collections are mix with small and very large databases. They then
proposed an approach to acquire database size estimates in uncooperative
environments as an extension of the query-based sampling used to acquire
resource descriptions. Based on the assumption that the documents sampled
from the database are a good representation of the whole database, the
estimation of remote collection can be easily calculated. Then, use with the
standard query sampling technique, relevant document distributions can by
estimated.
Instead of getting collection sample documents directly, some researchers
(Ipeirotis, Gravano et al. 2001) suggested to classify remote collections to help
collection selection. To classify a database, their algorithm does not retrieve or
inspect any documents or pages from the database, but rather just exploits the
number of matches that each query probe generates at the database in question.
In their later research(Ipeirotis and Gravano 2001), they use this collection
30
classification algorithm to help collection selection. Remote collections will only
be selected when they are in the same topic category as the input query.
Relevance feedback(CHOI and YOO 2001; Liu, Yu et al. 2002) is another popular
approach for discovering resource descriptions in uncooperative environments.
The basic idea is acquiring the IR knowledge from relevance feedback so the
system can discover the text databases which contain the documents relevant to
the user’s interests.
Choi’s approach is a typical relevance feedback system. In the training process,
humans are invoked to determine the usefulness of documents returned from
remote systems. The authors also addressed the major problem of such system is
that re-training is required when new remote systems are added. One of the
possible solutions is to use hierarchical multi-agent to break the knowledge of
remote system into small hierarchical parts. As a result, when new system is
added, only a small part of the system needed re-training.
In Lui’s (Liu, Yu et al. 2002) approach, the feedback is not given by human but by
machine. The quality of a remote system is measured by two arguments: the
number of documents returned by remote system and the average similarity of
the returned documents.
31
2.5.3 Collection fusion approaches
Collection fusion is the last step in the peer to peer information retrieval process.
It provides a single list of document results. In cooperative environment, a global
index is the most common technique for result merging. As the problem of result
merging comes from the lack of collection statistics and the information of
retrieval model, if the distributed collection can be treated as a logical centralized
collection but documents are physically located distributed, there would be no
merging problem.
The global index is not available in uncooperative environments. An alternative
approach to recalculate document score in uncooperative environments is to
download either the entire remote document set(Kirsch 1997) or part of the
remote document set(Craswell, Hawking et al. 1999). It is obvious that the
approach in (Kirsch 1997) will significantly increase the network traffic because
every search query requires an entire collection downloading. The approach in
(Craswell, Hawking et al. 1999) can reduce the traffic to 10% because it only
downloads the top 10% of the documents. But it is still a large amount of network
traffic if the system runs for a long time.
Normalized score merging is a solution to collection fusion in real world large
scale uncooperative distributed information retrieval (Liu, Yu et al. 2001; Rasolofo,
32
Abbaci et al. 2001; Si and Callan 2002). It is considered as the most accurate
solution(Viles and French 1995). When merging documents returned from peer
collections, the client normalizes the documents based on the rank of their
corresponding collections. For example, the document score is increased if it
comes from an important collection and decreased if it comes from a less
important collection.
Some researchers (Liu, Yu et al. 2001) use sampling technique to normalize
document scores. It is based on the observation that for many combinations of
global term-weighting formula and local term-weighting formula, documents that
have relatively large local weights for a term tend to have relatively large global
weights for the term. In training stage, they first randomly select some queries.
Then retrieve document scores from remote collections. According to the local
document score and remote document score, adjust the ration between local and
remote document scores. Then in the merge stage, they normalize document
score based on the ration.
The merging strategies based on CORI and GlOSS linearly combine the document
scores returned from peer collections and the collection ranks to determine the
resulting document ranks. It still requires that the peer collections use the same
indexing and the same retrieval model thus the document scores can be
33
normalized. In today’s p2p network environments, it is impossible to require peers
to use the same software to manage their data. For example, in the popular Bit
Torrent and Edonkey network, people use hundreds of different client software
systems to share their files. It is reasonable to assume that the trend of p2p
network is to use varied client software under the same communication protocol.
Therefore, the document scores returned from peers may be based on different
retrieval models. Thus the document scores cannot be normalized as the scores
are not comparable.
Round robin merging strategy and its variations (Steidinger 2000)were proposed
to deal with the case where the document scores are not comparable. Round
robin merging strategy interleaves all the result lists returned from remote peers
and the first document in each result list will be removed and put into the final
merged result list. It has been proved as a simple and efficient strategy for
distributed collections when remote peers have similar statistics and the retrieval
performances(Savoy, Calv et al. 1998). However, Round robin merging strategy
will fail significantly when the distributed collections have quite different
collection statistics; for example, the distributed collections focus on different
domains.
34
In 1995, some researchers(Towell, Voorhees et al. 1995) suggested to relevance
feed back and query clustering as merging strategies which is another merging
strategy that does not require document scores. Query clustering learns a
measure of the quality of the search for a particular topic area on the collection.
The number of documents retrieved from a collection for a new query is
proportional to the value of the quality measure for that query. Query clustering
uses query vectors to represent the queries. Topic areas are represented as
centroids of query clusters. Collection weight is the average relevant documents
retried in the past. In the fusion process, the percentage of the total selected
collection weight will be calculated first. Then the number of documents selected
from the collections will be calculated based on the percentage. For example, if
three collections are used and the total number of documents required is 100.
The collection weight is 4 for collection A, 3 for collection B, 2 for collection C and
1 for collection D. Then the top 40 documents should be selected form collection
A, 30 from collection B, 20 from collection C and 10 from collection C. However,
the authors didn’t mention how to sort the documents into a single result list.
In summary, most of the results merging approaches require collection
information called resource description such as collection content and statistics.
When processing a query, the collections will be assigned ranks based on
35
similarity between the query and the resource descriptions. Then in the merging
stage, the document scores will be adjusted according to the collection ranks.
2.6 Translation
Dictionary based query translation is one of the conventional approaches in CLIR.
The dictionary based translation has been adopted in cross-language information
retrieval because bilingual dictionaries are widely available, dictionary based
approaches are easy to implement, and the efficiency of word translation with a
dictionary is high. On the other hand, because of the vocabulary limitation of
dictionaries, very often the translations of some words in a query cannot be found
in a dictionary. This problem is called the Out of Vocabulary (OOV) problem. The
appearance of the OOV terms is one of the main difficulties that arise with this
approach. In the very early years, an OOV term would not be translated at all,
leaving the original term in the translated query. However, very often the OOV
terms are proper names or newly created words. Even using the best dictionary,
the OOV problem is unavoidable. As input queries are usually short queries, query
expansion does not provide enough information to help recover the missing
words. Furthermore, in many cases it is exactly the OOV terms that are crucial
words in a query. For example, a query “SARS, CHINA” may be entered by a user
in order to find information about SARS in China. As everyone knows, SARS is a
36
new term created a few years ago and may not be included in a dictionary which
was published long time ago. If the word SARS is left out of the translated query, it
is most likely that the user will practically be unable to find any relevant
documents at all. As a result, the performance of multilingual search engine will
be significantly reduced if the OOV terms are not translated.
Another problem with the dictionary based translation approach is the translation
disambiguation problem. The problem is more serious for a language which does
not have word boundaries such as Chinese. Translation disambiguation refers to
finding the most appropriate translation from several choices in the dictionary.
For example, the English word STRING has over 20 different translations in
Chinese, according to the Kingsoft online dictionary (www.kingsoft.com). One
approach is to select the most likely translation [6] – usually the first one offered
by a dictionary. But even if the choices are ordered based on some criteria and
the most likely a-priori translation is picked, in general such an approach has less
than optional probability of success. Another solution is to use all possible
translations in the query with the OR operator. However, while this approach is
likely to include the correct translation, it also introduces noise into the query.
This can lead to the retrieval of many irrelevant documents which are of course
undesirable. Researchers (Jang, Myaeng et al. 1999; Gao, Nie et al. 2001)reported
37
that with this approach the precision is 50% lower than the precision that is
obtained by human translation.
In this section, several existing translation related approaches are reviewed
below.
2.6.1 Transliteration
Proper names, such as people names and names of places, are two of the major
sources of OOV terms because many dictionaries do not include such terms. It is
common that foreign names will be translated word by word based on their
pronunciations. So the pronunciation of a name in one language and the
pronunciation of its translation in another language will pronounced similarly –
this is transliteration. Such translation is usually made when a new proper name
term introduced from one language to another language.
Some researchers(Paola and Sanjeev 2003; Yan, Gregory et al. 2003) applied the
rule of transliteration to automatically translate proper names. Basically, the
transliteration will first transliterate words in one language into phonetic symbols.
Then transliterate phonetic symbols into another language. Some researchers
found that transliteration is quite useful in proper name translation(Paola and
Sanjeev 2003; Yan, Gregory et al. 2003). However transliteration is useful only
38
with a few language pairs. When dealing with the language pairs for which there
are many phonemes in one language that are not present in the other one, such
as Chinese and English, the problem is exacerbated. There are even more
problems when translating English to Chinese. Firstly, as there is no standard for
name translation in Chinese, different communities may translate a name in
different ways. For example, the “Disney” is translated as “迪斯尼” in mainland
China but is translated as“迪士尼” in Taiwan. Both translations’ pronunciations
are similar in Chinese but use different Chinese characters. Even a human
interpreter will have trouble to determine which character should be used.
Secondly, sometimes the Chinese translation only uses part of phonemes of the
English names. For example, the translation of “America” is “美国” which only
uses the second syllable of “American”. Finally, the translation of a name is not
limited to only using transliteration but also uses transcription. Sometimes the
translation of a proper name may even use the mixed form of transcription and
transliteration. For example, the translation of “New Zealand” in mainland China
is “新西兰”. “新” is the transcription of “New” and “西兰” is the transliteration of
“Zealand”.
39
2.6.2 Parallel Text Mining
Parallel text is a text in one language together with its translation in another
language. The typical way to use parallel texts is to generate translation
equivalence automatically, without looking up a dictionary. It has been used in
several studies (Eijk 1993; Kupiec 1993; Smadja, McKeown et al. 1996; Nie, Simard
et al. 1999) on multilingual related tasks such as machine translation or CLIR.
The idea of parallel text mining is straightforward. Since parallel texts are texts in
two languages, it should be possible to identify corresponding sentences in two
languages. When the corresponding sentences have been correctly identified, it is
possible to learn the translation of each term in the sentences using statistical
information. It is straight forward that a term’s translation will always appear in
the corresponding sentences. Therefore, an OOV term can be translated by
mining parallel corpora. Many researches also reported that parallel texts mining
based translation can significantly improve the CLIR performance. (Eijk 1993;
Kupiec 1993; Smadja, McKeown et al. 1996; Nie, Simard et al. 1999)
In the very early stage, the parallel text based transition approaches are on a
word-by-word basis and only domain specific noun terms are translated. In
general, those approaches(Eijk 1993; Kupiec 1993) firstly align the sentences in
each corpus. Then noun phrases are identified by part-of-speech tagger. Finally,
40
noun terms are mapped by using simple frequency calculation. In such translation
models, phrases, especially verb phrases, are very hard to translate. As phrases in
one language may have different word order in another language, they cannot be
translated on word-by-word basis. This problem in parallel based translation is
called the collocation problem.
Some later approaches(Smadja, McKeown et al. 1996; Nie, Simard et al. 1999)
started to use more complex strategies such as statistical association
measurement or probabilistic translation to solve the collocation problem. Smadja
et al. (Smadja, McKeown et al. 1996) proposed an approach that can translate
word pairs and phrases. In particular, they used a statistical association measure
of the Dice coefficient to deal with the problem of collocation translation. Nie et
al. (Nie, Simard et al. 1999) proposed an approach based on a probabilistic model
that demonstrates another way to solve the collocation problem. By using parallel
texts, their translation model can return p(t|S) which is the probability of having
the term t of the target language in the translation of the source sentence S.
Because the probability model does not consider the order and the position of
words, collocation is overcome for their approach.
Some of the advantages of the parallel text based approaches include the very
high accuracy of translation without bilingual dictionary and the extraction of
41
multiple transitions with equivalent meaning that can be used as query expansion.
However, the source of parallel corpora tends to be limited in some particular
domain and language pairs. Currently large scale parallel corpora are available
only in forms of government proceedings, e.g. Canadian parliamentary
proceedings in English and French, or Hong Kong government proceedings in
Chinese and English. Obviously, such corpora are not suitable for translating newly
created terms or domain specific terms that are outside of the corpora domain. As
a result, the current studies of parallel text based translation are focusing on
constructing large scale parallel corpora in various domains from the web.
2.6.3 Web mining for query translation
Web mining for automated translation is based on the observation that there are
a large number of web pages on the Internet that contain parallel text in several
languages. Investigation has found that, when a new English term such as new
technical terms or proper names is introduced into Chinese, the Chinese
translation of this term and the original English term very often appear together in
literature publications in an attempt to avoid misunderstanding.
Some studies(Yan, Gregory et al. 2003; Lu, Chein et al. 2004) have already
addressed the problem of extracting useful information from the Internet by using
Web search engines such as Google and Yahoo. Popular search engines allow us to
42
search English terms for pages in a certain language, e.g., Chinese or Japanese.
The search results of web search engines are normally a long ordered list of
document titles and summaries to help users locate information. Mining the result
lists is necessary to help find translations to the unknown query terms. Some
studies (Cheng, Teng et al. 2004; Zhang and Vines 2004) have shown that such
approaches are rather effective for proper name translation.
In common, web based translation extraction approaches consist of three steps:
• Web document retrieval: use a web search engine to find the
documents in target language that contain the OOV term in original
language and collect the text (i.e. the summaries) in the result pages
returned from the web search engine.
• Term extraction: extract the meaningful terms in the summaries where
the OOV term appears and record the terms and their frequency in the
summaries. As a term in one language could be translated to a phrase
or even a sentence, the major difficulty in term extraction is how to
extract correct MLUs from summaries (refer to Section 2.7 for the
definition of MLUs).
• Translation selection: select the appropriate translation from the
extracted words. As the previous steps may produce a long list of
43
terms, translation selection has to find the correct translation from the
terms.
The term extraction in the second step falls into two main categories: approaches
that are based on lexical analysis or dictionary based word segmentation, and
approaches that are based on co-occurrence statistics. When translating Chinese
text into English, Chinese terms should be correctly detected first. As there are no
word boundaries in Chinese text, the mining system has to perform segmentation
of the Chinese sentences to find the candidate words. The quality of the
segmentation greatly influences the quality of the keyword extraction because
incorrect segmentation of the Chinese text may break the correct translation of an
English term into two or more words so that the correct keyword is lost. The
translation selection in the third step also has a problem that the highest
frequency word or the longest word selection does not always produce a correct
translation. The term extraction and translation selection problems will be further
addressed in following sections.
2.7 Term extraction
Term extraction is mainly the task of finding MLUs in the corpus. The concept of
MLU is important for applications that exploit language properties, such as
Natural Language Processing (NLP), information retrieval and machine translation.
44
An MLU is a group of words that always occur together to express a specific
meaning. For example, compound nouns like Disney Land, compound verbs like
take into account, adverbial locutions like as soon as possible, and idioms like
cutting edge are MLUs. In most cases, it is necessary to extract MLUs rather than
words from a corpus because the meaning of an MLU is not always the
combination of individual words in the MLU. For example, you cannot interpret
the MLU ‘cutting edge’ by combining the meaning of ‘cutting’ and the meaning of
‘edge’.
Finding MLUs from the summaries returned by a search engine is important in
web mining for automated translation because a word in one language may be
translated into a phrase or even a sentence. If only words are extracted from the
summaries, the later process may not be able to find the correct translation
because the translation might be a phrase rather than a word. For Chinese text, a
word consisting of several characters is not explicitly delimited since Chinese text
contains sequences of Chinese characters without spaces between them. Chinese
word segmentation is the process of marking word boundaries. The Chinese word
segmentation is actually similar to the extraction of MLUs in English documents
since the MLU extraction in English documents also needs to mark the lexicon
boundaries between MLUs. Therefore, term extraction in Chinese documents can
be considered as Chinese word segmentation. Many existing systems use lexical
45
based or dictionary based segmenters to determine word boundaries in Chinese
text. However, in the case of web mining for automated translation, as an OOV
term is an unknown term to the system, these kind of segmenters usually cannot
correctly identify the OOV terms in the sentence. Therefore, the translation of an
OOV term cannot be found in a later process. Some researchers suggested
approaches that are based on co-occurrence statistics model for Chinese word
segmentation to avoid this problem (Chen, Jiang et al. 2000; Maeda, Sadat et al.
2000; Gao, Nie et al. 2001; Pirkola, Hedlund et al. 2001).
2.7.1 Mutual information and its variations
One of the most popular statistics based extraction approach is to use mutual
information (Chien 1997; Silva, Dias et al. 1999). Mutual information is defined as:
(1)
The mutual information measurement quantifies the distance between the joint
distribution of terms x and y and the product of their marginal distributions. When
using mutual information in Chinese segmentation, x, y are two Chinese
characters; f(x), f(y), f(x,y) are the frequencies that x appears, y appears, and x and
)()(),(log
)()(),(log),( 22 yfxf
yxNfypxp
yxpyxMI ==
46
y appear together, respectively; N is the size of the corpus. A string xy will be
judged as a term if the MI value is greater than a predefined threshold.
Chien (Chien 1997) suggests a variation of the mutual information measurement
called significance estimation to extract Chinese keywords from corpora. The
significance estimation of a Chinese string is defined as:
(2)
Where c is a Chinese string with n characters; a and b are two longest composed
substrings of c with length n1; f is the function to calculate the frequency of a
string. Two thresholds are predefined: THF and THSE. This approach identifies a
Chinese string as a MLU by the following steps. For the whole string c, if f(c)>THF,
c is considered a Chinese term. For the two (n1) substrings a and b of c, if
SE(c)>=THSE, both a and b are not a Chinese term. If SE(c)<THSE, and f(a)>>f(b) or
f(b)>>f(a) , a or b is a Chinese term, respectively. Then for each a and b, the
method is recursively applied to determine whether their substrings are terms.
)()()()()(
cfbfafcfcSE
−+=
47
2.7.2 Local Maxima based approaches
However, all mutual information based approaches have the problem of tuning
the thresholds for generic use. Silva and Lopes suggest an approach called Local
Maxima to extract MLU from corpora without using any predefined threshold(Silva
and Lopes 1999). The equation used in Local Maxima is known as SCP defined as
follows:
∑−
=+−
= 1
111
2
)...()...(1
1)()( n
inii wwfwwf
n
sfsSCP (3)
S is an n-gram string, w1,…,wi is the substring of S. A string is judged as an MLU if
the SCP value is greater than or equal to the SCP value of all the substrings of S
and also greater than or equal to the SCP value of its antecedent and successor .
The antecedent of S is an (n1)-gram substring of S. The successor of S is a string
that S is its antecedent.
Although Local Maxima should be a language independent approach, JenqHaur
Wang et al.(Cheng, Teng et al. 2004) found that it does not work well in Chinese
word extraction. They introduced context dependency (CD) used together with the
48
Local Maxima. The new approach is called SCPCD. The rank for a string S is
calculated using the following function:
∑−
=+−
= 1
111 )...()...(
11
)()()( n
inii wwfreqwwfreq
n
sRCsLCsSCPCD
(4)
S is the input string, w1..wi is the substring of S, LC() and RC() are functions to
calculate the number of unique left(right) adjacent characters of S. A string is
judged as a Chinese term if the SCPCD value is greater or equal than the SCPCD
value of all the substrings of S.
2.8 Summary
In this chapter, several existing techniques for distributed systems, peer to peer
systems and multilingual systems are reviewed. Peer and peer systems are special
dynamic distributed computing systems. In order to improve the scalability and
efficiency, collection selection and collection fusion are two key components in
distributed and Peer to Peer systems. Collection selection can reduce the amount
of query request sent out while still keep the retrieval performance. Collection
fusion can improve the precision of the merged results. When applying
multilingual information retrieval features to peer to peer systems, translation of
query is the most important part. Transliteration, parallel text mining and web
49
mining are three major approaches to translate OOV terms. Due to the special
character of Chinese language, transliteration is not quite effective as the other
two approaches. However, parallel text mining and web mining approaches
requires term extraction techniques to extract quality Chinese terms from text
which makes the approaches more complex. Statistics based extraction
approaches are popular in parallel text mining and web mining because for new
terms their performance is better then dictionary based approaches. Mutual
information based approaches are widely used as they are simple and easy to
apply but they need predefine threshold. As a result, they only work well on static
collection. Local Maxima based approaches are complex but they do not need any
predefined threshold so they can work well in dynamic environments.
50
Chapter 3
Web based query translation
Enabling multilingual search is one of the key features of the proposed the P2PIR
system in this thesis. Obviously translation is needed in the CLIR process; either
translating the query into the document language, or translating the documents
into the query language. It is quite clear that query translation is much quicker
because it only needs to translate a few query terms instead of whole document.
As a result, query translation is more common than document translation in CLIR.
This thesis only focuses on improving query translation quality.
Similar to the previous work, the approach proposed in this thesis adopted the
idea of finding the OOV term’s translation through searching the web by using a
web search engine. Web mining based approaches submit English queries (usually
an English term) to a web search engine and the top returned results (i.e.,
summaries in Chinese) are segmented into a word list. Each of the words in the list
is then be assigned a rank calculated based on term frequency. The word with the
highest rank in the word list is selected as the translation of the English term.
However, observations showed that there are two weaknesses in existing
approaches.
51
The term extraction approaches used in the existing web mining based
approaches are not designed for a small amount of text. According to our initial
experiments, the performance of those term extraction approaches is not always
satisfactory in web search engine-based OOV term translation. The pages
returned from a web search engine are used for search based OOV term
translation. In most cases, only a few hundreds of top results from the result
pages are used for translation extraction. Consequently, the corpus size for search
based approaches is quite small. In a small collection, the frequencies of strings
very often are too low to be used in the approaches reviewed in Chapter 2.
Moreover, the search engine results are usually incomplete sentences, which
make traditional Chinese word segmentation hard to apply in this situation. Many
researchers (Gao, Nie et al. 2001; Chen and Gey 2003; Cheng, Teng et al. 2004;
Zhang and Vines 2004; Lu, Xu et al. 2007) apply statistical based approaches to
search based translation for term extraction to avoid the incorrect segmentation
by dictionary based word segmentation approaches.
The second weakness of current web-based query translation approaches is that
they use relatively simple translation selection Strategy. As discussed before, term
extraction will provide a list of translation candidate words. Each of the words in
the list is then be assigned with a rank calculated based on term frequency. The
word with the highest rank in the word list is selected as the translation of the
52
English term. Researchers(Lu, Chein et al. 2004; Zhang and Vines 2004) usually use
the frequency of a word as the rank of the word. Their experiments showed that
such strategy is effective but also showed that the correct translation does not
always have the highest frequency even though it very often has a higher
frequency. Therefore an argument is made that the correct translation is not
necessarily the term with the highest rank.
The new query translation approach discussed in this section still follows the
common web search engine-based query translation approaches but differs in
term extraction, term ranking and translation selection strategy. The aim of the
approach is to resolve the two weaknesses of current approaches discussed above.
The contributions of the new approach are: introducing new term extraction
strategy and applying translation disambiguation technology as translation
selection strategy.
In the following sections, a bottom-up term extraction strategy is introduced
together with new term measurement. The term extraction approach specifically
designed for the search based translation extraction, which uses term frequency
change as an indicator to determine term boundaries and also uses the similarity
comparison between individual character frequencies instead of terms to reduce
the impact of low term frequency in small collections.
53
The basic idea of translation selection in the approach is to combine the
translation disambiguation technology and the web search based translation
extraction technology. The web based translation extraction process usually
returns a list of words in the target language. As those words are all extracted
from the results returned by the web search engine, it is reasonable to assume
that those words are relevant to the English terms that were submitted to the
web search engine. If we assume all those words are potential translations of the
English terms, we can apply the translation disambiguation technique to select the
most appropriate word as the translation of the English terms.
The proposed query translation approach contains three major modules:
collecting web document summaries, term extraction, and translation selection.
For easier understanding, in the following sub sections, an example of finding the
translation to the term “Stealth Fighter” will be demonstrated to facilitate the
description of the proposed approach.
3.1 Collecting Web Document summaries
Firstly, collect the top 200 document summaries returned from Google that
contain both English and Chinese words. Sample document summaries are shown
below.
54
Figure 3–1 Three sample document summaries for “Stealth Fighter” returned
from Google
Figure 3-1 clearly shows that Stealth Fighter and its translation in Chinese隱形戰
機 always appear together. The Chinese translation of Stealth Fighter appears
either before or after the English words. In the sample example summaries in
Figure 3-1, the translation and the English term “Stealth Fighter” are highlighted in
red.
Although the query submitted to Google is asking for Chinese documents, Google
may still return some documents purely in English. Therefore we need to filter out
55
the documents which are written in English only. The documents that contain
both the English terms and Chinese characters are kept. Also all the html tags
need to be removed and only the plain text is kept.
Secondly, from the document summaries returned by the search engine we
collect the sentences in target language, for example, we can collect three
Chinese sentences from the three sample document summaries in Figure 3–1.
Each sentence must contain the English term and the characters before or after
the term. From the summaries given in Figure 3–1., the following Chinese strings
will be extracted as shown in Figure 3-2 below:
Figure 3–2 Sample output of Chinese string collection
56
3.2 Term extraction
In this step, meaningful terms should be extracted from the Chinese string
collection obtained from the previous step. In this step, term extraction is similar
to Chinese word segmentation. However, Chinese word segmentation is to
identify word boundaries while term extraction here is to find out all the possible
meaningful terms. For example, 隱形戰機 might be segmented as隱形(stealth)
and 戰機 (fighter). In term extraction, all three terms隱形, 戰機 and隱形戰機
should be extracted as translation candidates otherwise the correct translation
could be missed. The upcoming sections will describe the new term extraction
strategy in detail.
3.2.1 Frequency Change Measurement
The approaches mentioned in Section 2.7 uses a top-down approach that starts
with examining the whole sentence and then examining substrings of the
sentence to extract MLUs until the substring becomes empty. We propose using a
bottom-up approach that starts with examining the first character and then
examines super strings. The approach is based on the following observations for
small document collections:
57
Observation 1: For a small collection of Chinese text such as the sentences
collected from the summaries returned by a search engine, a sequence of Chinese
characters is most likely an MLU if the frequency of the characters occurring
together is close to the frequency of each individual character in the sequence.
This is because in small document collections such as Google search result
summary, the number unique of Chinese is quite small. This reduces the possible
terms in the collection. In the search result summary, all the texts are related to a
special topic, this reduce the possible terms even further . As a result, one Chinese
character will only appear in one or two terms in the collection which makes the
term frequency close to character frequency.
According to Observation 1, the frequencies of a term and each character in the
term should be similar. As the standard deviation is a common measure of the
dispersion of a set of values, we propose to use the sample standard deviation
given in Equation (5) to measure the similarity between the character frequencies.
∑=
−−
=n
ii xx
n 1
2)(1
1σ (5)
For a given Chinese character sequence with n characters, xi is the frequency of
character i in the sequence, x is the average frequency of all the characters in the
sequence. Although the frequency of a string is low in a small corpus, the
58
frequencies of Chinese characters may still have relatively high values. According
to Observation 1, if the characters in a sequence have similar frequencies, i.e., σ
is small, and then the given sequence is most likely an MLU. When the frequencies
of all the characters in a Chinese sequence are equal, σ = 0. Because σ
represents the average frequency error of individual characters in the sequence,
according to observation 1, in an MLU, the longer substring of that MLU will have
smaller average frequency error .
Observation 2: When a correct Chinese term is extended with an additional
character, the frequency of the new string very often drops significantly. When a
Chinese term is extended with a random additional character, it is not a Chinese
term any more. As a result, the new string is not likely to be found a lot in the
documents and its frequency should drop significantly. In the case that the
additional character makes the new string to a correct term, the new term is
unlikely has the same meaning with the old term. Because we are looking for the
frequency in a small document collection in a particular domain that related to
the old term, the frequency of the new term should not appear quite often. Thus
the frequency of the new term should drop significantly as well. In summary,
when a Chinese term is extended with additional character, what ever the new
string is a real term, the frequency of the new string should drop in a small
document collection.
59
Equation (5) measures the frequency similarity between individual characters
without comparing each individual character’s frequency with the sequence’s
frequency. Combining the frequency of the sequence and the standard deviation
measurement together, we designed the following equation to measure the
possibility of s being a term:
1)(1)(
1)()(
1
2 +−=
+=
∑=
n
ii xx
n
sfsfsRσ
(6)
Where, s is a Chinese sequence; f(s) is the frequency of s in the corpus. We use σ
+1 as the denominator instead of usingσ to avoid 0 denominators.
Let S be a Chinese sequence with n characters, S’ is a substring of S with length n1.
According to observation 1, we should have:
If S is an MLU, then f(S) ≈ f(S’), vice versa.
If S is an MLU, then the longer S is, the smaller σ is. Therefore, in the case that S’ is
a substring of S, we would have σ<σ’. As a result we will have R(S)>R(S’). In
another case where S’ is a substring of S and S’ is an MLU while S is not, that is, S
has an additional character to an MLU, we will have f(S) <f(S’) and the additional
character makes S have a larger standard deviation value, so σ>σ’. Therefore, R(S)
<R(S’).
60
In summary, for a string and its substrings, the one with higher R value would
most likely be an MLU. Table 3-1 gives the R value of each possible term in the
Chinese sentence “隱形戰機/是/一種/靈活度/極差/的/戰機” (“/” indicates the
lexicon boundary given by a human), chosen from the small collection of Chinese
strings given in Figure 3-2
Table 3-1 Chinese strings and R
String R
隱形 26.00
隱形戰 0.94
戰機 2.89
戰機是 0.08
一種 0.44
一種靈 0.21
靈活 2.00
靈活度 2.00
靈活度極 1.07
極差 0.8
極差的 0.07
戰機 2.89
This example clearly shows that if a Chinese MLU has an additional character, its R
value will be significantly smaller than the R value of the MLU. For example,
Chinese terms”一種”, “靈活”and “靈活度” are valid Chinese MLUs, but “
61
一種靈”and “靈活度極” are not. From their R values, we find that R(一種
)=0.44 > R(一種靈)=0.21, R(靈活)=R(靈活度)=2.00 > R(靈活度極)=1.07. Based on
this analysis, we conclude that it is reasonable to segment a Chinese sentence at
the positions where a Chinese character string’s R value drops greatly and the
Chinese character string is a potential MLU. For the example sentence, it will be
segmented as: “隱形/戰機/是/一種/靈活度/極差/的/戰機” by using this
method. The only difference between the human segmented sentence and the
automatic segmented sentence is that “隱形戰機” (Stealth Fighter) is segmented
into two words “隱形” (Stealth) and “戰機” (Fighter). However, this is still an
acceptable segmentation because those two words are meaningful.
3.2.2 A Bottom-up Term Extraction Strategy
The traditional top-down strategy is firstly to check whether the whole sentence is
an MLU, then reduce the sentence size by 1 and recursively check sub sequences.
It is reported that over 90% of meaningful Chinese terms consist of less than 4
characters(Wu 2004), and on average, the number of characters in a sentence is
much larger than 4. Obviously, a whole sentence is unlikely to be an MLU.
Therefore, checking the whole sentence for an MLU is unnecessary. In this
section, we describe a bottom-up strategy that extracts terms starting from the
first character in the sentence. The basic idea is to determine the boundary of a
62
term in a sentence by examining the frequency change, i.e., the change of the R
value defined in Equation (6) when the size of the term is increasing. If the R value
of a term with size n+1 drops significantly compared with its largest sub term with
size n, the sub term with size n is extracted as an MLU. For example, in Table 3-1,
there is a big drop between the R value of the third term “靈活度” (2.00) and its
super term “靈活度極” (1.07). Therefore, “靈活度” is considered as an MLU. The
following algorithm describes the bottom-up term extraction strategy:
Algorithm BUTE(s)
Input: s=a1a2….an is a Chinese sentence with n Chinese characters
Output: M, a set of MLUs
Check each character in s, if it is a stop character to such as是, 了, 的…, remove it
from s. After removing all stop characters, s becomes a1a2….am, m≤n.
1. Let b=2, e=2, and M=φ
2. Let t1= aba2….ae, t2= aba2….a(e+1).
3. If R(t1) >>R(t2), then M=M∪ (t1), b=e+1.
4. e=e+1, if e+1>m, return M, otherwise go to step 3.
63
Here, the meaning of stop character is similar to stop words in English, e.g.
“the”,”an”. The algorithm makes the sub sequence uncheckable once it is
identified as an MLU (i.e., b=e+1 in step 3 ensures that the next valid checkable
sequence doesn’t contain t1 which was just extracted as an MLU). However, when
using the bottom-up strategy, some longer MLU terms might be missed since the
longer terms may contain some shorter terms which have been extracted as
MLUs. As shown in our example, “隱形戰機” (Stealth Fighter) consists of two
terms “隱形” and “戰機”. When using bottom-up strategy, “隱形戰機” would not
be extracted because the composite terms have been segmented into two terms.
To avoid this problem, we set up a fixed number ω which specifies the maximum
number of characters to be examined before reducing the size of the checkable
sequence. The modified algorithm is given below:
Algorithm BUTEM(s)
Input: s=a1a2….an is a Chinese sentence with n Chinese characters
Output: M, a set of MLUs
Check each character in s, if it is a stop character such as是, 了, 的…, remove it
from s. After removing all stop characters, s becomes a1a2….am, m≤n.
1. Let b=2, e=2, First-term = true, and M= Ø
64
2. Let t1= aba2….ae, t2= aba2….a(e+1).
3. If R(t1) >>R(t2),
then M:=M ∪ {t1)
4. If First-term = true
then first-position:= e and First-term:= false
5. If eb+1 ≥ ω
then e:=firstposition, b:=e+1, First-term:=true.
6. e=e+1, if e+1>m, return M,
otherwise go to step 3
In algorithm BUTEM, the variable first-position gives the ending position of the
first identified MLU. Only when ω characters have been examined, the first
identified MLU will be removed from the next valid checkable sequence,
otherwise the current sequence will be checked for a possible longer MLU even it
contains an extracted MLU. Therefore, not only the term “隱形” and “戰機” will
be extracted but also the longer term “隱形戰機” (Stealth Fighter) will be
extracted.
65
3.3 Translation selection
From the term extraction step discussed in Section 3.2, we can generate a list of
translation candidates for each query term. The next step is to find the correct
translation for each query term from its candidate list. The traditional translation
selection approaches select the translation on the basis of word frequency and
word length(Chen and Gey 2003; Zhang and Vines 2004). The approach suggested
here can find the most appropriate translation from the extracted word list
regardless of term frequency by using translation disambiguation techniques, so
even a low frequency word will have a chance to be selected.
In most cases, a query represents the user’s need for information about some
specific topic. Therefore, all query terms should relate to one single topic. The
translation of query terms must also belong to the same topic to make the query
meaningful. In other words, if the translation of the query terms dose not belong
to the same topic as the original query terms, the translations are not likely to be
correct translations. Consequently, translation disambiguation can be applied to
help selecting the correct query translation. If all the candidate terms are assumed
correct translations, the problem of selecting the most appropriate translation
becomes the problem of word translation disambiguation.
66
Researchers summarised “the problem of word translation disambiguation (in
general, word sense disambiguation) can be viewed as that of classification(Li and
Li 2001)”. Therefore, the problem of selecting appropriate query translation
becomes the problem of query classification. If the original query terms can be
classified as class C, the translated query terms should also belongs to class C.
However, the process of query classification is expensive and it is hard to classify
query terms in two languages. To make this problem simple, it is reasonable to
assume that if all query terms belong to one topic class, the corresponding
translations of these terms should be correlated with each other strongly. Based
on this consideration, we can determine the best translation of the query by
examining the correlation of each possible combination of the translation terms
and choosing the combination which has the highest correlation.
3.3.1 The algorithm
A simple way to measure the correlation between items is to use mutual
information(Church and Hanks 1990). There are several variations of mutual
information based approaches to measure the co-occurrence of multiple items.
Total correlation is one of the popular approaches. Let {Qi} be a set of query terms
and Ti={ti,j) be the candidate translations of term Qi. The correlation between the
translation terms can be calculated as:
67
)()...()()()...(log)...(
321
3211
2321n
nn
n tftftftfttttfNttttC
−
= (7)
where ti is one of the candidate translations for the ith query term Qi, f(ti) is the
frequency that the translation word ti appears in the corpus, t1t2…tn is a
combination of the candidate translation, )...( 321 nttttf is the frequency that
t1t2…tn appears in the corpus. N is the size of the corpus. The corpus is
constructed by the relevant documents retrieved from the document collection
using all candidate translation terms. Assuming that a list of translation candidates
for each query term of a given query has been generated from the term extraction
phase, the process of determining the best translation of the query is described as
follows:
Step 1: using the candidate translation terms to retrieve the documents from the
document collection calculate the frequency of each candidate translation in the
collection that contains all the retrieved documents. For instance, if the original
English query has three terms A,B,C and A1,A2…, B1,B2…, and C1,C2….. are the
candidate translations for A, B, and C, respectively, then the frequency of A1, A2,
….., B1, B2, ….., C1,C2….. in the collection is f(A1), f(A2),… f(B1), f(B2)…., and so on.
Step 2: calculate the frequencies of all the possible combinations of the candidate
translations in the collection of all the retrieved documents. For example, the
68
frequency of combination A1B1C1 is f(A1B1C1), A1B2C1 is f(A1B2C1), and A1B2C3
is f(A1B2C3)…. and so on.
Step 3: calculate the correlation of all the possible combinations using Equation
(7). For the example, the correlation of three candidate translation A1B1C1 is
calculated by:
)1()1()1()111(log)111(
2
2 CfBfAfCBAfNCBAC =
The terms in the translation combination with the highest correlation value are
considered strongly related and thus the translation combination should be
selected as the correct translation for that query.
It is quite possible that the frequency of one term is zero in the corpus which will
make the calculation of Equation (7) invalid. To avoid zero frequency in the
calculation, Equation (7) is modified as below:
( 8)
In practice, sometimes the translation combination with the highest correlation
might still not be the correct query translation. For example, “隱形戰機” (Stealth
Fighter), 隱形 and 戰機 are all translation candidates for term Stealth Fighter. In
)1)()...(1)()(1)()(1)(()1)...((
log)...(321
3211
2321 +++++
=−
n
nn
n tftftftfttttfN
ttttC
69
fact, using 隱形 as the translation will have a higher C value than using隱形戰機
because a shorter string usually has a higher frequency than a longer string. A
simple strategy, called term merging in this thesis, is used to solve this problem.
Among the top 10 translation combinations based on their C value, if a translation
is a substring of other translations in the top 10 list, replace it with the longer
term. Repeat this process until the top 1 translation is not a substring of any other
translations. Then the top 1 translation is chosen as the correct translation. The
merge algorithm is described below.
1. List top 10 translations based on value C (calculated using equation 8).
Let W={w1,w2,…,w10}, i=2
2. If w1 is the substring of wi, 1 < i <= 10 then w1=wn and i=2
3. i=i+1
4. repeat step 2 and 3 until i=10
We still use “隱形戰機” (Stealth Fighter) as an example to explain the term
merging strategy. In the top 7 combination translation list as shown in table below,
only the top 7 are listed below as only 7 terms are available in the whole list.
70
Table 3-2 Sample Combination of Translations
Rank Term 1 隱形 2 戰機
3 戰鬥機 4 隱形戰機 5 美國 6 雷達 7 隱形戰鬥機
The top 1 term is 隱形. The true translation is at rank 4th. Following the merge
algorithm, 隱形 will be replaced by隱形戰機. Then try to find 隱形戰機’s
super-string in the list. As no super-string will be found in the list, 隱形戰機 will
be selected as the correct translation of Stealth Fighter.
In summary, this chapter introduces new term bottom up extraction strategy and
a new translation selection strategy that applies translation disambiguation
techniques. Using these two strategies in web search based query translation will
help increasing translation accuracy.
3.3.2 Time Complexities
Although complexity analysis is not the major focus of IR searchers, it is still
necessary to know how much overhead is the disambiguation process to the
standard mono IR process. In this section, a brief analysis of time complexity will
be given.
71
To explain the time complexity of the algorithm easily, I break the disambiguation
process into several parts.
1. Time complexity of finding the possible combinations of correlations
between terms.
If there are n query terms, each query terms has mi possible translations,
the computation complexity will be:
( ) =
If we define that a number c>1 and mi=ai*c
Then the computation complexity will become:
= ( ) = ( )
2. Time complexity to calculate correlations
Equation (8) is used to calculate corrections. It will be a loop of n times to
calculate ( ), 1 ≤ ≤ ; plus the calculation of N ( ( … ) + 1.
Therefore, the time complexity will be O(n)+O(1)=O(n).
3. Complexity of term merging:
For each term, the time complexity is O(1). Therefore, the time complexity
n query terms is O(n).
4. Summary of computational complexity
73
Chapter 4
Multilingual Experiments
In this chapter, several experiments were conducted to evaluate the proposed
query translation approach. The web search engine we used in the experiments is
Google. Two sets of experiments were designed to evaluate the translation
related approaches described in Chapter 3. The first set of experiments is
designed to evaluate the effectiveness of term extraction for OOV translation and
the second set of experiments is designed to evaluate the effectiveness of
translation selection for OOV translation.
4.1 Test set
Queries, document collection and relevance judgments provided by NTCIR
(http://research.nii.ac.jp/ntcir/) are used in the experiments. The NTCIR6 Chinese
test document collection was used as our test collection. The articles in the
collection are news articles published in 2000-2001. The detailed information of
the test set is as shown in Table 4-1 below.
74
Table 4-1 Test document collections
Document collection Year 2000 Year 2001 No. of articles United Daily News (udn) 244038 222526 466564 United Express (ude) 40445 51851 92296 Ming Hseng News (mhn) 84437 85302 169739 Economic Daily News (edn) 79380 93467 172847 Total 448300 453146 901446
The document itself is in XML format with the following tags:
• <DOC> </DOC> The tag for each document
• <DOCNO> </DOCNO> Document identifier
• <LANG> </LANG> Language code: CH, EN, JA, KR
• <HEADLINE> </HEADLINE> Title of this news article
• <DATE> </DATE> Issue date
• <TEXT> </TEXT> Text of news article
• <P> </P> Paragraph marker
Queries used in the experiments are from NTCIR5 and NTCIR6 CLIR tasks. There
are all together 100 queries created by researchers from Taiwan, Japan and Korea.
NTCIR provided both English queries and corresponding Chinese queries. The
Chinese queries are translated by human translators and thus are correct
translations of the corresponding English queries.
Yahoo’s online English-Chinese dictionary (http://tw.dictionary.yahoo.com/) is
used in the experiments. The English queries were first translated by using the
Yahoo’s online English-Chinese dictionary. The terms that could not be translated
75
by the online dictionary were used as the input queries to evaluate the
performance of our proposed web based query translation approach. There are
108 OOV terms that cannot be translated by the online dictionary and therefore
used in the experiments.
4.2 Term extraction experiments
The existing term extraction approaches reviewed in section 2.6 were used in the
experiments for comparison purpose. The abbreviations for the approaches are:
• MI for Mutual information.
• SE for the approach introduced by Chien(Chien 1997).
• SCP for the Local Maxima introduced by Silva and Lopes(Silva and Lopes
1999).
• SCPCD for the approach introduced by Cheng et al.(Cheng, Teng et al.
2004).
• The extraction approach introduced in 3.2 is abbreviated as SQUT.
The OOV term is translated via the following steps:
1. Send the OOV term as a query to Google, from the result pages returned
from Google, use the 5 different term extraction approaches mentioned
above to produce 5 Chinese term lists.
76
2. If a Chinese word in a term list can be translated to an English word using a
dictionary, the English word must not be an OOV word. This means, the
Chinese word must not be a translation of the queried English OOV word.
Therefore, for each term list obtained in step 1, remove the terms if they
can be translated to English by the Yahoo’s online dictionary. This leaves
only OOV terms.
3. Select the top 20 terms from each of the term lists produced from step 2
as translation candidates. Select the final translation from the candidate
list using the translation selection approach described in 3.3.
Finally we have 5 sets of OOV translations; a sample of the translation is as shown
in Appendix 3 Translation of OOV terms.
As the same corpus and the same translation selection approach were used in the
evaluation, the difference in the resulting translation accuracy is the result of
using different term extraction approaches. Thus we can claim that the approach
with the higher translation accuracy has higher extraction accuracy.
4.3 Discussion
For the 108 OOV terms, by using the 5 different term extraction approaches, we
obtained the translation results shown in Table 4-2. SQUT has the highest
77
translation accuracy. SCP and SCPCD provided similar performance. The
approaches based on mutual information provided lowest performance.
Table 4-2 OOV translation accuracy
Correct Accuracy (%)
MI 48 44.4
SE 58 53.7
SCP 73 67.6
SCPCD 74 68.5
SQUT 84 77.8
4.3.1 Mutual information based approaches
In the experiment, MI based approaches such as MI and SE cannot determine the
Chinese term boundaries well. The term lists produced by the MI based
approaches contain a huge number of partial Chinese terms. It is quite often that
partial Chinese terms were chosen as the translation of OOV terms. Some partial
Chinese terms selected by MI are listed in Table 4-3
Table 4-3 Some Extracted terms by MI
OOV Terms Extracted terms Correct terms
Embryonic Stem Cell 胚胎幹細 胚胎幹細胞
consumption tax 費稅 消費稅
Promoting Academic Excellence 卓越發 卓越發展計畫
78
The performance of the mutual information based term extraction approaches is
affected by many factors. These approaches rely on the predefined thresholds to
determine the lexicon boundaries. Those thresholds can only be adjusted
experimentally. Therefore, they can be optimized in fixed corpora. However, in
OOV term translation, the corpus is dynamic web search engine result. The
predefined thresholds might work perfectly in some situations but might work
poorly in other situations. It is almost impossible to optimize thresholds for
generic use. As a result, the output quality is not guaranteed.
In addition, mutual information based approaches seem unsuitable to Chinese
term extraction. As there are no word boundaries between Chinese words, the
calculation of MI values in Chinese is based on Chinese characters but not words
as it does in English. On average, a high school graduate in the U.S. has a
vocabulary of 27,600 words (Salovesh 1996), while the cardinality of the
commonly used Chinese character set is under 3000[61][61][61][61]. Due to the
small set of Chinese characters, Chinese characters have much higher frequencies
than English words. This means that one Chinese character could be used in many
MLUs while an English word will have lower chance to be used in Multiple MLUs.
As a result, an English MLU will have much higher MI value than a Chinese MLU.
The subtle difference in MI values in Chinese between MLUs and non-MLUs makes
the thresholds hard to tune for generic use.
79
SE uses some filtering techniques to minimize the affect of thresholds. In our
experiment, there is 17.2% improvement in translation accuracy. Obviously the
improvement comes from the higher quality of extracted terms. However, the
limitation of thresholds is still not avoidable.
4.3.2 Local Maxima based approaches
Without using thresholds, local maxima based approaches have much better
flexibility than the MI based approaches in various corpora, achieving higher
translation accuracy in our experiment. In comparison, the SCP approach tries to
extract longer MLUs while the SCPCD approach tries to extract shorter ones. The
translation of “Autumn Struggle”, “Wang Dan”, “Masako” and “Renault” are all 2
character Chinese terms. SCPCD can extract the translation with no problem while
SCP always has trouble with them. As over 90% of the Chinese terms are short
terms, this is a problem for SCP in Chinese term extraction. In the mean time,
SCPCD has trouble in extracting long terms. Overall, the two local maxima based
approaches have similar performance. However, since in our experiment, most of
the translations of OOV terms are long terms, SCP’s performance is a little better
than that of SCPCD.
80
Local maxima based approaches use string frequencies in the calculation of
∑−
=+−
1
111 )...()...(
11 n
inii wwfwwf
n. In a small corpus, the frequency of a string
becomes very low which makes the calculation of string frequencies less
meaningful. Local Maxima based approaches are not effective in a small corpus. In
comparison, our approach calculates the difference between character
frequencies. In a small corpus, characters still have a relatively high value. As a
result, our approach performs better than Local Maxima based approaches in
small corpora. For example, local maxima based approaches were unable to
extract the translation of “Nissan Motor Company” because the corpus is too
small Google only returns 73 results for the query “Nissan Motor Company”.
4.3.3 SQUT Approach
Most of the translations can be extracted by the SQUT algorithm. As the approach
monitors the change in R value (see section 3.2.1) to determine if a string is an
MLU instead of using the absolute value of R, it does not have the difficulty of
using predefined thresholds. In addition, the use of single character frequencies in
standard deviation calculations makes our approach usable in small corpora.
Therefore, we have much higher translation accuracy than the MI based
approaches and also about 10% improvement over the Local Maxima based
approaches.
81
However, the SQUT algorithm has difficulty in extracting the translation of “Wang
Dan”. In analysing the result summaries, we found that the Chinese character “王”
(“Wang”) is a very high frequency character in the summaries. It is also used in
other terms such as “霸王” (the Conqueror), “帝王”(regal); “國王”(king); “女王”
(queen) and “王朝” (dynasty). Those terms also appear frequently in the result
summaries. In our approach, where we are using the count of individual
characters, the very high frequency of “王” breaks observation 2. Thus the
translation of “Wang Dan” cannot be extracted. However, in most cases, our
observations are true in small corpora as demonstrated by the high translation
accuracy of our approach in query expansion from Chinese/English web search
summaries.
There are 23 OOV terms that SQUT cannot find correct translations. In fact, none
of the algorithms used in the experiments can find the correct translations for
those 23 OOV terms. For all the algorithms, none of them can find the translation
of some of the terms such as “Chiutou”, “Viagra” and “capital tie up”. When
looking at the Google search results in details, we can clearly see that there are
really not many Chinese texts in the result pages; most of the result summaries
are still in English. And for some of the OOV terms, such as “Florence Griffith
Joyner”, ”F117” and ”ST1” and “FloJo”, none of the algorithms can find ideal
translations. Actually, ”F117” and ”ST1” are directly used in Chinese and none of
82
the result summaries from Google use their translations. Therefore, none of the
algorithms can find out the correct translation. We can consider this as the
limitation of the web based translation approach.
4.4 Translation selection Experiments
The following runs were conducted in the English-Chinese translation selection
experiments:
• Mono: in this run, we use the original Chinese queries from NTCIR5 and
the Chinese terms in the queries are segmented by human. This run, called
the monolingual retrieval, provides the baseline result for comparing with
all other runs.
• IgnoreOOV: in this run, the English queries are translated using the online
Yahoo English-Chinese dictionary with the disambiguation technology
proposed in 3.3. If a translation is not found in the dictionary, the query
keeps the original English word.
• SimpleSelect: similar to IgnoreOOV, English queries are translated using
the online Yahoo English-Chinese dictionary with disambiguation
technology. If a term cannot be translated by the dictionary, it will be
translated by the proposed web mining based approach. However, in the
translation selection step, the longest and the highest frequency string
83
were selected as its translation. This run simulates the previous web
translation selection approaches (Lu, Chein et al. 2004).
• TQUT: like SimpleSelect, except that in the translation selection stage, the
translation for the OOV term is selected with disambiguation technology
proposed in 3.3.
4.5 Discussion
Table 4-4 below gives the results of retrieval performance from the four runs
defined in Section 4.4.
Table 4-4 NTCIR retrieval performance
Average precision Percentage of MonoRun
Mono 0. 3713
IgnoreOOV 0.1312 35.3%
SimpleSelect 0.2482 66.8%
TQUT 0.2978 79.3%
4.5.1 IgnoreOOV
The performance of the IgnoreOOV is 0.1312 which is only 35.3% of the
monolingual retrieval performance. This result shows the extent to which an OOV
term can affect a query. By looking at the translated queries, we found that 31
84
queries out of 50 have OOV terms. By removing all those 31 queries, the Mono’s
average precision becomes 0.3026 and the IgnoreOOV’s average precision
becomes 0.2581 which is about 85.3% of the Mono’s precision. This is a
reasonable result and indicated that our disambiguation technique just works well
to find the correct translations. The reason that we cannot get the same precision
as the monolingual retrieval is that the limited coverage of the dictionary
introduces inappropriate translations. An inappropriate translation is defined as a
valid translation in some other context but not in the current query context. For
example, in query 24, for term “space station, Mir”, 儲存信息暫存器 (Memory
Information Register) is the only translation returned from our dictionary which is
a right translation in some other context. But for query 24, it should be translated
to和平號太空站. In this case, when a dictionary only returns one translation, it is
hard to tell if it will be suitable in the context. As the dictionary only gives one
translation, we have no opportunity to correct any errors by using a
disambiguation technique. Some translations from a dictionary are incorrect
because the translations in various distinct Chinese cultures are different. For
example in the query “mad cow disease”, the query is translated to 瘋牛病 by
TQUT. This translation is in use in mainland China and Hong Kong, but in our
document collection, as they are from Taiwan, it should be translated to 狂牛症
or to 狂牛病. We also find the same problem for term “syndrome” in query 24. Its
85
translation is症候群 in Taiwan but given by the dictionary as併發症狀 and 綜合
症狀 which is for Hong Kong and mainland China. With those inappropriate
translations, those queries have very low precision thus we cannot possibly match
the Mono performance.
Table 4-5 Retrieval performance on queries that contains OOV terms only
Average precision Percentage of MonoRun
Mono 0.4134
SimpleSelect 0.2149 52.0%
TQUT 0.2946 71.3%
4.5.2 SimpleSelect
The performance of SimpleSelect which achieved 0.2482 in precision was much
better than IgnoreOOV and it is 66.8% of the Mono performance. The result is
quite clear that some of the OOV terms in English are found and translated to
Chinese correctly.
The results of the 31 queries that have OOV terms are given in Table 4-5. From
Table 4-5, we can see that the precision of Mono is 0.4134 and the precision of
SimpleSelect is 0.2149 which is 52.0% of the Mono’s precision. This indicates that
by just choosing the longest and highest frequency terms as the translation of
OOV terms, the performance is actually lower than looking up the dictionary. The
86
performance is quite close to the performance of looking up a dictionary without
translation disambiguation technology reported by other researchers. However,
some of our results show that this approach is quite useful in looking up proper
names. Because there is no standard for name translation in Chinese, it is quite
common that a person’s name might be translated into different forms with
similar pronunciation (akin to phonetic form). Different people may choose
different translation due to their custom. As our test collection contains articles
from four different news agents, if we only choose one of the translations, we
may not retrieve all the relevant documents.
For example, in query 12, the precision of SimpleSelect is 0.3528 and the precision
of Mono is 0.0508 which means SimpleSelect’s performance is vastly superior to
Mono. This is a notable performance boost. The English OOV term in query 12 is
Jennifer Capriati (name of a tennis player). The translation given by human expert
is卡普莉雅蒂. The translations from our approach are卡普裏亞蒂, 卡普莉雅蒂,
卡普裏雅蒂 and 雅蒂. They are all correct translations. It is clear that we miss
many relevant documents when we only use the translation 卡普莉雅蒂. When
we take a deep look into the collection, actually three out of four news agents
have sports news. And those three news agents use three different translations
for Jennifer Capriati. These translations are卡普莉雅蒂 in the mhn, 凱普莉雅蒂
87
in the ude and卡普莉亞蒂 in the udn. Obviously , our translated query takes the
advantage of adding 雅蒂. Because we use character based index for our
collection, the documents containing 雅蒂 will include the documents that
contain both 卡普莉雅蒂 and 凱普莉雅蒂. Therefore, although we cannot find
the correct translation 凱普莉雅蒂, we can still retrieve the documents that
contain 凱普莉雅蒂 by using雅蒂.
4.5.3 TQUT
Table 4-6 OOV translation accuracy for NTCIR5 collection
Correct Accuracy (%)
TQUT 25 65
SimpleSelect 20 51
Table 4-6 shows that by using translation disambiguated technology in Web
Translation Extraction, we can get more accurate translation then previous
approaches. We have 65% accuracy of the translation while the simulation of
previous approach only achieves 51%. The IR performance of disambiguated
queries achieved 79.3% of the Mono which is 0.2978. If we only look at the results
of 31 queries that contain OOV terms, the precision is 0.2846 which is 71.3% of
the Mono’s precision. This result is much higher than the result in SimpleSelect
which is only 52% of Mono. There are 39 OOV terms over 50 queries. 31 of the
88
OOV terms’ translations can be found using our proposed approach. And 20 of the
translations are exactly the same or identical to the human translation. It is about
65% in execise.
There are many reasons for not being able to get 100% precision. The first reason
is the different translation custom that we described earlier. Since we cannot
control from where the web search engine gets the documents and to whom the
web search engine returns documents, we cannot guarantee the translation will
be suitable for the collection. For example, we may be able to find the translation
for an OOV term from the Internet, but this translation may be used in Hong Kong
and is not suitable for a collection from Taiwan. The translation of Kursk is a good
example. Our web translation extraction system only returns one translation 庫爾
斯克 as the translation of Kursk. This result shows that most of the documents
over the Internet use 庫爾斯克 as the translation of Kursk, however, the NTCIR5
collection uses科斯克 as its translation. This kind of inappropriate translation is
very hard to avoid even by human interpreters. A good example is the translation
of National Council of Timorese Resistance. We believe帝汶抵抗全國委員會
(from our web translation extraction system) and 東帝汶人抗爭國家委員會
(from NTCIR human translation) are both correct. The difference of the two
translations comes from the different custom of translation. However, when using
89
the two translations as two queries, our IR system cannot return any document.
This means that the documents in the NTCIR5 collection use a different translation
for National Council of Timorese Resistance. Actually the translation in the NTCIR5
collection is: 東帝汶全國反抗會議.
Another reason that we cannot get 100% precision is that our web translation
extraction system does not consider the query context. As we described before,
we only put the OOV terms into a web search engine. This may lead to a situation
where we get the translation suitable for other context. For instance, in query 36,
we are looking for some articles about the use of a robot for remote operation in
a medical scene. “Remote operation” is an OOV term in this query. Our web
translation extraction system returns the term遠程操作服務 as its translation.
Disregarding the query context, this is a correct translation. But this translation is
only correct when it is used in computer science. If we do not consider the query
context, 27 of the translations are correct. It is about 87% precise. This result is
close to the disambiguated queries of dictionary translations which is 85%.
90
Chapter 5
Web based collection profiling for collection fusion
Most of the current collection selection and fusion approaches are based on the
assumption that the similarity measures between different peers are comparable.
For example, the raw score merging approach assumes that the search engines
used across the network are the same and the collection statistics are similar
between collections. The query based sampling approach also assumes that the
performance of different search engines is similar for all remote collections.
However, ”the incorporation of collection-dependent frequency counts in the
document or query weights (such as idf weights) invalidates this
assumption ”(Voorhees, Gupta et al. 1994). In addition, in a p2p environment,
peer collections are managed by various IR systems. The retrieval performance of
different IR systems could be quite different to a certain query. As a result, most
of the existing collection selection and fusion approaches are not suitable in the
peer to peer environment. The retrieval performance should be taken into
consideration when selecting remote IR systems and merging the results from
different IR systems. In order to obtain both content quality and search engine
retrieval quality of remote IR systems, user feedback can be used together with
collection ranking approaches such as CORI. This section proposes a method that
91
obtains resource descriptions and retrieval performances based on users’
feedback.
5.1 A simple example
Before describing the selection and merging approach, let us look at a simple
example of retrieving documents from a distributed information systems. For the
purpose of illustration, suppose there are three remote collections: CA, CB and CC,
and users have no prior knowledge of the content of the collections; each
collection can be treated as a “Black Box”. A user sends a query about art and
computer science to these collections. Suppose that the user chooses 10 returned
documents as relevant from each collection, if we can obtain the user’s feedback
about the retrieved document topics, for instance, the 10 documents from CA are
related to arts, the 10 documents from CC are related to computer sciences, and 5
each from CB are related to arts and computer sciences, it is reasonable to
estimate that CA contains documents about arts but no computer sciences. CC
contains documents about computer sciences but no arts. CB contains both
computer sciences and arts. Therefore, according to query topics and user
feedback, we could construct collection content profiles.
This simple example also tells us that the remote IR systems will have different
retrieval performance on different topics. An IR system that mainly contains
92
computer science documents would not have good retrieval performance on art
topics. User feedback can provide the information about how good a remote IR
system’s performance is on particular topics. The behind our approach is that the
profiles of remote IR systems can be constructed based on user feedback. Based
on the profile, the collection fusion can be improved by considering not only the
content description but also the retrieval quality.
5.2 Collection profiling
The collection profiling technique described here uses a matrix {pi,j} to present the
historical performance of collections, where pi,j represents the average retrieval
performance of remote collection i to topic class cj.
The performance of a search engine (i.e., a collection) is usually measured by
precision and recall. As most IR systems only return top N results, the precision of
top N results, denoted as P@N, is a reasonable measurement for evaluating the
search engine performance. The average P@N of a collection can measure how
well a remote search engine performed in the past. However, the absolute
average value cannot tell how good a collection is compared to other collections.
Precisions for different queries are not comparable. Precision of 0.3 might be a
good result for one query but might be a bad result for another query. For
example, for a query, if the average precision of all the collections is 0.1 and one
93
collection achieves 0.3, this collection achieves a good result since although 0.3 by
itself looks ordinary. But if the average is 0.5 then 0.3 might be a bad result. From
this simple example, we can see that, for a query, the difference between the
precision achieved by a collection and the average precision over all collections
can indicate how well the collection performs for that. Suppose that we have sent
n queries to a remote IR system, pi is the precision for the ith query achieved by
this remote system, ip is the average precision for the ith query achieved by all
remote systems or collections. We can calculate the performance of the remote
system by the following equation:
npp
P ii∑ −=
)(
(8)
Because pi and ip are precisions, we should have 0<= pi <=1, 0<= ip <=1.
Therefore -1<=P<=1.
It is clear that the performance of a remote collection represents not only the
quality of remote collection content but also the quality of the search engine.
When two remote peers use the same search engine, the peer that has better
quality content will have better performance. When the two peers have same
document collection, the peer that has a better search engine will have better
performance.
94
For real world peer to peer systems, since there is no standard relevant
judgement, as a result the average precision will not be available for equation 10.
However, user feedback can be used as the judgement for determining the
relevance of documents. For example, if a user downloads a document from the
result list or spends a while to read the document, it can be considered that the
document is relevant to the query. In addition, users are likely to only examine the
results on top of the result list; therefore, it is reasonable to use the precision of
documents at top to represent the performance of remote search engines.
Suppose that the size of the result list is m, Ni is the number of documents in the
list that are read or downloaded by the user,
the precision of a remote system can be calculated by the following equation:
nmNN
P i
*)(∑ −
= (9)
5.3 Query classification
The profiling system described in thesis does not use the average performance of
all past queries. It is necessary to measure the past search performance in several
topic classes unlike digital libraries that cover all topics, personal document
collections in a peer to peer environment usually focus on some specific topics. As
described in section 5.1, a profiling system that uses uniform query performance is
95
not suitable to such environment. Some researchers have studied extracting topic
classification rules from the internet(King and Li 2003). However, their studies are
based on western languages and their works are trying to build a complex tree
structure for describing the relationship between various topics. Their works are
effective in homogeneous collections however, in a p2p environment, the
documents are usually heterogeneous. Also, there are limited resources on text
classification for Asian languages.
According to initial investigation, many news websites classify their news into
groups based on topics. For example, Yahoo news Taiwan
(http://tw.news.yahoo.com/) groups their Chinese news into 12 topics. Google
news Taiwan (http://news.google.com/news?ned=tw) groups their news into 9
topics. Those websites also provide very powerful search features on news. When
searching in Yahoo news Taiwan, a list of news that contains the query term will
be returned together with the source of the news, the catalogue and the news
summary. Google news Taiwan would not return catalogue information of the
news but it enables the user to search news within a specific catalogue. Mining
such information may find the topics that a query term is related to. For example,
when searching for term “雅虎” (Yahoo) in Yahoo news, most of the returned
news is about computer science. Therefore, we can determine that term “雅虎”
(Yahoo) has strong relationship to computer science but has little relationship to
96
arts, for example. As a result, with the help of such news sites, we can discover
the relationship between a query term and a topic. Such information will then
help identify query topics. For example, searching for “Linux” in Yahoo news
Taiwan, out of 28 returned news, 20 news are under topic “SCI/TECH”, 1 news
under topic “world news”, 3 news under topic “financial” and 4 news under topic
“education”. This result indicates that, the term “Linux” will have a chance of 72%
to be in topic “SCI/TECH”, 4% to be in topic “world”, 10% to be in topic “financial”
and 14% to be in topic “education”. By searching the term in all the catalogues in
Google news Taiwan and calculating the number of results returned from each
catalogue, we can also get the percentage that a term belongs to a particular
catalogue.
Let C={c1,c2,…,c|c|} be a set of predefined classes, i.e. Cc j ∈ is a class or catalogue,
),( ji cwN be the number of news in class cj returned from querying term wi in a
news web site. The probability that the term wi belongs to a topic class cj can be
calculated by the following equation:
∑∈
=
Ccki
jiij
k
cwNcwN
wcP),(
),()|( (10)
97
Let },...,,,{ ||321 WwwwwW = be a set of Chinese terms, = { , ,… }
represents a query that contains m terms. The probability that Q belongs to a
topic cj can be calculated by equation:
( ) ( )mjj wwwcPQcP L21|| = (11)
Suppose that the occurrences of terms mwww ,,, 21 L are independent(Liu, Yu et
al. 2002; Zhao, Shen et al. 2006), we will have: ,,,....1, CcmiWw ji ∈=∈∀
( ) ( ) ∏=
==m
iijmjj wcPwwwcPQcP
121 )|(|| L
(12)
Equation (13) can be used to calculate the probability that a query Q belongs to a
topic cj.
,,....1, miWwi =∈∀ that the topic of a query = { , ,… } is c∈C can be
determined by the following equation:
( ))|(max)|( QcPQcP jCc j ∈=
(13)
98
5.4 Collection fusion
In an uncooperative p2p environment, since the collection statistics may not be
available or the scores of the returned documents from different collections are
not comparable. It is difficult to merge the results returned from different
collections. In this section, we propose some new merging strategies to solve this
problem. Generally, the retrieval process is conducted with the following steps.
Firstly the user’s query Q is classified using equation (13) and then broadcasted to
all the remote peer IR systems. When results are returned from peer IR systems,
the results will be merged according to the query catalogue and collection profile.
We propose a merging method called Sorted round robin strategy which
incorporates collection profile into the standard round robin method to enhance
the quality of result merging.
The basic idea of the round robin merging, in general, is to interleave the result
list returned from each remote peer. Every time the top document in the list from
each peer will be popped up and inserted into the final result list. The order of the
peers to be visited is usually the order of the peer collection IDs. It is obvious that
the basic round robin approach does not consider the performance of remote
peers. In the worst case, the results from the worst peer will be popped up first
and the results from the best peer will be popped up last. Therefore, the
99
irrelevant documents will appear on the top of the merged result list. The
distributed retrieval performance will be harmed significantly in the worst case.
Furthermore, remote systems will have different retrieval performance on
difference topics. However, traditional round robin always sorts the results from
remote systems in the order of collection IDs. Even the order of returned
documents from remote systems may be optimized, the quality of the merged
result still cannot be guaranteed due to the fixed order of the remote IR systems.
For the above reasons, we proposed a modified round robin approach called
sorted round robin margining strategy. Instead of using a fixed order of remote
systems to merge results, we dynamically change the order of remote systems
based on query classes and pervious performance of remote systems. In other
words, the order of remote systems to be visited is determined based on the
matrix {pi,j} .The merging strategy can be described as following steps:
1. Determine query class cj using equation (13).
2. Sort Collections by {pi,j}.
3. Using round robin strategy to merge results, based on the collection order
generated in step 2.
4. Repeat steps 1-3 for all input queries.
100
By using such merging strategy, we attempt to optimize the order of the remote
IR systems no matter what type of queries we are using. The collection which
performed best before for the query class will always be visited first and the
quality of merged results can be guaranteed.
We also propose another merging strategy called Sorted Rank. The idea of round
robin merging is a one-by-one merging strategy. This strategy means that if we
have n remote systems, the second document in the first visited remote system
will be at the n+1 position in the merged result list. However, intuitively, the
more important system should have more documents in the top of the merged
result list than the less important ones. The basic idea of the Sorted Rank strategy
is to re-calculate the scores of the returned documents from remote collections
based on the historic performance of the remote collections and the scores
calculated by the remote collections, and then merge the returned documents
based on the newly calculated scores. The simplest way to calculate a document
score by taking the intuition mentioned above into consideration is to make the
score proportional to the collection performance jip , and the original rank ri
given by the collection i. We designed the following equation to modify the
document score for a document and topic catalogue cj:
)1(),( , jiiji parcrscore +×= (14)
101
here ri is the document rank returned from collection i, cj is the catalogue that the
query belongs to and pi,j is the historical performance of collection i for topic
catalogue j, a is a threshold. According to our experiments, a=11 will give the best
results. Because -1<= pi,j <=1, we use (1+ pi,j) as historical performance to avoid
negative value. Finally the documents will be sorted by the calculated scores.
102
Chapter 6
Profiling and fusion evaluation
Several experiment sets were designed to evaluate the performance of collection
fusion approach. The first experiment set was designed to evaluate the
performance of the web-based query classification. The second experiment set
was designed to evaluate the accuracy of collection rank and the last experiment
set was designed to evaluate the performance of the collection fusion.
We conducted the experiments with 30 databases from NTCIR6 CLIR track
document collections. The articles were evenly separated into 30 databases with
each database having around 30048 documents. In order to make the databases
cover different topics, according to relevance judgments, relevant documents on
different topics are manually put into different databases. 50 queries from NTCIR5
CLIR task are used as training set. That is, the collection profiles were created
based on those 50 queries. P20 were used in profiling. 50 queries from NTCIR6
CLIR task are used in evaluation.
Two search engines are used in the experiments. One is the search engine
introduced in section 7.5. It is called M-GPX in the experiments. The other one
uses the same index strategy as SYSTEM1 but uses a simple Boolean model with
103
tf-idf weighting schema. It is called System1 in the experiments. Document score in
System1 is calculated by the equation below:
∑= iiscore idftfD * (15)
6.1 Query classification
Web based query classification is the first step and one of the key components in
our collection fusion approach. If this part cannot produce good enough results,
the rest of our approach is meaningless. Therefore, in the first experiment, we
would like to evaluate the performance of query classification.
100 queries form NTCIR5 and NTCIR6 were used in this experiment. The
classification results produced by our web based classification method are
compared to the results classified by 5 human experts.
6.1.1 Discussion
In this experiment, the accuracy of the query classification made by our
classification method is 89%. It is almost impossible to reach 100% accuracy
because even human may have different opinions on classifying some queries. If
we only look at the Title field of the query, different people may get completely
different results. For example, NTCIR6 query #20: Y2K problem. This query can
104
belong to topic class ‘Science/Tech’ if we are looking for the definition of Y2K
problem. But it can also belong to topic class ‘Financial’ if we are looking
something about the impact of Y2K problem.
Term ambiguity will also impact the accuracy, especially in Chinese. For example
in NTCIR6 query #75: birth, cloned, calf. From the English query we most likely
classify this query to topic class ‘Science/Tech’. However, the Chinese translation
of this query is: 誕生(birth), 複製(cloned), 小牛(calf). Term ‘小牛’ is often
referred to NBA team ‘Dallas Mavericks’in Chinese. We can expect much more
news refers to ‘Dallas Mavericks’ than ‘calf’because sports news is
generally hotter than news about science. In the experiment, when searching
term ‘小牛’ from yahoo news, 91% of the returned news belong to topic class
‘Sports’. Term ‘誕生’ is a general term, 12% of the returned news belong
to ‘Science/Tech’ and 9% belong to ‘Sports’. Although another key term 複
製 is obviously a ‘Science/Tech’ term but only 66% of the returned news
belong to ‘Science/Tech’. As a result, the meaning of ‘Dallas Mavericks’
dominating the query causes the query to be classified as ‘Sports’.
105
6.2 Collection Rank Experiment
Collection rank is the most important part for collection selection and collection
fusion. It is also the key in the collection profiling technique described in this
thesis. In the experiment, we used two information retrieval systems to create
two different profiles and compare the two profile files together with the
contents of the collections. In the evaluation, we sort the collections by the
number relevant documents in each topic class and compare the collection ranks
calculated by different retrieval systems.
6.2.1 Discussion
In these experiments, we try to compare between collection rank and collection
quality. The quality of collections is determined by the number of relevant
documents under each topic class. As all the 30 systems use the same search
engine, if the search engines used are good enough, the collection rank given by
the collection profile should match the collection quality.
106
Table 6-1 Collection Rank Top 5 for topic “ Financial news (財經新聞)”
Collection #
No. of Relevant documents
Collection rank by # Relevant Doc
Collection rank by M-GPX
Collection rank by System 1
6 113 1 1 13 3 94 2 2 11 5 84 3 3 4 2 71 5 4 9 1 76 4 5 3
Table 6-2 Collection Rank Top 5 – International (國際新聞)
Collection #
No. of Relevant documents
Collection rank by # Relevant Doc
Collection rank by M-GPX
Collection rank by System 1
25 337 1 1 1 24 230 2 2 2 21 184 6 3 3 26 209 3 4 13 20 149 8 5 9
Table 6-3 Collection Rank Top 5 – Science/Tech (科技新聞)
Collection #
No. of Relevant documents
Collection rank by # Relevant Doc
Collection rank by M-GPX
Collection rank by System 1
2 240 1 1 1 6 127 5 2 2 7 83 7 3 5 3 193 3 4 3 1 227 2 5 14
107
Table 6-4 Collection Rank Top 5 – Sports (運動新聞)
Collection #
No. of Relevant documents
Collection rank by # Relevant Doc
Collection rank by M-GPX
Collection rank by System 1
10 55 1 1 1 11 36 3 2 3 8 25 4 3 2 12 24 5 4 7
From the results above we can see that collection rank given by M-GPX nearly
matches collection quality while SYSTEM1 does not provide such good collection
rank on some topic classes. This result indicated several things:
1. M-GPX is good enough to find relevant documents cross collections.
2. Our collection profiling algorithm can produce good quality collection which
will help collection selection and collection fusion.
3. The performance of System1 is not good enough to find financial news as the
collection rank given by System1 under ‘Financial’ does not match collection
quality.
Although System1 is not good enough, we still cannot say the collection rank given
by System1 is incorrect. The nature of our profiling system determines that the
top ranked collections are not necessarily the best collections that provide the
most relevant documents. They are the best collections that can return largest
108
number of relevant documents. As we emphasized in Chapter 3, we should
consider both collection content and retrieval system together. If the remote
system cannot retrieve any relevant documents, we should not consider it is a
good source even if the collection content is good.
6.3 Collection fusion experiment
Several runs were conducted in the experiments which are defined as follows:
• Centralized: all documents are located in a central database. This is the
baseline to all other runs.
• Round robin (RR): results are merged using the standard round robin
method. The order of visiting result lists is the order of collection id.
• Sorted round robin (SRR): results are merged using sorted round robin
method described in section 5.4. The order of visiting result lists is the
order of corresponding collection rank (OP).
• No classification Sorted round robin (NSRR): this run is a comparison to
SRR. In this run, we use the same sorted round robin approach as SRR but
we do not consider query classification. That is, the collection rank is
calculated based on the collection’s pervious performance and all queries
are under the same topic class.
109
• Sorted rank (SR): results are merged using sorted rank method described
in section 5.4. Document scores are calculated by equation (14 and
ascending sorted.
• No classification Sorted rank (NSR): this run is a comparison to SR. In this
run, we use the same sorted rank approach as SR above but we do not
consider query classification. That is, the collection rank is calculated based
on collections’ pervious performance and all queries are under same topic
class.
NSRR and NSR only run under M-GPX.
6.4 Discussion
From Table 6-5 and Table 6-6 we can see that M-GPX is a better retrieval system
than the simple System1 retrieval system. In the centralized environment, the
average precision of M-GPX is 0.2653 while System1 only get 0.2202. The average
precision in Table 6-5 and Table 6-6 also show that SR is the most effective way of
merging distributed results while the standard round robin method produced the
lowest precision. If we use the centralized system’s precision as the baseline, the
precision produced by SR is around 9% higher than SRR’s and around 13% higher
than RR’s, and the precision of SSR approach is about 4% higher than RR’s. As the
only difference between SRR and RR is the order of the collections to be visited, it
110
is easy to conclude that sorting the returned results according to the importance
of the collections can improve the precision.
Table 6-5 Average Precision – System1
RR RR/Central SRR SRR/Central SR SR/Central Central P5 0.2960 67.89% 0.2960 67.89% 0.3010 69.04% 0.4360 P10 0.2860 70.79% 0.2860 70.79% 0.2840 70.30% 0.4040 P15 0.2667 70.43% 0.2693 71.11% 0.2773 73.22% 0.3787 P20 0.2500 71.43% 0.2510 71.71% 0.2720 77.71% 0.3500 P30 0.2047 64.64% 0.2047 64.64% 0.2500 78.94% 0.3167 All 0.0900 40.87% 0.1001 45.46% 0.1237 56.18% 0.2202
Figure 6–1 P-R curves of 4 runs under System1
111
Table 6-6 average precision – M-GPX
RR RR/Central SRR SRR/Central SR SR/Central Central P@5 0.1640 31.54% 0.2360 57.70% 0.3680 70.80% 0.5200 P@10 0.1820 39.74% 0.2780 66.40% 0.3340 72.90% 0.4580 P@15 0.2107 49.23% 0.2880 68.90% 0.3160 73.80% 0.4280 P@20 0.2420 60.50% 0.2930 71.80% 0.3050 76.30% 0.4000 P@30 0.2760 77.20% 0.2760 77.20% 0.2847 79.70% 0.3573 All 0.1378 51.94% 0.1447 56.50% 0.1715 64.60% 0.2653
Figure 6–2 P-R curves of 4 runs under M-GPX
112
Under M-GPX, if we only look at P5 (precision @5), SRR is over 13% better than RR.
The P-R curves in figure 8-2 clearly indicate that the precision of SRR is much
higher than RR at the top of result list. The performance gain of SRR mostly comes
from the high precision. SR has nearly 10% improvement to SRR. This is because
the SRR method sorts the collections simply according to their importance. The
results returned from distributed collections are still evenly distributed in the
merged result list. In SR, the more important the collection is the more documents
from the collection will appear at the top of the merged result list. P-R curves
clearly show that SR is much better than SRR in extracting relevant documents in
the top of the result list. Table 6-5 and Table 6-6 also show the same result. The
precisions are getting closer from P5 – P30. At P5, SR is about 10% better than
SRR but at P30, it is only about 2% better.
Experiments under System1 show similar results: SR performs the best followed
by SRR then RR, however, SRR and RR perform quite similar . If we look at Table
6-5, we can clearly see that SRR and RR have the same precision at P5 and P30.
SRR performed a little bit better than RR at P15 and P20. That is the reason why
SRR performed 5% better than RR. Because we have 30 collections and the
difference between RR and SRR is the order interleaved the results; they should
have the same precision at P30. The SRR algorithm tries to put results from ‘Good’
quality collection to the top, so P15 and P20 in SRR are better than RR. The reason
113
for SRR and RR results being so close is that System1 is not good enough to
distinguish collections. As under M-GPX, we can clearly see the difference
between RR and SRR on the top of the results.
In this experiment set, we also compared the performance between systems
considering query classification and systems not considering classification. The
results are showed in Table 6-7
Table 6-7 Query classification VS No classification
With Classification Without Classification 0.1499 (SRR) 0.1409 (N-SRR) 0.1715 (SR) 0.1506 (N-SR)
Table 6-7 clearly showed that query classification can help improve collection
fusion performance. Table 6-1 -- Table 6-4 have already showed that different test
collections focus on different topics. E.g. collection #10 is best for sports news
while collection #5 is best for financial news. As we have discussed in Chapter 5,
the collection fusion should consider not only the overall performance but also
the performance under different query topics.
In theory, N-SRR should provide similar performance to RR. But in our experiment,
it is about 4% better than the RR run. That is because about 40% of our queries
are classified ‘International’. As a result, the merging order in N-SRR will benefit all
the ‘International’ queries. Therefore, N-SRR gets a 4% improvement over RR.
114
Table 6-7 also proved again that SR is a better collection fusion strategy than SRR.
Even without query classification, SR can still provide better retrieval performance
than SRR with query classification.
Although centralized collection produces the highest precision, SRR and SR still
provide reasonable better performance than the standard round robin method.
115
Chapter 7
The P2PIR System architecture
The P2PIR system is quite distinctive by comparison with convertional P2P IR
systems. Figure 7–1 below demonstrates a sample scenario of a new user joining
the P2PIR system and completing a search within the network.
Figure 7–1 System overview
In Figure 7–1 System overview, SIG member A (i.e., client A) registers herself on
the directory web server; downloads the email addresses of all other SIG
members. At some time later, A broadcasts a query to all other SIG members.
116
Members B and D do not have any content that matches the query so only client C
responds to the query. The architecture of the P2PIR system is very much like the
architecture of the existing P2P systems, but the communication channel is email.
This is a unique feature of our system.
The P2PIR system is designed to exchange data via a third party agent that can get
through firewalls; this is of paramount importance since the inability to do so
renders collaborative IR almost impossible in today’s enterprise network
structure. For security reason, most companies and large organizations only
provide private IP address space for internal computers. Proxy servers are used to
provide internet access. Therefore, it is almost impossible to make direct
connection between internal and external computers. Port forwarding and UPNP
can solve such problem but it might cause serious security issues, e.g. revealing
sensitive data, and it is a nightmare to network administrators. As a result, an
agent is chosen for data exchange between internal and external computers, the
structure of the network is showed below.
117
Figure 7–2 The P2PIR system network structure
The P2PIR system chooses email as the third party agent. There are a few good
reasons for the choice of the email infrastructure as our communication medium
and they will be explained in follows.
Firstly, the P2PIR system is intended to support offline information retrieval
applications which cannot be performed securely by existing distributed IR
systems over a public domain network. For example, a user wishes to find some
private documents that could possibly be available from an external partner . It is
commonly done through phone enquiry, fax, or email. The user first requires the
partner to search their private collection using a local search engine. Having
found some relevant material the partner then sends back the documents via
mail, fax or email, using some secure mechanism (e.g. encryption). This is an
expensive operation, but it is even more expensive if you consider that it may
118
simultaneously involve many independent partners. The P2PIR system is designed
to facilitate the automation of the entire operation, without surrendering privacy
and security.
Secondly, by choosing email as our communication medium, trading off real time
response provides more important features for the intended IR applications.
Importantly, email gets across asynchronously. Many users do not have direct
permanent connection to the internet, but email is eventually delivered. Even
when a user is offline, the email agent can guarantee the delivery of search
requests and search results. This feature is quite helpful when the communication
is between users in different time zones. It is likely that the two parties will not
connect to the internet at the same time if they only turn on their computers
during business hours. For most of the synchronize P2PIR systems, the distributed
search can only be performed when other group members are online. Otherwise
nothing can be found. However, when using email as communication agent, when
the remote group partners are not online, search requests will be saved in their
mail box. When they connect to the internet, the requests will be processed and
results will be sent back to requester’s mail box.
Thirdly, email is universally accessible, reliable and stable. Free email storage and
communications services are readily available. This makes email particularly
119
attractive in support of a free public domain P2PIR system. Moreover, email is by
design and nature allowed to get through firewalls. It will be no network setting
change for network administrators so there will be no security load due to the use
of the P2PIR system. Additionally, email address is universal and unique over the
internet which makes it easier to identify users since email addresses are static
and stable. IP addresses are unique but not all users can get static IP address.
Locating email users is straight forward and much more reliable than locating a
volatile IP address.
Last, email is potentially faster than direct peer to peer connection. This is true
when people are using their ISP’s email addresses. As ISP always has fastest
connection speed to external network, it will take them less time to sent out the
same amount of data then yourself. And ISP is the fastest place you can go as
there is no traffic jam between you and ISP. As a result, when both parties are
using ISP emails, it should be faster to send data via email than direct connection,
especially when sending large files across continents, e.g. from Australia to South
American. Many people did not notice that email is really fast nowadays. There is
virtually no delay. The experiments in Chapter 8 proved that an email is sent from
one end, the other party will receive it straight away.
120
7.1 System Design
The P2PIR system is designed to present the users with a customary search engine
GUI. Each user , who is a member of a SIG that collaborates on the P2PIR system,
is able to issue a query request to another member of the group, to a subset of
members in the SIG, or to the entire SIG. The query is then distributed to the
targeted SIG members. Upon (asynchronous) receipt each group member’s
system acts upon the request and searches the local and possibly private
collection, subject to security settings and constraints. Results, if any exist, are
then returned to the query originator – again, subject to security constraints and
settings the result may be returned automatically, if the security classification of
the documents permits, or may be assembled, but require further authorization
by the owner before being sent. It is also possible to enforce the establishment of
a separate secure communication channel (e.g. using SSL) in order to exchange
the data.
The system scales well because each new SIG member brings with them the data
collection, the storage and processing capacity, and there is no need to crawl and
globally index the collection. All search operations are local. A search can only
find existing documents (no “dangling pointers”), and indexes can be up to date
immediately following the addition of new documents, without the typical update
121
lag for documents on the WWW. More details about the system will be given in a
later section in this thesis.
7.2 The P2PIR system Search lifecycle
Searching information within the P2PIR system consists of several steps. Figure 7–
3 Dataflow of Making Search Request shows the dataflow of making a search
request. The user interface is similar to a conventional WWW centralized search
engine. Users can input search queries and view results. When a user searches for
information, the user sends requests to the UI(user interface) module, then the
requests are passed to security module. Depending on the security requirements,
the security module might send plain text, for public domain search, or encrypted
text, for secure SIG search, to collection selection module. Details of the security
module are discussed in 7.4. Collection selection module will select the suitable
remote peers and pass the email addresses to communication module. After that,
a request mail will be created and sent to email server. At the same time of
sending the query to security module, the search query will also be sent to local
multilingual search engine and results will be displayed in the UI immediately.
122
Figure 7–3 Dataflow of Making Search Request
Figure 7–4 shows the dataflow of receiving a search request. When the remote
search request mail arrives, is received by the communication module. Then the
message is passed to security module. The security module checks if the request-
meets the security requirements according to the type of request public domain
or secure SIG search. Valid request are passed to the search engine module. If the
request sender is not in the secure group, the search engine will search within the
public documents and return the query results. Otherwise, private results may be
returned. When the search result is ready, it is passed to security module. Then a
response mail is created and the mail is sent to the email server . These results
123
contain summary information only. It can be more or less informative depending
on security needs. It may be as little as a notification that a result exists.
Figure 7–4 Dataflow of Response to Search Request
Figure 7–5 shows the dataflow of receiving results. When the response mail
arrives, the mail is received and sent to the security module. The security module
verifies the results. Valid results are passed to collection fusion module. The
collection fusion module then merges all the remote results into a single list. The
list is then displayed in the UI with document name and summaries.
124
Figure 7–5 Dataflow of Receiving Results
When a client wishes to retrieve a result document, she sends the specific
document request. The document owner process will then check if the client has
the right to view the document. A document response is returned to the valid
client and then the client can read the document. In order to keep the private
information secure, the query, the results and the document request/response
are encrypted, keeping the confidentiality between the clients.
7.3 Communication protocol
Communication protocol is one of the core components of peer to peer systems.
As discussed before, the physical communication between the P2PIR system
clients is via email. POP3(Post Office Protocol Version 3)(IETF 1996) and IMAP4
(Internet Message Access Protocol Version 4)(IETF 2003) are two of the most
popular protocols for getting email from mail server. SMTP (Simple Mail Transfer
Protocol) is standard (in practices) for sending email to mail server(IETF 2001). As
125
they are so popular , the details on how to use those protocols to communicate
with mail servers are omitted.
This protocol used in the P2PIR system is handled through XML coded messages,
delivered as plain text within an email message. XML is chosen as communication
protocol because it is designed for stored self-describing data. It has been wildly
used for data exchange cross all platforms. It is powerful and flexible. As XML
data is self-describing, it does not require fixed relational schemata or data type
definitions designed in advance. It is quite flexible to extent the protocol without
making large changes to existing applications. The existing P2PIR system can work
with the new version of P2PIR system as long as the new protocol still has all the
information as the old protocol. This thesis shows the power of flexibility of XML
protocol. In this section, an XML protocol for plain text (public domain) will be
described in detail. Later in 7.4, an extension of this protocol will be introduced
for encrypted data (for secure SIG) communication.
The XML Schema of the protocol is showed as Figure 7–6. The root element of the
XML document is ‘DSearch’ which has attributes of ‘ID’, ‘Sender’ and an optional
attribute ‘Encrypt’. ‘Sender’ element defines the sender of the message which is
the email address of sender. ‘ID’ element is the unique identification of the search
126
request. The P2PIR system uses time stamp as search ID. ‘Encrypt’ attribute
indicates if this message is encrypted and its default value is false.
Within ‘DSearch’ element, there are three elements to choose: ‘Query’, ‘ResultList’
and ‘Document’. ‘DSearch’ can only contain one of the three elements.
• The ‘Query’ element contains the search query in plain text. ‘Query’
element representing that this message is a search query request.
• The ‘ResultList’ element represents a search result table, which contains
two compulsory elements ‘DocId’ and ‘Title’. and two optional
elements, ’Summary’ and ‘Score’. It is assumed that documents in the
‘ResultList’ are sorted by relevance to the query. That is, the first
document in the ‘ResultList’ is the most relevant document and the last
one in the list is the least relevant document. ‘Title’, ‘Summary’ and ‘Score’
elements are supporting elements that will help the user to determine if
the document is relevant to the search query.
• The ‘Document’ element contains one compulsory element ‘DocId’ and
one optional element ‘Content’. Content contains the document. However,
the document will be encoded as base64Binary because binary data
cannot be sent directly by email. Base64 encoding is one of the binary-to-
127
text encoding schemes. By using base64 encoding, any binary data can be
converted into standard ASCII text, and then it can be sent via email.
When ‘DSearch’ only contains ‘Query’ element, it represents a ‘search request’
message. The search query is in the ‘Query’ element. When ‘ResultList’ is the
only element in ‘DSearch’, the XML file represents ‘search result’ message. The
P2PIR system can easily convert the ‘ResultList’ element into a table because it
is in XML. When ‘Document’ is in the ‘DSearch’ element, it might represent a
‘document request’ or ‘document result’ message. If ‘Content’ does not exist
in the ‘Document’ element, the message is a ‘document request’ message. It
asks the user to send the document that has the given document id specified
in ‘DocId’ element. While ‘Content’ is not empty, the message is a ‘document
request’ message. The P2PIR system only needs to decode the ‘Content’
element as saved to disk. Then the user can open the document and read it.
129
7.4 Security
7.4.1 Group management
One of the key features of the P2PIR system is secure group communications. The
security mechanisms do not prevent public access to general/non-sensitive
information. However, only the authorized peers can access the respective
sensitive data. Thus, identification of SIG members becomes the most import
security issues for the P2PIR system. In server-less environments the
authentication of a peer is different to a centralized environment. In this case,
there is no server for user authentication, thus each user must authenticate the
other offline directly.
The simplest form of group control is using an Access Control List (ACL). This list,
typically manipulated only by the group administrators, lists all group members
that should have permission to access the sensitive data. However, the use of ACL
is very restrictive so it is not suitable for the P2PIR system. The main reason is that
ACL is a static entity. It can only be used when all group members are known. But
in a peer to peer environment, peers can join SIG dynamically. Therefore, the new
SIG member will not get full access to members’ collections until all members
update their ACL. Downloading the ACL list from directory server then verifying
130
the user might be a solution but it introduces more network traffic and will face
the single point of failure problem.
In order to solve this problem, we introduce a certificate for group members
which is based on X.509 public key infrastructure (PKI) (ITU 2000). When a new
member joins a SIG, the SIG administrator will issue him a group member
certificate which is signed by the SIG group. A unique private/public key pair and
the certificate of group itself will also be sent to the new member at the same
time. The group member certificate is similar to the standard PKI certificate that
includes member identity (email address), issuance time, validity interval and a
public key. Then the member can prove his membership to other members by
sending out the member certificate together with a signed message.
The process of validating a member is demonstrated in Figure 7–7. The first step
of authentication is to identify the peer. It’s relatively easy to verify a peer
because each peer uses a unique email address. The authentication process does
not only check the certificate but also the sender’s email address. Only when the
incoming message’s email address matches the certificate, the message will be
accepted. The next step is to verify the certificate. X.509 validation method is used
in the P2PIR system, following the certificate path, to validate the certificate.
Several checks will be performed in this step. Firstly, check if the group member’s
131
certificate is issued by the group administrator. Secondly, check if the certificate is
not expired or has not been revoked. Then, check if the message is signed by the
sender. If the message passes all the checks, B is a valid client in group G. In this
way, A and B do not have to know each other but can trust each other and
communicate in a secure way. Authentication is done on the client side and does
not require any authentication server. Additionally, a block list can be used at the
client side. With a block list a user can block any unwanted clients even if the
client has a valid certificate within the group.
133
Once issued, a group member’s certificate becomes valid when its validity time
has been reached, and it is considered valid until its expiration date. However,
various circumstances may cause a certificate to become invalid prior to the
expiration of the validity period. For example, change of email address, change of
employment status (when an employee terminates employment with an
organization), expose the private to public or lost of private key, etc. Under such
circumstances, the certificate needs to be revoked. RFC 3280(IETF 2002) defines
one ways to represent revocation information such method. This method involves
each CA (in our circumstance, group administrators) periodically issuing a signed
data structure called a certificate revocation list (CRL). A CRL is a list identifying
revoked certificates, which is signed by group administrators. The CRL will be sent
to all group members once revocation list updated. By using X.509 CRL, we can
ensure that certificates presented by peers have not been revoked by the group
administrator .
7.4.2 Security protocol
There are some internet encryption and security protocols, such as VPN, Secure
Sockets Layer (SSL) and its updated version Transport Layer Security (TLS), which
will ensure the privacy and integrity of communication between peers. But such
protocols are designed for direct communication and will not be suitable to the
134
system because the communication between peers in our system is via email. We
cannot create a low layer secure communication channel on the public email
system. Since our communication protocol is based on XML it is only natural to
use XML encryption and XML Signatures to base our security protocols upon.
XML encryption(Imamura, Dillaway et al. 2002) is designed for general secure data
exchange. The structure of XML encryption is described in Figure 7–8 XML
encryption structure and the detail of the XML schema for XML Encryption is listed
in Appendix 2 XML Schema for XML Encryption. One of the most powerful
features of XML encryption is that the encryption is fully extensible. As we can see,
XML encryption defines all possible elements about the encryption, such as
encryption method, encrypted key, Cipher Data etc. Thus, any encryption
algorithms can be applied to XML encryption. When necessary, user can apply
stronger encryption algorithm to protect sensitive data and apply faster (usually
weaker) encryption algorithm to reduce CPU usage for lower security level items.
In addition, by comparing with real time encryption protocols, such as SSL and
TLS, XML encryption can encrypt only the sensitive data rather than the entire
communication. So the peers can exchange messages efficiently. Non-sensitive
data will be left unencrypted to reduce CPU load and to reduce the
communication cost.
135
Figure 7–8 XML encryption structure(Imamura, Dillaway et al. 2002)
Where "?" denotes zero or one occurrence; "+" denotes one or more occurrences;
"*" denotes zero or more occurrences; and the empty element tag means the
element must be empty.
7.4.3 Encryption methods
As PKI certificates are used in the P2PIR system, it is natural to use asymmetric
cryptography techniques in the encryption. However, symmetric algorithms are
generally much less computationally intensive than asymmetric algorithms. In
practice, asymmetric key algorithms are typically hundreds to thousands times
slower than symmetric key algorithms. One of the difficulties arising with
136
symmetric encryption is the key exchange problem. Peers have to share a key
before communication can start. Secure plain text key exchange is not possible on
the internet. In addition, the more users are in the system, the higher potential
lost of cryptographic. Therefore, cryptographic keys should be changed regularly.
The use of both asymmetric encryption and symmetric encryption offers a
solution.
There are numerous published articles that describe how to use asymmetric and
symmetric encryption together. For the sake of completeness a brief explanation
will be provided in Appendix 1,
7.4.4 XML signature
Data integrity is also important in the encryption / decryption process. It means
data consistency is assured and that the data is tamperproof. Only when data
integrity is guaranteed we can authenticate the message owner. The P2PIR system
uses X.509 certificates and XML signature (Bartel, Boyer et al. 2002) to achieve
data integrity and authentication..
XML signature is defined by W3C and “provide integrity, message authentication,
and/or signer authentication services for data of any type, whether located within
the XML that includes the signature or elsewhere(Bartel, Boyer et al. 2002)”. As
137
normal digital signatures, XML signatures can provide the following capabilities for
data: integrity assures that the data has not been tampered with or corrupted
since it was signed; authentication assures that the data originates from the signer
and non-repudiation assures that the signer is committed to the document
contents. The advantage of XML signatures is that XML signatures can be applied
to arbitrary data elements that may be located within the XML document.
7.5 The search engine
The search engine used in this peer to peer system is based on GPX(Geva 2006).
Small modification was made to enable document level search for both English
and Chinese documents. Basically, the documents were indexed using a character-
based inverted file index. Microsoft SQL Server 2005 developer edition is used as
the backbone of the search engine. The storage of document index is as follows:
• Terms table stores unique terms in the whole collections. It records the
term id, term, start and end positions of the term in the inverted list.
• Contents table stores the term id, document id, xpath id and the position
of the term in the xpath.
• XPath table stores all possible xpath in the collection.
• Documents table stores all document names in the collection.
138
The terms table only stores single English words and single Chinese characters.
The main reason to only storing words instead of full terms is to avoid the
problem of Chinese term segmentation. As described before, using incorrectly
segmented Chinese documents will significantly decrease the performance of
search engine. By using single character based indexing, a Chinese term can be
easily determined by documents, xpath and positions. Only when character
positions are consecutive and have the same document and xpath, the character
sequence will be considered as a phrase or a term in the document. As a result,
this strategy can maximise the recall of the search engine.
Document score for a query Q is calculated by the equation below:
∑= iiscore idftfnD *7
(16)
Here, n is the number of the unique query terms in the document. tfi is the
frequency of the ith term in the document and idfi is the in inverse document
frequency of the ith term in the collection.
The equation can ensure two things: first, the n7 strongly rewards the documents
that contain more query terms. The more unique query terms match in a
139
document, the higher rank the document has. For example, the document that
contains five unique query terms will always have higher rank than the document
that contains four query terms, regardless of the query terms frequency in the
document. Second, when documents contain the same number of unique terms,
the score of a document will be determined by the sum of query terms’ tf*idf, as
traditional information retrieval does. According to experiments, n7 is the best
value for both English and Chinese information retrieval.
140
Chapter 8
Peer to peer system evaluation
The purpose of the experiments in this chapter is to evaluate the performance of
the peer to peer system that uses email as communication medium. Therefore, all
multilingual components and also the collection profiling components were
removed from the P2PIR system during the experiments to avoid the impact of
query translation and collection merging. . As a result, raw score merging was
used in the P2PIR system.
8.1.1 Test environment
To simulate a real world situation we have used two different email servers: the
Yahoo mail server (Yahoo server) and the Queensland University of Technology’s
mail server (QUT server). Those two servers can represent two types of mail
server. The QUT server represents an organization’s internal mail server: it is
located behind the organization’s firewall and will block access from outside. The
Yahoo server can represent a general public mail server: it is heavily loaded, it can
be accessed by anyone and it is physically remote to QUT.
141
The machines that we use to test the system are also varied. We have used 10 PCs
with various hardware and operating system configurations. 7 PCs were located
within QUT’s internal network and firewall. One laptop and two PCs located
outside QUT. The external PCs used 256K ADSL internet connection and the
Yahoo server .
8.1.2 Test queries and collection
We used the INEX document collection in the experiments. The INEX04 document
collection consists of the IEEE computer society journal and magazine articles
from 1995 to 2002, approximately 500MB in size and with over 12,000
documents. The collection was arbitrarily split into 10 partitions and distributed as
independent sub-collections on 10 PCs of different types.
The queries that we used were a mix of 26 Xpath-like CAS queries (content-and-
structure) specifying keywords as well XML structural constraints and 34 CO
(content-only) queries-specifying keywords only.
8.1.3 Test results
We have run the first test during the weekend. The work load of both servers
seemed low and the email messages were delivered without delay. The check-
mail interval was set to 0 which means that the software keeps checking for new
142
mail without imposing the usual wait. In this case, we can assume that our system
is running the equivalence of a direct peer to peer system, or parallel system. The
overall system performance is dependent on the slowest part in the
system(Brown 1999) . Table 8-1 Collection size and search time shows the search
time on each of the 10 partitions of the collection, when running on the fastest
and the slowest computers. The average search time on our system is 26.4
minutes (for 60 queries).
Table 8-1 Collection size and search time
ID Size
(MB)
Search time on fastest
pc(minutes)
Search time on slowest
pc(minutes)
1 193 12 23
2 150 10 22
3 195 9 18
4 250 8 15
5 198 10 21
6 187 6 11
7 237 10 22
8 225 9 19
9 172 13 26
10 169 12 26
Table 8-2 Comparison of Centralize search and Distributed search shows the
search time on different environments. In the distributed environment, the search
143
time is approximately 6.5 times faster than the centralize search time on best PC.
We can achieve shorter response times when the data is split into smaller
collections.
Table 8-2 Comparison of Centralize search and Distributed search
Centralized Search time A1 Centralized Search time B2 Distributed Search Time
168min >4 hours 26.4min
The same test was run in the mid-week to simulate real world situation. It is quite
surprised to find out that all mails are delivered within seconds which still make
the system virtually direct peer to peer connections.
In a real application we anticipate that the check-mail frequency will be in the
order of 5 minutes or more and so response time will be longer. We may also find
that some users are not permanently connected and so response time may be
even longer hours or days – for some user collections. However, results are
delivered incrementally and so the query originator will receive results as soon as
1 Search A: 1.7GB database search on AMD 2800+ with 1GB RAM 2 Search B: 1.7GB database search on Pentium3 800 with 256 RAM. The search did not finish within 4 hours and aborted.
144
they become available. The system therefore offers a trade-off between
immediate response (ideal situation) and delayed asynchronous response which
will increase the search coverage to collections that are offline at the time that
the search request is issued.
8.1.4 Results and raw score fusion
As the collection profiling component was turned off, raw score merging was
demonstrated. In order to achieve effective results fusion we have opted to
change the ranking strategy of our search engine so as to eliminate the use of
collection specific information. This approach may be initially questionable
because it implies, for instance, that ranking schemes that are based on term
frequencies (e.g. tfidf) cannot be used. When a collection is partitioned each
partition might exhibit distinctly different term distribution frequencies. This is
particularly to be expected when the partition is thematic and different
collections cover different topics.
Merging the results on the basis of local rank leads to poor results. We have
implemented a result element ranking scheme that is only based on the number
of distinct search terms that appear in a result element and on the number of
times that each term appears in the element. We ignore term frequencies
altogether in computing the score of a result element. In such a manner, the
145
ranking in the distributed environment is identical to the ranking in a single
collection. There is of course a trade off. On one hand results fusion is straight
forward and there is no need to keep or exchange any global collection statistics.
On the other hand, the quality of the results is degraded. We have tested the
approach by running our system with an unaltered ranking scheme (i.e., using tfidf
and collection term statistics) and with an altered ranking scheme that uses no
collection statistics. The results are depicted in Figure 8–1. The baseline for
comparison is the Precision Recall curves of the official submissions to INEX 2004.
Merging the results of the original search engine leads to poor results. Our
submission to the CAS track in INEX 2004 was ranked first, but the result obtained
by distributing the search engine without modification would have been ranked
36 against all 51 official submissions to INEX 2004 and with much lower average
precision than the centralized system – 0.03 vs. 0.15 respectively. On the other
hand, by changing the ranking scheme to eliminate all references to global
information we obtained a result that is still very good. It would have ranked 4th
at INEX 2004 with an average precision of 0.12
147
Chapter 9
Conclusion and future work
9.1 Summary
The goal of this thesis was to develop and evaluate a peer to peer multilingual IR
system that can easily pass through firewall and can be used for either in public
domains or private secure domains. This peer to peer IR system can search for
both English and Chinese documents using one single query in English or Chinese
to satisfy the information needs for users who know both languages. . The P2PIR
system includes many features that are not offered by any of the current peer to
peer IR systems. The contributions of this thesis include:
• Web-based English - Chinese query translation approach.
In this area, the thesis describes an approach to tackle the OOV problem in
English-Chinese information retrieval. Firstly, a bottom-up term extraction
method to be used in small corpora for generating candidate translations for
query OOV terms is proposed. The method introduces a new measurement of
a Chinese string based on frequency and standard deviation, together with a
Chinese MLU extraction process based on the change of the new string
148
measurement that does not rely on any predefined thresholds. The method
considers a Chinese string as a term based on the change of R’s value when
the size of the string increases rather than the absolute value of R. Our
experiments show that this approach is effective for translation extraction of
unknown query terms. The related information can be found in session 3.2,
Chapter 4 and also in conference paper [1], journal paper [1].
A simple translation selection approach to improve translation accuracy is alos
proposed in the thesis. The experiment results show that OOV terms can
significantly affect the performance of CLIR systems. By using web translation
extraction based on co-occurrence model, the overall performance can boost
to almost 174% comparing to the case of not processing OOV terms. With our
proposed translation selection approach, the accuracy of OOV term
translation can be improved by up to 85%. The overall performance is about
200% comparing to the case of relative to not processing OOV terms. And it is
about 120% comparing to by comparison with the simulation of previous
approaches. The related information can be found in session 3.3, Chapter 4
and also in conference paper [3], journal paper [1].
149
• Web-based collection profiling strategy
In this area, the thesis proposed and evaluated the approach on result
merging that can be applied in uncooperative distributed information
environments such as p2p systems. A Web based query classification method
is proposed. Learning user behaviours and using query classification can create
collection profiles which contain the information not only about collection
content but also about the performance of the remote information retrieval
systems. Using the information in the profile can help merging results. Our
experiments proved that our proposed SRR and SR approaches can provide
much better results than the standard round robin method. The related
information can be found in Chapter 5, Chapter 6 and also in conference paper
[2]
• Email-based peer to peer IR system architecture
An email based secure distributed information retrieval system is described in
the thesis. The system is based on a store-and-forward paradigm, utilizing the
public email system, to facilitate search distribution and collaborative
information retrieval - a federated system of local private collections. The test
shows that the distributed system can even offer speed advantages arising
from parallel processing in comparison with an equivalent centralized system.
150
An important and central feature of the P2PIR system is that it can ensure the
information security and privacy in the search and exchange of sensitive data
in an open network; the system can also provide other features such as storing
and forwarding messages, passing through firewalls; trustfully communicating
with unknown clients. The work shows that it is possible to use X.509
certificate framework in a peer to peer IR system. The related information can
be found Chapter 7, Chapter 8 and also in conference paper[4].
9.2 Future work
First of all, the research work of query translation in this thesis is limited to
translate English queries to Chinese queries. As the ideas of term extraction and
translation selection are statistic based they should be language independent. As
a result, these ideas might work in Chinese to English translation and other
language pairs. However, more experiments are required to evaluate the actual
performances. In the query translation area, although the proposed approach
shows impressive accuracy for OOV term translation, there are still some works to
be conducted in the future. Firstly, our experiments were conducted using a test
set from NTCIR5 and NTCIR6 CLIR task which only has 108 OOV terms. It might be
necessary to test our approach to a larger scale test set such as a test set that has
over 1000 OOV terms. Secondly, inappropriate translation is still a problem in
151
query translation. The main reasons include the limited size of dictionary,
different customs of translation, and ignoring query context. Some work should be
done to minimize these problems. Our experiments provide hints for some
possible approaches. If we have large enough resources, we may find all the
possible translations. For translation selection, if some of the translations hit a
similar number of documents, we may keep all of them as correct translations. It
may be useful to include more results from the Google search for instance or
combining different translation result together. We will validate these ideas in the
future.
In the collection profiling area, future work will be focused on collection selection.
Collection selection is necessary in a large scale environment. A query should only
be sent to a small number of peers in the network because a client may not be
able to handle a large number of search requests. In addition, selecting a small
number of remote peers can help to reduce the number of irrelevant documents.
Although the collection profiling strategy can provide impressive accuracy of
collection rank, which is usually the basis for collection selection strategies, the
collection selection strategies are not covered in this thesis. How to select a small
amount of remote peers, e.g. how many peers should be selected, to maximize
the recall and precession is still an issue in the P2PIR system.
152
In the peer to peer IR system area, as our system is based on email, any attack
through the email system is a potential risk to our system, especially the “spam
attack”. More work is required to address this issue.
153
Appendix
Appendix 1 Asymmetric and symmetric encryption / decryption
The encryption and decryption process is showed in Figure 10–1. When user A is
requesting some sensitive information from user B, A sends her public key to user
B with the request. User B then creates a symmetric Content encryption key and
encrypts the key with A’s public key. Then B encrypts the data using the
symmetric encryption Content key and sends the encrypted data and encrypted
key back to A. User A decrypts the Content key using her private key and then
decrypts the data using the Content key. With asymmetric cryptography, only the
private key can decrypt the data encrypted by the corresponding public key.
Therefore, privacy can be ensured. We use symmetric encryption to encrypt the
data.
155
Appendix 2 XML Schema for XML Encryption
<?xml version="1.0" encoding="utf8" ?> <schema xmlns="http://www.w3.org/2001/XMLSchema" version="1.0" xmlns:xenc="http://www.w3.org/2001/04/xmlenc#" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" targetNamespace="http://www.w3.org/2001/04/xmlenc#" elementFormDefault="qualified"> <import namespace="http://www.w3.org/2000/09/xmldsig#" schemaLocation="http://www.w3.org/TR/2002/RECxmldsigcore20020212/xmldsigcoreschema.xsd" /> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="EncryptedType" abstract="true"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <sequence> <element name="EncryptionMethod" type="xenc:EncryptionMethodType" minOccurs="0" /> <element ref="ds:KeyInfo" minOccurs="0" /> <element ref="xenc:CipherData" /> <element ref="xenc:EncryptionProperties" minOccurs="0" /> </sequence> <attribute name="Id" type="ID" use="optional" /> <attribute name="Type" type="anyURI" use="optional" /> <attribute name="MimeType" type="string" use="optional" /> <attribute name="Encoding" type="anyURI" use="optional" /> </complexType> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="EncryptionMethodType" mixed="true"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <sequence> <element name="KeySize" minOccurs="0" type="xenc:KeySizeType" /> <element name="OAEPparams" minOccurs="0" type="base64Binary" /> <any namespace="##other" minOccurs="0" maxOccurs="unbounded" /> </sequence> <attribute name="Algorithm" type="anyURI" use="required" /> </complexType> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <simpleType name="KeySizeType"> <restriction base="integer" />
156
</simpleType> <element name="CipherData" type="xenc:CipherDataType" /> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="CipherDataType"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <choice> <element name="CipherValue" type="base64Binary" /> <element ref="xenc:CipherReference" /> </choice> </complexType> <element name="CipherReference" type="xenc:CipherReferenceType" /> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="CipherReferenceType"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <choice> <element name="Transforms" type="xenc:TransformsType" minOccurs="0" /> </choice> <attribute name="URI" type="anyURI" use="required" /> </complexType> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="TransformsType"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <sequence> <element ref="ds:Transform" maxOccurs="unbounded" /> </sequence> </complexType> <element name="EncryptedData" type="xenc:EncryptedDataType" /> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="EncryptedDataType"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexContent> <extension base="xenc:EncryptedType" /> </complexContent> </complexType> <! Children of ds:KeyInfo > <element name="EncryptedKey" type="xenc:EncryptedKeyType" /> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="EncryptedKeyType"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexContent> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <extension base="xenc:EncryptedType">
157
Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <sequence> <element ref="xenc:ReferenceList" minOccurs="0" /> <element name="CarriedKeyName" type="string" minOccurs="0" /> </sequence> <attribute name="Recipient" type="string" use="optional" /> </extension> </complexContent> </complexType> <element name="AgreementMethod" type="xenc:AgreementMethodType" /> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="AgreementMethodType" mixed="true"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <sequence> <element name="KANonce" minOccurs="0" type="base64Binary" /> <! <element ref="ds:DigestMethod" minOccurs="0"/> > <any namespace="##other" minOccurs="0" maxOccurs="unbounded" /> <element name="OriginatorKeyInfo" minOccurs="0" type="ds:KeyInfoType" /> <element name="RecipientKeyInfo" minOccurs="0" type="ds:KeyInfoType" /> </sequence> <attribute name="Algorithm" type="anyURI" use="required" /> </complexType> <! End Children of ds:KeyInfo > Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <element name="ReferenceList"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <choice minOccurs="1" maxOccurs="unbounded"> <element name="DataReference" type="xenc:ReferenceType" /> <element name="KeyReference" type="xenc:ReferenceType" /> </choice> </complexType> </element> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="ReferenceType"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <sequence> <any namespace="##other" minOccurs="0" maxOccurs="unbounded" /> </sequence>
158
<attribute name="URI" type="anyURI" use="required" /> </complexType> <element name="EncryptionProperties" type="xenc:EncryptionPropertiesType" /> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="EncryptionPropertiesType"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <sequence> <element ref="xenc:EncryptionProperty" maxOccurs="unbounded" /> </sequence> <attribute name="Id" type="ID" use="optional" /> </complexType> <element name="EncryptionProperty" type="xenc:EncryptionPropertyType" /> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="EncryptionPropertyType" mixed="true"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <choice maxOccurs="unbounded"> <any namespace="##other" processContents="lax" /> </choice> <attribute name="Target" type="anyURI" use="optional" /> <attribute name="Id" type="ID" use="optional" /> <anyAttribute namespace="http://www.w3.org/XML/1998/namespace" /> </complexType> </schema>
Appendix 3 Translation of OOV terms
OOV term SQUT SCP SCPCD SE MI
Chiutou:
Autumn Struggle: 秋鬥大遊 從秋鬥 秋鬥 秋鬥 秋鬥
Jonnie Walker: 約翰走路 約翰走路 黑次元 高雄演唱 高雄演唱
Charity Golf
Tournament:
慈善高爾夫
球賽
慈善高爾夫
球賽 慈善高 慈善高
Embryonic Stem
Cell:
胚胎幹細胞 胚胎幹細胞 胚胎幹細胞
Florence Griffith
Joyner: 花蝴蝶 葛瑞菲絲 葛瑞菲絲 花蝴蝶 花蝴蝶
FloJo: 佛羅倫薩格
里菲斯 花蝴蝶 花蝴蝶 花蝴蝶 花蝴蝶
Michael Jordan: 麥可喬丹 麥可喬丹 喬丹 喬丹 喬丹
Torrijos Carter
Treaty:
Viagra:
159
Hu Jin tao: 胡錦濤 胡錦濤 胡錦濤 胡錦濤 胡錦濤
Wang Dan: 天安門 王丹 王丹 王丹
Tiananmen 天安門廣場 天安門 天安門 天安門 天安門
Akira Kurosawa: 黑澤明 黑澤明 黑澤明 黑澤明 黑澤明
Keizo Obuchi: 小淵惠三 小淵惠三 小淵惠三 小淵惠三 小淵惠三
Environmental
Hormone:
環境荷爾蒙 環境荷爾蒙 環境荷爾蒙 環境荷爾蒙
Acquired Immune
Deficiency
Syndrome:
後天免疫缺
乏症候群 愛滋病 愛滋病 愛滋病 愛滋
Social Problem: 社會問題 社會問題 社會問題
Kia Motors: 起亞汽車 起亞汽車 起亞汽車 起亞 起亞
Self Defense Force: 自衛隊 自衛隊 自衛隊 自衛隊 自衛隊
Animal Cloning
Technique:
動物克隆技
術
動物克隆技
術
Political Crisis: 政治危機 政治危機 政治危機
Public Officer: 公職人員 公職人員 公職人員 公職人員
Research Trend: 研究趨勢 研究趨勢 研究趨勢 研究趨勢
Foreign Worker: 外籍勞工 外籍勞工 外籍勞工 外籍勞工
World Cup: 世界盃 世界盃 世界盃 世界盃 世界盃
Apple Computer: 蘋果公司 蘋果電腦 蘋果電腦 蘋果電腦 蘋果電腦
Weapon of Mass
Destruction:
大規模毀滅
性武器
大規模毀滅
性武器 性武器
Energy Consumption: 能源消費 能源消費 能源消費
International Space
Station:
國際太空站 國際太空站 國際太空站
President Habibie: 哈比比總統 哈比比總統 哈比比總統 哈比比
Underground Nuclear
Test:
地下核試驗 地下核試驗 地下核試
F117: 戰鬥機 隱形戰鬥機 隱形戰 隱形戰 隱形戰
Stealth Fighter: 隱形戰機 隱形戰機 形戰鬥機 形戰鬥機 形戰鬥機
Masako: 雅子 太子妃 雅子 雅子 雅子
Copyright
Protection:
版權保護 版權保護 版權保護 版權保護 版權保護
Daepodong: 大浦洞 大浦洞 大浦洞 大浦洞 大浦洞
Contactless SMART
Card:
智慧卡 非接觸式智
慧卡
非接觸式智慧
卡 非接觸式 非接觸式
Han Dynasty: 漢朝 大漢風 漢朝 漢朝 漢朝
Promoting Academic
Excellence:
學術追求卓
越發展計畫 卓越計畫 卓越發展計畫 卓越發展計畫 卓越發
China Airlines: 中華航空 中華航空 中華航空 中華航空 長榮
ST1:
El Nino 聖嬰 聖嬰現象 聖嬰現象 聖嬰 聖嬰
160
Mount Ali: 阿里山 阿里山 阿里山 阿里山 阿里山
Kazuhiro Sasaki: 佐佐木主浩 佐佐木主浩 佐佐木 佐佐木 佐佐木
Seattle Mariners: 西雅圖水手 西雅圖水手 西雅圖水手
Takeshi Kitano: 北野武 北野武 北野武 北野武 北野武
European monetary
union:
歐洲貨幣聯
盟
歐洲貨幣聯
盟 歐洲貨幣 歐洲貨幣 歐洲貨幣
capital tie up:
Nissan Motor
Company:
日產汽車公
司 汽車公司 汽車公司 處經濟 處經濟
Renault: 雷諾 休旅車 雷諾 雷諾 雷諾
Pol Pot: 波布 紅高棉 紅高棉 紅高棉 紅高棉
war crime: 戰爭罪 戰爭罪 戰爭罪 戰爭罪
Kim Dae Jung: 金大中 金大中 金大中 金大中 金大中
Clinton: 克林顿 克林顿 克林顿
New Year Holiday: 新年假期 新年假期 新年假期
Drunken Driving: 醉後駕車 醉後駕車 醉後駕車 醉後駕車 後駕車
Science Camp: 科學營 科學營 科學營 科學營
Nelson Mandela: 曼德拉 曼德拉 曼德拉 曼德拉 曼德拉
Kim Il Sung: 金日成 金日成 金日成 金日成 金日成
anticancer drug: 抗癌藥物
consumption tax: 消費稅 消費稅 消費稅 消費稅 費稅
Uruguay Round: 烏拉圭回合 烏拉圭回合 烏拉圭回合
Kim Jong Il: 金正日 金正日 金正日 金正日 金正日
Time Warner 時代華納 時代華納 時代華納 時代華納 時代華納
American Online 美國線上 美國線上 美國線上 美國線上 美國線上
Alberto Fujimori 藤森 藤森 藤森 藤森 藤森
Taliban 塔利班 塔利班 塔利班 塔利班 塔利班
Tiger Woods 老虎伍茲 老虎伍茲 老虎伍茲 老虎伍茲 伍茲
Harry Potter 哈利波特 哈利波特 哈利波特 哈利波特 哈利波特
Greenspan 葛林斯班 葛林斯班 葛林斯班 葛林斯
monetary policy 貨幣政策 貨幣政策 貨幣政策 貨幣政策
abnormal weather 天氣異常 天氣異常 天氣異常 天氣異常 天氣
National Council of
Timorese Resistance
帝汶抵抗全
國委員會
帝汶抵抗全
國委員會
帝汶抵抗全國
委員會
帝汶抵抗全國
委員會
帝汶抵抗
全國委員
會
161
Bibliography
The List of common use Chinese Characters, Ministry of Education of the People's Republic of China.
Aberer, K., F. Klemm, et al. (2001). An Architecture for Peer-to-Peer Information Retrieval. The Ninth International Conference on Information and Knowledge Management.
Bartel, M., J. Boyer , et al. (2002). XML-Signature Syntax and Processing D. Eastlake, J. Reagle and D. Solo, The World Wide Web Consortium (W3C).
Bawa, M., G. S. Manku, et al. (2003). SETS: search enhanced by topic segmentation. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, ACM Press.
Beverly, Y . and G.-M. Hector . (2001). "Improving Search in Peer-to-Peer Systems." from http://dbpubs.stanford.edu:8090/pub/2001-47.
Brown, E. (1999). Parallel and Distributed IR. Modern Information Retrieval. R. B.-Y . B. Ribeiro-Neto, Addison Wesley: 229-256.
Brown, E. (1999). Parallel and Distributed IR. Modern Information Retrieval.
Callan, J. P . and M. E. Connell (2001). "Query-Based Sampling of Text Databases." Information Systems 19(2): 97-130.
Callan, J. P., Z. Lu, et al. (1995). Searching distributed collections with inference networks. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press.
Callan, J. P., Z. Lu, et al. (1995). Searching distributed collections with inference networks. Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, Seattle, Washington, United States, ACM Press.
Charles, P., N. Good, et al. (2003). "How Much Information." http://www2.sims.berkeley.edu/research/projects/how-much-info-
162
2003/execsum.htm#summary Retrieved 19 Aug. 2007, 2007, from http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm#summary.
Chen, A. and F . Gey (2003). Experiments on Cross-language and Patent retrieval at NTCIR3 Worksho. Proceedings of the 3rd NTCIR Workshop, Japan.
Chen, A., H. Jiang, et al. (2000). Combining multiple sources for short query translation in Chinese-English cross-language information retrieval. Proceedings of the fifth international workshop on on Information retrieval with Asian languages. Hong Kong, China, ACM Press.
Cheng, P.-J., J.-W. Teng, et al. (2004). Translating unknown queries with web corpora for cross-language information retrieval. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. Sheffield, United Kingdom, ACM Press.
Chien, L.-F. (1997). PAT-tree-based keyword extraction for Chinese information retrieval. Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval Philadelphia, Pennsylvania, United States ACM Press.
CHOI, Y . S. and S. I. YOO (2001). " Text database discovery on the Web: Neural net based approach " Journal of Intelligent Information Systems 15(3).
Church, K. W. and P. Hanks (1990). "Word association norms, mutual information and lexicography." Computational Linguistics 16(1).
Craswell, N., D. Hawking, et al. (1999). Merging Results From Isolated Search Engines. Australasian Database Conference.
de Kretser, O., A. Moffat, et al. (1998). Methodologies for distributed information retrieval. Distributed Computing Systems, 1998. Proceedings. 18th International Conference on.
Ehrig, M., C. Schmitz, et al. (2004). "Towards Evaluation of Peer-to-Peer-based Distributed Information Management Systems." LECTURE NOTES IN COMPUTER SCIENCE(2926): 73-88.
Eijk, P. v. d. (1993). Automating the acquisition of bilingual terminology Proceedings of the sixth conference on European chapter of the Association for Computational Linguistics Utrecht, The Netherlands
163
Gao, J., J.-Y. Nie, et al. (2001). Improving query translation for cross-language information retrieval using statistical models. Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. New Orleans, Louisiana, United States, ACM Press.
Geva, S. (2006). Gardens Point XML IR at INEX 2005. Comparative Evaluation of XML information Retrieval Systems 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, Springer .
Gravano, L., C.-C. K. Chang, et al. (1997). STARTS: Stanford proposal for Internet meta-searching. Proceedings of the 1997 ACM SIGMOD Conference.
Gravano, L. and H. García-Molina (1995). Generalizing GlOSS to Vector-Space databases and Broker Hierarchies. Proceedings of the 21st International Conference on Very Large Databases.
Hawking, D. and P. Thistlewaite (1999). Methods for information server selection, ACM. 17: 40-76.
IETF (1996). "RFC-1939 Post Office Protocol - Version 3."
IETF (2001). "Simple Mail Transfer Protocol."
IETF (2002). " Internet X.509 Public Key Infrastructure - Certificate and Certificate Revocation List (CRL) Profile."
IETF (2003). RFC-3501 INTERNET MESSAGE ACCESS PROTOCOL - VERSION 4rev1.
Imamura, T., B. Dillaway, et al. (2002). XML Encryption Syntax and Processing. D. Eastlake and J. Reagle, The World Wide Web Consortium (W3C)
Ipeirotis, P. G. and L. Gravano (2001). "Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection." Journal of Intelligent Information Systems 15(3).
Ipeirotis, P. G., L. Gravano, et al. (2001). Probe, Count, and Classify:Categorizing Hidden-Web Databases. The 2001 ACM SIGMOD International Conference on Management of Data, ACM.
ITU (2000). X.509 Information technology – Open systems interconnection – The Directory: Public-key and attribute certificate frameworks International Telecommunication Union
164
Jang, M.-G., S. H. Myaeng, et al. (1999). Using mutual information to resolve query translation ambiguities and query term weighting. Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. College Park, Maryland, Association for Computational Linguistics.
King, A. (2008). "Average Web Page Triples Since 2003." from http://www.websiteoptimization.com/speed/tweak/average-web-page/.
King, J. and Y. Li (2003). Web based collection selection using singular value decomposition. WIC International Conference on Web Intelligence.
Kirsch, S. T . (1997). Document retrieval over networks wherein ranking and relevance scores are computed at the client for multiple database documents. United States Patent I. C. (US). United States.
Kupiec, J. M. (1993). An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora. Proceedings of the 31st Annual Meeting of the ACL. Columbus, Ohio.
Li, C. and H. Li (2001). Word translation disambiguation using Bilingual Bootstrapping. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Liu, K.-L., C. Yu, et al. (2001). Discovering the representative of a search engine. The 10th ACM International Conference on Information and Knowledge Management.
Liu, K.-L., C. Yu, et al. (2002). "A Statistical Method for Estimating the Usefulness of Text Databases." IEEE Transactions on Knowledge and Data Engineering 14(6).
Lu, C., Y. Xu, et al. (2007). Translation disambiguation in web-based translation extraction for English-Chinese CLIR. Proceeding of The 22nd Annual ACM Symposium on Applied Computing.
Lu, W.-H., L.-F. Chein, et al. (2004). "Anchor text mining for translation of Web queries A transitive translation approach." ACM Transactions on Information Systems (TOIS) 22(2): 242-269
165
Lv, Q., P. Cao, et al. (2002). Search and replication in unstructured peer-to-peer networks. The 16th international conference on Supercomputing, ACM Press.
Maeda, A., F. Sadat, et al. (2000). Query term disambiguation for Web cross-language information retrieval using a search engine. Proceedings of the fifth international workshop on on Information retrieval with Asian languages. Hong Kong, China, ACM Press.
McDaniel, P., A. Prakash, et al. (1999). Antigone: A Flexible Framework for Secure Group Communication. The 8th USENIX UNIX Security Symposium.
Ng, W. S., B. C. Ooi, et al. (2003). "PeerDB: A P2P-based System for Distributed Data Sharing."
Ng, W. S., B. C. Ooi, et al. (2003). PeerDB: A P2P-based System for Distributed Data Sharing. The 19th International Conference on Data Engineering.
Nie, J.-Y., M. Simard, et al. (1999). Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. Berkeley, California, United States, ACM Press.
Paola, V. and K. Sanjeev (2003). Transliteration of proper names in cross-lingual information retrieval. Proceedings of the ACL 2003 workshop on Multilingual and mixed-language named entity recognition - Volume 15, Association for Computational Linguistics.
Pirkola, A., T. Hedlund, et al. (2001). "Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings." Information Retrieval 4(3-4): 209 - 230.
Powell, A. L. and J. C. French (2003). "Comparing the performance of collection selection algorithms." ACM Transactions on Information Systems (TOIS) 21(4(October 2003)): 412 - 456.
Rasolofo, Y., F. Abbaci, et al. (2001). Approaches to collection selection and results merging for distributed information retrieval. Proceedings of the tenth international conference on Information and knowledge management Atlanta, Georgia, USA.
166
Rosenberg, J., R. Mahy, et al. (2005). TURN: traversal using relay NAT. IETF.
Rosenberg, J., J. Weinberger, et al. (2003). STUN: Simple Traversal of User Datagram Protocol (UDP) Through Network Address Translators (NATs). RFC 3489. IETF.
Salovesh, M. (1996) "How many words in an "average" person's vocabulary?" http://unauthorised.org/anthropology/anthro-l/august-1996/0436.html Volume, DOI:
Salton, G. and M. J. McGill (1996). Introduction to Modern Information Retrieval, McGraw-Hill, Inc.
Sankar, K. (2003). "What is Peer to Peer ." from http://p2p.internet2.edu/documents/What%20is%20peer%20to%20peer-5.pdf.
Saracevic, T. (1995). Evaluation of evaluation in information retrieval. Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval table of contents, Seattle, Washington, United States ACM.
Savoy, J., A. L. Calv, et al. (1998). Report on the TREC-5 Experiment: Data Fusion and Collection Fusion. The Fifth Text REtrieval Conference (TREC-5).
Saxena, N., G. Tsudik, et al. (2003). Admission Control in PeertoPeer: Design and Performance Evaluation. the 1st ACM workshop on Security of ad hoc and sensor networks. , ACM Press.
Si, L. and J. Callan (2002). Using sampled data and regression to merge search engine results. Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. Tampere, Finland.
Si, L. and J. Callan (2003). Relevant Document Distribution Estimation Method for Resource Selection. The 26th annual international ACM SIGIR conference on Research and development in informaion retrieval.
Silva, J. F. d., G. Dias, et al. (1999). Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units. Progress in Artificial Intelligence: 9th Portuguese Conference on Artificial Intelligence.
167
Silva, J. F . d. and G. P . Lopes (1999). A Local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units. International Conference on Mathematics of Language.
Singhal, A. (2001). "Modern Information Retrieval: A Brief Overview." Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24(4): 35-43.
Smadja, F., K. R. McKeown, et al. (1996). "Translating collocations for bilingual lexicons: a statistical approach." Computational Linguistics 22(1): 38.
Steidinger, A. (2000). Comparison of different Collection Fusion Models in Distributed Information Retrieval DELOS Workshop on Information Seeking, Searching and Querying in Digital Libraries.
Sun, L. and G. C. Chen (2001). Implementation of large-scale distributed information retrieval system. Info-tech and Info-net, 2001. Proceedings. ICII 2001 - Beijing. 2001 International Conferences on.
Sun, L. and G. C. Chen (2001). Implementation of large-scare distributed information retrieval system. International Conferences on Info-tech and Info-net. Beijing.
Tang, C., Z. Xu, et al. (2003). Peer-to-peer information retrieval using self-organizing semantic overlay networks. Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications, Karlsruhe, Germany, ACM Press New York, NY, USA.
Towell, G., E. M. Voorhees, et al. (1995). Learning Collection Fusion Strategies for Information Retrieval. The twelfith Annual Machine Learning Conference. Lake Tahoe.
Viles, C. L. and J. C. French (1995). Dissemination of Collection Wide Information in a Distributed Information Retrieval System. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.
Voorhees, E. M., N. K. Gupta, et al. (1994). The Collection Fusion Problem. The Third Text REtrieval Conference (TREC-3). Gaithersburg, M.D., National Institute of Standards and Technology.
168
Wu, G. (2004). Research and Application on Statistical Language Model. Computer science and technology. Beijing, Tsinghua University, China.
Yan, Q., G. Gregory, et al. (2003). Automatic transliteration for Japanese-to-English text retrieval. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. Toronto, Canada, ACM Press.
Yuwono, B. and D. L. Lee (1997). Server Ranking for Distributed Text Retrieval Systems on Internet. Database System for Advance Applications.
Zeinalipour-Yazti, D., V. Kalogeraki, et al. (2003). "Exploiting locality for scalable information retrieval in peer-to-peer networks*1." Information Systems In Press, Uncorrected Proof.
Zhang, Y. and P. Vines (2004). Using the web for automated translation extraction in cross-language information retrieval. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. Sheffield, United Kingdom, ACM Press.
Zhao, M.-Y., J. Shen, et al. (2006). "A New Algorithm for Automatic Text Classification." 揚州大學學報(自然科學版) 9(1).