+ All Categories
Home > Documents > Peer to Peer English/Chinese Cross-Language Information ... · Peer to peer systems have been...

Peer to Peer English/Chinese Cross-Language Information ... · Peer to peer systems have been...

Date post: 26-Sep-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
191
Peer to Peer English/Chinese Cross-Language Information Retrieval By CHENGYE LU A dissertation submitted for the degree of Doctor of Philosophy Faculty of Science and Technology Queensland University of Technology September 2008
Transcript

Peer to Peer English/Chinese Cross-Language Information

Retrieval

By

CHENGYE LU

A dissertation submitted for the degree of

Doctor of Philosophy

Faculty of Science and Technology

Queensland University of Technology

September 2008

i

Keywords

Peer to peer system, distributed information retrieval, security, cross-language

information retrieval, query translation, out of vocabulary problem, translation

disambiguation, collection fusion, collection profiling

ii

iii

Abstract

Peer to peer systems have been widely used in the internet. However, most of the

peer to peer information systems are still missing some of the important features,

for example cross-language IR (Information Retrieval) and collection selection /

fusion features.

Cross-language IR is the state-of-art research area in IR research community. It has

not been used in any real world IR systems yet. Cross-language IR has the ability to

issue a query in one language and receive documents in other languages. In

typical peer to peer environment, users are from multiple countries. Their

collections are definitely in multiple languages. Cross-language IR can help users

to find documents more easily. E.g. many Chinese researchers will search research

papers in both Chinese and English. With Cross-language IR, they can do one

query in Chinese and get documents in two languages.

The Out Of Vocabulary (OOV) problem is one of the key research areas in cross-

language information retrieval. In recent years, web mining was shown to be one

of the effective approaches to solving this problem. However, how to extract

Multiword Lexical Units (MLUs) from the web content and how to select the

iv

correct translations from the extracted candidate MLUs are still two difficult

problems in web mining based automated translation approaches.

Discovering resource descriptions and merging results obtained from remote

search engines are two key issues in distributed information retrieval studies. In

uncooperative environments, query-based sampling and normalized-score based

merging strategies are well-known approaches to solve such problems. However,

such approaches only consider the content of the remote database but do not

consider the retrieval performance of the remote search engine.

This thesis presents research on building a peer to peer IR system with cross-

language IR and advance collection profiling technique for fusion features.

Particularly, this thesis first presents a new Chinese term measurement and new

Chinese MLU extraction process that works well on small corpora. An approach to

selection of MLUs in a more accurate manner is also presented. After that, this

thesis proposes a collection profiling strategy which can discover not only

collection content but also retrieval performance of the remote search engine.

Based on collection profiling, a web-based query classification method and two

collection fusion approaches are developed and presented in this thesis. Our

experiments show that the proposed strategies are effective in merging results in

v

uncooperative peer to peer environments. Here, an uncooperative environment is

defined as each peer in the system is autonomous. Peer like to share documents

but they do not share collection statistics. This environment is a typical peer to

peer IR environment. Finally, all those approaches are grouped together to build

up a secure peer to peer multilingual IR system that cooperates through X.509 and

email system.

vi

Table of Contents

Keywords ................................................................................................................ i

Abstract ................................................................................................................ iii

Table of Contents .................................................................................................. vi

List of Figures ...................................................................................................... xiii

List of Tables ........................................................................................................ xv

Acknowledgement ............................................................................................. xvii

List of peer reviewed publication ........................................................................ xix

Statement of Original Authorship ........................................................................ xxi

Chapter 1 Introduction ........................................................................................... 1

1.1 Background................................................................................................... 3

1.2 Contributions ................................................................................................ 8

1.3 Thesis outline.............................................................................................. 11

vii

Chapter 2 Literature review ................................................................................. 13

2.1 Evaluation of Peer to Peer Information Retrieval System ............................ 14

2.2 Peer to peer information retrieval system architecture .............................. 17

2.2.1 Centralized architecture ....................................................................... 18

2.1.1 Unstructured architecture .............................................................. 19

2.1.2 Structured architecture ................................................................... 20

2.3 Firewall traversal ........................................................................................ 21

2.4 Distributed computing and Peer to peer system ......................................... 22

2.5 Collection selection and fusion ................................................................... 24

2.5.1 Collection rank ..................................................................................... 26

2.5.2 Getting resource description ................................................................ 28

2.5.3 Collection fusion approaches................................................................ 31

2.6 Translation .................................................................................................. 35

viii

2.6.1 Transliteration ...................................................................................... 37

2.6.2 Parallel Text Mining .............................................................................. 39

2.6.3 Web mining for query translation ......................................................... 41

2.7 Term extraction .......................................................................................... 43

2.7.1 Mutual information and its variations .................................................. 45

2.7.2 Local Maxima based approaches .......................................................... 47

2.8 Summary .................................................................................................... 48

Chapter 3 Web based query translation ............................................................... 50

3.1 Collecting Web Document summaries ........................................................ 53

3.2 Term extraction .......................................................................................... 56

3.2.1 Frequency Change Measurement ......................................................... 56

3.2.2 A Bottom-up Term Extraction Strategy ................................................. 61

3.3 Translation selection ................................................................................... 65

ix

3.3.1 The algorithm ....................................................................................... 66

3.3.2 Time Complexities ................................................................................ 70

Chapter 4 Multilingual Experiments ..................................................................... 73

4.1 Test set ....................................................................................................... 73

4.2 Term extraction experiments ...................................................................... 75

4.3 Discussion ................................................................................................... 76

4.3.1 Mutual information based approaches ................................................. 77

4.3.2 Local Maxima based approaches .......................................................... 79

4.3.3 SQUT Approach .................................................................................... 80

4.4 Translation selection Experiments .............................................................. 82

4.5 Discussion ................................................................................................... 83

4.5.1 IgnoreOOV ........................................................................................... 83

4.5.2 SimpleSelect ......................................................................................... 85

x

4.5.3 TQUT .................................................................................................... 87

Chapter 5 Web based collection profiling for collection fusion ............................. 90

5.1 A simple example........................................................................................ 91

5.2 Collection profiling ...................................................................................... 92

5.3 Query classification ..................................................................................... 94

5.4 Collection fusion ......................................................................................... 98

Chapter 6 Profiling and fusion evaluation ........................................................... 102

6.1 Query classification ................................................................................... 103

6.1.1 Discussion .......................................................................................... 103

6.2 Collection Rank Experiment ...................................................................... 105

6.2.1 Discussion .......................................................................................... 105

6.3 Collection fusion experiment .................................................................... 108

6.4 Discussion ................................................................................................. 109

xi

Chapter 7 The P2PIR System architecture .......................................................... 115

7.1 System Design .......................................................................................... 120

7.2 The P2PIR system Search lifecycle ............................................................. 121

7.3 Communication protocol .......................................................................... 124

7.4 Security..................................................................................................... 129

7.4.1 Group management ........................................................................... 129

7.4.2 Security protocol ................................................................................ 133

7.4.3 Encryption methods ........................................................................... 135

7.4.4 XML signature .................................................................................... 136

7.5 The search engine ..................................................................................... 137

Chapter 8 Peer to peer system evaluation .......................................................... 140

8.1.1 Test environment ............................................................................... 140

8.1.2 Test queries and collection ................................................................. 141

xii

8.1.3 Test results ......................................................................................... 141

8.1.4 Results and raw score fusion .............................................................. 144

Chapter 9 Conclusion and future work ............................................................... 147

9.1 Summary .................................................................................................. 147

9.2 Future work .............................................................................................. 150

Appendix ............................................................................................................ 153

Bibliography ....................................................................................................... 161

xiii

List of Figures

Figure 2–1 Search in a semantic space .................................................... 21

Figure 2–2 the process of collection selection ......................................... 24

Figure 3–1 Three sample document summaries for “Stealth Fighter”

returned from Google ............................................................................. 54

Figure 3–2 Sample output of Chinese string collection ............................ 55

Figure 6–1 P-R curves of 4 runs under System1 ..................................... 110

Figure 6–2 P-R curves of 4 runs under M-GPX ....................................... 111

Figure 7–1 System overview ................................................................. 115

Figure 7–2 The P2PIR system network structure ................................... 117

Figure 7–3 Dataflow of Making Search Request .................................... 122

Figure 7–4 Dataflow of Response to Search Request............................. 123

Figure 7–5 Dataflow of Receiving Results .............................................. 124

xiv

Figure 7–6 Schema for plain text communication ................................. 128

Figure 7–7 Member validation process ................................................. 132

Figure 7–8 XML encryption structure(Imamura, Dillaway et al. 2002) ... 135

Figure 8–1 PR curves ............................................................................. 146

Figure 10–1 Encryption and Decryption using Symmetric and Asymmetric

together ............................................................................................... 154

xv

List of Tables

Table 3-1 Chinese strings and R .............................................................. 60

Table 3-2 Sample Combination of Translations ....................................... 70

Table 4-1 Test document collections ....................................................... 74

Table 4-2 OOV translation accuracy ........................................................ 77

Table 4-3 Some Extracted terms by MI ................................................... 77

Table 4-4 NTCIR retrieval performance ................................................... 83

Table 4-5 Retrieval performance on queries that contains OOV terms only

............................................................................................................... 85

Table 4-6 OOV translation accuracy for NTCIR5 collection ...................... 87

Table 6-1 Collection Rank Top 5 for topic “ Financial news (財經新聞)” 106

Table 6-2 Collection Rank Top 5 – International (國際新聞) ................. 106

Table 6-3 Collection Rank Top 5 – Science/Tech (科技新聞) ................. 106

xvi

Table 6-4 Collection Rank Top 5 – Sports (運動新聞) ............................ 107

Table 6-5 Average Precision – System1 ................................................. 110

Table 6-6 average precision – M-GPX .................................................... 111

Table 6-7 Query classification VS No classification ................................ 113

Table 8-1 Collection size and search time.............................................. 142

Table 8-2 Comparison of Centralize search and Distributed search ....... 143

xvii

Acknowledgement

Firstly, I would like to express my immense gratitude to Professor Shlomo

Geva, my principal supervisor , for all his guidance and encouragement

throughout this research work. He has been always there providing

sufficient support with his excellent expertise in information retrieval area.

Many thanks also go to my associate supervisor, Dr . Yue Xu for her

generous support and comments on my work with her knowledge in data

mining area and English – Chinese translation area.

I would also like to thank my examiners for their precious comments and

suggestions.

Special thanks must go to Faculty of Information Technology, QUT, which

has provided me the comfortable research environment with needed

facilities and financial support including my scholarship and travel

allowances over the period of my candidature. I would especially like to

thank all the members of our research group for offering invaluable

advice and comments regarding my research work.

xviii

This work would not have been accomplished without the constant

support of my family. I would like to dedicate this thesis to my parents for

their never-ending encouragement over these years. Last but certainly not

the least I would like to thank my wife Ning and my parents-in-law for

their tremendous support.

xix

List of peer reviewed publication

Journal paper:

1. Chengye Lu, Yue Xu, Shlomo Geva: Web-based Query Translation

for English-Chinese CLIR. International Journal of Computational

Linguistics and Chinese Language Processing, 61-90.

Conference papers:

1. Chengye Lu, Yue Xu , Shlomo Geva: A Bottom-Up Term Extraction

Approach for Web-Based Translation in Chinese-English IR Systems.

Proceedings of the 2007 Australasian Document Computing

Symposium.

2. Chengye Lu, Yue Xu, Shlomo Geva: Collection Profiling for

Collection Fusion in Distributed Information Retrieval Systems.

Proceedings of the 2007 Knowledge Science, Engineering and

Management: 279-288

3. Chengye Lu, Yue Xu, Shlomo Geva: Translation disambiguation in

web-based translation extraction for English-Chinese CLIR.

xx

Proceedings of the 2007 ACM symposium on Applied computing :

819-823

4. Chengye Lu, Shlomo Geva: Secure email-based peer to peer

information retrieval. Proceedings of the 2005 International

Conference on Cyberworlds : 531-538

xxi

Statement of Original Authorship

The work contained in this thesis has not been previously submitted to

meet requirements for an award at this or any other higher education

institution. To the best of my knowledge and belief, the thesis contains no

material previously published or written by another person except where

due reference is made.

Name:__________________________________________________

Signed:_________________________________________________

Date:___________________________________________________

1

Chapter 1

Introduction

Traditional web search engines use special software to navigate and browse the

web, automatically following hyperlinks from one document to another, and

extracting textual information. This information is used to build huge index

structures correlating keywords, hyperlinks, and other document features to web

pages (Sun and Chen 2001). Obviously such tasks require powerful servers,

massive amounts of storage and bandwidth to monitor the changing web and to

index all the web pages. It is estimated that the volume of information on the web

is about 66.8 to 91.9 Petabytes (1015 bytes)(Charles, Good et al. 2003). It was over

three times of the amount of information than 2000. In one of the research

papers, it is reported that in 2008 the size of the average web page has more than

tripled since 2003(King 2008). According to his research, event we have same

amount of web pages as 2003, we will have more than 200 Petabytes of data on

the web.

Due to such large scale of the web, it is an impossible task to index the whole web

with classic centralized models and algorithms. As a result, decentralized models

2

become an alternative solution for web information retrieval. Peer to Peer (P2P)

information systems seem to be the most promising decentralized models.

Peer to Peer Systems are decentralized, large scale computer networks, where

peers operate as clients and server at the same time(Aberer, Klemm et al. 2001).

As personal computers become more and more powerful and disk space is getting

larger and cheaper, more and more people join P2P networks to share their

personal documents. Existing P2P systems already form very large networks, such

as the Bit Torrent network and the Edonkey network, with millions of personal

computers participating. Obviously, it is impossible to excursively index the whole

web via such large amount of computers through P2P networks. Moreover, it is

interesting to investigate how P2P information sharing systems might be able to

provide collaborative P2P information retrieval (P2PIR) on networks of PCs. On the

other hand, more and more people are able to read more than one language.

People would like to search documents in their second or third language. E.g.

many Chinese researchers need to seek English documents for reference.

However, none of the popular P2P systems provide multilingual features. That is,

the search engine will only search the documents in one language. As a result,

users have to manually translate queries from one language to another and use

the original query and translated queries in the search engine to search

documents in different languages. This is inconvenient to users.

3

1.1 Background

There are a number of problems in P2PIR systems. Some of those problems will be

addressed here and will be discussed in more detail later in the thesis. The first

problem is that current P2P systems lack IR features.

Public domain P2P systems are widely used in Internet file sharing systems and

instant messaging systems (Ng, Ooi et al. 2003) such as Napster, Kazaa and ICQ

systems which the reader may be familiar with. However, such systems are

focused on robust connectivity and scalability. They do not meet the very

demanding requirements of Information Retrieval operations. Such operations are

of primary interest in group collaboration for information sharing. By comparison

with traditional centralized search engines, the P2P systems are relatively simple.

They only provide file level search, which means, for instance, that one can only

search by document title but not by document content. Therefore, basic IR

operations, such as text searching and result ranking, are not possible. This

limitation makes it difficult, if not impossible, for users to find the documents that

they really need.

The second problem is that current applications lack of security features. Most of

the popular P2P systems assume complete trust between peers. Therefore, no

access control mechanisms are provided to ensure security, privacy, and

4

confidentiality (Sun and Chen 2001). But access control is very important for

P2PIR in order to support secure collaborative information sharing. For example,

a large medical research team may be distributed around the globe, working

jointly on a specific problem. The team might have to share patient clinical data

that is held by team members on distributed private and secure PCs. Such

information must be kept private and centralization of records may be completely

out of the question. It cannot be placed in the public domain and so there is no

scope for exploiting public domain P2P file sharing software, or the available, but

very intrusive, public domain search engines. At the same time, without going into

the expense of establishing and maintaining a private network and a centralized

search service, it is not possible to provide a comprehensive solution that is as

easy to join and use, as the existing public domain systems. Furthermore, public

domain P2P systems cannot be used in environments where users are locked

behind firewalls. It is very common for different user groups to exist behind

different organizations’ firewalls, sometimes even being compartmentalized

within a single organization. This greatly restricts the use of direct P2P systems in

collaborative information sharing.

Some recent distributed systems(McDaniel, Prakash et al. 1999; Saxena, Tsudik et

al. 2003) do focus on security but such systems are designed specifically to work

for particular group collaborations. Such systems do not have the appealing

5

general usability and scalability of the open P2P systems. They also require

adherence to standards, centralization, and coordination, all of which are never

enforceable, or even desirable in a voluntary Special Interest Groups (SIG)

environment with truly independent members. The cost of membership in terms

of effort of participation must be close to zero for such groups to prosper. A

fundamental lesson that can be learned from the Internet phenomena is that

decentralization and minimal control are critical success factors in voluntary

collaboration.

The third problem of the current systems is lack of multilingual features. That is,

current P2P-IR systems only allow monolingual information retrieval. As more and

more documents written in various languages become available on the Internet,

increasingly users wish to explore documents that were written in either their

native language or some other language. Cross-language information retrieval

(CLIR) systems enable users to retrieve documents written in more than one

language through a single query. This is a helpful end user feature. For example, a

researcher from China may need to find documents in both Chinese and English.

CLIR could help him to search documents by using a single query, either in English

or Chinese to find articles in both languages. Moreover, the use of CLIR not only

reduces the communication cost of whole peer to peer system but also reduces

the CPU load. For example, in monolingual environment, if a user wants to find

6

document in two languages, he needs to send out two queries. And two set of

documents will be sent back. If the cut-off is 100 documents and 10 peers return

results, he will get 2000 documents. In cross-language IR environment, as the user

only sends out one set of query, only one set of results will be sent back from

remote peers. If the cut-off is still 100 documents and 10 peers return results, he

will get 1000 documents. That is a saving of 50% of communication cost. It is also

a saving of 50% of CPU power in the collection fusion processing. The more

languages the user is looking for , the more savings CLIR can gave to the user . For

example, if the user is looking for documents in 5 languages, the saving will be

80%. In addition, CLIR can provide better retrieval results. As everyone knows,

Chinese are used in different regions. The translation for the same English term

will be quite different in different regions. For example, the movie “Planet of the

Apes” is translated as “决战猩球”in Taiwan but “猿人争霸战”in Hong Kong. It is

not possible for everyone that speaks Chinese to know all translations in all the

regions. If the user only chooses the translation he knows, he will miss all other

related documents that use other translations. CLIR will help the user to find out

all possible translations so it is not going to miss any documents. Finally, CLIR can

be used in some tasks that human cannot do. For example, query expansion and

relevance feed back. In such tasks, there will be a huge amount of terms need to

7

be translated. Most of the end users would like such process to be automated and

would not want to be invoked in the translation.

The fourth problem of the current systems is lack of resource discovery and result

merging. Many researches in distributed information retrieval do pay attention to

resource discovery however most P2P-IR systems only use a very simple approach.

Some P2P systems called unstructured networks, like Gnutella and Kazaa, typically

broadcast search queries to neighbouring peers to locate information. While such

approaches are effective for finding highly replicated items, their performance is

very poor for finding rare items. In contrast, structured P2P systems, such as Bit

Torrent networks and Edonkey networks, use central servers as directory service

and use distributed hash table abstraction to identify remote resources. Such

networks are good at finding rare items, but tend to introduce higher costs than

unstructured systems.

Result merging approaches used in most of the popular P2P systems are very

simple because most of the popular P2P systems lack IR features. They simply

merge results. Users need to scan through the whole list to find the relevant

documents. One basic solution is to sort the merged list by document scores. In

most cases, collection statistics (e.g., size of the collection, inverse

document/term frequency, etc.) are used to calculate document scores in most of

8

the retrieval model such as Boolean model, probability model and vector space

model. The use of collection statistics makes the document score quite different

in different collections. Even the same document will have different scores if it is

in different collections. Therefore, document scores from different clients are not

comparable. Sorting the merged list by document score cannot provide good

enough results.

1.2 Contributions

The principal observation is that it is generally considered too expensive to

provide a dedicated centralized information sharing mechanism to large groups of

independent users from various countries. Many real world IR systems have

documents in multiple languages (e.g. www.Wikipedia.org) but only allow you do

monolingual queries. That is, you type in a query in one language, you get

documents in the same language. If one wants to get documents in other

languages, he has to type in a query in other languages. In addition, an

information retrieval system used for large scale secured SIG (special interest

group) environment should have the features from both peer to peer information

retrieval systems and cross-language information retrieval systems, the former is

to solve the communication, security and collection profiling problems and the

latter is to provide multilingual functionalities.

9

Firewall is another serious obstacle for conventional P2PIR system. Under current

internet infrastructure, it is not possible to make direct connection between peers

when they are behind different firewalls when port forwarding is not available.

Due to security and management issues, large organizations usually do not allow

port forwarding to enable P2P connection. Such policy virtually makes P2PIR

system unusable for researchers as most researchers’ computers are behind

corporate level firewalls.

The objective of the research is to address three major issues in peer to peer

systems: communication issues, CLIR issues and collection fusion issues. The

system proposed in this thesis is to provide a workable P2PIR system that

supports secure collaborative information retrieval over private collections in

multiple languages. It is intended to allow special interest groups to securely and

effectively search and share private collections over the Internet. And it is

intended to support the state of the art cross-language information retrieval

functionality without the excessive costs of centralized solutions. This is done

through the exploitation of the distributed resources of SIG members, each with

their own document collection, storage capacity, processing capacity, and a local

search engine. Although we used our own search engine in our experiments, any

search engine (e.g. desktop Google or MS search engine) can be used in the

10

system. The major contribution of this thesis is to provide cross-language IR

feature to the peer to peer IR system.

In order to evaluate the proposed system, a P2PIR distributed information

retrieval prototype system has been developed to address all these issues

discussed above. Not only can it be used over the public domain like popular P2P

systems, but it supports powerful generic text searches while it also provides the

necessary security and multilingual information retrieval features for private

group collaboration.

The P2PIR system offers some unique advantages over conventional P2P systems.

• Offline processing: a user can be offline when a query is issued. The

query is stored in a mailbox. Therefore, when the user’s machine goes

online the system can act and respond to the query.

• High level encryption is used at the email application layer without

impeding the operation of the system over the public email

infrastructure.

• Advanced Searching functionality.

• Tunnelling through firewalls.

• Server-less.

• Complete access control by end users.

11

• Multilingual search engine

• Advance collection profiling and result merging

To achieve the goal of the P2PIR system, several new approaches are proposed in

this thesis. The summarized contributions are briefed as follows:

• Proposed XML based communication protocol with advance security

features

• Web based query translation, including English – Chinese translation

extraction, term extraction and translation disambiguation.

• Web based collection profiling, including Chinese query classification,

new remote system performance measurement and result fusion

based on the new measurement.

1.3 Thesis outline

The rest of this thesis is organized as follows:

Chapter 2: this chapter is a literature review of related technologies and

disciplines in the area of distributed systems and multilingual information retrieval

systems. It focuses on the latest works on P2P system architecture, collection

selection, collection fusion and query translation.

12

Chapter 3: this chapter presents a proposed algorithm for Chinese keyword

extraction as well as an algorithm for translation selection. These two algorithms

make up the approach of web based query translation. In order to make the

algorithms easily understood, this chapter also provides a sample example on

translating English terms into Chinese using the proposed approach.

Chapter 5: a web based collection profiling strategy is introduced in this chapter.

Some new technologies are also proposed to support the strategy including a web

based Chinese query term classification method, a remote system performance

measurement, a collection selection strategy, and a collection fusion strategy.

Chapter 7: this chapter demonstrates a peer to peer system that implements all

proposed algorithms and strategies introduced in this thesis. An email based peer

to peer communication protocol presented in this chapter as well.

0 and Chapter 6 Chapter 8 are evaluations of all proposed approaches. Detailed

analysis and the comparison results are also included in those chapters.

13

Chapter 2

Literature review

The purpose of this literature review is to analyse existing information retrieval

techniques for distributed systems, peer to peer systems and multilingual

systems. In particular, this review is focused on the problems described below:

• Peer to peer information retrieval system architecture: how to manage

remote peers, including how to discover remote peers and how peers

communicate with each other .

• Resource Description: how to learn and describe the topics covered by

different data collections.

• Resource Selection: given an information need (a query) and a set of

resource descriptions, how to select a set of resources (text databases

or collections) to search.

• Query Translation: given an information need in some base

representation, how to map it automatically to representations (query

languages and vocabularies) that is appropriate for the selected

databases.

14

• Result Merging: after result lists have been returned from the selected

databases, how to integrate them into a single ranked list.

2.1 Evaluation of Peer to Peer Information Retrieval System

Information system is the system than can deliver information in collection

(documents) that relevant to given search criterion (query). Evaluation of the

system is to investigate how well does the system work. According to the

objective of the investigation, the evaluation of IR system is usually divided into

six levels(Saracevic 1995):

1. Engineering level: the objective of this level is to inspect the performance

of hardware and software, such as reliability, speed, flexibility, etc.

2. Input level: the objective of this level is to question about the coverage in

the designated area.

3. Processing level: the objective of this level is to exam the performance of

the algorithms, techniques and approaches, such as computational

effectiveness, efficiency of the algorithms, etc.

4. Output level: the objective of this level is to investigate the effectiveness

of outputs, e.g. the quality of results.

15

5. User level: the objective of this level is to question about the satisfaction

of user .

6. Social level: this level studies the social impact of the system.

Nowadays, most researchers in the IR community use evaluation on output level

to measure the quality of IR systems. Recall and precision are two major

properties that have been accepted as the measurements of search

effectiveness.(Singhal 2001)

Precision is the fraction of the documents retrieved that are relevant to the given

query. It is represented by equation:

presision = |{relevant documents} ∩ {retrieved documents}||{retrieved documents}|

Usually precision considers all documents retrieved. It is also possible to evaluate

by a cut-off number of documents, represented as P@n. For example, P@100

represents the precision at the top 100 documents.

Recall is the fraction of the documents that are relevant to the query that are

successfully retrieved. It is represented by equation:

Recall = |{relevant documents} ∩ {retrieved documents}||{relevant documents}|

16

A perfect IR system should have should retrieve as many relevant documents as

possible which means high recall, and it should retrieve as few non-relevant

documents as possible, which means high precision. Unfortunately, these two

goals have proven to be quite conflicting over years of research. When the

techniques improve precision then it will hurt recall, vice-versa. Recent years,

researchers like to use average precision to measure the effectiveness of IR

systems. Average precision combines the information from both precision and

recall. Average precision is computed by measuring precision at different recall

points (e.g. 10%, 20%, etc.) and averaging. (Salton and McGill 1996)

Evaluations of peer to peer system are mostly focus on engineer level. (Ehrig,

Schmitz et al. 2004). The common measurements used are:

• Reliability: this measures the degree of failures and tolerated when system

breaks down or measures the percentage of actual breakdowns of the

system.

• Response time: this measures the time from query sent to result received.

• Network load: this figure shows the network traffic of the system.

As peer to peer systems are designed for different needs, there is no standard

measurement for all peer to peer systems. For example, for IM (internet message)

17

systems, response time is the most critical measurement while reliability is more

important in file sharing systems.

2.2 Peer to peer information retrieval system architecture

According to the relationship between peers, peer to peer information retrieval

systems can be divided into cooperative systems and uncooperative systems. In a

cooperative environment, various kinds of information such as resource

description, collection index and collection statistics, etc., are usually held in a

central place. Peers can use such information to help their search. In an

uncooperative environment, each peer is independent and knows nothing about

others. Peers can answer queries and return documents, but they do not provide

other information such as collection statistics, collection description or retrieval

model.

According to the network structure, peer to peer systems can be catalogued as

centralized peer to peer network, structured peer to peer network and

unstructured network(Lv, Cao et al. 2002). Structured and unstructured are both

decentralized architectures.

18

2.2.1 Centralized architecture

Centralized peer to peer architecture is the mix of traditional client-server

architecture and pure peer to peer architecture. In this architecture, some nodes

will work as servers. Unlike client-server architecture, the servers in a centralized

peer to peer architecture only provide directory service to other nodes and they

do not contain any other information resources. All information resources are

located in all other peers. Centralized architecture faces the same problem as

client-server architecture. Single point of failure and scalability are the main

issues. It is obvious that one directory server can only handle certain amount of

request from peers. The response time will increase when peers submit more

requests. Server failure will result in the whole network stopping to work. An easy

solution is adding more servers to the network. Centralized systems become more

practical in the real world. Most of the real world peer to peer systems are based

on this architecture, such as Bit Torrent and Edonkey network. One reason is that

it is similar to the existing client-server architecture. So transferring to this

architecture from existing system is easy. Another benefit of the centralized

architecture is that it is easy to manage. In addition, a centralized architecture can

reduce broadcasting and improve the network usage so it can provide higher

scalability. Security might be another important reason. “Identity management,

19

authentication and authorization cannot be done in a global scale, they have to be

domain or realm based one way or another.(Sankar 2003)”

2.1.1 Unstructured architecture

Under unstructured architecture, there is no directory server in the network. All

the peers in the system are equal. They both can issue requests, response to other

requests or route requests to other nodes. One of the best known peer to peer

systems under this architecture is Gnutella (http://www.gnutella.com/). Peers

typically flood (broadcast) search queries by forwarding them to their neighbours

to locate information. It is clear that flooding-based approaches are effective for

finding popular items but the performance is quite poor for rare items. In

addition, a badly designed system will large amount of peers can easily flood the

entire network.

Improving the scalability of unstructured architecture is the major topic for

researchers. Several techniques were developed to reduce the number of peers

that are visited while broadcasting queries. The architecture proposed by

Zeinalipour-Yazti (Zeinalipour-Yazti, Kalogeraki et al. 2003) tried to reduce the

time to find the clients that contain the documents. The key idea is that each

client keeps a log for the past queries to propagate the query messages to only a

subset of other clients. At the beginning, a query will be sent to all other clients.

20

The system will automatically record the query and clients that returned “good”

results. As the system runs for some time, it will keep a record that shows which

client contains “good” results for which kind of query. Then the system will only

send the query to those clients to reduce the network traffic. In other words,

Meta data for collection selection is gathered at run time while the traditional

method is gathering it before the system runs.

2.1.2 Structured architecture

Structured architecture was proposed (Callan, Lu et al. 1995; Bawa, Manku et al.

2003; Ng, Ooi et al. 2003; Tang, Xu et al. 2003) to solve the fundamental limitation

of unstructured architecture. Under this architecture, peers are grouped or

clustered. The peer to peer topology is tightly controlled and that files are placed

not at random nodes but at specified locations that will make subsequent queries

easier to satisfy(Lv, Cao et al. 2002). For example, make use of distributed hash

table (DHT) abstraction to find queries efficiently. The only question is how to

organize peers.

In (Bawa, Manku et al. 2003; Tang, Xu et al. 2003), this problem is directly

addressed by reassigning documents to peers that cluster semantically related

documents. The peers are organized in Content Addressable Networks (CANs)

which are used to partition a logical document space into zones. Using Latent

21

Semantic Indexing (LSI) to generate semantic vectors as key, documents then will

be mapped into DHT. By using this approach, documents relevant to a given query

tend to cluster on a small number of neighbours, as shown in figure 2-1.

Figure 2–1 Search in a semantic space

In(Bawa, Manku et al. 2003), the authors created a dynamic peer to peer network

while in (Tang, Xu et al. 2003) the peer to peer network is static. When a new

node joins the network, it gets segment from its neighbours and joins an

appropriate segment.

2.3 Firewall traversal

As discuss in 1.2, firewall traversal is one of the major issues in peer to peer

systems. There are some approaches used in real world products (e.g. Skype) that

can provide the ability of firewall traversal, such as STUN(Rosenberg, Weinberger

et al. 2003) and TURN(Rosenberg, Mahy et al. 2005). Such protocols require a

22

server on the public internet. Before two peers directly connect to each other, the

peers need to connect to the server first. The server will then analyse the network

packets to discover the IP addresses and ports of the client. After that, the two

peers can make direct connection based on the IP addresses and ports. The

authors already address that providing the server in the network will result high

cost and such protocol can only provide one to one connections. Therefore, such

protocols are not suitable to be used in unstructured peer to peer environments.

2.4 Distributed computing and Peer to peer system

“Distributed computing system combines multiple computers connected by a

network to solve a single problem(Brown 1999)”. Some studies (Brown 1999; Sun

and Chen 2001) of distributed computing systems come from the study of parallel

computing systems because a distributed computing system can be treated as a

parallel computing system with relevant slow processors communication channel.

Each computer in the distributed computing system can be treated as a processor

in the parallel computing system but with more freedom. Moreover, each

computer in a distributed computing system may have different CPU, different OS

and different communication software. The only thing those computers must have

is that they must have the same communication protocol.

23

“Peer to peer systems are distributed systems in which nodes of equal roles and

capabilities exchange information and services directly with each other(Beverly

and Hector 2001)”. The main benefits of the P2P systems include: improving

scalability by avoiding dependency on centralized points; enable flexible

information sharing between peers and low cost(Sankar 2003). According to de

Kretser’s study, (de Kretser, Moffat et al. 1998) distributed computer systems

which is built on a set of low end PCs can provide similar performance to high end

servers. Peer to peer data systems with high speed connection can be used as

alternative method of database clustering.

As we can see, peer to peer systems can be treated as a sub category of

distributed system. They share the same difficulties in distributed processing.

However, a peer to peer system has its own issues (Bawa, Manku et al. 2003; Ng,

Ooi et al. 2003). First, peer to peer systems are usually dynamic while distributed

systems are always static. In a peer to peer environment, a node can dynamically

join and leave the system. In distributed systems, the components are fixed.

Second, in a peer to peer environment, each node is equal and performs the same

task. In distributed systems, each component usually performs a different task.

Third, in distributed systems, global information is available in a central place. In

peer to peer systems, global information is not always available. Therefore, some

techniques used in the distributed systems are not fit for Peer to peer systems.

24

2.5 Collection selection and fusion

Collection selection defined as the process of determining which of the distributed

document collections are most likely to contain relevant documents for the

current query and therefore should receive the query for processing (Brown

1999). Collection fusion (or merge) is referred to integrating the results from each

individual distributed client. The final merged result list should include as much as

relevant documents as possible and the relevant documents should have higher

ranks (Powell and French 2003).

Figure 2–2 the process of collection selection

25

The process of collection selection and collection fusion is showen in figure 2-2.

Query q comes to the collection selection model which ranks the entire set of

collections base on the similarity between recourse and query. Then the top

ranked collections are selected and the query is routed to and run on the selected

collections, e.g. the collections c1 and c3 in the figure. Then the results are

merged into a single list. The aim of collection selection is to efficiently use

resources including bandwidth and processor power. Collection selection not only

decreases the time taken to send out a query and return results but also decreases

the time for merging results from different collections. Collection selection is first

used to retrieve multiple collections in a centralize environment. E.g. multiple

collections are stored on single machine or on different machines with a high

speed connection. As the idea of distributed information system comes up, it is

now used in peer to peer systems to reduce network traffic.

Merging result into a single result list is a difficult task because document scores

returned by distributed collections are usually not comparable. In most retrieval

models such as the Boolean model, the probability model and the vector space

model, collection statistics (e.g., size of the collection, inverse document/term

frequency, etc.) are used to calculate document scores. The use of collection

statistics makes the document scores calculated by different search engines

incompatible. Even the same document may have different scores if calculated in

26

different collections. Therefore, the document scores are incomparable and

merging results from different collections becomes a very complex task.

2.5.1 Collection rank

From the section above we can see that, collection ranking makes significant

impact on collection selection. In a cooperative environment, global index is the

most common technique for collection ranking. In the global index architecture,

usually there is a directory server that holds all the information about the

distributed collections. Peers can get global collection statistics via the directory

server then merge the distributed results together. With the help of global

statistics, the whole peer to peer network can be treated as a single centralized

collection. Queries can be run on local machine or on the central server without

sending them to the remote peers. The remote peers only need to provide the

documents requested. STARTS(Gravano, Chang et al. 1997) is one of the best

known protocols for peer to peer communication in cooperative environment.

Clients can exchange their collection statistic via STARTS protocol. The global index

architecture can achieve nearly 100% of the centralized information retrieval

performance because a client will have all the information to calculate a document

score as if they are in a centralized place. This architecture requires a deep

collaboration between clients and fits well when all the clients are happy to share

their entire collections. However, this architecture is not practical in real word

27

large scale distributed networks because not all clients want to share their

collection information.

GlOSS (Gravano and García-Molina 1995), CORI (Callan, Lu et al. 1995) and

CVV(Yuwono and Lee 1997) are three of the best known collection ranking

approaches that require far less cooperation between peers. All three approaches

are based on calculating the collection similarity to a query and require resource

description (see 2.5.2 for detail) before the calculation process.

In GlOSS, the similarity is calculated by the number of documents in the collection

that are relevant to the query. This algorithm works well with large collections of

heterogeneous data. Gravano and GarcíaMolina (Gravano and García-Molina

1995) also suggest a variation of GlOSS known as gGlOOS which is based on vector

space model. The similarity is calculated by the vector sum of the relevant

documents instead of the number of relevant documents.

CORI is based on probabilistic inference network which is originally used for

document selection. Callan (Callan, Lu et al. 1995) introduced this algorithm for

collection selection. It uses document frequency (df, the number of documents

containing the query term in a collection) and inverse collection frequency (icf, the

number of collections not containing the query word) to calculate collection ranks.

One of the advantages of CORI is that it only uses 0.4% of the original collection.

28

The CVV ranking algorithm uses a combination of document frequency and cue

validity variance information(Yuwono and Lee 1997). The variability of the fraction

of documents in a database that contains a specific word is characterized by the

cue validity variance.

2.5.2 Getting resource description

Although GlOSS, CORI and CVV do not require deep cooperation between peers,

they still need an accurate representation of remote peers (so called resource

description). In practice, such information is unlikely to be available, especially in a

peer to peer environment. Automatic generation of resource descriptions has

been studied since early 1980(Callan and Connell 2001). Query-based sampling

(Hawking and Thistlewaite 1999; Callan and Connell 2001; Si and Callan 2003) is

the most popular technique for discovering resource descriptions in

uncooperative environments. The basic idea is to send queries to remote peers

and examine term frequency in the returned documents.

D. Hawking’s approach requires special designed queries in the sampling process.

It also requires significant communications costs in wide area networks. J. Callan

(Callan and Connell 2001) proposed a more flexible approach. It does not require

specially queries and the communication cost is reasonable. It has been proved

29

that running 300 queries in the process can provide good enough resource

description(Callan and Connell 2001; Rasolofo, Abbaci et al. 2001).

Si and Callan extended their study in 2003(Si and Callan 2003). In their study, the

found that most of the old resource selection approaches do not work well when

remote collections are mix with small and very large databases. They then

proposed an approach to acquire database size estimates in uncooperative

environments as an extension of the query-based sampling used to acquire

resource descriptions. Based on the assumption that the documents sampled

from the database are a good representation of the whole database, the

estimation of remote collection can be easily calculated. Then, use with the

standard query sampling technique, relevant document distributions can by

estimated.

Instead of getting collection sample documents directly, some researchers

(Ipeirotis, Gravano et al. 2001) suggested to classify remote collections to help

collection selection. To classify a database, their algorithm does not retrieve or

inspect any documents or pages from the database, but rather just exploits the

number of matches that each query probe generates at the database in question.

In their later research(Ipeirotis and Gravano 2001), they use this collection

30

classification algorithm to help collection selection. Remote collections will only

be selected when they are in the same topic category as the input query.

Relevance feedback(CHOI and YOO 2001; Liu, Yu et al. 2002) is another popular

approach for discovering resource descriptions in uncooperative environments.

The basic idea is acquiring the IR knowledge from relevance feedback so the

system can discover the text databases which contain the documents relevant to

the user’s interests.

Choi’s approach is a typical relevance feedback system. In the training process,

humans are invoked to determine the usefulness of documents returned from

remote systems. The authors also addressed the major problem of such system is

that re-training is required when new remote systems are added. One of the

possible solutions is to use hierarchical multi-agent to break the knowledge of

remote system into small hierarchical parts. As a result, when new system is

added, only a small part of the system needed re-training.

In Lui’s (Liu, Yu et al. 2002) approach, the feedback is not given by human but by

machine. The quality of a remote system is measured by two arguments: the

number of documents returned by remote system and the average similarity of

the returned documents.

31

2.5.3 Collection fusion approaches

Collection fusion is the last step in the peer to peer information retrieval process.

It provides a single list of document results. In cooperative environment, a global

index is the most common technique for result merging. As the problem of result

merging comes from the lack of collection statistics and the information of

retrieval model, if the distributed collection can be treated as a logical centralized

collection but documents are physically located distributed, there would be no

merging problem.

The global index is not available in uncooperative environments. An alternative

approach to recalculate document score in uncooperative environments is to

download either the entire remote document set(Kirsch 1997) or part of the

remote document set(Craswell, Hawking et al. 1999). It is obvious that the

approach in (Kirsch 1997) will significantly increase the network traffic because

every search query requires an entire collection downloading. The approach in

(Craswell, Hawking et al. 1999) can reduce the traffic to 10% because it only

downloads the top 10% of the documents. But it is still a large amount of network

traffic if the system runs for a long time.

Normalized score merging is a solution to collection fusion in real world large

scale uncooperative distributed information retrieval (Liu, Yu et al. 2001; Rasolofo,

32

Abbaci et al. 2001; Si and Callan 2002). It is considered as the most accurate

solution(Viles and French 1995). When merging documents returned from peer

collections, the client normalizes the documents based on the rank of their

corresponding collections. For example, the document score is increased if it

comes from an important collection and decreased if it comes from a less

important collection.

Some researchers (Liu, Yu et al. 2001) use sampling technique to normalize

document scores. It is based on the observation that for many combinations of

global term-weighting formula and local term-weighting formula, documents that

have relatively large local weights for a term tend to have relatively large global

weights for the term. In training stage, they first randomly select some queries.

Then retrieve document scores from remote collections. According to the local

document score and remote document score, adjust the ration between local and

remote document scores. Then in the merge stage, they normalize document

score based on the ration.

The merging strategies based on CORI and GlOSS linearly combine the document

scores returned from peer collections and the collection ranks to determine the

resulting document ranks. It still requires that the peer collections use the same

indexing and the same retrieval model thus the document scores can be

33

normalized. In today’s p2p network environments, it is impossible to require peers

to use the same software to manage their data. For example, in the popular Bit

Torrent and Edonkey network, people use hundreds of different client software

systems to share their files. It is reasonable to assume that the trend of p2p

network is to use varied client software under the same communication protocol.

Therefore, the document scores returned from peers may be based on different

retrieval models. Thus the document scores cannot be normalized as the scores

are not comparable.

Round robin merging strategy and its variations (Steidinger 2000)were proposed

to deal with the case where the document scores are not comparable. Round

robin merging strategy interleaves all the result lists returned from remote peers

and the first document in each result list will be removed and put into the final

merged result list. It has been proved as a simple and efficient strategy for

distributed collections when remote peers have similar statistics and the retrieval

performances(Savoy, Calv et al. 1998). However, Round robin merging strategy

will fail significantly when the distributed collections have quite different

collection statistics; for example, the distributed collections focus on different

domains.

34

In 1995, some researchers(Towell, Voorhees et al. 1995) suggested to relevance

feed back and query clustering as merging strategies which is another merging

strategy that does not require document scores. Query clustering learns a

measure of the quality of the search for a particular topic area on the collection.

The number of documents retrieved from a collection for a new query is

proportional to the value of the quality measure for that query. Query clustering

uses query vectors to represent the queries. Topic areas are represented as

centroids of query clusters. Collection weight is the average relevant documents

retried in the past. In the fusion process, the percentage of the total selected

collection weight will be calculated first. Then the number of documents selected

from the collections will be calculated based on the percentage. For example, if

three collections are used and the total number of documents required is 100.

The collection weight is 4 for collection A, 3 for collection B, 2 for collection C and

1 for collection D. Then the top 40 documents should be selected form collection

A, 30 from collection B, 20 from collection C and 10 from collection C. However,

the authors didn’t mention how to sort the documents into a single result list.

In summary, most of the results merging approaches require collection

information called resource description such as collection content and statistics.

When processing a query, the collections will be assigned ranks based on

35

similarity between the query and the resource descriptions. Then in the merging

stage, the document scores will be adjusted according to the collection ranks.

2.6 Translation

Dictionary based query translation is one of the conventional approaches in CLIR.

The dictionary based translation has been adopted in cross-language information

retrieval because bilingual dictionaries are widely available, dictionary based

approaches are easy to implement, and the efficiency of word translation with a

dictionary is high. On the other hand, because of the vocabulary limitation of

dictionaries, very often the translations of some words in a query cannot be found

in a dictionary. This problem is called the Out of Vocabulary (OOV) problem. The

appearance of the OOV terms is one of the main difficulties that arise with this

approach. In the very early years, an OOV term would not be translated at all,

leaving the original term in the translated query. However, very often the OOV

terms are proper names or newly created words. Even using the best dictionary,

the OOV problem is unavoidable. As input queries are usually short queries, query

expansion does not provide enough information to help recover the missing

words. Furthermore, in many cases it is exactly the OOV terms that are crucial

words in a query. For example, a query “SARS, CHINA” may be entered by a user

in order to find information about SARS in China. As everyone knows, SARS is a

36

new term created a few years ago and may not be included in a dictionary which

was published long time ago. If the word SARS is left out of the translated query, it

is most likely that the user will practically be unable to find any relevant

documents at all. As a result, the performance of multilingual search engine will

be significantly reduced if the OOV terms are not translated.

Another problem with the dictionary based translation approach is the translation

disambiguation problem. The problem is more serious for a language which does

not have word boundaries such as Chinese. Translation disambiguation refers to

finding the most appropriate translation from several choices in the dictionary.

For example, the English word STRING has over 20 different translations in

Chinese, according to the Kingsoft online dictionary (www.kingsoft.com). One

approach is to select the most likely translation [6] – usually the first one offered

by a dictionary. But even if the choices are ordered based on some criteria and

the most likely a-priori translation is picked, in general such an approach has less

than optional probability of success. Another solution is to use all possible

translations in the query with the OR operator. However, while this approach is

likely to include the correct translation, it also introduces noise into the query.

This can lead to the retrieval of many irrelevant documents which are of course

undesirable. Researchers (Jang, Myaeng et al. 1999; Gao, Nie et al. 2001)reported

37

that with this approach the precision is 50% lower than the precision that is

obtained by human translation.

In this section, several existing translation related approaches are reviewed

below.

2.6.1 Transliteration

Proper names, such as people names and names of places, are two of the major

sources of OOV terms because many dictionaries do not include such terms. It is

common that foreign names will be translated word by word based on their

pronunciations. So the pronunciation of a name in one language and the

pronunciation of its translation in another language will pronounced similarly –

this is transliteration. Such translation is usually made when a new proper name

term introduced from one language to another language.

Some researchers(Paola and Sanjeev 2003; Yan, Gregory et al. 2003) applied the

rule of transliteration to automatically translate proper names. Basically, the

transliteration will first transliterate words in one language into phonetic symbols.

Then transliterate phonetic symbols into another language. Some researchers

found that transliteration is quite useful in proper name translation(Paola and

Sanjeev 2003; Yan, Gregory et al. 2003). However transliteration is useful only

38

with a few language pairs. When dealing with the language pairs for which there

are many phonemes in one language that are not present in the other one, such

as Chinese and English, the problem is exacerbated. There are even more

problems when translating English to Chinese. Firstly, as there is no standard for

name translation in Chinese, different communities may translate a name in

different ways. For example, the “Disney” is translated as “迪斯尼” in mainland

China but is translated as“迪士尼” in Taiwan. Both translations’ pronunciations

are similar in Chinese but use different Chinese characters. Even a human

interpreter will have trouble to determine which character should be used.

Secondly, sometimes the Chinese translation only uses part of phonemes of the

English names. For example, the translation of “America” is “美国” which only

uses the second syllable of “American”. Finally, the translation of a name is not

limited to only using transliteration but also uses transcription. Sometimes the

translation of a proper name may even use the mixed form of transcription and

transliteration. For example, the translation of “New Zealand” in mainland China

is “新西兰”. “新” is the transcription of “New” and “西兰” is the transliteration of

“Zealand”.

39

2.6.2 Parallel Text Mining

Parallel text is a text in one language together with its translation in another

language. The typical way to use parallel texts is to generate translation

equivalence automatically, without looking up a dictionary. It has been used in

several studies (Eijk 1993; Kupiec 1993; Smadja, McKeown et al. 1996; Nie, Simard

et al. 1999) on multilingual related tasks such as machine translation or CLIR.

The idea of parallel text mining is straightforward. Since parallel texts are texts in

two languages, it should be possible to identify corresponding sentences in two

languages. When the corresponding sentences have been correctly identified, it is

possible to learn the translation of each term in the sentences using statistical

information. It is straight forward that a term’s translation will always appear in

the corresponding sentences. Therefore, an OOV term can be translated by

mining parallel corpora. Many researches also reported that parallel texts mining

based translation can significantly improve the CLIR performance. (Eijk 1993;

Kupiec 1993; Smadja, McKeown et al. 1996; Nie, Simard et al. 1999)

In the very early stage, the parallel text based transition approaches are on a

word-by-word basis and only domain specific noun terms are translated. In

general, those approaches(Eijk 1993; Kupiec 1993) firstly align the sentences in

each corpus. Then noun phrases are identified by part-of-speech tagger. Finally,

40

noun terms are mapped by using simple frequency calculation. In such translation

models, phrases, especially verb phrases, are very hard to translate. As phrases in

one language may have different word order in another language, they cannot be

translated on word-by-word basis. This problem in parallel based translation is

called the collocation problem.

Some later approaches(Smadja, McKeown et al. 1996; Nie, Simard et al. 1999)

started to use more complex strategies such as statistical association

measurement or probabilistic translation to solve the collocation problem. Smadja

et al. (Smadja, McKeown et al. 1996) proposed an approach that can translate

word pairs and phrases. In particular, they used a statistical association measure

of the Dice coefficient to deal with the problem of collocation translation. Nie et

al. (Nie, Simard et al. 1999) proposed an approach based on a probabilistic model

that demonstrates another way to solve the collocation problem. By using parallel

texts, their translation model can return p(t|S) which is the probability of having

the term t of the target language in the translation of the source sentence S.

Because the probability model does not consider the order and the position of

words, collocation is overcome for their approach.

Some of the advantages of the parallel text based approaches include the very

high accuracy of translation without bilingual dictionary and the extraction of

41

multiple transitions with equivalent meaning that can be used as query expansion.

However, the source of parallel corpora tends to be limited in some particular

domain and language pairs. Currently large scale parallel corpora are available

only in forms of government proceedings, e.g. Canadian parliamentary

proceedings in English and French, or Hong Kong government proceedings in

Chinese and English. Obviously, such corpora are not suitable for translating newly

created terms or domain specific terms that are outside of the corpora domain. As

a result, the current studies of parallel text based translation are focusing on

constructing large scale parallel corpora in various domains from the web.

2.6.3 Web mining for query translation

Web mining for automated translation is based on the observation that there are

a large number of web pages on the Internet that contain parallel text in several

languages. Investigation has found that, when a new English term such as new

technical terms or proper names is introduced into Chinese, the Chinese

translation of this term and the original English term very often appear together in

literature publications in an attempt to avoid misunderstanding.

Some studies(Yan, Gregory et al. 2003; Lu, Chein et al. 2004) have already

addressed the problem of extracting useful information from the Internet by using

Web search engines such as Google and Yahoo. Popular search engines allow us to

42

search English terms for pages in a certain language, e.g., Chinese or Japanese.

The search results of web search engines are normally a long ordered list of

document titles and summaries to help users locate information. Mining the result

lists is necessary to help find translations to the unknown query terms. Some

studies (Cheng, Teng et al. 2004; Zhang and Vines 2004) have shown that such

approaches are rather effective for proper name translation.

In common, web based translation extraction approaches consist of three steps:

• Web document retrieval: use a web search engine to find the

documents in target language that contain the OOV term in original

language and collect the text (i.e. the summaries) in the result pages

returned from the web search engine.

• Term extraction: extract the meaningful terms in the summaries where

the OOV term appears and record the terms and their frequency in the

summaries. As a term in one language could be translated to a phrase

or even a sentence, the major difficulty in term extraction is how to

extract correct MLUs from summaries (refer to Section 2.7 for the

definition of MLUs).

• Translation selection: select the appropriate translation from the

extracted words. As the previous steps may produce a long list of

43

terms, translation selection has to find the correct translation from the

terms.

The term extraction in the second step falls into two main categories: approaches

that are based on lexical analysis or dictionary based word segmentation, and

approaches that are based on co-occurrence statistics. When translating Chinese

text into English, Chinese terms should be correctly detected first. As there are no

word boundaries in Chinese text, the mining system has to perform segmentation

of the Chinese sentences to find the candidate words. The quality of the

segmentation greatly influences the quality of the keyword extraction because

incorrect segmentation of the Chinese text may break the correct translation of an

English term into two or more words so that the correct keyword is lost. The

translation selection in the third step also has a problem that the highest

frequency word or the longest word selection does not always produce a correct

translation. The term extraction and translation selection problems will be further

addressed in following sections.

2.7 Term extraction

Term extraction is mainly the task of finding MLUs in the corpus. The concept of

MLU is important for applications that exploit language properties, such as

Natural Language Processing (NLP), information retrieval and machine translation.

44

An MLU is a group of words that always occur together to express a specific

meaning. For example, compound nouns like Disney Land, compound verbs like

take into account, adverbial locutions like as soon as possible, and idioms like

cutting edge are MLUs. In most cases, it is necessary to extract MLUs rather than

words from a corpus because the meaning of an MLU is not always the

combination of individual words in the MLU. For example, you cannot interpret

the MLU ‘cutting edge’ by combining the meaning of ‘cutting’ and the meaning of

‘edge’.

Finding MLUs from the summaries returned by a search engine is important in

web mining for automated translation because a word in one language may be

translated into a phrase or even a sentence. If only words are extracted from the

summaries, the later process may not be able to find the correct translation

because the translation might be a phrase rather than a word. For Chinese text, a

word consisting of several characters is not explicitly delimited since Chinese text

contains sequences of Chinese characters without spaces between them. Chinese

word segmentation is the process of marking word boundaries. The Chinese word

segmentation is actually similar to the extraction of MLUs in English documents

since the MLU extraction in English documents also needs to mark the lexicon

boundaries between MLUs. Therefore, term extraction in Chinese documents can

be considered as Chinese word segmentation. Many existing systems use lexical

45

based or dictionary based segmenters to determine word boundaries in Chinese

text. However, in the case of web mining for automated translation, as an OOV

term is an unknown term to the system, these kind of segmenters usually cannot

correctly identify the OOV terms in the sentence. Therefore, the translation of an

OOV term cannot be found in a later process. Some researchers suggested

approaches that are based on co-occurrence statistics model for Chinese word

segmentation to avoid this problem (Chen, Jiang et al. 2000; Maeda, Sadat et al.

2000; Gao, Nie et al. 2001; Pirkola, Hedlund et al. 2001).

2.7.1 Mutual information and its variations

One of the most popular statistics based extraction approach is to use mutual

information (Chien 1997; Silva, Dias et al. 1999). Mutual information is defined as:

(1)

The mutual information measurement quantifies the distance between the joint

distribution of terms x and y and the product of their marginal distributions. When

using mutual information in Chinese segmentation, x, y are two Chinese

characters; f(x), f(y), f(x,y) are the frequencies that x appears, y appears, and x and

)()(),(log

)()(),(log),( 22 yfxf

yxNfypxp

yxpyxMI ==

46

y appear together, respectively; N is the size of the corpus. A string xy will be

judged as a term if the MI value is greater than a predefined threshold.

Chien (Chien 1997) suggests a variation of the mutual information measurement

called significance estimation to extract Chinese keywords from corpora. The

significance estimation of a Chinese string is defined as:

(2)

Where c is a Chinese string with n characters; a and b are two longest composed

substrings of c with length n1; f is the function to calculate the frequency of a

string. Two thresholds are predefined: THF and THSE. This approach identifies a

Chinese string as a MLU by the following steps. For the whole string c, if f(c)>THF,

c is considered a Chinese term. For the two (n1) substrings a and b of c, if

SE(c)>=THSE, both a and b are not a Chinese term. If SE(c)<THSE, and f(a)>>f(b) or

f(b)>>f(a) , a or b is a Chinese term, respectively. Then for each a and b, the

method is recursively applied to determine whether their substrings are terms.

)()()()()(

cfbfafcfcSE

−+=

47

2.7.2 Local Maxima based approaches

However, all mutual information based approaches have the problem of tuning

the thresholds for generic use. Silva and Lopes suggest an approach called Local

Maxima to extract MLU from corpora without using any predefined threshold(Silva

and Lopes 1999). The equation used in Local Maxima is known as SCP defined as

follows:

∑−

=+−

= 1

111

2

)...()...(1

1)()( n

inii wwfwwf

n

sfsSCP (3)

S is an n-gram string, w1,…,wi is the substring of S. A string is judged as an MLU if

the SCP value is greater than or equal to the SCP value of all the substrings of S

and also greater than or equal to the SCP value of its antecedent and successor .

The antecedent of S is an (n1)-gram substring of S. The successor of S is a string

that S is its antecedent.

Although Local Maxima should be a language independent approach, JenqHaur

Wang et al.(Cheng, Teng et al. 2004) found that it does not work well in Chinese

word extraction. They introduced context dependency (CD) used together with the

48

Local Maxima. The new approach is called SCPCD. The rank for a string S is

calculated using the following function:

∑−

=+−

= 1

111 )...()...(

11

)()()( n

inii wwfreqwwfreq

n

sRCsLCsSCPCD

(4)

S is the input string, w1..wi is the substring of S, LC() and RC() are functions to

calculate the number of unique left(right) adjacent characters of S. A string is

judged as a Chinese term if the SCPCD value is greater or equal than the SCPCD

value of all the substrings of S.

2.8 Summary

In this chapter, several existing techniques for distributed systems, peer to peer

systems and multilingual systems are reviewed. Peer and peer systems are special

dynamic distributed computing systems. In order to improve the scalability and

efficiency, collection selection and collection fusion are two key components in

distributed and Peer to Peer systems. Collection selection can reduce the amount

of query request sent out while still keep the retrieval performance. Collection

fusion can improve the precision of the merged results. When applying

multilingual information retrieval features to peer to peer systems, translation of

query is the most important part. Transliteration, parallel text mining and web

49

mining are three major approaches to translate OOV terms. Due to the special

character of Chinese language, transliteration is not quite effective as the other

two approaches. However, parallel text mining and web mining approaches

requires term extraction techniques to extract quality Chinese terms from text

which makes the approaches more complex. Statistics based extraction

approaches are popular in parallel text mining and web mining because for new

terms their performance is better then dictionary based approaches. Mutual

information based approaches are widely used as they are simple and easy to

apply but they need predefine threshold. As a result, they only work well on static

collection. Local Maxima based approaches are complex but they do not need any

predefined threshold so they can work well in dynamic environments.

50

Chapter 3

Web based query translation

Enabling multilingual search is one of the key features of the proposed the P2PIR

system in this thesis. Obviously translation is needed in the CLIR process; either

translating the query into the document language, or translating the documents

into the query language. It is quite clear that query translation is much quicker

because it only needs to translate a few query terms instead of whole document.

As a result, query translation is more common than document translation in CLIR.

This thesis only focuses on improving query translation quality.

Similar to the previous work, the approach proposed in this thesis adopted the

idea of finding the OOV term’s translation through searching the web by using a

web search engine. Web mining based approaches submit English queries (usually

an English term) to a web search engine and the top returned results (i.e.,

summaries in Chinese) are segmented into a word list. Each of the words in the list

is then be assigned a rank calculated based on term frequency. The word with the

highest rank in the word list is selected as the translation of the English term.

However, observations showed that there are two weaknesses in existing

approaches.

51

The term extraction approaches used in the existing web mining based

approaches are not designed for a small amount of text. According to our initial

experiments, the performance of those term extraction approaches is not always

satisfactory in web search engine-based OOV term translation. The pages

returned from a web search engine are used for search based OOV term

translation. In most cases, only a few hundreds of top results from the result

pages are used for translation extraction. Consequently, the corpus size for search

based approaches is quite small. In a small collection, the frequencies of strings

very often are too low to be used in the approaches reviewed in Chapter 2.

Moreover, the search engine results are usually incomplete sentences, which

make traditional Chinese word segmentation hard to apply in this situation. Many

researchers (Gao, Nie et al. 2001; Chen and Gey 2003; Cheng, Teng et al. 2004;

Zhang and Vines 2004; Lu, Xu et al. 2007) apply statistical based approaches to

search based translation for term extraction to avoid the incorrect segmentation

by dictionary based word segmentation approaches.

The second weakness of current web-based query translation approaches is that

they use relatively simple translation selection Strategy. As discussed before, term

extraction will provide a list of translation candidate words. Each of the words in

the list is then be assigned with a rank calculated based on term frequency. The

word with the highest rank in the word list is selected as the translation of the

52

English term. Researchers(Lu, Chein et al. 2004; Zhang and Vines 2004) usually use

the frequency of a word as the rank of the word. Their experiments showed that

such strategy is effective but also showed that the correct translation does not

always have the highest frequency even though it very often has a higher

frequency. Therefore an argument is made that the correct translation is not

necessarily the term with the highest rank.

The new query translation approach discussed in this section still follows the

common web search engine-based query translation approaches but differs in

term extraction, term ranking and translation selection strategy. The aim of the

approach is to resolve the two weaknesses of current approaches discussed above.

The contributions of the new approach are: introducing new term extraction

strategy and applying translation disambiguation technology as translation

selection strategy.

In the following sections, a bottom-up term extraction strategy is introduced

together with new term measurement. The term extraction approach specifically

designed for the search based translation extraction, which uses term frequency

change as an indicator to determine term boundaries and also uses the similarity

comparison between individual character frequencies instead of terms to reduce

the impact of low term frequency in small collections.

53

The basic idea of translation selection in the approach is to combine the

translation disambiguation technology and the web search based translation

extraction technology. The web based translation extraction process usually

returns a list of words in the target language. As those words are all extracted

from the results returned by the web search engine, it is reasonable to assume

that those words are relevant to the English terms that were submitted to the

web search engine. If we assume all those words are potential translations of the

English terms, we can apply the translation disambiguation technique to select the

most appropriate word as the translation of the English terms.

The proposed query translation approach contains three major modules:

collecting web document summaries, term extraction, and translation selection.

For easier understanding, in the following sub sections, an example of finding the

translation to the term “Stealth Fighter” will be demonstrated to facilitate the

description of the proposed approach.

3.1 Collecting Web Document summaries

Firstly, collect the top 200 document summaries returned from Google that

contain both English and Chinese words. Sample document summaries are shown

below.

54

Figure 3–1 Three sample document summaries for “Stealth Fighter” returned

from Google

Figure 3-1 clearly shows that Stealth Fighter and its translation in Chinese隱形戰

機 always appear together. The Chinese translation of Stealth Fighter appears

either before or after the English words. In the sample example summaries in

Figure 3-1, the translation and the English term “Stealth Fighter” are highlighted in

red.

Although the query submitted to Google is asking for Chinese documents, Google

may still return some documents purely in English. Therefore we need to filter out

55

the documents which are written in English only. The documents that contain

both the English terms and Chinese characters are kept. Also all the html tags

need to be removed and only the plain text is kept.

Secondly, from the document summaries returned by the search engine we

collect the sentences in target language, for example, we can collect three

Chinese sentences from the three sample document summaries in Figure 3–1.

Each sentence must contain the English term and the characters before or after

the term. From the summaries given in Figure 3–1., the following Chinese strings

will be extracted as shown in Figure 3-2 below:

Figure 3–2 Sample output of Chinese string collection

56

3.2 Term extraction

In this step, meaningful terms should be extracted from the Chinese string

collection obtained from the previous step. In this step, term extraction is similar

to Chinese word segmentation. However, Chinese word segmentation is to

identify word boundaries while term extraction here is to find out all the possible

meaningful terms. For example, 隱形戰機 might be segmented as隱形(stealth)

and 戰機 (fighter). In term extraction, all three terms隱形, 戰機 and隱形戰機

should be extracted as translation candidates otherwise the correct translation

could be missed. The upcoming sections will describe the new term extraction

strategy in detail.

3.2.1 Frequency Change Measurement

The approaches mentioned in Section 2.7 uses a top-down approach that starts

with examining the whole sentence and then examining substrings of the

sentence to extract MLUs until the substring becomes empty. We propose using a

bottom-up approach that starts with examining the first character and then

examines super strings. The approach is based on the following observations for

small document collections:

57

Observation 1: For a small collection of Chinese text such as the sentences

collected from the summaries returned by a search engine, a sequence of Chinese

characters is most likely an MLU if the frequency of the characters occurring

together is close to the frequency of each individual character in the sequence.

This is because in small document collections such as Google search result

summary, the number unique of Chinese is quite small. This reduces the possible

terms in the collection. In the search result summary, all the texts are related to a

special topic, this reduce the possible terms even further . As a result, one Chinese

character will only appear in one or two terms in the collection which makes the

term frequency close to character frequency.

According to Observation 1, the frequencies of a term and each character in the

term should be similar. As the standard deviation is a common measure of the

dispersion of a set of values, we propose to use the sample standard deviation

given in Equation (5) to measure the similarity between the character frequencies.

∑=

−−

=n

ii xx

n 1

2)(1

1σ (5)

For a given Chinese character sequence with n characters, xi is the frequency of

character i in the sequence, x is the average frequency of all the characters in the

sequence. Although the frequency of a string is low in a small corpus, the

58

frequencies of Chinese characters may still have relatively high values. According

to Observation 1, if the characters in a sequence have similar frequencies, i.e., σ

is small, and then the given sequence is most likely an MLU. When the frequencies

of all the characters in a Chinese sequence are equal, σ = 0. Because σ

represents the average frequency error of individual characters in the sequence,

according to observation 1, in an MLU, the longer substring of that MLU will have

smaller average frequency error .

Observation 2: When a correct Chinese term is extended with an additional

character, the frequency of the new string very often drops significantly. When a

Chinese term is extended with a random additional character, it is not a Chinese

term any more. As a result, the new string is not likely to be found a lot in the

documents and its frequency should drop significantly. In the case that the

additional character makes the new string to a correct term, the new term is

unlikely has the same meaning with the old term. Because we are looking for the

frequency in a small document collection in a particular domain that related to

the old term, the frequency of the new term should not appear quite often. Thus

the frequency of the new term should drop significantly as well. In summary,

when a Chinese term is extended with additional character, what ever the new

string is a real term, the frequency of the new string should drop in a small

document collection.

59

Equation (5) measures the frequency similarity between individual characters

without comparing each individual character’s frequency with the sequence’s

frequency. Combining the frequency of the sequence and the standard deviation

measurement together, we designed the following equation to measure the

possibility of s being a term:

1)(1)(

1)()(

1

2 +−=

+=

∑=

n

ii xx

n

sfsfsRσ

(6)

Where, s is a Chinese sequence; f(s) is the frequency of s in the corpus. We use σ

+1 as the denominator instead of usingσ to avoid 0 denominators.

Let S be a Chinese sequence with n characters, S’ is a substring of S with length n1.

According to observation 1, we should have:

If S is an MLU, then f(S) ≈ f(S’), vice versa.

If S is an MLU, then the longer S is, the smaller σ is. Therefore, in the case that S’ is

a substring of S, we would have σ<σ’. As a result we will have R(S)>R(S’). In

another case where S’ is a substring of S and S’ is an MLU while S is not, that is, S

has an additional character to an MLU, we will have f(S) <f(S’) and the additional

character makes S have a larger standard deviation value, so σ>σ’. Therefore, R(S)

<R(S’).

60

In summary, for a string and its substrings, the one with higher R value would

most likely be an MLU. Table 3-1 gives the R value of each possible term in the

Chinese sentence “隱形戰機/是/一種/靈活度/極差/的/戰機” (“/” indicates the

lexicon boundary given by a human), chosen from the small collection of Chinese

strings given in Figure 3-2

Table 3-1 Chinese strings and R

String R

隱形 26.00

隱形戰 0.94

戰機 2.89

戰機是 0.08

一種 0.44

一種靈 0.21

靈活 2.00

靈活度 2.00

靈活度極 1.07

極差 0.8

極差的 0.07

戰機 2.89

This example clearly shows that if a Chinese MLU has an additional character, its R

value will be significantly smaller than the R value of the MLU. For example,

Chinese terms”一種”, “靈活”and “靈活度” are valid Chinese MLUs, but “

61

一種靈”and “靈活度極” are not. From their R values, we find that R(一種

)=0.44 > R(一種靈)=0.21, R(靈活)=R(靈活度)=2.00 > R(靈活度極)=1.07. Based on

this analysis, we conclude that it is reasonable to segment a Chinese sentence at

the positions where a Chinese character string’s R value drops greatly and the

Chinese character string is a potential MLU. For the example sentence, it will be

segmented as: “隱形/戰機/是/一種/靈活度/極差/的/戰機” by using this

method. The only difference between the human segmented sentence and the

automatic segmented sentence is that “隱形戰機” (Stealth Fighter) is segmented

into two words “隱形” (Stealth) and “戰機” (Fighter). However, this is still an

acceptable segmentation because those two words are meaningful.

3.2.2 A Bottom-up Term Extraction Strategy

The traditional top-down strategy is firstly to check whether the whole sentence is

an MLU, then reduce the sentence size by 1 and recursively check sub sequences.

It is reported that over 90% of meaningful Chinese terms consist of less than 4

characters(Wu 2004), and on average, the number of characters in a sentence is

much larger than 4. Obviously, a whole sentence is unlikely to be an MLU.

Therefore, checking the whole sentence for an MLU is unnecessary. In this

section, we describe a bottom-up strategy that extracts terms starting from the

first character in the sentence. The basic idea is to determine the boundary of a

62

term in a sentence by examining the frequency change, i.e., the change of the R

value defined in Equation (6) when the size of the term is increasing. If the R value

of a term with size n+1 drops significantly compared with its largest sub term with

size n, the sub term with size n is extracted as an MLU. For example, in Table 3-1,

there is a big drop between the R value of the third term “靈活度” (2.00) and its

super term “靈活度極” (1.07). Therefore, “靈活度” is considered as an MLU. The

following algorithm describes the bottom-up term extraction strategy:

Algorithm BUTE(s)

Input: s=a1a2….an is a Chinese sentence with n Chinese characters

Output: M, a set of MLUs

Check each character in s, if it is a stop character to such as是, 了, 的…, remove it

from s. After removing all stop characters, s becomes a1a2….am, m≤n.

1. Let b=2, e=2, and M=φ

2. Let t1= aba2….ae, t2= aba2….a(e+1).

3. If R(t1) >>R(t2), then M=M∪ (t1), b=e+1.

4. e=e+1, if e+1>m, return M, otherwise go to step 3.

63

Here, the meaning of stop character is similar to stop words in English, e.g.

“the”,”an”. The algorithm makes the sub sequence uncheckable once it is

identified as an MLU (i.e., b=e+1 in step 3 ensures that the next valid checkable

sequence doesn’t contain t1 which was just extracted as an MLU). However, when

using the bottom-up strategy, some longer MLU terms might be missed since the

longer terms may contain some shorter terms which have been extracted as

MLUs. As shown in our example, “隱形戰機” (Stealth Fighter) consists of two

terms “隱形” and “戰機”. When using bottom-up strategy, “隱形戰機” would not

be extracted because the composite terms have been segmented into two terms.

To avoid this problem, we set up a fixed number ω which specifies the maximum

number of characters to be examined before reducing the size of the checkable

sequence. The modified algorithm is given below:

Algorithm BUTEM(s)

Input: s=a1a2….an is a Chinese sentence with n Chinese characters

Output: M, a set of MLUs

Check each character in s, if it is a stop character such as是, 了, 的…, remove it

from s. After removing all stop characters, s becomes a1a2….am, m≤n.

1. Let b=2, e=2, First-term = true, and M= Ø

64

2. Let t1= aba2….ae, t2= aba2….a(e+1).

3. If R(t1) >>R(t2),

then M:=M ∪ {t1)

4. If First-term = true

then first-position:= e and First-term:= false

5. If eb+1 ≥ ω

then e:=firstposition, b:=e+1, First-term:=true.

6. e=e+1, if e+1>m, return M,

otherwise go to step 3

In algorithm BUTEM, the variable first-position gives the ending position of the

first identified MLU. Only when ω characters have been examined, the first

identified MLU will be removed from the next valid checkable sequence,

otherwise the current sequence will be checked for a possible longer MLU even it

contains an extracted MLU. Therefore, not only the term “隱形” and “戰機” will

be extracted but also the longer term “隱形戰機” (Stealth Fighter) will be

extracted.

65

3.3 Translation selection

From the term extraction step discussed in Section 3.2, we can generate a list of

translation candidates for each query term. The next step is to find the correct

translation for each query term from its candidate list. The traditional translation

selection approaches select the translation on the basis of word frequency and

word length(Chen and Gey 2003; Zhang and Vines 2004). The approach suggested

here can find the most appropriate translation from the extracted word list

regardless of term frequency by using translation disambiguation techniques, so

even a low frequency word will have a chance to be selected.

In most cases, a query represents the user’s need for information about some

specific topic. Therefore, all query terms should relate to one single topic. The

translation of query terms must also belong to the same topic to make the query

meaningful. In other words, if the translation of the query terms dose not belong

to the same topic as the original query terms, the translations are not likely to be

correct translations. Consequently, translation disambiguation can be applied to

help selecting the correct query translation. If all the candidate terms are assumed

correct translations, the problem of selecting the most appropriate translation

becomes the problem of word translation disambiguation.

66

Researchers summarised “the problem of word translation disambiguation (in

general, word sense disambiguation) can be viewed as that of classification(Li and

Li 2001)”. Therefore, the problem of selecting appropriate query translation

becomes the problem of query classification. If the original query terms can be

classified as class C, the translated query terms should also belongs to class C.

However, the process of query classification is expensive and it is hard to classify

query terms in two languages. To make this problem simple, it is reasonable to

assume that if all query terms belong to one topic class, the corresponding

translations of these terms should be correlated with each other strongly. Based

on this consideration, we can determine the best translation of the query by

examining the correlation of each possible combination of the translation terms

and choosing the combination which has the highest correlation.

3.3.1 The algorithm

A simple way to measure the correlation between items is to use mutual

information(Church and Hanks 1990). There are several variations of mutual

information based approaches to measure the co-occurrence of multiple items.

Total correlation is one of the popular approaches. Let {Qi} be a set of query terms

and Ti={ti,j) be the candidate translations of term Qi. The correlation between the

translation terms can be calculated as:

67

)()...()()()...(log)...(

321

3211

2321n

nn

n tftftftfttttfNttttC

= (7)

where ti is one of the candidate translations for the ith query term Qi, f(ti) is the

frequency that the translation word ti appears in the corpus, t1t2…tn is a

combination of the candidate translation, )...( 321 nttttf is the frequency that

t1t2…tn appears in the corpus. N is the size of the corpus. The corpus is

constructed by the relevant documents retrieved from the document collection

using all candidate translation terms. Assuming that a list of translation candidates

for each query term of a given query has been generated from the term extraction

phase, the process of determining the best translation of the query is described as

follows:

Step 1: using the candidate translation terms to retrieve the documents from the

document collection calculate the frequency of each candidate translation in the

collection that contains all the retrieved documents. For instance, if the original

English query has three terms A,B,C and A1,A2…, B1,B2…, and C1,C2….. are the

candidate translations for A, B, and C, respectively, then the frequency of A1, A2,

….., B1, B2, ….., C1,C2….. in the collection is f(A1), f(A2),… f(B1), f(B2)…., and so on.

Step 2: calculate the frequencies of all the possible combinations of the candidate

translations in the collection of all the retrieved documents. For example, the

68

frequency of combination A1B1C1 is f(A1B1C1), A1B2C1 is f(A1B2C1), and A1B2C3

is f(A1B2C3)…. and so on.

Step 3: calculate the correlation of all the possible combinations using Equation

(7). For the example, the correlation of three candidate translation A1B1C1 is

calculated by:

)1()1()1()111(log)111(

2

2 CfBfAfCBAfNCBAC =

The terms in the translation combination with the highest correlation value are

considered strongly related and thus the translation combination should be

selected as the correct translation for that query.

It is quite possible that the frequency of one term is zero in the corpus which will

make the calculation of Equation (7) invalid. To avoid zero frequency in the

calculation, Equation (7) is modified as below:

( 8)

In practice, sometimes the translation combination with the highest correlation

might still not be the correct query translation. For example, “隱形戰機” (Stealth

Fighter), 隱形 and 戰機 are all translation candidates for term Stealth Fighter. In

)1)()...(1)()(1)()(1)(()1)...((

log)...(321

3211

2321 +++++

=−

n

nn

n tftftftfttttfN

ttttC

69

fact, using 隱形 as the translation will have a higher C value than using隱形戰機

because a shorter string usually has a higher frequency than a longer string. A

simple strategy, called term merging in this thesis, is used to solve this problem.

Among the top 10 translation combinations based on their C value, if a translation

is a substring of other translations in the top 10 list, replace it with the longer

term. Repeat this process until the top 1 translation is not a substring of any other

translations. Then the top 1 translation is chosen as the correct translation. The

merge algorithm is described below.

1. List top 10 translations based on value C (calculated using equation 8).

Let W={w1,w2,…,w10}, i=2

2. If w1 is the substring of wi, 1 < i <= 10 then w1=wn and i=2

3. i=i+1

4. repeat step 2 and 3 until i=10

We still use “隱形戰機” (Stealth Fighter) as an example to explain the term

merging strategy. In the top 7 combination translation list as shown in table below,

only the top 7 are listed below as only 7 terms are available in the whole list.

70

Table 3-2 Sample Combination of Translations

Rank Term 1 隱形 2 戰機

3 戰鬥機 4 隱形戰機 5 美國 6 雷達 7 隱形戰鬥機

The top 1 term is 隱形. The true translation is at rank 4th. Following the merge

algorithm, 隱形 will be replaced by隱形戰機. Then try to find 隱形戰機’s

super-string in the list. As no super-string will be found in the list, 隱形戰機 will

be selected as the correct translation of Stealth Fighter.

In summary, this chapter introduces new term bottom up extraction strategy and

a new translation selection strategy that applies translation disambiguation

techniques. Using these two strategies in web search based query translation will

help increasing translation accuracy.

3.3.2 Time Complexities

Although complexity analysis is not the major focus of IR searchers, it is still

necessary to know how much overhead is the disambiguation process to the

standard mono IR process. In this section, a brief analysis of time complexity will

be given.

71

To explain the time complexity of the algorithm easily, I break the disambiguation

process into several parts.

1. Time complexity of finding the possible combinations of correlations

between terms.

If there are n query terms, each query terms has mi possible translations,

the computation complexity will be:

( ) =

If we define that a number c>1 and mi=ai*c

Then the computation complexity will become:

= ( ) = ( )

2. Time complexity to calculate correlations

Equation (8) is used to calculate corrections. It will be a loop of n times to

calculate ( ), 1 ≤ ≤ ; plus the calculation of N ( ( … ) + 1.

Therefore, the time complexity will be O(n)+O(1)=O(n).

3. Complexity of term merging:

For each term, the time complexity is O(1). Therefore, the time complexity

n query terms is O(n).

4. Summary of computational complexity

72

In summary, the time complexity of the whole process will become:

( ) ∗ ( ) + ( ) = ( )

73

Chapter 4

Multilingual Experiments

In this chapter, several experiments were conducted to evaluate the proposed

query translation approach. The web search engine we used in the experiments is

Google. Two sets of experiments were designed to evaluate the translation

related approaches described in Chapter 3. The first set of experiments is

designed to evaluate the effectiveness of term extraction for OOV translation and

the second set of experiments is designed to evaluate the effectiveness of

translation selection for OOV translation.

4.1 Test set

Queries, document collection and relevance judgments provided by NTCIR

(http://research.nii.ac.jp/ntcir/) are used in the experiments. The NTCIR6 Chinese

test document collection was used as our test collection. The articles in the

collection are news articles published in 2000-2001. The detailed information of

the test set is as shown in Table 4-1 below.

74

Table 4-1 Test document collections

Document collection Year 2000 Year 2001 No. of articles United Daily News (udn) 244038 222526 466564 United Express (ude) 40445 51851 92296 Ming Hseng News (mhn) 84437 85302 169739 Economic Daily News (edn) 79380 93467 172847 Total 448300 453146 901446

The document itself is in XML format with the following tags:

• <DOC> </DOC> The tag for each document

• <DOCNO> </DOCNO> Document identifier

• <LANG> </LANG> Language code: CH, EN, JA, KR

• <HEADLINE> </HEADLINE> Title of this news article

• <DATE> </DATE> Issue date

• <TEXT> </TEXT> Text of news article

• <P> </P> Paragraph marker

Queries used in the experiments are from NTCIR5 and NTCIR6 CLIR tasks. There

are all together 100 queries created by researchers from Taiwan, Japan and Korea.

NTCIR provided both English queries and corresponding Chinese queries. The

Chinese queries are translated by human translators and thus are correct

translations of the corresponding English queries.

Yahoo’s online English-Chinese dictionary (http://tw.dictionary.yahoo.com/) is

used in the experiments. The English queries were first translated by using the

Yahoo’s online English-Chinese dictionary. The terms that could not be translated

75

by the online dictionary were used as the input queries to evaluate the

performance of our proposed web based query translation approach. There are

108 OOV terms that cannot be translated by the online dictionary and therefore

used in the experiments.

4.2 Term extraction experiments

The existing term extraction approaches reviewed in section 2.6 were used in the

experiments for comparison purpose. The abbreviations for the approaches are:

• MI for Mutual information.

• SE for the approach introduced by Chien(Chien 1997).

• SCP for the Local Maxima introduced by Silva and Lopes(Silva and Lopes

1999).

• SCPCD for the approach introduced by Cheng et al.(Cheng, Teng et al.

2004).

• The extraction approach introduced in 3.2 is abbreviated as SQUT.

The OOV term is translated via the following steps:

1. Send the OOV term as a query to Google, from the result pages returned

from Google, use the 5 different term extraction approaches mentioned

above to produce 5 Chinese term lists.

76

2. If a Chinese word in a term list can be translated to an English word using a

dictionary, the English word must not be an OOV word. This means, the

Chinese word must not be a translation of the queried English OOV word.

Therefore, for each term list obtained in step 1, remove the terms if they

can be translated to English by the Yahoo’s online dictionary. This leaves

only OOV terms.

3. Select the top 20 terms from each of the term lists produced from step 2

as translation candidates. Select the final translation from the candidate

list using the translation selection approach described in 3.3.

Finally we have 5 sets of OOV translations; a sample of the translation is as shown

in Appendix 3 Translation of OOV terms.

As the same corpus and the same translation selection approach were used in the

evaluation, the difference in the resulting translation accuracy is the result of

using different term extraction approaches. Thus we can claim that the approach

with the higher translation accuracy has higher extraction accuracy.

4.3 Discussion

For the 108 OOV terms, by using the 5 different term extraction approaches, we

obtained the translation results shown in Table 4-2. SQUT has the highest

77

translation accuracy. SCP and SCPCD provided similar performance. The

approaches based on mutual information provided lowest performance.

Table 4-2 OOV translation accuracy

Correct Accuracy (%)

MI 48 44.4

SE 58 53.7

SCP 73 67.6

SCPCD 74 68.5

SQUT 84 77.8

4.3.1 Mutual information based approaches

In the experiment, MI based approaches such as MI and SE cannot determine the

Chinese term boundaries well. The term lists produced by the MI based

approaches contain a huge number of partial Chinese terms. It is quite often that

partial Chinese terms were chosen as the translation of OOV terms. Some partial

Chinese terms selected by MI are listed in Table 4-3

Table 4-3 Some Extracted terms by MI

OOV Terms Extracted terms Correct terms

Embryonic Stem Cell 胚胎幹細 胚胎幹細胞

consumption tax 費稅 消費稅

Promoting Academic Excellence 卓越發 卓越發展計畫

78

The performance of the mutual information based term extraction approaches is

affected by many factors. These approaches rely on the predefined thresholds to

determine the lexicon boundaries. Those thresholds can only be adjusted

experimentally. Therefore, they can be optimized in fixed corpora. However, in

OOV term translation, the corpus is dynamic web search engine result. The

predefined thresholds might work perfectly in some situations but might work

poorly in other situations. It is almost impossible to optimize thresholds for

generic use. As a result, the output quality is not guaranteed.

In addition, mutual information based approaches seem unsuitable to Chinese

term extraction. As there are no word boundaries between Chinese words, the

calculation of MI values in Chinese is based on Chinese characters but not words

as it does in English. On average, a high school graduate in the U.S. has a

vocabulary of 27,600 words (Salovesh 1996), while the cardinality of the

commonly used Chinese character set is under 3000[61][61][61][61]. Due to the

small set of Chinese characters, Chinese characters have much higher frequencies

than English words. This means that one Chinese character could be used in many

MLUs while an English word will have lower chance to be used in Multiple MLUs.

As a result, an English MLU will have much higher MI value than a Chinese MLU.

The subtle difference in MI values in Chinese between MLUs and non-MLUs makes

the thresholds hard to tune for generic use.

79

SE uses some filtering techniques to minimize the affect of thresholds. In our

experiment, there is 17.2% improvement in translation accuracy. Obviously the

improvement comes from the higher quality of extracted terms. However, the

limitation of thresholds is still not avoidable.

4.3.2 Local Maxima based approaches

Without using thresholds, local maxima based approaches have much better

flexibility than the MI based approaches in various corpora, achieving higher

translation accuracy in our experiment. In comparison, the SCP approach tries to

extract longer MLUs while the SCPCD approach tries to extract shorter ones. The

translation of “Autumn Struggle”, “Wang Dan”, “Masako” and “Renault” are all 2

character Chinese terms. SCPCD can extract the translation with no problem while

SCP always has trouble with them. As over 90% of the Chinese terms are short

terms, this is a problem for SCP in Chinese term extraction. In the mean time,

SCPCD has trouble in extracting long terms. Overall, the two local maxima based

approaches have similar performance. However, since in our experiment, most of

the translations of OOV terms are long terms, SCP’s performance is a little better

than that of SCPCD.

80

Local maxima based approaches use string frequencies in the calculation of

∑−

=+−

1

111 )...()...(

11 n

inii wwfwwf

n. In a small corpus, the frequency of a string

becomes very low which makes the calculation of string frequencies less

meaningful. Local Maxima based approaches are not effective in a small corpus. In

comparison, our approach calculates the difference between character

frequencies. In a small corpus, characters still have a relatively high value. As a

result, our approach performs better than Local Maxima based approaches in

small corpora. For example, local maxima based approaches were unable to

extract the translation of “Nissan Motor Company” because the corpus is too

small Google only returns 73 results for the query “Nissan Motor Company”.

4.3.3 SQUT Approach

Most of the translations can be extracted by the SQUT algorithm. As the approach

monitors the change in R value (see section 3.2.1) to determine if a string is an

MLU instead of using the absolute value of R, it does not have the difficulty of

using predefined thresholds. In addition, the use of single character frequencies in

standard deviation calculations makes our approach usable in small corpora.

Therefore, we have much higher translation accuracy than the MI based

approaches and also about 10% improvement over the Local Maxima based

approaches.

81

However, the SQUT algorithm has difficulty in extracting the translation of “Wang

Dan”. In analysing the result summaries, we found that the Chinese character “王”

(“Wang”) is a very high frequency character in the summaries. It is also used in

other terms such as “霸王” (the Conqueror), “帝王”(regal); “國王”(king); “女王”

(queen) and “王朝” (dynasty). Those terms also appear frequently in the result

summaries. In our approach, where we are using the count of individual

characters, the very high frequency of “王” breaks observation 2. Thus the

translation of “Wang Dan” cannot be extracted. However, in most cases, our

observations are true in small corpora as demonstrated by the high translation

accuracy of our approach in query expansion from Chinese/English web search

summaries.

There are 23 OOV terms that SQUT cannot find correct translations. In fact, none

of the algorithms used in the experiments can find the correct translations for

those 23 OOV terms. For all the algorithms, none of them can find the translation

of some of the terms such as “Chiutou”, “Viagra” and “capital tie up”. When

looking at the Google search results in details, we can clearly see that there are

really not many Chinese texts in the result pages; most of the result summaries

are still in English. And for some of the OOV terms, such as “Florence Griffith

Joyner”, ”F117” and ”ST1” and “FloJo”, none of the algorithms can find ideal

translations. Actually, ”F117” and ”ST1” are directly used in Chinese and none of

82

the result summaries from Google use their translations. Therefore, none of the

algorithms can find out the correct translation. We can consider this as the

limitation of the web based translation approach.

4.4 Translation selection Experiments

The following runs were conducted in the English-Chinese translation selection

experiments:

• Mono: in this run, we use the original Chinese queries from NTCIR5 and

the Chinese terms in the queries are segmented by human. This run, called

the monolingual retrieval, provides the baseline result for comparing with

all other runs.

• IgnoreOOV: in this run, the English queries are translated using the online

Yahoo English-Chinese dictionary with the disambiguation technology

proposed in 3.3. If a translation is not found in the dictionary, the query

keeps the original English word.

• SimpleSelect: similar to IgnoreOOV, English queries are translated using

the online Yahoo English-Chinese dictionary with disambiguation

technology. If a term cannot be translated by the dictionary, it will be

translated by the proposed web mining based approach. However, in the

translation selection step, the longest and the highest frequency string

83

were selected as its translation. This run simulates the previous web

translation selection approaches (Lu, Chein et al. 2004).

• TQUT: like SimpleSelect, except that in the translation selection stage, the

translation for the OOV term is selected with disambiguation technology

proposed in 3.3.

4.5 Discussion

Table 4-4 below gives the results of retrieval performance from the four runs

defined in Section 4.4.

Table 4-4 NTCIR retrieval performance

Average precision Percentage of MonoRun

Mono 0. 3713

IgnoreOOV 0.1312 35.3%

SimpleSelect 0.2482 66.8%

TQUT 0.2978 79.3%

4.5.1 IgnoreOOV

The performance of the IgnoreOOV is 0.1312 which is only 35.3% of the

monolingual retrieval performance. This result shows the extent to which an OOV

term can affect a query. By looking at the translated queries, we found that 31

84

queries out of 50 have OOV terms. By removing all those 31 queries, the Mono’s

average precision becomes 0.3026 and the IgnoreOOV’s average precision

becomes 0.2581 which is about 85.3% of the Mono’s precision. This is a

reasonable result and indicated that our disambiguation technique just works well

to find the correct translations. The reason that we cannot get the same precision

as the monolingual retrieval is that the limited coverage of the dictionary

introduces inappropriate translations. An inappropriate translation is defined as a

valid translation in some other context but not in the current query context. For

example, in query 24, for term “space station, Mir”, 儲存信息暫存器 (Memory

Information Register) is the only translation returned from our dictionary which is

a right translation in some other context. But for query 24, it should be translated

to和平號太空站. In this case, when a dictionary only returns one translation, it is

hard to tell if it will be suitable in the context. As the dictionary only gives one

translation, we have no opportunity to correct any errors by using a

disambiguation technique. Some translations from a dictionary are incorrect

because the translations in various distinct Chinese cultures are different. For

example in the query “mad cow disease”, the query is translated to 瘋牛病 by

TQUT. This translation is in use in mainland China and Hong Kong, but in our

document collection, as they are from Taiwan, it should be translated to 狂牛症

or to 狂牛病. We also find the same problem for term “syndrome” in query 24. Its

85

translation is症候群 in Taiwan but given by the dictionary as併發症狀 and 綜合

症狀 which is for Hong Kong and mainland China. With those inappropriate

translations, those queries have very low precision thus we cannot possibly match

the Mono performance.

Table 4-5 Retrieval performance on queries that contains OOV terms only

Average precision Percentage of MonoRun

Mono 0.4134

SimpleSelect 0.2149 52.0%

TQUT 0.2946 71.3%

4.5.2 SimpleSelect

The performance of SimpleSelect which achieved 0.2482 in precision was much

better than IgnoreOOV and it is 66.8% of the Mono performance. The result is

quite clear that some of the OOV terms in English are found and translated to

Chinese correctly.

The results of the 31 queries that have OOV terms are given in Table 4-5. From

Table 4-5, we can see that the precision of Mono is 0.4134 and the precision of

SimpleSelect is 0.2149 which is 52.0% of the Mono’s precision. This indicates that

by just choosing the longest and highest frequency terms as the translation of

OOV terms, the performance is actually lower than looking up the dictionary. The

86

performance is quite close to the performance of looking up a dictionary without

translation disambiguation technology reported by other researchers. However,

some of our results show that this approach is quite useful in looking up proper

names. Because there is no standard for name translation in Chinese, it is quite

common that a person’s name might be translated into different forms with

similar pronunciation (akin to phonetic form). Different people may choose

different translation due to their custom. As our test collection contains articles

from four different news agents, if we only choose one of the translations, we

may not retrieve all the relevant documents.

For example, in query 12, the precision of SimpleSelect is 0.3528 and the precision

of Mono is 0.0508 which means SimpleSelect’s performance is vastly superior to

Mono. This is a notable performance boost. The English OOV term in query 12 is

Jennifer Capriati (name of a tennis player). The translation given by human expert

is卡普莉雅蒂. The translations from our approach are卡普裏亞蒂, 卡普莉雅蒂,

卡普裏雅蒂 and 雅蒂. They are all correct translations. It is clear that we miss

many relevant documents when we only use the translation 卡普莉雅蒂. When

we take a deep look into the collection, actually three out of four news agents

have sports news. And those three news agents use three different translations

for Jennifer Capriati. These translations are卡普莉雅蒂 in the mhn, 凱普莉雅蒂

87

in the ude and卡普莉亞蒂 in the udn. Obviously , our translated query takes the

advantage of adding 雅蒂. Because we use character based index for our

collection, the documents containing 雅蒂 will include the documents that

contain both 卡普莉雅蒂 and 凱普莉雅蒂. Therefore, although we cannot find

the correct translation 凱普莉雅蒂, we can still retrieve the documents that

contain 凱普莉雅蒂 by using雅蒂.

4.5.3 TQUT

Table 4-6 OOV translation accuracy for NTCIR5 collection

Correct Accuracy (%)

TQUT 25 65

SimpleSelect 20 51

Table 4-6 shows that by using translation disambiguated technology in Web

Translation Extraction, we can get more accurate translation then previous

approaches. We have 65% accuracy of the translation while the simulation of

previous approach only achieves 51%. The IR performance of disambiguated

queries achieved 79.3% of the Mono which is 0.2978. If we only look at the results

of 31 queries that contain OOV terms, the precision is 0.2846 which is 71.3% of

the Mono’s precision. This result is much higher than the result in SimpleSelect

which is only 52% of Mono. There are 39 OOV terms over 50 queries. 31 of the

88

OOV terms’ translations can be found using our proposed approach. And 20 of the

translations are exactly the same or identical to the human translation. It is about

65% in execise.

There are many reasons for not being able to get 100% precision. The first reason

is the different translation custom that we described earlier. Since we cannot

control from where the web search engine gets the documents and to whom the

web search engine returns documents, we cannot guarantee the translation will

be suitable for the collection. For example, we may be able to find the translation

for an OOV term from the Internet, but this translation may be used in Hong Kong

and is not suitable for a collection from Taiwan. The translation of Kursk is a good

example. Our web translation extraction system only returns one translation 庫爾

斯克 as the translation of Kursk. This result shows that most of the documents

over the Internet use 庫爾斯克 as the translation of Kursk, however, the NTCIR5

collection uses科斯克 as its translation. This kind of inappropriate translation is

very hard to avoid even by human interpreters. A good example is the translation

of National Council of Timorese Resistance. We believe帝汶抵抗全國委員會

(from our web translation extraction system) and 東帝汶人抗爭國家委員會

(from NTCIR human translation) are both correct. The difference of the two

translations comes from the different custom of translation. However, when using

89

the two translations as two queries, our IR system cannot return any document.

This means that the documents in the NTCIR5 collection use a different translation

for National Council of Timorese Resistance. Actually the translation in the NTCIR5

collection is: 東帝汶全國反抗會議.

Another reason that we cannot get 100% precision is that our web translation

extraction system does not consider the query context. As we described before,

we only put the OOV terms into a web search engine. This may lead to a situation

where we get the translation suitable for other context. For instance, in query 36,

we are looking for some articles about the use of a robot for remote operation in

a medical scene. “Remote operation” is an OOV term in this query. Our web

translation extraction system returns the term遠程操作服務 as its translation.

Disregarding the query context, this is a correct translation. But this translation is

only correct when it is used in computer science. If we do not consider the query

context, 27 of the translations are correct. It is about 87% precise. This result is

close to the disambiguated queries of dictionary translations which is 85%.

90

Chapter 5

Web based collection profiling for collection fusion

Most of the current collection selection and fusion approaches are based on the

assumption that the similarity measures between different peers are comparable.

For example, the raw score merging approach assumes that the search engines

used across the network are the same and the collection statistics are similar

between collections. The query based sampling approach also assumes that the

performance of different search engines is similar for all remote collections.

However, ”the incorporation of collection-dependent frequency counts in the

document or query weights (such as idf weights) invalidates this

assumption ”(Voorhees, Gupta et al. 1994). In addition, in a p2p environment,

peer collections are managed by various IR systems. The retrieval performance of

different IR systems could be quite different to a certain query. As a result, most

of the existing collection selection and fusion approaches are not suitable in the

peer to peer environment. The retrieval performance should be taken into

consideration when selecting remote IR systems and merging the results from

different IR systems. In order to obtain both content quality and search engine

retrieval quality of remote IR systems, user feedback can be used together with

collection ranking approaches such as CORI. This section proposes a method that

91

obtains resource descriptions and retrieval performances based on users’

feedback.

5.1 A simple example

Before describing the selection and merging approach, let us look at a simple

example of retrieving documents from a distributed information systems. For the

purpose of illustration, suppose there are three remote collections: CA, CB and CC,

and users have no prior knowledge of the content of the collections; each

collection can be treated as a “Black Box”. A user sends a query about art and

computer science to these collections. Suppose that the user chooses 10 returned

documents as relevant from each collection, if we can obtain the user’s feedback

about the retrieved document topics, for instance, the 10 documents from CA are

related to arts, the 10 documents from CC are related to computer sciences, and 5

each from CB are related to arts and computer sciences, it is reasonable to

estimate that CA contains documents about arts but no computer sciences. CC

contains documents about computer sciences but no arts. CB contains both

computer sciences and arts. Therefore, according to query topics and user

feedback, we could construct collection content profiles.

This simple example also tells us that the remote IR systems will have different

retrieval performance on different topics. An IR system that mainly contains

92

computer science documents would not have good retrieval performance on art

topics. User feedback can provide the information about how good a remote IR

system’s performance is on particular topics. The behind our approach is that the

profiles of remote IR systems can be constructed based on user feedback. Based

on the profile, the collection fusion can be improved by considering not only the

content description but also the retrieval quality.

5.2 Collection profiling

The collection profiling technique described here uses a matrix {pi,j} to present the

historical performance of collections, where pi,j represents the average retrieval

performance of remote collection i to topic class cj.

The performance of a search engine (i.e., a collection) is usually measured by

precision and recall. As most IR systems only return top N results, the precision of

top N results, denoted as P@N, is a reasonable measurement for evaluating the

search engine performance. The average P@N of a collection can measure how

well a remote search engine performed in the past. However, the absolute

average value cannot tell how good a collection is compared to other collections.

Precisions for different queries are not comparable. Precision of 0.3 might be a

good result for one query but might be a bad result for another query. For

example, for a query, if the average precision of all the collections is 0.1 and one

93

collection achieves 0.3, this collection achieves a good result since although 0.3 by

itself looks ordinary. But if the average is 0.5 then 0.3 might be a bad result. From

this simple example, we can see that, for a query, the difference between the

precision achieved by a collection and the average precision over all collections

can indicate how well the collection performs for that. Suppose that we have sent

n queries to a remote IR system, pi is the precision for the ith query achieved by

this remote system, ip is the average precision for the ith query achieved by all

remote systems or collections. We can calculate the performance of the remote

system by the following equation:

npp

P ii∑ −=

)(

(8)

Because pi and ip are precisions, we should have 0<= pi <=1, 0<= ip <=1.

Therefore -1<=P<=1.

It is clear that the performance of a remote collection represents not only the

quality of remote collection content but also the quality of the search engine.

When two remote peers use the same search engine, the peer that has better

quality content will have better performance. When the two peers have same

document collection, the peer that has a better search engine will have better

performance.

94

For real world peer to peer systems, since there is no standard relevant

judgement, as a result the average precision will not be available for equation 10.

However, user feedback can be used as the judgement for determining the

relevance of documents. For example, if a user downloads a document from the

result list or spends a while to read the document, it can be considered that the

document is relevant to the query. In addition, users are likely to only examine the

results on top of the result list; therefore, it is reasonable to use the precision of

documents at top to represent the performance of remote search engines.

Suppose that the size of the result list is m, Ni is the number of documents in the

list that are read or downloaded by the user,

the precision of a remote system can be calculated by the following equation:

nmNN

P i

*)(∑ −

= (9)

5.3 Query classification

The profiling system described in thesis does not use the average performance of

all past queries. It is necessary to measure the past search performance in several

topic classes unlike digital libraries that cover all topics, personal document

collections in a peer to peer environment usually focus on some specific topics. As

described in section 5.1, a profiling system that uses uniform query performance is

95

not suitable to such environment. Some researchers have studied extracting topic

classification rules from the internet(King and Li 2003). However, their studies are

based on western languages and their works are trying to build a complex tree

structure for describing the relationship between various topics. Their works are

effective in homogeneous collections however, in a p2p environment, the

documents are usually heterogeneous. Also, there are limited resources on text

classification for Asian languages.

According to initial investigation, many news websites classify their news into

groups based on topics. For example, Yahoo news Taiwan

(http://tw.news.yahoo.com/) groups their Chinese news into 12 topics. Google

news Taiwan (http://news.google.com/news?ned=tw) groups their news into 9

topics. Those websites also provide very powerful search features on news. When

searching in Yahoo news Taiwan, a list of news that contains the query term will

be returned together with the source of the news, the catalogue and the news

summary. Google news Taiwan would not return catalogue information of the

news but it enables the user to search news within a specific catalogue. Mining

such information may find the topics that a query term is related to. For example,

when searching for term “雅虎” (Yahoo) in Yahoo news, most of the returned

news is about computer science. Therefore, we can determine that term “雅虎”

(Yahoo) has strong relationship to computer science but has little relationship to

96

arts, for example. As a result, with the help of such news sites, we can discover

the relationship between a query term and a topic. Such information will then

help identify query topics. For example, searching for “Linux” in Yahoo news

Taiwan, out of 28 returned news, 20 news are under topic “SCI/TECH”, 1 news

under topic “world news”, 3 news under topic “financial” and 4 news under topic

“education”. This result indicates that, the term “Linux” will have a chance of 72%

to be in topic “SCI/TECH”, 4% to be in topic “world”, 10% to be in topic “financial”

and 14% to be in topic “education”. By searching the term in all the catalogues in

Google news Taiwan and calculating the number of results returned from each

catalogue, we can also get the percentage that a term belongs to a particular

catalogue.

Let C={c1,c2,…,c|c|} be a set of predefined classes, i.e. Cc j ∈ is a class or catalogue,

),( ji cwN be the number of news in class cj returned from querying term wi in a

news web site. The probability that the term wi belongs to a topic class cj can be

calculated by the following equation:

∑∈

=

Ccki

jiij

k

cwNcwN

wcP),(

),()|( (10)

97

Let },...,,,{ ||321 WwwwwW = be a set of Chinese terms, = { , ,… }

represents a query that contains m terms. The probability that Q belongs to a

topic cj can be calculated by equation:

( ) ( )mjj wwwcPQcP L21|| = (11)

Suppose that the occurrences of terms mwww ,,, 21 L are independent(Liu, Yu et

al. 2002; Zhao, Shen et al. 2006), we will have: ,,,....1, CcmiWw ji ∈=∈∀

( ) ( ) ∏=

==m

iijmjj wcPwwwcPQcP

121 )|(|| L

(12)

Equation (13) can be used to calculate the probability that a query Q belongs to a

topic cj.

,,....1, miWwi =∈∀ that the topic of a query = { , ,… } is c∈C can be

determined by the following equation:

( ))|(max)|( QcPQcP jCc j ∈=

(13)

98

5.4 Collection fusion

In an uncooperative p2p environment, since the collection statistics may not be

available or the scores of the returned documents from different collections are

not comparable. It is difficult to merge the results returned from different

collections. In this section, we propose some new merging strategies to solve this

problem. Generally, the retrieval process is conducted with the following steps.

Firstly the user’s query Q is classified using equation (13) and then broadcasted to

all the remote peer IR systems. When results are returned from peer IR systems,

the results will be merged according to the query catalogue and collection profile.

We propose a merging method called Sorted round robin strategy which

incorporates collection profile into the standard round robin method to enhance

the quality of result merging.

The basic idea of the round robin merging, in general, is to interleave the result

list returned from each remote peer. Every time the top document in the list from

each peer will be popped up and inserted into the final result list. The order of the

peers to be visited is usually the order of the peer collection IDs. It is obvious that

the basic round robin approach does not consider the performance of remote

peers. In the worst case, the results from the worst peer will be popped up first

and the results from the best peer will be popped up last. Therefore, the

99

irrelevant documents will appear on the top of the merged result list. The

distributed retrieval performance will be harmed significantly in the worst case.

Furthermore, remote systems will have different retrieval performance on

difference topics. However, traditional round robin always sorts the results from

remote systems in the order of collection IDs. Even the order of returned

documents from remote systems may be optimized, the quality of the merged

result still cannot be guaranteed due to the fixed order of the remote IR systems.

For the above reasons, we proposed a modified round robin approach called

sorted round robin margining strategy. Instead of using a fixed order of remote

systems to merge results, we dynamically change the order of remote systems

based on query classes and pervious performance of remote systems. In other

words, the order of remote systems to be visited is determined based on the

matrix {pi,j} .The merging strategy can be described as following steps:

1. Determine query class cj using equation (13).

2. Sort Collections by {pi,j}.

3. Using round robin strategy to merge results, based on the collection order

generated in step 2.

4. Repeat steps 1-3 for all input queries.

100

By using such merging strategy, we attempt to optimize the order of the remote

IR systems no matter what type of queries we are using. The collection which

performed best before for the query class will always be visited first and the

quality of merged results can be guaranteed.

We also propose another merging strategy called Sorted Rank. The idea of round

robin merging is a one-by-one merging strategy. This strategy means that if we

have n remote systems, the second document in the first visited remote system

will be at the n+1 position in the merged result list. However, intuitively, the

more important system should have more documents in the top of the merged

result list than the less important ones. The basic idea of the Sorted Rank strategy

is to re-calculate the scores of the returned documents from remote collections

based on the historic performance of the remote collections and the scores

calculated by the remote collections, and then merge the returned documents

based on the newly calculated scores. The simplest way to calculate a document

score by taking the intuition mentioned above into consideration is to make the

score proportional to the collection performance jip , and the original rank ri

given by the collection i. We designed the following equation to modify the

document score for a document and topic catalogue cj:

)1(),( , jiiji parcrscore +×= (14)

101

here ri is the document rank returned from collection i, cj is the catalogue that the

query belongs to and pi,j is the historical performance of collection i for topic

catalogue j, a is a threshold. According to our experiments, a=11 will give the best

results. Because -1<= pi,j <=1, we use (1+ pi,j) as historical performance to avoid

negative value. Finally the documents will be sorted by the calculated scores.

102

Chapter 6

Profiling and fusion evaluation

Several experiment sets were designed to evaluate the performance of collection

fusion approach. The first experiment set was designed to evaluate the

performance of the web-based query classification. The second experiment set

was designed to evaluate the accuracy of collection rank and the last experiment

set was designed to evaluate the performance of the collection fusion.

We conducted the experiments with 30 databases from NTCIR6 CLIR track

document collections. The articles were evenly separated into 30 databases with

each database having around 30048 documents. In order to make the databases

cover different topics, according to relevance judgments, relevant documents on

different topics are manually put into different databases. 50 queries from NTCIR5

CLIR task are used as training set. That is, the collection profiles were created

based on those 50 queries. P20 were used in profiling. 50 queries from NTCIR6

CLIR task are used in evaluation.

Two search engines are used in the experiments. One is the search engine

introduced in section 7.5. It is called M-GPX in the experiments. The other one

uses the same index strategy as SYSTEM1 but uses a simple Boolean model with

103

tf-idf weighting schema. It is called System1 in the experiments. Document score in

System1 is calculated by the equation below:

∑= iiscore idftfD * (15)

6.1 Query classification

Web based query classification is the first step and one of the key components in

our collection fusion approach. If this part cannot produce good enough results,

the rest of our approach is meaningless. Therefore, in the first experiment, we

would like to evaluate the performance of query classification.

100 queries form NTCIR5 and NTCIR6 were used in this experiment. The

classification results produced by our web based classification method are

compared to the results classified by 5 human experts.

6.1.1 Discussion

In this experiment, the accuracy of the query classification made by our

classification method is 89%. It is almost impossible to reach 100% accuracy

because even human may have different opinions on classifying some queries. If

we only look at the Title field of the query, different people may get completely

different results. For example, NTCIR6 query #20: Y2K problem. This query can

104

belong to topic class ‘Science/Tech’ if we are looking for the definition of Y2K

problem. But it can also belong to topic class ‘Financial’ if we are looking

something about the impact of Y2K problem.

Term ambiguity will also impact the accuracy, especially in Chinese. For example

in NTCIR6 query #75: birth, cloned, calf. From the English query we most likely

classify this query to topic class ‘Science/Tech’. However, the Chinese translation

of this query is: 誕生(birth), 複製(cloned), 小牛(calf). Term ‘小牛’ is often

referred to NBA team ‘Dallas Mavericks’in Chinese. We can expect much more

news refers to ‘Dallas Mavericks’ than ‘calf’because sports news is

generally hotter than news about science. In the experiment, when searching

term ‘小牛’ from yahoo news, 91% of the returned news belong to topic class

‘Sports’. Term ‘誕生’ is a general term, 12% of the returned news belong

to ‘Science/Tech’ and 9% belong to ‘Sports’. Although another key term 複

製 is obviously a ‘Science/Tech’ term but only 66% of the returned news

belong to ‘Science/Tech’. As a result, the meaning of ‘Dallas Mavericks’

dominating the query causes the query to be classified as ‘Sports’.

105

6.2 Collection Rank Experiment

Collection rank is the most important part for collection selection and collection

fusion. It is also the key in the collection profiling technique described in this

thesis. In the experiment, we used two information retrieval systems to create

two different profiles and compare the two profile files together with the

contents of the collections. In the evaluation, we sort the collections by the

number relevant documents in each topic class and compare the collection ranks

calculated by different retrieval systems.

6.2.1 Discussion

In these experiments, we try to compare between collection rank and collection

quality. The quality of collections is determined by the number of relevant

documents under each topic class. As all the 30 systems use the same search

engine, if the search engines used are good enough, the collection rank given by

the collection profile should match the collection quality.

106

Table 6-1 Collection Rank Top 5 for topic “ Financial news (財經新聞)”

Collection #

No. of Relevant documents

Collection rank by # Relevant Doc

Collection rank by M-GPX

Collection rank by System 1

6 113 1 1 13 3 94 2 2 11 5 84 3 3 4 2 71 5 4 9 1 76 4 5 3

Table 6-2 Collection Rank Top 5 – International (國際新聞)

Collection #

No. of Relevant documents

Collection rank by # Relevant Doc

Collection rank by M-GPX

Collection rank by System 1

25 337 1 1 1 24 230 2 2 2 21 184 6 3 3 26 209 3 4 13 20 149 8 5 9

Table 6-3 Collection Rank Top 5 – Science/Tech (科技新聞)

Collection #

No. of Relevant documents

Collection rank by # Relevant Doc

Collection rank by M-GPX

Collection rank by System 1

2 240 1 1 1 6 127 5 2 2 7 83 7 3 5 3 193 3 4 3 1 227 2 5 14

107

Table 6-4 Collection Rank Top 5 – Sports (運動新聞)

Collection #

No. of Relevant documents

Collection rank by # Relevant Doc

Collection rank by M-GPX

Collection rank by System 1

10 55 1 1 1 11 36 3 2 3 8 25 4 3 2 12 24 5 4 7

From the results above we can see that collection rank given by M-GPX nearly

matches collection quality while SYSTEM1 does not provide such good collection

rank on some topic classes. This result indicated several things:

1. M-GPX is good enough to find relevant documents cross collections.

2. Our collection profiling algorithm can produce good quality collection which

will help collection selection and collection fusion.

3. The performance of System1 is not good enough to find financial news as the

collection rank given by System1 under ‘Financial’ does not match collection

quality.

Although System1 is not good enough, we still cannot say the collection rank given

by System1 is incorrect. The nature of our profiling system determines that the

top ranked collections are not necessarily the best collections that provide the

most relevant documents. They are the best collections that can return largest

108

number of relevant documents. As we emphasized in Chapter 3, we should

consider both collection content and retrieval system together. If the remote

system cannot retrieve any relevant documents, we should not consider it is a

good source even if the collection content is good.

6.3 Collection fusion experiment

Several runs were conducted in the experiments which are defined as follows:

• Centralized: all documents are located in a central database. This is the

baseline to all other runs.

• Round robin (RR): results are merged using the standard round robin

method. The order of visiting result lists is the order of collection id.

• Sorted round robin (SRR): results are merged using sorted round robin

method described in section 5.4. The order of visiting result lists is the

order of corresponding collection rank (OP).

• No classification Sorted round robin (NSRR): this run is a comparison to

SRR. In this run, we use the same sorted round robin approach as SRR but

we do not consider query classification. That is, the collection rank is

calculated based on the collection’s pervious performance and all queries

are under the same topic class.

109

• Sorted rank (SR): results are merged using sorted rank method described

in section 5.4. Document scores are calculated by equation (14 and

ascending sorted.

• No classification Sorted rank (NSR): this run is a comparison to SR. In this

run, we use the same sorted rank approach as SR above but we do not

consider query classification. That is, the collection rank is calculated based

on collections’ pervious performance and all queries are under same topic

class.

NSRR and NSR only run under M-GPX.

6.4 Discussion

From Table 6-5 and Table 6-6 we can see that M-GPX is a better retrieval system

than the simple System1 retrieval system. In the centralized environment, the

average precision of M-GPX is 0.2653 while System1 only get 0.2202. The average

precision in Table 6-5 and Table 6-6 also show that SR is the most effective way of

merging distributed results while the standard round robin method produced the

lowest precision. If we use the centralized system’s precision as the baseline, the

precision produced by SR is around 9% higher than SRR’s and around 13% higher

than RR’s, and the precision of SSR approach is about 4% higher than RR’s. As the

only difference between SRR and RR is the order of the collections to be visited, it

110

is easy to conclude that sorting the returned results according to the importance

of the collections can improve the precision.

Table 6-5 Average Precision – System1

RR RR/Central SRR SRR/Central SR SR/Central Central P5 0.2960 67.89% 0.2960 67.89% 0.3010 69.04% 0.4360 P10 0.2860 70.79% 0.2860 70.79% 0.2840 70.30% 0.4040 P15 0.2667 70.43% 0.2693 71.11% 0.2773 73.22% 0.3787 P20 0.2500 71.43% 0.2510 71.71% 0.2720 77.71% 0.3500 P30 0.2047 64.64% 0.2047 64.64% 0.2500 78.94% 0.3167 All 0.0900 40.87% 0.1001 45.46% 0.1237 56.18% 0.2202

Figure 6–1 P-R curves of 4 runs under System1

111

Table 6-6 average precision – M-GPX

RR RR/Central SRR SRR/Central SR SR/Central Central P@5 0.1640 31.54% 0.2360 57.70% 0.3680 70.80% 0.5200 P@10 0.1820 39.74% 0.2780 66.40% 0.3340 72.90% 0.4580 P@15 0.2107 49.23% 0.2880 68.90% 0.3160 73.80% 0.4280 P@20 0.2420 60.50% 0.2930 71.80% 0.3050 76.30% 0.4000 P@30 0.2760 77.20% 0.2760 77.20% 0.2847 79.70% 0.3573 All 0.1378 51.94% 0.1447 56.50% 0.1715 64.60% 0.2653

Figure 6–2 P-R curves of 4 runs under M-GPX

112

Under M-GPX, if we only look at P5 (precision @5), SRR is over 13% better than RR.

The P-R curves in figure 8-2 clearly indicate that the precision of SRR is much

higher than RR at the top of result list. The performance gain of SRR mostly comes

from the high precision. SR has nearly 10% improvement to SRR. This is because

the SRR method sorts the collections simply according to their importance. The

results returned from distributed collections are still evenly distributed in the

merged result list. In SR, the more important the collection is the more documents

from the collection will appear at the top of the merged result list. P-R curves

clearly show that SR is much better than SRR in extracting relevant documents in

the top of the result list. Table 6-5 and Table 6-6 also show the same result. The

precisions are getting closer from P5 – P30. At P5, SR is about 10% better than

SRR but at P30, it is only about 2% better.

Experiments under System1 show similar results: SR performs the best followed

by SRR then RR, however, SRR and RR perform quite similar . If we look at Table

6-5, we can clearly see that SRR and RR have the same precision at P5 and P30.

SRR performed a little bit better than RR at P15 and P20. That is the reason why

SRR performed 5% better than RR. Because we have 30 collections and the

difference between RR and SRR is the order interleaved the results; they should

have the same precision at P30. The SRR algorithm tries to put results from ‘Good’

quality collection to the top, so P15 and P20 in SRR are better than RR. The reason

113

for SRR and RR results being so close is that System1 is not good enough to

distinguish collections. As under M-GPX, we can clearly see the difference

between RR and SRR on the top of the results.

In this experiment set, we also compared the performance between systems

considering query classification and systems not considering classification. The

results are showed in Table 6-7

Table 6-7 Query classification VS No classification

With Classification Without Classification 0.1499 (SRR) 0.1409 (N-SRR) 0.1715 (SR) 0.1506 (N-SR)

Table 6-7 clearly showed that query classification can help improve collection

fusion performance. Table 6-1 -- Table 6-4 have already showed that different test

collections focus on different topics. E.g. collection #10 is best for sports news

while collection #5 is best for financial news. As we have discussed in Chapter 5,

the collection fusion should consider not only the overall performance but also

the performance under different query topics.

In theory, N-SRR should provide similar performance to RR. But in our experiment,

it is about 4% better than the RR run. That is because about 40% of our queries

are classified ‘International’. As a result, the merging order in N-SRR will benefit all

the ‘International’ queries. Therefore, N-SRR gets a 4% improvement over RR.

114

Table 6-7 also proved again that SR is a better collection fusion strategy than SRR.

Even without query classification, SR can still provide better retrieval performance

than SRR with query classification.

Although centralized collection produces the highest precision, SRR and SR still

provide reasonable better performance than the standard round robin method.

115

Chapter 7

The P2PIR System architecture

The P2PIR system is quite distinctive by comparison with convertional P2P IR

systems. Figure 7–1 below demonstrates a sample scenario of a new user joining

the P2PIR system and completing a search within the network.

Figure 7–1 System overview

In Figure 7–1 System overview, SIG member A (i.e., client A) registers herself on

the directory web server; downloads the email addresses of all other SIG

members. At some time later, A broadcasts a query to all other SIG members.

116

Members B and D do not have any content that matches the query so only client C

responds to the query. The architecture of the P2PIR system is very much like the

architecture of the existing P2P systems, but the communication channel is email.

This is a unique feature of our system.

The P2PIR system is designed to exchange data via a third party agent that can get

through firewalls; this is of paramount importance since the inability to do so

renders collaborative IR almost impossible in today’s enterprise network

structure. For security reason, most companies and large organizations only

provide private IP address space for internal computers. Proxy servers are used to

provide internet access. Therefore, it is almost impossible to make direct

connection between internal and external computers. Port forwarding and UPNP

can solve such problem but it might cause serious security issues, e.g. revealing

sensitive data, and it is a nightmare to network administrators. As a result, an

agent is chosen for data exchange between internal and external computers, the

structure of the network is showed below.

117

Figure 7–2 The P2PIR system network structure

The P2PIR system chooses email as the third party agent. There are a few good

reasons for the choice of the email infrastructure as our communication medium

and they will be explained in follows.

Firstly, the P2PIR system is intended to support offline information retrieval

applications which cannot be performed securely by existing distributed IR

systems over a public domain network. For example, a user wishes to find some

private documents that could possibly be available from an external partner . It is

commonly done through phone enquiry, fax, or email. The user first requires the

partner to search their private collection using a local search engine. Having

found some relevant material the partner then sends back the documents via

mail, fax or email, using some secure mechanism (e.g. encryption). This is an

expensive operation, but it is even more expensive if you consider that it may

118

simultaneously involve many independent partners. The P2PIR system is designed

to facilitate the automation of the entire operation, without surrendering privacy

and security.

Secondly, by choosing email as our communication medium, trading off real time

response provides more important features for the intended IR applications.

Importantly, email gets across asynchronously. Many users do not have direct

permanent connection to the internet, but email is eventually delivered. Even

when a user is offline, the email agent can guarantee the delivery of search

requests and search results. This feature is quite helpful when the communication

is between users in different time zones. It is likely that the two parties will not

connect to the internet at the same time if they only turn on their computers

during business hours. For most of the synchronize P2PIR systems, the distributed

search can only be performed when other group members are online. Otherwise

nothing can be found. However, when using email as communication agent, when

the remote group partners are not online, search requests will be saved in their

mail box. When they connect to the internet, the requests will be processed and

results will be sent back to requester’s mail box.

Thirdly, email is universally accessible, reliable and stable. Free email storage and

communications services are readily available. This makes email particularly

119

attractive in support of a free public domain P2PIR system. Moreover, email is by

design and nature allowed to get through firewalls. It will be no network setting

change for network administrators so there will be no security load due to the use

of the P2PIR system. Additionally, email address is universal and unique over the

internet which makes it easier to identify users since email addresses are static

and stable. IP addresses are unique but not all users can get static IP address.

Locating email users is straight forward and much more reliable than locating a

volatile IP address.

Last, email is potentially faster than direct peer to peer connection. This is true

when people are using their ISP’s email addresses. As ISP always has fastest

connection speed to external network, it will take them less time to sent out the

same amount of data then yourself. And ISP is the fastest place you can go as

there is no traffic jam between you and ISP. As a result, when both parties are

using ISP emails, it should be faster to send data via email than direct connection,

especially when sending large files across continents, e.g. from Australia to South

American. Many people did not notice that email is really fast nowadays. There is

virtually no delay. The experiments in Chapter 8 proved that an email is sent from

one end, the other party will receive it straight away.

120

7.1 System Design

The P2PIR system is designed to present the users with a customary search engine

GUI. Each user , who is a member of a SIG that collaborates on the P2PIR system,

is able to issue a query request to another member of the group, to a subset of

members in the SIG, or to the entire SIG. The query is then distributed to the

targeted SIG members. Upon (asynchronous) receipt each group member’s

system acts upon the request and searches the local and possibly private

collection, subject to security settings and constraints. Results, if any exist, are

then returned to the query originator – again, subject to security constraints and

settings the result may be returned automatically, if the security classification of

the documents permits, or may be assembled, but require further authorization

by the owner before being sent. It is also possible to enforce the establishment of

a separate secure communication channel (e.g. using SSL) in order to exchange

the data.

The system scales well because each new SIG member brings with them the data

collection, the storage and processing capacity, and there is no need to crawl and

globally index the collection. All search operations are local. A search can only

find existing documents (no “dangling pointers”), and indexes can be up to date

immediately following the addition of new documents, without the typical update

121

lag for documents on the WWW. More details about the system will be given in a

later section in this thesis.

7.2 The P2PIR system Search lifecycle

Searching information within the P2PIR system consists of several steps. Figure 7–

3 Dataflow of Making Search Request shows the dataflow of making a search

request. The user interface is similar to a conventional WWW centralized search

engine. Users can input search queries and view results. When a user searches for

information, the user sends requests to the UI(user interface) module, then the

requests are passed to security module. Depending on the security requirements,

the security module might send plain text, for public domain search, or encrypted

text, for secure SIG search, to collection selection module. Details of the security

module are discussed in 7.4. Collection selection module will select the suitable

remote peers and pass the email addresses to communication module. After that,

a request mail will be created and sent to email server. At the same time of

sending the query to security module, the search query will also be sent to local

multilingual search engine and results will be displayed in the UI immediately.

122

Figure 7–3 Dataflow of Making Search Request

Figure 7–4 shows the dataflow of receiving a search request. When the remote

search request mail arrives, is received by the communication module. Then the

message is passed to security module. The security module checks if the request-

meets the security requirements according to the type of request public domain

or secure SIG search. Valid request are passed to the search engine module. If the

request sender is not in the secure group, the search engine will search within the

public documents and return the query results. Otherwise, private results may be

returned. When the search result is ready, it is passed to security module. Then a

response mail is created and the mail is sent to the email server . These results

123

contain summary information only. It can be more or less informative depending

on security needs. It may be as little as a notification that a result exists.

Figure 7–4 Dataflow of Response to Search Request

Figure 7–5 shows the dataflow of receiving results. When the response mail

arrives, the mail is received and sent to the security module. The security module

verifies the results. Valid results are passed to collection fusion module. The

collection fusion module then merges all the remote results into a single list. The

list is then displayed in the UI with document name and summaries.

124

Figure 7–5 Dataflow of Receiving Results

When a client wishes to retrieve a result document, she sends the specific

document request. The document owner process will then check if the client has

the right to view the document. A document response is returned to the valid

client and then the client can read the document. In order to keep the private

information secure, the query, the results and the document request/response

are encrypted, keeping the confidentiality between the clients.

7.3 Communication protocol

Communication protocol is one of the core components of peer to peer systems.

As discussed before, the physical communication between the P2PIR system

clients is via email. POP3(Post Office Protocol Version 3)(IETF 1996) and IMAP4

(Internet Message Access Protocol Version 4)(IETF 2003) are two of the most

popular protocols for getting email from mail server. SMTP (Simple Mail Transfer

Protocol) is standard (in practices) for sending email to mail server(IETF 2001). As

125

they are so popular , the details on how to use those protocols to communicate

with mail servers are omitted.

This protocol used in the P2PIR system is handled through XML coded messages,

delivered as plain text within an email message. XML is chosen as communication

protocol because it is designed for stored self-describing data. It has been wildly

used for data exchange cross all platforms. It is powerful and flexible. As XML

data is self-describing, it does not require fixed relational schemata or data type

definitions designed in advance. It is quite flexible to extent the protocol without

making large changes to existing applications. The existing P2PIR system can work

with the new version of P2PIR system as long as the new protocol still has all the

information as the old protocol. This thesis shows the power of flexibility of XML

protocol. In this section, an XML protocol for plain text (public domain) will be

described in detail. Later in 7.4, an extension of this protocol will be introduced

for encrypted data (for secure SIG) communication.

The XML Schema of the protocol is showed as Figure 7–6. The root element of the

XML document is ‘DSearch’ which has attributes of ‘ID’, ‘Sender’ and an optional

attribute ‘Encrypt’. ‘Sender’ element defines the sender of the message which is

the email address of sender. ‘ID’ element is the unique identification of the search

126

request. The P2PIR system uses time stamp as search ID. ‘Encrypt’ attribute

indicates if this message is encrypted and its default value is false.

Within ‘DSearch’ element, there are three elements to choose: ‘Query’, ‘ResultList’

and ‘Document’. ‘DSearch’ can only contain one of the three elements.

• The ‘Query’ element contains the search query in plain text. ‘Query’

element representing that this message is a search query request.

• The ‘ResultList’ element represents a search result table, which contains

two compulsory elements ‘DocId’ and ‘Title’. and two optional

elements, ’Summary’ and ‘Score’. It is assumed that documents in the

‘ResultList’ are sorted by relevance to the query. That is, the first

document in the ‘ResultList’ is the most relevant document and the last

one in the list is the least relevant document. ‘Title’, ‘Summary’ and ‘Score’

elements are supporting elements that will help the user to determine if

the document is relevant to the search query.

• The ‘Document’ element contains one compulsory element ‘DocId’ and

one optional element ‘Content’. Content contains the document. However,

the document will be encoded as base64Binary because binary data

cannot be sent directly by email. Base64 encoding is one of the binary-to-

127

text encoding schemes. By using base64 encoding, any binary data can be

converted into standard ASCII text, and then it can be sent via email.

When ‘DSearch’ only contains ‘Query’ element, it represents a ‘search request’

message. The search query is in the ‘Query’ element. When ‘ResultList’ is the

only element in ‘DSearch’, the XML file represents ‘search result’ message. The

P2PIR system can easily convert the ‘ResultList’ element into a table because it

is in XML. When ‘Document’ is in the ‘DSearch’ element, it might represent a

‘document request’ or ‘document result’ message. If ‘Content’ does not exist

in the ‘Document’ element, the message is a ‘document request’ message. It

asks the user to send the document that has the given document id specified

in ‘DocId’ element. While ‘Content’ is not empty, the message is a ‘document

request’ message. The P2PIR system only needs to decode the ‘Content’

element as saved to disk. Then the user can open the document and read it.

128

Figure 7–6 Schema for plain text communication

129

7.4 Security

7.4.1 Group management

One of the key features of the P2PIR system is secure group communications. The

security mechanisms do not prevent public access to general/non-sensitive

information. However, only the authorized peers can access the respective

sensitive data. Thus, identification of SIG members becomes the most import

security issues for the P2PIR system. In server-less environments the

authentication of a peer is different to a centralized environment. In this case,

there is no server for user authentication, thus each user must authenticate the

other offline directly.

The simplest form of group control is using an Access Control List (ACL). This list,

typically manipulated only by the group administrators, lists all group members

that should have permission to access the sensitive data. However, the use of ACL

is very restrictive so it is not suitable for the P2PIR system. The main reason is that

ACL is a static entity. It can only be used when all group members are known. But

in a peer to peer environment, peers can join SIG dynamically. Therefore, the new

SIG member will not get full access to members’ collections until all members

update their ACL. Downloading the ACL list from directory server then verifying

130

the user might be a solution but it introduces more network traffic and will face

the single point of failure problem.

In order to solve this problem, we introduce a certificate for group members

which is based on X.509 public key infrastructure (PKI) (ITU 2000). When a new

member joins a SIG, the SIG administrator will issue him a group member

certificate which is signed by the SIG group. A unique private/public key pair and

the certificate of group itself will also be sent to the new member at the same

time. The group member certificate is similar to the standard PKI certificate that

includes member identity (email address), issuance time, validity interval and a

public key. Then the member can prove his membership to other members by

sending out the member certificate together with a signed message.

The process of validating a member is demonstrated in Figure 7–7. The first step

of authentication is to identify the peer. It’s relatively easy to verify a peer

because each peer uses a unique email address. The authentication process does

not only check the certificate but also the sender’s email address. Only when the

incoming message’s email address matches the certificate, the message will be

accepted. The next step is to verify the certificate. X.509 validation method is used

in the P2PIR system, following the certificate path, to validate the certificate.

Several checks will be performed in this step. Firstly, check if the group member’s

131

certificate is issued by the group administrator. Secondly, check if the certificate is

not expired or has not been revoked. Then, check if the message is signed by the

sender. If the message passes all the checks, B is a valid client in group G. In this

way, A and B do not have to know each other but can trust each other and

communicate in a secure way. Authentication is done on the client side and does

not require any authentication server. Additionally, a block list can be used at the

client side. With a block list a user can block any unwanted clients even if the

client has a valid certificate within the group.

132

Figure 7–7 Member validation process

133

Once issued, a group member’s certificate becomes valid when its validity time

has been reached, and it is considered valid until its expiration date. However,

various circumstances may cause a certificate to become invalid prior to the

expiration of the validity period. For example, change of email address, change of

employment status (when an employee terminates employment with an

organization), expose the private to public or lost of private key, etc. Under such

circumstances, the certificate needs to be revoked. RFC 3280(IETF 2002) defines

one ways to represent revocation information such method. This method involves

each CA (in our circumstance, group administrators) periodically issuing a signed

data structure called a certificate revocation list (CRL). A CRL is a list identifying

revoked certificates, which is signed by group administrators. The CRL will be sent

to all group members once revocation list updated. By using X.509 CRL, we can

ensure that certificates presented by peers have not been revoked by the group

administrator .

7.4.2 Security protocol

There are some internet encryption and security protocols, such as VPN, Secure

Sockets Layer (SSL) and its updated version Transport Layer Security (TLS), which

will ensure the privacy and integrity of communication between peers. But such

protocols are designed for direct communication and will not be suitable to the

134

system because the communication between peers in our system is via email. We

cannot create a low layer secure communication channel on the public email

system. Since our communication protocol is based on XML it is only natural to

use XML encryption and XML Signatures to base our security protocols upon.

XML encryption(Imamura, Dillaway et al. 2002) is designed for general secure data

exchange. The structure of XML encryption is described in Figure 7–8 XML

encryption structure and the detail of the XML schema for XML Encryption is listed

in Appendix 2 XML Schema for XML Encryption. One of the most powerful

features of XML encryption is that the encryption is fully extensible. As we can see,

XML encryption defines all possible elements about the encryption, such as

encryption method, encrypted key, Cipher Data etc. Thus, any encryption

algorithms can be applied to XML encryption. When necessary, user can apply

stronger encryption algorithm to protect sensitive data and apply faster (usually

weaker) encryption algorithm to reduce CPU usage for lower security level items.

In addition, by comparing with real time encryption protocols, such as SSL and

TLS, XML encryption can encrypt only the sensitive data rather than the entire

communication. So the peers can exchange messages efficiently. Non-sensitive

data will be left unencrypted to reduce CPU load and to reduce the

communication cost.

135

Figure 7–8 XML encryption structure(Imamura, Dillaway et al. 2002)

Where "?" denotes zero or one occurrence; "+" denotes one or more occurrences;

"*" denotes zero or more occurrences; and the empty element tag means the

element must be empty.

7.4.3 Encryption methods

As PKI certificates are used in the P2PIR system, it is natural to use asymmetric

cryptography techniques in the encryption. However, symmetric algorithms are

generally much less computationally intensive than asymmetric algorithms. In

practice, asymmetric key algorithms are typically hundreds to thousands times

slower than symmetric key algorithms. One of the difficulties arising with

136

symmetric encryption is the key exchange problem. Peers have to share a key

before communication can start. Secure plain text key exchange is not possible on

the internet. In addition, the more users are in the system, the higher potential

lost of cryptographic. Therefore, cryptographic keys should be changed regularly.

The use of both asymmetric encryption and symmetric encryption offers a

solution.

There are numerous published articles that describe how to use asymmetric and

symmetric encryption together. For the sake of completeness a brief explanation

will be provided in Appendix 1,

7.4.4 XML signature

Data integrity is also important in the encryption / decryption process. It means

data consistency is assured and that the data is tamperproof. Only when data

integrity is guaranteed we can authenticate the message owner. The P2PIR system

uses X.509 certificates and XML signature (Bartel, Boyer et al. 2002) to achieve

data integrity and authentication..

XML signature is defined by W3C and “provide integrity, message authentication,

and/or signer authentication services for data of any type, whether located within

the XML that includes the signature or elsewhere(Bartel, Boyer et al. 2002)”. As

137

normal digital signatures, XML signatures can provide the following capabilities for

data: integrity assures that the data has not been tampered with or corrupted

since it was signed; authentication assures that the data originates from the signer

and non-repudiation assures that the signer is committed to the document

contents. The advantage of XML signatures is that XML signatures can be applied

to arbitrary data elements that may be located within the XML document.

7.5 The search engine

The search engine used in this peer to peer system is based on GPX(Geva 2006).

Small modification was made to enable document level search for both English

and Chinese documents. Basically, the documents were indexed using a character-

based inverted file index. Microsoft SQL Server 2005 developer edition is used as

the backbone of the search engine. The storage of document index is as follows:

• Terms table stores unique terms in the whole collections. It records the

term id, term, start and end positions of the term in the inverted list.

• Contents table stores the term id, document id, xpath id and the position

of the term in the xpath.

• XPath table stores all possible xpath in the collection.

• Documents table stores all document names in the collection.

138

The terms table only stores single English words and single Chinese characters.

The main reason to only storing words instead of full terms is to avoid the

problem of Chinese term segmentation. As described before, using incorrectly

segmented Chinese documents will significantly decrease the performance of

search engine. By using single character based indexing, a Chinese term can be

easily determined by documents, xpath and positions. Only when character

positions are consecutive and have the same document and xpath, the character

sequence will be considered as a phrase or a term in the document. As a result,

this strategy can maximise the recall of the search engine.

Document score for a query Q is calculated by the equation below:

∑= iiscore idftfnD *7

(16)

Here, n is the number of the unique query terms in the document. tfi is the

frequency of the ith term in the document and idfi is the in inverse document

frequency of the ith term in the collection.

The equation can ensure two things: first, the n7 strongly rewards the documents

that contain more query terms. The more unique query terms match in a

139

document, the higher rank the document has. For example, the document that

contains five unique query terms will always have higher rank than the document

that contains four query terms, regardless of the query terms frequency in the

document. Second, when documents contain the same number of unique terms,

the score of a document will be determined by the sum of query terms’ tf*idf, as

traditional information retrieval does. According to experiments, n7 is the best

value for both English and Chinese information retrieval.

140

Chapter 8

Peer to peer system evaluation

The purpose of the experiments in this chapter is to evaluate the performance of

the peer to peer system that uses email as communication medium. Therefore, all

multilingual components and also the collection profiling components were

removed from the P2PIR system during the experiments to avoid the impact of

query translation and collection merging. . As a result, raw score merging was

used in the P2PIR system.

8.1.1 Test environment

To simulate a real world situation we have used two different email servers: the

Yahoo mail server (Yahoo server) and the Queensland University of Technology’s

mail server (QUT server). Those two servers can represent two types of mail

server. The QUT server represents an organization’s internal mail server: it is

located behind the organization’s firewall and will block access from outside. The

Yahoo server can represent a general public mail server: it is heavily loaded, it can

be accessed by anyone and it is physically remote to QUT.

141

The machines that we use to test the system are also varied. We have used 10 PCs

with various hardware and operating system configurations. 7 PCs were located

within QUT’s internal network and firewall. One laptop and two PCs located

outside QUT. The external PCs used 256K ADSL internet connection and the

Yahoo server .

8.1.2 Test queries and collection

We used the INEX document collection in the experiments. The INEX04 document

collection consists of the IEEE computer society journal and magazine articles

from 1995 to 2002, approximately 500MB in size and with over 12,000

documents. The collection was arbitrarily split into 10 partitions and distributed as

independent sub-collections on 10 PCs of different types.

The queries that we used were a mix of 26 Xpath-like CAS queries (content-and-

structure) specifying keywords as well XML structural constraints and 34 CO

(content-only) queries-specifying keywords only.

8.1.3 Test results

We have run the first test during the weekend. The work load of both servers

seemed low and the email messages were delivered without delay. The check-

mail interval was set to 0 which means that the software keeps checking for new

142

mail without imposing the usual wait. In this case, we can assume that our system

is running the equivalence of a direct peer to peer system, or parallel system. The

overall system performance is dependent on the slowest part in the

system(Brown 1999) . Table 8-1 Collection size and search time shows the search

time on each of the 10 partitions of the collection, when running on the fastest

and the slowest computers. The average search time on our system is 26.4

minutes (for 60 queries).

Table 8-1 Collection size and search time

ID Size

(MB)

Search time on fastest

pc(minutes)

Search time on slowest

pc(minutes)

1 193 12 23

2 150 10 22

3 195 9 18

4 250 8 15

5 198 10 21

6 187 6 11

7 237 10 22

8 225 9 19

9 172 13 26

10 169 12 26

Table 8-2 Comparison of Centralize search and Distributed search shows the

search time on different environments. In the distributed environment, the search

143

time is approximately 6.5 times faster than the centralize search time on best PC.

We can achieve shorter response times when the data is split into smaller

collections.

Table 8-2 Comparison of Centralize search and Distributed search

Centralized Search time A1 Centralized Search time B2 Distributed Search Time

168min >4 hours 26.4min

The same test was run in the mid-week to simulate real world situation. It is quite

surprised to find out that all mails are delivered within seconds which still make

the system virtually direct peer to peer connections.

In a real application we anticipate that the check-mail frequency will be in the

order of 5 minutes or more and so response time will be longer. We may also find

that some users are not permanently connected and so response time may be

even longer hours or days – for some user collections. However, results are

delivered incrementally and so the query originator will receive results as soon as

1 Search A: 1.7GB database search on AMD 2800+ with 1GB RAM 2 Search B: 1.7GB database search on Pentium3 800 with 256 RAM. The search did not finish within 4 hours and aborted.

144

they become available. The system therefore offers a trade-off between

immediate response (ideal situation) and delayed asynchronous response which

will increase the search coverage to collections that are offline at the time that

the search request is issued.

8.1.4 Results and raw score fusion

As the collection profiling component was turned off, raw score merging was

demonstrated. In order to achieve effective results fusion we have opted to

change the ranking strategy of our search engine so as to eliminate the use of

collection specific information. This approach may be initially questionable

because it implies, for instance, that ranking schemes that are based on term

frequencies (e.g. tfidf) cannot be used. When a collection is partitioned each

partition might exhibit distinctly different term distribution frequencies. This is

particularly to be expected when the partition is thematic and different

collections cover different topics.

Merging the results on the basis of local rank leads to poor results. We have

implemented a result element ranking scheme that is only based on the number

of distinct search terms that appear in a result element and on the number of

times that each term appears in the element. We ignore term frequencies

altogether in computing the score of a result element. In such a manner, the

145

ranking in the distributed environment is identical to the ranking in a single

collection. There is of course a trade off. On one hand results fusion is straight

forward and there is no need to keep or exchange any global collection statistics.

On the other hand, the quality of the results is degraded. We have tested the

approach by running our system with an unaltered ranking scheme (i.e., using tfidf

and collection term statistics) and with an altered ranking scheme that uses no

collection statistics. The results are depicted in Figure 8–1. The baseline for

comparison is the Precision Recall curves of the official submissions to INEX 2004.

Merging the results of the original search engine leads to poor results. Our

submission to the CAS track in INEX 2004 was ranked first, but the result obtained

by distributing the search engine without modification would have been ranked

36 against all 51 official submissions to INEX 2004 and with much lower average

precision than the centralized system – 0.03 vs. 0.15 respectively. On the other

hand, by changing the ranking scheme to eliminate all references to global

information we obtained a result that is still very good. It would have ranked 4th

at INEX 2004 with an average precision of 0.12

146

Figure 8–1 PR curves

147

Chapter 9

Conclusion and future work

9.1 Summary

The goal of this thesis was to develop and evaluate a peer to peer multilingual IR

system that can easily pass through firewall and can be used for either in public

domains or private secure domains. This peer to peer IR system can search for

both English and Chinese documents using one single query in English or Chinese

to satisfy the information needs for users who know both languages. . The P2PIR

system includes many features that are not offered by any of the current peer to

peer IR systems. The contributions of this thesis include:

• Web-based English - Chinese query translation approach.

In this area, the thesis describes an approach to tackle the OOV problem in

English-Chinese information retrieval. Firstly, a bottom-up term extraction

method to be used in small corpora for generating candidate translations for

query OOV terms is proposed. The method introduces a new measurement of

a Chinese string based on frequency and standard deviation, together with a

Chinese MLU extraction process based on the change of the new string

148

measurement that does not rely on any predefined thresholds. The method

considers a Chinese string as a term based on the change of R’s value when

the size of the string increases rather than the absolute value of R. Our

experiments show that this approach is effective for translation extraction of

unknown query terms. The related information can be found in session 3.2,

Chapter 4 and also in conference paper [1], journal paper [1].

A simple translation selection approach to improve translation accuracy is alos

proposed in the thesis. The experiment results show that OOV terms can

significantly affect the performance of CLIR systems. By using web translation

extraction based on co-occurrence model, the overall performance can boost

to almost 174% comparing to the case of not processing OOV terms. With our

proposed translation selection approach, the accuracy of OOV term

translation can be improved by up to 85%. The overall performance is about

200% comparing to the case of relative to not processing OOV terms. And it is

about 120% comparing to by comparison with the simulation of previous

approaches. The related information can be found in session 3.3, Chapter 4

and also in conference paper [3], journal paper [1].

149

• Web-based collection profiling strategy

In this area, the thesis proposed and evaluated the approach on result

merging that can be applied in uncooperative distributed information

environments such as p2p systems. A Web based query classification method

is proposed. Learning user behaviours and using query classification can create

collection profiles which contain the information not only about collection

content but also about the performance of the remote information retrieval

systems. Using the information in the profile can help merging results. Our

experiments proved that our proposed SRR and SR approaches can provide

much better results than the standard round robin method. The related

information can be found in Chapter 5, Chapter 6 and also in conference paper

[2]

• Email-based peer to peer IR system architecture

An email based secure distributed information retrieval system is described in

the thesis. The system is based on a store-and-forward paradigm, utilizing the

public email system, to facilitate search distribution and collaborative

information retrieval - a federated system of local private collections. The test

shows that the distributed system can even offer speed advantages arising

from parallel processing in comparison with an equivalent centralized system.

150

An important and central feature of the P2PIR system is that it can ensure the

information security and privacy in the search and exchange of sensitive data

in an open network; the system can also provide other features such as storing

and forwarding messages, passing through firewalls; trustfully communicating

with unknown clients. The work shows that it is possible to use X.509

certificate framework in a peer to peer IR system. The related information can

be found Chapter 7, Chapter 8 and also in conference paper[4].

9.2 Future work

First of all, the research work of query translation in this thesis is limited to

translate English queries to Chinese queries. As the ideas of term extraction and

translation selection are statistic based they should be language independent. As

a result, these ideas might work in Chinese to English translation and other

language pairs. However, more experiments are required to evaluate the actual

performances. In the query translation area, although the proposed approach

shows impressive accuracy for OOV term translation, there are still some works to

be conducted in the future. Firstly, our experiments were conducted using a test

set from NTCIR5 and NTCIR6 CLIR task which only has 108 OOV terms. It might be

necessary to test our approach to a larger scale test set such as a test set that has

over 1000 OOV terms. Secondly, inappropriate translation is still a problem in

151

query translation. The main reasons include the limited size of dictionary,

different customs of translation, and ignoring query context. Some work should be

done to minimize these problems. Our experiments provide hints for some

possible approaches. If we have large enough resources, we may find all the

possible translations. For translation selection, if some of the translations hit a

similar number of documents, we may keep all of them as correct translations. It

may be useful to include more results from the Google search for instance or

combining different translation result together. We will validate these ideas in the

future.

In the collection profiling area, future work will be focused on collection selection.

Collection selection is necessary in a large scale environment. A query should only

be sent to a small number of peers in the network because a client may not be

able to handle a large number of search requests. In addition, selecting a small

number of remote peers can help to reduce the number of irrelevant documents.

Although the collection profiling strategy can provide impressive accuracy of

collection rank, which is usually the basis for collection selection strategies, the

collection selection strategies are not covered in this thesis. How to select a small

amount of remote peers, e.g. how many peers should be selected, to maximize

the recall and precession is still an issue in the P2PIR system.

152

In the peer to peer IR system area, as our system is based on email, any attack

through the email system is a potential risk to our system, especially the “spam

attack”. More work is required to address this issue.

153

Appendix

Appendix 1 Asymmetric and symmetric encryption / decryption

The encryption and decryption process is showed in Figure 10–1. When user A is

requesting some sensitive information from user B, A sends her public key to user

B with the request. User B then creates a symmetric Content encryption key and

encrypts the key with A’s public key. Then B encrypts the data using the

symmetric encryption Content key and sends the encrypted data and encrypted

key back to A. User A decrypts the Content key using her private key and then

decrypts the data using the Content key. With asymmetric cryptography, only the

private key can decrypt the data encrypted by the corresponding public key.

Therefore, privacy can be ensured. We use symmetric encryption to encrypt the

data.

154

Figure 10–1 Encryption and Decryption using Symmetric and Asymmetric

together

155

Appendix 2 XML Schema for XML Encryption

<?xml version="1.0" encoding="utf8" ?> <schema xmlns="http://www.w3.org/2001/XMLSchema" version="1.0" xmlns:xenc="http://www.w3.org/2001/04/xmlenc#" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" targetNamespace="http://www.w3.org/2001/04/xmlenc#" elementFormDefault="qualified"> <import namespace="http://www.w3.org/2000/09/xmldsig#" schemaLocation="http://www.w3.org/TR/2002/RECxmldsigcore20020212/xmldsigcoreschema.xsd" /> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="EncryptedType" abstract="true"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <sequence> <element name="EncryptionMethod" type="xenc:EncryptionMethodType" minOccurs="0" /> <element ref="ds:KeyInfo" minOccurs="0" /> <element ref="xenc:CipherData" /> <element ref="xenc:EncryptionProperties" minOccurs="0" /> </sequence> <attribute name="Id" type="ID" use="optional" /> <attribute name="Type" type="anyURI" use="optional" /> <attribute name="MimeType" type="string" use="optional" /> <attribute name="Encoding" type="anyURI" use="optional" /> </complexType> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="EncryptionMethodType" mixed="true"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <sequence> <element name="KeySize" minOccurs="0" type="xenc:KeySizeType" /> <element name="OAEPparams" minOccurs="0" type="base64Binary" /> <any namespace="##other" minOccurs="0" maxOccurs="unbounded" /> </sequence> <attribute name="Algorithm" type="anyURI" use="required" /> </complexType> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <simpleType name="KeySizeType"> <restriction base="integer" />

156

</simpleType> <element name="CipherData" type="xenc:CipherDataType" /> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="CipherDataType"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <choice> <element name="CipherValue" type="base64Binary" /> <element ref="xenc:CipherReference" /> </choice> </complexType> <element name="CipherReference" type="xenc:CipherReferenceType" /> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="CipherReferenceType"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <choice> <element name="Transforms" type="xenc:TransformsType" minOccurs="0" /> </choice> <attribute name="URI" type="anyURI" use="required" /> </complexType> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="TransformsType"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <sequence> <element ref="ds:Transform" maxOccurs="unbounded" /> </sequence> </complexType> <element name="EncryptedData" type="xenc:EncryptedDataType" /> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="EncryptedDataType"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexContent> <extension base="xenc:EncryptedType" /> </complexContent> </complexType> <! Children of ds:KeyInfo > <element name="EncryptedKey" type="xenc:EncryptedKeyType" /> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="EncryptedKeyType"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexContent> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <extension base="xenc:EncryptedType">

157

Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <sequence> <element ref="xenc:ReferenceList" minOccurs="0" /> <element name="CarriedKeyName" type="string" minOccurs="0" /> </sequence> <attribute name="Recipient" type="string" use="optional" /> </extension> </complexContent> </complexType> <element name="AgreementMethod" type="xenc:AgreementMethodType" /> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="AgreementMethodType" mixed="true"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <sequence> <element name="KANonce" minOccurs="0" type="base64Binary" /> <! <element ref="ds:DigestMethod" minOccurs="0"/> > <any namespace="##other" minOccurs="0" maxOccurs="unbounded" /> <element name="OriginatorKeyInfo" minOccurs="0" type="ds:KeyInfoType" /> <element name="RecipientKeyInfo" minOccurs="0" type="ds:KeyInfoType" /> </sequence> <attribute name="Algorithm" type="anyURI" use="required" /> </complexType> <! End Children of ds:KeyInfo > Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <element name="ReferenceList"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <choice minOccurs="1" maxOccurs="unbounded"> <element name="DataReference" type="xenc:ReferenceType" /> <element name="KeyReference" type="xenc:ReferenceType" /> </choice> </complexType> </element> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="ReferenceType"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <sequence> <any namespace="##other" minOccurs="0" maxOccurs="unbounded" /> </sequence>

158

<attribute name="URI" type="anyURI" use="required" /> </complexType> <element name="EncryptionProperties" type="xenc:EncryptionPropertiesType" /> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="EncryptionPropertiesType"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <sequence> <element ref="xenc:EncryptionProperty" maxOccurs="unbounded" /> </sequence> <attribute name="Id" type="ID" use="optional" /> </complexType> <element name="EncryptionProperty" type="xenc:EncryptionPropertyType" /> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <complexType name="EncryptionPropertyType" mixed="true"> Www.w3.org/TR/2002/REC-xmlenc-core-20021210/xenc-schema.xsd <choice maxOccurs="unbounded"> <any namespace="##other" processContents="lax" /> </choice> <attribute name="Target" type="anyURI" use="optional" /> <attribute name="Id" type="ID" use="optional" /> <anyAttribute namespace="http://www.w3.org/XML/1998/namespace" /> </complexType> </schema>

Appendix 3 Translation of OOV terms

OOV term SQUT SCP SCPCD SE MI

Chiutou:

Autumn Struggle: 秋鬥大遊 從秋鬥 秋鬥 秋鬥 秋鬥

Jonnie Walker: 約翰走路 約翰走路 黑次元 高雄演唱 高雄演唱

Charity Golf

Tournament:

慈善高爾夫

球賽

慈善高爾夫

球賽 慈善高 慈善高

Embryonic Stem

Cell:

胚胎幹細胞 胚胎幹細胞 胚胎幹細胞

Florence Griffith

Joyner: 花蝴蝶 葛瑞菲絲 葛瑞菲絲 花蝴蝶 花蝴蝶

FloJo: 佛羅倫薩格

里菲斯 花蝴蝶 花蝴蝶 花蝴蝶 花蝴蝶

Michael Jordan: 麥可喬丹 麥可喬丹 喬丹 喬丹 喬丹

Torrijos Carter

Treaty:

Viagra:

159

Hu Jin tao: 胡錦濤 胡錦濤 胡錦濤 胡錦濤 胡錦濤

Wang Dan: 天安門 王丹 王丹 王丹

Tiananmen 天安門廣場 天安門 天安門 天安門 天安門

Akira Kurosawa: 黑澤明 黑澤明 黑澤明 黑澤明 黑澤明

Keizo Obuchi: 小淵惠三 小淵惠三 小淵惠三 小淵惠三 小淵惠三

Environmental

Hormone:

環境荷爾蒙 環境荷爾蒙 環境荷爾蒙 環境荷爾蒙

Acquired Immune

Deficiency

Syndrome:

後天免疫缺

乏症候群 愛滋病 愛滋病 愛滋病 愛滋

Social Problem: 社會問題 社會問題 社會問題

Kia Motors: 起亞汽車 起亞汽車 起亞汽車 起亞 起亞

Self Defense Force: 自衛隊 自衛隊 自衛隊 自衛隊 自衛隊

Animal Cloning

Technique:

動物克隆技

動物克隆技

Political Crisis: 政治危機 政治危機 政治危機

Public Officer: 公職人員 公職人員 公職人員 公職人員

Research Trend: 研究趨勢 研究趨勢 研究趨勢 研究趨勢

Foreign Worker: 外籍勞工 外籍勞工 外籍勞工 外籍勞工

World Cup: 世界盃 世界盃 世界盃 世界盃 世界盃

Apple Computer: 蘋果公司 蘋果電腦 蘋果電腦 蘋果電腦 蘋果電腦

Weapon of Mass

Destruction:

大規模毀滅

性武器

大規模毀滅

性武器 性武器

Energy Consumption: 能源消費 能源消費 能源消費

International Space

Station:

國際太空站 國際太空站 國際太空站

President Habibie: 哈比比總統 哈比比總統 哈比比總統 哈比比

Underground Nuclear

Test:

地下核試驗 地下核試驗 地下核試

F117: 戰鬥機 隱形戰鬥機 隱形戰 隱形戰 隱形戰

Stealth Fighter: 隱形戰機 隱形戰機 形戰鬥機 形戰鬥機 形戰鬥機

Masako: 雅子 太子妃 雅子 雅子 雅子

Copyright

Protection:

版權保護 版權保護 版權保護 版權保護 版權保護

Daepodong: 大浦洞 大浦洞 大浦洞 大浦洞 大浦洞

Contactless SMART

Card:

智慧卡 非接觸式智

慧卡

非接觸式智慧

卡 非接觸式 非接觸式

Han Dynasty: 漢朝 大漢風 漢朝 漢朝 漢朝

Promoting Academic

Excellence:

學術追求卓

越發展計畫 卓越計畫 卓越發展計畫 卓越發展計畫 卓越發

China Airlines: 中華航空 中華航空 中華航空 中華航空 長榮

ST1:

El Nino 聖嬰 聖嬰現象 聖嬰現象 聖嬰 聖嬰

160

Mount Ali: 阿里山 阿里山 阿里山 阿里山 阿里山

Kazuhiro Sasaki: 佐佐木主浩 佐佐木主浩 佐佐木 佐佐木 佐佐木

Seattle Mariners: 西雅圖水手 西雅圖水手 西雅圖水手

Takeshi Kitano: 北野武 北野武 北野武 北野武 北野武

European monetary

union:

歐洲貨幣聯

歐洲貨幣聯

盟 歐洲貨幣 歐洲貨幣 歐洲貨幣

capital tie up:

Nissan Motor

Company:

日產汽車公

司 汽車公司 汽車公司 處經濟 處經濟

Renault: 雷諾 休旅車 雷諾 雷諾 雷諾

Pol Pot: 波布 紅高棉 紅高棉 紅高棉 紅高棉

war crime: 戰爭罪 戰爭罪 戰爭罪 戰爭罪

Kim Dae Jung: 金大中 金大中 金大中 金大中 金大中

Clinton: 克林顿 克林顿 克林顿

New Year Holiday: 新年假期 新年假期 新年假期

Drunken Driving: 醉後駕車 醉後駕車 醉後駕車 醉後駕車 後駕車

Science Camp: 科學營 科學營 科學營 科學營

Nelson Mandela: 曼德拉 曼德拉 曼德拉 曼德拉 曼德拉

Kim Il Sung: 金日成 金日成 金日成 金日成 金日成

anticancer drug: 抗癌藥物

consumption tax: 消費稅 消費稅 消費稅 消費稅 費稅

Uruguay Round: 烏拉圭回合 烏拉圭回合 烏拉圭回合

Kim Jong Il: 金正日 金正日 金正日 金正日 金正日

Time Warner 時代華納 時代華納 時代華納 時代華納 時代華納

American Online 美國線上 美國線上 美國線上 美國線上 美國線上

Alberto Fujimori 藤森 藤森 藤森 藤森 藤森

Taliban 塔利班 塔利班 塔利班 塔利班 塔利班

Tiger Woods 老虎伍茲 老虎伍茲 老虎伍茲 老虎伍茲 伍茲

Harry Potter 哈利波特 哈利波特 哈利波特 哈利波特 哈利波特

Greenspan 葛林斯班 葛林斯班 葛林斯班 葛林斯

monetary policy 貨幣政策 貨幣政策 貨幣政策 貨幣政策

abnormal weather 天氣異常 天氣異常 天氣異常 天氣異常 天氣

National Council of

Timorese Resistance

帝汶抵抗全

國委員會

帝汶抵抗全

國委員會

帝汶抵抗全國

委員會

帝汶抵抗全國

委員會

帝汶抵抗

全國委員

161

Bibliography

The List of common use Chinese Characters, Ministry of Education of the People's Republic of China.

Aberer, K., F. Klemm, et al. (2001). An Architecture for Peer-to-Peer Information Retrieval. The Ninth International Conference on Information and Knowledge Management.

Bartel, M., J. Boyer , et al. (2002). XML-Signature Syntax and Processing D. Eastlake, J. Reagle and D. Solo, The World Wide Web Consortium (W3C).

Bawa, M., G. S. Manku, et al. (2003). SETS: search enhanced by topic segmentation. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, ACM Press.

Beverly, Y . and G.-M. Hector . (2001). "Improving Search in Peer-to-Peer Systems." from http://dbpubs.stanford.edu:8090/pub/2001-47.

Brown, E. (1999). Parallel and Distributed IR. Modern Information Retrieval. R. B.-Y . B. Ribeiro-Neto, Addison Wesley: 229-256.

Brown, E. (1999). Parallel and Distributed IR. Modern Information Retrieval.

Callan, J. P . and M. E. Connell (2001). "Query-Based Sampling of Text Databases." Information Systems 19(2): 97-130.

Callan, J. P., Z. Lu, et al. (1995). Searching distributed collections with inference networks. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press.

Callan, J. P., Z. Lu, et al. (1995). Searching distributed collections with inference networks. Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, Seattle, Washington, United States, ACM Press.

Charles, P., N. Good, et al. (2003). "How Much Information." http://www2.sims.berkeley.edu/research/projects/how-much-info-

162

2003/execsum.htm#summary Retrieved 19 Aug. 2007, 2007, from http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm#summary.

Chen, A. and F . Gey (2003). Experiments on Cross-language and Patent retrieval at NTCIR3 Worksho. Proceedings of the 3rd NTCIR Workshop, Japan.

Chen, A., H. Jiang, et al. (2000). Combining multiple sources for short query translation in Chinese-English cross-language information retrieval. Proceedings of the fifth international workshop on on Information retrieval with Asian languages. Hong Kong, China, ACM Press.

Cheng, P.-J., J.-W. Teng, et al. (2004). Translating unknown queries with web corpora for cross-language information retrieval. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. Sheffield, United Kingdom, ACM Press.

Chien, L.-F. (1997). PAT-tree-based keyword extraction for Chinese information retrieval. Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval Philadelphia, Pennsylvania, United States ACM Press.

CHOI, Y . S. and S. I. YOO (2001). " Text database discovery on the Web: Neural net based approach " Journal of Intelligent Information Systems 15(3).

Church, K. W. and P. Hanks (1990). "Word association norms, mutual information and lexicography." Computational Linguistics 16(1).

Craswell, N., D. Hawking, et al. (1999). Merging Results From Isolated Search Engines. Australasian Database Conference.

de Kretser, O., A. Moffat, et al. (1998). Methodologies for distributed information retrieval. Distributed Computing Systems, 1998. Proceedings. 18th International Conference on.

Ehrig, M., C. Schmitz, et al. (2004). "Towards Evaluation of Peer-to-Peer-based Distributed Information Management Systems." LECTURE NOTES IN COMPUTER SCIENCE(2926): 73-88.

Eijk, P. v. d. (1993). Automating the acquisition of bilingual terminology Proceedings of the sixth conference on European chapter of the Association for Computational Linguistics Utrecht, The Netherlands

163

Gao, J., J.-Y. Nie, et al. (2001). Improving query translation for cross-language information retrieval using statistical models. Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. New Orleans, Louisiana, United States, ACM Press.

Geva, S. (2006). Gardens Point XML IR at INEX 2005. Comparative Evaluation of XML information Retrieval Systems 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, Springer .

Gravano, L., C.-C. K. Chang, et al. (1997). STARTS: Stanford proposal for Internet meta-searching. Proceedings of the 1997 ACM SIGMOD Conference.

Gravano, L. and H. García-Molina (1995). Generalizing GlOSS to Vector-Space databases and Broker Hierarchies. Proceedings of the 21st International Conference on Very Large Databases.

Hawking, D. and P. Thistlewaite (1999). Methods for information server selection, ACM. 17: 40-76.

IETF (1996). "RFC-1939 Post Office Protocol - Version 3."

IETF (2001). "Simple Mail Transfer Protocol."

IETF (2002). " Internet X.509 Public Key Infrastructure - Certificate and Certificate Revocation List (CRL) Profile."

IETF (2003). RFC-3501 INTERNET MESSAGE ACCESS PROTOCOL - VERSION 4rev1.

Imamura, T., B. Dillaway, et al. (2002). XML Encryption Syntax and Processing. D. Eastlake and J. Reagle, The World Wide Web Consortium (W3C)

Ipeirotis, P. G. and L. Gravano (2001). "Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection." Journal of Intelligent Information Systems 15(3).

Ipeirotis, P. G., L. Gravano, et al. (2001). Probe, Count, and Classify:Categorizing Hidden-Web Databases. The 2001 ACM SIGMOD International Conference on Management of Data, ACM.

ITU (2000). X.509 Information technology – Open systems interconnection – The Directory: Public-key and attribute certificate frameworks International Telecommunication Union

164

Jang, M.-G., S. H. Myaeng, et al. (1999). Using mutual information to resolve query translation ambiguities and query term weighting. Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. College Park, Maryland, Association for Computational Linguistics.

King, A. (2008). "Average Web Page Triples Since 2003." from http://www.websiteoptimization.com/speed/tweak/average-web-page/.

King, J. and Y. Li (2003). Web based collection selection using singular value decomposition. WIC International Conference on Web Intelligence.

Kirsch, S. T . (1997). Document retrieval over networks wherein ranking and relevance scores are computed at the client for multiple database documents. United States Patent I. C. (US). United States.

Kupiec, J. M. (1993). An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora. Proceedings of the 31st Annual Meeting of the ACL. Columbus, Ohio.

Li, C. and H. Li (2001). Word translation disambiguation using Bilingual Bootstrapping. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics

Liu, K.-L., C. Yu, et al. (2001). Discovering the representative of a search engine. The 10th ACM International Conference on Information and Knowledge Management.

Liu, K.-L., C. Yu, et al. (2002). "A Statistical Method for Estimating the Usefulness of Text Databases." IEEE Transactions on Knowledge and Data Engineering 14(6).

Lu, C., Y. Xu, et al. (2007). Translation disambiguation in web-based translation extraction for English-Chinese CLIR. Proceeding of The 22nd Annual ACM Symposium on Applied Computing.

Lu, W.-H., L.-F. Chein, et al. (2004). "Anchor text mining for translation of Web queries A transitive translation approach." ACM Transactions on Information Systems (TOIS) 22(2): 242-269

165

Lv, Q., P. Cao, et al. (2002). Search and replication in unstructured peer-to-peer networks. The 16th international conference on Supercomputing, ACM Press.

Maeda, A., F. Sadat, et al. (2000). Query term disambiguation for Web cross-language information retrieval using a search engine. Proceedings of the fifth international workshop on on Information retrieval with Asian languages. Hong Kong, China, ACM Press.

McDaniel, P., A. Prakash, et al. (1999). Antigone: A Flexible Framework for Secure Group Communication. The 8th USENIX UNIX Security Symposium.

Ng, W. S., B. C. Ooi, et al. (2003). "PeerDB: A P2P-based System for Distributed Data Sharing."

Ng, W. S., B. C. Ooi, et al. (2003). PeerDB: A P2P-based System for Distributed Data Sharing. The 19th International Conference on Data Engineering.

Nie, J.-Y., M. Simard, et al. (1999). Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. Berkeley, California, United States, ACM Press.

Paola, V. and K. Sanjeev (2003). Transliteration of proper names in cross-lingual information retrieval. Proceedings of the ACL 2003 workshop on Multilingual and mixed-language named entity recognition - Volume 15, Association for Computational Linguistics.

Pirkola, A., T. Hedlund, et al. (2001). "Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings." Information Retrieval 4(3-4): 209 - 230.

Powell, A. L. and J. C. French (2003). "Comparing the performance of collection selection algorithms." ACM Transactions on Information Systems (TOIS) 21(4(October 2003)): 412 - 456.

Rasolofo, Y., F. Abbaci, et al. (2001). Approaches to collection selection and results merging for distributed information retrieval. Proceedings of the tenth international conference on Information and knowledge management Atlanta, Georgia, USA.

166

Rosenberg, J., R. Mahy, et al. (2005). TURN: traversal using relay NAT. IETF.

Rosenberg, J., J. Weinberger, et al. (2003). STUN: Simple Traversal of User Datagram Protocol (UDP) Through Network Address Translators (NATs). RFC 3489. IETF.

Salovesh, M. (1996) "How many words in an "average" person's vocabulary?" http://unauthorised.org/anthropology/anthro-l/august-1996/0436.html Volume, DOI:

Salton, G. and M. J. McGill (1996). Introduction to Modern Information Retrieval, McGraw-Hill, Inc.

Sankar, K. (2003). "What is Peer to Peer ." from http://p2p.internet2.edu/documents/What%20is%20peer%20to%20peer-5.pdf.

Saracevic, T. (1995). Evaluation of evaluation in information retrieval. Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval table of contents, Seattle, Washington, United States ACM.

Savoy, J., A. L. Calv, et al. (1998). Report on the TREC-5 Experiment: Data Fusion and Collection Fusion. The Fifth Text REtrieval Conference (TREC-5).

Saxena, N., G. Tsudik, et al. (2003). Admission Control in PeertoPeer: Design and Performance Evaluation. the 1st ACM workshop on Security of ad hoc and sensor networks. , ACM Press.

Si, L. and J. Callan (2002). Using sampled data and regression to merge search engine results. Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. Tampere, Finland.

Si, L. and J. Callan (2003). Relevant Document Distribution Estimation Method for Resource Selection. The 26th annual international ACM SIGIR conference on Research and development in informaion retrieval.

Silva, J. F. d., G. Dias, et al. (1999). Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units. Progress in Artificial Intelligence: 9th Portuguese Conference on Artificial Intelligence.

167

Silva, J. F . d. and G. P . Lopes (1999). A Local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units. International Conference on Mathematics of Language.

Singhal, A. (2001). "Modern Information Retrieval: A Brief Overview." Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24(4): 35-43.

Smadja, F., K. R. McKeown, et al. (1996). "Translating collocations for bilingual lexicons: a statistical approach." Computational Linguistics 22(1): 38.

Steidinger, A. (2000). Comparison of different Collection Fusion Models in Distributed Information Retrieval DELOS Workshop on Information Seeking, Searching and Querying in Digital Libraries.

Sun, L. and G. C. Chen (2001). Implementation of large-scale distributed information retrieval system. Info-tech and Info-net, 2001. Proceedings. ICII 2001 - Beijing. 2001 International Conferences on.

Sun, L. and G. C. Chen (2001). Implementation of large-scare distributed information retrieval system. International Conferences on Info-tech and Info-net. Beijing.

Tang, C., Z. Xu, et al. (2003). Peer-to-peer information retrieval using self-organizing semantic overlay networks. Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications, Karlsruhe, Germany, ACM Press New York, NY, USA.

Towell, G., E. M. Voorhees, et al. (1995). Learning Collection Fusion Strategies for Information Retrieval. The twelfith Annual Machine Learning Conference. Lake Tahoe.

Viles, C. L. and J. C. French (1995). Dissemination of Collection Wide Information in a Distributed Information Retrieval System. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

Voorhees, E. M., N. K. Gupta, et al. (1994). The Collection Fusion Problem. The Third Text REtrieval Conference (TREC-3). Gaithersburg, M.D., National Institute of Standards and Technology.

168

Wu, G. (2004). Research and Application on Statistical Language Model. Computer science and technology. Beijing, Tsinghua University, China.

Yan, Q., G. Gregory, et al. (2003). Automatic transliteration for Japanese-to-English text retrieval. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. Toronto, Canada, ACM Press.

Yuwono, B. and D. L. Lee (1997). Server Ranking for Distributed Text Retrieval Systems on Internet. Database System for Advance Applications.

Zeinalipour-Yazti, D., V. Kalogeraki, et al. (2003). "Exploiting locality for scalable information retrieval in peer-to-peer networks*1." Information Systems In Press, Uncorrected Proof.

Zhang, Y. and P. Vines (2004). Using the web for automated translation extraction in cross-language information retrieval. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. Sheffield, United Kingdom, ACM Press.

Zhao, M.-Y., J. Shen, et al. (2006). "A New Algorithm for Automatic Text Classification." 揚州大學學報(自然科學版) 9(1).


Recommended