ISCSLP’02 L. F. Chien
Information Retrieval Techniques for Spoken Language Processing
Lee-Feng Chien (簡立峰)
Institute of Information Science Institute of Information Science Academia Academia SinicaSinica, Taiwan, Taiwan
�������������� ����� ���������������������������������������������
�������������
������� !�"�����
ISCA Archive����#$$���%����!������%���$�����&�
ISCSLP’02 L. F. Chien
Outline
IR vs. SLPConventional IR TechniquesWeb IR Techniques Web Mining Techniques Term Clustering through Web MiningAnchor Text Mining
ISCSLP’02 L. F. Chien
I. IR vs. SLP
ISCSLP’02 L. F. Chien
Information Retrieval
a research with a long-term research goal of exploration of information storage, classification, extraction, indexing and browsing techniques for the retrieval of non-structural databases such as textual documents
ISCSLP’02 L. F. Chien
Different Research Aspects
Text IRWeb IRMultimedia IRIntelligent IR
ISCSLP’02 L. F. Chien
Different Research Aspects
Text IRText indexing, searching, presentation, user study
Web IRCrawling, page ranking, distributed search, scalability, multi-lingual and multi-culture
Multimedia IRRetrieving multimedia contents such as speech, audio, music, image, video
Intelligent IRAdvanced language and information processing topics such as question answering, cross-language, information tracking, information extraction, summarization, speech interaction, etc.
ISCSLP’02 L. F. Chien
Chinese Information Retrieval
Language issuesCulture/geographical issuesChinese people issues
ISCSLP’02 L. F. Chien
Chinese Information Retrieval
Language issuesWord segmentation, term extraction, parsing, linguistic resources Font display, conversion between simplified Chinese and traditional Chinese, etc.
Culture/geographical issuesChinese pages from world wide, language identification required, preferred topics different, etc.
Chinese people issuesUser behaviors
ISCSLP’02 L. F. Chien
Web Users and Pages (3 years ago)
Area Users Web Pages Time World-wide 150M 800M 7/99 China 4M 2.5~3M 7/99 Taiwan 4M 3M 7/99
ISCSLP’02 L. F. Chien
Number of Chinese Web Pages
328,000,000 pages
ISCSLP’02 L. F. Chien
Number of Chinese Web Pages
328,000,000 pages
Scalability Problem !
ISCSLP’02 L. F. Chien
Number of Web Pages
The world’s largest search engine ?
2,073,418,204 pages (Google)2,095,568,809 pages (FAST)
ISCSLP’02 L. F. Chien
Number of Web Pages (Cont.)
ISCSLP’02 L. F. Chien
Why IR Useful for SLP ?
ASR or dictation machine: lexicon, corpus, and language model Voice portal: search via spoken queries Speech retrieval: indexing & searching Topic detection & tracking : document classification & clustering
ISCSLP’02 L. F. Chien
Natural Language Processing
Speech Recognition
Information Retrieval Search
Engine
DictationMachine
Parser
IR Via Voice
IR vs. SLP
Q&A
Speech IR
ISCSLP’02 L. F. Chien
Search Engine
DictationMachine
Parser
IR Via Voice
Research Paradigm
Web Mining
ISCSLP’02 L. F. Chien
Anchor Text Mining for Query Translation (Lu, ICDM’01)
EnglishTraditional Chinese
Simplified ChineseJapanese
SonyNikeStanfordSydneyinternetnetworkhomepagecomputerdatabaseinformation
新力耐吉史丹佛雪梨網際網路網路首頁電腦資料庫資訊
索尼耐克斯坦福悉尼互联网网络主页计算机数据库信息
ソニーナイキスタンフォードシドニーインターネットネットワークホームページコンピューターデータベースインフォメーション
ISCSLP’02 L. F. Chien
Cross-Language Web SearchA Web search service allows users to query in one language and search documents that are written or indexed in another language.
ISCSLP’02 L. F. Chien
II. Conventional IR Techniques
ISCSLP’02 L. F. Chien
The Vector Space Model
Measure closeness between query and document.Queries and documents represented as n dimensional vectors.Each dimension corresponds to a word/term. Advantages: Conceptual simplicity and use of spatial proximity for semantic proximity.
ISCSLP’02 L. F. Chien
Vector Similarity
d = The man said that a space age man appeared d’= Those men appeared to say their age
ISCSLP’02 L. F. Chien
Vector Similarity (Cont.)
cosine measure or normalized correlation coefficient
Euclidean Distance:
ISCSLP’02 L. F. Chien
Term Weighting
Quantities used:tfi,j (Term frequency) : # of occurrences of wiin di
dfi (Document frequency) : # of documents that wi occurs incfi (Collection frequency) : total # of occurrences of wi in the collection
ISCSLP’02 L. F. Chien
Inverted File for Keyword Matching
Google’s Index File Structure
ISCSLP’02 L. F. Chien
Chinese IR & Indexing Unit Section
Character-based indexing and searchspeed/space problemincorrect matching due to free combination of characters, EX: 電腦科學
Word-based indexing and searchlexicon is a prerequisite and limitationunknown word identification, disambiguation of word segmentation
Csmart’s approach (Chien, SIGIR’95)signature-based, feature grouping (unigram, bigramcharacters)two-stage search, fuzzy search; suited for not too large files and demand of fuzzy search
ISCSLP’02 L. F. Chien
Chinese Track in TREC’5
Berkeley (A. Chen, SIGIR’97)Indexing units
• use dictionary to segment texts• obtained from public domain with 91,000 words and phrases• stopword list with 444 entries
Searching algorithm, segmentation method• 0.461 average precision for manual queries, 0.32 for automatic
run
CUNY (Kwok, SIGIR’97)Indexing units
• Use 2,000 words to segment texts initially• use a learning strategy to extend the word entries to 15,000
finallySearching algorithm, segmentation method
• 0.40 in word-based; 0.42 in word and character-based
ISCSLP’02 L. F. Chien
Text Indexing
Indexing Chinese texts Indexing Units
• Single character, Bi-character, word, term, stringNew word identification and word segmentation problems
Indexing English texts Indexing units
• Word, term, string (few)Problems
• Stemming, capitalization, hyphen, word sense disambiguation, typing errors
Structure for indexingInverted file, PAT array
Term extraction and term clustering techniques are the required key techniques
ISCSLP’02 L. F. Chien
Term Extraction
Term is a meaningful and representative unit in terms of information retrieval, e.g., name, location, proper noun, topicTerms are derived and most excluded in common dictionariesTerm extraction can reduce word sense ambiguities in text retrieval and remedy weaknesses of word-based approaches, EX:
computer network (linking, net, mesh)Bank America, current theory
Term extraction is the first step toward concept-based IR
ISCSLP’02 L. F. Chien
Chinese Term Extraction
• PAT-tree-based Approach • Poster Presentation Award by ACM SIGIR’98
ISCSLP’02 L. F. Chien
Context Dependency
Association
Left Context Dependency (LCD) Right Context Dependency (RCD)
Lexical Pattern
usage of freedom
新加坡 國大政府
到達訪問
ISCSLP’02 L. F. Chien
Example of the PAT tree (Chien, SIGIR’97)
Data stream: 個人電腦 , 人腦
0 2 4 6 9 11
Semi-infinite strings:
個人電腦 10101101 11010011 10100100 ...
bit: 1 9 17 25
人電腦
電腦
腦
人腦
腦
10100100 01001000 10111001 ...
10111001 01110001 00000000 ...
10111000 00000000 00000000 ...
10100100 01001000 00000000 ...
10111000 00000000 00000000 ...
(24,2,1)
0
4
2
9
6
(0,6,1)
(4,6,1)
(5,3,1) (8,3,2)
Data position
(comparison bit, # external nodes, frequency)
abcd, ed
Suffix Prefixabcd (a, ab, abc, abcd)bcd (b, bc, bcd)cd (c, cd)
d (d)ed (e, ed)d (d)
abcdcd
bcd
ed
d
ISCSLP’02 L. F. Chien
Incremental Term Extraction
Term length
(character
N-gram)
Number of
extracted new
terms
Number of
documents
with new
terms
extracted
Average
number of
document
inputs can
find new
terms (A)
Average
frequency of
the extracted
new terms
Average
frequency as
the term can
be extracted
2 776 515 3.93 34.22 9.37
3 416 325 6.04 24.60 9.09
4 171 157 12.16 19.22 8.97
5 51 49 37.28 20.35 9.18
6 17 17 109.81 27.00 8.65
7 15 15 123.60 27.40 11.20
8 6 6 274.67 13.00 9.83
9 3 3 205.67 18.00 11.33
Total N-grams 1,455 814 2.41 28.95 9.25
Table 4. The detailed results for incremental term extraction when the threshold value was larger than 2 in the significance analysis;
the results were obtained from a total of 1,872 political news abstracts published in July 1997.
ISCSLP’02 L. F. Chien
Incremental Term Extraction (Cont.)
S(Y) Total
Extracted
Terms(A)
No. of
Correct Terms
Extracted(B)
No. of
Correct Terms
Outside
Dictionary(C)
Precision
(B/A)
Recall
>1.5 2,291 1,374 297 0.60 0.53
>2 1,455 1,135 258 0.78 0.44
>2.5 723 593 172 0.82 0.23
>3 214 184 66 0.86 0.07
Table 3. The testing results for incremental term extraction using different threshold values in the significance analysis; the results
were obtained from a total of 1,872 political news abstracts published in July, 1997.
ISCSLP’02 L. F. Chien
Incremental Term Extraction (Cont.)
S(Y) Total
Extracted
Terms(A)
No. of
Correct Terms
Extracted(B)
No. of
Correct Terms
Outside
Dictionary(C)
Precision
(B/A)
Recall
>1.5 2,291 1,374 297 0.60 0.53
>2 1,455 1,135 258 0.78 0.44
>2.5 723 593 172 0.82 0.23
>3 214 184 66 0.86 0.07
Table 3. The testing results for incremental term extraction using different threshold values in the significance analysis; the results
were obtained from a total of 1,872 political news abstracts published in July, 1997.
How to deal with low-frequency terms ?
ISCSLP’02 L. F. Chien
Characteristics of the Chinese
Language
Semantics
Form
Phonetics
Word segmentation:•一台大電腦 (a set of large computer)
ISCSLP’02 L. F. Chien
Characteristics of the Chinese
Language
Semantics
Form
Phonetics
Word segmentation:•一台大電腦 (a set of large computer)
ISCSLP’02 L. F. Chien
Characteristics of the Chinese
Language
Semantics
Form
Phonetics
Word segmentation:•一台大電腦 (a set of large computer)
ISCSLP’02 L. F. Chien
Characteristics of the Chinese
Language
Semantics
Form
Phonetics
Word segmentation:•一台大電腦 (National Taiwan University)
ISCSLP’02 L. F. Chien
Characteristics of the Chinese
Language
Semantics
Form
Phonetics
Word segmentation:. 參加一台大電腦會議 (Attend a computer science meeting in National Taiwan University)
ISCSLP’02 L. F. Chien
Characteristics of the Chinese
Language
Semantics
Form
Phonetics
New word identification:•台大電腦公司 (Taida Computer Inc.)
ISCSLP’02 L. F. Chien
III. Web IR Techniques
ISCSLP’02 L. F. Chien
Spectrums of Web Search
Types of contentText, e.g. Web text、documents、newsAudio, e.g. music、speech、sounds、broadcast newsImage, e.g. pictures、photos、graphicsVideo, e.g. films、clipsFormatted Data, e.g. products
Scopes of contentGeneral or specific
Languages
ScalabilityPersonal、content site、intranet 、InternetThousands, millions or billions of (documents、users、queries)
InterfaceWeb-based、WAP-based、Voice-based
ISCSLP’02 L. F. Chien
IndexSpace
User Space
DocumentSpace
Information UseInformation Need
Seek
Use
Users Authors
Short QuerySubject TermsReal Names
X YX1,X2... Y1,Y2...
An Analytical Model
ISCSLP’02 L. F. Chien
Different Web Search ModelsYahoo
manual recommendation in index space
Altavista、Inktomifull-text pattern matching in document space
Googlecitation information in document space
Realnamemanual real-name retrieval in user space
DirectHitcollaborative analysis in user space
AskJeevesQ&A (or FAQ search) in specific domains
ISCSLP’02 L. F. Chien
Hypertext on the Web
Internal Affairs
People
IIS
CS&IE, NTU
Institute of Information Science
http://www.iis.sinica.edu.tw
IISInstitute of
Information Science SE
Academia Sinica
Research Institutions
Hyperlink reference Sibling information
Web usage informationQuery & Click stream
Local content
ISCSLP’02 L. F. Chien
Basic Architecture of a Spider-based Web Search Engine
SESE
SESE
SESE BrowserBrowserIndexIndexWeb
50M queries/day
Quality resultsLogLog
Spider
Spam
Freshness3B pages
Scalable, e.g., 20K PCs in Scalable, e.g., 20K PCs in GoogleGoogle
ISCSLP’02 L. F. Chien
Crawling
Indexed Page
Out LinksDuplication Authority
Out link Traverse
Authorized Pages
ISCSLP’02 L. F. Chien
Indexed Features & Page Ranking
Page Title: Academia Sinica
Indexed Page
Anchor Text:Highest Government Research Institution in Taiwan
abstractPopularity
Anchor Text:Chien’s Lab
Authority
ISCSLP’02 L. F. Chien
Distributed Search
Query
Query Processor
SE
SE
SE
SEDocument Delivery
ISCSLP’02 L. F. Chien
Facts and Problems I
Query short query problem50% are personal and company namesBoolean or natural language query is few
Browsingno more 2nd pageprecision is more important than recall
Robotlow coverage、deadlinks、garbage sites and pages
ISCSLP’02 L. F. Chien
Facts and Problems II- Relevancy
Who judge the retrieval relevancyUsers
• What user want? What do they input?• Short query or NLQ?• HFQ、LFQ or MFQ?
Search engines• Technology limitations?• How many indexed pages? millions or billions
of pages ? • Ranking algorithm?
ISCSLP’02 L. F. Chien
Quality vs. Quantity
Source: The Direct Hit Technology: A White Paper (http://system.directhit.com/whitepaper.html)
ISCSLP’02 L. F. Chien
Facts and Problems III - Speed
What make the retrieval speed ?Users
• Where you are and how is the bandwidth?• Dial-up or T1? Cache or proxy?
• What is the query?• HFQ、LFQ or MFQ ?
Search Engines• How far and how good the infrastructure • Document delivery speed is the key
ISCSLP’02 L. F. Chien
Important Issues
Web user studyUser behavior analysisQuery log mining
Content indexingLanguage identification Information conversionUnified indexing Term extraction Term clustering
Content searchingEnglish-Chinese bilingual searchConcept-based search Personalized searchCross-language search
Content presentation Concept-based term suggestion Concept-based search result clustering
ISCSLP’02 L. F. Chien
Language Distribution (Pu’2000)
Table 3 Statistics concerning what language used in each search termAll Chinese All English Other
Dreamer 78.20% 19.18% 2.62%GAIS 78.22% 16.90% 4.88%
ISCSLP’02 L. F. Chien
Term Length
Table 4 Statistics concerning the number of terms per queryin Chinese in English All
Dreamer 3.18 characters 1.10 words 6.31 bytesGAIS 3.55 characters 1.22 words 7.26 bytes
ISCSLP’02 L. F. Chien
Information Needs by Subject Categories
Adult
Computer
EntertainmentChat
Life
Education
Travel
Game
Business
SocietyMedia
HumanitiesHealth
ScienceOther
Software
GraphSearch engine
Network
Company
HardwareBBS Other
-- Query categorization approach (Pu, JASIS’02). Obtained from analyzing about 2M queries
ISCSLP’02 L. F. Chien
Term Conversion
ISCSLP’02 L. F. Chien
IV. Web Mining
ISCSLP’02 L. F. Chien
Web Search
Weblogs, texts, images, …
Search Engine
Information Seeking
Millions of Users
ISCSLP’02 L. F. Chien
Web Mining
Weblogs, texts, images, …
Search Engine
Knowledge Discovery
Millions of Users
ISCSLP’02 L. F. Chien
Taxonomy of Web Mining (R. Cooley)
Web Mining
Web ContentMining
Web StructureMining
Web UsageMining
DM
ISCSLP’02 L. F. Chien
Web Content Mining
Most focus on extraction of knowledge from the text of web pagesWeb Page Classification (Chuang & Chien’s IRWK’02)
Text Mining Web Information Extraction XML/Semantic Web MiningMessage Understanding (NLP viewpoint)
Multimedia Content MiningWeb Image Classification (Tseng’s IRWK’02)Speech Archive Mining (Chien’s ISCSLP’02)
ISCSLP’02 L. F. Chien
Web Page Classification ApplicationsCMU Web KB Project (1998-2000) [Craven98]
Classifying Web pages is an essential step to constructWeb knowledge base
ISCSLP’02 L. F. Chien
Web Usage Mining
Data Gathering Web server log, site description data, concept hierarchies
Data Preparation Distinguish among users, build sessions
Data MiningPattern discovery & analysis
ISCSLP’02 L. F. Chien
Web Structure Mining
Google’s Page Rank
Document Citation (siteseer)
ISCSLP’02 L. F. Chien
Semantic Web Mining
Current Web Most of Web content is designed for humans to read, not for machine to manipulate meaningfully
Semantic Web XML+RDF + Ontology + Agent
Semantic Web MiningAuto-construction of OntologyCase-based reasoning/inference RDF1
RDF2
ISCSLP’02 L. F. Chien
V. Term Clustering Trough Web Mining
ISCSLP’02 L. F. Chien
Term Clustering (Chuang’02)
Hierarchical clustering
勞委會
職訓局
就業
青輔會
自傳
徵才
人力資源
104人力銀行
人力銀行
找工作
履歷表
求職
求才
占卜
塔羅牌
算命
紫微斗數
命理
姓名學
心理測驗
星座
愛情
eva長榮航空
長榮
航空公司
航空
華航
中華航空
補帖
大補帖
泡麵
dbt武俠
金庸
武俠小說
黃易
作家
武俠金庸武俠小說黃易作家
補帖大補帖泡麵dbt
eva長榮航空 (EVA airline)長榮 (EVA)航空公司 (airline)航空 (airway)華航 (China airline)中華航空 (China airline)
占卜塔羅牌算命紫微斗數命理姓名學心理測驗星座愛情
勞委會職訓局就業青輔會自傳徵才人力資源104人力銀行人力銀行找工作履歷表求職求才cut
1 2 3 4 5
1 23 4
5
ISCSLP’02 L. F. Chien
Hierarchical Query Clustering
ISCSLP’02 L. F. Chien
Virus
ISCSLP’02 L. F. Chien
Virussynonyms
ISCSLP’02 L. F. Chien
Security
ISCSLP’02 L. F. Chien
Security
Screen
Microsoft
ISCSLP’02 L. F. Chien
Search Result Clustering
GoogleTaiwan University
ConceptSearch
Query Taxonomy
……
… …
Taiwan University
ISCSLP’02 L. F. Chien
SE
DocumentSpace
UsersAuthors
Information UseInformation Need
DocumentTaxonomy
QuerySpace
QueryTaxonomy
Personalized Search
QuerySpace
….
Personal Directory Trees
QuerySpace
QueryTaxonomy
QuerySpace
QueryTaxonomy
ISCSLP’02 L. F. Chien
Hierarchical Query Clustering
ISCSLP’02 L. F. Chien
The Steps
Feature ExtractionUse co-occurred seed terms extracted from retrieved top pages
Term VectorEach query term is assigned a term vector
• Record the co-occurred feature terms and their frequency values in the retrieved documents.
Term Similaritytf*idf-based Cosine measurement
Hierarchical Term ClusteringCluster popular query terms in the log into initial categoriesQuery terms with similar features are grouped into clusters.
ISCSLP’02 L. F. Chien
Feature Extraction
Use co-occurred seed terms extracted from retrieved top pages
Creative Nude Photography Network -- Fine Art Nude and ...... The Creative Nude and Erotic Photography Network is the number one net portal to the best in fine art nude and erotic photography! Over 100 CNPN Member Sites ...
Nude Places... to be naked. Walking in the forest, cruising the lake in open boats, swimming, picnicking and nude photography are all enjoyed in the nude. 60 minutes $39.95. ...
A Brave Nude World... A Brave Nude World! Warning: This site contains links to fine art nude & erotic photography. If you are under 18 or do not wish to view this material, You can ...
nudeCo-occurred
feature terms
3/2erotic photography
1/1naked
………3/2art
2/2photography
tf/dfterm
ISCSLP’02 L. F. Chien
Term Weighting
ISCSLP’02 L. F. Chien
Extraction of Basic Feature TermsPerformance of different features: randomly selected, hi-frequency, and seed terms
Popular queries not affected by ephemeral trends, e.g., “movie”, “basketball”, “mutual fund”, etc.More expressive and distinguishable in describing a particular categoryTwo logs compared and extracted 9,709 overlapping top query terms as feature terms
G-1999D-1998
Top 1,000 terms top 20,000 terms ALL
Top 1,000 terms 583/58.30% 977/97.70% 992/99.20%Top 20,000 terms 914/91.40% 9,709/50.71% 14,721/76.89%
ISCSLP’02 L. F. Chien
Query Clustering (Cont.)
Feature ExtractionUse co-occurred seed terms extracted from retrieved top pages
Term VectorEach query term is assigned a term vector
• Record the co-occurred feature terms and their frequency values in the retrieved documents.
Term SimilarityTF *IDF-based Cosine measurement
Hierarchical Term ClusteringCluster popular query terms in the log into initial categoriesQuery terms with similar features are grouped into clusters.
ISCSLP’02 L. F. Chien
Term Similarity
ISCSLP’02 L. F. Chien
Hierarchical Term Clustering
Agglomerative hierarchical clustering (AHC)Compute the similarity between all pairs of clusters• Estimate similarity between all pairs of composed terms• Use the lowest term similarity value as the cluster
similarity value
Merge the most similar (closest) two clusters• Complete linkage method
Update the cluster vector of the new cluster Repeat steps 2 and 3 until only a single cluster remains
ISCSLP’02 L. F. Chien
Hierarchical Term Clustering
Agglomerative hierarchical clustering (AHC)Compute the similarity between all pairs of clusters• Estimate similarity between all pairs of composed terms
Term
Cluster
ISCSLP’02 L. F. Chien
Hierarchical Term Clustering
Agglomerative hierarchical clustering (AHC)Compute the similarity between all pairs of clusters• Estimate similarity between all pairs of composed terms
Term
Cluster
ISCSLP’02 L. F. Chien
Hierarchical Term Clustering
Agglomerative hierarchical clustering (AHC)Compute the similarity between all pairs of clusters• Estimate similarity between all pairs of composed terms• Use the lowest term similarity value as the cluster
similarity value
Term
Cluster
ISCSLP’02 L. F. Chien
Hierarchical Term Clustering
Agglomerative hierarchical clustering (AHC)Merge the most similar (closest) two clusters• Complete-linkage method
0.3
0.10.5
ISCSLP’02 L. F. Chien
Hierarchical Term Clustering
Agglomerative hierarchical clustering (AHC)Merge the most similar (closest) two clusters• Complete-linkage method
ISCSLP’02 L. F. Chien
Hierarchical Term Clustering
Agglomerative hierarchical clustering (AHC)Compute the similarity between all pairs of clusters• Estimate similarity between all pairs of composed terms• Use the lowest term similarity value as the cluster
similarity value
Merge the most similar (closest) two clusters• Complete linkage method
Update the cluster vector of the new cluster Repeat steps 2 and 3 until only a single cluster remains
ISCSLP’02 L. F. Chien
The Clustering Algorithm
ISCSLP’02 L. F. Chien
ISCSLP’02 L. F. Chien
Clustering Results
勞委會
職訓局
就業
青輔會
自傳
徵才
人力資源
104人力銀行
人力銀行
找工作
履歷表
求職
求才
占卜
塔羅牌
算命
紫微斗數
命理
姓名學
心理測驗
星座
愛情
eva長榮航空
長榮
航空公司
航空
華航
中華航空
補帖
大補帖
泡麵
dbt武俠
金庸
武俠小說
黃易
作家
武俠金庸武俠小說黃易作家
補帖大補帖泡麵dbt
eva長榮航空 (EVA airline)長榮 (EVA)航空公司 (airline)航空 (airway)華航 (China airline)中華航空 (China airline)
占卜塔羅牌算命紫微斗數命理姓名學心理測驗星座愛情
勞委會職訓局就業青輔會自傳徵才人力資源104人力銀行人力銀行找工作履歷表求職求才cut
1 2 3 4 5
1 23 4
5
ISCSLP’02 L. F. Chien
Cluster Partition
ISCSLP’02 L. F. Chien
Quality Function
ISCSLP’02 L. F. Chien
Quality Function (Cont.)
ISCSLP’02 L. F. Chien
Quality Function (Cont.)
ISCSLP’02 L. F. Chien
The Clustering Partition
Algorithm
ISCSLP’02 L. F. Chien
Preliminary Experiment
Test queries • Two sets: top 1k queries and random 1k queries• Each of the test queries has been manually assigned
according classes
Evaluation metrics• F-Measure
ISCSLP’02 L. F. Chien
Evaluation: F-Measure
ISCSLP’02 L. F. Chien
Obtained F-Measures
ISCSLP’02 L. F. Chien
ISCSLP’02 L. F. Chien
Results of Hierarchical Structure Generation
ISCSLP’02 L. F. Chien
VI. Anchor Text Mining
ISCSLP’02 L. F. Chien
Anchor Text Mining for Query Translation (Lu, ICDM’01)
EnglishTraditional Chinese
Simplified ChineseJapanese
SonyNikeStanfordSydneyinternetnetworkhomepagecomputerdatabaseinformation
新力耐吉史丹佛雪梨網際網路網路首頁電腦資料庫資訊
索尼耐克斯坦福悉尼互联网网络主页计算机数据库信息
ソニーナイキスタンフォードシドニーインターネットネットワークホームページコンピューターデータベースインフォメーション
ISCSLP’02 L. F. Chien
Cross-Language Web SearchA Web search service allows users to query in one language and search documents that are written or indexed in another language.
ISCSLP’02 L. F. Chien
- in USATaiwan -
www.yahoo.comwww.yahoo.com.tw
Yahoo Yahoo
Source Query
Observation of Anchor Text
ISCSLP’02 L. F. Chien
- in USATaiwan - 台灣 - 搜尋引擎
www.yahoo.comwww.yahoo.com.tw
Yahoo 雅虎雅虎 Yahoo
Translation Candidates
Anchor-Text Set
Observation of Anchor Text
ISCSLP’02 L. F. Chien
……(#in-link= 187)
……(#in-link= 21)
- in USATaiwan - 台灣 - 搜尋引擎
www.yahoo.comwww.yahoo.com.tw
Yahoo 雅虎雅虎 Yahoo
Page Authority
Observation of Anchor Text
Co-occurrence
ISCSLP’02 L. F. Chien
Asymmetric model:
Symmetric model with link information :)(
)()|(s
tsst
TPTTPTTP ∩
=
inktotal in-lUin-linkUP
UP)|U)P(T|UP(TUTPUTP
U)P|U)P(T|UP(T
UPUTTP
UPUTTP
TTPTTPTTP
ii
Uiitisitis
Uiitis
Uiits
Uiits
ts
tsts
i
i
i
i
# of #)( where
)(])|()|([
)(
)()|(
)()|(
)()()(
=
−+≈
∪
∩=
∪∩
=↔
∑∑
∑∑
……
Probabilistic Inference Model
Page Authority
Co-occurrence
PageRank
ISCSLP’02 L. F. Chien
Type of Model Top-1 Top-10
MA 41% 81% MAL 44% 83% MS 51% 84%
MSL* 53% 85%
* Training Data -- 109,416 anchor-text sets
from 1,980,816 pages* Test Query Set--622 English terms from 1,230 most popular English queries
* Training Data -- 109,416 anchor-text sets
from 1,980,816 pages* Test Query Set--622 English terms from 1,230 most popular English queries
Effects of Different Models
Using different modelsMA: Asymmetric modelMAL: Asymmetric model with link informationMS: Symmetric modelMSL: Symmetric model with link information
ISCSLP’02 L. F. Chien
Top-n inclusion rates obtained with three different approaches.
ApproachesTop1Top2Top3Top4Top5
Anchor Text Mining
57.0%68.6%74.3%77.9%80.1%
Dictionary Lookup
30.5%30.5%30.5%30.5%30.5%
Combine Anchor Text Mining and Dictionary Lookup
74.0%82.9%86.7%88.9%90.2%
Performance
Back
ISCSLP’02 L. F. Chien
References
Overview Chien, L. F., Pu, H. T., “Important Issues on Chinese Information Retrieval;”, Computational Linguistics and Chinese Language Processing, August 1996, pp. 205-221.Lua, K. T., “Chinese Information Processing – Past, Present and Future”, IRAL’98.
Text Retrieval Chien, L. F., “Fast and Quasi-Natural Language Search for Gigabytes of Chinese Texts”, ACM SIGIR’95. Huang, X. and Robertson, “Okapi Chinese Text Retrieval Experiments at TREC-6. Nie, J., “On Chinese Text Retrieval”, SIGIR’96.Kwok, K. L., Comparing Representations in Chinese Information Retrieval, SIGIR’97.Chen, A. Chinese Text Retrieval without Using a Dictionary, SIGIR’97.
Text Classification Lam, W., “Using a Generalized Instance Set for Automatic Text Categorization”, SIGIR’98, pp. 81-89.
Natural Language ProcessingChen, K. J. et al., “Word Identification for Mandarin Chinese Sentences”, CONLING’92.Sproat, R. and Shih, C., “A Statistical Method for Finding Word Boundaries in Chinese Text”, CPCOL, 1989, pp. 240-271.Wu, Zimin and Tseng, G., “Chinese Text Segmentation for Text Retrieval: Achievements and Problems”, JASIS, 1994, pp. 532-542. Term ExtractionGao, J., et al., “Toward a Unified Approach to Statistical Language Modeling for Chinese”, ACM Trans. On Asian Language Information Processing, 2002, pp. 3-33.
ISCSLP’02 L. F. Chien
References (Cont.)
Term Extraction Chang, J. S. and Su, K. Y., “An Unsupervised iterative Method for Chinese New Lexicon Extraction”, International Journal of Computational Linguistics & Chinese Language Processing, August 1997, pp. 97-147.Chien, L. F., “ PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval”, SIGIR’97, pp. 50-58.Lai, Y. S. and Wu, C. H., “Meaningful Term Extraction and Discriminative Term Selection in Text Categorization via Unknown-Word Methodology”, ACM Trans. On Asian Language Information Processing, 2002, pp. 34-64.
Web IRPu, H. T., "Understanding Chinese Users' Information Behaviors through Analysis of Web Search Term Logs”, Journal of Computers (電腦學刊), December 2000, pp. 75-82.
Speech RetrievalWang, H. M., “Experiments in Syllable-based Retrieval of Broadcast News Speech in Mandarin Chinese”, Speech Communication, 2000, pp. 49-60. Li, X. and Crosft, B., Evaluating Question-Answering Techniques in Chinese, 2001.
Information Tracking and DetectionChen, H. H. and Ku, L. W., “An NLP & IR Approach to Topic Detection”, Topic Detection and Tracking: Event-based Information Organization, Chapter 12, ed. By Allan J., KluwerAcademic Publishers, 2002, pp. 243-264.
Cross-Language IRKwok, K. L., “Evaluation of an English-Chinese Cross-lingual Retrieval Experiment”, TREC’97.
ISCSLP’02 L. F. Chien
ReferencesWeb MiningKosala, R., & Blockheel, H. (2000). Web Mining Research: A Survey. SIGKDD Explorations, 2(1),1-15. PS PDFSrivastava,J. Cooley, R., Deshpande, M., & Tan, P.-N. (2000). Web usage mining:discovery and application of usage patterns from web data. SIGKDD Explorations,1, 12-23. PSJ. Sirvastava & R. Cooley, Mining web data for e-commerce: concepts & applications, PKDD’01 Conferences & Workshops KDD 2001, PKDD 2001, WebKDD 1999l, WebKDD 2000, WebKDD 2001Web Content MiningD. Mladenic et al., Text Mining: What if your data is made of words, PKDD’01M. Hearst, Untangling Text Data Mining, ACL’99. (Chang et al., 2001) (s.a.) Chapter 6 HandapparatChakrabarti, S. (2000). Data mining for hypertext: A tutorial survey. SIGKDD Explorations 1(2), 1-11. PS PDFWeb Structure Mining (Chang et al., 2001) (s.a.) Chapter 7.3 Handapparat(Chakrabarti, 2000) s.a.Page, L., Brin, S., Motwani, R.,& Winograd, T. (1998). The PageRank Citation Ranking: Bringing Order to the Web. PS
ISCSLP’02 L. F. Chien
References (Cont.)
Web Usage Mining(Srivastava et al., 2000) s.a.Spiliopoulou, M. (2000). Web usage mining for site evaluation: Making a sitebetter fit its users. Special Section of the Communications of ACM on "Personalization Technologies with DataMining'', 43(8), 127-134. HandapparatACM Digital LibraryCooley, R. 2000. Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data. University of Minnesotal. PSBorges, J.L. (2000).A Data Mining Model to Capture User Web Navigation Patterns. Department of Computer Science, University College London, London University. PS PDFFor more references can refer at http://www.wiwi.hu-berlin.de/~berendt/lehre/2001w/wmi/literature.html
ISCSLP’02 L. F. Chien
References (Cont.)
Text and Web page categorizationS. Chakrabarti, B. Dorm, and P. Indyk. Enhanced hypertext categorization using hyperlinks. SIGMOD’98, pp. 307-318, 1998.J. M. Pierre, Practical issues for automated categorization of Web sites, ECDL 2000 Workshop on the Semantic Web, 2000.C.Y. Quek. Classification of World Wide Web Documents. Senior Honors Thesis, School of Computer Science, CMU, May 1997.Y. Yang and X. Liu. A re-examination of text categorization methods, SIGIR’99, pp. 42-49, 1999.
Web page classification applicationsC. Chekuri, M.H. Goldwasser, P. Raghavan, and E. Upfal. Web search using automatic classification. WWW’97.M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the World Wide Web. AAAI’98, pp. 509-516, 1998.M. Diligenti, F.M. Coetzee, S. Lawrence, C.L. Giles, and M. Gori, Focused crawling using context graphs, VLDB2000, pp. 527-534, 2000.
Link and context analysisG. Attardi, A. Gulli, and F. Sebastiani. Automatic web page categorization by link and context analysis. Proceedings of THAI’99, European Symposium on Telematics, Hypermedia and Artificial Intelligence, pp. 105-119, 1999.S. Brin and L. Page. The anatomy of large-scale hypertextual web search engine, WWW’98.J. Dean and M. R. Henzinger. Finding related pages in the world wide web. WWW’99, pp. 389-401, 1999.J. Kleinberg. Authoritative sources in a hyperlinked environment. Proceedings of the 9th annual ACM SIAM Symposium on Discrete Algorithms, pp. 668-677, 1998.
ISCSLP’02 L. F. Chien
References (Works in Academia Sinica)
1. S. L. Chuang, L. F. Chien, “Automatic Subject Categorization of Query Terms for Web Information Retrieval”, accepted by Decision Support System, 2002.2. Lee-Feng Chien, et al., “Incremental Extraction of Domain-Specific Terms from Online Text Collections”, Recent Advances in Computational Terminology, ed. By D. Bourigault et al., 2001.3. Lee-Feng Chien, “PAT-Tree-Based Adaptive Keyphrase Extraction for Intelligent Chinese Information Retrieval” , special issue on “Information Retrieval with Asian Languages”, Information Processing and Management ,Elsevier Press, 1999.4. W. H. Lu, L. F. Chien, H. J. Lee, “ Mining Anchor Texts for Translation of Web Queries”, accepted by ACM Trans on Asian Language Information Processing, 2002.5. W. H. Lu, L. F. Chien, S. J. Lee, “Web Anchor Text Mining for Translation of Web Queries”, IEEE Conference on Data Mining, Nov., San Jose, 2001. 6. C. K. Huang, L. F. Chien, Y. J. Oyang, “Interactive Web Multimedia Search Using Query-Session-Based Query Expansion”, The 2001 Pacific Conference on Multimedia (PCM2001), Oct., Beijing. 7. C. K. Huang, Y. J. Oyang, L. F. Chien, “A Contextual Term Suggestion Mechanism for Interactive Search”, The First Web Intelligence Conference(WI’2001), Japan.8. Lee-Feng Chien. PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval, The 1997 ACM SIGIR Conference, Philadelphia, USA, 50-58 (SIGIR’97).
ISCSLP’02 L. F. Chien
Q&A
Thanks!