Post on 31-Dec-2015
transcript
A search-based Chinese Word Segmentation Method
——WWW 2007
Xin-Jing Wang: IBM China
Wen Liu: Huazhong Univ. China
Yong Qin: IBM China
Introduction
• Challenges in CWS• Ambiguous • Unknown word
• Web and search technology• Free from OOV problem• Adaptive to different segmentation standards• Entirely unsupervised
The proposed approach
• Segments Collecting• Query sentence => sub-sentence (by punctuation)• Submit sub-sentence to a search engine• Collect the highlights from returned snippets
Query :“我明天要去止锚湾玩”:
The proposed approach
• Segments Scoring Select a subset of segments as final segmentation
• Frequency-based: term frequency• Segment occurrences : total number of occurrences
• SVM-based• SVM classifier with RBF kernel and maps the
outputs into probabilities as the scores
Reconstruct the query using the segment way with
highest score
The proposed approach
• Segments Selecting• Valid subset: if its member segments can reconstruct exactly the query
• Score of valid subset:
the average score of its member segments.
• Greedy search to find valid subset
For efficiency consideration• Select the valid subset which has highest score as
final segmentation
Evaluations
• Experiment setting• SVM-based score
• Training set: 3000 randomly selected sentences
• Feature space ——Three dimensional : TF DF LEN
TF: term frequency
DF: number of documents indexed by a segment
Len: number of characters in a segment
• Frequency-based score• need no training set
Evaluations
• Worse than reported results• Why SVM is worse ? Feature space too simple • Advantage:
• only 3000 or non training set
• Avoids OOV problem
• Better performance can be achieved with more search results provided (Google+Yahoo!)
Conclusion
• It is good at discovering new words (no OOV problem) and adapting to different segmentation standards
• Entirely unsupervised which saves labors to labeling training data.
• Finding more effective scoring methods• Combining current approach to other types of
segmentation methods to give a better performance
My work going on……Discriminative Reranking
——ACL 07 & 03
1 Michael Collins and Terry Koo
2 Zhongqiang Huang: Purdue Univ.
Background
• Have been applied to many NLP application• NER, Parsing, sentence boundary detection
• Haven’t try it on POS-tagging
• Motivation1 Rerank the output of an existing probabilistic tagger.
2 The base tagger produces a set of candidate tag sequence for each sentence.
3 A second model attempts to improve upon this initial ranking using
additional features
Collins’ Reranking Algorithm
• Training the reranker• n sentences
each with ni candidates
Along with log-probability produced by the HMM tagger
• “goodness” score• : measures the similarity between the
candidate and the gold reference.
{ : 1,..., }iS i n
,{ : 1,..., }i j iX j n
,( )i jL X
,( )i jScore X
Collins’ Reranking Algorithm
• Training data consists of a set of examples
each along with a “goodness” score
and a log-probability
,{ : 1,..., ; 1,..., }i j iX i n j n
,( )i jL X
,( )i jScore X
Collins’ Reranking Algorithm
• A set of indicator functions
:extract binary features
on each example .
• Each indicator function is associated with a weight parameter which is real valued.
• is associated with
kh
,{ ( ): 1,..., }k i jh X K m
{ : 1,..., }kh K m
,i jX
k
0,( )i jL X
Collins’ Reranking Algorithm
• The ranking function
• The objective of training• Set to minimize:
Where:
0 , ,1
( ) ( )m
i j k k i jk
L x h x
0 1{ , ,..., }m
Experiments
• Using HMM as the base model
• Data set• The most recently released Penn Chinese Tree
bank 5.2 (denoted CTB, released by LDC)
——33 POS tags
——500K words, 800K characters, 18K sentences
Experiments
• Divide into 20 chunks, with each chunk N-best tagged by the HMM model trained on the combination of
the other 19 chunks