Date post: | 01-Jan-2016 |
Category: |
Documents |
Upload: | garrison-noe |
View: | 24 times |
Download: | 0 times |
Mining Reference Tables for Automatic Text Segmentation
E. Agichtein V. Ganti
Columbia Univ. Microsoft R.
KDD’04
Shui-Lung Chuang
Oct 27, 2004
Text Segmentation
• A (short)-text string
• N attributes
• Conventional approaches– Rule-based — human creates rules – Supervised model-based — human labels data
Mining Ref. Table for Auto Text Segmentation E. Agichtein, V. Ganti, SIGKDD
[ Authors , Title , Conference , Year ]
Null
The Approach
• Utilize the existing (large, clean) reference data– E.g, DBLP Papers, US Addresses, …
Author Title Conference Year
Mark Steyvers, Padhraic Smyth Probabilistic Author-Topic Models for SIGKDD 2004
Lotlikar, S. Roy A Hierarchical Document Clustering WWW 2004
Cimiano, S. Handschuh Towards the Self-Annotating Web … WWW 2003
…… ……. …. ….
ARM1 ARM2 ARM3
ARM: Attribute Recognition Model
ARM3
s: a sub-string prob. s is generated
Segmentation Model
Mining Ref. Table for Auto Text Segmentation E. Agichtein, V. Ganti, SIGKDD
ARM: Attribute Recognition Models: a sub-string prob. s is generated
s3s2s1 s4
ARM1 ARM2 ARM3 ARM3
)(maxarg }{ iss sARMii
To find
Challenges
• Robust to input error– The ref. data may be clean, but– Input may contain various errors:
• Missing values, spelling error, extraneous or unknown tokens, etc
• Adaptive to varied attribute orders– Reference data don’t contain info
for attribute order in input
• Efficient in training– Reference data is large
Engineer features
Adjust model topology
Determine attribute order from early input strings
Fix model topology
Don’t use advanced learning (e.g., EM)
– –
–
– –
Feature Hierarchy
High-level features considered: Token classes (words, numbers, mixed, delimiters) + Token length
Attribute Recognition Model
• 57th n sixth st1010 s fifth st201 n goodwin ave
Model Training
•
57th
[a-z0-9]{1,4}
[a-z0-9]{1,5}
Mixed[a-z0-9]{1,-}
… …
Emission: p(x|e)=(x=e) ? 1 : 0
Transition:B { M, T, END }M { M, T, END }T { T, END }
57th n sixth st1010 s fifth st201 n goodwin ave
Sequential Specificity Relaxation
Token insertion e.g., 57th 57th n sixth st
Token deletion e.g., n sixth
Missing attribute value e.g., <null>
Determining Attribute Value Order
• Attribute order is usually preserved in the same batch of input strings
Determining Attribute Value Order
s = walmart 20205 s. randall ave madison 53715 wi.
[ 0.05, 0.01, 0.02, 0.1, 0.01, 0.8, 0.01, 0.07 ] city attr.
[ 0.1, 0.7, 0.8, 0.7, 0.9, 0.5, 0.4, 0.1 ] street attr.
1 2 3 4 5 6 7 8 pos v(s,Ai):
Search all permutation for the best total order
(partial order)
(total order)
Experiment Data
• Reference relations– Addresses: 1,000,000 tuples
• Schema; [ Name,Number1,Number2,Address, City, State, Zip ]
– Media: 280,000 music tracks• Schema: [ ArtistName, AlbumName, TrackName ]
– Bibliography: 100,000 records from DBLP• Schema: [ Title, Author, Journal, Volume, Month, Year ]
• Test datasets – Naturally concatenated test sets– Addresses: from RISE repository – Media: from Microsoft– Papers: 100 most cited papers from Citeseer
Experiment Data (cont.)
• Test datasets – Controlled test data sets– Randomly chosen order– Error injection
Experiment Results
Experiment Results
• 1-Pos vs BMT vs BMT-robust
Comments
• The idea of using reference tables is good• The approach is well engineered to deal with issues of
robustness and efficiency• Experiment is thorough
• The approach is somewhat still ad hoc, and every component seems replaceable