Shui-Lung Chuang Oct 27, 2004

Mining Reference Tables for Automatic Text Segmentation

E. Agichtein V. Ganti

Columbia Univ. Microsoft R.

KDD’04

Shui-Lung Chuang

Oct 27, 2004

Text Segmentation

• A (short)-text string

• N attributes

• Conventional approaches– Rule-based — human creates rules – Supervised model-based — human labels data

Mining Ref. Table for Auto Text Segmentation E. Agichtein, V. Ganti, SIGKDD

[ Authors , Title , Conference , Year ]

Null

The Approach

• Utilize the existing (large, clean) reference data– E.g, DBLP Papers, US Addresses, …

Author Title Conference Year

Mark Steyvers, Padhraic Smyth Probabilistic Author-Topic Models for SIGKDD 2004

Lotlikar, S. Roy A Hierarchical Document Clustering WWW 2004

Cimiano, S. Handschuh Towards the Self-Annotating Web … WWW 2003

…… ……. …. ….

ARM1 ARM2 ARM3

ARM: Attribute Recognition Model

ARM3

s: a sub-string prob. s is generated

Segmentation Model

Mining Ref. Table for Auto Text Segmentation E. Agichtein, V. Ganti, SIGKDD

ARM: Attribute Recognition Models: a sub-string prob. s is generated

s3s2s1 s4

ARM1 ARM2 ARM3 ARM3

)(maxarg }{ iss sARMii

To find

Challenges

• Robust to input error– The ref. data may be clean, but– Input may contain various errors:

• Missing values, spelling error, extraneous or unknown tokens, etc

• Adaptive to varied attribute orders– Reference data don’t contain info

for attribute order in input

• Efficient in training– Reference data is large

Engineer features

Adjust model topology

Determine attribute order from early input strings

Fix model topology

Don’t use advanced learning (e.g., EM)

– –

–

– –

Feature Hierarchy

High-level features considered: Token classes (words, numbers, mixed, delimiters) + Token length

Attribute Recognition Model

• 57th n sixth st1010 s fifth st201 n goodwin ave

Model Training

•

57th

[a-z0-9]{1,4}

[a-z0-9]{1,5}

Mixed[a-z0-9]{1,-}

… …

Emission: p(x|e)=(x=e) ? 1 : 0

Transition:B { M, T, END }M { M, T, END }T { T, END }

57th n sixth st1010 s fifth st201 n goodwin ave

Sequential Specificity Relaxation

Token insertion e.g., 57th 57th n sixth st

Token deletion e.g., n sixth

Missing attribute value e.g., <null>

Determining Attribute Value Order

• Attribute order is usually preserved in the same batch of input strings

Determining Attribute Value Order

s = walmart 20205 s. randall ave madison 53715 wi.

[ 0.05, 0.01, 0.02, 0.1, 0.01, 0.8, 0.01, 0.07 ] city attr.

[ 0.1, 0.7, 0.8, 0.7, 0.9, 0.5, 0.4, 0.1 ] street attr.

1 2 3 4 5 6 7 8 pos v(s,Ai):

Search all permutation for the best total order

(partial order)

(total order)

Experiment Data

• Reference relations– Addresses: 1,000,000 tuples

• Schema; [ Name,Number1,Number2,Address, City, State, Zip ]

– Media: 280,000 music tracks• Schema: [ ArtistName, AlbumName, TrackName ]

– Bibliography: 100,000 records from DBLP• Schema: [ Title, Author, Journal, Volume, Month, Year ]

• Test datasets – Naturally concatenated test sets– Addresses: from RISE repository – Media: from Microsoft– Papers: 100 most cited papers from Citeseer

Experiment Data (cont.)

• Test datasets – Controlled test data sets– Randomly chosen order– Error injection

Experiment Results

Experiment Results

• 1-Pos vs BMT vs BMT-robust

Comments

• The idea of using reference tables is good• The approach is well engineered to deal with issues of

robustness and efficiency• Experiment is thorough

• The approach is somewhat still ad hoc, and every component seems replaceable

Date post:	01-Jan-2016
Category:	Documents
Upload:	garrison-noe
View:	24 times
Download:	0 times

Shui-Lung Chuang Oct 27, 2004

Documents