Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | rowan-maddox |
View: | 28 times |
Download: | 3 times |
Integration and representation of unstructured text in relational
databases
Sunita Sarawagi
IIT Bombay
Database
HR databaseResumes: skills, experience, references (emails)
Unstructured data
Text resume in an email
Company database: products with features
Product reviews on the webCustomer emails
Citeseer/Google scholarStructured records from publishers
Publications from homepages
Personal Databases: bibtex, address book
Extract bibtex entries when I download a paper
Enter missing contacts via web search
Id Title Year Journal Canonical
2 Update Semantics 1983 10
Id Name Canonical
10 ACM TODS
17 AI 17
16 ACM Trans. Databases
Article Author
2 11
2 2
2 3
Id Name Canonical
11 M Y Vardi
2 J. Ullman 4
3 Ron Fagin 3
4 Jeffrey Ullman 4
Authors
Writes
JournalsArticles
Probabilistic variant links to
canonical entries
Database: imprecise
3 Top-level
entities
R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also see
Id Title Year Journal Canonical
2 Update Semantics 1983 10
7 Belief, awareness, reasoning
1988 17
Id Name Canonical
10 ACM TODS
17 AI 17
16 ACM Trans. Databases
10
Article Author
2 11
2 2
2 3
7 8
7 9
Id Name Canonical
11 M Y Vardi
2 J. Ullman 4
3 Ron Fagin 3
4 Jeffrey Ullman 4
8 R Fagin 3
9 J Helpern 8
Authors
Writes
JournalsArticles
R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also see
Extraction
Integration
Match with existing linked entities while respecting
all constraints
Author: R. Fagin AAuthor: J. Helpern Title Belief,..reasoning Journal: AI Year: 1998
Outline Statistical models for integration
Extraction while fully exploiting existing database Entity match, Entity pattern, link/relationship constraints,
Integrate extracted entities, resolve if entity already in database
Performance challenges Efficient graphical model inference algorithms Indexing support
Representing uncertainty of integration in DB Imprecise databases and queries
1 2 3 4 5 6 7 8
R. Fagin and J. Helpbern Belief Awareness Reasoning
Author Author Other Author Author Title Title Title
t
x
y
Extraction using chain CRFsR. Fagin and J. Helpern, Belief, awareness, reasoning
Flexible overlapping features •identity of word•ends in “-ski”•is capitalized•is part of a noun phrase?•is under node X in WordNet•is in bold font•is indented•next two words are “and Associates”•previous label is “Other”
y1 y2 y3 y4 y5 y6 y7 y8
Difficult to effectively combine features from labeled unstructured data and structured DB
1 2 3 4 5 6 7 8
R. Fagin and J. Helpbern Belief Awareness Reasoning
Author Author Other Author Author Title Title Title
t
x
y
Features describe the single word “Fagin”
CRFs for Segmentation
l1=1, u1=2 l1=u1=3 l1=4, u1=5 l1=6, u1=8
R. Fagin and J. Helpbern Belief Awareness Reasoning
Author Other Author Title
x
y
Features describe the segment from l to u
Similarity to author’s column in database
l,u
Features from database Similarity to a dictionary entry
JaroWinkler, TF-IDF Similarity to a pattern level dictionary
Regex based pattern index for database entities Entity classifier
A multi-class regression model which gives likelihood of a segment being a particular entity type
Features for the classifier: all standard entity-level extraction features
Segmentation models Segmentation
Input: sequence x=x1,x2..xn, label set Y
Output: segmentation S=s1,s2…sp
sj = (start position, end position, label) = (tj,uj,yj)
Score: F(x,s) = Transition potentials
Segment starting at i has label y and previous label is y’ Segment potentials
Segment starting at i’, ending at i, and with label y. All positions from i’ to i get same label.
Probability of a segmentation: Inference O(nL2)
Most likely segmentation, Marginal around segments
Id Title Year Journal Canonical
2 Update Semantics 1983 10
7 Belief, awareness, reasoning
1988 17
Id Name Canonical
10 ACM TODS
17 AI 17
16 ACM Trans. Databases
10
Article Author
2 11
2 2
2 3
7 8
7 9
Id Name Canonical
11 M Y Vardi
2 J. Ullman 4
3 Ron Fagin 3
4 Jeffrey Ullman 4
8 R Fagin 3
9 J Helpern 8
Authors
Writes
JournalsArticles
R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also see
Extraction
Integration
Match with existing linked entities while respecting
all constraints
Author: R. Fagin AAuthor: J. Helpern Title Belief,..reasoning Journal: AI Year: 1998
CACM 2000, R. Fagin and J. Helpern, Belief, awareness, reasoning in AI
Author: R. Fagin Author: J. Helpern Title: Belief,..reasoning Journal: AI Year: 2000
Only extractionCombined
Extraction+integration
Year mismatch!
Id Title Year Journal Canonical
2 Update Semantics
1983 10
7 Belief, awareness, reasoning
1988 17
Author: R. FaginAuthor: J. Helpern Title: Belief,..reasoning in AI Journal: CACM Year: 2000
Combined extraction + matching Convert predicted label to be a pair y = (a,r) (r=0) means none-of-the-above or a new entry
Id of matching entity
r
l1=1, u1=2 l1=u1=3 l1=4, u1=8
CACM. 2000 Fagin Belief Awareness Reasoning In AI
Journal Year Author Title
0 7 3 7
x
y
l,u
Constraints exist on ids that can be assigned to
two segments
Constrained models Two kinds of constraints between arbitrary
segments Foreign key constraint across their canonical-ids Cardinality constraint
Training Ignore constraints or use max-margin methods that
require only MAP estimates Application:
Formulate as a constrained integer programming problem (expensive)
Use general A-star search to find most likely constrained assignment
Effect of database on extraction performance
L L+DB %Δ
PersonalBib
author 75.7 79.5 4.9
journal 33.9 50.3 48.6
title 61.0 70.3 15.1
Address
city_name 72.4 76.7 6.0
state_name 13.9 33.2 138.5
zipcode 91.6 94.3 3.0
L = Only labeled structured data
L + DB: similarity to database entities and other DB features
(Mansuri and Sarawagi ICDE 2006)
Effect of various features
55
60
65
70
75
80
85o
nly
_L
(n
oD
B)
"+ca
rdin
alit
y
"+d
b_
sim
ilari
ty
"+d
b_
cla
ssifie
r
"+d
b_
reg
ex
"+d
b_
link
"-L
_e
ntity
"-L
_co
nte
xt
"-L
_e
dg
e
on
ly_
L (
no
DB
)
"+ca
rdin
alit
y
"+d
b_
sim
ilari
ty
"+d
b_
cla
ssifie
r
"+d
b_
reg
ex
"-L
_e
ntity
"-L
_co
nte
xt
"-L
_e
dg
e
Train=5% Train=10%
F1
Full integration performance
L L+DB %Δ
PersonalBib
author 70.8 74.0 4.5
journal 29.6 45.5 53.6
title 51.6 65.0 25.9
Address
city_name 70.1 74.6 6.4
state_name 9.0 28.3 213.8
pincode 87.8 90.7 3.3
L = conventional extraction + matching L + DB = technology presented here
Much higher accuracies possible with more training data
(Mansuri and Sarawagi ICDE 2006)
Outline Statistical models for integration
Extraction while fully exploiting existing database Entity match, Entity pattern, link/relationship constraints,
Integrate extracted entities, resolve if entity already in database
Performance challenges Efficient graphical model inference algorithms Indexing support
Representing uncertainty of integration in DB Imprecise databases and queries
Inference in segmentation models
R. Fagin and J. Helpern, Belief, awareness, reasoning, In AI 1998
Surface features(cheap)
Database lookup features(expensive!)
S Chakrabarti
Jay Shan
Jackie Chan
Bill Gates
Thorsten
J Kleinberg
J. Gherke
Claire Cardie
Jeffrey Ullman
Ron Fagin
J. Ullman
M Y Vardi
Name
Authors
Invertedindex
Efficient search fortop-k most similar
entities1. Batch up to do better than
individual top-k?
2. Find top segmentation without top-k matches for all segments?
Many large tables
Top-k similarity search
t1 t3t2 tU
Tidlists: pointers to DB tuples (on disk)
- - -
Bounds on normalized idf values (cached)
1. Fetch/mergetidlist subsets2. Point queries
Upper and lower bounds on dictionary match scores
Tuple id upper lowerScore bounds
Candidate matches
Q: query segmentE: an entry in the database D
Similarity score: Goal: get k highest scoring Es in D
Best segmentation with inexact, bounded features Normal Viterbi:
Forward pass over data positions, at each position maintain
Best segmentation ending at that position
Modify to: best-first search with selective feature refinement
s(0,0)
s(1,1)
s(1,3)
s(1,2)
s(3,3)
s(3,4) s(5,5)
s(3,5)
s(4,4)
End state
Suffix upper/lower
bound: from a backward Viterbi
with bounded features
(Chandel, Nagesh and Sarawagi, ICDE 2006)
Performance results
DBLP authors and titles100 citations
(Chandel, Nagesh and Sarawagi, ICDE 2006)
Inference in segmentation models
R. Fagin and J. Helpern, Belief, awareness, reasoning, In AI 1998
Surface features(cheap)
Not quite!Semi-CRFs 3—8 times slower than chain CRFs
Key insight Applications have a mix of token-level and
segment-level features Many features applicable to several overlapping
segments Compactly represent the overlap through new
forms of potentials Redesign inference algorithms to work on
compact features Cost is independent of number of segments a feature
applies to
(Sarawagi, ICML 2006)
Compact potentials Four kinds of potentials
Running time and Accuracy
Address
50
550
1050
1550
2050
2550
3050
0 10 20 30 40 50 60
Training %
Tim
e (s
ec)
Sequence-BCEUSegmentOptSegment
Cora
50
2050
4050
6050
8050
10050
12050
0 10 20 30 40 50 60 70
Training %
Tim
e (seconds)
Sequence-BCEU
SegmentOpt
Segment
78
80
82
84
86
88
90
92
F1 Accuracy
Address Cora Articles
Sequence-BCEU
Segment
Outline Statistical models for integration
Extraction while fully exploiting existing database Entity match, Entity pattern, link/relationship constraints,
Integrate extracted entities, resolve if entity already in database
Performance challenges Efficient graphical model inference algorithms Indexing support
Representing uncertainty of integration in DB Imprecise databases and queries
Probabilistic Querying Systems Integration systems while improving, cannot be
perfect particularly for domains like the web Users supervision of each integration result
impossible
Create uncertainty-aware storage and querying engines Two enablers:
Probabilistic database querying engines over generic uncertainty models
Conditional graphical models produce well-calibrated probabilities
Probabilities in CRFs are well-calibrated
Probability of segmentation Probability correct
E.g: 0.5 probability Correct 50% of the times
Cora citations Cora headers
Ideal Ideal
Uncertainty in integration systems
Model
Unstructured
text
Entities p1
Entities p2
Entities pk
Other more compact models?
Very
uncertain?
Additional training data
Probabilistic database system
Select conference name of article RJ03?
Find most cited author?
IEEE Intl. Conf. On data mining 0.8
Conf. On data mining 0.2
D Johnson 16000 0.6
J Ullman 13000 0.4
Segmentation-per-row model
(Rows: Uncertain; Cols: Exact)
HNO AREA CITY PINCODE PROB
52 Bandra West Bombay 400 062 0.1
52-A Bandra West Bombay
400 062 0.2
52-A Bandra West Bombay 400 062 0.5
52 Bandra West Bombay
400 062 0.2
Exact but impractical. We can have toomany segmentations!
One-row Model
Each column is a multinomial distribution
(Row: Exact; Columns: Independent, Uncertain)
HNO AREA CITY PINCODE
52 (0.3) Bandra West (0.6)
Bombay (0.6) 400 062 (1.0)
52-A (0.7) Bandra (0.4) West Bombay (0.4)
e.g. P(52-A, Bandra West, Bombay, 400 062) = 0.7 x 0.6 x 0.6 x 1.0 = 0.252
Simple model, closed form solution, poor approximation.
Multi-row Model
Segmentation generated by a ‘mixture’ of rows
(Rows: Uncertain; Columns: Independent, Uncertain)
HNO AREA CITY PINCODE Prob
52 (0.167)
52-A (0.833)
Bandra West (1.0)
Bombay (1.0) 400 062 (1.0) 0.6
52 (0.5)
52-A (0.5)
Bandra (1.0) West Bombay (1.0)
400 062 (1.0) 0.4
Excellent storage/accuracy tradeoffPopulating probabilities challenging
(Gupta and Sarawagi, VLDB 2006)
Populating a multi-row model Challenge
Learning parameters of a mixture model to approximate the SemiCRF but without enumerating the instances from the model
Solution Find disjoint partitions of string
Direct operation on marginal probability vectors (efficiently computable for SemiCRFs)
Each partition a row
Experiments: Need for multi-row
• KL very high at m=1. One-row model clearly inadequate.• Even a two-row model is sufficient in many cases.
What next in data integration? Lots to be done in building large-scale, viable
data integration systems Online collective inference
Cannot freeze database Cannot batch too many inferences Need theoretically sound, practical alternatives to
exact, batch inference Queries and Mining over imprecise databases
Models of imprecision for results of deduplication
Thank you.
Summary Data integration with statistical models an
exciting research direction + a useful problem Four take-home messages
Segmentation models (semi-CRFs) provide a more elegant way to exploit entity features and build integrated models (NIPS 2004, ICDE 2006a)
A-star search adequate for link and cardinality constraints (ICDE 2006a)
Recipe for combing two top-k searches so that expensive DB lookup features are refined gradually (ICDE 2006b)
An efficient segmentation model with succinct representation of overlapping features + message passing over partial potentials (NIPS 2005 workshop)
Software: http://crf.sourceforge.net
Outline Problem statement and goals Models for data integration
Information Extraction State-of-the-art
Overview: Conditional Random Fields Our extensions to incorporate database of entity names
Entity matching Combined model for extraction and matching Extending to multi-relational data
Outline Problem statement and goals Models for data integration
Information Extraction State-of-the-art
Overview: Conditional Random Fields Our extensions to incorporate database of entity names
Entity matching Combined model for extraction and matching Extending to multi-relational data
Entity resolution
Labeled data: record pairs with labels 0 (red-edges) 1 (black-edges)
Input features: Various kinds of similarity functions between attributes
Edit distance, Soundex, N-grams on text attributes Jaccard, Jaro-Winkler, Subset match
Classifier: any binary classifier CRF for extensibility
AuthorsVariantsJeffrey Ullman
J. Ullmann Jefry UlmanProf. J. Ullman
Jeffrey Smith
M, Stonebraker
J SmithMike Stonebraker
Michael Stonebraker Pedro Domingos
Domingos, P.?
CRFs for predicting matches Given record pair (x1 x2), predict y=1 or 0 as
Efficiency: Training: filter and only include pairs which satisfy
conditions like at least one common n-gram
Link constraints in multi-relational data Any pair of segments in previous output needs to
satisfy two conditions Foreign key constraint across their canonical-ids Cardinality constraint
Our solution: Constrained Viterbi (branch and bound search) Modified search that retains with best path labels
along the path Backtracks when constraints are violated
Normal CRF
Normal CRF
Semi-CRF
Constrained Viterbi
Compound-label
Entity column names in the database: Surface patterns, regular expression:
Example: pattern: X. [X.] Xx* author name Commonly occurring words:
Journal, IEEE journal name Ordering of words:
Part after “In” is journal name Similarity-based features:
Labeled data: Order of attributes: Title before journal name
Canonical links: Schema-level: cardinality of attributes
Links between entities: what entity is allowed to go with what.
The final picture.
Summary Exploiting existing large databases to bridge with
unstructured data, an exciting research problem with many applications
Conditional graphical models to combine all possible clues for extraction/matching in a simple framework
Probabilistic: robust to noise, soft predictions Ongoing work:
Probabilistic output for imprecise query processing
Available clues.. Entity column names in the database:
Surface patterns, regular expression: Example: pattern: X. [X.] Xx* author name
Commonly occurring words: Journal, IEEE journal name
Ordering of words: Part after “In” is journal name
TF-IDF similarity with stored entities Labeled data: Order of attributes:
Title before journal name Schema-level: cardinality of attributes
Links between entities: what entity is allowed to go with what.
Adding structure to unstructured data Extensive research in web, NLP, machine
learning, data mining and database communities. Most current research ignores existing structured
databases Database just a store at the last step of data integration.
Our goal Extend statistical models to exploit database of entities
and relationships Models: persistent, part of database, stored, indexed,
evolving and improving along with data.