+ All Categories
Home > Documents > Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this...

Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this...

Date post: 17-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
90
Unsupervised Information Extraction by Text Segmentation Eli Cortez Advisor: Altigran Soares da Silva Universidade Federal do Amazonas
Transcript
Page 1: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Unsupervised Information

Extraction by Text Segmentation

Eli Cortez

Advisor: Altigran Soares da Silva

Universidade Federal do Amazonas

Page 2: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Agenda

Introduction

Information Extraction by Text Segmentation (IETS)

Contributions

Related Work

Web Extraction Methods and Tools

Probabilistic Graph-Based Methods

Our Proposed Approach for IETS

Ondux

Judie

iForm

Conclusions and Future Work

1

Page 3: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Introduction

Steady increasing in the number and the types of sources

of textual information available in the World-Wide Web

2

Page 4: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Introduction

These sources constitute large repositories of valuable

data on a variety of domains.

Data referring to different “things” such as:

Personal Information;

Products;

Publication;

Companies;

Cities;

3

Page 5: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Introduction

Important restrictions on the way data they contain can

be manipulated.

4

Text snippets (product descriptions, movie reviews) can hardly be subject to automated processing.

Difficult to automatically identify data of interest.

Page 6: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Introduction

The Information Extraction (IE) Problem

Automatic extract structured information such as

entities, relationships between entities, and attributes

describing entities from noisy unstructured sources.

Named Entity Recognition;

Open Information Extraction;

Relationship Extraction;

Information Extraction by Text Segmentation (IETS)

5

Page 7: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Introduction

6

Information Extraction by Text Segmentation (IETS)

The problem of extracting attribute values occurring in

implicit semi-structured data records in the form of

continuous text.

Why is it important to extract information?

Query structured data; Data Mining; Record Linkage.

Page 8: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Introduction

7

Contributions

In this work we tackle the Information Extraction by Text Segmentation Problem (IETS)

Important and practical problem frequently addressed in the recent literature.

Borkar@SIGMOD'01, McCallum@ICML'01, Agichtein@SIGKDD'04, Mansuri@ICDE'06, Zhao@SICDM'08, Cortez@JASIST'09

We propose and implement an unsupervised approach to this problem.

Relies on information available on pre-existing data.

Learn content-based features (i.e., domain knowledge).

Exploit content-based features to directly learn structure-based features (i.e., source knowledge) from test data.

Eliminate the need of a user involved in any source specific training process.

Page 9: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Introduction

8

Contributions

Based on our approach we produced a number of results.

ONDUX – On-Demand Unsupervised Learning for IE

SIGMOD’10, IDAR’10, SBBD’11

JUDIE – Joint Unsupervised Structure Discovery and IE

SIGMOD’11

iForm – A Probabilistic Approach for Automatically Filling Form-Based

Web Interfaces

WWW’09, PVLDB’10

Page 10: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Related Work

9

Language for Wrapper Development.

Alternative to general purpose languages such as Perl and Java.

Minerva, WEB-OQL

Wrapper Induction Methods

Machine Learning usage to semi-automatically induce wrappers.

WEIN, StalKer

NLP-based Methods

Usage of Natural Language Processing techniques (semantic class, POS)

WHISK, TEXTRUNNER

Ontology-based Methods

Usage of an ontology and conceptual description of the data of interest

HTML-aware Methods

Explore the HTML Structure (Tags) and their representation (DOM)

RoadRunner, Webtables

Page 11: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Related Work

10

Methods Disadvantages

Language for Wrapper Development. Rely on the Regularity of

the HTML format

Wrapper Induction Methods. Rely on the Regularity of

the HTML format

NLP-based Methods Require Linguistic and

Grammatical Elements

Ontology-based Methods Require a huge human

effort to manually create

ontologies

HTML-aware Methods Rely on the Regularity of

the HTML format

These disadvantages precludes their usage in a large number

of textual sources that are available on the Web.

Page 12: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Related Work

11

Probabilistic Graph-Based Methods

Deal with the limitations of the extraction methods that are based on the HTML structure.

Based on probabilistic frameworks such as: Conditional Random Fields (CRF) and Hidden Markov Models (HMM)

Supervised Methods

Rely on human-created training sets to generate graphical models able to extract information

Require training data from each source

Unsupervised Methods

Rely on pre-existing datasets for easing the training process of probabilistic methods.

Dictionaries, Knowledge Bases

Page 13: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

12

Probabilistic Graph-Based Methods

Supervised Methods

Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273

Related Work

Page 14: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

13

Probabilistic Graph-Based Methods

Supervised Methods

Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273

Related Work

1. <Neighboorhood>Regent Square</Neighboorhood>

2. <Price>$228,900</Price>

3. <Number>1028</Number>

4. <Street>Mifflin Ave.;</Street>

5. <Bedroom>6 Bedrooms</ Bedroom>

6. <Bathroom>2 Bathrooms</Bathroom>

7. <Phone>412-638-7273</Phone>

CRF and HMM

methods learn from

given examples, lexical,

style (content)

positioning and

sequencing (structure)

features

Examples are source-dependent

Page 15: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Learning

f1, f2, f3,...,fk

g1, g2,g3,...,gl

Extraction

Labeled Segments

(Tranining)

Features

Output Labeled

Segments

Unlabeled Input Strings

Model

Input Texts

Related Work

Supervised Methods

14

Page 16: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Text Source 1

Text Source 2

Text Source 3

Related Work

Supervised Methods

15

Page 17: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Text Source 1

Text Source 2

Text Source 3

Related Work

Supervised Methods

16

Page 18: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

17

Unsupervised Methods

Related Work

Learning

Extraction

Output Labeled

Segments

Model

Dataset

Page 19: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Dataset

Content Features

Source 1

Source 3

Source 2

f1 ,f2 , f3 ,...,fk

Unsupervised Methods

Related Work

18

Page 20: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Source 2

Dataset

Content Features

f1 ,f2 , f3 ,...,fk

Source 1

Source 3

Unsupervised Methods

Related Work

19

Page 21: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

20

Supervised X UNsupervised

Hand-labeled

examples

Source Dependent

Scalability Problem

Reusability

Pre-existing

information

Source

Indepedent

Easily adaptable

Probabilistic Graph-Based Methods

Related Work

Page 22: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Related Work

21

Probabilistic Graph-Based Methods

Unsupervised Methods

[Agichtein et al @ SIGKDD 2004]

Usage of Reference Tables to create an unsupervised model using

Hidden Markov Models (HMM)

[Zhao et al. @ SIAM ICDM 2008]

Usage of reference tables to create unsupervised CRF models - (U-

CRF)

[Sarawagi et al. @ ICDE 2006]

Usage of pre-existing data and hand labeled training sets to create

an semi-supervised model using CRF

Page 23: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Both models assume single positioning and ordering of attributes in all test instances.

Related Work

22

Probabilistic Graph-Based Methods

Unsupervised Methods

[Agichtein et al @ SIGKDD 2004]

Usage of Reference Tables to create an unsupervised model using

Hidden Markov Models (HMM)

[Zhao et al. @ SIAM ICDM 2008]

Usage of reference tables to create unsupervised CRF models - (U-

CRF)

[Sarawagi et al. @ ICDE 2006]

Usage of pre-existing data and hand labeled training sets to create

an semi-supervised model using CRF

Page 24: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Our Proposed Approach for IETS

Page 25: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Our Proposed Approach for IETS

24

Knowledge Bases

◦ Set of pairs KB =

◦ Easily built from pre-existing sources

◦ Bibliographic DBs, Freebase, Wikipedia, etc.

)},(),...,,{( 11 nn OmOm

Page 26: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Our Proposed Approach for IETS

Our approach relies on two types of features:

State or content-based features;

KB

Attribute

Vocabulary Value Range Value Format

25

Page 27: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Our Proposed Approach for IETS

Our approach relies on two types of features:

Transition or Structure-based Features;

Input Records

Attributes Transition Probabilities

26

Page 28: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Our Proposed Approach for IETS

Knowledge bases implicitly encode domain knowledge.

Very suitable source for learning content-based features

Attribute Vocabulary

Exploit the common vocabulary often shared by values of

textual attributes

KB 27

Page 29: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Our Proposed Approach for IETS

Knowledge bases implicitly encode domain knowledge.

Very suitable source for learning content-based features

Attribute Value Range

For the case of numeric candidate values, it measures the

similarity between a numeric value and the set of values of a

numeric attribute

KB 28

Page 30: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Our Proposed Approach for IETS

Knowledge bases implicitly encode domain knowledge.

Very suitable source for learning content-based features

Attribute Value Format

Exploits the common format often used to represent values of

some attributes.

KB 29

Page 31: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Attribute Value Format (Style)

First a Markov model is generated for each attribute.

Computes the probability of the input mask sequence represents a path in each

Markov model of each attribute.

Start End

[A-Z][a-z]+

[A-Z]. [a-z][a-z]+

1.0

0.2 0.8

1.0

1.0

White sugar

[A-Z][a-z]+ [a-z][a-z]+

Our Proposed Approach for IETS

30

Page 32: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Our Proposed Approach for IETS

Content-based Features

Text segment

Value Format

Value Range

Attribute Vocabulary

Noisy

OR

KB

Attribute Label

31

Page 33: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Our Proposed Approach for IETS

Structure-based features are automatically induced from

content-based Features

HMM-like graph called Positioning and Sequencing Model (PSM)

Positioning and Sequencing Model

Automatically learned On-Demand from test instances

No a priori training required

Structure-based features

Dependent of the placement of attributes values on the input

Thus, they are input-dependent

32

Page 34: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Our Proposed Approach for IETS

33

Page 35: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Noisy OR

Content-related

features

Our Proposed Approach for IETS

Attribute Label Text Segment 34

Page 36: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Our Proposed Approach for IETS

Combination Strategy

Bayesian Noise-OR-Gate

We assume that the features we use exploit different properties of the attributes of the KB, i.e., they are independent.

Probabilistic methods such as CRF and HMM deploy optimization process to combine their features.

Not using optimization can, in theory, lead to sub-optimal results, our experiments demonstrates that our combination works very well in practice.

35

))1()1((1),,( 11 nn ppppor

Page 37: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Our Proposed Approach for IETS

Based on our approach

We developed unsupervised information extraction by text segmentation methods

ONDUX

On Demand Unsupervised Information Extraction

JUDIE

Joint Unsupervised Structure Discovery and Information Extraction

iForm

A Probabilistic Approach for Automatically Filling Form-Based Web Interfaces

36

Page 38: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

ONDUX On-Demand Unsupervised Learning for Information Extraction

Cortez et al. - SIGMOD 2010, Cortez and Silva – IDAR 2010

Page 39: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

ONDUX

Deals with text documents containing implicit semi-

structured data records

Addresses

Bibliographic References

Classified Ads

Product Descriptions

38

Dr. Robert A. Jacobson, 8109 Harford Road, Baltimore,

MD 21214

Postal Address

Pável Calado, Marco Cristo, Marcos André Gonçalves,

Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani.

Link-based similarity measures for the classication of Web

documents. JASIST, v. 57 n.2, p. 208-221, January 2006

Bibliographic Reference

Page 40: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

ONDUX

General View

39

Page 41: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

ONDUX

Blocking

Split the input text in substrings called blocks;

Consider the co-occurrence of consecutive terms based on

the KB

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

40

Page 42: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

ONDUX

Blocking

Split the input text in substrings called blocks;

Consider the co-occurrence of consecutive terms based on

the KB

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

Co-occur in the KB

(Neighborhood)

Left separated (no

presence in the KB)

41

Page 43: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

ONDUX

42

Matching

Associate each blocks with attributes according to content-

based features.

Attribute Vocabulary

Value Range

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

Street Price No. ??? Street

Bed. Bath. Phone

Page 44: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

ONDUX

Reinforcement – PSM

Ordering and Positioning Features are learned On-Demand

based on the test instances trough the Matching Phase 43

Page 45: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

ONDUX

Reinforcement

Once the PSM is built, we combine the content-based and the

structure-based features using the Bayesian operator OR.

44

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

Price No.

Bed. Bath. Phone

Street ??? Street

Page 46: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

ONDUX

Reinforcement

Once the PSM is built, we combine the content-based and the

structure-based features using the Bayesian operator OR.

45

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

Price No.

Bed. Bath. Phone

Neighborhood Street

Page 47: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

ONDUX - Experiments

46

Setup

We tested our proposed method with several sources from 3

distinct domains:

Addresses

Bibilographic Data

Classified Ads

Metrics

Precision, Recall and F-Measure

T-Test for the statistical validation of the results

Baselines

U-CRF and S-CRF

Page 48: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

ONDUX - Experiments

Extraction Quality

0

0.2

0.4

0.6

0.8

1

Name Street City State Phone Average

F-M

easu

re

Attributes

Dataset: BigBook | Source: BigBook

S-CRF

U-CRF

ONDUX-M

ONDUX-R

U-CRF results similar

to Zhao@SICDM

(validation)

Dataset follows

the single order

assumption

After

Reinforcement

ONDUX achieved

similar quality

47

Page 49: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

ONDUX - Experiments

Extraction Quality

S-CRF achieved

results higher than

U-CRF due to the

hand-labeled

training

CORA includes a

variety of citation

styles (conference,

journal, books, etc,)

In general, ONDUX

outperformed CRF

models

0

0.2

0.4

0.6

0.8

1

F-M

easu

re

Attributes

Dataset: CORA | Source: CORA

S-CRF

U-CRF

ONDUX-M

ONDUX-R

48

Page 50: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

ONDUX - Experiments

Extraction Quality

0

0.2

0.4

0.6

0.8

1

F-M

easu

re

Attributes

Dataset: Web Ads | Source: Folha On-line

U-CRF

ONDUX-M

ONDUX-R

Due to the Matching

Phase and the PSM

that is learned On-

Demand, ONDUX

achieve very high

quality results

U-CRF presented a

poor performance

(very heterogeneous

dataset)

49

Page 51: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

JUDIE

Joint Unsupervised Structure Discovery and Information Extraction

Cortez et al. - SIGMOD 2011

Page 52: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

JUDIE

1/2 cup butter 2 eggs 4 cups white sugar ground cinnamon 2 tablespoons dark

rum 6 chopped pecans 1/2 cup milk 1 1/2 cups applesauce 2 cups all-purpose

flour 1/4 cup cocoa powder 2 teaspoons baking soda 1/8 teaspoon salt 1 cup

raisins 1/4 cup dark rum

Chocolate Cake Recipe

Quantity Unit Ingredient

1/2 cup butter

2 eggs

4 cups white sugar

ground cinnamon

2 tablespoons dark rum

6 chopped pecans 51

Page 53: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

JUDIE

Joint Unsupervised Structure Discovery and Information

Extraction

Detects the structure of each individual record being extracted

without any user intervention

Looks for frequent patterns of label repetitions or cycles

Integrates this algorithm in the IE process

Accomplished by successive refinement steps that

alternate information extraction and structure

discovery.

52

Page 54: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

JUDIE

Structure-

free Labeling

Structure

Sketching

Structure

Refinement

Structure-aware

Labeling

Phase 1

Phase 2

53

Page 55: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

JUDIE – Structure-free Labeling

What is the best label for each segment? No information on the structure of the data records

Resort only to content-based features

White sugar

Value Format

Value Range

Attribute Vocabulary

Noisy

OR

KB

Ingredient

54

Page 56: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

JUDIE – Structure-free Labeling

Initially labels potential values with attribute names.

No information on the structure of the data records

Resort only to content-based features

Learned from the pre-existing KB

1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

Limitations: Unmatching: “Tbsp”

Mismatching: “a little”

55

Page 57: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

The SD Algorithm (I)

Uncover the structure of implicit records from the input

text.

Used in the Structure Sketching and Structure Refinement.

Takes as input a sequence of labels and generates the

structure of each record.

Assumption: It is possible to identify patterns of sequences by

looking for cycles into a graph (Adjacency Graph) that models

the ordering of labels.

56

Page 58: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

The SD Algorithm (II)

Consider a sequence of labels from a bibliographic reference input text.

Title Conference Year Author Author Title Conference Year Author Title

Conference Year … Author Title Journal Issue Year Author Title Journal

Issue Year Author Author Journal Issue Year Title Year … Author Title

Conference Year Author Author Author Title Journal Issue Year

Author

Title

Journal Issue

Conference

Year

57

Page 59: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

The SD Algorithm (V)

Dominant Cycles

Given the set of Coincident cycles that are also viable, the Dominant Cycle are most frequent in the input

Finally, the algorithm works by first identifying all dominant cycles in the adjacency graph and then processing each of these cycles, the largest cycles being processed first.

In our given examples, the dominant cycles are:

1. [Author, Title, Journal, Issue, Year]

2. [Author, Title, Conference, Year]

3. [Author, Journal, Issue, Year]

4. [Title,Conference, Year]

5. [Title, Year]

58

Page 60: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

JUDIE – Structure Sketching

Organizes the labeled candidate values into records

Induces a structure on the unstructured text input.

Outputs labeled values grouped into records

Uses a novel algorithm called Structure Discovery (SD)

1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

59

Page 61: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

JUDIE – Structure-aware Labeling

Now, what is the best label for each segment?

We already know some structural information

Re-labels segments considering content-based

features and structure-based features

Structure-based features learned using a graphical model

(PSM)

1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

60

Page 62: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Noisy OR

Content-related

features

JUDIE – Structure-aware Labeling

Unit A little 61

Page 63: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

JUDIE – Structure-aware Labeling

Labels textual values considering:

Uses a graphic model representing the likelihood of attribute

transitions within the input text

Content-related features and structure-based features

1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

62

Page 64: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

JUDIE – Structure Refinement

Applies again the SD algorithm

Considers the output of the structure-aware labeling

Fixes structural problems

Structure-aware labeling produces more precise results

1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

63

Page 65: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Experiments

Metrics

Precision, Recall and F-Measure

T-Test for the statistical validation of the results

Baselines

ONDUX and U-CRF

64

Page 66: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Evaluation – Record Level

Phase 1: acceptable. F ≈ 0.7

Phase 2: positive impact. Gains > 9%

In CORA, gains higher than 19%

Structural information led to significant improvements.

Dataset Phase 1 Phase 2 Gain (%)

Recipes 0.79 0.90 13.2

CORA 0.69 0.83 19.3

Web Ads 0.70 0.77 9.7

65

Page 67: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Comparison with baselines – Attribute Level

Results very close to ONDUX and even better than U-CRF

Recall: JUDIE faces a harder task.

Attribute JUDIE ONDUX U-CRF

Author 0.88 0.922 0.87

Title 0.70 0.79 0.69

Booktitle 0.86 0.89 0.56

Journal 0.84 0.90 0.55

Volume 0.90 0.96 0.43

Pages 0.86 0.84 0.50

Date 0.87 0.89 0.49

Average 0.86 0.88 0.58

CORA

Attribute JUDIE ONDUX U-CRF

Bedroom 0.82 0.86 0.79

Living 0.89 0.90 0.72

Phone 0.87 0.92 0.75

Price 0.92 0.93 0.78

Kitchen 0.83 0.84 0.78

Bathroom 0.77 0.79 0.81

Others 0.73 0.79 0.71

Average 0.84 0.85 0.76

Web Ads

66

Page 68: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

iForm A Probabilistic Approach for Automatically Filling

Form-Based Web Interfaces

Toda et al. – WWW 2009, Toda et Al. – PVLDB 2010

Page 69: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

The Form Filling Problem

Goal:

To automatically fill out the fields of a given form-based

interface with values extracted from a data-rich free text

document.

1. Extracting values from the input text;

2. Filling out the fields of the target form using them.

68

Page 70: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Example

Form-based interface

Check-box Text Box

Selection List

69

Page 71: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Example

Data-rich free text document

70

Page 72: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Example

Form Filling

2005

Honda

Accord

low

Automatic

Alloy Wheels

x

x

x

x

x

x

x

71

Page 73: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Common usage of Web Forms

A user manually fills each form field

Text-box, selection list, check-box and radio button

Tedious, error prone and repetitive process

values

Page 74: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

iForm

Information Extraction + Form Filling

Automatic form filling;

Data-rich text document Values

Verify Values

Page 75: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Shutter Island is a 2010 American

psychological thriller film directed by

Martin Scorsese. The film is based on

Dennis Lehane's 2003 novel of the same

name . Starring Leonardo DiCaprio, Mark

Ruffalo and Ben Kingsley.

Movie Review - Data-rich text

iForm - Scenario

Web Form

74

Page 76: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

iForm – Selecting plausible segments

Is this text segment a suitable value of a given field of the form?

Shutter Island is a 2010 American psychological thriller

film directed by Martin Scorsese. The film is based on

Dennis Lehane's 2003 novel of the same name . Starring

Leonardo DiCaprio, Mark Ruffalo and Ben Kingsley.

Shutter Shutter Island Shutter Island is Shutter Island is a

Leonardo Leonardo DiCaprio Kingsley.

Redundant computation of several features can be

avoided by using dynamic programming.

75

Page 77: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

iForm - Features

Features Considered:

Shutter Island

Value Format

Attribute Value

Attribute Vocabulary

Noisy

OR

Previous

Submissions

Title

76

Page 78: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Given the set of text segments such that theirs scores are

above a threshold

iForm aims at finding a mapping between candidate values and

form fields with a maximum aggregate score

Select non-overlaping segments.

Accomplished by means of a two-phase procedure

iForm – Mapping Segments to Fields

77

Page 79: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Uses the final mapping to fill out the form fields

Text Boxes: Mapped text segments as a field values.

Check boxes: Set true for mapped fields.

Selection List:

“Movie”

“Shutter Island” title Shutter Island

iForm – Filling Form-based interfaces

78

“psychological thriller”

Page 80: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

iForm - Overview

Structure

Sketching

Phase 2

Shutter Island is a 2010 American

psychological thriller film directed by

Martin Scorsese. The film is based on

Dennis Lehane's 2003 novel of the same

name . Starring Leonardo DiCaprio, Mark

Ruffalo and Ben Kingsley.

Web Form

Previous

Submissions

Shutter Island

Martin Scorsese

Leonardo DiCaprio

Mark Ruffalo

Ben Kingslev

Thriller

X

79

Page 81: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Experiments

Baseline

iCRF - a method for interactive form filling based on CRF

The Jobs dataset was used for an experimental comparison between iForm and iCRF.

Dataset Test Data Previous Data # Fields S - Test Data S – Previous Data

Jobs 50 100 13 RISE RISE

Movies 50 10000 4 IMDb FreeBase / Wikipedia

Cars 50 10000 35 TodaOferta.com TodaOferta.com

Cellphones 50 10000 37 TodaOferta.com TodaOferta.com

Books 1 50 10000 5 Submarino.com TodaOferta.com

Books 2 50 10000 4 Submarino.com Ingenta

Books 3 50 10000 2 Submarino.com Ourpress.com

Books 4 50 10000 3 Submarino.com NetLibrary

80

Page 82: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Evaluation – Multi-typed web forms

Type of Field # Fields P R F

Text Box 4 0.74 0.69 0.71

Submission-Level 0.73 0.67 0.69

Movies

iForm achieved high

quality results in

all datasets

81

Type of Field # Fields P R F

Text Box 2 0.89 0.69 0.78

Check Box 35 0.94 0.94 0.94

Average 0.94 0.93 0.93

Submission-Level 0.96 0.94 0.95

Cellphones

Filling quality above 0.90.

In fact, more than 90% of

each submission was

correctly entered in the

web form interface.

Page 83: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Evaluation – Comparison with iCRF

Field iForm iCRF

Application 0.82 0.37

Area 0.18 0.23

City 0.70 0.65

Company 0.41 0.17

Country 0.77 0.87

Desired Degree 0.57 0.37

Language 0.84 0.69

Platform 0.47 0.38

Recruiter 0.44 0.22

Req. Degree 0.31 0.59

Salary 0.22 0.25

State 0.85 0.81

Title 0.72 0.49

iForm was designed to

conveniently exploit these

field-related features from

previous submissions

iForm had

superior F-measure

levels in nine fields.

The lower quality obtained by

iCRF is explained by the fact that

segments to be extracted

from typical free text inputs, such as

jobs postings, may not

appear in a regular context.

Jobs

Page 84: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Conclusions This work proposes an unsupervised approach to the IETS problem.

Relies on information available on pre-existing data.

Exploit content-based features to directly learn from test data structure-based features.

Show that pre-existing datasets allow for the unsupervised learning of both content-based and structure-based features.

Eliminate the need of a user involved in any source specific training process.

Information Extraction Methods: ONDUX, JUDIE and iForm

83

Page 85: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Publications

Thesis Core

1. Joint Unsupervised Structure Discovery and Information Extraction. SIGMOD

Conference – 2011

2. Unsupervised Information Extraction with the ONDUX Tool. Brazilian

Symposium on Databases (SBBD) – 2011

3. On Using Wikipedia to Build Knowledge Bases for Information Extraction by

Text Segmentation. Journal of Information and Data Management

(JDIM) – 2011

4. ONDUX: on-demand unsupervised learning for information extraction. SIGMOD

Conference. - 2010

5. Unsupervised strategies for information extraction by text segmentation. SIGMOD

PhD Workshop on innovative Database Research (IDAR) – 2010

6. A Probabilistic Approach for Automatically Filling Form-Based Web Interfaces.

Proceedings of the VLDB Endowment (PVLDB) – 2010

7. Automatically filling form-based web interfaces with free text inputs. International

Conference on World Wide Web (WWW) – 2009

84

Page 86: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Publications

Related to the Information Extraction Problem

8. Building a research social network from individual perspective. Joint

Conference on Digital Libraries (JCDL) – 2011

9. CiênciaBrasil – The Brazilian Portal of Science and Technology. Integrated

Seminar of Software and Hardware (Semish)– 2011

10. A flexible approach for extracting metadata from bibliographic citations.

Journal of the American Society for Information Science and

Technology (JASIST) – 2009

85

Page 87: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Publications

Other Publications

11. Lightweight methods for large-scale product categorization. Journal of the

American Society for Information Science and Technology (JASIST) –

2011

12. Adaptive and Fexible blocking for record linkage tasks. Journal of

Information and Data Management (JDIM) – 2010

13. Blocagem adptativa e flexível para o pareamento aproximado de registros.

Brazilian Symposium on Databases (SBBD) – 2009

Tutorials

14. Methods and techniques for information extraction by text segmentation.

Alberto Mendelzon International Workshop on Foundations of Data

Management (AMW) - 2012

15. Methods and techniques for information extraction by text segmentation.

Brazilian Symposium on Databases (SBBD) - 2011

86

Page 88: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Future Work

Generating transductive methods using domain

knowledge

Use our approach to extract information from HTML

Query Extraction using our unsupervised approach

Extraction Improvement Through User Feedback

87

Page 89: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Acknowledgments

88

Page 90: Unsupervised Information Extraction by Text Segmentation · Introduction 7 Contributions In this work we tackle the Information Extraction by Text Segmentation Problem (IETS) Important

Unsupervised Information

Extraction by Text Segmentation

Eli Cortez

Advisor: Altigran Soares da Silva

Universidade Federal do Amazonas


Recommended