+ All Categories
Home > Documents > Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Date post: 14-Dec-2015
Category:
Upload: brayan-crain
View: 215 times
Download: 1 times
Share this document with a friend
Popular Tags:
33
Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell
Transcript
Page 1: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Text Mining -- ExtractionWeb-Based Information Architectures

MSEC 20-760Mini II

Jaime Carbonell

Page 2: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

General Topic: Text Extraction

• Motivation: Text Mining

• Context-Free Entity Extraction

• Role-based Entity Extraction

• Relational Extraction

• eBusiness Applications

Page 3: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Text Mining (1)

The Need to Process Text Automatically• Text is meant to be read by humans, not programs.• Most useful information is stored as text.

(100 times as much online text as online DBs)• HTML web pages are text (with structuring tags).• Data Mining (covered later) operates on data

tables (i.e. numbers, fixed fields, adherence to data models).

Page 4: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Text Mining (2)

The Need to Process Text Automatically• We need text => data table transducers.• General Natural Language Understanding is still

too hard.• But, can we solve simpler but useful sub-

problems?• Yes – categorization of text by topic and

extraction of certain kinds of information from free text or HTML-structured text is possible.

Page 5: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Text Mining (3)

Components of Text Mining

• Categorization by topic or Genre

Introduced here, see Prof Yang’s lecture

• Fact extraction from text

Topic of this class

• Data Mining from DBs or extracted facts

Later lecture on Data Mining

Page 6: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Text Categorization (1)

Definition• Assign labels to each document or web-page• Labels may be topics such as Yahoo-categories

e.g. "finance," "sports," "news>world>asia>business"• Labels may be genres

e.g. "editorials" "movie-reviews" "news"• Labels may be binary

e.g. "interesting-to-me" "not-interesting-to-me"

Page 7: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Text Categorization (2)

Methods

• Manual assignment (as in Yahoo)

• Hand-coded rule based (as in Reuters)

(Usually If the document contains a given boolean combination of words, then assign it a specified category.)

Page 8: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Text Categorization (3)

Methods• Learning of document-label assignment function

– Most new applications rely on machine learning

– k-Nearest Neighbors (simple, powerful)

See Prof. Yang’s lecture– Decision-tree induction (most common method)

– Support-vector machines (newest method)

Page 9: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Named Entity Identification I (1)

Purpose

To answer questions such as:• Who is mentioned in these 100 Society article?• What locations are listed in these 2000 web pages?• What companies are mentioned in these patent

forms?• What products were evaluated by Consumer

Reports this year?

Page 10: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Named Entity Identification I (2)Example

President Clinton decided to send special trade envoy Mickey Kantor to the special Asian economic meeting in Singapore this week. Ms. Xuemei Peng, trade minister from China, and Mr. Hideto Suzuki from Japan’s Ministry of Trade and Industry will also attend. Singapore, who is hosting the meeting, will probably be represented by its foreign and economic ministers. The Australian representative, Mr. Langford, will not attend, though no reason has been given. The parties hope to reach a framework for currency stabilization.

Page 11: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Named Entity Identification I (3)

Extracted Named Entities (NEs)

PEOPLE PLACES

__________________________________________

President Clinton Singapore

Mickey Kantor Japan

Ms. Xuemei Peng China

Mr. Hideto Suzuki Australia

Mr. Langford

Page 12: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Named Entity Identification IIFinite-State Machines (1)

Definition of Finite State Acceptor (FSA)

• A FSA is a directed graph

• With a "start" node

• With one or more "accepting" nodes

Page 13: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Named Entity Identification IIFinite-State Machines (2)

Definition of Finite State Acceptor (FSA)• With link-labels matching input items

– exact-match links labels

e.g. "China" matching only "China"

– wildcard (?) match

e.g. "?" matches "100" or "China" or ...

– feature-match

e.g. CAP matches any capitalized word

– list-membership match

e.g. if HON-LIST := (Mr, Ms, Dr, President, ...)

it would match any of those words in the input

Page 14: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Named Entity Identification IIFinite-State Machines (3)

Definition of Finite State Acceptor (FSA)

• With an input source (e.g. string of words)

• Outputs "YES" or "NO"

Page 15: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Named Entity Identification IIIFinite-State Machines

Definition of A Finite State Transducer (FST)

• An FSA with variable binding

• Outputs "NO" or "YES"+variable-bindings

• Variable bindings encode recognized entity

e.g. "YES <firstname Hideto> <lastname Suzuki>"

Page 16: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Finite State Acceptor (FSA)

StartState Accept

StateCAP

CAP

HON-LIST

Page 17: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Finite State Transducer (FST)

CAP

HON-LIST CAP

HON := FirstName := LastName :=

Page 18: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Role-Situated Named Entities (1)

Motivation• It is useful to know roles of NE’s, e.g.:• Who participated in the economic meeting?• Who hosted the economic meeting?• Who was discussed in the economic

meeting?• Who was absent from the the economic

meeting?

Page 19: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Role-Situated Named Entities (2)

How do we Assign Roles to Entities?

• Instead of one FSM, use a trio of 3 FSMs– <left-context-FSA><entity-FSM><right-context-FSA>

• Where left and right context help assign role

Page 20: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Role-Situated Named Entities (3)ExampleIf <right-context> =

<? "not" ("attend" | "participate")>

Then entity.role = ABSENT

If <left-context> =

<("meet" | "meeting") ("in" | "at")>

Then entity.role = HOST

Page 21: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Relational Information Extraction (1)

Motivation

It useful to know who is doing what to whom

Page 22: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Relational Information Extraction (2)

Example

"John Snell reporting for Wall Street. Today Flexicon Inc. announced a tender offer for Supplyhouse Ltd. for $30 per share, representing a 30% premium over Friday’s closing price. Flexicon expects to acquire Supplyhouse by Q4 2001 without problems from federal regulators"

Page 23: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Relational Information Extraction (3)

Extraction System is Template of FSMs[Corporate-acquisition

[acquirer <company-FSM> <r-acquirer-FSM>]

[acquiree <l-acquiree-FSM> <company-FSM)]

[share-price <money-FSM> <r-stock-FSM>]

[date <l-event-date-FSM> <date-FSM>]

]

Page 24: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Relational Information Extraction (4)

Output is Instantiated FSM

[Corporate-acquisition

[acquirer "Flexicon Inc."]

[acquiree "Supplyhouse Ltd."]

[share-price "30 USD"]

[date "Q4 2001"]

]

Page 25: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Fact Extraction: State of the Art (1)

Observations

• Entity => entity+roles => relation templates

Increasing richness of information extracted

• But not equivalent to language understanding

Only pre-determined info types extracted

Page 26: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Fact Extraction: State of the Art (2)

Observations• Useful for relational DB filling

Acquirer Acquiree Sh.price Year

__________________________________

Flexicon Logi-truck 18 1999

Flexicon Supplyhouse 30 2001

buy.com reel.com 10 2000

... ... ... ...

Page 27: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Fact Extraction: State of the Art (3)

Technical Approaches• Manually-built ad-hoc extraction "rules"• Manually-built FSTs• Feature-based training from labeled instances

(Naive Bayes, Decision Trees)• Hidden Markoff Models• FSTs with feedback-driven turning

Page 28: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Applications of Text Extraction I (1)

Financial

• Email auto-response– e.g. "What is the balance of account

N007623013?"– First categorize as balance-request– Then extract account number

Page 29: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Applications of Text Extraction I (2)Financial

• Template filling from bank order– e.g. "Please transfer 100,000 USD from

N007623013 to checking account A011129081 tomorrow“

– First categorize as transfer

Page 30: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Applications of Text Extraction I (3)Financial• Template filling from bank order

– Then extract:[account-transfer

<from N00762301><to A01112908><amount 100,000><date ??>]

– Then employee checks template and adds/corrects information such as missing date (e.g. if the system cannot interpret "tomorrow")

Page 31: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Applications of Text Extraction II (1)

Informational

• For all seminar announcements in BB

extract time/title/speaker/location

• From email messages about proposed meetings

extract time/participants/location

Page 32: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Applications of Text Extraction II (2)Large-scale Wed applications• Build DB of all job openings

– Categorize web pages as job descriptions– Extract company/date/salary/level/...– fill in relational DB with extracted info

• Whizbang! (a Pittsburgh eCompany) is doing just this via its flipdog.com site

• Build DB of all web-posted resumes,first categorizing pages as resumes,then extracting key fields name/expertise/...

Page 33: Text Mining -- Extraction Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Applications of Text Extraction II (3)

Corporate Intelligence

• Extract key facts about competition web sites– New products offered– Any changes to prices, sales, etc.

• Extract key facts about customers of competitors


Recommended