+ All Categories
Home > Documents > April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter...

April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter...

Date post: 19-Dec-2015
Category:
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
35
April 22, 2004 1 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter: Tyler Carr
Transcript
Page 1: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 1

Text Mining: Finding Nuggets in Mountains of Textual Data

Jochen Doerre, Peter Gerstl, Roland Seiffert

IBM Germany, August 1999

Presenter: Tyler Carr

Page 2: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Motivation 2

Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions

Page 3: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Motivation 3

Motivation

Customer Letters E-Mail

Correspondence Phone Call

Recordings Contracts

Technical Documentation

Patents News Articles Web Pages

90% of company’s data cannot be looked at with standard Datamining:

Page 4: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Motivation 4

Value of Text Mining Rapid Digestion of large document

collections Faster than human knowledge brokers Objective and Customizable Analysis Automation of tasks

Page 5: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Motivation 5

Typical Applications Summarizing Documents Monitoring relations among people,

places, and organizations Organizing documents by content Organizing indices for search and

retrieval (keyword finding) Retrieving documents by content

Page 6: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Methodology 6

Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions

Page 7: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Methodology 7

Challenges in Text Mining Information is in unstructured textual

form Natural Language (NL) interpretation is

years away for computers Text Mining deals with huge collections

of documents

Page 8: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Methodology 8

Two Text Mining Approaches Knowledge Discovery

Extraction of codified information (features) Information Distillation

Analysis of the feature distribution

Page 9: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Methodology 9

Comparison with Data Mining Data Mining

Identify data sets Select features

manually Prepare data Analyze distribution

Text Mining Identify documents Extract features Select features by

algorithm Prepare data Analyze distribution

Page 10: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Feature Extraction 10

Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions

Page 11: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Feature Extraction 11

Feature Extraction “To recognize and classify significant

vocabulary items in unrestricted natural language texts.”

Classes of Vocabulary Proper names Technical phrases Abbreviations and acronyms …

Page 12: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Feature Extraction 12

Canonical Forms Numbers convert to normal form

Four ==> 4 Date convert to normal form Inflected forms convert to common form

Sings, Sang, Sung ==> Sing Alternative names convert to explicit

form Mr. Carr, Tyler, Presenter==>Tyler Carr

Page 13: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Feature Extraction 13

Feature Extraction Tools Linguistically motivated heuristics Pattern matching Limited amounts of lexical information

Part-of-speech information (subject,verb) Avoid analyzing too deep (for speed)

Does not use huge amounts of lexical info. No in-depth syntactic and semantic

analysis

Page 14: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Feature Extraction 14

Feature Extraction Example Disambiguating Proper Names

(Nominator Program) Apply heuristics to strings, instead of

interpreting semantics. The unit of context for extraction is a

document. The heuristics represent English naming

conventions.

Page 15: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Feature Extraction 15

Feature Extraction Goals Very fast processing to deal with huge

amounts of data Domain independence for general

applicability

Page 16: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Clustering and Categorization 16

Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions

Page 17: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Clustering and Categorization 17

Clustering Also called Knowledge Discovery Fully automatic process Partitions a given collection into groups

of documents similar in contents Clusters identifiable by feature vectors

Provides a set of keywords for each cluster

Page 18: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Clustering and Categorization 18

Two Clustering Engines Hierarchical Clustering tool

Orders the clusters into a tree reflecting various levels of similarity.

Binary Relational Clustering tool Produces a flat clustering together with

relationships of different strength between the clusters

Relationships reflect inter-cluster similarities

Page 19: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Clustering and Categorization 19

Clustering Model

Page 20: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Clustering and Categorization 20

Categorization Also called Information Distillation Topic Categorization Tool Assigns documents to pre-existing

categories (“topics” or “themes”) Categories are chosen to match the

intended use of the collection

Page 21: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Clustering and Categorization 21

Categorization Categories defined by providing a set of

sample documents for each category Training phase produces a special

index, called the categorization schema Categorization tool returns set of

category names and confidence levels for each document

Page 22: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Clustering and Categorization 22

Categorization If confidence is below some threshold,

document is set aside for human categorizer

Tests have shown the Topic Categorization Tool agrees with human categorizers to the same degree as human categorizers agree with one another.

Page 23: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Clustering and Categorization 23

Categorization Model

Page 24: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Applications 24

Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions

Page 25: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Applications 25

IBM Intelligent Miner for Text Software Development Kit (not full

application) Contains necessary components for “real text

mining” Also contains more traditional components:

IBM Text Search Engine IBM Web Crawler Drop-in Intranet search solutions

Page 26: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Applications 26

Applications Customer Relationship Management

application provided by IBM Intelligent Miner for text called Customer Relationship Intelligence (CRI) “Help companies better understand what

their customers want and what they think about the company itself.”

Page 27: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Applications 27

Customer Intelligence Process Take body of communications with customer

as input. Cluster the documents to identify issues. Characterize the clusters to identify the

conditions for problems. Assign new messages appropriate to

clusters.

Page 28: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Applications 28

Customer Intelligence Usage Knowledge Discovery

Clustering used to create a structure that can be interpreted

Information Distillation Refinement and extension of clustering results

Interpreting the results Tuning of the clustering process Selecting meaningful clusters

Page 29: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Exam Questions 29

Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions

Page 30: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Exam Questions 30

Exam Question #1 Name an example of each of the two

main classes of applications of text-mining. Knowledge Discovery: Discovering a

common customer complaint among much feedback

Information Distillation: Filtering future comments into pre-defined categories.

Page 31: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Exam Questions 31

Exam Question #2 How does the procedure for text mining

differ from the procedure for data mining? Adds feature extraction function Not feasible to have humans select

features Highly dimensional, sparsely populated

feature vectors

Page 32: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 Exam Questions 32

Exam Question #3 In the Nominator program of IBM’s

Intelligent Miner for Text, an objective of the design is to enable rapid extraction of names from large amounts of text. How does this decision affect the ability of the program to interpret the semantics of text? Does not perform in-depth syntactic or

semantic analysis of texts

Page 33: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 33

Thank You

Any Questions?

Page 34: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 34

Thank You

Any Questions?

Page 35: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 35

Thank You

Any Questions?


Recommended