Samrat.ppt

Data Mining with Unstructured Data Mining with Unstructured DataData

A Study And Implementation of Industry A Study And Implementation of Industry Product(s)Product(s)

Samrat Sen

04/09/23 UB - CS 711, Data Mining with Unstructured Data

2

GoalsGoals

Issues in Text Mining with Unstructured Data

Analysis of Data Mining productsStudy of a Real Life Classification

ProblemStrategy for solving the problem


3

Issues in Text MiningIssues in Text Mining

Different from KDD and DM techniques in structured Databases

Problems:

1. Concerned with predefined fields

2. Based on learning from attribute- value database

e.g P.T.O


4

If Married(Person, Spouse) and Income(Person) >= 25,000Then Potential-Customer(Spouse)If Married(Person, Spouse) and Potential-Customer(Person) Then Potential-Customer(Spouse)

Person Age Sex Income Customer

Ann S 32 F 10,000 yes

Jane G 53 F 20,000 no

Sri S 35 M 65,000 yes

Egor 25 M 10,000 yes

Husband Wife

Egor Ann S

Sri H Jane

Potential Customer Table Married to Table

Induced Rules



5

Algorithm techniques like Association Extraction from Indexed data,

Prototypical Document Extraction from full Text

• Industry standard data mining tools cannot be used directly

e.g a usual process has to have the Text Transformer, Text Analyzer, Summary generator



6

• The input and output interfaces, the file formats may cost in time and money.

• Exhaustive domains have to be set up for classification.

• Cost and Benefits have to be weighed before model selection.

1. Gain from positive prediction

2. Loss from an incorrect positive prediction (false positive) 3. Benefit from a correct negative prediction 4. Cost of incorrect negative prediction (false negative) 5. Cost of project time (a better product/algorithm may come

up)



7

Data Mining Products/ToolsData Mining Products/Tools

DARWIN – from OracleIntelligent Data Miner – from IBMIntermedia Text with Oracle Database

with context query feature (theme based document retrieval)

http://www.oracle.com/ip/analyze/warehouse/datamining/http://www-4.ibm.com/software/data/iminer/

FOR MORE INFO...


8

• New Specification being proposed by SUN for a Data Mining API *

• SQLServer 2000 – Data mining and English query writing features

• Verity Knowledge Organizer

FOR MORE INFO...

* http://java.sun.com/aboutJava/communityprocess/jsr/jsr_073_dmapi. html#3

Additional Text Mining sites:

1.http://textmining.krdl.org.sg/resourves.html

2. www.intext.de/TEXTANAE.htm

3. www.cs.uku.fi/~kuikka/systems.html

Data Mining Products/ToolsData Mining Products/Tools


9

DARWINDARWIN

Functions 1. Prediction (from known values)2. Classification (into categories)3. Forecasting (future predictions)

Approach1. Plan2. Prepare Dataset3. Build and Use models


10

DARWINDARWIN

The problem is defined in terms of data fields and data records

The fields are classified as follows: - Categorical and Ordered Fields

- Predictive Fields

- Target Fields

• DARWIN dataset file has to be created containing all the records in the problem domain (using a descriptor file)


11

DARWIN - ModelsDARWIN - Models

Tree model – Based on classification and regression tree algorithm

Net model – A feed forward multilayer neural network

Match Model – Memory based reasoning model, using a K-nearest neighbor algorithm


12

DARWIN – Tree ModelDARWIN – Tree Model

Create TreeTraining Data

Test/Evaluate Tree(Information on error rates of pruned sub-trees)

Predict with Tree

(using the selected sub-tree)

Analyze Results

I/P Prediction Dataset

Merged I/P & O/P predictiondataset


13

DARWIN – Net ModelDARWIN – Net Model

Create NetTraining Dataset

Train Net(Information on error rates of pruned sub-trees)

Prediction Dataset

Analyze Results



Neural Network Model

Trained Neural

Network


14

DARWIN – Match ModelDARWIN – Match Model

Create Match Model Training Data

Optimize match weights

Predict with Match

Analyze Results




15

EvaluateEvaluates the performance of a given model on a given dataset, when working on known data for test or evaluation purposes.Summarize Data

Provides a statistical summary of the values taken by a data in the specified fields of a dataset

Frequency CountProvides information on the frequency with which particular data values appear in a dataset

DARWIN – AnalyzingDARWIN – Analyzing


16

Performance Matrix

Can be used to compare simple fields or simple functions of fields

Sensitivity

Provides a model showing the relative importance of attributes used in building a model

DARWIN – AnalyzingDARWIN – Analyzing


17

DARWIN – Code GenerationDARWIN – Code Generation

•Darwin can generate C, C++, Java code for a Tree or Net model so that a prediction function can be called from an application Program

•Java code can also be generated to embed a model in a Web Applet

FOR MORE INFO...

http://technet.oracle.com/docs/products/datamining/doc_index.htm


18

DARWINDARWIN

For more info

http://technet.oracle.com/software/products/intermedia/software_index.html

1. Oracle Data Mining Data sheet 2. Oracle Data Mining Solutions http://www.oracle.com/ip/analyze/warehouse/datamining/ http://www.oracle.com/oramag/oracle/98-Jan/fast.html 1. Managing Unstructured Data with Oracle8 http://technet.oracle.com/products/datamining/ 1. Product manuals


19

DARWINDARWIN

O r a c l e P e r s o n a l i z a t i o n

R e a l - T i m e R e c o m m e n d a t i o n s

N e w O f f e r i n g A v a i l a b l e w i t h O r a c l e 9 i

H e l l o ! W e h a v e r e c o m m e n d a t i o n s f o r y o u .


20

Oracle – Intermedia TextOracle – Intermedia Text

Ranking technique called theme proving is used

Documents grouped into categories and subcategories

Integrated with the Oracle – 8 database.

Absolutely no training or tuning required


21


Lexical Knowledge Base - 200,000 concepts from very broad domains

- 2000 major categories

- Concepts mapped into one or more words/phrases in

canonical form

- Each of these have alternate inflectional

variations,acronyms, synonyms stored

- Total vocabulary of 450,000 terms

- Each entry has other parameters like parts of speech


22


Theme Extraction -Themes are assigned initial ranks based on

structure of the document and the frequency of the theme. - All the ancestor themes also included in the result - Theme proving done before final ranking

Queries Direct match, phrase search (‘contains’), case-sensitive

query, misspellings and fuzzy match, inflections (‘about’), compound queries, Boolean operators, Natural language query


23


Oracle at Trec 8 (Eighth text retrieval conference-

http://otn.oracle.com/products/intermedia/htdocs/imt_trec8pap.htm)

Recall at 1000 71.57% (3384/4728) Average Precision 41.30% Initial precision (at 92.79% recall 0.0) Final precision (at 07.91% recall 1.0)


24

Intermedia Text-ModelIntermedia Text-Model


25

Interface OptionsInterface Options


26

Language SelectionLanguage Selection

Java for robot

PL/SQL for data retrieval


27

Code ExecutionCode Execution


28

Overview of the SystemOverview of the System

Customer Browser

WebServer Oracle 8i

Intermedia Text

Server process

Tag stripper

Listening at port 80

JDBC

Client Browser


29

Intermedia TextIntermedia Text

Steps for Building an application Load the documentsLoad the documents Index the documentIndex the document Issue QueriesIssue Queries Present the documents that satisfy the Present the documents that satisfy the

queryquery


30

Loading MethodsLoading Methods

Loading Methods– Insert Statements– SQL Loader– Ctxsrv – This is a server daemon process which builds

the index at regular intervals

– Ctxload Utility Used for

Thesaurus Import/Export

Text Loading

Document Updating/Exporting


31

Create and Populate a Simple TableCreate and Populate a Simple Table

CREATE TABLE quick (

quick_id NUMBER CONSTRAINT

quick_pk PRIMARY KEY,

text VARCHAR2(80) );

INSERT INTO quick

VALUES ( 1, 'The cat sat on the mat' );

INSERT INTO quick

VALUES ( 2, 'The fox jumped over the dog' );

INSERT INTO quick

VALUES ( 3, 'The dog barked like a dog' );

COMMIT;

CREATE TABLE quick (

quick_id NUMBER CONSTRAINT

quick_pk PRIMARY KEY,

text VARCHAR2(80) );

INSERT INTO quick

VALUES ( 1, 'The cat sat on the mat' );

INSERT INTO quick

VALUES ( 2, 'The fox jumped over the dog' );

INSERT INTO quick

VALUES ( 3, 'The dog barked like a dog' );

COMMIT;


32

Run a Text QueryRun a Text Query

SELECT text FROM quick

WHERE CONTAINS ( text,

'sat on the mat' ) > 0;

DRG-10599: column is not indexed




DRG-10599: column is not indexed

You must have a Text index on a columnbefore you can do a “contains” query on it

You must have a Text index on a columnbefore you can do a “contains” query on it


33

Create the Text IndexCreate the Text Index

CREATE INDEX quick_text

on quick ( text )

INDEXTYPE IS CTXSYS.CONTEXT;

CREATE INDEX quick_text

on quick ( text )

INDEXTYPE IS CTXSYS.CONTEXT;

CTXSYS is the system user for interMedia Text The INDEXTYPE keyword is a feature of the Extensible

Indexing Framework

CTXSYS is the system user for interMedia Text The INDEXTYPE keyword is a feature of the Extensible

Indexing Framework


34





TEXT

-----------------------

The cat sat on the mat




TEXT

-----------------------

The cat sat on the mat

You should regard the CONTAINS function as boolean in meaning

It is implemented as a number since SQL does not have a boolean datatype

The only sensible way to use it is with >0

You should regard the CONTAINS function as boolean in meaning

It is implemented as a number since SQL does not have a boolean datatype

The only sensible way to use it is with >0


35


SELECT SCORE(42) s, text FROM quick

WHERE CONTAINS ( text, 'dog', 42 )

>= 0 /* just for teaching purposes! */

ORDER BY s;

S TEXT

-- ---------------------------

7 The dog barked like a dog

4 The fox jumped over the dog

SELECT SCORE(42) s, text FROM quick

WHERE CONTAINS ( text, 'dog', 42 )

>= 0 /* just for teaching purposes! */

ORDER BY s;

S TEXT

-- ---------------------------

7 The dog barked like a dog

4 The fox jumped over the dog

The better is the match, the higher is the score The value can be used in ORDER BY but has no

absolute significance The score is zero when the query is not matched

The better is the match, the higher is the score The value can be used in ORDER BY but has no

absolute significance The score is zero when the query is not matched


36

Intermedia Text - Intermedia Text - Indexing PipelineIndexing Pipeline

Datastore Filter Sectioner

Database Engine LexerPlain text

Column data

Doc DataFilteredDoc text

TokensIndex Data

Section Offsets

• First step is creating an index

Datastore• Reads the data out of the table (for URL datastore performs a ‘GET ‘)


37


• Filter : The data is transformed to some text type, this is needed as some of formats may be binary as when storing doc, pdf, HTML types

• Sectioner: Converts to plain text, removes tags and invisible info.

• Lexer: Splits the text into discrete tokens.

• Engine: Takes the tokens from lexer , the offsets from sectioner and a list of stoplist words to build an index.


38


Example of index creation Statements• Insert into docs values(1,’first document’);• Insert into docs values(2,’second document’);

Produces an index DOCUMENT doc 1 position 2, doc 2 position 2 FIRST doc 1 position 1 SECOND doc 2 position 1


39

Testing procedureTesting procedure

Document set from newsgroups 122 documents from a text mining site Loaded using insert statements File datastore used

Documents(HTML) from browsing 20 documents Loaded from server process URL datastore used


40

Newsgroup ResultsNewsgroup Results1. 1. Religion ,AtheismReligion ,Atheism – 15

2. on bible, islam, religious beliefs

3. 2. Comp-os-ms-windows-miscComp-os-ms-windows-misc - 17

4. about operating sys, protocols, installation

5. 3. Comp.graphicsComp.graphics – 27

6. on hardware and software for computer graphics

7. 4. Ice HockeyIce Hockey - 18

8. 5. Computer hardwareComputer hardware – 12

9. on installation of different peripheral devices

10. 6. Mideast.politicsMideast.politics - 14

11. on political development in mideast

12. 7. Science.spaceScience.space - 19

13. on various space programs, devices,theories

1. 1. Religion ,AtheismReligion ,Atheism – 15

2. on bible, islam, religious beliefs

3. 2. Comp-os-ms-windows-miscComp-os-ms-windows-misc - 17

4. about operating sys, protocols, installation

5. 3. Comp.graphicsComp.graphics – 27

6. on hardware and software for computer graphics

7. 4. Ice HockeyIce Hockey - 18

8. 5. Computer hardwareComputer hardware – 12

9. on installation of different peripheral devices

10. 6. Mideast.politicsMideast.politics - 14

11. on political development in mideast

12. 7. Science.spaceScience.space - 19

13. on various space programs, devices,theories


41

Newsgroup ResultsNewsgroup Results

GroupGroup RetrievedRetrieved WrongWrong Not Retrieved

Not Retrieved

RecallRecall PrecisionPrecision

Science and technology

Science and technology

120120 1616 11 99%99% 78%78%

Computer Hardware Industry

Computer Hardware Industry

1212 00 55 71%71% 100%100%

Government

Government

103103 2626 88 90%90% 74%74%


42


politicspolitics 1717 33 00 100%100% 82%82%

MilitaryMilitary 55 11 00 80%80% 80%80%

Social Environment

Social Environment

4848 22 1414 77%77% 96%96%

ReligionReligion 2222 33 22 90%90% 86%86%

IslamIslam 44 00 00 100%100% 100%100%

Leisure recreati-on

Leisure recreati-on

2222 44 55 78%78% 82%82%


43


SportsSports 2121 11 00 90%90% 90%90%

HockeyHockey 1818 00 00 100%100% 100%100%

Recall = # of correct positive predictions ---------------------------------- # of positive examplesPrecision = # of correct positive predictions --------------------------------- # of positive predictions

Recall = # of correct positive predictions ---------------------------------- # of positive examplesPrecision = # of correct positive predictions --------------------------------- # of positive predictions


44

QueryQuery

Syntax: Binary OperatorsSyntax: Binary OperatorsSyntax: Binary OperatorsSyntax: Binary Operators

AND &

OR |

EQUIV =

MINUS -

NOT ~

ACCUM ,

AND &

OR |

EQUIV =

MINUS -

NOT ~

ACCUM ,

cat & dogcat | dogcat = dog cat - dogcat ~ dogcat , dog

cat & dogcat | dogcat = dog cat - dogcat ~ dogcat , dog


45

Semantics: Binary OperatorsSemantics: Binary Operators

The semantics of all the binary operators is defined in terms of SCORE

However, the score for even the simplest query expression - a single word - is calculated by a subtle rule– the score is higher for a document where the query

word occurs more frequently than for one where it occurs less frequently

– but when “word1” occurs N times indocument D, its score is lower than when “word2” occurs N times in document D if “word1” occurs more often in the whole document set than “word2”

The semantics of all the binary operators is defined in terms of SCORE

However, the score for even the simplest query expression - a single word - is calculated by a subtle rule– the score is higher for a document where the query

word occurs more frequently than for one where it occurs less frequently

– but when “word1” occurs N times indocument D, its score is lower than when “word2” occurs N times in document D if “word1” occurs more often in the whole document set than “word2”


46

The Salton AlgorithmThe Salton Algorithm

•interMedia Text uses an algorithm which is similar to the Salton Algorithm - widely used in Text Retrieval products

•The score for a word is proportional to...f ( 1+log ( N/n) )

...where

–f is the frequency of the search term in the document

–N is the total number documents

–and n is the number of documents which contain the search term

•The score is converted into an integer in the range 0 - 100.

•interMedia Text uses an algorithm which is similar to the Salton Algorithm - widely used in Text Retrieval products

•The score for a word is proportional to...f ( 1+log ( N/n) )

...where

–f is the frequency of the search term in the document

–N is the total number documents

–and n is the number of documents which contain the search term

•The score is converted into an integer in the range 0 - 100.


47


Inverse frequency scoring assumes that frequently occurring terms in a document

set are noise terms, and so these terms are scored lower. For a document to score

high, the query term must occur frequently in the document but infrequently in the

document set as a whole.

Inverse frequency scoring assumes that frequently occurring terms in a document

set are noise terms, and so these terms are scored lower. For a document to score

high, the query term must occur frequently in the document but infrequently in the

document set as a whole.

AssumptionAssumption


48


This table assumes that only one document in the set contains the query term.

# of Documents in Document Set Occurrences of Term in Document

Needed to Score 100 1 34

5 20

10 17

50 13

100 12

500 10

1,000 9

10,000 7

100,000 5

1,000,000 4

This table assumes that only one document in the set contains the query term.

# of Documents in Document Set Occurrences of Term in Document

Needed to Score 100 1 34

5 20

10 17

50 13

100 12

500 10

1,000 9

10,000 7

100,000 5

1,000,000 4


49

Summary of operatorsSummary of operators

Binary operators… Binary operators…

& | = - ~ ,& | = - ~ ,

• Built-in expansion...• Built-in expansion...

? $ !? $ !

• Thesaurus...• Thesaurus...

BT, BTG, BTP, BTI, NT, NTG, NTP, NTI, PT, RT, SYN, TR, TRSYN, TT

BT, BTG, BTP, BTI, NT, NTG, NTP, NTI, PT, RT, SYN, TR, TRSYN, TT


50

Summary of operatorsSummary of operators

• Stored query expression...• Stored query expression...

SQESQE

• Grouping and escaping...• Grouping and escaping...

() {} \() {} \

• Special...• Special...

NEARWITHINABOUT

NEARWITHINABOUT


51

Application Details- Customer profile Application Details- Customer profile AnalyzerAnalyzer

The http server

For (User web

Page caching)

Is started

Oracle web

Server also

started

The http server

For (User web

Page caching)

Is started

Oracle web

Server also

started


52

Log In Screen- Log In Screen- Customer & UserCustomer & User

Log in Screen

Used both

By the customer

And the users

Log in Screen

Used both

By the customer

And the users

The oracle web-

Server takes care

Of the secure

Connections, while

For the http server,

The user id is

Common for the session

-no user can invoke a

Document from server

Without user id.

The oracle web-

Server takes care

Of the secure

Connections, while

For the http server,

The user id is

Common for the session

-no user can invoke a

Document from server

Without user id.


53

Customer Interface – Http ServerCustomer Interface – Http Server

The user

Uses the

Interface

Provided

By the custom

http server

The user

Uses the

Interface

Provided

By the custom

http server


54

Main User ScreenMain User Screen

User can

Choose the

Type of data

To be analyzed.

Two types of data

exist-

1. Newsgroups

2. User Browsed

URL’s

User can

Choose the

Type of data

To be analyzed.

Two types of data

exist-

1. Newsgroups

2. User Browsed

URL’s


55

Selection of Category and Selection of Category and optionsoptions

User chooses

Category and

Other options

Like-

Generating theme

Generating gist

Generating-

marked-up text

Date range

User chooses

Category and

Other options

Like-

Generating theme

Generating gist

Generating-

marked-up text

Date range


56

Results Page – Gist GenerationResults Page – Gist Generation

Can use this

Page for drilling

Down to the

Actual document

Which opens up in

The browser (generated

By the filter option)

Can generate theme

And gist from this

Screen.

Can use this

Page for drilling

Down to the

Actual document

Which opens up in

The browser (generated

By the filter option)

Can generate theme

And gist from this

Screen.


57

Search ScreenSearch Screen

Search screen,

Has advance options

Like fuzzy search,

About search etc.

A chain of expressions

Can be used along

With conjunctions (like

‘not’,’or’,’and’ etc) for

Joining the statements

Search screen,

Has advance options

Like fuzzy search,

About search etc.

A chain of expressions

Can be used along

With conjunctions (like

‘not’,’or’,’and’ etc) for

Joining the statements


58

ConclusionConclusion

New estimation methods trying to find more meaning from text.

Industry has great text mining products and is constantly improving technology.

Unstructured Data Mining – a long way to go.

Date post:	30-Nov-2014
Category:	Documents
Upload:	tommy96
View:	799 times
Download:	1 times