+ All Categories
Home > Documents > Samrat.ppt

Samrat.ppt

Date post: 30-Nov-2014
Category:
Upload: tommy96
View: 799 times
Download: 1 times
Share this document with a friend
Description:
 
Popular Tags:
58
Data Mining with Data Mining with Unstructured Data Unstructured Data A Study And Implementation of A Study And Implementation of Industry Product(s) Industry Product(s) Samrat Sen
Transcript
Page 1: Samrat.ppt

Data Mining with Unstructured Data Mining with Unstructured DataData

A Study And Implementation of Industry A Study And Implementation of Industry Product(s)Product(s)

Samrat Sen

Page 2: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

2

GoalsGoals

Issues in Text Mining with Unstructured Data

Analysis of Data Mining productsStudy of a Real Life Classification

ProblemStrategy for solving the problem

Page 3: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

3

Issues in Text MiningIssues in Text Mining

Different from KDD and DM techniques in structured Databases

Problems:

1. Concerned with predefined fields

2. Based on learning from attribute- value database

e.g P.T.O

Page 4: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

4

If Married(Person, Spouse) and Income(Person) >= 25,000Then Potential-Customer(Spouse)If Married(Person, Spouse) and Potential-Customer(Person) Then Potential-Customer(Spouse)

Person Age Sex Income Customer

Ann S 32 F 10,000 yes

Jane G 53 F 20,000 no

Sri S 35 M 65,000 yes

Egor 25 M 10,000 yes

Husband Wife

Egor Ann S

Sri H Jane

Potential Customer Table Married to Table

Induced Rules

Issues in Text MiningIssues in Text Mining

Page 5: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

5

Algorithm techniques like Association Extraction from Indexed data,

Prototypical Document Extraction from full Text

• Industry standard data mining tools cannot be used directly

e.g a usual process has to have the Text Transformer, Text Analyzer, Summary generator

Issues in Text MiningIssues in Text Mining

Page 6: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

6

• The input and output interfaces, the file formats may cost in time and money.

• Exhaustive domains have to be set up for classification.

• Cost and Benefits have to be weighed before model selection.

1. Gain from positive prediction

2. Loss from an incorrect positive prediction (false positive) 3. Benefit from a correct negative prediction 4. Cost of incorrect negative prediction (false negative) 5. Cost of project time (a better product/algorithm may come

up)

Issues in Text MiningIssues in Text Mining

Page 7: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

7

Data Mining Products/ToolsData Mining Products/Tools

DARWIN – from OracleIntelligent Data Miner – from IBMIntermedia Text with Oracle Database

with context query feature (theme based document retrieval)

http://www.oracle.com/ip/analyze/warehouse/datamining/http://www-4.ibm.com/software/data/iminer/

FOR MORE INFO...

Page 8: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

8

• New Specification being proposed by SUN for a Data Mining API *

• SQLServer 2000 – Data mining and English query writing features

• Verity Knowledge Organizer

FOR MORE INFO...

* http://java.sun.com/aboutJava/communityprocess/jsr/jsr_073_dmapi. html#3

Additional Text Mining sites:

1.http://textmining.krdl.org.sg/resourves.html

2. www.intext.de/TEXTANAE.htm

3. www.cs.uku.fi/~kuikka/systems.html

Data Mining Products/ToolsData Mining Products/Tools

Page 9: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

9

DARWINDARWIN

Functions 1. Prediction (from known values)2. Classification (into categories)3. Forecasting (future predictions)

Approach1. Plan2. Prepare Dataset3. Build and Use models

Page 10: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

10

DARWINDARWIN

The problem is defined in terms of data fields and data records

The fields are classified as follows: - Categorical and Ordered Fields

- Predictive Fields

- Target Fields

• DARWIN dataset file has to be created containing all the records in the problem domain (using a descriptor file)

Page 11: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

11

DARWIN - ModelsDARWIN - Models

Tree model – Based on classification and regression tree algorithm

Net model – A feed forward multilayer neural network

Match Model – Memory based reasoning model, using a K-nearest neighbor algorithm

Page 12: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

12

DARWIN – Tree ModelDARWIN – Tree Model

Create TreeTraining Data

Test/Evaluate Tree(Information on error rates of pruned sub-trees)

Predict with Tree

(using the selected sub-tree)

Analyze Results

I/P Prediction Dataset

Merged I/P & O/P predictiondataset

Page 13: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

13

DARWIN – Net ModelDARWIN – Net Model

Create NetTraining Dataset

Train Net(Information on error rates of pruned sub-trees)

Prediction Dataset

Analyze Results

I/P Prediction Dataset

Merged I/P & O/P predictiondataset

Neural Network Model

Trained Neural

Network

Page 14: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

14

DARWIN – Match ModelDARWIN – Match Model

Create Match Model Training Data

Optimize match weights

Predict with Match

Analyze Results

I/P Prediction Dataset

Merged I/P & O/P predictiondataset

Page 15: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

15

EvaluateEvaluates the performance of a given model on a given dataset, when working on known data for test or evaluation purposes.Summarize Data

Provides a statistical summary of the values taken by a data in the specified fields of a dataset

Frequency CountProvides information on the frequency with which particular data values appear in a dataset

DARWIN – AnalyzingDARWIN – Analyzing

Page 16: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

16

Performance Matrix

Can be used to compare simple fields or simple functions of fields

Sensitivity

Provides a model showing the relative importance of attributes used in building a model

DARWIN – AnalyzingDARWIN – Analyzing

Page 17: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

17

DARWIN – Code GenerationDARWIN – Code Generation

•Darwin can generate C, C++, Java code for a Tree or Net model so that a prediction function can be called from an application Program

•Java code can also be generated to embed a model in a Web Applet

FOR MORE INFO...

http://technet.oracle.com/docs/products/datamining/doc_index.htm

Page 18: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

18

DARWINDARWIN

For more info

http://technet.oracle.com/software/products/intermedia/software_index.html

1. Oracle Data Mining Data sheet 2. Oracle Data Mining Solutions http://www.oracle.com/ip/analyze/warehouse/datamining/ http://www.oracle.com/oramag/oracle/98-Jan/fast.html 1. Managing Unstructured Data with Oracle8 http://technet.oracle.com/products/datamining/ 1. Product manuals

Page 19: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

19

DARWINDARWIN

O r a c l e P e r s o n a l i z a t i o n

R e a l - T i m e R e c o m m e n d a t i o n s

N e w O f f e r i n g A v a i l a b l e w i t h O r a c l e 9 i

H e l l o ! W e h a v e r e c o m m e n d a t i o n s f o r y o u .

Page 20: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

20

Oracle – Intermedia TextOracle – Intermedia Text

Ranking technique called theme proving is used

Documents grouped into categories and subcategories

Integrated with the Oracle – 8 database.

Absolutely no training or tuning required

Page 21: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

21

Oracle – Intermedia TextOracle – Intermedia Text

Lexical Knowledge Base - 200,000 concepts from very broad domains

- 2000 major categories

- Concepts mapped into one or more words/phrases in

canonical form

- Each of these have alternate inflectional

variations,acronyms, synonyms stored

- Total vocabulary of 450,000 terms

- Each entry has other parameters like parts of speech

Page 22: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

22

Oracle – Intermedia TextOracle – Intermedia Text

Theme Extraction -Themes are assigned initial ranks based on

structure of the document and the frequency of the theme. - All the ancestor themes also included in the result - Theme proving done before final ranking

Queries Direct match, phrase search (‘contains’), case-sensitive

query, misspellings and fuzzy match, inflections (‘about’), compound queries, Boolean operators, Natural language query

Page 23: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

23

Oracle – Intermedia TextOracle – Intermedia Text

Oracle at Trec 8 (Eighth text retrieval conference-

http://otn.oracle.com/products/intermedia/htdocs/imt_trec8pap.htm)

Recall at 1000 71.57% (3384/4728) Average Precision 41.30% Initial precision (at 92.79% recall 0.0) Final precision (at 07.91% recall 1.0)

Page 24: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

24

Intermedia Text-ModelIntermedia Text-Model

Page 25: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

25

Interface OptionsInterface Options

Page 26: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

26

Language SelectionLanguage Selection

Java for robot

PL/SQL for data retrieval

Page 27: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

27

Code ExecutionCode Execution

Page 28: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

28

Overview of the SystemOverview of the System

Customer Browser

WebServer Oracle 8i

Intermedia Text

Server process

Tag stripper

Listening at port 80

JDBC

Client Browser

Page 29: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

29

Intermedia TextIntermedia Text

Steps for Building an application Load the documentsLoad the documents Index the documentIndex the document Issue QueriesIssue Queries Present the documents that satisfy the Present the documents that satisfy the

queryquery

Page 30: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

30

Loading MethodsLoading Methods

Loading Methods– Insert Statements– SQL Loader– Ctxsrv – This is a server daemon process which builds

the index at regular intervals

– Ctxload Utility Used for

Thesaurus Import/Export

Text Loading

Document Updating/Exporting

Page 31: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

31

Create and Populate a Simple TableCreate and Populate a Simple Table

CREATE TABLE quick (

quick_id NUMBER CONSTRAINT

quick_pk PRIMARY KEY,

text VARCHAR2(80) );

INSERT INTO quick

VALUES ( 1, 'The cat sat on the mat' );

INSERT INTO quick

VALUES ( 2, 'The fox jumped over the dog' );

INSERT INTO quick

VALUES ( 3, 'The dog barked like a dog' );

COMMIT;

CREATE TABLE quick (

quick_id NUMBER CONSTRAINT

quick_pk PRIMARY KEY,

text VARCHAR2(80) );

INSERT INTO quick

VALUES ( 1, 'The cat sat on the mat' );

INSERT INTO quick

VALUES ( 2, 'The fox jumped over the dog' );

INSERT INTO quick

VALUES ( 3, 'The dog barked like a dog' );

COMMIT;

Page 32: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

32

Run a Text QueryRun a Text Query

SELECT text FROM quick

WHERE CONTAINS ( text,

'sat on the mat' ) > 0;

DRG-10599: column is not indexed

SELECT text FROM quick

WHERE CONTAINS ( text,

'sat on the mat' ) > 0;

DRG-10599: column is not indexed

You must have a Text index on a columnbefore you can do a “contains” query on it

You must have a Text index on a columnbefore you can do a “contains” query on it

Page 33: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

33

Create the Text IndexCreate the Text Index

CREATE INDEX quick_text

on quick ( text )

INDEXTYPE IS CTXSYS.CONTEXT;

CREATE INDEX quick_text

on quick ( text )

INDEXTYPE IS CTXSYS.CONTEXT;

CTXSYS is the system user for interMedia Text The INDEXTYPE keyword is a feature of the Extensible

Indexing Framework

CTXSYS is the system user for interMedia Text The INDEXTYPE keyword is a feature of the Extensible

Indexing Framework

Page 34: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

34

Run a Text QueryRun a Text Query

SELECT text FROM quick

WHERE CONTAINS ( text,

'sat on the mat' ) > 0;

TEXT

-----------------------

The cat sat on the mat

SELECT text FROM quick

WHERE CONTAINS ( text,

'sat on the mat' ) > 0;

TEXT

-----------------------

The cat sat on the mat

You should regard the CONTAINS function as boolean in meaning

It is implemented as a number since SQL does not have a boolean datatype

The only sensible way to use it is with >0

You should regard the CONTAINS function as boolean in meaning

It is implemented as a number since SQL does not have a boolean datatype

The only sensible way to use it is with >0

Page 35: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

35

Run a Text QueryRun a Text Query

SELECT SCORE(42) s, text FROM quick

WHERE CONTAINS ( text, 'dog', 42 )

>= 0 /* just for teaching purposes! */

ORDER BY s;

S TEXT

-- ---------------------------

7 The dog barked like a dog

4 The fox jumped over the dog

SELECT SCORE(42) s, text FROM quick

WHERE CONTAINS ( text, 'dog', 42 )

>= 0 /* just for teaching purposes! */

ORDER BY s;

S TEXT

-- ---------------------------

7 The dog barked like a dog

4 The fox jumped over the dog

The better is the match, the higher is the score The value can be used in ORDER BY but has no

absolute significance The score is zero when the query is not matched

The better is the match, the higher is the score The value can be used in ORDER BY but has no

absolute significance The score is zero when the query is not matched

Page 36: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

36

Intermedia Text - Intermedia Text - Indexing PipelineIndexing Pipeline

Datastore Filter Sectioner

Database Engine LexerPlain text

Column data

Doc DataFilteredDoc text

TokensIndex Data

Section Offsets

• First step is creating an index

Datastore• Reads the data out of the table (for URL datastore performs a ‘GET ‘)

Page 37: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

37

Intermedia Text - Intermedia Text - Indexing PipelineIndexing Pipeline

• Filter : The data is transformed to some text type, this is needed as some of formats may be binary as when storing doc, pdf, HTML types

• Sectioner: Converts to plain text, removes tags and invisible info.

• Lexer: Splits the text into discrete tokens.

• Engine: Takes the tokens from lexer , the offsets from sectioner and a list of stoplist words to build an index.

Page 38: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

38

Intermedia Text - Intermedia Text - Indexing PipelineIndexing Pipeline

Example of index creation Statements• Insert into docs values(1,’first document’);• Insert into docs values(2,’second document’);

Produces an index DOCUMENT doc 1 position 2, doc 2 position 2 FIRST doc 1 position 1 SECOND doc 2 position 1

Page 39: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

39

Testing procedureTesting procedure

Document set from newsgroups 122 documents from a text mining site Loaded using insert statements File datastore used

Documents(HTML) from browsing 20 documents Loaded from server process URL datastore used

Page 40: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

40

Newsgroup ResultsNewsgroup Results1. 1.     Religion ,AtheismReligion ,Atheism – 15

2. on bible, islam, religious beliefs

3. 2.     Comp-os-ms-windows-miscComp-os-ms-windows-misc - 17

4. about operating sys, protocols, installation

5. 3.     Comp.graphicsComp.graphics – 27

6. on hardware and software for computer graphics

7. 4.     Ice HockeyIce Hockey - 18

8. 5.     Computer hardwareComputer hardware – 12

9. on installation of different peripheral devices

10. 6.     Mideast.politicsMideast.politics - 14

11. on political development in mideast

12. 7. Science.spaceScience.space - 19

13. on various space programs, devices,theories

1. 1.     Religion ,AtheismReligion ,Atheism – 15

2. on bible, islam, religious beliefs

3. 2.     Comp-os-ms-windows-miscComp-os-ms-windows-misc - 17

4. about operating sys, protocols, installation

5. 3.     Comp.graphicsComp.graphics – 27

6. on hardware and software for computer graphics

7. 4.     Ice HockeyIce Hockey - 18

8. 5.     Computer hardwareComputer hardware – 12

9. on installation of different peripheral devices

10. 6.     Mideast.politicsMideast.politics - 14

11. on political development in mideast

12. 7. Science.spaceScience.space - 19

13. on various space programs, devices,theories

Page 41: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

41

Newsgroup ResultsNewsgroup Results

GroupGroup RetrievedRetrieved WrongWrong Not Retrieved

Not Retrieved

RecallRecall PrecisionPrecision

Science and technology

Science and technology

120120 1616 11 99%99% 78%78%

Computer Hardware Industry

Computer Hardware Industry

1212 00 55 71%71% 100%100%

Government

Government

103103 2626 88 90%90% 74%74%

Page 42: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

42

Newsgroup ResultsNewsgroup Results

politicspolitics 1717 33 00 100%100% 82%82%

MilitaryMilitary 55 11 00 80%80% 80%80%

Social Environment

Social Environment

4848 22 1414 77%77% 96%96%

ReligionReligion 2222 33 22 90%90% 86%86%

IslamIslam 44 00 00 100%100% 100%100%

Leisure recreati-on

Leisure recreati-on

2222 44 55 78%78% 82%82%

Page 43: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

43

Newsgroup ResultsNewsgroup Results

SportsSports 2121 11 00 90%90% 90%90%

HockeyHockey 1818 00 00 100%100% 100%100%

Recall = # of correct positive predictions ---------------------------------- # of positive examplesPrecision = # of correct positive predictions --------------------------------- # of positive predictions

Recall = # of correct positive predictions ---------------------------------- # of positive examplesPrecision = # of correct positive predictions --------------------------------- # of positive predictions

Page 44: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

44

QueryQuery

Syntax: Binary OperatorsSyntax: Binary OperatorsSyntax: Binary OperatorsSyntax: Binary Operators

AND &

OR |

EQUIV =

MINUS -

NOT ~

ACCUM ,

AND &

OR |

EQUIV =

MINUS -

NOT ~

ACCUM ,

cat & dogcat | dogcat = dog cat - dogcat ~ dogcat , dog

cat & dogcat | dogcat = dog cat - dogcat ~ dogcat , dog

Page 45: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

45

Semantics: Binary OperatorsSemantics: Binary Operators

The semantics of all the binary operators is defined in terms of SCORE

However, the score for even the simplest query expression - a single word - is calculated by a subtle rule– the score is higher for a document where the query

word occurs more frequently than for one where it occurs less frequently

– but when “word1” occurs N times indocument D, its score is lower than when “word2” occurs N times in document D if “word1” occurs more often in the whole document set than “word2”

The semantics of all the binary operators is defined in terms of SCORE

However, the score for even the simplest query expression - a single word - is calculated by a subtle rule– the score is higher for a document where the query

word occurs more frequently than for one where it occurs less frequently

– but when “word1” occurs N times indocument D, its score is lower than when “word2” occurs N times in document D if “word1” occurs more often in the whole document set than “word2”

Page 46: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

46

The Salton AlgorithmThe Salton Algorithm

•interMedia Text uses an algorithm which is similar to the Salton Algorithm - widely used in Text Retrieval products

•The score for a word is proportional to...f ( 1+log ( N/n) )

...where

–f is the frequency of the search term in the document

–N is the total number documents

–and n is the number of documents which contain the search term

•The score is converted into an integer in the range 0 - 100.

•interMedia Text uses an algorithm which is similar to the Salton Algorithm - widely used in Text Retrieval products

•The score for a word is proportional to...f ( 1+log ( N/n) )

...where

–f is the frequency of the search term in the document

–N is the total number documents

–and n is the number of documents which contain the search term

•The score is converted into an integer in the range 0 - 100.

Page 47: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

47

The Salton AlgorithmThe Salton Algorithm

Inverse frequency scoring assumes that frequently occurring terms in a document

set are noise terms, and so these terms are scored lower. For a document to score

high, the query term must occur frequently in the document but infrequently in the

document set as a whole.

Inverse frequency scoring assumes that frequently occurring terms in a document

set are noise terms, and so these terms are scored lower. For a document to score

high, the query term must occur frequently in the document but infrequently in the

document set as a whole.

AssumptionAssumption

Page 48: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

48

The Salton AlgorithmThe Salton Algorithm

This table assumes that only one document in the set contains the query term.

# of Documents in Document Set Occurrences of Term in Document

Needed to Score 100 1 34

5 20

10 17

50 13

100 12

500 10

1,000 9

10,000 7

100,000 5

1,000,000 4

This table assumes that only one document in the set contains the query term.

# of Documents in Document Set Occurrences of Term in Document

Needed to Score 100 1 34

5 20

10 17

50 13

100 12

500 10

1,000 9

10,000 7

100,000 5

1,000,000 4

Page 49: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

49

Summary of operatorsSummary of operators

Binary operators… Binary operators…

& | = - ~ ,& | = - ~ ,

• Built-in expansion...• Built-in expansion...

? $ !? $ !

• Thesaurus...• Thesaurus...

BT, BTG, BTP, BTI, NT, NTG, NTP, NTI, PT, RT, SYN, TR, TRSYN, TT

BT, BTG, BTP, BTI, NT, NTG, NTP, NTI, PT, RT, SYN, TR, TRSYN, TT

Page 50: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

50

Summary of operatorsSummary of operators

• Stored query expression...• Stored query expression...

SQESQE

• Grouping and escaping...• Grouping and escaping...

() {} \() {} \

• Special...• Special...

NEARWITHINABOUT

NEARWITHINABOUT

Page 51: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

51

Application Details- Customer profile Application Details- Customer profile AnalyzerAnalyzer

The http server

For (User web

Page caching)

Is started

Oracle web

Server also

started

The http server

For (User web

Page caching)

Is started

Oracle web

Server also

started

Page 52: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

52

Log In Screen- Log In Screen- Customer & UserCustomer & User

Log in Screen

Used both

By the customer

And the users

Log in Screen

Used both

By the customer

And the users

The oracle web-

Server takes care

Of the secure

Connections, while

For the http server,

The user id is

Common for the session

-no user can invoke a

Document from server

Without user id.

The oracle web-

Server takes care

Of the secure

Connections, while

For the http server,

The user id is

Common for the session

-no user can invoke a

Document from server

Without user id.

Page 53: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

53

Customer Interface – Http ServerCustomer Interface – Http Server

The user

Uses the

Interface

Provided

By the custom

http server

The user

Uses the

Interface

Provided

By the custom

http server

Page 54: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

54

Main User ScreenMain User Screen

User can

Choose the

Type of data

To be analyzed.

Two types of data

exist-

1. Newsgroups

2. User Browsed

URL’s

User can

Choose the

Type of data

To be analyzed.

Two types of data

exist-

1. Newsgroups

2. User Browsed

URL’s

Page 55: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

55

Selection of Category and Selection of Category and optionsoptions

User chooses

Category and

Other options

Like-

Generating theme

Generating gist

Generating-

marked-up text

Date range

User chooses

Category and

Other options

Like-

Generating theme

Generating gist

Generating-

marked-up text

Date range

Page 56: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

56

Results Page – Gist GenerationResults Page – Gist Generation

Can use this

Page for drilling

Down to the

Actual document

Which opens up in

The browser (generated

By the filter option)

Can generate theme

And gist from this

Screen.

Can use this

Page for drilling

Down to the

Actual document

Which opens up in

The browser (generated

By the filter option)

Can generate theme

And gist from this

Screen.

Page 57: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

57

Search ScreenSearch Screen

Search screen,

Has advance options

Like fuzzy search,

About search etc.

A chain of expressions

Can be used along

With conjunctions (like

‘not’,’or’,’and’ etc) for

Joining the statements

Search screen,

Has advance options

Like fuzzy search,

About search etc.

A chain of expressions

Can be used along

With conjunctions (like

‘not’,’or’,’and’ etc) for

Joining the statements

Page 58: Samrat.ppt

04/09/23 UB - CS 711, Data Mining with Unstructured Data

58

ConclusionConclusion

New estimation methods trying to find more meaning from text.

Industry has great text mining products and is constantly improving technology.

Unstructured Data Mining – a long way to go.


Recommended