Data Mining with Unstructured Data Mining with Unstructured DataData
A Study And Implementation of Industry A Study And Implementation of Industry Product(s)Product(s)
Samrat Sen
04/09/23 UB - CS 711, Data Mining with Unstructured Data
2
GoalsGoals
Issues in Text Mining with Unstructured Data
Analysis of Data Mining productsStudy of a Real Life Classification
ProblemStrategy for solving the problem
04/09/23 UB - CS 711, Data Mining with Unstructured Data
3
Issues in Text MiningIssues in Text Mining
Different from KDD and DM techniques in structured Databases
Problems:
1. Concerned with predefined fields
2. Based on learning from attribute- value database
e.g P.T.O
04/09/23 UB - CS 711, Data Mining with Unstructured Data
4
If Married(Person, Spouse) and Income(Person) >= 25,000Then Potential-Customer(Spouse)If Married(Person, Spouse) and Potential-Customer(Person) Then Potential-Customer(Spouse)
Person Age Sex Income Customer
Ann S 32 F 10,000 yes
Jane G 53 F 20,000 no
Sri S 35 M 65,000 yes
Egor 25 M 10,000 yes
Husband Wife
Egor Ann S
Sri H Jane
Potential Customer Table Married to Table
Induced Rules
Issues in Text MiningIssues in Text Mining
04/09/23 UB - CS 711, Data Mining with Unstructured Data
5
Algorithm techniques like Association Extraction from Indexed data,
Prototypical Document Extraction from full Text
• Industry standard data mining tools cannot be used directly
e.g a usual process has to have the Text Transformer, Text Analyzer, Summary generator
Issues in Text MiningIssues in Text Mining
04/09/23 UB - CS 711, Data Mining with Unstructured Data
6
• The input and output interfaces, the file formats may cost in time and money.
• Exhaustive domains have to be set up for classification.
• Cost and Benefits have to be weighed before model selection.
1. Gain from positive prediction
2. Loss from an incorrect positive prediction (false positive) 3. Benefit from a correct negative prediction 4. Cost of incorrect negative prediction (false negative) 5. Cost of project time (a better product/algorithm may come
up)
Issues in Text MiningIssues in Text Mining
04/09/23 UB - CS 711, Data Mining with Unstructured Data
7
Data Mining Products/ToolsData Mining Products/Tools
DARWIN – from OracleIntelligent Data Miner – from IBMIntermedia Text with Oracle Database
with context query feature (theme based document retrieval)
http://www.oracle.com/ip/analyze/warehouse/datamining/http://www-4.ibm.com/software/data/iminer/
FOR MORE INFO...
04/09/23 UB - CS 711, Data Mining with Unstructured Data
8
• New Specification being proposed by SUN for a Data Mining API *
• SQLServer 2000 – Data mining and English query writing features
• Verity Knowledge Organizer
FOR MORE INFO...
* http://java.sun.com/aboutJava/communityprocess/jsr/jsr_073_dmapi. html#3
Additional Text Mining sites:
1.http://textmining.krdl.org.sg/resourves.html
2. www.intext.de/TEXTANAE.htm
3. www.cs.uku.fi/~kuikka/systems.html
Data Mining Products/ToolsData Mining Products/Tools
04/09/23 UB - CS 711, Data Mining with Unstructured Data
9
DARWINDARWIN
Functions 1. Prediction (from known values)2. Classification (into categories)3. Forecasting (future predictions)
Approach1. Plan2. Prepare Dataset3. Build and Use models
04/09/23 UB - CS 711, Data Mining with Unstructured Data
10
DARWINDARWIN
The problem is defined in terms of data fields and data records
The fields are classified as follows: - Categorical and Ordered Fields
- Predictive Fields
- Target Fields
• DARWIN dataset file has to be created containing all the records in the problem domain (using a descriptor file)
04/09/23 UB - CS 711, Data Mining with Unstructured Data
11
DARWIN - ModelsDARWIN - Models
Tree model – Based on classification and regression tree algorithm
Net model – A feed forward multilayer neural network
Match Model – Memory based reasoning model, using a K-nearest neighbor algorithm
04/09/23 UB - CS 711, Data Mining with Unstructured Data
12
DARWIN – Tree ModelDARWIN – Tree Model
Create TreeTraining Data
Test/Evaluate Tree(Information on error rates of pruned sub-trees)
Predict with Tree
(using the selected sub-tree)
Analyze Results
I/P Prediction Dataset
Merged I/P & O/P predictiondataset
04/09/23 UB - CS 711, Data Mining with Unstructured Data
13
DARWIN – Net ModelDARWIN – Net Model
Create NetTraining Dataset
Train Net(Information on error rates of pruned sub-trees)
Prediction Dataset
Analyze Results
I/P Prediction Dataset
Merged I/P & O/P predictiondataset
Neural Network Model
Trained Neural
Network
04/09/23 UB - CS 711, Data Mining with Unstructured Data
14
DARWIN – Match ModelDARWIN – Match Model
Create Match Model Training Data
Optimize match weights
Predict with Match
Analyze Results
I/P Prediction Dataset
Merged I/P & O/P predictiondataset
04/09/23 UB - CS 711, Data Mining with Unstructured Data
15
EvaluateEvaluates the performance of a given model on a given dataset, when working on known data for test or evaluation purposes.Summarize Data
Provides a statistical summary of the values taken by a data in the specified fields of a dataset
Frequency CountProvides information on the frequency with which particular data values appear in a dataset
DARWIN – AnalyzingDARWIN – Analyzing
04/09/23 UB - CS 711, Data Mining with Unstructured Data
16
Performance Matrix
Can be used to compare simple fields or simple functions of fields
Sensitivity
Provides a model showing the relative importance of attributes used in building a model
DARWIN – AnalyzingDARWIN – Analyzing
04/09/23 UB - CS 711, Data Mining with Unstructured Data
17
DARWIN – Code GenerationDARWIN – Code Generation
•Darwin can generate C, C++, Java code for a Tree or Net model so that a prediction function can be called from an application Program
•Java code can also be generated to embed a model in a Web Applet
FOR MORE INFO...
http://technet.oracle.com/docs/products/datamining/doc_index.htm
04/09/23 UB - CS 711, Data Mining with Unstructured Data
18
DARWINDARWIN
For more info
http://technet.oracle.com/software/products/intermedia/software_index.html
1. Oracle Data Mining Data sheet 2. Oracle Data Mining Solutions http://www.oracle.com/ip/analyze/warehouse/datamining/ http://www.oracle.com/oramag/oracle/98-Jan/fast.html 1. Managing Unstructured Data with Oracle8 http://technet.oracle.com/products/datamining/ 1. Product manuals
04/09/23 UB - CS 711, Data Mining with Unstructured Data
19
DARWINDARWIN
O r a c l e P e r s o n a l i z a t i o n
R e a l - T i m e R e c o m m e n d a t i o n s
N e w O f f e r i n g A v a i l a b l e w i t h O r a c l e 9 i
H e l l o ! W e h a v e r e c o m m e n d a t i o n s f o r y o u .
04/09/23 UB - CS 711, Data Mining with Unstructured Data
20
Oracle – Intermedia TextOracle – Intermedia Text
Ranking technique called theme proving is used
Documents grouped into categories and subcategories
Integrated with the Oracle – 8 database.
Absolutely no training or tuning required
04/09/23 UB - CS 711, Data Mining with Unstructured Data
21
Oracle – Intermedia TextOracle – Intermedia Text
Lexical Knowledge Base - 200,000 concepts from very broad domains
- 2000 major categories
- Concepts mapped into one or more words/phrases in
canonical form
- Each of these have alternate inflectional
variations,acronyms, synonyms stored
- Total vocabulary of 450,000 terms
- Each entry has other parameters like parts of speech
04/09/23 UB - CS 711, Data Mining with Unstructured Data
22
Oracle – Intermedia TextOracle – Intermedia Text
Theme Extraction -Themes are assigned initial ranks based on
structure of the document and the frequency of the theme. - All the ancestor themes also included in the result - Theme proving done before final ranking
Queries Direct match, phrase search (‘contains’), case-sensitive
query, misspellings and fuzzy match, inflections (‘about’), compound queries, Boolean operators, Natural language query
04/09/23 UB - CS 711, Data Mining with Unstructured Data
23
Oracle – Intermedia TextOracle – Intermedia Text
Oracle at Trec 8 (Eighth text retrieval conference-
http://otn.oracle.com/products/intermedia/htdocs/imt_trec8pap.htm)
Recall at 1000 71.57% (3384/4728) Average Precision 41.30% Initial precision (at 92.79% recall 0.0) Final precision (at 07.91% recall 1.0)
04/09/23 UB - CS 711, Data Mining with Unstructured Data
24
Intermedia Text-ModelIntermedia Text-Model
04/09/23 UB - CS 711, Data Mining with Unstructured Data
25
Interface OptionsInterface Options
04/09/23 UB - CS 711, Data Mining with Unstructured Data
26
Language SelectionLanguage Selection
Java for robot
PL/SQL for data retrieval
04/09/23 UB - CS 711, Data Mining with Unstructured Data
27
Code ExecutionCode Execution
04/09/23 UB - CS 711, Data Mining with Unstructured Data
28
Overview of the SystemOverview of the System
Customer Browser
WebServer Oracle 8i
Intermedia Text
Server process
Tag stripper
Listening at port 80
JDBC
Client Browser
04/09/23 UB - CS 711, Data Mining with Unstructured Data
29
Intermedia TextIntermedia Text
Steps for Building an application Load the documentsLoad the documents Index the documentIndex the document Issue QueriesIssue Queries Present the documents that satisfy the Present the documents that satisfy the
queryquery
04/09/23 UB - CS 711, Data Mining with Unstructured Data
30
Loading MethodsLoading Methods
Loading Methods– Insert Statements– SQL Loader– Ctxsrv – This is a server daemon process which builds
the index at regular intervals
– Ctxload Utility Used for
Thesaurus Import/Export
Text Loading
Document Updating/Exporting
04/09/23 UB - CS 711, Data Mining with Unstructured Data
31
Create and Populate a Simple TableCreate and Populate a Simple Table
CREATE TABLE quick (
quick_id NUMBER CONSTRAINT
quick_pk PRIMARY KEY,
text VARCHAR2(80) );
INSERT INTO quick
VALUES ( 1, 'The cat sat on the mat' );
INSERT INTO quick
VALUES ( 2, 'The fox jumped over the dog' );
INSERT INTO quick
VALUES ( 3, 'The dog barked like a dog' );
COMMIT;
CREATE TABLE quick (
quick_id NUMBER CONSTRAINT
quick_pk PRIMARY KEY,
text VARCHAR2(80) );
INSERT INTO quick
VALUES ( 1, 'The cat sat on the mat' );
INSERT INTO quick
VALUES ( 2, 'The fox jumped over the dog' );
INSERT INTO quick
VALUES ( 3, 'The dog barked like a dog' );
COMMIT;
04/09/23 UB - CS 711, Data Mining with Unstructured Data
32
Run a Text QueryRun a Text Query
SELECT text FROM quick
WHERE CONTAINS ( text,
'sat on the mat' ) > 0;
DRG-10599: column is not indexed
SELECT text FROM quick
WHERE CONTAINS ( text,
'sat on the mat' ) > 0;
DRG-10599: column is not indexed
You must have a Text index on a columnbefore you can do a “contains” query on it
You must have a Text index on a columnbefore you can do a “contains” query on it
04/09/23 UB - CS 711, Data Mining with Unstructured Data
33
Create the Text IndexCreate the Text Index
CREATE INDEX quick_text
on quick ( text )
INDEXTYPE IS CTXSYS.CONTEXT;
CREATE INDEX quick_text
on quick ( text )
INDEXTYPE IS CTXSYS.CONTEXT;
CTXSYS is the system user for interMedia Text The INDEXTYPE keyword is a feature of the Extensible
Indexing Framework
CTXSYS is the system user for interMedia Text The INDEXTYPE keyword is a feature of the Extensible
Indexing Framework
04/09/23 UB - CS 711, Data Mining with Unstructured Data
34
Run a Text QueryRun a Text Query
SELECT text FROM quick
WHERE CONTAINS ( text,
'sat on the mat' ) > 0;
TEXT
-----------------------
The cat sat on the mat
SELECT text FROM quick
WHERE CONTAINS ( text,
'sat on the mat' ) > 0;
TEXT
-----------------------
The cat sat on the mat
You should regard the CONTAINS function as boolean in meaning
It is implemented as a number since SQL does not have a boolean datatype
The only sensible way to use it is with >0
You should regard the CONTAINS function as boolean in meaning
It is implemented as a number since SQL does not have a boolean datatype
The only sensible way to use it is with >0
04/09/23 UB - CS 711, Data Mining with Unstructured Data
35
Run a Text QueryRun a Text Query
SELECT SCORE(42) s, text FROM quick
WHERE CONTAINS ( text, 'dog', 42 )
>= 0 /* just for teaching purposes! */
ORDER BY s;
S TEXT
-- ---------------------------
7 The dog barked like a dog
4 The fox jumped over the dog
SELECT SCORE(42) s, text FROM quick
WHERE CONTAINS ( text, 'dog', 42 )
>= 0 /* just for teaching purposes! */
ORDER BY s;
S TEXT
-- ---------------------------
7 The dog barked like a dog
4 The fox jumped over the dog
The better is the match, the higher is the score The value can be used in ORDER BY but has no
absolute significance The score is zero when the query is not matched
The better is the match, the higher is the score The value can be used in ORDER BY but has no
absolute significance The score is zero when the query is not matched
04/09/23 UB - CS 711, Data Mining with Unstructured Data
36
Intermedia Text - Intermedia Text - Indexing PipelineIndexing Pipeline
Datastore Filter Sectioner
Database Engine LexerPlain text
Column data
Doc DataFilteredDoc text
TokensIndex Data
Section Offsets
• First step is creating an index
Datastore• Reads the data out of the table (for URL datastore performs a ‘GET ‘)
04/09/23 UB - CS 711, Data Mining with Unstructured Data
37
Intermedia Text - Intermedia Text - Indexing PipelineIndexing Pipeline
• Filter : The data is transformed to some text type, this is needed as some of formats may be binary as when storing doc, pdf, HTML types
• Sectioner: Converts to plain text, removes tags and invisible info.
• Lexer: Splits the text into discrete tokens.
• Engine: Takes the tokens from lexer , the offsets from sectioner and a list of stoplist words to build an index.
04/09/23 UB - CS 711, Data Mining with Unstructured Data
38
Intermedia Text - Intermedia Text - Indexing PipelineIndexing Pipeline
Example of index creation Statements• Insert into docs values(1,’first document’);• Insert into docs values(2,’second document’);
Produces an index DOCUMENT doc 1 position 2, doc 2 position 2 FIRST doc 1 position 1 SECOND doc 2 position 1
04/09/23 UB - CS 711, Data Mining with Unstructured Data
39
Testing procedureTesting procedure
Document set from newsgroups 122 documents from a text mining site Loaded using insert statements File datastore used
Documents(HTML) from browsing 20 documents Loaded from server process URL datastore used
04/09/23 UB - CS 711, Data Mining with Unstructured Data
40
Newsgroup ResultsNewsgroup Results1. 1. Religion ,AtheismReligion ,Atheism – 15
2. on bible, islam, religious beliefs
3. 2. Comp-os-ms-windows-miscComp-os-ms-windows-misc - 17
4. about operating sys, protocols, installation
5. 3. Comp.graphicsComp.graphics – 27
6. on hardware and software for computer graphics
7. 4. Ice HockeyIce Hockey - 18
8. 5. Computer hardwareComputer hardware – 12
9. on installation of different peripheral devices
10. 6. Mideast.politicsMideast.politics - 14
11. on political development in mideast
12. 7. Science.spaceScience.space - 19
13. on various space programs, devices,theories
1. 1. Religion ,AtheismReligion ,Atheism – 15
2. on bible, islam, religious beliefs
3. 2. Comp-os-ms-windows-miscComp-os-ms-windows-misc - 17
4. about operating sys, protocols, installation
5. 3. Comp.graphicsComp.graphics – 27
6. on hardware and software for computer graphics
7. 4. Ice HockeyIce Hockey - 18
8. 5. Computer hardwareComputer hardware – 12
9. on installation of different peripheral devices
10. 6. Mideast.politicsMideast.politics - 14
11. on political development in mideast
12. 7. Science.spaceScience.space - 19
13. on various space programs, devices,theories
04/09/23 UB - CS 711, Data Mining with Unstructured Data
41
Newsgroup ResultsNewsgroup Results
GroupGroup RetrievedRetrieved WrongWrong Not Retrieved
Not Retrieved
RecallRecall PrecisionPrecision
Science and technology
Science and technology
120120 1616 11 99%99% 78%78%
Computer Hardware Industry
Computer Hardware Industry
1212 00 55 71%71% 100%100%
Government
Government
103103 2626 88 90%90% 74%74%
04/09/23 UB - CS 711, Data Mining with Unstructured Data
42
Newsgroup ResultsNewsgroup Results
politicspolitics 1717 33 00 100%100% 82%82%
MilitaryMilitary 55 11 00 80%80% 80%80%
Social Environment
Social Environment
4848 22 1414 77%77% 96%96%
ReligionReligion 2222 33 22 90%90% 86%86%
IslamIslam 44 00 00 100%100% 100%100%
Leisure recreati-on
Leisure recreati-on
2222 44 55 78%78% 82%82%
04/09/23 UB - CS 711, Data Mining with Unstructured Data
43
Newsgroup ResultsNewsgroup Results
SportsSports 2121 11 00 90%90% 90%90%
HockeyHockey 1818 00 00 100%100% 100%100%
Recall = # of correct positive predictions ---------------------------------- # of positive examplesPrecision = # of correct positive predictions --------------------------------- # of positive predictions
Recall = # of correct positive predictions ---------------------------------- # of positive examplesPrecision = # of correct positive predictions --------------------------------- # of positive predictions
04/09/23 UB - CS 711, Data Mining with Unstructured Data
44
QueryQuery
Syntax: Binary OperatorsSyntax: Binary OperatorsSyntax: Binary OperatorsSyntax: Binary Operators
AND &
OR |
EQUIV =
MINUS -
NOT ~
ACCUM ,
AND &
OR |
EQUIV =
MINUS -
NOT ~
ACCUM ,
cat & dogcat | dogcat = dog cat - dogcat ~ dogcat , dog
cat & dogcat | dogcat = dog cat - dogcat ~ dogcat , dog
04/09/23 UB - CS 711, Data Mining with Unstructured Data
45
Semantics: Binary OperatorsSemantics: Binary Operators
The semantics of all the binary operators is defined in terms of SCORE
However, the score for even the simplest query expression - a single word - is calculated by a subtle rule– the score is higher for a document where the query
word occurs more frequently than for one where it occurs less frequently
– but when “word1” occurs N times indocument D, its score is lower than when “word2” occurs N times in document D if “word1” occurs more often in the whole document set than “word2”
The semantics of all the binary operators is defined in terms of SCORE
However, the score for even the simplest query expression - a single word - is calculated by a subtle rule– the score is higher for a document where the query
word occurs more frequently than for one where it occurs less frequently
– but when “word1” occurs N times indocument D, its score is lower than when “word2” occurs N times in document D if “word1” occurs more often in the whole document set than “word2”
04/09/23 UB - CS 711, Data Mining with Unstructured Data
46
The Salton AlgorithmThe Salton Algorithm
•interMedia Text uses an algorithm which is similar to the Salton Algorithm - widely used in Text Retrieval products
•The score for a word is proportional to...f ( 1+log ( N/n) )
...where
–f is the frequency of the search term in the document
–N is the total number documents
–and n is the number of documents which contain the search term
•The score is converted into an integer in the range 0 - 100.
•interMedia Text uses an algorithm which is similar to the Salton Algorithm - widely used in Text Retrieval products
•The score for a word is proportional to...f ( 1+log ( N/n) )
...where
–f is the frequency of the search term in the document
–N is the total number documents
–and n is the number of documents which contain the search term
•The score is converted into an integer in the range 0 - 100.
04/09/23 UB - CS 711, Data Mining with Unstructured Data
47
The Salton AlgorithmThe Salton Algorithm
Inverse frequency scoring assumes that frequently occurring terms in a document
set are noise terms, and so these terms are scored lower. For a document to score
high, the query term must occur frequently in the document but infrequently in the
document set as a whole.
Inverse frequency scoring assumes that frequently occurring terms in a document
set are noise terms, and so these terms are scored lower. For a document to score
high, the query term must occur frequently in the document but infrequently in the
document set as a whole.
AssumptionAssumption
04/09/23 UB - CS 711, Data Mining with Unstructured Data
48
The Salton AlgorithmThe Salton Algorithm
This table assumes that only one document in the set contains the query term.
# of Documents in Document Set Occurrences of Term in Document
Needed to Score 100 1 34
5 20
10 17
50 13
100 12
500 10
1,000 9
10,000 7
100,000 5
1,000,000 4
This table assumes that only one document in the set contains the query term.
# of Documents in Document Set Occurrences of Term in Document
Needed to Score 100 1 34
5 20
10 17
50 13
100 12
500 10
1,000 9
10,000 7
100,000 5
1,000,000 4
04/09/23 UB - CS 711, Data Mining with Unstructured Data
49
Summary of operatorsSummary of operators
Binary operators… Binary operators…
& | = - ~ ,& | = - ~ ,
• Built-in expansion...• Built-in expansion...
? $ !? $ !
• Thesaurus...• Thesaurus...
BT, BTG, BTP, BTI, NT, NTG, NTP, NTI, PT, RT, SYN, TR, TRSYN, TT
BT, BTG, BTP, BTI, NT, NTG, NTP, NTI, PT, RT, SYN, TR, TRSYN, TT
04/09/23 UB - CS 711, Data Mining with Unstructured Data
50
Summary of operatorsSummary of operators
• Stored query expression...• Stored query expression...
SQESQE
• Grouping and escaping...• Grouping and escaping...
() {} \() {} \
• Special...• Special...
NEARWITHINABOUT
NEARWITHINABOUT
04/09/23 UB - CS 711, Data Mining with Unstructured Data
51
Application Details- Customer profile Application Details- Customer profile AnalyzerAnalyzer
The http server
For (User web
Page caching)
Is started
Oracle web
Server also
started
The http server
For (User web
Page caching)
Is started
Oracle web
Server also
started
04/09/23 UB - CS 711, Data Mining with Unstructured Data
52
Log In Screen- Log In Screen- Customer & UserCustomer & User
Log in Screen
Used both
By the customer
And the users
Log in Screen
Used both
By the customer
And the users
The oracle web-
Server takes care
Of the secure
Connections, while
For the http server,
The user id is
Common for the session
-no user can invoke a
Document from server
Without user id.
The oracle web-
Server takes care
Of the secure
Connections, while
For the http server,
The user id is
Common for the session
-no user can invoke a
Document from server
Without user id.
04/09/23 UB - CS 711, Data Mining with Unstructured Data
53
Customer Interface – Http ServerCustomer Interface – Http Server
The user
Uses the
Interface
Provided
By the custom
http server
The user
Uses the
Interface
Provided
By the custom
http server
04/09/23 UB - CS 711, Data Mining with Unstructured Data
54
Main User ScreenMain User Screen
User can
Choose the
Type of data
To be analyzed.
Two types of data
exist-
1. Newsgroups
2. User Browsed
URL’s
User can
Choose the
Type of data
To be analyzed.
Two types of data
exist-
1. Newsgroups
2. User Browsed
URL’s
04/09/23 UB - CS 711, Data Mining with Unstructured Data
55
Selection of Category and Selection of Category and optionsoptions
User chooses
Category and
Other options
Like-
Generating theme
Generating gist
Generating-
marked-up text
Date range
User chooses
Category and
Other options
Like-
Generating theme
Generating gist
Generating-
marked-up text
Date range
04/09/23 UB - CS 711, Data Mining with Unstructured Data
56
Results Page – Gist GenerationResults Page – Gist Generation
Can use this
Page for drilling
Down to the
Actual document
Which opens up in
The browser (generated
By the filter option)
Can generate theme
And gist from this
Screen.
Can use this
Page for drilling
Down to the
Actual document
Which opens up in
The browser (generated
By the filter option)
Can generate theme
And gist from this
Screen.
04/09/23 UB - CS 711, Data Mining with Unstructured Data
57
Search ScreenSearch Screen
Search screen,
Has advance options
Like fuzzy search,
About search etc.
A chain of expressions
Can be used along
With conjunctions (like
‘not’,’or’,’and’ etc) for
Joining the statements
Search screen,
Has advance options
Like fuzzy search,
About search etc.
A chain of expressions
Can be used along
With conjunctions (like
‘not’,’or’,’and’ etc) for
Joining the statements
04/09/23 UB - CS 711, Data Mining with Unstructured Data
58
ConclusionConclusion
New estimation methods trying to find more meaning from text.
Industry has great text mining products and is constantly improving technology.
Unstructured Data Mining – a long way to go.