:: DIAsDEM ::
Seminar: Web Mining
WS 2003/2004
Ingo Kampe
Heiko Scharff
:: DIAsDEM :: 2/24
Content
Introduction and data mining context
DIAsDEM - functioning
New extensions
:: DIAsDEM :: 3/24
Introduction
:: problems ::
:: DIAsDEM :: 4/24
Introduction
known: data in databases (DB2, Oracle, ...) unproblematically to analyse, for example
with SQL, self-brewed programmes or data miners
but in enterprises: 80% of data in text documents (MS Word, plain text files, text archives, ...)
knowledge there, but „useless“
:: DIAsDEM :: 5/24
Introduction
example (same meaning, other structure):
Mr. Schröder earns EUR 20.000 per month.
Mister Schröder earns 20000,- €/month.
What does it mean?
How to compare? How to analyse?
Does this mean the same?
:: DIAsDEM :: 6/24
Introduction
:: data mining context ::
:: DIAsDEM :: 7/24
Introduction
necessary to make knowledge analysable
desirable:– semantically structured knowledge– queryable knowledge
possible solution: XML– semantic tagging– analysable (XPath, XQuery, Tamino, ...)
:: DIAsDEM :: 8/24
Introduction
for humans:
Mr. Schröder earns EUR 20.000 per month.=
Mister Schröder earns 20000,- €/month.
„useless“ for computational analyse only useful informations:
– Mister Schröder– 20000 Euro– month
:: DIAsDEM :: 9/24
Introduction
need to– „find“ important information– mark important information
<person>Mr. Schröder</person>
<capital amount=„20000 EUR“>earns EUR 20.000</capital>
<period>per month</period>.
:: DIAsDEM :: 10/24
DIAsDEM
:: DIAsDEM ::
:: DIAsDEM :: 11/24
DIAsDEM
DIAsDEM: Datenintegration von Altlastdaten und semistrukturierten Dokumenten mit Mining-Verfahren (integration of legacy data and semi-structured documents with data mining techniques)
project of the Deutsche Forschungs-gemeinschaft (German Research Society)
necessary: domain specific knowledge (!!!)
:: DIAsDEM :: 12/24
DIAsDEM
:: functioning ::
:: DIAsDEM :: 13/24
DIAsDEM
2-phase-model
1. knowledge discovery– iterative process (with expert knowledge)– training phase with training text archive– finding of segments (clusters) and semi-automatic
annotation– deduction of an unstructured XML DTD
2. semantic tagging– usage of found clusters on new archives– „intelligent“ tagging of new, unknown texts of the same
domain
:: DIAsDEM :: 14/24
DIAsDEM
Fig.: Winkler 2003b, page 6
:: DIAsDEM :: 15/24
DIAsDEM
to achieve „good“ semantic tagging, expert knowledge necessary
What is needed?
<person>Mr. Schröder</person>
or
<title>Mr.</title>
<name>Schröder</name>
:: DIAsDEM :: 16/24
DIAsDEM
steps in DIAsDEM:1. finding segments (for example sentences) in
training texts by using thesauri and knowledge of named entities (persons, ...)
2. building an unstructured XML DTD
3. clustering of similar text elements (cluster name = in cluster dominating descriptors)
4. renaming of clusters by experts
5. annotation of training texts
6. building a final XML DTD (for querying, XML based databases like Tamino, data miner, ...)
:: DIAsDEM :: 17/24
Extensions
:: new extensions ::
:: DIAsDEM :: 18/24
Extensions
main goal:
– searching documents from the internet, concerning user specification
– downloading hypertext documents– extracting plain text from hypertext documents– importing plain text into DIAsDEM collection
:: DIAsDEM :: 19/24
Extensions
:: querying Google ::
:: DIAsDEM :: 20/24
Extensions - Google
1. declaration of search words by user (panel)
2. querying of Google using the Google-API with reference to the search words
3. result: list of URLs (now only 10, limited by Google) automatic exported as list into a text file
:: DIAsDEM :: 21/24
Extensions
:: processing and import ::
:: DIAsDEM :: 22/24
Extensions - Processing and Import
1. reading url list (exported text file)
2. downloading hypertext files into a directory and renaming the files (enumeration)
3. detagging the files- cleaning hypertext documents- deleting comments an tags- replacing special characters (not yet
implemented)
4. importing files into the DIAsDEM collection
:: DIAsDEM :: 23/24
Questions?
?
:: DIAsDEM :: 24/24
Literature
Graubitz,H., Spiliopoulou,M. & Winkler,K. (2001). „The DIAsDEM Framework for Converting Domain-Specific Texts into XML Documents with Data Mining Techniques“. In Proceedings of the First IEEE International Conference on Data Mining, pages 171-178, San Jose, CA, USA, November / December 2001. IEEE Computer Society, Los Alamitos.
Winkler,K. & Spiliopoulou,M. (2003a). „Text Mining in der Wettbewerberanalyse: Konvertierung von Textarchiven in XML-Dokumente“. In Proceedings der 6. Konferenz der SAS Anwender in Forschung und Entwicklung, pages 347-363, Shaker Verlag, Aachen, Germany.
Winkler,K. (2003b). „Technical Report - Getting Started with DIAsDEM Workbench 2.1“. A Case-Based Approach Technical Report, 121 pages. HHL - Leipzig Graduate School of Management.