Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page1/17
Course on Data Mining (581550-4): Course on Data Mining (581550-4): Seminar MeetingsSeminar Meetings
Ass. RulesAss. RulesAss. RulesAss. Rules
EpisodesEpisodesEpisodesEpisodes
Text MiningText MiningText MiningText Mining
02.11.
09.11.
ClusteringClusteringClusteringClustering
KDD ProcessKDD ProcessKDD ProcessKDD Process
Home ExamHome ExamHome ExamHome Exam
23.11.
30.11.
16.11.
MM
PP
Seminar by Mika
Seminar by Pirjo
PP PP
PPMM
MM
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page2/17
Today 16.11.2001Today 16.11.2001Today 16.11.2001Today 16.11.2001
• R. Feldman, M. Fresko, H. Hirsh, R. Feldman, M. Fresko, H. Hirsh, et.al.: "Knowledge Management: A et.al.: "Knowledge Management: A Text Mining Approach", Proc of the Text Mining Approach", Proc of the 2nd Int'l Conf. on Practical Aspects 2nd Int'l Conf. on Practical Aspects of Knowledge Management of Knowledge Management (PAKM98), 1998(PAKM98), 1998
• B. Lent, R. Agrawal, R. Srikant: B. Lent, R. Agrawal, R. Srikant: "Discovering Trends in Text "Discovering Trends in Text Databases", Proc. of the 3rd Int'l Databases", Proc. of the 3rd Int'l Conference on Knowledge Conference on Knowledge Discovery in Databases and Data Discovery in Databases and Data Mining, 1997. Mining, 1997.
Course on Data Mining (581550-4): Course on Data Mining (581550-4): Seminar MeetingsSeminar Meetings
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page3/17
Good to Read as BackgroundGood to Read as BackgroundGood to Read as BackgroundGood to Read as Background
• Both papers refer to the Both papers refer to the Agrawal and Srikant paper we Agrawal and Srikant paper we had last week:had last week:
Rakesh Agrawal and Rakesh Agrawal and Ramakrishnan Srikant: Ramakrishnan Srikant: Mining Mining Sequential PatternsSequential Patterns. Int'l . Int'l Conference on Data Conference on Data Engineering, 1995.Engineering, 1995.
Course on Data Mining (581550-4): Course on Data Mining (581550-4): Seminar MeetingsSeminar Meetings
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page4/17
Knowledge Management: Knowledge Management: A Text Mining ApproachA Text Mining Approach
R. Feldman, M. Fresko, H. Hirsh, et.al
Bar-Ilan University and Instict Software, ISRAEL; Rutgers University, USA; LIA-EPFL,
Switzerland
Published in PAKM'98 (Int'l Conf. on Practical Aspects of Knowledge
Management)
Data Mining course Autumn 2001/University of Helsinki
Summary by Mika Klemettinen
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page5/17
KM: A Text Mining ApproachKM: A Text Mining Approach
• Basic idea (see selected phases on the next slides):Basic idea (see selected phases on the next slides):1. Get input data in SGML (or XML) formatSelect only the contents of desired elements! (title, abstract, etc.) 2. Do linguistic preprocessing:2.1 Term extraction (use linguistic software for this)2.2 Term generation (combine adjacent terms to morpho-syntactic patterns like "noun-noun", "adj.-noun", etc. by calculating association coefficients)2.3 Term filtering (select only the top M most frequent ones)3. Create taxonomies (there is a tool for this)4. Generate associations (you may constrain the creation)5. Visualize/explore the results
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page6/17
2.1: Term Extraction2.1: Term Extraction
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page7/17
3: Taxonomy Construction3: Taxonomy Construction
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page8/17
4: Association Rule Generation4: Association Rule Generation
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page9/17
4: Association Rule Generation4: Association Rule Generation
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page10/17
5.1: Visualization/5.1: Visualization/ExplorationExploration
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page11/17
5.2: 5.2: VisualizationVisualization/Exploration/Exploration
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page12/17
Discovering Trends in Text Discovering Trends in Text DatabasesDatabases
Brian Lent, Rakesh Agrawal and Ramakrishnan Srikant
IBM Almaden Research Center, USA
Published in KDD'97
Data Mining course Autumn 2001/University of Helsinki
Summary by Mika Klemettinen
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page13/17
Discovering Trends in Text Discovering Trends in Text DatabasesDatabases
• Basic ideas:Basic ideas:• Identify frequent phrases using sequential patterns
mining (see the slides & summaries from the Agrawal et. al paper "Mining Sequential Patterns" (MSP))
• Generate histories of phrases• Find phrases that satisfy a specified trend
• Definitions:Definitions:• Phrase: phrase p is (w1)(w2) … (wn ) , where w is a
word• 1-phrase: (IBM) (data)(mining) • 2-phrase: (IBM) (data)(mining) (Anderson)
(Consulting) (decision)(support) • Itemset, sequence, is contained, etc.: as in MSP paper
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page14/17
Discovering Trends in Text Discovering Trends in Text DatabasesDatabases
• Gaps: Minimum and maximum gaps between adjacent words: identify relations of words/phrases inside sentences/paragraphs, between words/phrases in different paragraphs, between words/phrases in different sections, etc.
• Sentence boundary: 1000• Paragraph boundary: 100.000• Section boundary: 10.000.000
• Phases: • Partition data/documents based on their time stamps, create
phrases for each partition (Lent & al. have patent data documents)
• Select the frequent phrases and save their frequences• Define shape queries using SDL (Shape Definition Language)
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page15/17
Discovering Trends in Text Discovering Trends in Text DatabasesDatabases
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page16/17
Discovering Trends in Text Discovering Trends in Text DatabasesDatabases
Course on Data Mining: Course on Data Mining: Seminar MeetingsSeminar Meetings
Page17/17
Discovering Trends in Text Discovering Trends in Text DatabasesDatabases