Data Mining and Knowledge Discovery
Part of “New Media and eScience” MSc Programme
and “Statistics” MSc Programme
Fall semester, 2004/05
Nada Lavrač
Jožef Stefan Institute
Ljubljana, Slovenia
Thanks to Blaz Zupan, Saso Dzeroski and Peter Flach for contributing some slides to this course material
Course participants
I. NMeS MPSJS students– Robert Blatnik – Joel Plisson– Jadran Prodan– Viljem Tisnikar
II. Statistics students– Borut Kodrič– Borut Rajer– Maja Sever
III. Other participants
• Dept. of Knowledge Technologies members, students, scholars – Matjaz Depolli, Borut Lužar,
Primož Lukšič, …
• Faculty of mechanical engineering MSc students– Jože Jenkole, Viktor Zaletelj,
Damir Husejnagič, Andrej Jermol
Courses in Knowledge Technologies: Fall 2004/05
10 Nov.15h-19h
Data Mining and Knowledge Discovery
prof. dr. Nada Lavrač
11 Nov 12h-13h???????
Concept of Sustainable Development
prof. dr. Ivo Šlaus
11 Nov.15h-19h
Decision Support prof. dr. Marko Bohanec
17 Nov.15h-19h
Selected topics in New Media and eScience
prof. dr. Sašo Džeroski
Courses in Knowledge Technologies: Fall 2004/05
25 Nov.15h-19h???????
Data Mining and Knowledge Discovery
prof. dr. Nada Lavrač
15 Dec. 15h-19h
New Media and Knowledge Management
Nada Lavrač, Mitja JermolTanja UrbančičSašo DžeroskiTomaž Erjavec
13 Jan.15h-19h
Language Technologies
to be defined
Text and Web Mining, Active Learning, Relational Data Mining, Equation Discovery, ..
Mladenić, Grobelnik, Todorovski, ...
Advanced Course on Knowledge Technologies: ACAI-05
Ljubljana, June 27–July 8, 2005
Act
ivity
Tra
ckD
ata
Min
ing
Tra
ck Intelligent data analysis
(Fundamental concepts, Statistics, Statistical methods, Neural Networks, Fuzzy Logic, Time series, Bayesian
Networks, Visualization tools, Rule Induction, Evolutionary systems, Kernel Methods…)
Data mining and decission support
(Text mining, Data mining, Web mining, Decision Support, Link Analysis,…)
Hands-on sectionData mining with Weka
Hands-on sectionData mining with Orange
Hands-on sectionText mining with TextGarden
Hands-on sectionData mining with R
Stu
den
t se
ctio
n
Stu
den
t se
ctio
n
Inte
llige
nt d
ata
ana
lysi
s in
S
eman
tic W
eb
Inte
llige
nt d
ata
ana
lysi
s in
C
olla
bora
tive
Net
wor
ked
Org
anis
atio
ns
The 1st ECOLEAD Summer School on
Collaborative Networked Organisations
A Summer School on
Semantic Web
A P
AS
CA
L N
etw
ork
Wo
rksh
op
MON 27.6.05
TUE 28.6.05
WED 29.6.05
THU 30.6.05
FRI 1.7.05
SAT 2.7.05
MON 4.7.05
TUE5.7.05
WED 6.7.05
THU 7.7.05
FRI 8.7.05
Credits and coursework
“New Media and eScience” MSc Programme
• 6 credits• 30 hours
– 10 – lectures– 10 – hands-on– 10 – seminar
• Individual workload distribution and/or consultations: to be agreed by mail/phone
“Statistics” MSc Programme
• 12 credits• 36 hours
– 24 – lectures– 12 – seminar
• Individual workload distribution and/or consultations: to be agreed by mail/phone
Credits and coursework: Sample individual programmes
“New Media and eScience” MSc Programme
• 6 credits, 30 hours– Lectures (with/without
ACAI lectures)– e.g., ACAI hands-on
(1x, 2x or 3x4 hours)– Seminar based on the
results of ACAI hands-on work
“Statistics” MSc Programme
• 12 credits, 36 hours– Lectures (e.g., with
ACAI lectures)– e.g., WEKA ACAI
hands-on (1x4 hours)– Individual seminar
work, using you own data (e.g., using WEKA for survey data analysis)
Outline of 10 Nov. and 25 Nov. lectures on DM and KDD
I. Introduction– Data Mining and KDD process– Why DM: Examples of
discovered patterns and applications
– Classification of DM tasks and techniques
– Visualization and overview of DM tools
(Ch. 1,2,11,12,13 of DM&DS book)
II. DM Techniques– Classification of DM tasks and
techniques– Predictive DM
• Decision Tree induction (Ch. 3 of Mitchell’s book)
• Learning sets of rules (Ch. 7 of IDA book, Ch. 10 of Mitchell’s book)
– Descriptive DM• Association rule induction• Subgroup discovery• Hierarchical clustering
III. Evaluation– Evaluation methodology– Evaluation measures
IV. Relational Data Mining – What is RDM?– Propositionalization – Inductive Logic
Programming(Ch. 3,4,11 of RDM book)
V. Concluding Remarks
Introduction to data mining
• Data Mining (DM) and related areas• Why DM: Examples of discovered patterns
and applications• Classification of DM tasks and techniques• Visualization and overview of DM tools
What is data mining
• Extraction of useful information from data: discovering relationships that have not
been previously known• The viewpoint in this course:
DM is the application of machine learningtechniques to “hard” real-life problems
Related areas
Database technology
and data warehouses• efficient storage,
access and manipulationof data
DM
statistics
machinelearning
visualization
text and Web mining
softcomputing pattern
recognition
databases
Statistics,
machine learning,
pattern recognition
and soft computing* • techniques for
classification and knowledge extractionfrom data
* neural networks, fuzzy logic, genetic algorithms, probabilistic reasoning
DM
statistics
machinelearning
visualization
text and Web mining
softcomputing pattern
recognition
databases
Related areas
DM
statistics
machinelearning
visualization
text and Web mining
softcomputing pattern
recognition
databases
Related areas
Text and Web mining• Web page analysis• text categorization• acquisition, filtering
and structuring of textual information
• natural language processing
text and Web mining
Related areas
Visualization • visualization of data
and discovered knowledge
DM
statistics
machinelearning
visualization
text and Web mining
softcomputing pattern
recognition
databases
Point of view in this tutorial
Data mining with machine learning methods
Emphasis on relation with statistics
DM
statistics
machinelearning
visualization
text and Web mining
softcomputing pattern
recognition
databases
Machine learning and statistics
• Both have a long tradition of developing inductive techniques for data analysis– reasoning from properties of data samples
to properties of a population• DM = statistics + marketing ? No !
DM = statistics + ... + machine learning• Statistics is particularly appropriate for hypothesis testing
and data analysis under certain theoretical expectations about data distribution, independence, random sampling, sample size , …
• Machine learning is particularly appropriate for inducing generalizations that consist of easily understandable patterns, induced from both large and small samples
DM and KDD• DM is a way of doing data analysis, aimed at finding
patterns, revealing hidden regularities and relationships• Knowledge Discovery in Databases (KDD) provides
a broader view: - KDD is defined as “the process of identifying valid, novel, potentially useful and ultimately understandable patterns in data” * - KDD provides tools to automate the entire process of data analysis, including the statistician’s art of hypothesis selection
• DM is the key element in this much more elaborate KDDprocess
* Usama M. Fayyad et al, The KDD Process for Extracting Useful Knowledge from Volumes of Data. Comm ACM, Nov. 1996
The KDD process
• KDD involves several phases:• data preparation (selection, pre-processing, transformation)• data mining• interpretation and evaluation of discovered patterns
• Data mining is the key phase, 15-25 % of the KDD process
Part I. Introduction
• Data Mining and the KDD process• Why DM: Examples of discovered patterns
and applications• Classification of DM tasks and techniques• Visualization and overview of DM tools
The SolEuNet Project
• European 5FP project “Data Mining and Decision Support for Business Competitiveness: A European Virtual Enterprise”, 2000-2003
• Scientific coordinator Jozef Stefan Institute, administrative Fraunhofer Gesellschaft
• 3 M €, 12 partners (8 academic and 4 business) from 7 countries
• Main project objectives:– development of prototype solutions for end-users– foundation of a virtual enterprise for marketing data
mining and decision support expertise, involving business and academia
Data mining application prototypes
• Mediana – analysis of media research data • Kline & Kline – improved brand name recognition• Australian financial house – customer quality evaluation,
stock market prediction• Czech health farm – predict the use of resources• UK County Council - analysis of traffic accident data• Portuguese statistical bureau – Web page access analysis
for better page organization• Detection of coronary heart disease risk groups• Analysis of online dating• EC Harris, UK - analysis of building construction projects• European Comission - analysis of 5FP IST projects:
better understanding of large amounts of text documents, “clique” identification
Mediana case study
• Questionnaires about journal/magazine reading, watching TV programs and listening to radio programs, published annually since 1992, about 1200 questions/attributes (frequency of reading/listening/watching, distribution w.r.t. sex, age, education, buying power, interests, ...)
• Data for 1998, about 8000 questionnaires• Good quality, “clean” data• Table of n-tuples (rows: individuals, columns: attributes)
Mediana case study
• Target patterns:– Which other journals/magazines are read by readers of
a particular journal/magazine ?– What are the properties of individuals that are
consumers of a particular media ?– Which properties are distinctive for readers of various
journals ? • Induced models: description (association rules, clusters)
and classification (decision trees, classification rules)
Decision treesFinding reader profiles: decision tree for classifying people into readers and non-
readers of a teenage magazine
Classification rulesSet of Rules: if Cond then Class
Interpretation: if-then ruleset, or
if-then-else decision list
Class: Reading of daily newspaper EN (Evening News)
if a if person does not read MM (Maribor Magazine) and rarely reads the weekly magazine “7Days”
then the person does not read EN (Evening News)
else if a person rarely reads MM and does not read the weekly magazine SN (Sunday News)
then the person reads EN
else if a person rarely reads MM
then the person does not read EN
else the person reads EN.
Association rules
Rules X => Y, X, Y conjunction of bin. attributes• Support: Sup(X,Y) = #XY/#D = p(XY)
• Confidence: Conf(X,Y) = #XY/#X = p(XY)/p(X) = p(Y|X)
Task: Find all association rules that satisfy minimum support and minimum confidence constraints.
Example association rule about readers of yellow press daily newspaper SloN (Slovenian News):
read_Love_Stories_Magazine => read_SloN
sup = 3.5% (3.5% of the whole dataset population reads both LSM and SloN)
conf = 61% (61% of those reading LSM also read SloN)
Association rules
Finding profiles of readers of the Delo daily newspaper
1. read_Marketing magazine 116 =>
read_Delo 95 (0.82)
2. read_Financial_News 223 => read_Delo 180 (0.81)
3. read_Views 201 => read_Delo 157 (0.78)
4. read_Money 197 => read_Delo 150 (0.76)
5. read_Vip 181 => read_Delo 134 (0.74)
Interpretation: Most readers of Marketing magazine, Financial News, Views, Money and Vip read also Delo.
Analysis of UK traffic accidents
• End-user: Hampshire County Council (HCC, UK) – Can records of road traffic accidents be analysed to
produce road safety information valuable to county surveyors?
– HCC is sponsored to carry out a research project Road Surface Characteristics and Safety
– Research includes an analysis of the STATS19 Accident Report Form Database to identify trends over time in the relationships between recorded road-user type/injury, vehicle position/damage, and road surface characteristics
STATS19 Data Base
10
• Over 5 million accidents recorded in 1979-1999• 3 data tables
Accident ACC7999 (~5 mil. Accidents,
30 variables)
Where ? When ? How many ?
Vehicle VEH7999 (~9 mil. Vehicles,
24 variables)
Which vehicles ? What movement ? Which consequences ?
Casualty CAS7999 (~7 mil.injuries,
16 variables)Who was injured ? What injuries ? ...
Data understanding
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
220000
230000
240000
250000
260000
270000
Year of Accident
Data quality: Accident location
Data preparation
• There are 51 police force areas in UK• For each area we count the number of
accidents in each:– Year– Month– Day of Week– Hour of Day
Data preparationYEARpfc 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
a 10023 9431 9314 8965 8655 9014 9481 9069 8705 8829 9399 9229 8738 8199 7453 7613 7602 7042 7381 7362 6905b 6827 6895 6952 7032 6778 6944 6387 6440 6141 5924 6331 6233 5950 6185 5910 6161 5814 6263 5881 5855 5780c 2409 2315 2258 2286 2022 2169 2212 2096 1989 1917 2137 2072 2032 1961 1653 1526 1552 1448 1521 1408 1234
MONTHpfc jan feb m ar apr m ay jun jul aug sep oct nov dec
a 72493 67250 77434 73841 78813 78597 80349 74226 79362 85675 84800 76282b 2941 2771 3145 3317 3557 3668 3988 4048 3822 3794 3603 3481c 9261 8574 9651 9887 10649 10590 10813 11299 10810 11614 10884 10306
DAY OF WEEK12 Sunday Monday Tuesday Wednesday Thursday Friday Saturday
a 96666 132845 137102 138197 142662 155752 125898b 5526 5741 5502 5679 6103 7074 6510c 15350 17131 16915 17116 18282 21000 18544
HOURpfc 0 1 2 3 4 5 6 7 8 … 16 17 18 19 20 21 22 23
a 794 626 494 242 166 292 501 1451 2284 … 3851 3538 2557 2375 1786 1394 1302 1415b 2186 1567 1477 649 370 521 1004 4099 7655 … 11500 11140 7720 7129 5445 4396 3946 4777c 2468 1540 1714 811 401 399 888 3577 8304 … 12112 12259 8701 7825 6216 4809 4027 4821
Simple visualization of short time series
• Used for data understanding• Very informative and easy to understand
format• UK traffic accident analysis: Distributions of
number of accidents over different time periods (year, month, day of week, and hour)
Year/Month distributionJan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Darker color - MORE accidents
All weekdays (Mon – Fri) are worse in deep winter, Friday the worst
SUN
FRI
SAT
MON
THU
TUES
WED
Jan Feb Mar Apr May Jun July Aug Sept Oct Nov Dec
Day of Week/Month distribution
Hour/Month distribution
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sept
Oct
Nov
Dec
1. More Accidents at “Rush Hour”, Afternoon Rush hour is the worst
2. More holiday traffic (less rush hour) in August
SUN
FRI
SAT
MON
THU
TUES
WED
1. More Accidents at “Rush Hour”, Afternoon Rush hour is theworst and lasts longer with “early finish” on Fridays
2. More leisure traffic on Saturday/Sunday
Day of Week/Hour distribution
Traffic: different modeling approaches
• association rule learning• static subgroup discovery• dynamic subgroup discovery• clustering of short time series• text mining• multi-relational approaches• …
Some discovered association rules
• Association rules: Road number and Severity of accident– The probability of a fatal or serious accident on the
“K8” road is 2.2 times greater than the probability of fatal or serious accidents in the county generally.
– The probability of fatal accidents on the “K7” road is 2.8 times greater than the probability of fatal accidents in the county generally (when the road is dry and the speed limit = 70).
Analysis of documents of European IST project
Data source: • List of IST project descriptions as 1-2 page text summaries
from the Web (database www.cordis.lu/)• IST 5FP has 2786 projects in which participate 7886
organizations
Analysis tasks:• Visualization of project topics • Analysis of collaboration• Connectedness between organizations• Community/clique identification • Thematic consortia identification• Simulation of 6FP IST
Analysis of documents of European IST project
Visualization into 25 project groupsHealth
Data analysis
Knowledge Management
Mobile computin
g
TelecommunicationTransport
Electronics
No. of joint projects
Institutional Backbone of IST
Collaboration between countries (top 12)
Most active country
Number of collaborations
Part I. Introduction
• Data Mining and the KDD process• Why DM: Examples of discovered patterns
and applications• Classification of DM tasks and techniques• Visualization and overview of DM tools
Types of DM tasks • Predictive DM:
– Classification (learning of rulesets, decision trees, ...)
– Prediction and estimation (regression)– Predictive relational DM (RDM, ILP)
• Descriptive DM:– description and summarization – dependency analysis (association rule
learning)– discovery of properties and constraints– segmentation (clustering)– subgroup discovery
• Text, Web and image analysis
++
+
---
H
xx
x x
+xxx
H
Predictive vs. descriptive induction
Predictive induction
Descriptive induction
+
-
++ +
+- -
---
-
++ + +
++++
++ + +
+++
++
+
++ + +
+++
++
Predictive vs. descriptive induction
• Predictive induction: Inducing classifiers for solving classification and prediction tasks, – Classification rule learning, Decision tree learning, ...– Bayesian classifier, ANN, SVM, ...– Data analysis through hypothesis generation and testing
• Descriptive induction: Discovering interesting regularities in the data, uncovering patterns, ... for solving KDD tasks– Symbolic clustering, Association rule learning, Subgroup
discovery, ...– Exploratory data analysis
Predictive vs. descriptive induction: A rule learning
perspective
• Predictive induction: Induces rulesets acting as classifiers for solving classification and prediction tasks
• Descriptive induction: Discovers individual rules describing interesting regularities in the data
• Therefore: Different goals, different heuristics, different evaluation criteria
Supervised vs. unsupervised learning: A rule learning
perspective
• Supervised learning: Rules are induced from labeled instances (training examples with class assignment) - usually used in predictive induction
• Unsupervised learning: Rules are induced from unabeled instances (training examples with no class assignment) - usually used in descriptive induction
• Exception: Subgroup discovery Discovers individual rules describing interesting regularities in the data from labeled examples
Subgroups vs. classifiers• Classifiers:
– Classification rules aim at pure subgroups– A set of rules forms a domain model
• Subgroups:– Rules describing subgroups aim at significantly higher proportion of positives– Each rule is an independent chunk of knowledge
• Link:– SD can be viewed as
a form of cost-sensitive classification
Part I. Introduction
• Data Mining and the KDD process• Why DM: Examples of discovered patterns
and applications• Classification of DM tasks and techniques• Visualization and overview of DM tools
Visualization
• can be used on its own (usually for description and summarization tasks)
• can be used in combination with other DM techniques, for example– visualization of decision trees– cluster visualization– visualization of association rules– subgroup visualization
Data visualization: Scatter plot
Daisy Graph
Visualization byB. Zupan et al.
Daisy Graph
Patients were mostly female
Daisy Graph
The older the patient, the higher the difference of HHS between two follow-ups
Data visualization: time dependecyCumulative ineffectiveness of antibiotics gentamycin, clyndamycin, cefpiramide, and cefotaxim
[Bohanec et al., “PTAH: A system for supporting nosocomial infection theraphy”, IDAMAP book, 1997]
Subgroup visualization
Subgroups of patients with CHD risk
[Gamberger, Lavrac & Wettschereck, IDAMAP2002]
Subgroup visualization
Subgroups of patients with CHD risk
[Gamberger, Lavrac & Wettschereck, IDAMAP2002]
Subgroup visualization
Subgroups of patients with CHD risk
[Gamberger & Lavrac, ICML2002]
DB Miner: Association rule visualization
MineSet:Association
RuleVisualization
MineSet: Decision tree visualization
DM tools
Clementine
S-Plus
Part I: Summary
• KDD is the overall process of discovering useful knowledge in data– many steps including data preparation, cleaning,
transformation, pre-processing
• Data Mining is the data analysis phase in KDD– DM takes only 15%-25% of the effort of the overall KDD
process– employing techniques from machine learning and statistics
• Predictive and descriptive induction have different goals: classifier vs. pattern discovery
• Many application areas• Many powerful tools available
Part I: Introduction Questions