Searching for Credible Relations in Machine LearningDoctoral Dissertation
Vedrana Vidulin
Supervisor: prof. dr. Matjaž GamsCo-supervisor: prof. dr. Bogdan Filipič
Ljubljana, 3 February 2012
2 of 20 Searching for Credible Relations in Machine Learning
Introduction• Task: domain analysis of complex domains
• Problem:– When DM methods construct models on complex domains, the
models often contain parts (relations) that are less-credible from the perspective of human analyst.
– Less-credible parts can:• Lead to wrong conclusions about the most important relations in the
domain• Undermine user’s trust in DM methods (Stumpf et al., 2009).
• Proposed solution: a new method that in algorithmic way combines human understanding and raw computer power in order to extract credible relations – supported by data and meaningful for the human.
3 of 20 Searching for Credible Relations in Machine Learning
An Example• A decision-tree model is constructed:
– With J48 algorithm in Weka,– From a data set that represents the impact of R&D sector
on economic welfare of a country
Country GERD per capita (PPP$)
Researchers per million inhabitants (HC)
…Sector investing the most in R&D
GNI per capita
Armenia 7.6 1,660 … Government low
Latvia 37.1 2,455 … Government middle
Japan 813.7 6,227 Business enterprise high
… … … … … …
37 attributes: R&D sector
167
exam
ples
: Cou
ntrie
s
Class: Economic welfare
4 of 20 Searching for Credible Relations in Machine Learning
= Abroad
GERD per capita (PPP$)
Sector employing the most researchers
<= 105.5
Sector investing the most in R&D
> 105.5
middle (49.0/20.92)
= N/A
middle (42.87/13.15) GERD per capita (PPP$)
= Goverment
middle (5.0)
low (12.58/0.39)
<= 10.8
middle (10.29/4.29)
> 10.8
middle (16.7/8.77)
= N/A
high (6.57/1.28)
= Government
high (24.0/1.0) high (0.0) high (0.0) high (0.0)
= Higher education
= Business enterprise
= Business enterprise
= Higher education
= Private non-profit
An Example (2)
5 of 20 Searching for Credible Relations in Machine Learning
Outline• Definition of credible relation
• Human-Machine Data Mining (HMDM) method
• Experimental evaluation
• Conclusions and contributions
6 of 20 Searching for Credible Relations in Machine Learning
Credible Relation• Relation – a pattern that connects a set of attributes that
describe the properties of a concept underlying the data and a class/target attribute that represents the concept.
• Credible relation – of great meaning and of high quality:– Meaning – a subjective criterion attributed by the human
based on the common sense, an informal knowledge about the domain, observed frequency and stability of the relation.
– Quality – an objective criterion that indicates a support of the selected quality measures.
• Credible model – composed only of credible relations.
7 of 20 Searching for Credible Relations in Machine Learning
How to Establish Credible Relations?
The relation is composed ofattributes A1 and A2.
Re-examine relation’s credibility by:1) Removing attributes A1 and A2
from data set 2) Adding attributes A1 and A2 to
If the relation is supported by evidence, add it to the list of candidates for credible relations.
8 of 20 Searching for Credible Relations in Machine Learning
The HMDM Algorithm
Until no new interesting relations
Repeat Create several models (e.g., trees) Choose most interesting models
For each interesting modelExamine credibility of relations in the modelby adding and removing attributes from the data set
Merge candidate relations with the output list of credible relations
9 of 20 Searching for Credible Relations in Machine Learning
The HMDM Algorithm (2)HMDM (data set) REPEAT Select DM method Select parameters and their ranges, define constraints Perform INITIAL_DM creating a list of models LM: FOR each interesting model M from LM, reexamine M: REPEAT Perform any of the following: {
ADD_ATTRIBUTES REMOVE_ATTRIBUTES Expand credibility indicator }
Evaluate the results with several quality measures and for meaning UNTIL no more interesting relations are found in the search space near the initial model Store credible relations and integrate conclusions END FOR UNTIL no more new interesting relations are found anywhere in the data set
10 of 20 Searching for Credible Relations in Machine Learning
NO ATTRIBUTESA1 | 71.43
A2 | 85.71
A2 | 100
HMDM: ADD_ATTRIBUTES
ATTRIBUTESA1 A2 A3 C1 1 0 11 1 0 11 0 1 00 1 1 01 1 0 10 0 1 01 0 0 0
Quality: Accuracy (%)
Model: J48 trees
Candidates for credible relations
A1 & A2 – combination
…
11 of 20 Searching for Credible Relations in Machine Learning
ALL ATTRIBUTES | 100A3 | 100
A1 | 71.43
HMDM: REMOVE_ATTRIBUTES
Quality: Accuracy (%)ATTRIBUTES
A1 A2 A3 C1 0 1 10 1 0 00 1 0 01 0 1 11 0 1 11 1 1 11 1 1 1
Model: J48 trees
Candidates for credible relations
A1 || A3 – redundancy
…
12 of 20 Searching for Credible Relations in Machine Learning
Type-Credibility Scheme
• Three levels of credibility:1. Frequent and stable relations
• Often appear in models• When added improve quality• When removed reduce quality
2. Frequent and less-stable relations• Often appear in models• When added sometimes improve quality and sometimes not• When removed sometimes reduce quality and sometimes not
3. Not supported by evidence
13 of 20 Searching for Credible Relations in Machine Learning
Quality Measures• The decision trees are evaluated according to:
– Accuracy– Corrected class probability estimate (CCPE)– Kappa
• The regression trees are evaluated according to:– Correlation coefficient– Relative absolute accuracy (RAA)
• In addition, trees are evaluated according to – the total change in quality caused by adding and removing attributes:
14 of 20 Searching for Credible Relations in Machine Learning
Experimental Evaluation• Performed on three domains:
1. Research and development (R&D)2. Higher education3. Automatic web genre identification
15 of 20 Searching for Credible Relations in Machine Learning
R&D Domain: Remove Attributes Graph
GERD-PC || GERD-GDPRES-HC || RES-FTEAPP-NON-RES
16 of 20 Searching for Credible Relations in Machine Learning
Domains• Higher education
– Goal: An analysis of the impact of higher education sector on economic welfare of a country
– DM methods: J48 and M5P trees– Data: 60 attributes; 167 examples: countries; class: GNI
per capita
• Automatic web genre identification– Goal: Improve predictive performance by eliminating less-
credible relations from J48 decision-tree models– Data: 500 attributes: words; 1,539 examples: web pages;
class: 20 genres
17 of 20 Searching for Credible Relations in Machine Learning
R&D and Higher Education Domains – Credible RelationsR&D• First level: increase the level of investment in R&D sector• Second level:
– Increase the number of patents– Increase the number of researchers– Develop business enterprise sector as the key leader in R&D activities
Higher education• First level: stimulate participation in higher education and improve
student exchange programs• Second level:
– Increase the level of investment in all levels of education (“low”)– Increase number of graduates in science programs (“middle”)– Attract more foreign students (“middle”)
18 of 20 Searching for Credible Relations in Machine Learning
Evaluation
• User study on 22 participants:– 64% of participants did not recognize less-credible relations in the
single model– When presented with credible models all accepted credible models
as better
Accuracy (%)Data J48 HMDM
HI-EDU 71.86
R&D 63.47
Correlation coefficientData M5P HMDM
HI-EDU 0.681
R&D 0.722 0.787
Data: Genres
F-Measure J48 HMDMMicro-AVG 0.280 0.370Macro-AVG 0.284 0.377
19 of 20 Searching for Credible Relations in Machine Learning
Conclusions• A novel method Human-Machine Data Mining (HMDM)
was designed that combines human understanding and raw computer power to extract credible relations from data.
• The HMDM method was evaluated on three complex domains showing that:– the method is able to find important relations in data– credible models are better in quality than the models
constructed by automatic DM methods– humans accept credible models
20 of 20 Searching for Credible Relations in Machine Learning
Contributions• The main contributions:
– A new method Human-Machine Data Mining (HMDM) was designed for extracting credible relations from data
– The CCPE statistical measure, originally conceived for classification rules, was extended for decision trees
– Interactive explanation structures in the form of added and removed attributes graphs were designed, conceived to facilitate the extraction of credible relations
• Additional contributions:– A computer program was developed to support the HMDM
method– The analysis of three real-life domains