Idea Engineering

Idea Engineering

[email protected]’13

Oct’13

0. algorithmmining

1. landscapemining

2. decisionmining

3. discussionmining

yesterday today

tomorrow future

The Premises of PROMISE(2005)

– Wanted: predictions• Nope. Users want decision, or engagement



– Data mining will reveal “the truth” about SE• [Dejaeger: TSE’11], [Hall: TSE’12], [Shepperd:COW’13]• Not(Better learners = better conclusions)



– Data mining will reveal “the truth” about SE• [Dejaeger: TSE’11], [Hall: TSE’12], [Shepperd:COW’13]• Not(Better learners = better conclusions)

– Sooner or later: enough data for general conclusions• Found more differences than generalities• Special issues: [IST’13], [ESEj’13]• Best papers, ASE’11, MSR’12• Menzies, Zimmermann et al [TSE’13]• Lots of local models

5

Landscape mining:look before your leap

• Report what is true about the data– Not trivia on how algorithms

walk that data

• Map the landscape– Reason on each part of map

• E.g. landscape mining– Unsupervised iterative

dichotomization– Cluster, prune– Then generate rules

6

Landscape mining:look before your leap

• Report what is true about the data– Not trivia on how algorithms

walk that data

• Map the landscape– Reason on each part of map

• E.g. landscape mining– Unsupervised iterative

dichotomization– Cluster, prune– Then generate rules

• Different to “leap before you look”– i.e. skew learning by class variable– then study the results

• E.g. C4.5, CART, Fayya-Iranni, etc– Supervised iterative dichotomization

• E.g. 61% * 300+effort estimation papers– Algorithm tinkering, without end

7

Find landscape = cluster data, assign “heights”

Find decisions = report delta highs to lows

Monitor discussions = watch, help, communities explore deltas

IDEA Engineering = <landscape, decisions, discussion>

Spectral Landscape Mining• Spectrum = condition that is not

limited to a specific set of values but varies in a continuum.

• Groups together a broad range of conditions or behaviors under one single title

• In mathematics, the spectrum of a (finite-dimensional) matrix is the set of its eigenvalues.

• Nystrom algorithms: approximations to eigenvalues– FASTMAP: linear time

Project data on first 2 PCA; grid that datae.g. Nasa93dem

1) project 23 dimensions projected into 2 2a) cluster 2b) replace clusters with centroids.

MOEA: score= effort+defects +months

Sanity check:What information loss?

• E.g. POI-3 – 400+ examples– 20 centroids

• Prediction via:– Extrapolation between two

nearest centroids

• Works as well as– Random forest, Naïve Bayes

• For defect prediction (10 data sets)

– Linear regression, M5’• For effort estimation (10 data sets)

11

• Find delta between neighbors that go worse to better• Very small rules, found in logLinear time• Menzies et al. [TSE’13]

Planning = Inter-cluster contrast sets

Applications

• Prediction• Planning• Monitoring• Multi-objective optimization

– Cluster first on N objectives • Anomaly detection• Incremental theory revision• Compression• Privacy• etc

Idea Engineering

0. algorithmmining

1. landscapemining

2. decisionmining

3. discussionmining

yesterday today

tomorrow future

Beyond Data Mining, T. Menzies, IEEE Software, 2013, to appear

13

Q: why call it mining?

• A1: because all the primitives for the above are in the data mining literature• So we know how to get from here to there

• A2: because data mining scales

Date post:	06-May-2015
Category:	Technology
Upload:	cs-ncstate
View:	176 times
Download:	0 times

Idea Engineering

Technology