Download - Kdd2014 Hamalainen Webb Discovery

8/10/2019 Kdd2014 Hamalainen Webb Discovery

1/116

Tutorial KDD14 New York

STATISTICALLY SOUNDPATTERN DISCOVERY

Wilhelmiina HmlinenUniversity of EasternFinland

[email protected].

Geoff WebbMonash UniversityAustralia

[email protected]://www.cs.joensuu.fi/pages/whamalai/kdd14/sspdtutorial.html

SSPD tutorial KDD14 p. 1
http://www.cs.joensuu.fi/pages/whamalai/kdd14/http://localhost/var/www/apps/conversion/tmp/scratch_4/sspdtutorial.htmlhttp://localhost/var/www/apps/conversion/tmp/scratch_4/sspdtutorial.htmlhttp://www.cs.joensuu.fi/pages/whamalai/kdd14/


2/116

Statistically sound pattern discovery: Problem

000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111POPULATIONclean and accurateusually infinite SAMPLEmay contain noiseREAL PATTERNS

(with some tool)

PATTERNS FOUND

FROM THE SAMPLE

?

REAL WORLDIDEAL WORLD


000000111111


3/116

Statistically sound pattern discovery: Problem

000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111

POPULATION

clean and accurate

usually infinite SAMPLE

may contain noise

REAL PATTERNS

REAL WORLDIDEAL WORLD

?

(with some tool)

PATTERNS FOUND

FROM THE SAMPLE



4/116

Statistically Sound vs. Unsound DM?

Pattern-type-rst :Given a desired classicalpattern, invent a searchmethod.

Method-rst :Invent a new patterntype which has an easysearch method

e.g., an antimonotonicinterestingness propertyTricks to sell it:

overload statisticaltermsdont specify exactly



5/116

Statistically Sound vs. Unsound DM?

Pattern-type-rst :Given a desired classicalpattern, invent a searchmethod.

+ easy to interpretecorrectly

+ informative+ likely to hold in future

computationally de-manding

Method-rst :Invent a new patterntype which has an easysearch method

difcult to interprete misleading

information no guarantees on

validity+ computationally easy



6/116

Statistically sound pattern discovery: Scope

Other patterns?Statistical dependencypatterns

Dependency rules Correlated itemsets

PATTERNS MODELS

timeseries?

Discussion

Part I(Wilhelmiina)

Part II(Geoff)

graphs?

loglinear models classifiers



7/116

Contents

Overview (statistical dependency patterns)Part I

Dependency rulesStatistical signicance testing

Coffee break (10:00-10:30)Signicance of improvement

Part IICorrelated itemsets (self-sufcient itemsets)Signicance tests for genuine set dependencies

DiscussionSSPD tutorial KDD14 p. 7


8/116

Statistical dependence: Many interpretations!

Events ( X = x) and (Y = y) are statistically indepen-dent , if P( X = x, Y = y) = P( X = x)P (Y = y).

When variables (or variable-value combinations) arestatistically dependent ?When the dependency is genuine? measures for the strength and signicance ofdependenceHow to dene mutual dependence between three ormore variables?



9/116

Statistical dependence: 3 main interpretations

Let A, B, C binary variables. Notate A ( A = 0) and A ( A = 1)1. Dependency rule AB C : must be = P( ABC ) P ( AB)P (C ) > 0 (positive dependence).2. Full probability model :1 = P( ABC ) P ( AB)P (C ),2 = P( A BC ) P ( A B)P (C ),3 = P(

ABC )

P (

AB)P (C ) and

4 = P( A BC ) P ( A B)P (C ).If 1 = 2 = 3 = 4 = 0, no dependenceOtherwise decide from i (i = 1 , . . . , 4) (with someequation)



10/116

Statistical dependence: 3 interpretations

3. Correlated set ABC Starting point mutual independence:P ( A = a, B = b, C = c) = P( A = a)P ( B = b)P (C = c) forall a, b, c {0, 1}different variations (and names)! e.g.

(i) P( ABC ) > P( A)P ( B)P (C ) (positive dependence) or(ii) P( A = a, B = b, C = c) P( A = a)P ( B = b)P (C = c)for some a, b, c {0, 1}

+ extra criteriaIn addition, conditional independence sometimes usefulP ( B = b, C = c

| A = a) = P( B = b

| A = a)P (C = c

| A = a)



11/116

Statistical dependence: no single correct denition

One of the most important problems in the philosophy of natural sciences is in addition to the well-known one regarding the essence of the concept of probability itself to make precise the premises which would make it possible to regard any given real events as independent .

A.N. Kolmogorov



12/116

Part I Contents

1. Statistical dependency rules2. Variable- and value-based interpretations3. Statistical signicance testing

3.1 Approaches

3.2 Sampling models3.3 Multiple testing problem4. Redundancy and signicance of improvement

5. Search strategies



13/116

1. Statistical dependency rules

Requirements for a genuine statistical dependency rule X A:

(i) Statistical dependence(ii) Statistically signicant

likely not due to chance(iii) Non-redundantnot a side-product of another dependencyadded value

Why?SSPD tutorial KDD14 p. 13


14/116

Example: Dependency rules on atherosclerosis

1. Statistical dependencies:smoking atherosclerosissports atherosclerosisABCA1-R219K atherosclerosis ?

2. Statistical signicance?

spruce sprout extract atherosclerosis ?dark chocolate atherosclerosis3. Redundancy?

stress, smoking atherosclerosissmoking, coffee atherosclerosis ?high cholesterol, sports atherosclerosis ?male, male pattern baldness

atherosclerosis ?



15/116

Part I Contents


3.1 Approaches

3.2 Sampling models3.3 Multiple testing problem4. Redundancy and signicance of improvement




16/116

2. Variable-based vs. Value-based interpretation

Meaning of dependency rule X A

1. Variable-based: dependency between binary variables X and APositive dependency X A the same as X AEqually strong as negative dependency between X and A (or X and A)

2. Value-based: positive dependency between values

X =

1 and A =

1different from X A which may be weak!



17/116

Strength of statistical dependence

The most common measures:

1. Variable-based: leverage( X , A) = P( XA) P ( X )P ( A)

2. Value-based: lift ( X , A) =

P ( XA)P ( X )P ( A)

=P ( A| X )P ( A)

=P ( X | A)P ( X )

P ( A| X ) = condence of the ruleRemember: X

( X = 1) and A

( A = 1)



18/116

Contingency table

A A All X f r ( XA) = f r ( X

A) =

n[P ( X )P ( A) + ] n[P ( X )P ( A) ] f r ( X ) X f r ( XA) = f r ( X A) =

n[P (

X )P ( A)

] n[P

( X )P (

A) + ] f r (

X )

All f r ( A) f r ( A) nAll value combinations have the same

|

|!

depends on the value combination

f r ( X )=absolute frequency of X P ( X )=relative frequency of X



19/116

Example: The Apple problem

Variables: Taste, smell, colour, size, weight, variety, grower,. . .

(55 sweet + 45 bitter)100 apples



20/116

Rule RED SWEET ( Y A ) P ( A|Y ) = 0 .92 , P( A|Y ) = 1 .0 A=sweet, A=bitter = 0 .22 , = 1 .67 Y =red, Y =green

60 red apples(55 sweet)

Basket 1 Basket 2

40 green apples(all bitter) SSPD tutorial KDD14 p. 20


21/116

Rule RED and BIG SWEET ( X A ) P ( A| X ) = 1 .0, P( A| X ) = 0 .75 X =(red big) = 0 .18 , = 1 .82 X =(green small)

Basket 1

40 large red apples(all sweet)

40 green + 20 small red apples

Basket 2

(45 bitter)SSPD tutorial KDD14 p. 21

When the value-based interpretation could be


22/116

When the value based interpretation could be useful? Example

D=disease, X =allele combinationP ( X ) small and P( D| X ) = 1 .0

( X , D) = P( D)1 can be large

P ( D

| X )

P( D)

P ( D| X ) P( D)

( X , D) = P( X )P ( D) small .

D

X

Now dependency strong in the value-based but weak in thevariable-based interpretation!

(Usually, variable-based dependencies tend to be morereliable)SSPD tutorial KDD14 p. 22


23/116

Part I Contents


3.1 Approaches3.2 Sampling models3.3 Multiple testing problem

4. Redundancy and signicance of improvement




24/116

3. Statistical signicance of X A

What is the probability of the observed or a strongerdependency, if X and A were independent? If smallprobability, then X A likely genuine (not due tochance).

Signicant X A is likely to hold in future (in similardata sets)How to estimate the probability??How small the probability should be?

Fisherian vs. Neyman-Pearsonian schoolsmultiple testing problem



25/116

3.1 Main approaches

SIGNIFICANCETESTING

EMPIRICALANALYTIC

FREQUENTIST BAYESIAN

different schools

different sampling modelsSSPD tutorial KDD14 p. 25


26/116

Analytic approaches

H 0: X and A independent (null hypothesis) H 1: X and A positively dependent (research hypothesis)

Frequentist: Calculate p = P(observed or stronger dependency | H 0)Bayesian:(i) Set P( H 0) and P( H 1)(ii) Calculate P(observed or stronger dependency | H 0) andP (observed or stronger dependency

| H

1)

(iii) Derive (with Bayes rule)P ( H 0|observed or stronger dependency) andP ( H 1

|observed or stronger dependency)



27/116

Analytic approaches: pros and cons

+ p-values relatively fast to calculate+ can be used as search criteria How to dene the distribution under H 0? (assumptions) If data not representative, the discoveries cannot be

generalized to the whole populationdescribe only the sample data or other similarsamplesrandom samples not always possible (innitepopulation)


Note: Differences between Fisherian vs.


28/116

Neyman-Pearsonian schools

signicance testing vs. hypothesis testingrole of nominal p-values (thresholds 0 .05 , 0 .01)many textbooks represent a hybrid approach

see Hubbard & Bayarri



29/116

Empirical approach (randomization testing)

Generate random data sets according to H 0 and testhow many of them contain the observed or strongerdependency X A.(i) Fix a permutation scheme (how to express H 0 + which

properties of the original data should hold)(ii) Generate a random subset {d 1 , . . . , d b} of all possiblepermutations(iii)

p = |{d i|contains observed or stronger dependency }|b



30/116

Empirical approach: pros and cons

+ no assumptions on any underlying parametricdistribution

+ can test null hypotheses for which no closed form testexists+ offers an approach to multiple testing problem

Later

+ data doesnt have to be a random sample discoveries hold for the whole population ...

... dened by the permutation scheme often not clear (but critical), how to permutate data! computationally heavy (b: efciency vs. quality

trade-off) How to apply during search??SSPD tutorial KDD14 p. 30


31/116

Note: Randomization test vs. Fishers exact test

When testing signicance of X A

a natural permutation scheme xes N =

n, N X =

f r ( X ), N A = f r ( A)randomization test generates some randomcontingency tables with these constraintsfull permutation test = Fishers exact test studies allcontingency tables

faster to compute (analytically)produces more reliable results

No need for randomization tests, here!



32/116

Part I Contents


3.1 Approaches3.2 Sampling models

variable-basedvalue-based

3.3 Multiple testing problem4. Redundancy and signicance of improvement5. Search strategies



33/116

3.2 Sampling models

= dening the distribution under H 0

What do we assume xed?

Variable-based dependencies: classical samplingmodels (Statistics)

Value-based dependencies: several suggestions (Datamining)



34/116

Basic idea

Given a sampling model MT =set of all possible contingency tables.

1. Dene probability P(T i|M) for contingency tables T i T2. Dene an extremeness relation T i T j

T i contains at least as strong dependency X Aas T j doesdepends on the strength measure, e.g.

(var-based) or (val-based)3. Calculate p = T i T 0 P(T i|M)(T 0=our table)



35/116

Sampling models for variable-based dependencies

3 basic models:

1. Multinomial ( N = n xed)2. Double binomial ( N = n, N X = f r ( X ) xed)3. Hypergeometric ( Fishers exact test)

( N = n, N A = f r ( A), N X = f r ( X ) xed)

+ asymptotic measures (like 2)



36/116

Multinomial model

Independence assumption: In the innite urn, p XA = p X p A.( p XA=probability of red sweet apples)

n apples

a sample of

INFINITE URN



37/116

Multinomial model

T i is dened by random variables N XA, N X A, N XA, N X A

P ( N XA, N X A, N XA, N X A|n, p X , p A) = n

N XA, N X A, N XA, N X A p N X X (1

p X )n N X p N A A (1

p A)n N A.

p =T i T 0

P ( N XA, N X A, N XA, N X A|n, p X , p A) p X and p A can be estimated from the data



38/116

Double binomial model

Independence assumption: p A| X = p A = p A| X TWO INFINITE URNS:

fr( X) green apples

a sample ofa sample of

fr(X) red apples



39/116


Probability of red sweet apples:

P ( N XA| f r ( X ), p A) = f r ( X ) N XA p N XA A (1 p A) f r ( X ) N XA

Probability of green sweet apples:

P ( N

XA

| f r (

X ), p A) =

f r ( X ) N XA

p N XA A (1

p A) f r ( X ) N XA



40/116


T i is dened by variables N XA and N XA.

P ( N XA, N XA|n, f r ( X ), f r ( X ), p A) = f r ( X )

N XA f r ( X ) N XA

p N A A (1

p A)n N A

p =T i T 0

P ( N XA, N XA|n, f r ( X ), f r ( X ), p A)


H i d l (Fi h )


41/116

Hypergeometric model (Fishers exact test)

fr(A)n )(

How many other similar urns have

at least as strong dependency as ours?

ALL

SIMILAR URNS

OUR URN n apples

fr(A) sweet + fr( A) bitter

fr(X) red + fr( X) green


Lik i f ll i


42/116

Like in a full permutation test

A A A A

A A A

A A A

A A A

A A A A

A A A A

A A A A A A A

1 2 3 4 5 6 987 10

X X

A A

A A A

A A A

A

A A A

u r n

1

u r n

2

u r n

1 2 0

fr(A)=3

fr(X)=6

n=10


H t i d l (Fi h t t t)


43/116

Hypergeometric model (Fishers exact test)

The number of all possible similar urns (xed N = n, N X = f r ( X ) and N A = f r ( A)) is

f r ( A)

i= 0

f r ( X )i f r ( X ) f r ( A) i =

n f r ( A)

Now (T i T 0) ( N XA f r ( XA)). Easy!

pF = i= 0

f r ( X ) f r ( XA)+ i

f r (

X )

f r ( X A)+ i n

f r ( A)


E l C i f l


44/116

Example: Comparison of p-values

0

0.1

0.2

0.3

0.4

0.5

0.6

15 17 19 21 23 25 27 29

p

fr(XA)

fr(X)=50, fr(A)=30, n=100

Fisherdouble binmultinom


Example: Comparison of values


45/116


0

2e-05

4e-05

6e-05

8e-05

0.0001

200 250

p

fr(XA)

fr(X)=300, fr(A)=500, n=1000



Example: Comparison of p values


46/116


f r XA multi- double Fishernomial binomial (hyperg.)

180 1.7e-05 1.8e-05 2.2e-05200 2.3e-12 2.2e-12 3.0e-12220 1.4e-22 7.3e-23 1.1e-22240 2.9e-36 3.0e-37 4.4e-37260 1.5e-53 4.2e-56 3.5e-56280 1.3e-74 2.9e-80 1.6e-81300 9.3e-100 3.5e-111 2.5e-119


Asymptotic measures


47/116

Asymptotic measures

Idea: p-values are estimated indirectly

1. Select some nicely behaving measure M e.g. M follows asymptotically the normal or the 2distribution

2. Estimate P( M val), where M =

val in our dataEasy! (look at statistical tables)But the accuracy can be poor


The 2 measure


48/116

The -measure

2 =1

i= 0

1

j= 0

n(P ( X = i, A = j) P ( X = i)P ( A = j))2P ( X = i)P ( A = j)

=n(P ( X , A) P ( X )P ( A))2P ( X )P ( X )P ( A)P ( A)

=n2

P ( X )P ( X )P ( A)P ( A)very sensitive to underlying assumptions!all P( X = i)P ( A = j) should be sufciently largethe corresponding hypergeometric distributionshouldnt be too skewed


Mutual information


49/116

Mutual information

MI =

log P( XA)P ( XA)P ( X A)P ( X A)P ( XA)P ( XA)P ( X A)P ( X A)

P ( X )P ( X )P ( X )P ( X )P ( A)P ( A)P ( A)P ( A)

2n MI =log likelihood ratiofollows asymptotically the 2-distributionusually gives more reliable results than the 2-measure


Comparison: Sampling models for variable-based dependencies


50/116

dependencies

Multinomial: impractical but useful for theoreticalresults

Double binomial: not exchangeable p( X A) p( A X ) (in general)Hypergeometric (Fishers exact test): recommended,

enables efcient search, reliable resultsAsymptotic: often sensitive to underlying assumptions 2 very sensitive, not recommended MI reliable, enables efcient search, approximates pF


Sampling models for value-based dependencies


51/116

Sampling models for value based dependencies

Main choices:

1. Classical sampling models but with a differentextremeness relation

use lift to dene a stronger dependencyMultinomial and Double binomial: can differ muchfrom var-basedHypergeometric: leads to Fishers exact test, again!

2. Binomial models + corresponding asymptoticmeasures



52/116

Binomial model 1 (classical binomial test)


53/116

Binomial model 1 (classical binomial test)

Probability of getting exactly N XA sweet red apples andn N XA green or bitter apples is

p( N XA|n, p XA) = n

N XA( p XA) N XA(1 p XA)n N XA

p( N XA f r ( XA)|n, p XA) =n

i= f r ( XA)

ni

( p XA)i(1 p XA)ni

(or i = f r ( XA), . . . , min{ f r ( X ), f r ( A)})Use estimate p XA = P( X )P ( A)Note: N X and N A unxed


Corresponding asymptotic measure


54/116

p g y p

z-score:

z1( X A) = f r ( X , A) = f r ( X , A) nP ( X )P ( A) nP ( X )P ( A)(1 P ( X )P ( A))=

n( X , A) P ( X )P ( A)(1 P ( X )P ( A))

= nP ( XA)( ( X , A)

1)

( X , A) P ( X , A) .follows asymptotically the normal distribution


Binomial model 2 (suggested in DM)


55/116

( gg )

Like the double binomial model, but forget the other urn!

fr( X) green apples

a sample ofa sample of

fr(X) red apples

CONSIDER ONE FROM TWO INFINITE URNS:


Binomial model 2


56/116

Binomial model 2

p( N XA

f r ( XA)

| f r ( X ), P ( A)) =

f r ( X )

i= f r ( XA)

f r ( X )

iP ( A)iP (

A) f r ( X )i

Corresponding z-score:

z2 = f r ( XA)

=

f r ( XA) f r ( X )P ( A)

f r ( X )P ( A)P (

A)

= n( X , A)

P ( X )P ( A)P ( A)= f r ( X )(P ( A| X ) P ( A)) P ( A)P ( A)


J-measure


57/116

J measure

one urn version of MI

J = P( XA)log P( XA)P ( X )P ( A)

+ P ( X A)log P( X A)P ( X )P ( A)




58/116

0

0.1

0.2

0.3

0.4

0.5

0.6

19 21 23 25

p

fr(XA)

fr(X)=25, fr(A)=75, n=100

bin1bin2


0

0.1

0.2

0.3

0.4

0.5

0.6

19 21 23 25

p

fr(XA)

fr(X)=75, fr(A)=25, n=100

bin1bin2



Comparison: Sampling models for value-based dependencies


59/116

Multinomial, Hypergeometric, classical Binomial + its z-score: p( X A) = P( A X )Double binomial, alternative Binomial + its z-score: p( X A) P( A X ) (in general)The alternative Binomial, its z-score and J can

disagree with the other measures (only the X -urn vs.whole data) z-score easy to integrate into search, but may be

unreliable for infrequent patterns (classical)Binomial test in post-pruning improves quality!


Part I Contents


60/116

1. Statistical dependency rules2. Variable- and value-based interpretations

3. Statistical signicance testing3.1 Approaches3.2 Sampling models3.3 Multiple testing problem




3.3 Multiple testing problem


61/116

The more patterns we test, the more spurious patternswe are likely to accept.

If threshold = 0 .05 , there is 5% probability that aspurious dependency passes the test.

If we test 10 000 rules, we are likely to accept 500spurious rules!


Solutions to Multiple testing problem


62/116

1. Direct adjustment approach : adjust (stricterthresholds)

easiest to integrate into the search2. Holdout approach : Save part of the data for testing Webb3. Randomization test approaches : Estimate the overallsignicance of all discoveries or adjust the individual

p-values empirically

e.g. Gionis et al., Hanhijrvi et al.


Contingency table for m signicance tests


63/116

spurious rule genuine rule All H 0 true H 1 true

declared V S Rsignicant false positives true positivesdeclared U T m

R

insignicant true negatives false negativesAll m0 m m0 m


Direct adjustment: Two approaches


64/116

(i) Control familywise errorrate = probablity of accept-ing at least one false dis-covery

FWER = P(V 1)

(ii) Control false discoveryrate = expected proportionof false discoveries

FDR = E V R

spurious rule genuine rule Alldecl. sign. V S R

decl. insign U T m RAll m0 m m0 m


(i) Control familywise error rate FWER


65/116

Decide = FWER and calculate a new stricter threhold .

If tests are mutually independent: = 1

(1

)m

idk correction: = 1 (1 )

1m

If they are not independent: m

Bonferroni correction : = m

conservative (may lose genuine discoveries)How to estimate m?

may be explicit and implicit testing during searchHolm-Bonferroni method more powerful

but less suitable for the search (all p-values shouldbe known, rst)SSPD tutorial KDD14 p. 65

(ii) Control false discovery rate FDR


66/116

BenjaminiHochbergYekutieli procedure

1. Decide q = FDR2. Order patterns r i by their p-values

Result r 1 , . . . , r m such that p1 . . . pm3. Search the largest k such that pk

k

q

mc(m)if tests mutually independent or positivelydependent, c(m) = 1

otherwise c(m) = mi= 1 1i ln( m) + 0.584. Save patterns r 1 , . . . , r k (as signicant) and rejectr k + 1, . . . , r m


Hold-out approach


67/116

Powerful because m is quite small!

Data

Explor-atory

Holdout

PatternDiscovery

Patterns

Statistical

EvaluationSound

PatternsM. T.

correctionAny

hypothesis

test

Limitedtype-2

error


Randomization test approaches


68/116

1. Estimate the overall signicance of discoveries at oncee.g. What is the probability to nd K 0 dependency

rules whose strength is at least min M ?Empirical p-value

pemp = |{d i

| K i

K 0

}|+ 1

b + 1

d 0 original setd 1 , . . . , d b random setsK 1 , . . . , K b numbers of discovered patterns from set d i

Gionis et al. SSPD tutorial KDD14 p. 68

Randomization test approaches (cont.)


69/116

2. Use randomization tests to correct individual p-valuese.g., How many sets contained better rules than X

A

? p = |{d i|(Si )(min p(Y B |d i) p( X A |d 0)}|

b + 1 ,

d 0 original setd 1 , . . . , d b random sets

Si=set of patterns returned from set d i

Hanhijrvi


Randomization test approaches


70/116

+ dependencies between patterns not a problem more powerful control over FWER+ one can impose extra constraints (e.g. that a certainpattern holds with a given frequency and condence) most techniques assume subset pivotality the

complete hypothesis and all subsets of true nullhypotheses have the same distribution of the measurestatistic

Remember also points mentioned in the single hypothesistesting


Part I Contents


71/116








72/116

When X A is redundant with respect to Y A(Y X )? Improves it signicantly?Examples of redundant dependency rules:

smoking, coffee

atherosclerosis

coffee has no effect on smoking atherosclerosis high cholesterol, sports atherosclerosis sports makes the dependency only weakermale, male pattern baldness atherosclerosis adding male hardly any signicant improvement


Redundancy and signicance of improvement


73/116

Value-based: X A is productive if P( A| X ) > P( A|Y ) forall Y X Variable-based: X A is redundant if there is Y X such that M (Y A) is better than M ( X A) with thegiven goodness measure M

X

A is non-redundant if for all Y X M ( X

A) is

better than M (Y A)When the improvement is signicant?


Value-based: Signicance of productivity


74/116

Hypergeometric model:

p(YQ A|Y A) =i

f r (YQ)

f r (YQA)+

i f r (Y Q)

f r (Y QA)i f r (Y ) f r (YA)

probability of the observed or a stronger conditionaldependency Q A, given Y , in a value-based model.

also asymptotic measures ( 2, MI )


Apple problem: value-based


75/116

p(YQ A|Y A) = 0 .0029 Y =red, Q=large

Basket 1 Basket 2

40 green apples40 large red apples

(all sweet) (all bitter)

20 small

red apples

(15 sweet)


Apple problem: variable-based?


76/116

p(Y A|(YQ) A) = 2 .9e 10


77/116

Part I Contents


78/116








79/116

1. Search for the strongest rules (with , etc.) that passthe signicance test for productivity

MagnumOpus (Webb 2005)

2. Search for the most signicant non-redundant rules(with Fishers p etc.)

Kingsher (Hmlinen 2012)

3. Search for frequent sets, construct association rules,prune with statistical measures, and lternon-redundant rules??

No way!closed sets? redundancy problemtheir minimal generators?


Main problem: non-monotonicity of statistical dependence


80/116

AB C can express a signicant dependency even if A and C as well as B and C mutually independentIn the worst case, the only signicant dependencyinvolves all attributes A1 . . . Ak (e.g. A1 . . . Ak 1 Ak )

1) A greedy heuristic does not work!

2) Studying only simplest dependency rules does not

reveal everything!

ABCA1-R219K alzheimerABCA1-R219K, female alzheimer


End of Part I


81/116

Questions?

SSPD tutorial KDD14 p. 81 . . . .


82/116

Statistically sound pattern discoveryPart II: Itemsets

Wilhelmiina HmlinenGeoff Webb

http://www.cs.joensuu.fi/~whamalai/ecmlpkdd13/sspdtutorial.html

Overview


83/116

intro itemsets productivity redundancy independent productivity multiple testing randomisation examples conclusion

Most association discovery techniques find rules Association is often conceived as a relationship

between two parts so rules provide an intuitive representation

However, when many items are all mutually

interdependent, a plethora of rules results Itemsets can provide a more intuitiverepresentation in many contexts

However, it is less obvious how to identifypotentially interesting itemsets than potentiallyinteresting rules

Rules


84/116


bruises?=true ring-type=pendant[Coverage=3376; Support=3184; Lift=1.93; p


85/116


stalk-surface-above-ring=smooth & ring-type=pendant stalk-surface-below-ring=smooth[Coverage=3664; Support=3328; Lift=1.49; p=3.05E-072]bruises?=true stalk-surface-above-ring=smooth[Coverage=3376; Support=3232; Lift=1.50; p


86/116


bruises?=true,stalk-surface-above-ring=smooth,stalk-surface-below-ring=smooth,ring-type=pendant[Coverage=2776; Leverage=0.1143; p


87/116


An association between two items will berepresented by two rules

three items nine rules four items twenty-eight rules ...

It may not be apparent that all the resulting rulesrepresent a single multi-item association

Reference: Webb 2011

But how to find itemsets?


88/116


Association is conceived as deviation fromindependence between two parts

Itemsets may have many parts

Main approaches


89/116


Consider all partitions Randomisation testing

Incremental mining Information theoretic

Main approaches


90/116


Consider all partitions Randomisation testing

Incremental mining models

Information theoretic mainly models not statistical

All partitions

M l b d f i l


91/116


Most rule-based measures of interest relateto difference between the joint frequency andexpected frequency under independencebetween antecedent and consequent

However itemsets do not have an antecedentand consequent

Does not work to consider deviation fromexpectation if all items are independent ofeach other If P( x , y ) P(x )P(y ) and P(x , y , z ) =

P(x , y )P(z ) then P(x , y , z ) P(x )P(y )P(z )

AttendingKDD14, InNewYork, AgeIsEvenReferences: Webb, 2010; Webb, & Vreeken, 2014

Productive itemsets

A i i lik l b i i if i


92/116


An itemset is unlikely to be interesting if itsfrequency can be predicted by assuming

independence between any partition thereof Pregnant , Oedema , AgeIsEven Pregnant , Oedema

AgeIsEven Male , PoorEyesight , ProstateCancer , Glasses

Male , ProstateCancer PoorEyesight, Glasses

References: Webb, 2010; Webb, & Vreeken, 2014

Measuring degree of positive association

M d f i i i i


93/116


Measure degree of positive association asdeviation from the maximum of the expectedfrequency under an assumption of

independence between any partition of theitemseteg


Statistical test for productivity

N ll h th i


94/116


Null hypothesis: , Use a Fisher exact test on every partition Equivalent to testing that every rule is

significant

No correction for multiple testingnull hypothesis only rejected ifcorresponding null hypothesis isrejected for every partition

this increases the risk of Type 2 ratherthan Type 1 error


Redundancy

If it X i f


95/116


If item X is a necessary consequence ofanother set of items Y then { X } Y should beassociated with everything with which Y is

associated. Eg pregnant female and pregnant oedema , female , pregnant , oedema is not likely to be

interesting if pregnant , oedema is known Discard itemsets I where X I , Y X , P(Y ) =

P(X ) I = { female , pregnant, oedema } X = { female , pregnant } Y = {pregnant }

Note, no statistical testReferences: Webb, 2010; Webb, & Vreeken, 2014

Independent productivity

Suppose that heat oxygen and fuel are all


96/116


Suppose that heat , oxygen and fuel are allrequired for fire .

heat , oxygen and fuel are associated with fire So too are : heat , oxygen

heat , fuel oxygen, fuel heat

oxygen fuel

But these six are potentially misleading giventhe full association References: Webb, 2010; Webb, & Vreeken, 2014

Independent productivity

An itemset is unlikely to be interesting if its


97/116


An itemset is unlikely to be interesting if itsfrequency can be predicted from the

frequency of its specialisations If both X and X Y are non-redundantand productive then X is only likely to beinteresting if it holds with respect to Y .

,

P is the set of all non-redundant andproductive patterns


Assessing independent productivity

Given fuel oxygen


98/116


Given fuel, oxygen,heat, fire

to assess fuel,oxygen, fire check whether

association holds indata without heat

References: Webb, 2010


Given fuel oxygen


99/116



to assess fuel,oxygen, fire check whether

association holds indata without heat



Given fuel oxygen


100/116



to assess fuel,oxygen check whether

association holds indata without heat or

fire



Given fuel oxygen


101/116



to assess fuel,oxygen check whether

association holds indata without heat or

fire


Mushroom

118 items 8124 examples


102/116


118 items, 8124 examples 9676 non-redundant productive itemsets ( =0.05)

3164 are not independently productiveedible=e, odor=n[support=3408, leverage=0.194559, p


103/116


We typically want to discover associationsthat generalise beyond the given data

The massive search involved in associationdiscovery results in a massive risk of falsediscoveries associations that appear to hold in the

sample but do not hold in the generatingprocess

Reference: Webb 2007

Bonferroni correction for multiple testing

Divide critical value by size of search spaceEg retail


104/116


v de c t ca va ue by s e o sea c space Eg retail 16470 items Critical value = 0.05/2 16470 < 5E-4000

Use layered critical values Sum of all critical values cannot exceed familywise criticalvalue

Allocate different familywise critical values to different

itemset sizes | | divided by all itemsets of size | I | The critical value for itemset I is thus

| | | |

Eg retail 16470 items Critical value for itemsets of size 2 = . = 1.84E-10

Reference: Webb 2007, 2008

Randomization testing

Randomization testing can be used to find


105/116


Randomization testing can be used to findsignificant itemsets.

All the advantages and disadvantagesenumerated for dependency rules. Not possible to efficiently test for

productivity or independent productivityusing randomisation testing.

References: Megiddo & Srikant, 1998; Gionis et al 2007

Incremental and interactive mining

Iteratively find the most informative itemset


106/116


Iteratively find the most informative itemsetrelative to those found so far

May have human-in-the-loop Aim to model the full joint distribution will tend to develop more succinct

collections of itemsets than self-sufficient itemsets

will necessarily choose between one ofmany potential such collections.

References: Hanhijarvi et al 2009; Lijffijt et al 2014

Belgian lottery

{43, 44, 45}


107/116


{43, 44, 45} 902 frequent itemsets (min sup = 1%)

All are closed and all are non-derivable KRIMP selects 232 itemsets. MTV selects no itemsets.

DOCWORD.NIPS Top-25 leverageitemsets

kaufmann,morgani d i i

top,bottom


108/116


trained,trainingreport,technical

san,mateomit,cambridgedescent,gradient

mateo,morganimage,imagessan,mateo,morgan

mit,pressgrant,supportedmorgan,advances

springer,verlag

san,morgankaufmann,mateo

san,kaufmann,mateodistribution,probabilityconference,international

conference,proceedinghidden,trainedkaufmann,mateo,morgan

learn,learnedsan,kaufmann,mateo,morganhidden,training

Reference: Webb, & Vreeken, 2014

WORDOC.NIPS Top-25 leverage rules

kaufmann morganmorgan kaufmann abstract,neural,morgan kaufmannresult,kaufmann morgan


109/116


morgan kaufmannabstract,morgan kaufmannabstract,kaufmann morgan

references,morgan kaufmannreferences,kaufmann morganabstract,references,morgan

kaufmann

abstract,references,kaufmannmorgansystem,morgan kaufmannsystem,kaufmann morgan

neural,kaufmann morganneural,morgan kaufmannabstract,system,kaufmann morganabstract,system,morgan kaufmann

abstract,neural,kaufmann morgan

result,kaufmann morganresult,morgan kaufmannreferences,system,morgan kaufmann

neural,references,kaufmann morganneural,references,morgan kaufmannabstract,references,system,morgan

kaufmann

abstract,references,system,kaufmannmorganabstract,result,kaufmann morganabstract,neural,references,kaufmann

morgan


WORDOC.NIPS Top-25 frequent (closed)itemsets

abstract,referencesb t t lt

references,systemf t


110/116


abstract,resultreferences,result

abstract,functionabstract,references,resultabstract,neural

abstract,systemfunction,referencesabstract,set

abstract,function,referencesneural,referencesfunction,result

abstract,neural,references

references,setneural,result

abstract,function,resultabstract,introductionabstract,references,system

abstract,references,setresult,systemresult,set

abstract,neural,resultabstract,networkabstract,number


WORDOC.NIPS Top-25 lift self-sufficientitemsets

duane,leapfrogamericana periplaneta

ekman,hagerlpnn petek


111/116


americana,periplanetaalessandro,sperduti

crippa,ghisellichorale,harmonizationiiiiiiii,iiiiiiiiiii

artery,coronarykerszberg,linsternuno,vasconcelos

brasher,krugmizumori,postsubiculumimplantable,pickard

zag,zig

lpnn,petekpetek,schmidbauer

chorale,harmonetdeerwester,dumaisharmonet,harmonization

fodor,pylyshynjeremy,bonetornstein,uhlenbeck

nakashima,satoshitaube,postsubiculumiceg,implantable

Closure of duane,leapfrog(all words in all 4 documents)

abstract, according, algorithm, approach, approximation,bayesian, carlo, case, cases, computation, computer, defined,


112/116


y , , , , p , p , ,department, discarded, distribution, duane, dynamic,dynamical, energy, equation, error, estimate, exp, form,found, framework, function, gaussian, general, gradient,hamiltonian, hidden, hybrid, input, integral, iteration, keeping,kinetic, large, leapfrog, learning, letter, level, linear, log, low,

mackay, marginal, mean, method, metropolis, model,momentum, monte, neal, network, neural, noise, non, number,obtained, output, parameter, performance, phase, physic,

point, posterior, prediction, prior, probability, problem,references, rejection, required, result, run, sample, sampling,science, set, simulating, simulation, small, space, squared,step, system, task, term, test, training, uniformly, unit,university, values, vol, weight, zero

Itemsets Summary

More attention has been paid to finding associations efficientlythan to which ones to find


113/116


While we cannot be certain what will be interesting, the followingprobably won't frequency explained by independence between a partition frequency explained by specialisations

Statistical testing is essential Itemsets often provide a much more succinct summary of

association than rules rules provide more fine grained detail rules useful if there is a specific item of interest

Self-Sufficient Itemsets

capture all of these principles support comprehensible explanations for why itemsets arerejected

can be discovered efficiently often find small sets of patterns

(mushroom: 9,676, retail: 13,663) Reference: Novak et al 2009

Software

OPUS Miner can be downloaded from :h // h d / bb/


114/116


http://www.csse.monash.edu.au/~webb/Software/opus_miner.tgz

Statistically sound pattern discoveryCurrent state and future challenges

Efficient and reliable algorithms for binary andcategorical data


115/116

g branch-and-bound style no minimum frequencies (or harmless like

5/n) Numeric variables

impact rules allow numerical consequences(Webb) main challenge : Numerical variables in thecondition part of rule and in itemset How to integrate an optimal discretization

into search? How to detect all redundant patterns? Long patterns

End!

Questions?


116/116

All material:http://cs.joensuu.fi/pages/whamalai/kdd14/ssp

dtutorial.html