Date post: | 13-Feb-2017 |
Category: |
Data & Analytics |
Upload: | bigml-inc |
View: | 596 times |
Download: | 1 times |
Introducing Association Discovery
BigML 2015 Fall Release
BigMLInc Fall2015Release 2
Today’sWebinar• Speaker:
• PoulPetersen,CIO
• Moderator:
• AtakanCe>nsoy,VPPredic>veApplica>ons
• Enterques>onsintochatbox–we’llanswersomeviatext;othersattheendofthesession
• email:[email protected]
• TwiPer:@bigmlcom
BigMLInc Fall2015Release 3
Associa1onDiscovery
Algorithm“MagnumOpus”fromGeoffWebb
UnsupervisedLearning:unlabelleddata
LearningTask:Find“interes1ng”rela1onsbetweenvariables.
BigMLInc Fall2015Release
DecisionTreesBaggingDecisionForest
4
BigMLWorkflow
MODEL
DATASET
CLUSTER
ANOMALY
ASSOCIATION
SOURCE
K-MeansG-Means
Isola>onForest
MagnumOpus
BigMLInc Fall2015Release 5
date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51
Clustering
date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51
AnomalyDetec1on
similar
unusual
UnsupervisedLearning
BigMLInc Fall2015Release
date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51
6
{customer = Bob, account = 3421} zip = 46140
Rules:
{class = gas} amount > 80
Associa1onRules
BigMLInc Fall2015Release
date customer account auth class zip amountMon Bob 3421 pin clothes 46140 135Tue Bob 3421 sign food 46140 401Tue Alice 2456 pin food 12222 234Wed Sally 6788 pin gas 26339 94Wed Bob 3421 pin tech 21350 2459Wed Bob 3421 pin gas 46140 83The Sally 6788 sign food 26339 51
7
{customer = Bob, account = 3421} zip = 46140
Rules:
{class = gas} amount > 80
Antecedent Consequent
Associa1onRules
BigMLInc Fall2015Release 8
UseCases
• MarketBasketAnalysis
• WebusagepaPerns
• Intrusiondetec>on
• Frauddetec>on
• Bioinforma>cs
• Medicalriskfactors
BigMLInc Fall2015Release 9
MarketBasketAnalysis
• Datasetof9,834grocerycarttransac>ons
• Eachrowisalistofallitemsinacartatcheckout
GOAL:Discover“interes1ng”rulesaboutwhatstoreitemsaretypicallypurchasedtogether.
BigMLInc Fall2015Release 10
Associa1onMetrics
Instances
AC
Coverage
Percentageofinstanceswhichmatchantecedent“A”
BigMLInc Fall2015Release 11
Associa1onMetrics
Instances
AC
Support
Percentageofinstanceswhichmatchantecedent“A”andConsequent“C”
BigMLInc Fall2015Release
Confidence
Percentageofinstancesintheantecedentwhichalsocontaintheconsequent.
SupportCoverage
12
Associa1onMetrics
Instances
AC
BigMLInc Fall2015Release
CInstances
A C
A
Instances
C
Instances
A
13
Associa1onMetrics
Instances
AC
0% 100%
Instances
AC
Confidence
AneverimpliesC
Asome1mesimpliesC
AalwaysimpliesC
BigMLInc Fall2015Release
LiO
Ra>oofobservedsupporttosupportifAandCweresta>s>callyindependent.
Support==Confidencep(A)*p(C)p(C)
14
Associa1onMetrics
Independent
AC
C
Observed
A
BigMLInc Fall2015Release
C
Observed
A
15
Associa1onMetrics
Observed
AC
< 1 > 1
Independent
A C
Lift = 1
Nega>veCorrela>on NoAssocia>on Posi>ve
Correla>on
Independent
A C
Independent
A C
Observed
A C
BigMLInc Fall2015Release 16
Associa1onMetrics
Independent
AC
C
Observed
A
Leverage
DifferenceofobservedsupportandsupportifAandCweresta>s>callyindependent.
Support-[p(A)*p(C)]
BigMLInc Fall2015Release
C
Observed
A
17
Associa1onMetrics
Observed
AC
< 0 > 0
Independent
A C
Leverage = 0
Nega>veCorrela>on NoAssocia>on Posi>ve
Correla>on
Independent
A C
Independent
A C
Observed
A C
-1… …1
BigMLInc Fall2015Release 18
GOAL:Findgeneralrulesthatindicatediabetes.
• Datasetofdiagnos>cmeasurementsof768pa>ents.
• Eachpa>entlabelledTrue/Falsefordiabetes.
MedicalRisk
BigMLInc Fall2015Release 19
MedicalRiskAssocia1onRule
If plasma glucose > 146 then diabetes = TRUE
DecisionTree
If plasma glucose > 155 and bmi > 29.32 and diabetes pedigree > 0.32 and insulin <= 629 and age <= 44
then diabetes = TRUE
BigMLInc Fall2015Release 20
Par1alDependencePlots
VisualizeEnsembles
BigMLInc Fall2015Release 21
FlatlineEditor
hPps://github.com/bigmlcom/flatline
BigMLInc Fall2015Release
DecisionTreesBaggingDecisionForest
22
BigMLWorkflow
MODEL
DATASET
CLUSTER
ANOMALY
ASSOCIATION
SOURCE
K-MeansG-Means
Isola>onForest
MagnumOpus
DATASET
FlatlineFlatlineEditor
BigMLInc Fall2015Release 23
Logis1cRegression
DATASET LOGISTIC REGRESSION
• Classifica>onalgorithm
• Categorical:one-hotencoded
• Text:mappedtotokenfreq
• Bindingssupportlocalmodel
• I1/I2regulariza>on
• CurrentlyAPIonly
hPps://bigml.com/developers/logis>cregressions
BigMLInc Fall2015Release
DecisionTreesBaggingDecisionForestLogis>cRegression
24
BigMLWorkflow
MODEL
DATASET
CLUSTER
ANOMALY
ASSOCIATION
SOURCE
K-MeansG-Means
Isola>onForest
MagnumOpus
DATASET
FlatlineFlatlineEditor
BigMLInc Fall2015Release 25
BigMLClassifiers
Advantages Disadvantages
SingleTree easytointerpretrobusttomissingdata overfiong
Ensemble topperformerrobusttomissingdata hardtointerpret
Logis1cRegression robusttonoiseoutputsprobability
nomissingdatahardtointerpret
BigMLInc Fall2015Release
DecisionTreesBaggingDecisionForestLogis>cRegression
26
BigMLWorkflow
MODEL
DATASET
CLUSTER
ANOMALY
ASSOCIATION
SOURCE
K-MeansG-Means
Isola>onForest
MagnumOpus
Sta>s>calTestsCorrela>ons
STATSDATASET
FlatlineFlatlineEditor
BigMLInc Fall2015Release 27
Correla1ons
DATASET CORRELATION
• PearsonCoefficient
• SpearmanCoefficient
• Chi-Square
• Cramér'sV
• Tschuprow'sT
• One-wayANOVA
hPps://bigml.com/developers/correla>ons
BigMLInc Fall2015Release 28
Sta1s1calTests
DATASET STATISTICAL TESTS
• Benford’sLaw
• Anderson-Darling
• Jarque-Bera
• Z-score
• Grubbs
hPps://bigml.com/developers/sta>s>caltests
BigMLInc Fall2015Release
DecisionTreesBaggingDecisionForestLogis>cRegression
29
BigMLWorkflow
MODEL
DATASET
CLUSTER
ANOMALY
ASSOCIATION
SOURCE
K-MeansG-Means
Isola>onForest
MagnumOpus
Sta>s>calTestsCorrela>ons
STATSDATASET
FlatlineFlatlineEditor
BigMLInc Fall2015Release 30
Q&A
•Askques1onsandgetaFreeBigMLT-shirt!
•Alldemonstratedfeaturesareimmediatelyavailabletoallusersincluding:•Allsubscrip1onplans•VirtualPrivateCloud(VPC)customers•On-premiseimplementa1ons.
•Documenta1on@hRps://bigml.com/releases
BigMLInc Fall2015Release 31
FEEDBACK
@bigmlcom TWITTER
GetStartedToday!
RESOURCES Join us for future webinars & hangouts
OFFICE HOURS
Every Wednesday 9:30am Pacific Time