+ All Categories
Home > Documents > Pierre Lebrun, Arlenda SA / Pharmalex...

Pierre Lebrun, Arlenda SA / Pharmalex...

Date post: 14-Jan-2019
Category:
Upload: hoangkhuong
View: 214 times
Download: 0 times
Share this document with a friend
21
Statistics, Big Data … … and small data Pierre Lebrun, Arlenda SA / Pharmalex [email protected] QPRC2017, UCONN, Storrs
Transcript

Statistics,BigData…… andsmalldata

PierreLebrun,ArlendaSA/[email protected]

QPRC2017,UCONN,Storrs

Disclaimer

n Iamnota“bigdataanalyst”,merelyastatisticianworkingasa

consultantforcompanieswith– sometimes– largedatasets

- Moreoften,verysmalldatasets

n My“bigdata”problemsaremoreacollectionofanawfullotofsmall

dataproblems

n Asaconsultant,thecustomerbigdatahardware

- iscentraltodefineaworkingsolution

- isoftenfixedandsometimesnotadapted

- wasprobablythebestwhentheyinstalledit(10yearsago)

n Asaconsultant,thecustomersoftware…

- isoftenfixed(protectedenvironment… e.g.unabilitytoinstallanR

package…)

Bayesianstatistics

n WhyBayesianstatistics?

- Focusonprediction insteadofmodelparameters

(notsayingthatparametersaren’tvaluable!)

- Integrateparameteruncertaintyandmeasurementerrorintothepredictivedistribution

(worksforalltypesofmodelsà unifiedframework)

- Don’tstoptobinaryanswers(go/nogoorpass/fail)

- Allowcombinationofknowledgethroughtheprior distribution,inanaturalschemeofupdatingpriorknowledgeusingBayestheorem

- Probability isacoherentmeasureofplausibilityofaneventoccurring,giventhemodelhypothesisanddata,insteadofthefrequencyoftheevent

n Allpointsaboveareindependentofthesizeofthedataset

Bayesianstatistics

Simulationswherethe“newobservations”aredrawnfromdistribution“centered”onestimatedlocationanddispersionparameters(treatedwronglyas“truevalues”).

PosteriorPredictionsFirst,bydrawingameanandavariancefromtheposteriorsand,second,drawinganobservationfromresultingdistribution

Case1:Timeseriesanalysis

n Identifyoutlierontimeseriesmadeofcountdata

- Compliance:authoritiessaidtothebanks“ifyouhavetheabilitytodetectweirdpatternsinyourcustomerdataandraisealert,pleasedoit”

- Onepatternthatcanbefoundeasilyiswhenanumberofaggregated

transactionsbetweentwoentitiesisnot“Normal”

- Thenacustomercanidentifyarootcauseoranoncomplianceissue

• Stronglinkwithstatisticalprocesscontrol(SPC)

5 10 15 20

1500

025

000

3500

0

Monthly aggregates

months

# tra

nsac

tions

Timeseriesanalysis

n Identifyoutlierontimeseriesmadeofcountdata

- Poissonornegativebinomialregression

- Deriveapredictionintervalforthenextnumberofaggregatedtransactions

withlargecoverage(e.g.99%)

- Ifapointisoutside,itmeansitdoesnotbelongtothesamepopulation(it

wouldoccuronlyin1-99%ofthecaseifthetimeserieswasbehaving

normally)

5 10 15 20

1500

025

000

3500

0

Monthly aggregates

months

# tra

nsac

tions

5 10 15 20

1500

025

000

3500

0

Monthly aggregates

months

# tra

nsac

tions

Timeseriesanalysis

n SomeexampleofRcode(Normalapproximation,noARstructure)

library(MASS)

mod<- glm.nb(formula=y~month,data=datas)#Log-linkisimplicit

pred <- predict(mod,data.frame(month=1:(nrow(datas)+1)),type="link",se.fit =TRUE)

X=model.matrix(mod)

xprimex_1=solve(t(X)%*%X)

Xpred =rbind(X,t(c(1,24)))

S=sqrt(sum(mod$residuals^2)/(ndata-2))

Root=sqrt(1+diag(Xpred%*%xprimex_1%*%t(Xpred)))

meanpred =exp(pred$fit)

PIpredLL =exp(pred$fit - qt(0.975,df=ndata-2)*S*Root)

PIpredUU =exp(pred$fit +qt(0.975,df=ndata-2)*S*Root)

(Here,95%bilateral

quantiles)

ExamplewithStancode(samemodel)

model<- "

data{

int<lower=1>N; //rowsofdata

vector[N]x; //predictor

int<lower=0>y[N];//response

}

parameters {

real<lower=0>phi;//neg.binomialdispersionparameter

realb0; //intercept

realb1; //slope

}

model {

//priors:

phi~cauchy(0,20);

b0~normal(0,20);

b1~normal(0,20);

//datamodel:

y~neg_binomial_2_log(b0+b1*x,phi);

}

"

stanmod <- stan(model_code =model,

data=list(N=nrow(datas),x=datas$month,y=datas$y),

iter=10000,chains=4)

chains<- extract(stanmod,pars=c("b0","b1","phi"))

x=1:24

N=length(x)

simul =matrix(0,ncol=length(x),nrow=length(chains[[1]]))

for(i inseq_along(x)){

#drawfromthepredictiveateverymonths

predictive[,i]=rnegbin(length(chains$b0),

mu=exp(chains$bo +chains$b1*(x[i])),

theta=chains$phi)

}

PIPpred =apply(predictive[i,2,function(l){

quantile(l,probs =c((1-0.95)/2,(1+0.95)/2)))

}

02

46

810

12

datas$month

datas$TR

N_SEN

T

201301 201308 201403 201410 201505 201512

Timeseriesanalysis

n Whya“non-Bayesian”version?

- Onmillionsofsubsetsofdata,runninganMCMCsamplercanbehopeless

(dependingonthecustomerarchitecture)

- Generally,itisinterestingtoverifyifananalyticalsolutioncanbeidentified

- Unfortunately,thisisnotthecasefornegativebinomialmodel

- Asshown,A“Normal”approximationcanbedeveloped,andcoverages

verifiedinvarioussimulations

• ButitisnotedthatarealBayesianpredictionismorepowerful,especially

withsmallcounts

Green:Bayesianinterval(inStan)

Red:Normalapprox.(Rcode)

Timeseriesanalysis

n Sofar,theproposedsolutionanswersyes/no

n Customersmayhavethousandsofalerts,allneedstobeverified

n Howtoimprovethesolution,byrankingthesealerts

- Provideaposteriorprobabilitythatthedatapointisout-of-trend(OOT)

n ExamplefollowingStanimplementation

>ecdf(predictive[,24])(35000) #24isthelasttimepoint(notincludedinthemodel)

[1]0.98915#probabilitytobe OOT

n Instead of reportingyes/no,it is better toreportP(OOT)

Timeseriesanalysis

n Difficultiesspecifictobigdata

- Customerhardwareandsoftwarepoorlyadapted

- Structureofthedata

Dataarestructured… butstill,plentyofproblems,leadingtoadditionaldata

consolidations

- Findingrobustsimplemodelsandhandlingerrorcase

E.g.insomecases,onemodelwillnotconvergeandRwouldcrashiftheerror

isnothandled(try….catchmechanism)

n Taketimetobuildappropriatemodels,onwhichyoumake

predictiveinference!!

- Ifthemodelsarenotgood,inferenceisquestionable… (sensitivity↘↘)

- Easyinsmalldata… Closetoimpossibleinbigdata

- Howtodeveloprobustmodelswithouthavingseen(all) thedata?

Case2:Accelerometerdata

n Pre-clinicaldataaregatheredonanimaltoobtainanearlyideaof

drugefficacy

- 32or64mice

- Followedonlineduring3to6weeks

- Videos(likeparkingsecurityvideos)

- 3-dimensionalaccelerometerdata(adeviceisattachedontheirback)

- Samplingfrequency~100hz

n Miceareepileptic-induced,andthedifferenttreatmentgroups

(control,placebo,dose1,dose2,etc.)shouldseedifferent

(significant?)numbersofcrisis

n Problemisthatitisnotpossibletoanalyzeeveryvideo

- Instead,usethesignalfromaccelerometerstoautomaticallydetectseizures

Accelerometerdata

n Thegoalistodetectepilepticseizuresfromaccelerometersignals

n Butsometimes,micearescratching,running,dancing…

Accelerometerdata

n Afirstalgorithmhasbeendevelopedtobeverysensitive(HMM

modelonsimplefeatures)

- Findsignalpatterns(“nomovement– movement– nomovement”)

- Canberun“online”duringexperiment

- ~100%detection

n Butthefalsepositiveratewas~97%

- Highlyskilledscientistsneedtoconfirmseizuresvisually(takesabout20

sec./suspicion)

- Totalnumberoffoundseizureisabout15k!!

- About100h+ofvideoanalysis(don’tdothat),forasmallexperiment

- …thisisnotaccountingforcoffeebreak… suchrateisnotacceptable

n Let’scall‘suspicions’thedetectedseizuresofar

Accelerometerdata

n Toimprove:

- Addafilterontopofthesuspicion,basedonmorespecificfeatures

- Asnobodyknowswhatisagoodfeature

1) Extractmanyfeatures,asorthogonalaspossible(frequencies,amplitude,

rollingSD)

2) Trainasupportvectormachine(theskilledscientistsalreadydidthe

100h+hoursanalysisonsomeexperiments,sowehavetrainingdata…)

3) TunetheSVM(mainly,weightedSVMduetoclasshighlyunbalanced)

4) Verify/optimizeusingstratifiedcross-validation

predicted

actual 0 1 total

0 178998 24102 203100

1 103 5304 5407

total 179101 29406 208507

missedseizure = 2%reductioninworkingtime = 86%

Accelerometerdata

n Implementation

- NoSVMavailablein“bigdata”Rpackagessuchassparklyr orrsparkling

- Sad,butnotsuchanissueasweworkonlyonthepredefinedsuspicions

Onlyabout15kchunksaccelerometerdata…

- Suspicionaccelerometerdatacanbeaccessedonebyone

• Featurescanbeextracted

• Thissummaryiseasilyhandledbyonemachine

• Stillembarrassinglyparallel…

n ”Bayesian”SVMnotveryavailable

- Answerwillremainssimple“yes/no”decision

- Norankingofthesuspicionsgivenhowlikelytheywouldbearealseizure

- PleaseimplementitinBoomSpikeSlabJ

Accelerometerdata

n Thestatisticalquestion

- Thereis,andwillalwaysbe,atrade-offbetweensensitivityandfalse

positiverate

n Whatistheimpactofmissingafewseizures

- Clinicalrelevance

n Simulationstudy

Accelerometerdata

Accelerometerdata

n Seetheimpactwithsomeseizuredecreaseassumptionsand

subject-to-subjectvariability

Conclusions

n Bayesianstatisticsallowprovidingbetteranswerthroughtheuseof

thecompletepredictivedistribution

- Useitwhenfeasible

- Approximationisnotacrime

n Makingmillionsofmodelsonsmalldataishard

- Taketimetoadjustyourmodel

- Inferenceotherwiseisquestionable

n Besuretoanswertheveryquestionofthecustomer/scientist

- OptimizingspecificityandsensitivityorRMSECVisnotsufficient

- Theresultsaregenerallyusedbyothers(wearejustasmallpieceinabig

process)

Lookattheimpactoferrorsinthefinaloutcome,insteadoftakingextratime

tocontinuetryingtooptimizeyourclassifier

Thanks

n Acknowledgement

- MarcoMunda(Arlenda)


Recommended