AcknowledgementsThisworkwouldnothavebeenpossiblewithoutthehelpofDennisWall,Kelley
Paskov,andtheothermembersoftheWalllab,aswellasthefundingandcomputingresourcesoftheWallLabandtheStanfordUniversitySchoolofMedicine.
ThanksalsototheMachineLearninginstructorsandTAs.
1. Phillips,R.D.etal.EnrichmentProceduresforSoftClusters:AStatisticalTestanditsApplications.(2010).2.UnpublishedworkfromtheWalllab,StanfordUniversity
Clusterin
gandClassifyingAutism
TheiHart ConsortiumhashelpedtocollectoneofthelargestAutismSpectrumDisorder(ASD)datasetsever,includinggeneticandbehavioraldataforseveralthousandASDCasesandControls.
ThisoffersusunprecedentedopportunitytotakeMachineLearningApproachestotwomajorAutismResearchproblems:
Aim2:AnAutismGeneticRiskScore
Goal:BuildageneticriskpredictorforASD
TheProblem:Autismisacomplexdisease– itisdeterminedabout50%bygeneticsand50%byaperson’senvironment
Asaresult,itisimpossibletoperfectlypredictautismfromgenetics.
However,animperfectclassifiercan:• giveusameasureofaperson’sgeneticriskofautism• provideintuitionaboutwhichgeneticfeaturesare
mostpredictiveofdisease.
Genotype+Environment=Phenotype
TheFeatureSet:
0 0 1
Eachgenomeisshownasa1109× 1binarydescribingwhereeachpersonhasaloss-of-functioninagene.
ALogisticRegressionClassifier: AGradientBoostedClassifier:
ConclusionsandFutureWork:Ourbestperformanceisachievedfromaveragingthepredictionsfrom
thetwoclassifiersabove(seeright).Thisclassifieroutperformspreviousmethods(bestAU-ROC=0.54[2]),showingpromiseasa
geneticriskscorepredictorforASD.
WefirsttrainedaLogisticRegressionClassifierbecausethesemodelsareoftensimpletointerpret.
Wealsotrainedagradientboostedtreeclassifiertocapturenon-lineargene-generelationships.
F1score:
0.634
AreaunderROC:
0.565
F1score:
0.647
AreaunderROC:
0.580
RachaelA
iken
s,(ra
iken
)and
Bria
nnaKo
zemzak(kozem
zak@
stanford.edu
)StanfordUniversity
Dep
artm
ento
fBiomed
icalInformatics,WallLab
F1score:
0.642AreaunderROC:
0.602Futureworkwill:• Continuetooptimizeensembleandnon-linearclassificationmodels• Analyzefeatureimportancetoinferwhichgeneticvariantsaremostpredictive
Aim1:ClusteringAutismSubtypes
Goal::Developaclustervalidationtoolkitanduseittoanalyzeclusteringresults
FeatureHeatMaps:
LabelPieCharts:
Featuresonthex-axisandcentroidsonthey-axis.Lighterfeaturevaluesusuallyindicatemoreneurotypical behavior.Weseeseparationofneurotypical individualsfromatypicalindividualsandthenamixedcluster.
Cluster1(3980) Cluster2(2683) Cluster3 (6830)
ADOSDiagnosis
ADI-RDiagnosis
Piechartsweregeneratedfor29differentlabelsincludingdiagnostic,demographic,andcomputedADOS/ADI-Rlabels.ThecontrolgroupappearstoseparatefromtheASDindividuals.
Data:• 13,493individuals• 123featuresfromADOSandADI-Rinstruments• Diagnostic,medical,demographic,etc.labels
IndividualMovement:Cluster1(3980) Cluster2(2683) Cluster3 (6830)
ClusterMovedTo
Movementbetweenclusterswasnotrandom.Thisindicatessomecommonunderlyingfeaturesdriving
clusterformationforallkvalues.
ASDcanmanifestoverabroadspectrumofsymptoms,fromgreatintellectualandcommunicationdisabilitytonear-normal‘high-functioning’forms.Asaresult,itisoftenaskedwhetherASDisinfactcomposedofsomenumberofAutism‘sub-types’thatarebestdiagnosed,studied,andtreatedindifferentways.
Featuresonx-axisandexamplesony-axis,sortedbycluster.Thiswastoocomplextobeuseful,sowelookedonlyatthecentroidsofthecluster(alowrankrepresentationofexamples)instead.
TheProblem:
PriorWorkinWallLab:• Imputedmissingvaluesandclustereddatausing
generalizedlowrankmodelwithlogisticloss• Crispandsoftk-meansclusteringswerecreated
fork=1,2,...,6.
ConclusionsandFutureWork:Conclusions• “Best”clusteringresultwassoftk-meanswithk=3,
whereeachindividualisassignedtoasingleclusterbasedonmaximumpartialmembership
• Why?Clustersareseparatedbydiagnosis,medicalhistory,andcomputedADOS/ADI-Rlabelswithoutcreatingindistinguishableextraclusters
Futureworkwill:• Employmethodstoworkdirectlywiththesoft
clusteringresultsbyusingenrichmenttestsdevelopedforsoftclustering[1]andimplementingweightedmembershipforpiecharts
• Applyotherclusteringmethodstodatasetandcomparewithk-meansandsoftk-meansresults