+ All Categories
Transcript
  • AbstractPolycystic ovary syndrome (PCOS) is a common endocrine disorder thataffects up to 20% of women, however diagnosis is commonly unreliable andun-quantitative. Here we use supervised machine learning and measurementsof 51 cytokines from a large cohort of patients to identify a low-dimensionalset of potential biomarkers for diagnosis of PCOS. Both whole blood andindividual follicular fluid (FF) aspirates were collected women during pre-intracytoplasmic sperm injection with in vitro fertilization (ICSI/IVF) oocyteretrieval and linked with patients’ PCOS status as diagnosed by the Rotterdamcriteria (n = 69 PCOS, n = 222 non-PCOS). We trained a binary support vectormachine (SVM) using a random subset of patient data to determine cytokineprofile associated with PCOS. Our resultant model includes 3 variables and is76% accurate. This provides insight into the immunological basis of PCOS andmay define a potential non-invasivequantitative strategy for diagnosis.

    Introduction

    Definitions&Equations

    StatisticalAnalysis UsingSupportVectorMachines OtherAnalyses

    FutureDirectionsThe dataset described in this study is a small subset of a much largercollection of patient data. In the future we plan to incorporate more ofthese data to determine what more can be said about the classification ofPCOS and prediction of fertility treatment success. Another question wemight be interested in is if Follicular Fluids data could be better used topredict pregnancy results while blood plasma data could be better atpredicting the presence of PCOS. Applying different methods of analysis(i.e. different machine learning classifiers) may have different strengthsthan our current models.

    References

    1.http://www.med.nyu.edu/chibi/sites/default/files/chibi/Final.pdf2.http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3872139/3.http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf4.Latchman,DavidS.GeneRegulation- FifthEdition.Taylor&FrancisGroup2005.Print.5.http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2785020/

    Thank you tothe members ofthe Gunawardena lab includingbut not limitedtoJohn,Sieu,Dan,David,Deepesh,Mohan,Felix,Javi.ThisworkwassupportedbytheGwill YorkandPaulMaeder ResearchAwardforSystemsBiologyandtheFASCenterforSystemsBiology. This work wassupported bythe NationalScience Foundation,Award id.1462629.

    Acknowledgements

    IdentificationofNon-invasiveCytokineBiomarkersforPolycysticOvarySyndromeUsingSupervisedMachineLearning

    Sensitivity = 𝑇𝑃

    𝑇𝑃 + 𝐹𝑁

    Specificity = 𝑇𝑁

    𝑇𝑁 + 𝐹𝑃

    Accuracy = 𝑇𝑁

    𝑇𝑁 + 𝐹𝑃

    PCOS is an endocrine disorder that affects up to 20% of women. It is diagnosedusing the Rotterdam criteria, which are as follows:At least two of the followingmust be present: Hyperandrogenism, which could include clinical hirsutism(excessive hairiness) or biochemically raised testosterone levels; anovulation,or difficulties getting pregnant; the presence of cysts in ovaries analyzed by anultrasound.

    With these diagnosis criteria in mind, one thing my projects aims to accomplishis to provide a different measurement of PCOS using concrete levels ofcytokines. This study involved taking samples from 291 women by firstexposing them to long-protocol ovarian hyperstimulation, which is a techniqueused to induce ovulation by multiple ovarian follicles. Then samples are takenfrom each woman’s blood plasma and follicular fluid. To note, the reason whythere is approximately double the amount of follicular fluid samples is that forthis set, each ovary is sampled, yielding roughly double the number ofsamples. Then, fifty-one whole blood and FF cytokines were measured by fluid-phase multiplex cytometric immunoassay (the resultant dataset is picturedbelow). The different cytokines were detected using different antibodies,which can be quite an expensive and lengthy test. So, another goal of thisprojec is to reduce the number of cytokines, or features, needed to predictPCOS in patients.

    Cytokines:Cytokinesaresmallsecretedproteinsreleasedbycellshaveaspecificeffectontheinteractionsandcommunicationsbetweencells.

    Lowlevels

    Highlevels

    P Dataset FFDataset

    Datasetvisualization.Thesetwofiguresrepresenttheplasma(P)andfollicularfluid(FF)datasetswhichare291x51and530x51respectively.

    Acomprehensivelistofthe48cytokinesmeasuredis:IL.1a,IL.1b,IL.1ra,IL.2,IL.2ra,IL.3,IL.4,IL.5,IL.6,IL.7,IL.8,IL.9,IL.10,IL.12..p40.,IL.12..p70., IL.13,IL.15,IL.16,IL.17,IL.18,CTACK,Eotaxin,FGF,G.CSF,GM.CSF,GRO.a,IFN.a,IFN.g,IP.10,LIF,MCP.1,MCP.3,M.CSF,MIF,MIG,MIP.1a,MIP.1b,b.NGF,PDGF,RANTES,SCF,SDF.1a,TGF.b,TGF.b,TNF.a,TNF.b,TRAIL,VEGF,CRP

    ln 89:;89 = 𝛽= + 𝛽:XLogisticRegression:

    LinearKernel: 𝐾 𝑥:, 𝑥B = (𝑥:·𝑥B)

    f (x) = sign(w⋅ x + b)

    SVMOptimizationProblem:

    DanielaPerry1,2,Tathagata Dasgupta2,JosephDexter2,SarahField3, MicheleCummings3, VinaySharma3,NadiaGopichandran3,Ellis Baskind3,NicholasOrsi3,JeremyGunawardena21CornellUniversity,DepartmentofBiologicalStatisticsandComputationalBiology2HarvardMedicalSchool,DepartmentofSystemsBiology3UniversityofLeedsInstituteofCancerandPathology

    First,weputall48variablesinalogisticregressionmodelandpredictedtheprobabilityofeachsamplebeingaPCOSorControl(ALL). Bothdatasetshadcomparablypoorperformance.Thenweconductedstepwiseregression,whichusestheAICcriteriatoeliminatetheleastsignificantvariable(STEP).Oncethisprocesswascomplete,weanalyzedtheperformanceofthemodelusingtheFFandPdatasetsonceagain.Finally,togetasenseofhowthemodelswouldperformwithamoreclinicallyfeasiblenumberofvariables,weremovedvariablesbasedonthehighestp-valueuntilwewereleftwithjustfourcytokines(FOUR). ROCcurveshighlightingtheperformanceofallsixmodelsaredisplayedbelow.Inaddition,avisualofaccuracy,specificity,andsensitivityarealsodisplayed.

    PrincipalComponentAnalysis(PCA):ff.pca

    Variances

    05

    1015

    1 2 3 4 5 6 7 8 9

    IL.1a

    IL.1b

    IL.1ra

    IL.2

    IL.2ra

    IL.3

    IL.4

    IL.5

    IL.6

    IL.7

    IL.8

    IL.9

    IL.10

    IL.12..p40.

    IL.12..p70.

    IL.13

    IL.15

    IL.16

    IL.17

    IL.18

    CTACK

    Eotaxin

    FGF

    G.CSF

    GM.CSF

    GRO.a

    HGF

    ICAM.1

    IFN.a

    IFN.g

    IP.10

    LIF

    MCP.1

    MCP.3

    M.CSF

    MIF

    MIG

    MIP.1a

    MIP.1b

    b.NGF

    PDGF

    RANTES

    SCF

    SCGF.b

    SDF.1aTGF.b

    TNF.a

    TNF.b

    TRAIL

    VCAM.1

    VEGF

    CRP

    -30

    -20

    -10

    0

    10

    20

    0 20 40PC1 (36.2% explained var.)

    PC

    2 (1

    9.1%

    exp

    lain

    ed v

    ar.)

    Control PCOS

    K-MeansClustering:K-meansclusteringisastochasticprocessthatgroupsdatapointsbasedontheirdistancefromeachother.Pointsarerandomlyassignedtoacluster,thenclusterplacementisoptimizedbyfindingthecenterofeachcluster.Ourclusteringanalysisresultedinthegraphtotheright,whichishighlycondensedbecauseofthepresenceofsomanyuniqueoutliers,whichisconsistentwithone-classclassification.

    PCAusesanorthogonaltransformationtomakeasetofvariableslinearlyindependentvariablescalledprincipalcomponents.Theabovelinegraphrepresentshowmuchvarianceinthedataisaccountedforbyeachprincipalcomponent.Thegraphtotheleftrepresentsthegroupingsbasedonthetwomostsignificantprincipalcomponents.

    FullModel

    “StepwiseModel”

    FourVariableModel

    StepwiseRegression

    P-valueelimination

    Model #Variables TrainingSet TestingSet Specificity Sensitivity Accuracy

    FF All 48 75%offullFF

    25%offullFF

    0.9207921 0.21875 0.7518797

    FFStepwise 21 75%offullFF

    25%offullFF

    0.950495 0.25 0.7819549

    FFFour 4 75%offullFF

    25%offullFF

    0.990099 0.0625 0.7669173

    P All 48 75%offullP 25%offullP 0.8571429 0.2222222 0.7027027

    PStepwise 12 75%offullP 25%offullP 0.8392857 0.1111111 0.6621622

    PFour 4 75%offullP 25%offullP 0.9821429 0.05555556 0.7567568

    0.0

    0.1

    0.2

    0.3

    Control PCOSGroup

    IL.10

    Group vs. IL.10 (P)

    -0.1

    0.0

    0.1

    Control PCOSGroup

    IL.10

    Group vs. IL.10 (FF)

    0.00

    0.05

    0.10

    0.15

    Control PCOSGroup

    IL.10

    Group vs. IL.10 (FF All)

    0.00

    0.05

    0.10

    0.15

    Control PCOSGroup

    IL.10

    Group vs. IL.10 (FF Step)

    0.00

    0.05

    0.10

    0.15

    Control PCOSGroup

    IL.10

    Group vs. IL.10 (FF Four)

    0.0

    0.1

    0.2

    0.3

    Control PCOSGroup

    IL.10

    Group vs. IL.10 (P All)

    0.0

    0.1

    0.2

    0.3

    0.4

    Control PCOSGroup

    IL.10

    Group vs. IL.10 (P Four)

    0.0

    0.1

    0.2

    0.3

    Control PCOSGroup

    IL.10

    Group vs. IL.10 (P Step)

    FirstweperformedgridsearchinordertooptimizeparametersforourSVMmodelusingalinearkernel.Then,wetrainedabinaryclassifierandtesteditsperformanceusing5-foldcrossvalidation(resultsinthetablebelow).Theresultsledustoconductone-classclassificationforoutlierdetection.Thenweretestedourmodelexcludingthepointswefoundtobeoutliers.

    GridSearch

    TrainabinaryclassificationSVM

    Testmodelperformance

    Applybestparameters

    5-foldcrossvalidation

    One-classclassification

    Testmodelwithoutoutliers

    5-foldcrossvalidation

    Lowmodelstrength

    HowSVMWorks1

    Group2

    Group1 Group1

    Group2

    Model #Variables TrainingSet TestingSet Specificity Sensitivity Accuracy

    FFSVM 7 75%offullFF

    25%offullFF

    1 0.0625 0.7744361

    5-foldCV 5-foldCV 0.9876543 0.03846154 0.7570093

    5-foldCV(nooutliers)

    5-foldCV(nooutliers)

    1 0 0.7583333

    PSVM 3 75%offullFF

    25%offullFF

    0.9821429 0.05555556 0.7567568

    5-foldCV 5-foldCV 1 0 0.7627119

    5-foldCV(nooutliers)

    5-foldCV(nooutliers)

    1 0.06666667 0.7878788


Top Related