HST190:IntroductiontoBiostatistics
Lecture1:Basicprinciplesofstatisticaldata
analysis
1 HST190:IntrotoBiostatistics
Welcome!
• Statisticalreasoningistheprocessofdrawingscientificconclusionsfromdatainarational,consistentway
• Goalsforthecourse:§ developanintuitionforthekeyconceptsthatunderpinthestatisticalanalysisofdata
§ readthe“Methods”sectionofanarticle,andunderstand/critiquetheapproachtaken
§ learntoanalyzeanddrawscientificconclusionsfromyourowndata
HST190:IntrotoBiostatistics2
Outline
Lecture Topic(s)1 Basicprinciplesofstatistical dataanalysis
2 Principlesofprobability&Estimationofparameters
3 Two-sample comparisons,hypothesistestingandpower/samplesizecalculations
4 Clinicaltrials&Simplelinearregression
5 Multiplelinear regression
6 Methodsforbinaryoutcomes
7 Logisticregression
8 Analysis oftime-to-eventdata
9 Projectpresentations
10 Reviewbeforetheexam
HST190:IntrotoBiostatistics3
CourseLogistics
HST190:IntrotoBiostatistics4
• Eightlectures§ each2-2.5hourslong
• Readingwillbeassignedpriortoeachlecture§ giventhepaceofthecourse,thisisstronglyencouraged
• Problemsetsfollowingeachlecture§ includeMatlab exercises§ dueat9amonthedayofthefollowinglecture(unlessspecifiedotherwise)
HST190:IntrotoBiostatistics5
• Duringbreaksinthemiddlewewill:§ completegroupexercises§ learnMatlab§ discusscourseprojects
• Youwillalsoworkonagroupproject andpresentresultsduringoneoftheclassmeetings
• In-classexamwilltakeplaceduringlastmeeting§ 28th August§ open-book
Suggestions
HST190:IntrotoBiostatistics6
• AskquestionsduringthelectureaswellasonPiazza§ takenotes!
• MaterialpresentedindifferentsequencefromRosner§ consultRosnerforadifferentapproach
• Lotsofmaterialinashorttime§ feelfreetoaskforhelp!
• Therewillbemanyformulae§ goalisnottomemorizethem
§ eventhoughwehaveaccesstosoftware,handcalculationscanhelpcultivateintuition
HowtoPrioritize
HST190:IntrotoBiostatistics7
• Thecourseispass/fail.
• Examisopen-book,sodon’tspendtimememorizingformulas.Learnwhenandwhytouseeachprocedure;youcanalwaysrefertoyournotestoseehow.
• Togetthemostoutofthiscourse,youshould:§ attendlectures
§ submitsolutionstoalltheproblemsets
§ participateinclassdiscussions,groupexercises,andPiazza
§ completeaproject
§ takethefinalexam
Resources
HST190:IntrotoBiostatistics8
• LectureNotes(Canvas->Files)§ Getbonuspointsforfindingtypos!
• IntroductiontoMatlab (Canvas->Files)
• Rosnertextbook,7thed.(required;alifelongreference)
• Piazza
• Pagano&Gavreau textbook
• SeeSyllabusforadditionalreferences.
Basicstepsofdataanalysis
• Tosetthestage,let’sconsidertwomotivatingquestions:1) isthereanassociationbetweentimespentintheoperatingroom
andpost-surgicaloutcomesforlungcancerresection?
2) canwedevelopanenhancedbreastcancerriskmodel?
• Thequestionshavebeenleftdeliberatelyvague!it’softenthecasethatscientificquestionsareinitiallyimpreciselyposed
• Integraltotheprocessofresearchistranslatingscienceintostatistics,andbackagain§ asyoureadpapers,itisimportanttoconsiderhowtheauthorsthoughtthroughthisprocess
HST190:IntrotoBiostatistics9
• Therearemany(possiblyinfinite!)waysinwhichonecouldcharacterize‘basicsteps’butareasonableoutlinemightbe:I. Understandthecontextoftheanalysis
II. Establishthescientificgoals
III. Translatethescientificgoalsintostatisticallanguage
IV. Choosestatisticalmethodstoemploy
V. Implementationandrunningtheanalysis
VI. Interpretation
HST190:IntrotoBiostatistics10
HST190:IntrotoBiostatistics11
• Sometimes,thewayforwardisclearand,inthatsense,theprocessisprescriptive§ features/issuesthatarecommontoallanalyses
• Inmanyinstances,however,thewayforwardisn’tclear§ aspectsoftheanalysisdon’tfitinwithwhatyoucurrentlyknow
§ thesemayrelatetothescience,dataand/orstatisticsaspects
• Solutionsinclude:§ appealingtothepublishedliterature(scientificandstatistical)
§ adoptingoradaptingexistingmethods
§ developingnewmethods
• Regardless,dealingwiththeseissueswillrequiresomecreativity,andthereisseldom,ifever,one‘correct’dataanalysis§ differentdataanalysescorrespondtodifferentscientificquestions
§ whichscientificquestionis‘right’?
I.UnderstandingtheContext
• Fromtheperspectiveofabiostatistician,thepurposeofdataanalysisistolearnaboutsomepopulationusinginformationinasample
• Learnaboutcovariatesintermsofassociationwithorpredictionofanoutcome§ notationally weoftenthinkintermsof𝑋 and𝑌
§ possiblywithinoracrosscertainsub-populationsdenoted,say,by𝑍
• Contextusuallyinvolvesthreethings:1) thebackgroundscience
2) thenatureoftheavailabledata
3) thepopulationofinterest,oftencalledthe‘targetpopulation’
HST190:IntrotoBiostatistics12
Lungcancersurgery
HST190:IntrotoBiostatistics13
Q:Isthereanassociationbetweentimespentintheoperatingroomandpost-surgicaloutcomes?
• Backgroundscience:§ longeroperatingtime–>greaterexposuretoanesthesia
§ shorteningoperatingtimemightreduceadversepost-surgicaloutcomes
o complicationsduringthehospitalstay
o recurrenceoflungcancer
o mortality
§ mayalsoleadtodecreasedcosts/increasedefficiencyo increasedcapacityfortheoperatingroom
o shorterpost-surgicalhospitalstay
HST190:IntrotoBiostatistics14
• Availabledata:§ ≈400surgeriesatBrighamandWomen’sHospital
§ performedbetween1997-2008
§ demographic,clinical,tumorandfollow-upinformation
• Targetpopulation:§ patientswhoundergoelectivesurgeryforearlystagenon-smallcelllungcancer
§ needtobeawareofdifferentsurgerysub-types
o lobectomy,segmentectomy,wedgeresection
o thorachotomy,videoassistedthoracicsurgery
§ whatdowethinkaboutthe(relatively)longtimeframe?
§ generalizabilitybeyondBWH?
Breastcancerrisk
HST190:IntrotoBiostatistics15
Q:Canwedevelopanenhancedbreastcancerriskmodel?
• Backgroundscience:§ the‘Gailmodel’forbreastcancerriskwasdevelopedinthelate1980s
o age,race,
o ageatmenarche,ageatbirthoffirstchild
o familyhistory,numberofpriorbiopsyexaminationsandatypicalhyperplasia
§ themodelwasvalidatedinanumberofsubsequentstudies
§ subsequentresearchidentifiedanumberofadditionalriskfactorsforbreastcancer
o breastdensity,useofhormonereplacementtherapyandbodymassindex
HST190:IntrotoBiostatistics16
• Availabledata:§ 2,392,998screeningmammogramsfromtheBreastCancerSurveillanceConsortium
o NCI-fundednationwidenetworkofmammographyregistries
§ mammogramsperformedbetween1996-2002
§ outcomesareascertainedvialinkageswithcancerregistries
• Targetpopulation:§ screeningmammogramsperformedonwomenaged35-84years
o unitofanalysisisthemammogram,notthewoman
§ whoundergoesscreening?whodoesn’t?
o howmightthisimpacttheinterpretationofthestudy?
Natureoftheavailabledata
HST190:IntrotoBiostatistics17
• Whatwerethedatacollectionprocedures?§ conveniencesampleorpartofadesignedstudy?
§ whatwasthesetting/timeframe?
§ observationalstudyorrandomizeddesign?
§ cross-sectional,prospective,orretrospective?
§ stratificationand/ormatching?
• Howweretheproceduresfollowed?§ anysystematicdeviationsfromthe‘ideal’datacollectionprocess?
§ maybeduetopatients?o refusaltoparticipate/respond
o inaccurateresponses
HST190:IntrotoBiostatistics18
§ maybeduetoresearchers?
o wereuniformproceduresappliedtoall(potential)participants?
o areweactuallymeasuringwhatwethinkwearemeasuring?
• Havetherebeenanyinterimdatacleaning/manipulationefforts?§ cleaningof‘strange’values
o settosomethresholdvalueortomissing
o exclusionfromthedataset
§ constructionofderivedvariables
Populations
HST190:IntrotoBiostatistics19
• Inpractice,the‘population’canbe§ anactual,potentiallyobservablepopulation
§ ahypothetical(sometimesinfinite)population
• Mightrefertothe‘targetpopulation’toemphasizethatthereisaspecificpopulationinmind
• Definingthetargetpopulationiscrucialinthatitprovidesthecontextthescientificquestionofinterest§ whowouldwelikeourresultstogeneralizetoo?
• Narrowvs.broaddefinitionsofthetargetpopulation§ heterogeneityvs.homogeneity
§ whatarethetrade-offs?
HST190:IntrotoBiostatistics20
• Whatcomesfirst...thedataorthepopulation?§ dependsonwhenyougetinvolved
• Ifthedatahasalreadybeencollected:§ forwhichpopulationcouldweconsiderthesampleasbeing‘representative’?
§ mayneedtofocusthedatasetbyexcludingcertainfolkso implicitlychangesthepopulationtowhichonecangeneralize
o samplesizevs.mixingofeffects
§ istherescopeforadditionaldatacollectionefforts?
• Ifthedatahasnotbeencollected:§ muchgreaterflexibilityforchoosing/definingthepopulationofinterest
Learningfromdata
HST190:IntrotoBiostatistics21
• Recall,thegoalistolearnabouttherelationshipsbetweenasubsetofcovariates
• Achievedbycollectingandanalyzingasamplefromthepopulation§ animportantaspectof‘context’isthatthisisindeedwhatwearedoing
o or,atleast,hopingtodo!
• Supposewecouldenumeratetheentirepopulation§ thatis,thesampleisthepopulation
• Inthiscaseobserveddatacharacterizesrelationshipscompletely
HST190:IntrotoBiostatistics22
• Notewhenwehaveacompleteenumeration,thereisnosamplingvariability§ wedon’thavetoworryaboutmakingstatementsaboutthepopulationonthebasisofinformationinthesample
§ thesampleisthepopulation
• Wedon’thavetoconsiderorquantifyuncertaintyassociatedwithonlyobservingasub-sample§ noneedforstandarderrors,confidenceintervalsorp-values
§ maybenoneedforstatisticalmethods!
• Mostofthetimewecan’tenumeratetheentirepopulation§ typically,thisisn’tlogisticallyand/orfinanciallyfeasible
• So…
II.Establishthescientificgoals
• Broadlyspeakingonecanclassifyscientificgoalsas:§ descriptionorexplorationofapopulation
§ evaluationofsomehypothesis
§ predictionoffutureoutcomes
• Asingleanalysismayhaveseveralgoals§ dependsonscientificsettingandbackground
HST190:IntrotoBiostatistics23
Lungcancersurgery
HST190:IntrotoBiostatistics24
Q:Isthereanassociationbetweentimespentintheoperatingroomandpost-surgicaloutcomes?
• Description/exploration:§ whatisthenatureoftheassociation?
§ doestheassociationdifferacrosssurgerytypes?
• Hypothesistesting:§ apriorihypothesisamongthecollaboratorsthatshortertimesareassociatedwithbetterpost-surgicaloutcomes
Breastcancerrisk
HST190:IntrotoBiostatistics25
Q:Canwedevelopanenhancedbreastcancerriskmodel?
• Prediction:§ usealltheavailableinformationinthebestpossiblewaytopredicttheriskofbreastcancer
§ buildpredictionmodelsthatcatertospecificsettingswithvaryingamounts/typeofinformation?
o athome/online
o inthephysiciansoffice
• Whymightdescription/explorationandhypothesistestingbeoflessinterest?
Description/exploration
HST190:IntrotoBiostatistics26
• Goalistocharacterizetherelationshipsamongasetofcovariatesinthepopulationofinterest
• Animportantissueiswhetherornotthegoalistoestablishcausation§ typicallyrequiresagreaterunderstandingofthescience
• Typically,althoughnotalways,viewedashypothesisgenerating§ wehaveacooldataset,let’sseewhatwecanfind...
§ thereisafine,oftenblurrylinebetweenexplorationandhypothesistesting
o whatcamefirst...thedataorthequestion?
Hypothesistesting
HST190:IntrotoBiostatistics27
• Goalistomakesomeconfirmatorystatement
• Typicallyframedinthecontextofmakinga‘decision’betweentwocompetinghypotheses𝐻%:nullhypothesis
𝐻&:alternativehypothesis
• Assumethenullhypothesisholdsandlookforevidencetothecontrary
• Standardhypothesistestingreducesthepotentialdecisionsto:1. failtoreject𝐻%2. reject𝐻% (implicitlyinfavorof𝐻&)
§ decisionshouldbeaccompaniedbysomemeasureofuncertainty
Prediction
HST190:IntrotoBiostatistics28
• Goalistoestimatefutureoutcomesorrisk§ Typicallyframedintermsofbuildingthebestpossiblemodel
• Whatdowemeanby‘best’?§ needsomemeansofjudgingaccuracyandpenalizingpoorpredictions
§ ideallybasedonrealworldconsequenceso e.g.false-positivevs.false-negativeforbreastcancer
• Sometimesasinglebestmodelisinappropriate§ amodelmayworkwellinonepopulationandnotothers
§ inputsmaynotalwaysbeavailable(e.g.geneticinformation)
• Towhatextentdoweneedtocareaboutcausation?§ doweneedtounderstandthe‘true’underlyingmechanisms?
Therealworld
HST190:IntrotoBiostatistics29
• Unfortunately,thescientificgoalsarenotalwaysclearattheoutset
• Typically,itisthecasethat:§ therearemanyscientificgoalsthatareofinterest,and/or
§ thegoalcanbeinterpretedinanumberofways
• Primarilyaproblembecauseinvestigatorsneedprecisestatementstobeabletoproceed§ totranslatethescientificgoalsintostatisticalones
• Towardsrefiningstudygoals,acoupleofusefulquestionsare:1) whoistheintended(primary)audience?
2) whatwillbeactionablefromtheresults?
HST190:IntrotoBiostatistics30
• Considerthequestion:WhatisMrs.Jones’riskofbreastcancer?
• Howoneproceedsdepends,atleastinpart,onhowthisinformationwillbeused:Researchers
o determineeligibilityforarandomizedstudyofsomenovelpreventativeagent
Patientso decisionastowhetherornotsheshouldgetintouchwithherphysician
Physicianso planningforfuturescreeningschedule
Policy-makerso monitorthepublichealthburdenofbreastcancer
HST190:IntrotoBiostatistics31
• Relatedquestionsinclude:§ isinterestinallbreastcancersorsomespecifictumortype?
§ riskoverwhichtimeframe?o 1year?
o 5years?
o lifetime?
§ howmuchinformationwilltheinterested‘user’haveaccessto?
o willdetailedfamilyhistoryinformationbeavailable?
o willgeneticinformationbeavailable?
• Differentanswerstoallthesequestionsdefinedifferentscientificgoals
III.Translatingscientificgoalsintostatisticalterms/tasks
• Oncethescientificgoalsare‘established’weneedtotranslatethemintothelanguageofstatistics
• Movingforwardrequires:§ preciseandcleardefinitionsofallrelevantcovariates
§ specificationofkeyrelationshipsofinterest
HST190:IntrotoBiostatistics32
Scientificgoal StatisticaltaskDescription/exploration Estimation
Hypothesistesting InferencePrediction Estimation
Preciselydefiningcovariates
HST190:IntrotoBiostatistics33
• Eachofthepotentialgoalsistryingtosaysomethingabouttherelationshipsamongasetofcovariates
• Priortoanyanalysisweneedcleardefinitionsforallrelevantcovariates:§ responsevariables
§ exposure(s)ofinterest
§ interactiontermsand/oreffectmodifiers
§ predictorsoftheresponse
§ predictorsoftheexposure(s)ofinterest
• Therewillbeoverlapacrossthesevarioustypesofvariables§ e.g.,acovariatemaybeapredictorofboththeresponseandoftheexposureofinterest
HST190:IntrotoBiostatistics34
• Oftennotasstraightforwardasonemightthink,mainlybecausethereisoftenchoiceinvolved
• Supposetheresponseofinterestis‘diagnosisofbreastcancer’§ overwhichtimeframe?
§ forwhichsub-types?
• Supposetheexposureofinterestis‘operatingtime’§ whendoestimestart?
§ whendoestimestop?
• Define(andperhapsre-define)untileverythingisclear!
Lungcancersurgery
HST190:IntrotoBiostatistics35
Q:Isthereanassociationbetweentimespentintheoperatingroomandpost-surgicaloutcomes?
• Responses:§ hospitalstayof>7days(binary)
§ numberofmajorcomplicationsduringhospitalstay(count)
o needalistof‘major’complications
§ timetodeath(continuous,right-censored)
• Exposureofinterest:§ operatingtimedefinedasthetimefromthefirstincisiontothetimeofthefirststitchtocloseup(continuous)
Breastcancerrisk
HST190:IntrotoBiostatistics36
Q:Canwedevelopanenhancedbreastcancerriskmodel?
• Response:§ diagnosisofbreastcancerwithin1yearofthescreeningmammogram(binary)
• Exposureofinterest:§ age,race,education,breastdensity,HRTuse...
§ atotalof13potentialpredictors
§ allcategorical
o atleastintheavailabledataset
IV.Choosingstatisticalmethods
• Onewayofviewingallthestatisticalmethodsavailableisasacollectionoftools§ differentstatisticaltoolsfordifferentstatisticaltasks
§ developunderstandingofacollectionoftoolsoverthecourseofyourcareer
• Atoolboxofstatisticaltools/methods§ basicmethods,thateveryoneshouldbeabletouse
§ specializedmethods
o sophisticatedtoolsthatrequire‘training’
o constantlybeingdevelopedandpublishedintheliterature
§ sometimesnewquestionsrequirenewmethods
HST190:IntrotoBiostatistics37
HST190:IntrotoBiostatistics38
• Forthemostpart,thetoolsthatresearchersemployaredeterminedbytheissueswe’veconsideredsofar§ scientificgoals
§ natureoftheavailabledata
§ populationofinterest
• Evengivenallthisinformation,thereareoftenseveralchoicesofstatisticaltools/methods
• Howtochoosebetweenalltheavailableapproaches?§ interpretation(tobediscussedlater)
§ operatingcharacteristicso e.g.biasandstatisticalefficiency
V.Implementationandrunningtheanalysis
• Seeminglythemost‘prescriptive’ofthesteps§ inaperfectworld,turnthehandle...andyou’redone!
• Unfortunately,actuallyperformingtheanalysisisnotalwaysstraightforward
• Manychoicesforstatisticalsoftware§ R,Matlab,SAS,Stata,WinBUGS,...
§ eachhasnumerousresources,includingalready-writtencodeavailableonline
§ notallmethodshavebeenimplementedinallsoftwarepackages
HST190:IntrotoBiostatistics39
HST190:IntrotoBiostatistics40
• Performingtheanalysescanalsohighlightallsortsofproblems§ EDAmighthighlightdataissues
o missingdata
o unusualvalues
o unusualobservedrelationships
• Issueslikethismayrequireare-thinkofthescientificgoals§ ifyoucan’tanswerthisquestion,whichquestioncanyouanswer?
VI.Interpretation
• It’simportanttodistinguishinterpretationofthemodel frominterpretationoftheresults
• Specificationofthemodelissomethingthatwehavecontrolover§ itshouldbestraightforwardtoprovideapreciseinterpretationofits’components
o youcannotbepedanticenoughonthispoint
§ shouldbeabletodothisbeforeyouevenseethatdata
• Considerthelinearregressionmodel:𝐸 𝑌 𝑋 = 𝛽% + 𝛽&𝑋
§ Howdoweinterpret𝛽&?
HST190:IntrotoBiostatistics41
Interpretationoftheresults
HST190:IntrotoBiostatistics42
• Herearesomeresults...whatdoesitallmean?!?§ translationofstatisticsbacktoscience
• Interpretingtheresultsrequiresadetailedunderstandingboththescientificandstatisticalcontext§ usuallyrequiresdiscussionwithcollaborators
• Sometimestheresultsdon’tsupporttheinitialhypotheses!§ e.g.,Breitner etal(2008)Neurology
§ RiskofdementiaandADwithpriorexposuretoNSAIDsinanelderlycommunity-basedcohort
§ seethenextslide
HST190:IntrotoBiostatistics43
HST190:IntrotoBiostatistics44
• Thesecanbeparticularlychallengingsituations
• Aretheseresults‘right’?§ arewemisinterpretingourassumptions/models?
§ aretheredataissuesthatwearen’tawareof?
§ isthecodewrong?
§ aretheresultssensitivetoparticularanalysischoices?
• Itmaybethattheresultsare‘right’§ perhapsanewunderstandingofthemechanismofinterest
§ perhapstheresultspertaintoapopulationthathasn’tbeenstudiedbefore
Learningaboutpopulations
• Itisseldompossibletospecifyone,singletargetpopulation§ oftenthecasetherearemanyinterestingtargetpopulations
• Flexibilitytoconsiderdifferentpopulationsdependsonwhetherornotthesamplehasbeencollected
• Ifthesamplehasnotbeencollected,onemightconsider§ arangeofscientificquestions
§ thefeasibilityofcollectingdataacrossdifferentpopulations
• Ifthesamplehasbeencollected,flexibilitydependsonthenatureandscopeoftheavailabledata
HST190:IntrotoBiostatistics45
Breastcancerscreening
HST190:IntrotoBiostatistics46
• Broadgoalofscreeningistodetectcancerasearlyaspossible§ balancebetweenpublichealthgoalsandcosts
§ cannotscreeneveryoneallofthetime
§ therearealso‘harms’associatedwithscreening
§ mammographyisnotperfect
§ realconsequencesassociatedwithfalse-positives
• Currentrecommendationsare(broadly):§ allwomenaged50oroldergetscreenedeverytwoyears
§ also,womenintheir40’swhoareat‘highrisk’
Q:Howgoodismammographyasascreeningmodality?§ answerdepends,inpart,onthepopulationofinterest
HST190:IntrotoBiostatistics47
• Rosenbergetal(2006)Radiology.§ allwomenwhoundergoscreeningmammography
HST190:IntrotoBiostatistics48
• Yankaskas et al (2010) JNCI.
HST190:IntrotoBiostatistics49
• Miglioretti etal(2004)JAMA.
HST190:IntrotoBiostatistics50
• Goldmanetal(2008)MedicalCare.
Remarks
• Exceptinthemosttrivialofsettings,thedataanalysisprocessiscollaborativeanditerative
• Howyouproceedwilldependonmanythings:§ thenatureofthedata
§ yourphilosophy
§ thephilosophyofyourcollaborators
• Gettingthescience‘right’isoftenthehardestpart§ goalsareseldompreciseattheoutset
§ goingback-and-forthbetweenthescienceandstatisticsistypicallyaveryinstructiveprocess
§ todoagoodjobusuallyrequiresknowledgeofthescience
HST190:IntrotoBiostatistics51
HST190:IntrotoBiostatistics52
• Moreoftenthannot,thereisscopeforprescriptionaswellasforcreativity§ sometimesthereisanobviouswayforward
§ othertimesthereisn’t
• Whatcamefirst...thequestionorthedata?
• Thereisseldomone‘right’scientificquestionordataanalysis§ BoxandDraper(1987):
Essentially,allmodelsarewrongbutsomeareuseful.