StatisticalThinkingBasedonC.J.WildandM.Pfannkuch (1999).StatisticalthinkinginEmpiricalEnquiry,InternationalStatisticalReview,67(3):223-265.
+ProfessorMattWaite’snotes
BasicIdeas
• Thoughtprocessesinvolvedinstatisticalproblemsolving• Fromproblemformulationtoconclusions
• Afour-dimensionalframeworkforstatisticalthinkinginempiricalenquiry• Investigativecycle• Interrogativecycle• Typesofthinking• Dispositions
• Centralelement:“variation”
Four-DimensionalFramework
Dimension1:TheInvestigativeCycle• Concernedwithabstractingandsolvingastatisticalproblemgroundedinalarger”real”problem
• BasedonthePPDACmodel(Problem,Plan,Data,Analysis,Conclusions)
Dimension2:TypesofThinking• Variation• Thinkingwhichisstatisticalisconcernedwithlearninganddecisionmakingunderuncertainty
• forthepurposesofexplanation,prediction,orcontrol
Dimension2:MoreonVariation|Sources
Dimension2:MoreonVariation|Prediction,Explain,Control
Dimension2:SummaryonVariation
• Special-cause vs.commoncausevariation• Usefulwhenlookingforcauses
• Explained vs.unexplainedvariation• Usefulwhenexploringdata&buildingamodelforthem
• Suppositions• Variationisanobservablereality
• Somevariationcanbeexplained;othervariationcannot beexplainedoncurrentknowledge• Random variationisthewayinwhichstatisticiansmodelunexplainedvariation
• Thisunexplainedvariationmayinpartorinwholebeproducedbytheprocessofobservationthroughrandomsampling
• Randomnessisaconvenient humanconstructwhichisusedtodealwithvariationinwhichpatternscannotbedetected
CorrelationisNOTcausation
Dimension3:TheInterrogativeCycle• Appliesatmacrolevels
• Appliesalsoatverydetailedlevelsofthinking• Recursive• Subcyclesareinitiatedwithinmajorcycles
Dimension4:Dispositions
• Whenauthorsbecomeintenselyinterestedinaproblemorare,aheightenedsensitivityandawarenessdevelopstowardsinformationontheperipheriesofourexperiencethatmightberelatedtotheproblem• Peoplearemostobservantinareastheyfindmostinteresting
• Engagementintensitieseachdispositionalelement
TypesofAnalytics
• Descriptive• Describingcharacteristicsorpropertiesinthedata
• Predictive• Predictingthetypesofoutcomesgivennewsetsofdata,usuallybasedonaclassifiertrainedusinglabelled,existingdatasets
• Prescriptive• Decidingonthebestrouteoroptionordecisiontomakegivendata
TypesofData
• Categorical (cf.wikipedia)• Variable thatcantakeononeofalimited,andusuallyfixednumberofpossiblevalues,assigningeachindividualorotherunitofobservationtoaparticulargroupor nominalcategory onthebasisofsome qualitativeproperty
• The bloodtype ofaperson:A,B,ABorO• Thestatethatapersonlivesin• The politicalparty thatavotermightvotefor• Thetypeofarock: igneous, sedimentary or metamorphic• Ordinal data?
• Numerical• Canbesubdividedintodiscretedata(thingsthatcanbecounted)andcontinuousdata(allpossiblenumbers).
• # ofchildren,age,scores,temperatures,etc.
DescriptiveStatistics
• Therearethreemaingroupsofdescriptives• Thedistribution• Workswellwithcategoricaldata.Howmanyofeachthingisthere?
• Thecentraltendency• Onlyworkswithnumericaldata.Whatisthemean,medianandmode?
• Thedispersion• Onlyworkswithnumericaldata.Howspreadoutisthedata?
DescriptiveStatistics:Distribution
• Groupingandcountingbycategoricaldata– groupandcountbytown,orzipcodeorsomethinglikethat• Oftencalledafrequencydistribution• Histogram
• Withnumericaldata,minimum andmaximum valuesareuseful
DescriptiveStatistics:CentralTendency
• Mean• Averageornorm:allupallvaluestofindatotal,andthendividethetotalbythenumberofvalues
• Median• Middlevalue:Sortallvaluesintoorder,andthemedianisthemiddlevalue;ifthereare2valuesinthemiddle,findthemeanofthesetwo
• Mode• Mostfrequentvalue:Counthowmanyeachvalueappears,themodeisthevaluethatappearsthemost• Canhavemorethanonemode
DescriptiveStatistics:Dispersion
• Mean• Averageornorm:allupallvaluestofindatotal,andthendividethetotalbythenumberofvalues
• Median• Middlevalue:Sortallvaluesintoorder,andthemedianisthemiddlevalue;ifthereare2valuesinthemiddle,findthemeanofthesetwo
• Mode• Mostfrequentvalue:Counthowmanyeachvalueappears,themodeisthevaluethatappearsthemost• Canhavemorethanonemode
DescriptiveStatistics:Dispersion
• Range• Differencebetweenthelowestandhighestvalues• Subjecttoextremes(e.g.,outliers)
• Standarddeviation• Itistherelationthatasetofscoreshastothemean• Subjecttoskewness indistribution
• ForaGaussian/normaldistribution• 68%ofallvalueswillbewithin1standarddeviation• 95%willbewithin3standarddeviation
DirtyData• Missing data
• Blanksinthedatabaseorspreadsheet.• Datamissingfromaperiodoftime.• Missingstates,counties,zipcodes.
• Wrong data• Wrongtype– numberswheretheyshouldbetextandviceversa• Sharpcurves– trendsthatcontinuenormallythatsuddenlyjumpinoneyear• Conflictingdatawithinadatasetoracrossdatasets(race,percentages,etc)
• Unusable data• Non-standardizeddata• Inconsistentdata• Abbreviations• Unitconsistency
Correlation
• Pearsoncorrelationcoefficients(orPearsonproduct-momentcorrelationcoefficient)• ItisameasureofhowLINEARLYrelatedtwoentitiesare.• HowoftenisachangeinArelatedtoachangeinB?Andisthatpositiveornegative?
Correlation:Forapopulation
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
StandarddeviationofX;standarddeviationofY
Correlation:Forasample
Correlation:Whatitmeans?
• Itisbasedonarangefrom-1to1.• 1=perfectpositivecorrelation• Agoesup1,Bgoesup1• Intherealworld,almostneverhappensoutsideofamistake
• 0=nocorrelationatall• 0rarelyeverhappens• NEARzerohappensallthetime
• -1=perfectnegativecorrelation• Agoesup1,Bgoesdown1• Itisjustlike1:rare,probablyamistake
Significance:t-test
• The t-test isany statisticalhypothesistest inwhichthe teststatistic followsa Student's t-distribution underthe null hypothesis.• A t-testismostcommonlyappliedwhentheteststatisticwouldfollowa normal distribution ifthevalueofa scalingterm intheteststatisticwereknown• Whenthescalingtermisunknownandisreplacedbyanestimatebasedonthe data,theteststatistics(undercertainconditions)followaStudent's t distribution• The t-testcanbeused,forexample,todetermineiftwosetsofdataare significantly differentfromeachother
https://en.wikipedia.org/wiki/Student%27s_t-test
Significance:p-value&nullhypothesis• Inthecontextof nullhypothesis testing:toquantifytheideaof statisticalsignificance ofevidence• Inessence,aclaimisassumedvalidifitscounter-claimisimprobable
• Theonlyhypothesisthatneedstobespecifiedinthistestandwhichembodiesthecounter-claimisreferredtoasthe nullhypothesis• i.e.,thehypothesistobenullified
• Aresultissaidtobe statisticallysignificant ifitallowsustoreject thenullhypothesis• Thestatisticallysignificantresultshouldbehighlyimprobableifthenullhypothesisisassumedtobetrue
• Therejectionofthenullhypothesisimpliesthatthecorrecthypothesisliesinthelogicalcomplementofthenullhypothesis
• Caveat:Unlessthereisasinglealternativetothenullhypothesis,therejectionofnullhypothesisdoesnot telluswhichofthealternativesmightbethecorrectone
https://en.wikipedia.org/wiki/Student%27s_t-test