+ All Categories
Home > Documents > Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric...

Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric...

Date post: 21-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
21
Data Analysis, Statistics, Machine Learning Leland Wilkinson Adjunct Professor UIC Computer Science Chief Scien<st H2O.ai [email protected]
Transcript
Page 1: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

Data Analysis, Statistics, Machine Learning

Leland  Wilkinson    Adjunct  Professor                        UIC  Computer  Science  Chief  Scien<st                        H2O.ai    [email protected]  

Page 2: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

2  

Exploring  o  Exploratory  Data  Analysis  (John  W.  Tukey  ,  EDA)  

 Summaries    Transforma<ons    Smoothing    Robustness    Interac<vity    What  EDA  is  not  …      LeQng  the  data  speak  for  itself      Fishing  expedi<ons      Null  hypothesis  tes<ng  

Qualita<ve  Data  Analysis      Mixed  methods      Old  wine  in  new  boVles  

   

Copyright  ©  2016  Leland  Wilkinson  

Page 3: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

Exploring  

3  

“Probability  modelers  seem  to  want  to  believe  that  their  models  are  en<rely  correct.  Data  analysts  regard  their  models  as  a  basis  from  which  to  measure  devia<on,    as  a  convenient  benchmark  in  the  wilderness,  expec<ng  liVle  truth  and  relying  on  less.”    

Tukey  (1979)  

Copyright  ©  2016  Leland  Wilkinson  

Page 4: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

4  

Exploring  o  Summaries  

o  LeVer  values  o  M  (median)  sort  and  split  batch  o  H  (hinges)  split  each  half  as  if  it  were  a  new  batch  o  E  (eighths)  split  again,  and  so  on  …  o  Medians  and  hinges  yield  a  5-­‐number  summary  

o  1.  lower  extreme  o  2.  lower  hinge  o  3.  median  o  4.  upper  hinge  o  5.  upper  extreme  H-­‐spread  is  (upper  hinge  –  lower  hinge)  Range  is  (upper  extreme  –  lower  extreme)  

   

extreme   median   hinge   extreme  hinge  

Copyright  ©  2016  Leland  Wilkinson  

Page 5: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

5  

Exploring  o  Summaries  

o  What  leVer  values  reveal  o  Symmetry  

o  Outliers  o  A  Step  is  1.5  <mes  H-­‐spread    o  Inner  fences  are  1  step  outside  hinges  o  Outer  fences  are  2  steps  outside  hinges  o  Adjacent  values  are  those  at  each  end  closest  to,  but  s<ll  inside  inner  fences  o  Outside  values  are  between  inner  fence  and  neighboring  outer  fence  o  Far  out  values  are  beyond  outer  fences  toward  extremes  

outside   far  out  

Copyright  ©  2016  Leland  Wilkinson  

Page 6: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

6  

Exploring  o  Transforma<ons  

o  Tukey  Ladder  of  Powers  (re-­‐expressions)  o  Assume  data  are  posi<ve,  or  use  X  +  1  if  non-­‐nega<ve  o  Tukey  formula  

o  X  ⟼  Xp  

o  Box  &  Cox  formula  (derived  from  Tukey’s  idea)  o  X  ⟼  (Xp  –  1)  /  p  

o  Values  of  p  o  p  =  2  yields  X2  o  p  =  1  yields  X  o  p  =  .5  yields  sqrt(X)  o  p  =  0  yields  log(X)  o  p  =  -­‐1  yields  1  /  X  

o  For  Box  &  Cox  formula    o  p  =  0  yields  log(X)  because  limp→0  (Xp  –  1)  /  p  =  log(X)    o  Also,  dividing  by  p  in  Box  &  Cox  formula  preserves  polarity  of  X  

o  Ascending  the  ladder  (p  >  1)  spreads  out  large  values  and  compresses  small  values.  Descending  the  ladder  (p  <  1)  compresses  large  values  and  spreads  out  small  values.    

Copyright  ©  2016  Leland  Wilkinson  

Page 7: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

7  

Exploring  o  Transforma<ons  

o  Dealing  with  skewness  o  Posi<ve  skew:  descend  the  ladder  (p  <  1)  o  Nega<ve  skew:  ascend  the  ladder  (p  >  1)  

p  =  2   p  =  1   p  =  .5   p  =  0   p  =  -­‐.5   p  =  -­‐1  

0 1000 2000 3000 4000 5000 6000BRAINWEIGHT

-1 0 1 2 3 4LOG10(BRAINWEIGHT)

Copyright  ©  2016  Leland  Wilkinson  

Page 8: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

8  

Exploring  o  Transforma<ons  

Wilkinson,  Blank,  &  Gruber  (1996)  

Copyright  ©  2016  Leland  Wilkinson  

Page 9: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

9  

Exploring  o  Transforma<ons  

o  Spread-­‐level  plot  o  Divide  batch  into  quin<les  and  plot  H-­‐spread  against  median  o  1  –  slope  of  line  is  es<mate  of  p  o  In  this  case,  p  =  0  is  best  choice  

Copyright  ©  2016  Leland  Wilkinson  

Page 10: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

10  

Exploring  o  Smoothing  

o  See  any  paVern  here?    

0 20 40 60 80Day

30405060708090100

Temperature

Copyright  ©  2016  Leland  Wilkinson  

Page 11: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

11  

Exploring  o  Smoothing  

o  Give  yourself  a  medal  if  you  saw  this    

0 20 40 60 80Day

30405060708090100

Temperature

Velleman  &  Hoaglin  (1981)  

Copyright  ©  2016  Leland  Wilkinson  

Page 12: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

12  

Exploring  o  Smoothing  

o  Data  =  smooth  +  rough  o  Data  =  fit  +  residuals  

o  Fit  a  model  o  Compute  residuals  o  Examine  residuals  for  systema<c  varia<on  o  If  residuals  look  nonrandom,  fit  a  model  to  the  residuals  o  Iterate  

 

Copyright  ©  2016  Leland  Wilkinson  

Page 13: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

13  

Exploring  o  Robustness  

 Tukey  was  skep<cal  regarding  Gaussian  assump<on    Inspired  a  search  for  sta<s<cal  es<mators  that  are  robust  against        outliers  and  other  forms  of  contamina<on    Simple  loca<on  es<mators  involved  trimming  outliers      Median      Winsorizing      Trimmed  mean    Others  (Tukey,  Hampel,  …)  involved  weigh<ng  func<ons    Peter  Huber  developed  maximum-­‐likelihood-­‐like  methods      

Copyright  ©  2016  Leland  Wilkinson  

Page 14: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

14  

Exploring  o  Interac<vity  

o  Linking  o  Brushing  o  Projec<ng  

o  Tukey,  Friedman,  Fisherkeller:  Prim9  o  https://www.youtube.com/watch?v=B7XoW2qiFUA

o  Tukey  and  Friedman:  Projec<on  pursuit  o  https://www.youtube.com/watch?v=n5i9RLCe1rQ

Copyright  ©  2016  Leland  Wilkinson  

Page 15: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

15  

Exploring  o  Qualita<ve  Data  Analysis  (QDA)  

o  The  QDA  movement  is  a  reac<on  against…  o  Quan<ta<ve  analysis  (mathema<cs  in  general,  sta<s<cs  in  par<cular)  o  Scien<fic  objec<vism,  realism,  and  posi<vism  o  Peer  review  (controversial  within  QDA  community)  o  Educa<onal  tes<ng  

o  Subjec<vity  o  Hermeneu<cs  

o  Transla<onal  o  Postmodernist  

o  The  researcher  constructs  own  reality  that  others  may  not  share  o  Reliance  on  “trustworthiness”  instead  of  formal  measures  of  validity  

o  credibility,  dependability,  auditability,  confirmability,  corrobora<on  o  Focus  on  symbolic  interpreta<ons  of  icons  (text,  videos,  …)  leads  to  “mixed  methods”  

o  Fluidity  o  No  predefined  measures  or  hypotheses    o  Progressive  data  collec<on  and  coding  leads  to  “grounded  theory”  

o  Poli<cs  o  Peculiar  QDA  journals  o  Ac<vism  in  academic  departments  

Copyright  ©  2016  Leland  Wilkinson  

Page 16: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

16  

Exploring  o  Qualita<ve  Data  Analysis  

o  Nothing  new  here  o  Introspec<on  (Wundt,  …)  o  Clinical  observa<on  (Freud,  Piaget,  …)  o  Personal  knowledge  (Polanyi,  …)  o  Par<cipant  observa<on  (Malinowski,  Mead,  …)  o  Community  psychology  interviewing  (Sarason,  Levine,  Kelly,  …)  o  Group  dynamics  (Lewin,  Bales,  Slater,  …)  

o  BoVom  line:  o  If  you  can’t  quan<fy  or  qualify  something,  you  don’t  understand  it  

o  In  science,  understanding  means  being  able  to  communicate  to  a  ra<onal  person  o  In  religion,  understanding  is  a  non-­‐cogni<ve  experience  of  the  transcendent  o  In  aesthe<cs,  understanding  is  a  judgment  of  taste  (Kant)  o  But  you  can’t  build  a  science  on  subjec<ve  or  non-­‐cogni<ve  founda<ons  

o  Quan<fica<on  doesn’t  mean  simply  assigning  a  number  o  It  can  mean  “these  two  things  are  not  comparable”    o  Or,  “this  is  greater  than  that”  o  Or,  “these  two  things  are  related”  

 

Copyright  ©  2016  Leland  Wilkinson  

Page 17: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

17  

Exploring  Qualita<ve  Data  Analysis  Alterna<ves  

 Text  analysis  (Shepard,  Rosenberg,  …)      Collect  the  data  through  simple  comparisons  (no  numbers)      Scale  them  by  exploi<ng  distance  and  ordering  constraints  

Copyright  ©  2016  Leland  Wilkinson  

Page 18: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

18  

Exploring  Qualita<ve  Data  Analysis  Alterna<ves  

 Sequence  analysis  (Agrawal  A  priori  algorithm)    Associa<on  rules  

Copyright  ©  2016  Leland  Wilkinson  

Page 19: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

19  

Exploring  Qualita<ve  Data  Analysis  Alterna<ves  

 Network  analysis      No  numbers  here  

Andris  C,  Lee  D,  Hamilton  MJ,  Mar<no  M,  Gunning  CE,  Selden  JA  (2015)  The  Rise  of  Par<sanship  and  Super-­‐Cooperators  in  the  U.S.  House  of  Representa<ves.  PLoS  ONE  10(4):  e0123507.  

Copyright  ©  2016  Leland  Wilkinson  

Page 20: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

20  

Exploring  Qualita<ve  Data  Analysis  Alterna<ves  

 Innova<ve  experimental  paradigms      No  numbers  used  here  

Color  vision  and  hue  categoriza<on  in  young  human  infants.  Bornstein,  Marc  H.;  Kessen,  William;  Weiskopf,  Sally  Journal  of  Experimental  Psychology:  Human  Percep>on  and  Performance,  Vol  2(1),  Feb  1976,  115-­‐129  

   

"Infant  looking  at  shiny  object"  by  Mehregan  Javanmard,  Wikipedia  

Copyright  ©  2016  Leland  Wilkinson  

Page 21: Data Analysis, Statistics, Machine Learningwilkinson... · Comment on Emanuel Parzen [Nonparametric statistical data ! ! !modeling], Journal of the American Statistical Association,

21  

Exploring  o  References  

o  Andrews, D., P. Bickel, F. Hampel, P. Huber, W. Rogers, and J. W. Tukey (1972). Robust Estimates of Location: Survey and Advances. Princeton University Press.

o  Box, G. E. P. and Cox, D. R. (1964). An Analysis of Transformations, Journal of the Royal Statistical Society, pp. 211-243, discussion pp. 244-252.

o  Friedman, J.H., and Stuetzle, W. (2002). John W. Tukey’s Work on Interactive Graphics. Annals of Statistics 30.6: 1629–39.

o  Hampel, F.R., Ronchetti, E.M., Rousseeuw,P.J. and Stahel, W.A. (1986). Robust Statistics: The Approach Based on Influence Functions. Wiley.

o  Huber, P.J., and Ronchetti, E.M. (2009), Robust Statistics, 2nd ed., Wiley. o  Mosteller, F. and Tukey, J.W. (1977). Data Analysis and Regression. Addison-Wesley. o  Stigler, S.M. (2010), ”The Changing History of Robustness”, The American Statistician, 64,

277-281. o  Tukey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley. o  Tukey, J.W. (1979). Comment on Emanuel Parzen [Nonparametric statistical data

modeling], Journal of the American Statistical Association, 74, 121-122. o  Velleman, P. and Hoaglin, D. (1981). The ABC’s of EDA: Applications, Basics, and

Computing of Exploratory Data Analysis, Duxbury.

Copyright  ©  2016  Leland  Wilkinson  


Recommended