+ All Categories
Home > Documents > SuperGoodAdvice_06142015_Final

SuperGoodAdvice_06142015_Final

Date post: 12-Apr-2017
Category:
Upload: joyce-rose
View: 173 times
Download: 2 times
Share this document with a friend
41
1 MULTIVARIATE ANALYSIS FARZAD ESKANDANIAN, MAX LI, J OYCE ROSE, NASIM S ONBOLI CSC 424 | ADVANCED DATA ANALYSIS 6|14|2015 The purpose of this paper is to discuss the model(s) used in predicting the presence or absence of the West Nile virus [WNV]. The uniqueness of this multivariate analysis is the use of weather, temporal and spatial factors based on the premise of time based effects. That is, the models built take into account the developmental stages of a mosquito. Four individual classifiers 1) logistic regression using a generalized additive model (GAM), 2) linear discriminant analysis (LDA), 3) random forests, and 4) support vector machines (SVM) – were built and the best combinations of parameters from each model was included in the ensemble model. Species, week number, location, moving temperature averages, precipitation moving averages and growing degree days played an important role in predicting WNV. The best overall ensemble classifier was a weighted average of GAM and SVM with weights of 0.6 and 0.4, respectively, and an AUC of 0.8361962 INTRODUCTION The west Nile Virus (WNV) is “a mosquito borne diseasecausing infectious agent” (Theophilides et al, 2006, para. 1) that affects birds, humans, and animals. In 1999, WNV was first reported in the United States. Since the initial occurrence the presence of WNV causing seasonal epidemics have been recorded leading to a series of research focused on understanding the features and characteristics of the virus. The research available on WNV indicates that “the infections caused by pathogens by way of a mosquito vector often cluster in space and time given the habitat requirements of the vectors and the vertebrate involved in the transmission.” (Ruiz et al, 2007, para 8). In other words, the West Nile viral transmission is attributed to the patterns of climate, landscape, hydrology and types of human settlements. Ruiz et al (2010) argue that the statistical models built thus far by researchers are mere reports that only characterize associations between the virus and weather, landscape, human density etc. Though they offer insights about the WNV, the associations themselves are not enough to develop and implement preventive measures for future epidemics. The interesting aspect of the WNV challenge arises from the need to build a better model that takes into account the life cycle of the mosquitoes in relationship to the variability in weather and its impact “on WEST NILE VIRUS | CHICAGO
Transcript

  1  

MULTIVARIATE  ANALYSIS  

FARZAD  ESKANDANIAN,  MAX  LI,  JOYCE  ROSE,  NASIM  SONBOLI    

CSC  424  |  ADVANCED  DATA  ANALYSIS  

6|14|2015        The  purpose  of  this  paper  is  to  discuss  the  model(s)  used  in  predicting  the  presence  or  absence  of  the  West  Nile  virus  [WNV].    The  uniqueness  of  this  multivariate  analysis  is  the  use  of  weather,  temporal  and   spatial   factors   based   on   the   premise   of   time   based   effects.   That   is,   the  models   built   take   into  account  the  developmental  stages  of  a  mosquito.  Four   individual  classifiers     -­‐  1)   logistic  regression  using  a  generalized  additive  model  (GAM),  2)  linear  discriminant  analysis  (LDA),  3)  random  forests,  and  4)  support  vector  machines  (SVM)  –  were  built  and  the  best  combinations  of  parameters   from  each   model   was   included   in   the   ensemble   model.   Species,   week   number,   location,   moving  temperature  averages,  precipitation  moving  averages  and  growing  degree  days  played  an  important  role  in  predicting  WNV.  The  best  overall  ensemble  classifier  was  a  weighted  average  of  GAM  and  SVM  with  weights  of  0.6  and  0.4,  respectively,  and  an  AUC  of  0.8361962      INTRODUCTION    

 

The   west   Nile   Virus   (WNV)   is   “a   mosquito  borne   disease-­‐causing   infectious   agent”  (Theophilides  et  al,  2006,  para.  1)  that  affects  birds,   humans,   and   animals.   In   1999,   WNV  was   first   reported   in   the  United   States.   Since  the   initial   occurrence   the   presence   of   WNV  causing   seasonal   epidemics   have   been  recorded   leading   to   a   series   of   research  focused   on   understanding   the   features   and  characteristics   of   the   virus.   The   research  available   on   WNV   indicates   that   “the  infections   caused   by   pathogens   by   way   of   a  mosquito   vector   often   cluster   in   space   and  time   given   the   habitat   requirements   of   the  vectors   and   the   vertebrate   involved   in   the  transmission.”  (Ruiz  et  al,  2007,  para  8).    

In   other   words,   the   West   Nile   viral  transmission   is   attributed   to   the   patterns   of  climate,   landscape,   hydrology   and   types   of  human   settlements.   Ruiz   et   al   (2010)   argue  that   the   statistical   models   built   thus   far   by  researchers   are   mere   reports   that   only  characterize   associations   between   the   virus  and   weather,   landscape,   human   density   etc.  Though  they  offer  insights  about  the  WNV,  the  associations   themselves   are   not   enough   to  develop   and   implement   preventive  measures  for  future  epidemics.  The  interesting  aspect  of  the   WNV   challenge   arises   from   the   need   to  build   a   better   model   that   takes   into   account  the  life  cycle  of  the  mosquitoes  in  relationship  to  the  variability  in  weather  and  its  impact  “on  

WEST  NILE  VIRUS  |  CHICAGO  

   

2  

growth   or   activity   of   an   organism.”   Such   a  model  can  take  a  step  beyond  associations  and  indicate  what  the  best  time  and  location  is  for  early  intervention.  The  importance  of  building  a   robust   model   with   predictive   capabilities  lies  in  the  need  to  prevent  an  outbreak  in  the  future.  Therefore  the  goal  of   this  project   is   to  build  a  model  that  uses  weather,  temporal  and  spatial  factors  to  predict  the  West  Nile  virus.      DATA  DESCRIPTION  

Kaggle’s  West  Nile  Virus  challenge  consists  of  the  following  datasets1:  

Obs   Train   Weather   Spray   Test  

10506   2944   14835   116293  Var   12   22   4   11  

   The  datasets  contains  a  combination  of  string  and  numeric  variables.      “In   many   cases,   some   predictors   have   no  values  for  a  given  sample.  These  missing  data  could   be   structurally   missing”   (Kuhn   &  Johnson,   p.41).   For   instance,   station   2   does  not   collect   information   on   depart,   depth,  water1,   snowfall,   sunset   and   sunrise.   These  structurally   missing   values   are   denoted   by  “M,”   “T”,   or   “-­‐“.   “In   other   cases,   the   value  cannot   or  was   not   determined   at   the   time   of  the   model   building”   (Kuhn   &   Johnson,   p.41).  Examples   of   such   missing   values   are   tavg,  wetbulb,   heat,   cool,   preciptotal,   stnpressure,  sea   level,   time   [584   values]   and   average  speed.  Hence,  the  spray  data  and  the  weather  data  do  contain  missing  values.      The   missing   value   for   the   time   data   set   is  “concentrated  in  a  subset  of  predictors”  (Kuhn  &   Johnson,   p.41).   In   other   words,   the   584  missing   values   pertaining   to   the   spray   data  relates   to   09/07/2011   where   time   has   not                                                                                                                  1 The fields for the datasets can be found in Table 1 in the appendix titled “Data Fields”.

been   recorded   after   7:44:32   PM   and   before  7:46:30  PM.  The  non-­‐structurally  missing  data  values   for   the   weather   dataset,   however,  appear   to   occur   randomly   across   all   the  predictors.     The   counts   of  missing   values   for  each   of   the   predictor   variables   have   been  tabulated  below.    

 

   The   response   variables   are   the   two   classes  that   the   model   aims   to   predict   namely   the  presence  or  absence  of  the  West  Nile  Virus  [1,  0].      The   explanatory   variables   are:   maximum  temperature,  minimum   temperature,   average  temperature,  precipitation,  result  wind  speed,  result  wind  direction,  species,  trap,   longitude,  latitude,  number  of  mosquitoes  and  address.    

EXTERNAL  DATASETS  

Although  Kaggle  already  provides  a  number  of  explanatory  variables   for   the  West  Nile  Virus  challenge,   there   are   ample   opportunities   to  include   external   datasets   that   may   contain  other  variables   that   can   improve  a  predictive  model’s  performance.  For  example,  Ruiz  et   al  (2010)   found   that   the   amount   of   vegetation  and  the  degree  to  which  water  would  flow  or  remain   in   an   area   mediated   the   effect   of  weather   in   predicting   the   infection   rate   of  West   Nile   Virus.   Socioeconomic   factors   that  measured   poverty   also   seemed   to   correlate  with  the  presence  of  West  Nile  Virus.  Bringing  in   additional   data   from   reliable   government  sources   that   reflect   the   aforementioned  

   

3  

factors  will  help  us   finely   tune  our  predictive  models.    

MULTIVARIATE  ANALYSIS    

The  main   objective   of   a  multivariate   analysis  is   to   use   multiple   data   mining   techniques   to  study   how   variables   relate   to   one   another.  This   method   of   analysis   is   most   often   used  when   the   dataset   contains   more   than   one  explanatory   or   response   variable   or   even  both.   Kaggle’s   West   Nile   Virus   dataset  contains   one   response   variable   and   12  explanatory  variables.        Using   a   multivariate   analysis   for   such   a  dataset  is  desirable  because  the  final  outcome  of   accurately   predicting   the   presence   or  absence  of  WNV  might  be  influenced  by  more  than   one   attribute.   For   instance,   principal  component   analysis   can   be   used   to  “decompose   a   data   table   with   correlated  measurements   into  a  new  set  of  uncorrelated  (i.e.,   orthogonal)   variables”   (Abdi,   p.1).  Performing  PCA  will  determine   the  dominant  trends  in  the  dataset  upon  which,  for  example,  a  logistic  regression  model  can  be  applied.      Conducting  a  logistic  regression  alone  with  12  explanatory   variables   may   not   produce   a  stable  model   if   there   is   a   strong   dependence  between   predictors.   PCA   addresses   the   issue  of   multicollinearity   resulting   in   a   regression  model   that  accurately  estimates   the   response  variable.   Therefore,   the   advantages   and  disadvantages   of   using   one   technique   in  conjunction   with   another   in   light   of   the  number   of   explanatory   variables   offers   a  purpose  to  use  multivariate  analysis.      DATA  COLLECTION    The   dataset   provided   by   the   Chicago  Department   of   Public   health   and   NOAA  [National   Oceanic   and   Atmospheric  

Administration]   comprises   of   weather   data2,  GIS   data3,   date   of   traps   set   [spanning   3   days  each   week   for   approximately   5   months],  location   of   traps   and   species   for   the   years  between  2007  and  2014.  The  main  dataset   is  broken   into   two   sets   of   data   that   is   the  training   and   the   testing   dataset.   The   training  dataset   reflects   data   points   collected   for   the  odd   years:   2007,   2009,   2011   and   2013.  Whereas,   the   testing   dataset   consists   of   data  points   gathered   for   the   even   years:   2008,  2010,  2012  and  2014.      There  are  two  central  factors  that  serve  as  the  premise  for  when  and  why  the  WNV  data  was  collected.   The   first   factor   is   weather.   “It   is  believed  that  hot  and  dry  conditions  are  more  favorable   for   West   Nile   virus   than   cold   and  wet.”   (Kaggle,   information   description,   para.  9)  Therefore,  the  dataset  captures  information  about   weather   [from   station   1   –   Chicago  O’Hare  International  Airport  –  and  station  2  –  Chicago   Midway   International   Airport]   only  for   the   months   of   late   May   through   early  October.   The   second   factor   is   the   availability  of  data  for  the  number  of  mosquitos’  trapped,  location,  species  identified  and  the  test  results  of   the   presence   or   absence   of   the   West   Nile  virus.   “Every   year   from   late-­‐May  to   early-­‐October,   public   health   workers   in   Chicago  setup  mosquito  traps  scattered  across  the  city.  Every   week   from   Monday   through  Wednesday,  these  traps  collect  mosquitos,  and  the   mosquitos   are  tested   for   the   presence   of  West   Nile   virus   before   the   end   of   the  week.”  (Kaggle,  information  description,  para.  3)    It  is  no  coincidence  that  traps  are  only  set  out  in   late   spring   through   early   fall   when   the  weather   is   conducive   to   the   population  growth   in  mosquitos.   Identifying   the   location                                                                                                                  2  Weather data has been collected only for dates on which the traps were set 3 GIS data for spraying is only available from 2011 to 2013,  

   

4  

of   the   traps,   the   number   of   mosquitos’  trapped,   the   species,   and   the   frequencies   of  each  species   infected  or  not   infected  with  the  virus  in  conjunction  with  weather  is  crucial  in  understanding   where   the   next   sporadic  growth   of   the  mosquitos  will   occur.   After   all,  the  goal  of   the  predictive  model   is   to   identify  the   presence   or   absence   of   the   WNV   by  predicting   the   occurrence   and   the   rate   of  mosquito   growth   in   one   particular   location  over   another   given   a   set   of   weather  conditions.   Such   predictions   can   be   used   by  the   City   of   Chicago   and   CPHD   “to   efficiently  and   effectively   allocate   resources”  to   control  the  population  growth  of  mosquitos  which   in  turn   prevents   the   transmission   of   the  “potentially  deadly  virus.”    DATA  MERGING    The   West   Nile   training   dataset   does   not  contain   the   weather   variables   required   for   a  robust   analysis.   Therefore,   the   weather  dataset   has   been   merged   with   the   train   file  resulting   in   a   merged   file   titled  “wnv.train.weather.”   The   unique   identifier  used  to  merge  both  files  are  date  and  station.      Since   the   NOAA   Weather   dataset   provides  weather   data   from   two   weather   stations  located   in   the   Greater   Chicago   Area,   the  distance   was   calculated   from   the   site   of  individual   traps   to   each   of   the   two   weather  stations   and   was   used   to   select   the  appropriate   weather   information   for   each  training  record  based  on  the  proximity  of   the  two   weather   stations.   Two   distance   metrics  were   considered:   1)   Euclidean   distance  formula,      

𝐷 = (𝑙𝑎𝑡!"#"$%& − 𝑙𝑎𝑡!"#$)! + (𝑙𝑜𝑛𝑔!"#"$%& − 𝑙𝑜𝑛𝑔!"#$)!  

 as   well   as   2)   Haversine   formula  (http://en.wikipedia.org/wiki/Haversine_for

mula)  when  taking  into  account  the  curvature  of  the  Earth,    

   The   “geosphere”   R   package   was   used   to  calculate  the  Haversine  formula  for  distance.      NEW  FEATURES    Ruiz  et  al.   (2010)  reported   the   importance  of  temporal   characteristics   of   weather   in  predicting  infection  rates  of  WNV  in  Northern  Illinois.   For   example,   they   found   a   positive  correlation   at   1   to   3   week   lags   between  precipitation  and  infection  rates.  Based  on  this  research   new   features   were   created   to  capture   this   information   in   the   weather  dataset,   namely   a   2  week  moving   average   of  precipitation  as  well  as  a  2  week  moving  sum  of  accumulated  rainfall.      Also,   time-­‐based   effects   of   temperature   was  explored  and  this  entailed  the  use  of  a  metric  known   as   growing   degree   days   (GDD)   to  measure   heat   accumulation   used   to   predict  mosquito   development   rates.   GDD   was  calculated  as    

𝐺𝐷𝐷 =   𝑇!"#$ − 𝑇!"#$ ,  𝑖𝑓  𝑇!"#$ >  𝑇!"#$0,                                                                  𝑖𝑓    𝑇!"#$ ≤  𝑇!"#$

 

 where   Tbase   represents   a   threshold  temperature  where  an  organism’s  growth  rate  is   near   zero.   From   reviewing   literature,   Tbase  can   range   between   13°C   and   33°C.   We   will  vary  Tbase  and  observe  the  threshold  value  that  yields  the  best  performing  model.      Other   features   that   were   created   from   the  base   training   data   include   the   specific   week  number   of   a   year.   It   is   expected   that   the  abundance   of   mosquitos   and   consequently,  the   presence   of   WNV,   to   be   more   prevalent  during   certain   times  of   the   year.   Therefore   it  

   

5  

is   surmised   that   the   week   number   will   be  important  in  predicting  the  timing  of  WNV.      CATEGORICAL  VARIABLES    Dealing   with   categorical   variables   can   pose  certain   limitations.   For   example,   if   a   variable  in  a  given  data  set  contains  several  categories  there  arises  a  need  to  re-­‐categorize  the  classes  into   smaller   groups   for   the   sake  of   simplicity  and  the  robustness  of  the  predictive  model.  In  addition,   depending   on   the   data   mining  technique  used  the  need  to  use  numerical  data  than  categorical  data  becomes  eminent.          The   categorical   variables   found   in   the   WNV  dataset   have   undergone   transformations   in  the   form   of   re-­‐categorization.   For   instance,  variable   species   is   categorical   with   seven  classes  as  indicated  in  the  table  below:    

Table  1  Species  However,   table   1   species   indicates   that   3  species   specifically   have   been   tested   positive  for   WNV.   Re-­‐categorization   highlights   the  importance   of   the   three   classes   associated  with  WNV  leaving  the  other  four  classes  to  be  grouped  in  a  category  of   its  own  indicative  of  the  lack  of  attribution  to  the  spread  of  WNV4.  It   is   also   important   to   note   that   the   training  set   has   a   class   titled   “uncategorized.”   By  creating   the   fourth   category   called   “Culex  Other”   the   issue  of   the  unidentified  species   is  addressed  effectively.      

                                                                                                               4  Table 2 titled Species 2 contains the new groupings  

 The   re-­‐categorization   approach   has   been  applied  to  the  variable  date  as  well.      EXPLORARTORY  DATA  ANALYSIS    One  of  the  prime  focus  of  an  exploratory  data  analysis   is   to   check   whether   the   specific  characteristic(s)   of   a   data   set   meets   the  requirements  of  the  modeling  technique(s)  to  be   used   as   some   models   maybe   sensitive   to  certain  types  of  data.    That  is,  how  is  the  data  set  distributed?    Skewedness   of   a   distribution   whether   it   is  positive   or   negative   is   often   a   result   of   a  “subset   of   observations   that   appear   to   be  inconsistent  with   the   remaining  observations  that   follow  a  hypothesized  distribution.”  (Sim  et  al,  2005,  pg.642).  Histograms  and  box  plots  are  graphical  tools  widely  used  to  inspect  the  data   for   the   presence   of   outliers.   There   are  two   important   questions   to   address   after  visually   inspecting   the   boxplot:   first,   is   it  possible  for  the  boxplot  to  incorrectly  declare  certain   points   as   outliers.   Second,   does   the  presence   of   outliers   imply   the   need   for   a  transformation?        

The  graphical  representation  of  the  box  plots5  for  the  West  Nile  dataset  has  identified  certain  variables   to   be   skewed   with   the   presence   of  outliers.   For   instance,   the   distribution   of   the  number   of   mosquitos   is   right   skewed.   The  

                                                                                                               5  All   histograms   and   box   plots   with   short  description   of   shape,   center   and   spread   for   the  WNV  data  set  can  be  found  in  the  appendix.    

   

6  

distribution   being   pulled   to   the   right   by   the  largest   number   in   the   data   set   for   the  respective   column.   The   IQR6  rule   for   outliers  indicates   that   values   lying   below   -­‐20   and  above   39.5   are   potential   outliers.   On  examining   the   number   of   mosquitos   trapped  for   each   species   it   is   apparent   that   class  imbalance   plays   an   important   role   in   the  skewedness  of  the  data  as  shown  in  Table  2.  

 Table  2:  Number  of  Mosquitos  Trapped  

All  numbers  above  39.5  represent  the  species  attributed  to  the  WNV  and  the  location  where  it  abounds.  There  exists  a  pattern  between  the  type  of  species,  the  location  and  the  number  of  mosquitos  trapped  that  is  beyond  the  scope  of  the  boxplot.    Similarly   the  boxplot   for  most  of   the  weather  variables   in   the   WNV   dataset   shows   the  presence   of   outliers.   However,   yearly,  monthly,   weekly   and   daily   variations   in  weather   are   infinite   and   the   differences   in  data  points   for   station  1  and  2   can  be  due   to  the   geographical   locations   of   the   stations  and/or   the   way   in   which   the   instruments  record  the  temperatures.      The   Natural   Resources   Management   and  Environment   Department   furthers   this  argument   by   stating   that   “weather   data  collected   at   a   given  weather   station   during   a  period   of   several   years   may   be   not  homogeneous,  i.e.,  the  data  set  representing  a  particular   weather   variable   may   present   a  

                                                                                                               6  The  appendix  contains  a   table   titled   “Lower  and  Upper  Bound  Outliers”    

sudden   change   [from   one  weather   station   to  another].  This  phenomenon  may  occur  due  to  several   causes,   some   of   which   are   related   to  changes   in   instrumentation   and   observation  practices,   and   others,   which   relate   to  modification   of   the   environmental   conditions  of  the  site”  or  even  “change  in  the  time  of  the  observations.”  (para.14)    Thus,  the  skewedness  of  the  distribution  is  not  necessarily   a   consequence   of   extreme   data  points.   However,   it   is   a   result   of   class  imbalance.  For  instance,  the  histogram  for  the  accumulated   degree   day   shows   that  distribution   is   skewed   to   the   right.  But  when  the   histogram   is   constructed   taking   into  consideration  the  presence  or  absence  of  WNV  it   becomes   clear   that   imbalanced   class   is   the  root   of   the   skewedness   as   seen   in   the  histograms  below:    

 

   The   histograms   show   that   there   are   no  wnvpresent   at   lower/higher   degree   days.  However,  the  histograms  for  acc.deg.day  when  wnvpresent  =  0  or  1  and  0  appears  to  be  more  flat.  In  order  to  remove  distribution  skewness  the   data   points   was   replaced   by   the   square  root.   Thus   resulting   in   a   data   that   is   better  behaved  than  in  its  original  units.      

   

7  

In   addition   to   skewness,   another   factor   that  affects   the  predictive   capability   of   a  model   is  the  presence  of  outliers.  As  noted  earlier,   the  weather  data  consists  of  outliers.  “For  a   large  dataset,  removal  of  samples  based  on  missing  values   is   not   a   problem,   assuming   the  missingness   is   not   informative”   (Kuhn   &  Johnson,  2013,  p.41).  However,  a  more  robust  way   of   handling   missing   information   is   by  imputation.    “Imputation  is  layer  of  modelling  where  missing  values  are  estimated  based  on  other   predictor   variables.   This   amounts   to   a  predictive   model   within   a   predictive   model”  (Kuhn  &  Johnson,  2013,  p.42).      Missing   values   in   the   weather   data   set   have  been  addressed  by  the  implementation  of  hot  deck   imputation  where   each  missing   value   is  replaced   with   an   observed   value   from   a  similar   unit.   “An   attractive   feature   of   the   hot  deck   imputation   is   that   only   plausible   values  can   be   imputed   since   values   come   from  observed   responses   in   the   donor   pool”  (Andridge   &   Little,   2011,   para.   3)   which  means  that  the  weather  data  is  more  likely  to  be   similar   to   the   other   data   points   than  imputing   averages.   The   second   advantage   of  using  hot  deck  imputation  is  that  the  “method  does  not  rely  on  model  fitting  for  the  variable  to   be   imputed   and   thus   is   potentially   less  sensitive   to   model   misspecification   than   an  imputation   method   based   on   a   parametric  method   such   as   regression   imputation”  (Andridge  &  Little,  2011,  para.  3).    CORRELATION  ANALYSIS    There  are  specific  variables  in  the  dataset  that  reveal   interesting   patterns   such   as   the  number   of   mosquitos,   temperature   and  precipitation.      The  goal  of  the  correlation  analysis  was  to  plot  or   capture   a   trend   that   would   explain   the  relationship   between   the   variables   and   the  

presence   of   the   West   Nile   Virus.   Since   the  variables  are  on  different  scales   the  variables  were  normalized  using  the  Z  score  formula.  In  addition   to   normalizing   the   data,   average  values   of   the   said   variables   were   considered  in  building  the  plots.    The  plots  pertain   to  weekly  records  captured  for  4  years:  2007,  2009,  2011  and  2013  for  the  months   between   late  May   and   early  October.  Individual   plots   have   been   drawn   for   each  year.    The  blue  line  shows  the  average  precipitation.  The   red   line   shows   the   average   number   of  mosquitos,   the   green   line   shows   the   average  temperature   and   the   purple   line   shows   the  presence  of  the  virus.          

 Figure  1:  2007  

According  to  the  line  graph  for  the  year  2007,  a   sudden   decrease   in   temperature   causes  mosquitos   to   decrease   after   week   35.  Consequently,  the  average  number  of  detected  virus  decreases.      It   was   also   noted   that   the   higher   the  temperature   and   the   precipitation   gets,   the  higher   the   number   of   mosquitos   and  subsequently   the   higher   the   probability   for  the  presence  of  the  West  Nile  virus.      An   interesting   pattern   was   found   between  precipitation   and   the   increase   in   the   number  

   

8  

of  mosquitos.    The   increase   in   the  number  of  

 Figure  2:  2009  

mosquitos  occurs  rapidly  not  during  the  week  of  high  precipitation  but   in  the  week  after.     It  appears   that  once   the  numbers  of  mosquitos’  increase.  Then  the  virus  infects  the  mosquitos.      The  number   of  mosquitos   in  week  35   is   low.  However,   the   graph   shows   that   the   presence  of   the   virus   is   prominent   than   before  indicating   that   all   of   the   mosquitos   have   the  virus   in   their   blood   although   the   mosquito  population  is  small.      Not   surprisingly,   as   the   temperature  declines  rapidly   [even   with   high   precipitation],   the  number   of   mosquitos   and   the   presence   of  WNV   drops.     All   plots   have   captured   similar  trends.    

 Figure  3:  2011  

 Figure  4:  2013  

 The  scatterplots  below  shows  that  the  number  of  mosquitos  and   the  presence  of  WNV  has  a  positive   relationship   with   dmonth,   dweek,  dewpoint,   cool,   tmax,   tmin,   tavg   and   spray.  Therefore,   the   model   will   certainly   rely   on  

these  features  more  than  the  others  to  predict  WNV.      

   Though   the   relationships   are   positive   the  strength   however,   appears   to   be   weak.   A  closer   look   at   the   scatterplots   shows   some  evidence   of   multicolinearity.   For   instance,   in  the   plot   titled   temp   and   weather   there   are  blocks   of   strong   positive   correlations   that  indicate   colinearity.     An   issue   to   consider   in  the  modeling  process.      MODELS    Accurately   predicting   the   presence   of   WNV  essentially   amounts   to   selecting   the   best  spatial,   temporal   and  weather   features   along  with   a   specifically   tuned   classification  algorithm.   It   is   evident   from   the   exploratory  analysis  as  well  as  from  literature  that  certain  individual   features   are   crucial   in   predicting  WNV.      Therefore,   the  modeling  process   for   this  data  set  will   be   broken   into   two  parts.   Part   I,  will  focus  on  determining  how  to  best  incorporate  the   available   features   into   a   classification  model.    Part  II,  will  focus  on  investigating  and  

   

9  

fine   tuning   the   specific   classification  algorithms   to   yield   the   best   possible  prediction.        Part  I    Weather  Data  and  Principal  Component  Analysis    Due   to   the   number   of   weather   attributes  available   to   the   researcher   in   the   dataset,   it  becomes   quite   difficult   to   ascertain   the  combination  that  will  result  in  the  best  model.  Moreover,   the   nature   of  weather   is   such   that  most   individual   features  will  be  correlated   to  another   resulting   in   multicolinearity.   For  example,   the   amount   of   precipitation   will   be  correlated   to   atmospheric   pressure   and   in  turn,  be  correlated  to  temperature.    Therefore  to   combat   multicolinearity   principal  component  analysis  (PCA)  was  used  to  extract  features   that   highlight   the   similarities   and  differences  of  the  original  weather  data  while  eliminating   the   detrimental   effects   that   can  result  from  the  linear  dependency  of  predictor  variables.        Figure   5   summarizes   the   results   of   PCA  conducted  on  the  weather  attributes.  The  first  five  components  capture  97%  of  the  variation  in   the   weather   data.   The   loadings   of  component   1   suggest   it   is   highly   related   to  temperature,   humidity   and   pressure;   a   large  value   for   component   1   seems   to   represent   a  sunny  but  chilly  day.  Component  2  appears  to  capture  wind  information,  while  component  3  summarizes   precipitation.   The   first   5  components   from  PCA  will   be   used   to   reflect  the  weather  conditions  of  a  specific  day  in  the  data.          

 Figure  5:  PCA  

Figure  6:  Clustering  

                             

   

10  

Figure  7:  Model  Summary    Temporally  based  weather  variables  and  week  number    While  the  weather  conditions  of  a  specific  day  can   affect   the   activity   level   of   mosquitos   for  that   day,   it   does   not   take   into   account   a  mosquito’s   life-­‐cycle  or  the  timing  of  weather  conditions   and   its   effect   on   mosquito  populations.  Hence,   engineered   features   such  as   growing   degree   day,   moving   temperature  averages/sums   and   moving   precipitation  averages/sums   (all   mentioned   in   previous  sections)  will  be  included  in  the  model.      Also,   week   numbers   of   the   year   will   be  incorporated   to   capture   the   inter-­‐annual  timing  of  mosquito  populations.      Clustering  Location  Data    Determining  a  good  way  to  represent  location  will  most  likely  improve  the  predictive  power  of   the   models.   Although,   the   WNV   challenge  provides  raw  longitude  and  latitude  values  to  represent  location,  it  is  believed  to  not  be  in  a  form   that   will   be   conducive   to   predictive  modeling   due   to   the   non-­‐linear   nature   of  spatial  data.      Thus  k-­‐means  algorithm  (k  =  20)  was  used  to  translate   the   location   data   represented   by  longitude/   latitude   pairs   into   clustered  locations.   Figure   6   shows   the   location   of   the  clusters  using  a  normalized  scale.      

As   one   can   observe,   the   clustered   locations  outline   the   Chicago   area   quite   accurately.  These   clustered   locations   will   be   used   as   a  categorical  variable  in  our  models.      Part  II    With   the   necessary   data   pre-­‐processing   and  variable   transformations   completed.   The  focus   was   moved   onto   the   construction   of  models  to  predict  WNV.  The  overall  approach  was  to  build  an  ensemble,  a  model  that  takes  a  weighted   average   of   a   set   of   classifiers   that  generally   outperforms   the   individual  classifiers   upon   which   the   ensemble   is   built  from.   The   strategy   was   to   consider   five  individual   algorithms   and   build   the   best  possible  classifier  out  of  each  to  include  in  the  final   ensemble   model:   1)   logistic   regression  using  a  generalized  additive  model   (GAM),  2)  linear  discriminant  analysis  (LDA),  3)  random  forests,   and   4)   support   vector   machines  (SVM).  Kaggle’s  train  dataset  was  split  by  70%  and   30%   probabilities   where   the   70%   was  used   as   the   training   set   and   the   remaining  30%   served   as   the   hold   out   for   the   test  dataset.      Figure  7   is   a   summary  of   all   the  best   set-­‐ups  for   each   algorithm.   Of   all   the   individual  models,  GAM  was  clearly  the  best  performing  with   an   AUC   value   of   0.8253717.   The   best  overall   ensemble   classifier   was   a   weighted  average  of  GAM  and  SVM  with  weights  of  0.6  and   0.4,   respectively,   and   an   AUC   of  0.8361962.        

   

11  

CONCLUSION    Although  the  ensemble  model  had  the  highest  AUC  value  achieved   in   the   training  dataset,   it  only  reached  an  AUC  of  0.6220  on  the  Kaggle  leaderboard.        In   fact,   over   50   models   were   submitted   to  Kaggle   and   the   results   were   rarely   as  expected.   The   two   best   models   on   the  leaderboard  consisted  of  an  ensemble  of  GAM  logistic  regression  and  GLM  logistic  regression  and   a   slightly   modified   Poisson   GLM   model.  Both   did   not   have   notable   training   AUCs   but  performed  well  on  Kaggle.        Other  validation  techniques  were  investigated  in   an   attempt   to   obtain   better   feedback   from  the   training   process   which   resulted   in   the  build   of   a   better   model.   Instead   of   using   a  70/30   training   and   testing   split,   a   modified  version   of   n-­‐fold   cross   validation   was   used  where  one   year’s   data  was   left   out   as   testing  and   the   remaining   years   were   used   as  training.   This   process   was   repeated   four  times,   once   for   each   year,   and   this   averaged  the   model’s   performance.   The   best   models  achieved   from   this   validation   technique   did  not   seem  any  different   from   the  models  built  on  a  traditional  70/30  split.  

 Figure  8:  Models  &  Imbalance  

Because  there  is  a  gross  imbalance  of  positive  and   negative   cases   in   the   WNV   data   further  examination   was   conducted   to   see   if   the  imbalance   had   any   influence   on   the  effectiveness  of  training  and  validation.  Figure  8   shows   the   performance   of   several   models  and   its   relationship   with   data   imbalance.  Except  for  one  model,  none  displayed  a  drastic  sensitivity  to  data  balance.    If   using   the   appropriate   validation   technique  does   not   account   for   the   disparity   between  training  AUC  and  the  Kaggle  leaderboard  AUC,  it  is  surmised  that  there  may  be  a  fundamental  difference   between   the   characteristics   of   the  training  data  and  testing  data.      Specifically,   it   is   possible   that   there   are  idiosyncratic   intra-­‐annual   variations   in  weather   that   cannot   be   captured   in   the  training   set   due   to   how   the  WNV   problem   is  set   up.   Ezanno   et   al   (2014)   cites   that  population  of  certain  mosquito  species  does  in  fact   have   inter-­‐annual   variations   due   to  specific  weather  events  in  a  year.        It   is   therefore   suspected,   that   the   best  algorithms  discussed  afore  are  over  fitting  the  training   data.   While   the   best   models   in   this  study  capture  the  variations  in  weather  in  the  training  data  well,  it  is  unable  to  replicate  this  in  the  testing  data.      This   intuitively   makes   sense   as   most   of   the  models   that  performed  better  on  Kaggle   tend  to   be   simple   models   that   included   variables  like   location,   week   number   and   mosquito  species  that  is  generalizable  through  all  years  of  the  data.      

   

12  

Other   matter   of  consideration   for   future  model   building   is   the  importance   of   the   spray  data.   Though   the   spray  data   is   not   a   part   of   the  testing  dataset  and  would  warrant   an   immediate  dismissal   from   the  predictor   selection  process,   the   following  heat   map   implies  otherwise.     Upon   close  inspection   of   the   heat  map   one   speculates   that  spraying   one   year   does  indeed  alter  the  effects  of  population   the   next   year,  which   might   explain   why   mosquito  populations  appear  in  different  locations  each  year.      Also,   feature   engineering   of   the   predictor  variable,   depart   [departure   from   normal],  might   help   in   creating   a   deeper   level   of  understanding  the  problem  statement  at  hand.  A  possible  means  of  engineering  this  predictor  would   be   to   categorize   the   deviance   from  temperature   normalcy   as   hotter   than   normal  and  colder  than  normal.                            

                                                                     

           

   

   

13  

 Appendix  

 

Table  3:  Data  Fields  

FIELDS  

Number   Train   Weather   Spray   Test  1   Date   Station   Date   ID  2   Address   Date   Time   Date  3   Species   Max  Temperature     Latitude   Address  4   Block   Min  Temperature   Longitude   Species  5   Street   Avg  Temperature     Block  6   Trap   Departure  from  Normal     Street  7   Address  Number   Dew  Point     Trap  8   Latitude   Wet  Bulb     Address  Number  9   Longitude   Heat     Latitude  10   Address  Accuracy   Cool     Longitude  11   #  of  Mosquitoes   Sunrise     Address  Accuracy  12   Wnvpresent   Sunset      13     Code  Sum      14     Depth        15     Water1      16     Snowfall      17     Total  Precipitation      18     Station  Pressure      19     Sea  Level      20     Wind  Speed      21     Wind  Direction      22     Average  Speed      

   

14  

TABLE 2 | SPECIES 2

   

     

     

 

   

15  

SKEWNESS OF VARIABLES & OUTLIERS

DATE PATTERN

   

 

The data is skewed to the left. There are more records for 2007 than other years but not by a significant amount. If this becomes problematic, we may sample equal number of records for each year.

       

   

16  

LATITUDE PATTERN

   

 

     Shape:  Latitude  is  very  slightly  skewed  to  the  left.  Mean  is  less  than  the  median        Center:  41.84628      Spread: 41.64461 to 42.01743  

       

   

17  

LONGITUDE PATTERN

   

 

       Shape:  Longitude  is  symmetric    Center:  -­‐87.69499      Spread: -87.93099 to -87.53163  

       

   

18  

NUMBER OF MOSQUITOS PATTERN

 

 

 

     Shape:   The   distribution   is   right  skewed   as   the   mean   is   12.85351    being   pulled   to   the   right   away   from  the  median  which  is  5    Center:  5    Spread: 1 to 50 Outlier: The boxplot confirms the skewedness of the histogram in that there are large numbers causing the distribution to be pulled to the right. The outlier function indicates the largest number in the data for number of mosquitos is 50    

         

   

19  

DISTANCE FROM O’HARE PATTERN

   

 

     Shape:  The  distribution  is  symmetric      Center:  0.2943334    Spread: 0.0372549 to 0.5179756  

                         

   

20  

DISTANCE FROM MIDWAY PATTERN

   

 

     Shape:   The   distribution   is   slightly  skewed   to   the   left   as   the   mean  0.1548598   is   pulled   away   from   the  median  0.1616137      Center:  0.1616137    Spread: 0.0077139 to 0.2481943    

                       

   

21  

MAXIMUM TEMPERATURE PATTERN

     

 

 

   Shape:  The  distribution  is  s  skewed  to   the   left   as   the   mean   81.94765   is  pulled   away   to   the   left   from   the  median  83      Center:  83    Spread: 57 to 97 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the left. The outlier function indicates that 57 is the point that is distant from the other values in the dataset.  

                       

   

22  

MINIMUM TEMPERATURE PATTERN

     

 

     Shape:  The  distribution  is  s  skewed  to   the   left   as   the   mean   64.16533   is  pulled   away   to   the   left   from   the  median  66      Center:  66    Spread: 41 to 79 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the left. The outlier function indicates that 41 is the point that is distant from the other values in the dataset.  

                     

   

23  

AVERAGE TEMPERATURE PATTERN

     

 

 

     Shape:   The   distribution   is  skewed   to   the   left   as   the   mean  38.28412  is  pulled  away  to  the  left  from  the  median  40      Center:  40    Spread: 15 to 52 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the left. The outlier function indicates that 15 is the point that is distant from the other values in the dataset.  

                     

   

24  

TOTAL PRECIPITATION PATTERN

 

 

 

   Shape:   The   distribution   is  skewed   to   the   right   as   the   mean  0.1274281   is   pulled   away   to   the  right  from  the  median  0      Center:  0    Spread: 0.00 to 3.97 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the right. The outlier function indicates that 3.97 is the point that is distant from the other values in the dataset.  

                       

   

25  

 

RESULT OF WIND SPEED PATTERN

   

 

 

     Shape:  The  distribution  is    skewed  to  the  right  as  the  mean  5.911003  is  pulled   away   to   the   left   from   the  median  5.5    Center:  5.5    Spread: 0.1 to 15.4 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the right. The outlier function indicates that 15.4 is the point that is distant from the other values in the dataset.  

                     

   

26  

 

RESULT OF WIND DIRECTION PATTERN

     

 

     Shape:  The  distribution  is  skewed  to   the   left   as   the  mean  17.72016   is  pulled   away   to   the   left   from   the  median  19      Center:  19    Spread: 1 to 36  

                     

   

27  

AVERAGE WIND SPEED PATTERN

     

 

 

     Shape:   The   distribution   is  skewed   to   the   left   as   the   mean  123.4147  is  pulled  away  to  the  left  from  the  median  139      Center:  139    Spread: 3 to 177 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the left. The outlier function indicates that 3 is the point that is distant from the other values in the dataset.        

                   

   

28  

TEMPERATURE MOVING AVERAGES - 1 WEEK PATTERN

     

 

     

   Shape:   The   distribution   is   skewed  to   the   left   as   the   mean   72.5431   is  pulled   away   to   the   left   from   the  median  73.14286    Center:  73.14286    Spread: 53.14286 to 83.85714 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the left. The outlier function indicates that 53.14286 is the point that is distant from the other values in the dataset.              

                   

   

29  

TEMPERATURE MOVING AVERAGES – 2 WEEK PATTERN

     

 

 

     Shape:   The   distribution   is   skewed  to   the   left   as   the   mean   72.41439   is  pulled   away   to   the   left   from   the  median  73    Center:  73    Spread: 55.07143 to 82.76923 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the left. The outlier function indicates that 55.07143 is the point that is distant from the other values in the dataset.  

                   

   

30  

MOVING AVGS OF PRECIPITATION – 1 WEEK PATTERN

     

 

 

     Shape:   The   distribution   is   skewed   to  the   right   as   the   mean   0.1333564   is  pulled   away   to   the   right   from   the  median  0.07    Center:  0.07    Spread: -0.0000 to 1.42857 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the right. The outlier function indicates that 1.42857 is the point that is distant from the other values in the dataset.                  

                   

   

31  

MOVING AVGS OF PRECIPITATION – 2 WEEK PATTERN

     

   

   Shape:  The  distribution  is  skewed  to  the   right   as   the  mean   0.130   is   pulled  away   to   the   right   from   the   median  0.085    Center:  0.085    Spread: 0.0007 to 0.76714 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the right. The outlier function indicates that 0.76714 is the point that is distant from the other values in the dataset.                  

                     

   

32  

MOVING SUM OF PRECIPITATION – 1 WEEK PATTERN

     

 

 

   Shape:   The   distribution   is   skewed  to  the  right  as  the  mean  0.9432334  is  pulled   away   to   the   right   from   the  median  0.53    Center:  0.53    Spread: -0.000 to 9.149 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the right. The outlier function indicates that 9.15 is the point that is distant from the other values in the dataset.                    

   

                 

   

33  

MOVING SUM OF PRECIPITATION – 2 WEEK PATTERN

     

   

   Shape:  The  distribution  is  skewed  to  the  right  as  the  mean  1.74216  is  pulled  away  to  the  right  from  the  median  1.1    Center:  1.1    Spread: -0.000 to 10.74999 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the right. The outlier function indicates that 10.75 is the point that is distant from the other values in the dataset.                    

                     

   

34  

DEGREE DAY PATTERN

     

 

     Shape:   The   distribution   is   skewed  to   the   right   as   the  mean   3.824472   is  pulled   away   to   the   right   from   the  median  3.4    Center:  3.4    Spread: 0.0 to 14.9  

             

 

                     

   

35  

ACCUMULATED DEGREE DAY FOR EACH YEAR PATTERN

     

 

     Shape:   The   distribution   is  skewed   to   the   right   as   the   mean  241.0934   is   pulled   away   to   the  right  from  the  median  239.6    Center:  239.6    Spread: 1.3 to 521.1          

 

                         

   

36  

LOWER & UPPER BOUND OUTLIERS

       

                 

                         

   

37  

GROUPED LINE GRAPH | YEAR 2007

   Blue line: The average precipitation. Red line: The average number of mosquitos

Green line: The average temperature. Purple line: The presence of virus

 

             

   

38  

GROUPED LINE GRAPH | YEAR 2009

     Blue line: The average precipitation. Red line: The average number of mosquitos

Green line: The average temperature. Purple line: The presence of virus.

   

         

   

39  

           

GROUPED LINE GRAPH | YEAR 2011

   Blue line: The average precipitation. Red line: The average number of mosquitos

Green line: The average temperature. Purple line: The presence of virus.

 

   

40  

         

GROUPED LINE GRAPH | YEAR 2013

   Blue line: The average precipitation. Red line: The average number of mosquitos

Green line: The average temperature. Purple line: The presence of virus.

   

   

41  

Works Cited

Abdi, Herve. Multivariate analysis. Retrieved from www.utdallas.edu/~herve/Abdi-MultivariateAnalysis-pretty.pdf Andridge & Little. (2011). A review of hot deck imputation for survey non – response Int Stat Rev. 78(1): 40-64. Retrieved from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3130338/ Ezanno, P, Aubry-Kientz, M et al. (2015). A generic weather driven model to predict Mosquito population dynamics applied to species of anopheles, culex And aedes genera of southern France. 120(1): 39-50. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/25623972 Kaggle. West Nile Prediction. Retrieved from: https://www.kaggle.com/c/predict- west-nile-virus/data Kuhn & Johnson (2013). Applied Predictive Modeling. New York, Springer. Natural Resources Management and Environmental Departments. Annex 4:

Statistical Analysis of Weather Data Sets 1. Retrieved from: http://www.fao.org/docrep/x0490e/x0490e0l.htm#TopOfPage Ruiz, Marilyn O., F Chavez Luis et al. (2010). Local impact of temperature and precipitation on west Nile virus infection in culex species mosquitoes in northeast Illinois, USA. Parasites & Vectors. Retrieved from http://www.parasitesandvectors.com/content/3/1/19. Ruiz, Marilyn 0., Edward D. Walker et al.(2007). Association of west nile virus illness and urban landscapes in Chicago and Detroit. International Journal of Health Geographics. Theophilidies, C.N., S.C. Ahearni et al. (2006). First evidence of west nile virus amplification and relationship to human infections. International Journal of Geographical Information Science, 20, 103 -115. Sim, C.H, Gan, F. F. et al (2005), Outlier: labeling with boxplot procedures. Journal of American Statistical Association, 100(470). Retrieved from: http://www.jstor.org/stable/27590584