Date post: | 12-Apr-2017 |
Category: |
Documents |
Upload: | joyce-rose |
View: | 173 times |
Download: | 2 times |
1
MULTIVARIATE ANALYSIS
FARZAD ESKANDANIAN, MAX LI, JOYCE ROSE, NASIM SONBOLI
CSC 424 | ADVANCED DATA ANALYSIS
6|14|2015 The purpose of this paper is to discuss the model(s) used in predicting the presence or absence of the West Nile virus [WNV]. The uniqueness of this multivariate analysis is the use of weather, temporal and spatial factors based on the premise of time based effects. That is, the models built take into account the developmental stages of a mosquito. Four individual classifiers -‐ 1) logistic regression using a generalized additive model (GAM), 2) linear discriminant analysis (LDA), 3) random forests, and 4) support vector machines (SVM) – were built and the best combinations of parameters from each model was included in the ensemble model. Species, week number, location, moving temperature averages, precipitation moving averages and growing degree days played an important role in predicting WNV. The best overall ensemble classifier was a weighted average of GAM and SVM with weights of 0.6 and 0.4, respectively, and an AUC of 0.8361962 INTRODUCTION
The west Nile Virus (WNV) is “a mosquito borne disease-‐causing infectious agent” (Theophilides et al, 2006, para. 1) that affects birds, humans, and animals. In 1999, WNV was first reported in the United States. Since the initial occurrence the presence of WNV causing seasonal epidemics have been recorded leading to a series of research focused on understanding the features and characteristics of the virus. The research available on WNV indicates that “the infections caused by pathogens by way of a mosquito vector often cluster in space and time given the habitat requirements of the vectors and the vertebrate involved in the transmission.” (Ruiz et al, 2007, para 8).
In other words, the West Nile viral transmission is attributed to the patterns of climate, landscape, hydrology and types of human settlements. Ruiz et al (2010) argue that the statistical models built thus far by researchers are mere reports that only characterize associations between the virus and weather, landscape, human density etc. Though they offer insights about the WNV, the associations themselves are not enough to develop and implement preventive measures for future epidemics. The interesting aspect of the WNV challenge arises from the need to build a better model that takes into account the life cycle of the mosquitoes in relationship to the variability in weather and its impact “on
WEST NILE VIRUS | CHICAGO
2
growth or activity of an organism.” Such a model can take a step beyond associations and indicate what the best time and location is for early intervention. The importance of building a robust model with predictive capabilities lies in the need to prevent an outbreak in the future. Therefore the goal of this project is to build a model that uses weather, temporal and spatial factors to predict the West Nile virus. DATA DESCRIPTION
Kaggle’s West Nile Virus challenge consists of the following datasets1:
Obs Train Weather Spray Test
10506 2944 14835 116293 Var 12 22 4 11
The datasets contains a combination of string and numeric variables. “In many cases, some predictors have no values for a given sample. These missing data could be structurally missing” (Kuhn & Johnson, p.41). For instance, station 2 does not collect information on depart, depth, water1, snowfall, sunset and sunrise. These structurally missing values are denoted by “M,” “T”, or “-‐“. “In other cases, the value cannot or was not determined at the time of the model building” (Kuhn & Johnson, p.41). Examples of such missing values are tavg, wetbulb, heat, cool, preciptotal, stnpressure, sea level, time [584 values] and average speed. Hence, the spray data and the weather data do contain missing values. The missing value for the time data set is “concentrated in a subset of predictors” (Kuhn & Johnson, p.41). In other words, the 584 missing values pertaining to the spray data relates to 09/07/2011 where time has not 1 The fields for the datasets can be found in Table 1 in the appendix titled “Data Fields”.
been recorded after 7:44:32 PM and before 7:46:30 PM. The non-‐structurally missing data values for the weather dataset, however, appear to occur randomly across all the predictors. The counts of missing values for each of the predictor variables have been tabulated below.
The response variables are the two classes that the model aims to predict namely the presence or absence of the West Nile Virus [1, 0]. The explanatory variables are: maximum temperature, minimum temperature, average temperature, precipitation, result wind speed, result wind direction, species, trap, longitude, latitude, number of mosquitoes and address.
EXTERNAL DATASETS
Although Kaggle already provides a number of explanatory variables for the West Nile Virus challenge, there are ample opportunities to include external datasets that may contain other variables that can improve a predictive model’s performance. For example, Ruiz et al (2010) found that the amount of vegetation and the degree to which water would flow or remain in an area mediated the effect of weather in predicting the infection rate of West Nile Virus. Socioeconomic factors that measured poverty also seemed to correlate with the presence of West Nile Virus. Bringing in additional data from reliable government sources that reflect the aforementioned
3
factors will help us finely tune our predictive models.
MULTIVARIATE ANALYSIS
The main objective of a multivariate analysis is to use multiple data mining techniques to study how variables relate to one another. This method of analysis is most often used when the dataset contains more than one explanatory or response variable or even both. Kaggle’s West Nile Virus dataset contains one response variable and 12 explanatory variables. Using a multivariate analysis for such a dataset is desirable because the final outcome of accurately predicting the presence or absence of WNV might be influenced by more than one attribute. For instance, principal component analysis can be used to “decompose a data table with correlated measurements into a new set of uncorrelated (i.e., orthogonal) variables” (Abdi, p.1). Performing PCA will determine the dominant trends in the dataset upon which, for example, a logistic regression model can be applied. Conducting a logistic regression alone with 12 explanatory variables may not produce a stable model if there is a strong dependence between predictors. PCA addresses the issue of multicollinearity resulting in a regression model that accurately estimates the response variable. Therefore, the advantages and disadvantages of using one technique in conjunction with another in light of the number of explanatory variables offers a purpose to use multivariate analysis. DATA COLLECTION The dataset provided by the Chicago Department of Public health and NOAA [National Oceanic and Atmospheric
Administration] comprises of weather data2, GIS data3, date of traps set [spanning 3 days each week for approximately 5 months], location of traps and species for the years between 2007 and 2014. The main dataset is broken into two sets of data that is the training and the testing dataset. The training dataset reflects data points collected for the odd years: 2007, 2009, 2011 and 2013. Whereas, the testing dataset consists of data points gathered for the even years: 2008, 2010, 2012 and 2014. There are two central factors that serve as the premise for when and why the WNV data was collected. The first factor is weather. “It is believed that hot and dry conditions are more favorable for West Nile virus than cold and wet.” (Kaggle, information description, para. 9) Therefore, the dataset captures information about weather [from station 1 – Chicago O’Hare International Airport – and station 2 – Chicago Midway International Airport] only for the months of late May through early October. The second factor is the availability of data for the number of mosquitos’ trapped, location, species identified and the test results of the presence or absence of the West Nile virus. “Every year from late-‐May to early-‐October, public health workers in Chicago setup mosquito traps scattered across the city. Every week from Monday through Wednesday, these traps collect mosquitos, and the mosquitos are tested for the presence of West Nile virus before the end of the week.” (Kaggle, information description, para. 3) It is no coincidence that traps are only set out in late spring through early fall when the weather is conducive to the population growth in mosquitos. Identifying the location 2 Weather data has been collected only for dates on which the traps were set 3 GIS data for spraying is only available from 2011 to 2013,
4
of the traps, the number of mosquitos’ trapped, the species, and the frequencies of each species infected or not infected with the virus in conjunction with weather is crucial in understanding where the next sporadic growth of the mosquitos will occur. After all, the goal of the predictive model is to identify the presence or absence of the WNV by predicting the occurrence and the rate of mosquito growth in one particular location over another given a set of weather conditions. Such predictions can be used by the City of Chicago and CPHD “to efficiently and effectively allocate resources” to control the population growth of mosquitos which in turn prevents the transmission of the “potentially deadly virus.” DATA MERGING The West Nile training dataset does not contain the weather variables required for a robust analysis. Therefore, the weather dataset has been merged with the train file resulting in a merged file titled “wnv.train.weather.” The unique identifier used to merge both files are date and station. Since the NOAA Weather dataset provides weather data from two weather stations located in the Greater Chicago Area, the distance was calculated from the site of individual traps to each of the two weather stations and was used to select the appropriate weather information for each training record based on the proximity of the two weather stations. Two distance metrics were considered: 1) Euclidean distance formula,
𝐷 = (𝑙𝑎𝑡!"#"$%& − 𝑙𝑎𝑡!"#$)! + (𝑙𝑜𝑛𝑔!"#"$%& − 𝑙𝑜𝑛𝑔!"#$)!
as well as 2) Haversine formula (http://en.wikipedia.org/wiki/Haversine_for
mula) when taking into account the curvature of the Earth,
The “geosphere” R package was used to calculate the Haversine formula for distance. NEW FEATURES Ruiz et al. (2010) reported the importance of temporal characteristics of weather in predicting infection rates of WNV in Northern Illinois. For example, they found a positive correlation at 1 to 3 week lags between precipitation and infection rates. Based on this research new features were created to capture this information in the weather dataset, namely a 2 week moving average of precipitation as well as a 2 week moving sum of accumulated rainfall. Also, time-‐based effects of temperature was explored and this entailed the use of a metric known as growing degree days (GDD) to measure heat accumulation used to predict mosquito development rates. GDD was calculated as
𝐺𝐷𝐷 = 𝑇!"#$ − 𝑇!"#$ , 𝑖𝑓 𝑇!"#$ > 𝑇!"#$0, 𝑖𝑓 𝑇!"#$ ≤ 𝑇!"#$
where Tbase represents a threshold temperature where an organism’s growth rate is near zero. From reviewing literature, Tbase can range between 13°C and 33°C. We will vary Tbase and observe the threshold value that yields the best performing model. Other features that were created from the base training data include the specific week number of a year. It is expected that the abundance of mosquitos and consequently, the presence of WNV, to be more prevalent during certain times of the year. Therefore it
5
is surmised that the week number will be important in predicting the timing of WNV. CATEGORICAL VARIABLES Dealing with categorical variables can pose certain limitations. For example, if a variable in a given data set contains several categories there arises a need to re-‐categorize the classes into smaller groups for the sake of simplicity and the robustness of the predictive model. In addition, depending on the data mining technique used the need to use numerical data than categorical data becomes eminent. The categorical variables found in the WNV dataset have undergone transformations in the form of re-‐categorization. For instance, variable species is categorical with seven classes as indicated in the table below:
Table 1 Species However, table 1 species indicates that 3 species specifically have been tested positive for WNV. Re-‐categorization highlights the importance of the three classes associated with WNV leaving the other four classes to be grouped in a category of its own indicative of the lack of attribution to the spread of WNV4. It is also important to note that the training set has a class titled “uncategorized.” By creating the fourth category called “Culex Other” the issue of the unidentified species is addressed effectively.
4 Table 2 titled Species 2 contains the new groupings
The re-‐categorization approach has been applied to the variable date as well. EXPLORARTORY DATA ANALYSIS One of the prime focus of an exploratory data analysis is to check whether the specific characteristic(s) of a data set meets the requirements of the modeling technique(s) to be used as some models maybe sensitive to certain types of data. That is, how is the data set distributed? Skewedness of a distribution whether it is positive or negative is often a result of a “subset of observations that appear to be inconsistent with the remaining observations that follow a hypothesized distribution.” (Sim et al, 2005, pg.642). Histograms and box plots are graphical tools widely used to inspect the data for the presence of outliers. There are two important questions to address after visually inspecting the boxplot: first, is it possible for the boxplot to incorrectly declare certain points as outliers. Second, does the presence of outliers imply the need for a transformation?
The graphical representation of the box plots5 for the West Nile dataset has identified certain variables to be skewed with the presence of outliers. For instance, the distribution of the number of mosquitos is right skewed. The
5 All histograms and box plots with short description of shape, center and spread for the WNV data set can be found in the appendix.
6
distribution being pulled to the right by the largest number in the data set for the respective column. The IQR6 rule for outliers indicates that values lying below -‐20 and above 39.5 are potential outliers. On examining the number of mosquitos trapped for each species it is apparent that class imbalance plays an important role in the skewedness of the data as shown in Table 2.
Table 2: Number of Mosquitos Trapped
All numbers above 39.5 represent the species attributed to the WNV and the location where it abounds. There exists a pattern between the type of species, the location and the number of mosquitos trapped that is beyond the scope of the boxplot. Similarly the boxplot for most of the weather variables in the WNV dataset shows the presence of outliers. However, yearly, monthly, weekly and daily variations in weather are infinite and the differences in data points for station 1 and 2 can be due to the geographical locations of the stations and/or the way in which the instruments record the temperatures. The Natural Resources Management and Environment Department furthers this argument by stating that “weather data collected at a given weather station during a period of several years may be not homogeneous, i.e., the data set representing a particular weather variable may present a
6 The appendix contains a table titled “Lower and Upper Bound Outliers”
sudden change [from one weather station to another]. This phenomenon may occur due to several causes, some of which are related to changes in instrumentation and observation practices, and others, which relate to modification of the environmental conditions of the site” or even “change in the time of the observations.” (para.14) Thus, the skewedness of the distribution is not necessarily a consequence of extreme data points. However, it is a result of class imbalance. For instance, the histogram for the accumulated degree day shows that distribution is skewed to the right. But when the histogram is constructed taking into consideration the presence or absence of WNV it becomes clear that imbalanced class is the root of the skewedness as seen in the histograms below:
The histograms show that there are no wnvpresent at lower/higher degree days. However, the histograms for acc.deg.day when wnvpresent = 0 or 1 and 0 appears to be more flat. In order to remove distribution skewness the data points was replaced by the square root. Thus resulting in a data that is better behaved than in its original units.
7
In addition to skewness, another factor that affects the predictive capability of a model is the presence of outliers. As noted earlier, the weather data consists of outliers. “For a large dataset, removal of samples based on missing values is not a problem, assuming the missingness is not informative” (Kuhn & Johnson, 2013, p.41). However, a more robust way of handling missing information is by imputation. “Imputation is layer of modelling where missing values are estimated based on other predictor variables. This amounts to a predictive model within a predictive model” (Kuhn & Johnson, 2013, p.42). Missing values in the weather data set have been addressed by the implementation of hot deck imputation where each missing value is replaced with an observed value from a similar unit. “An attractive feature of the hot deck imputation is that only plausible values can be imputed since values come from observed responses in the donor pool” (Andridge & Little, 2011, para. 3) which means that the weather data is more likely to be similar to the other data points than imputing averages. The second advantage of using hot deck imputation is that the “method does not rely on model fitting for the variable to be imputed and thus is potentially less sensitive to model misspecification than an imputation method based on a parametric method such as regression imputation” (Andridge & Little, 2011, para. 3). CORRELATION ANALYSIS There are specific variables in the dataset that reveal interesting patterns such as the number of mosquitos, temperature and precipitation. The goal of the correlation analysis was to plot or capture a trend that would explain the relationship between the variables and the
presence of the West Nile Virus. Since the variables are on different scales the variables were normalized using the Z score formula. In addition to normalizing the data, average values of the said variables were considered in building the plots. The plots pertain to weekly records captured for 4 years: 2007, 2009, 2011 and 2013 for the months between late May and early October. Individual plots have been drawn for each year. The blue line shows the average precipitation. The red line shows the average number of mosquitos, the green line shows the average temperature and the purple line shows the presence of the virus.
Figure 1: 2007
According to the line graph for the year 2007, a sudden decrease in temperature causes mosquitos to decrease after week 35. Consequently, the average number of detected virus decreases. It was also noted that the higher the temperature and the precipitation gets, the higher the number of mosquitos and subsequently the higher the probability for the presence of the West Nile virus. An interesting pattern was found between precipitation and the increase in the number
8
of mosquitos. The increase in the number of
Figure 2: 2009
mosquitos occurs rapidly not during the week of high precipitation but in the week after. It appears that once the numbers of mosquitos’ increase. Then the virus infects the mosquitos. The number of mosquitos in week 35 is low. However, the graph shows that the presence of the virus is prominent than before indicating that all of the mosquitos have the virus in their blood although the mosquito population is small. Not surprisingly, as the temperature declines rapidly [even with high precipitation], the number of mosquitos and the presence of WNV drops. All plots have captured similar trends.
Figure 3: 2011
Figure 4: 2013
The scatterplots below shows that the number of mosquitos and the presence of WNV has a positive relationship with dmonth, dweek, dewpoint, cool, tmax, tmin, tavg and spray. Therefore, the model will certainly rely on
these features more than the others to predict WNV.
Though the relationships are positive the strength however, appears to be weak. A closer look at the scatterplots shows some evidence of multicolinearity. For instance, in the plot titled temp and weather there are blocks of strong positive correlations that indicate colinearity. An issue to consider in the modeling process. MODELS Accurately predicting the presence of WNV essentially amounts to selecting the best spatial, temporal and weather features along with a specifically tuned classification algorithm. It is evident from the exploratory analysis as well as from literature that certain individual features are crucial in predicting WNV. Therefore, the modeling process for this data set will be broken into two parts. Part I, will focus on determining how to best incorporate the available features into a classification model. Part II, will focus on investigating and
9
fine tuning the specific classification algorithms to yield the best possible prediction. Part I Weather Data and Principal Component Analysis Due to the number of weather attributes available to the researcher in the dataset, it becomes quite difficult to ascertain the combination that will result in the best model. Moreover, the nature of weather is such that most individual features will be correlated to another resulting in multicolinearity. For example, the amount of precipitation will be correlated to atmospheric pressure and in turn, be correlated to temperature. Therefore to combat multicolinearity principal component analysis (PCA) was used to extract features that highlight the similarities and differences of the original weather data while eliminating the detrimental effects that can result from the linear dependency of predictor variables. Figure 5 summarizes the results of PCA conducted on the weather attributes. The first five components capture 97% of the variation in the weather data. The loadings of component 1 suggest it is highly related to temperature, humidity and pressure; a large value for component 1 seems to represent a sunny but chilly day. Component 2 appears to capture wind information, while component 3 summarizes precipitation. The first 5 components from PCA will be used to reflect the weather conditions of a specific day in the data.
Figure 5: PCA
Figure 6: Clustering
10
Figure 7: Model Summary Temporally based weather variables and week number While the weather conditions of a specific day can affect the activity level of mosquitos for that day, it does not take into account a mosquito’s life-‐cycle or the timing of weather conditions and its effect on mosquito populations. Hence, engineered features such as growing degree day, moving temperature averages/sums and moving precipitation averages/sums (all mentioned in previous sections) will be included in the model. Also, week numbers of the year will be incorporated to capture the inter-‐annual timing of mosquito populations. Clustering Location Data Determining a good way to represent location will most likely improve the predictive power of the models. Although, the WNV challenge provides raw longitude and latitude values to represent location, it is believed to not be in a form that will be conducive to predictive modeling due to the non-‐linear nature of spatial data. Thus k-‐means algorithm (k = 20) was used to translate the location data represented by longitude/ latitude pairs into clustered locations. Figure 6 shows the location of the clusters using a normalized scale.
As one can observe, the clustered locations outline the Chicago area quite accurately. These clustered locations will be used as a categorical variable in our models. Part II With the necessary data pre-‐processing and variable transformations completed. The focus was moved onto the construction of models to predict WNV. The overall approach was to build an ensemble, a model that takes a weighted average of a set of classifiers that generally outperforms the individual classifiers upon which the ensemble is built from. The strategy was to consider five individual algorithms and build the best possible classifier out of each to include in the final ensemble model: 1) logistic regression using a generalized additive model (GAM), 2) linear discriminant analysis (LDA), 3) random forests, and 4) support vector machines (SVM). Kaggle’s train dataset was split by 70% and 30% probabilities where the 70% was used as the training set and the remaining 30% served as the hold out for the test dataset. Figure 7 is a summary of all the best set-‐ups for each algorithm. Of all the individual models, GAM was clearly the best performing with an AUC value of 0.8253717. The best overall ensemble classifier was a weighted average of GAM and SVM with weights of 0.6 and 0.4, respectively, and an AUC of 0.8361962.
11
CONCLUSION Although the ensemble model had the highest AUC value achieved in the training dataset, it only reached an AUC of 0.6220 on the Kaggle leaderboard. In fact, over 50 models were submitted to Kaggle and the results were rarely as expected. The two best models on the leaderboard consisted of an ensemble of GAM logistic regression and GLM logistic regression and a slightly modified Poisson GLM model. Both did not have notable training AUCs but performed well on Kaggle. Other validation techniques were investigated in an attempt to obtain better feedback from the training process which resulted in the build of a better model. Instead of using a 70/30 training and testing split, a modified version of n-‐fold cross validation was used where one year’s data was left out as testing and the remaining years were used as training. This process was repeated four times, once for each year, and this averaged the model’s performance. The best models achieved from this validation technique did not seem any different from the models built on a traditional 70/30 split.
Figure 8: Models & Imbalance
Because there is a gross imbalance of positive and negative cases in the WNV data further examination was conducted to see if the imbalance had any influence on the effectiveness of training and validation. Figure 8 shows the performance of several models and its relationship with data imbalance. Except for one model, none displayed a drastic sensitivity to data balance. If using the appropriate validation technique does not account for the disparity between training AUC and the Kaggle leaderboard AUC, it is surmised that there may be a fundamental difference between the characteristics of the training data and testing data. Specifically, it is possible that there are idiosyncratic intra-‐annual variations in weather that cannot be captured in the training set due to how the WNV problem is set up. Ezanno et al (2014) cites that population of certain mosquito species does in fact have inter-‐annual variations due to specific weather events in a year. It is therefore suspected, that the best algorithms discussed afore are over fitting the training data. While the best models in this study capture the variations in weather in the training data well, it is unable to replicate this in the testing data. This intuitively makes sense as most of the models that performed better on Kaggle tend to be simple models that included variables like location, week number and mosquito species that is generalizable through all years of the data.
12
Other matter of consideration for future model building is the importance of the spray data. Though the spray data is not a part of the testing dataset and would warrant an immediate dismissal from the predictor selection process, the following heat map implies otherwise. Upon close inspection of the heat map one speculates that spraying one year does indeed alter the effects of population the next year, which might explain why mosquito populations appear in different locations each year. Also, feature engineering of the predictor variable, depart [departure from normal], might help in creating a deeper level of understanding the problem statement at hand. A possible means of engineering this predictor would be to categorize the deviance from temperature normalcy as hotter than normal and colder than normal.
13
Appendix
Table 3: Data Fields
FIELDS
Number Train Weather Spray Test 1 Date Station Date ID 2 Address Date Time Date 3 Species Max Temperature Latitude Address 4 Block Min Temperature Longitude Species 5 Street Avg Temperature Block 6 Trap Departure from Normal Street 7 Address Number Dew Point Trap 8 Latitude Wet Bulb Address Number 9 Longitude Heat Latitude 10 Address Accuracy Cool Longitude 11 # of Mosquitoes Sunrise Address Accuracy 12 Wnvpresent Sunset 13 Code Sum 14 Depth 15 Water1 16 Snowfall 17 Total Precipitation 18 Station Pressure 19 Sea Level 20 Wind Speed 21 Wind Direction 22 Average Speed
15
SKEWNESS OF VARIABLES & OUTLIERS
DATE PATTERN
The data is skewed to the left. There are more records for 2007 than other years but not by a significant amount. If this becomes problematic, we may sample equal number of records for each year.
16
LATITUDE PATTERN
Shape: Latitude is very slightly skewed to the left. Mean is less than the median Center: 41.84628 Spread: 41.64461 to 42.01743
17
LONGITUDE PATTERN
Shape: Longitude is symmetric Center: -‐87.69499 Spread: -87.93099 to -87.53163
18
NUMBER OF MOSQUITOS PATTERN
Shape: The distribution is right skewed as the mean is 12.85351 being pulled to the right away from the median which is 5 Center: 5 Spread: 1 to 50 Outlier: The boxplot confirms the skewedness of the histogram in that there are large numbers causing the distribution to be pulled to the right. The outlier function indicates the largest number in the data for number of mosquitos is 50
19
DISTANCE FROM O’HARE PATTERN
Shape: The distribution is symmetric Center: 0.2943334 Spread: 0.0372549 to 0.5179756
20
DISTANCE FROM MIDWAY PATTERN
Shape: The distribution is slightly skewed to the left as the mean 0.1548598 is pulled away from the median 0.1616137 Center: 0.1616137 Spread: 0.0077139 to 0.2481943
21
MAXIMUM TEMPERATURE PATTERN
Shape: The distribution is s skewed to the left as the mean 81.94765 is pulled away to the left from the median 83 Center: 83 Spread: 57 to 97 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the left. The outlier function indicates that 57 is the point that is distant from the other values in the dataset.
22
MINIMUM TEMPERATURE PATTERN
Shape: The distribution is s skewed to the left as the mean 64.16533 is pulled away to the left from the median 66 Center: 66 Spread: 41 to 79 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the left. The outlier function indicates that 41 is the point that is distant from the other values in the dataset.
23
AVERAGE TEMPERATURE PATTERN
Shape: The distribution is skewed to the left as the mean 38.28412 is pulled away to the left from the median 40 Center: 40 Spread: 15 to 52 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the left. The outlier function indicates that 15 is the point that is distant from the other values in the dataset.
24
TOTAL PRECIPITATION PATTERN
Shape: The distribution is skewed to the right as the mean 0.1274281 is pulled away to the right from the median 0 Center: 0 Spread: 0.00 to 3.97 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the right. The outlier function indicates that 3.97 is the point that is distant from the other values in the dataset.
25
RESULT OF WIND SPEED PATTERN
Shape: The distribution is skewed to the right as the mean 5.911003 is pulled away to the left from the median 5.5 Center: 5.5 Spread: 0.1 to 15.4 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the right. The outlier function indicates that 15.4 is the point that is distant from the other values in the dataset.
26
RESULT OF WIND DIRECTION PATTERN
Shape: The distribution is skewed to the left as the mean 17.72016 is pulled away to the left from the median 19 Center: 19 Spread: 1 to 36
27
AVERAGE WIND SPEED PATTERN
Shape: The distribution is skewed to the left as the mean 123.4147 is pulled away to the left from the median 139 Center: 139 Spread: 3 to 177 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the left. The outlier function indicates that 3 is the point that is distant from the other values in the dataset.
28
TEMPERATURE MOVING AVERAGES - 1 WEEK PATTERN
Shape: The distribution is skewed to the left as the mean 72.5431 is pulled away to the left from the median 73.14286 Center: 73.14286 Spread: 53.14286 to 83.85714 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the left. The outlier function indicates that 53.14286 is the point that is distant from the other values in the dataset.
29
TEMPERATURE MOVING AVERAGES – 2 WEEK PATTERN
Shape: The distribution is skewed to the left as the mean 72.41439 is pulled away to the left from the median 73 Center: 73 Spread: 55.07143 to 82.76923 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the left. The outlier function indicates that 55.07143 is the point that is distant from the other values in the dataset.
30
MOVING AVGS OF PRECIPITATION – 1 WEEK PATTERN
Shape: The distribution is skewed to the right as the mean 0.1333564 is pulled away to the right from the median 0.07 Center: 0.07 Spread: -0.0000 to 1.42857 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the right. The outlier function indicates that 1.42857 is the point that is distant from the other values in the dataset.
31
MOVING AVGS OF PRECIPITATION – 2 WEEK PATTERN
Shape: The distribution is skewed to the right as the mean 0.130 is pulled away to the right from the median 0.085 Center: 0.085 Spread: 0.0007 to 0.76714 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the right. The outlier function indicates that 0.76714 is the point that is distant from the other values in the dataset.
32
MOVING SUM OF PRECIPITATION – 1 WEEK PATTERN
Shape: The distribution is skewed to the right as the mean 0.9432334 is pulled away to the right from the median 0.53 Center: 0.53 Spread: -0.000 to 9.149 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the right. The outlier function indicates that 9.15 is the point that is distant from the other values in the dataset.
33
MOVING SUM OF PRECIPITATION – 2 WEEK PATTERN
Shape: The distribution is skewed to the right as the mean 1.74216 is pulled away to the right from the median 1.1 Center: 1.1 Spread: -0.000 to 10.74999 Outlier: The box plot shows the presence of some points influencing the movement of the distribution to the right. The outlier function indicates that 10.75 is the point that is distant from the other values in the dataset.
34
DEGREE DAY PATTERN
Shape: The distribution is skewed to the right as the mean 3.824472 is pulled away to the right from the median 3.4 Center: 3.4 Spread: 0.0 to 14.9
35
ACCUMULATED DEGREE DAY FOR EACH YEAR PATTERN
Shape: The distribution is skewed to the right as the mean 241.0934 is pulled away to the right from the median 239.6 Center: 239.6 Spread: 1.3 to 521.1
37
GROUPED LINE GRAPH | YEAR 2007
Blue line: The average precipitation. Red line: The average number of mosquitos
Green line: The average temperature. Purple line: The presence of virus
38
GROUPED LINE GRAPH | YEAR 2009
Blue line: The average precipitation. Red line: The average number of mosquitos
Green line: The average temperature. Purple line: The presence of virus.
39
GROUPED LINE GRAPH | YEAR 2011
Blue line: The average precipitation. Red line: The average number of mosquitos
Green line: The average temperature. Purple line: The presence of virus.
40
GROUPED LINE GRAPH | YEAR 2013
Blue line: The average precipitation. Red line: The average number of mosquitos
Green line: The average temperature. Purple line: The presence of virus.
41
Works Cited
Abdi, Herve. Multivariate analysis. Retrieved from www.utdallas.edu/~herve/Abdi-MultivariateAnalysis-pretty.pdf Andridge & Little. (2011). A review of hot deck imputation for survey non – response Int Stat Rev. 78(1): 40-64. Retrieved from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3130338/ Ezanno, P, Aubry-Kientz, M et al. (2015). A generic weather driven model to predict Mosquito population dynamics applied to species of anopheles, culex And aedes genera of southern France. 120(1): 39-50. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/25623972 Kaggle. West Nile Prediction. Retrieved from: https://www.kaggle.com/c/predict- west-nile-virus/data Kuhn & Johnson (2013). Applied Predictive Modeling. New York, Springer. Natural Resources Management and Environmental Departments. Annex 4:
Statistical Analysis of Weather Data Sets 1. Retrieved from: http://www.fao.org/docrep/x0490e/x0490e0l.htm#TopOfPage Ruiz, Marilyn O., F Chavez Luis et al. (2010). Local impact of temperature and precipitation on west Nile virus infection in culex species mosquitoes in northeast Illinois, USA. Parasites & Vectors. Retrieved from http://www.parasitesandvectors.com/content/3/1/19. Ruiz, Marilyn 0., Edward D. Walker et al.(2007). Association of west nile virus illness and urban landscapes in Chicago and Detroit. International Journal of Health Geographics. Theophilidies, C.N., S.C. Ahearni et al. (2006). First evidence of west nile virus amplification and relationship to human infections. International Journal of Geographical Information Science, 20, 103 -115. Sim, C.H, Gan, F. F. et al (2005), Outlier: labeling with boxplot procedures. Journal of American Statistical Association, 100(470). Retrieved from: http://www.jstor.org/stable/27590584