+ All Categories
Home > Documents > Web viewLab3 Algorithms and Statistical Models for Routine Data Analytics November 20, 2014 ......

Web viewLab3 Algorithms and Statistical Models for Routine Data Analytics November 20, 2014 ......

Date post: 07-Feb-2018
Category:
Upload: phamtu
View: 219 times
Download: 4 times
Share this document with a friend
14
Lab3 Algorithms and Statistical Models for Routine Data Analytics November 20, 2014/Dec 4 Overview of Lab3: We begin the lab with another clustering method called Partitioning around Medians (PAM) that can handle categorical data (such {male, female}, {country code} etc.). Then we will work on a complete example for Bayesian using real congressional voting records and the issues. Logistic regression will be illustrated using a “brand recognition” synthetic data. Last exercise is a demo of data sharing web application based on R package called Shiny. Note: We called Lab1 R project Lab1EDA, Lab2 R project Lab2Alg; We will call this R project Lab3Bay (for Bayesian). File New Project Lab3Bay Unzip the files for today from Session3Dec4.zip. Copy the “data” folder into the Lab3Bay folder. Exercise 1: PAM: Partitioning around medoids: #Extract information from WDI package that has all the world bank information on countries. indicators<-c("BX.KLT.DINV.WD.GD.ZS", "NY.GDP.DEFL.KD.ZG","NY.GDP.MKTP.CD", "NY.GDP.MKTP.KD.ZG", "NY.GDP.PCAP.CD","NY.GDP.PCAP.KD.ZG","TG.VAL.TOTL.GD.ZS") install.packages("WDI") require(WDI) wbInfo<- WDI(country="all",indicator=indicators,start=2013,end=2013,extra=TRUE) wbInfo<-wbInfo[wbInfo$region != "Aggregates",] wbInfo<-wbInfo[which(rowSums(!is.na(wbInfo[,indicators]))>0),] wbInfo<-wbInfo[!is.na(wbInfo$iso2c),] # make the row names same as the country code : iso2c : ISO country code 2 character rownames(wbInfo)<-wbInfo$iso2C #make sure categorical data are factorized/considered as categories: see the catergories wbInfo$region <-factor(wbInfo$region) wbInfo$income<-factor(wbInfo$income) wbInfo$lending<-factor(wbInfo$lending)
Transcript
Page 1: Web viewLab3 Algorithms and Statistical Models for Routine Data Analytics November 20, 2014 ... ("BX.KLT.DINV.WD.GD.ZS", ... prob.spam.word = prob.word.spam

Lab3 Algorithms and Statistical Models for Routine Data Analytics November 20, 2014/Dec 4

Overview of Lab3: We begin the lab with another clustering method called Partitioning around Medians (PAM) that can handle categorical data (such {male, female}, {country code} etc.). Then we will work on a complete example for Bayesian using real congressional voting records and the issues. Logistic regression will be illustrated using a “brand recognition” synthetic data. Last exercise is a demo of data sharing web application based on R package called Shiny.

Note: We called Lab1 R project Lab1EDA, Lab2 R project Lab2Alg; We will call this R project Lab3Bay (for Bayesian). File New Project Lab3Bay

Unzip the files for today from Session3Dec4.zip. Copy the “data” folder into the Lab3Bay folder.

Exercise 1: PAM: Partitioning around medoids:

#Extract information from WDI package that has all the world bank information on countries.

indicators<-c("BX.KLT.DINV.WD.GD.ZS", "NY.GDP.DEFL.KD.ZG","NY.GDP.MKTP.CD", "NY.GDP.MKTP.KD.ZG", "NY.GDP.PCAP.CD","NY.GDP.PCAP.KD.ZG","TG.VAL.TOTL.GD.ZS")install.packages("WDI")require(WDI)wbInfo<-WDI(country="all",indicator=indicators,start=2013,end=2013,extra=TRUE)wbInfo<-wbInfo[wbInfo$region != "Aggregates",]wbInfo<-wbInfo[which(rowSums(!is.na(wbInfo[,indicators]))>0),]wbInfo<-wbInfo[!is.na(wbInfo$iso2c),]

# make the row names same as the country code : iso2c : ISO country code 2 characterrownames(wbInfo)<-wbInfo$iso2C

#make sure categorical data are factorized/considered as categories: see the catergorieswbInfo$region <-factor(wbInfo$region)wbInfo$income<-factor(wbInfo$income)wbInfo$lending<-factor(wbInfo$lending)

levels(wbInfo$region)levels(wbInfo$income)levels(wbInfo$lending)

# keep the information only for counties whose iso is known# use pamr package and cluster them into 12 clustersinstall.packages("pamr")require(pamr)keep.cols<- which(!names(wbInfo) %in% c("iso2c","country","year","capital","iso3c"))wbPam <- pam(x=wbInfo[,keep.cols], k=12, keep.diss=TRUE, keep.data=TRUE)wbPam$medoidsplot(wbPam,which.plots=2, main="")head(wbPam) # study the data and the clusters… not so clear

Page 2: Web viewLab3 Algorithms and Statistical Models for Routine Data Analytics November 20, 2014 ... ("BX.KLT.DINV.WD.GD.ZS", ... prob.spam.word = prob.word.spam

require(maptools)world <- readShapeSpatial("data/world_country_admin_boundary_shapefile_with_fips_codes.shp")head(world@data)world@data

#read the world shape file with boundaries + data information, extract datarequire(plyr)world@data$FipsCntry <- as.character (revalue(world@data$FipsCntry, replace=c(AU="AT",AS="AU",VM="VN",BM="MM",SP="ES",PO="PT",IC="IL",SF="ZA",TU="TR",IZ="IQ", UK="GB",EI="IE",SU="SD",MA="MG",MO="MA",JA="JP",SW="SE",SN="BG")))world@data

# form the world data frameworld@data$id <-rownames(world@data)require(ggplot2)install.packages("rgeos")require(rgeos)

world.df <- fortify(world,region ="id")head(world.df)

world.df <- join(world.df, world@data[,c("id","CntryName","FipsCntry")],by="id")head(world.df)

#Using the PAM cluster we createdclusterMembership <-data.frame(FipsCntry=wbInfo$iso2c,Cluster=wbPam$clustering,stringsAsFactors=FALSE)

head(clusterMembership)

#Prepare data for display and joining the cluster and Federal Info. Processing code (FipsCntry)world.df <- join(world.df, clusterMembership, by="FipsCntry")

world.df$Cluster <- as.character(world.df$Cluster)world.df$Cluster <- factor(world.df$Cluster,levels=1:12)

ggplot() + geom_polygon (data=world.df,aes(x=long,y=lat, group=group, fill=Cluster, color=Cluster))

Page 3: Web viewLab3 Algorithms and Statistical Models for Routine Data Analytics November 20, 2014 ... ("BX.KLT.DINV.WD.GD.ZS", ... prob.spam.word = prob.word.spam

Exercise 2: Naïve Bayes based classification.nspam= 1500ngood = 3762prob.spam = 1500/(1500+3762)prob.good = 3761/(1500+3762)names(count.words.spam)=c("meeting","budget","enron","lottery")names(count.words.good)=c("meeting","budget", "enron","lottery")count.words.spam<-c(16,10,1,400)count.words.good<-c(153,200,3000,4)

#classify "lottery" as spam with a percent probabilitynword.spam = count.words.spam["lottery"]nword.good = count.words.good["lottery"]prob.word.spam = nword.spam/nspamprob.word.good = nword.good/ngoodprob.word = prob.word.spam*prob.spam + prob.word.good*prob.goodprob.spam.word = prob.word.spam*prob.spam/prob.wordprob.spam.word*100

Exercise 3: Naïve Bayes is a classification algorithm that is generally a simple but accurate choice for classifying multi-dimensional data. (about 85% accuracy, more with some heuristics)

We try to tailor these labs to current events. At this time of writing the United States midterm elections just took place; so we will investigate a Weka1dataset from the University of California, Irvine. The data includes voting patterns for US House of Representatives Congressmen2. Given the voting records of 16 key issues, we will classify each Representative as Republican or Democrat. Fortunately, this dataset is already included in an R package called mlbench. This is a historical data from long ago. But you will see that things have not changed much in politics. (Our congress is still dealing with immigration and alternative energy companies!)

1 http://www.cs.waikato.ac.nz/ml/weka/datasets.html 2 http://prdownloads.sourceforge.net/weka/datasets-UCI.jar

Page 4: Web viewLab3 Algorithms and Statistical Models for Routine Data Analytics November 20, 2014 ... ("BX.KLT.DINV.WD.GD.ZS", ... prob.spam.word = prob.word.spam

install.packages("mlbench")library(mlbench)

#load the house votes datadata (HouseVotes84, package="mlbench")HouseVotes84head(HouseVotes84)

Note that the <NA> positions means that data was not gathered.

A brief description of V1, v2 etc. for the sake of understanding the various issues is given below:

V1 # 1. Class Name: (democrat, republican)V2 # 2. handicapped-infants:V3 # 3. water-project-cost-sharing:V4 # 4. adoption-of-the-budget-resolution:V5 # 5. physician-fee-freeze:V6 # 6. el-salvador-aid:V7 # 7. religious-groups-in-schools:V8 # 8. anti-satellite-test-ban:V9 # 9. aid-to-nicaraguan-contras:V10 # 10. mx-missile:V11 # 11. immigration:V12 # 12. synfuels-corporation-cutback:V13 # 13. education-spending:V14 # 14. superfund-right-to-sue:V15 # 15. crime:V16 # 16. duty-free-exports:V17 # 17. export-administration-act-south-africa:

1.) Now, we will use R to run Naïve Bayes. The naiveBayes() function takes in two parameters: the columns to model, and the data to model. In this case, we are specifying to model every column

Page 5: Web viewLab3 Algorithms and Statistical Models for Routine Data Analytics November 20, 2014 ... ("BX.KLT.DINV.WD.GD.ZS", ... prob.spam.word = prob.word.spam

except Class (Class is the model we are trying to predict); label /predict “Class” column as Democrat or Republican based on the voting “prior” knowledge.# learn the behavior using machine learning probabilistic approach using Naïve Bayesmodel <- naiveBayes(Class~., data=HouseVotes84)

2.) We now have a model to predict whether or not a voting pattern aligns with the Democrat or Republican party. Let’s run a prediction (we’ll do a prediction on our test data first):pre <- predict(model, HouseVotes84)

3.) Now, let’s take a look at some predictions of the test data. Here, we look at the first 5 voting records in our training data set, and then look at the predictions. Note that the 3 rd vote was categorized incorrectly. Classification models aren’t perfect! The more training data you have, the more accurate your model! Note too that we are using Naïve Bayes, so it looks at each voting issue independently, while in reality they might have a relationship.

4.) Now, let’s read in some test data. We have a small CSV file with sample data we’ve generated for you; feel free to edit it and reimport it into R for your own analysis. This is in your zip file for today’s lecture. sample <- read.csv("sampleVotes.csv")

Page 6: Web viewLab3 Algorithms and Statistical Models for Routine Data Analytics November 20, 2014 ... ("BX.KLT.DINV.WD.GD.ZS", ... prob.spam.word = prob.word.spam

5.) Now, let’s run the prediction on our sample data :

A few interesting points: a voter with no record is classified as a Democrats by default – this is the result of our model and how many NAs were Democrats vs Republicans! In addition, all “y” or all “n” results in a Democrat classification – again the result of our model. Scroll through the data to find additional information.

6.) Let’s do some additional analysis. Let’s compare the Democrat and Republican counts for the actual data and what the model predicts. To do this, we’ll copy the original dataset into a new dataframe. Then we’ll replace the Class there with what our model predicted. After that, we’ll get the running counts of Republicans and Democrats in each model. Looks like our model is a bit biased; it reclassified several Democrats as Republicans! Remember that Naïve Bayes is not a perfect model, and to expect some inaccuracies.

7.) We can also plot some of our results.8.) Here, plot two issues, arbitrarily chosen, Missile and Immigration (v10, V11) using a histogram:

ggplot(HouseVotes84, aes(x=V10:V11,fill=Class)) + geom_histogram(binwidth=1)

Page 7: Web viewLab3 Algorithms and Statistical Models for Routine Data Analytics November 20, 2014 ... ("BX.KLT.DINV.WD.GD.ZS", ... prob.spam.word = prob.word.spam

9.) Here we can see along the bottom those that voted no to both issues, those that voted no to the first but yes to the second, those that votes yes to the first and no to the second, those that voted yes to both, and vice versa.

10.)Now, let’s do the same plot for our predicted data:ggplot(HouseVotes84model, aes(x=V10:V11,fill=Class)) + geom_histogram(binwidth=1)

As you can see from the similarity of the charts, our model is fairly accurate at predicting!

Page 8: Web viewLab3 Algorithms and Statistical Models for Routine Data Analytics November 20, 2014 ... ("BX.KLT.DINV.WD.GD.ZS", ... prob.spam.word = prob.word.spam

Here is the complete code without any breaks for discussion:

install.packages("mlbench") # Step1: install all needed packages and required librarieslibrary(mlbench)install.packages("ggplot2")library(ggplot2)install.packages("e1071")library(e1071)

data (HouseVotes84, package="mlbench") #Step 2: get the data and understand it, view itHouseVotes84head(HouseVotes84)

model <- naiveBayes(Class~., data=HouseVotes84) #Step 3: model it using Naïve Bayespre <- predict(model, HouseVotes84)

HouseVotes84[1:5,] #Step 4: compare actual and predictedpre[1:5]

sample <- read.csv("sampleVotes.csv") # Step 5: run model on a different data setsampleclassifyMe<-predict(model, sample)classifyMe

HouseVotesModel<-HouseVotes84 #step 6: check the prediction accuracyHouseVotesModel[,1] <-pretable(HouseVotes84[,1])table(HouseVotesModel[,1])# +/- 6% error rate in prediction

#Step 7: plot it and visualize it; compare actual and predictedggplot(HouseVotes84, aes(x=V10:V11,fill=Class)) + geom_histogram(binwidth=1) ggplot(HouseVotesModel, aes(x=V10:V11,fill=Class)) + geom_histogram(binwidth=1)

Page 9: Web viewLab3 Algorithms and Statistical Models for Routine Data Analytics November 20, 2014 ... ("BX.KLT.DINV.WD.GD.ZS", ... prob.spam.word = prob.word.spam

Exercise 4: Logistic regression with synthetic data; Brand recognition focus group or survey or field data. Two sets of data, pre and post of an advertisement campaign. You can judge the success of the campaign by the two logistic regression plots. Use the file MarketResearch.csv for this exercise.

data6<-read.csv(file.choose(), header=T)age <- data6$Agerecog <- data6$R1total <- data6$Totaldf <- cbind(recog, nrecog=total-recog)model<- glm(df~age, family=binomial)plot(recog/total~age)points(age,model$fitted,pch=17,col="blue")lines(age, model$fitted, col="red", lwd=2)title(main="Brand Recognition Data: Logistic Regression Line")grid(nx=NULL, ny=NULL, lwd=2)

recog1 <- data6$R2df2 <- cbind(recog1, nrecog=total-recog1)model1<- glm(df2~age, family=binomial(logit))#plot(recog1/total ~ age, data = data6)points(age,model1$fitted,pch=15,col="brown")p2<-lines(age, model1$fitted, col="green", lwd=2)legend("topright", c("Pre", "Post"), fill=c("red","green"))

# you can also predict the outcome for a subject of a# What are the odds of this 22 year recognizing this brand? It gives the probability; derive the odds# given age for the two models

newdata <- data.frame(age=c(22))PredictMe<-predict(model,newdata, type="response" )PredictMesummary(model)

newdata <- data.frame(age=c(22))PredictMe<-predict(model1,newdata, type="response" )PredictMesummary(model1)

Page 10: Web viewLab3 Algorithms and Statistical Models for Routine Data Analytics November 20, 2014 ... ("BX.KLT.DINV.WD.GD.ZS", ... prob.spam.word = prob.word.spam

Exercise 5: You will need R version 3.1.2 for this exercise.Shiny package of R: taking it to the next step. Step 7 of data science, making a tool out of R code, sort of. It can also be thought of as developing web application out of R code. A web application has minimally (i) an user interface (UI) interacting with the user and (ii) a server performing the computation (server). These are indeed the components in Shiny-R application: ui.R, server.R.

We will write a simple Shiny application and understand its foundations. We will make use of Latitude/Longtitude problem with google API access to work on this exercise.

Open a New Projectcall it Lab3Shnycreate a script file save it as ui.R and add the content below to it and save.

library(shiny)

# Define UI for application that input lat&long or City# it also displays two maps one for displaying teh lat/long input# another map for city input, lat/long is obtained by accessing google api

shinyUI(fluidPage( # Application title titlePanel("Geography: Guess Lat/Long/City"), sidebarLayout( sidebarPanel( numericInput("lon", label = h3("Longtitude"), value = 1) , numericInput("lat", label = h3("Latitude"), value = 1),

textInput("city", label = h3("Enter City Name"), value="BUF")), mainPanel( plotOutput("map1"), plotOutput("map2") ))))

Create another script file called server.R, enter the code given below.This file will contain the R code for (i) obtaining the input from the UI, (ii) access the Google API to get the coordinates if a City is typed, (iii) plot the lat/long in case user chooses to enter the lat/long. To make this exercise interesting, we have presented as a game: Guess the lat-long? Or locate your city on the world map?

Page 11: Web viewLab3 Algorithms and Statistical Models for Routine Data Analytics November 20, 2014 ... ("BX.KLT.DINV.WD.GD.ZS", ... prob.spam.word = prob.word.spam

library(shiny)library("ggmap")library("maptools")library(maps)

shinyServer( function(input, output) { #plot on map1 output$map1 <- renderPlot({ map("world", fill=TRUE, col="white", bg="lightgreen", ylim=c(-60, 90), mar=c(0,0,0,0)) points(input$lon,input$lat, col="blue", cex=2, pch=17) }) #plot on map 2 output$map2 <- renderPlot({ visited <- input$city ll.visited <- geocode(visited) visit.x <- ll.visited$lon visit.y <- ll.visited$lat map("world", fill=TRUE, col="white", bg="lightblue", ylim=c(-60, 90), mar=c(0,0,0,0)) points(visit.x,visit.y, col="red", cex=2, pch=19) }) })

Save the two scripts in ui.R and server.R. And run the application using RunApp icon at the top-right corner of the script window.


Recommended