Date post: | 14-Jul-2015 |
Category: |
Science |
Upload: | wesley-goi |
View: | 253 times |
Download: | 1 times |
Exploratory Data Analysis
Wesley GOI
In today’s session
• Principles behind exploratory analyses
• Plotting data out on to popular exploratory graphs
• Plotting Systems in R
• Base (Week1)
• Lattice (Week2)
• GGPLOT2 (Week2)
• Choosing and using Graphic Devices aka the output formats
Scripts can be downloaded at:https://www.dropbox.com/s/ii1yj8f650d4l1q/lesson1.r?dl=0https://www.dropbox.com/s/eme44h6lrhn775l/final.r?dl=0
Principles behind exploratory analyses
• Show comparisons
• Show causality, mechanism, explanation
• Show multivariate data
• Integrate multiple modes of evidence
• Describe and document the evidence
• Content is king
• SPEED
Dimensionality
• Five-number summary
• Boxplots
• Histograms
• Density plot
• Barplot
Multiple-overlayed 1D plots
Scatter plots
Downloading our dataset
R code
download.file(“http://www.bio.ic.ac.uk/research/mjcraw/therbook/data/therbook.zip",dest="data.zip")
unzip(“data.zip”)
dir.create("exploring_data")
setwd(“exploring_data”)
R code
Boxplots
onemonth = subset(weather, month==1 & yr == 2004)
boxplot(onemonth$rain)
weather = read.table("SilwoodWeather.txt",h=T)
Header = T
Histograms
R code
hist(weather$upper)
rug(weather$upper) ticks for each value
Barplot
R code
Barplot(table(weather$month), col = "wheat", main = "Number of Observations in
Months”)
Raster Vector
PNG PDF SVG
Filesize small medium medium
Scalable No Yes Yes
Web friendly Yes No Yes
grDevices
Plotting Systems
Plotting Systems
Base Lattice Grid
Libraries lattice grid, gridExtrasggplot2
Examplefunctions
hist✔barplot✔boxplot✔Plot
xyplot (scatterplots)
bwplot (boxplots)
levelplot
qplotggplot
geom
Facetted plots Yes Yes Yes
Grammar of graphics
NO No Yes
Interface with statisticalfunctions
Yes Partial Partial + Workarounds
Cannot be mixed
Base plots: Scatterplot
R code
data1 = read.table("scatter1.txt", h=T)data2 = read.table("scatter2.txt", h=T)
Base plots: Scatterplot
R code
data1 = read.table("scatter1.txt", h=T)data2 = read.table("scatter2.txt", h=T)
#Colorwith(data1, plot(xv, ys, col="red"))
#Regression Linewith(data1, abline(lm(ys~xv)))
Color
Base plots: Scatterplot
Set symbol to represent data point
Base plots: Scatterplot
R code
data1 = read.table("scatter1.txt", h=T)data2 = read.table("scatter2.txt", h=T)
#Colorwith(data1, plot(xv, ys, col="red"))with(data1, abline(lm(ys~xv)))
#shapewith(data2,
points(xv2, ys2, col="blue", pch =11))
Symbol shape
Base plots: Scatterplot
R code
data1 = read.table("scatter1.txt", h=T)data2 = read.table("scatter2.txt", h=T)
#Colorwith(data1, plot(xv, ys, col="red"))with(data1, abline(lm(ys~xv)))
#shapewith(data2,
points(xv2, ys2, col="blue", pch =11))
Symbol shape
Base plots: Using par for multiple plots
R code
with(data1, plot(xv, ys, col="red"))with(data1, abline(lm(ys~xv)))
par(mfrow=c(1,2))
title(“My Title", outer=TRUE)
#Plot2with(data2,
plot(xv2, ys2, col="blue", pch =11))
Par: To set global settings
R code
mfrow(mar=c(5.1,4.1,4.1,2.1), oma=c(2,2,2,2))
Lattice
productivity = read.table("productivity.txt",h=T)
# of species in forest against differing productivity
library(lattice)
#plotting
xyplot( x~y, productivity,
xlab=list(label="Productivity"),
ylab=list(label="Mammal Species"))
R code
Formular
Data frame
Lattice
productivity = read.table("productivity.txt",h=T)
# of species in forest against differing productivity
library(lattice)
#plotting
xyplot( x~y, productivity,
xlab=list(label="Productivity"),
ylab=list(label="Mammal Species"))
xyplot( x~y | f, productivity,
xlab=list(label="Productivity"),
ylab=list(label="Mammal Species"))
R code
Formular
Data frame
given
ggplot2
• Grammar of graphics (gg)
• Based on GRID plotting system, cannot be mixed with base
ggplot2.org
ggplot
Components
• Data & relationship
• GEOMetric Object
• Statistical transformation
• Scales
• Coordinate system
• Facetting
ggplot
Data
ggplot
Mapping
ggplot
Geometric objectsaka
Geoms
Coordinate systemwrt
scales
Log scale / sqrt / log ratio
Title
Plot Theme etc
ggplot
Geometric objectsaka
Geoms
ggplot
Components
• Data & relationship ✔
• GEOMetric Object
• Statistical transformation
• Scales
• Coordinate system
• Facetting
R code
data.frame
ggplot(weather, aes(x=month, y=upper))+
geom_boxplot()
Aesthetics function which maps the relationships
Rmbr to change month into a
factor
ggplot
Components
• Data & relationship ✔
• GEOMetric Object ✔
• Statistical transformation✔
• Scales
• Coordinate system
• Facetting
R code
weather2 = weather %>%group_by(month) %>%summarise(average.upper = mean(upper))
ggplot(weather2, aes(month, average.upper))+geom_bar(stat="identity")
ggplot
Components
• Data & relationship ✔
• GEOMetric Object ✔
• Statistical transformation✔
• Scales
• Coordinate system
• Facetting
R code
weather2 = weather %>%group_by(month) %>%summarise(average.upper = mean(upper))
ggplot(weather2, aes(month, average.upper))+geom_bar(stat="identity")
ggplot
Components
• Data & relationship ✔
• GEOMetric Object ✔
• Statistical transformation✔
• Scales✔
• Coordinate system
• Facetting
R code
plot2 = ggplot(weather2, aes(month, average.upper))+
geom_bar(aes(fill=month),stat="identity")+scale_fill_brewer(palette="Set3")+xlab("Months")+ylab("Upper Quantile")+theme_bw()
ggplot
Components
• Data & relationship ✔
• GEOMetric Object ✔
• Statistical transformation✔
• Scales✔
• Coordinate system
• Facetting
R code
plot2 = ggplot(weather2, aes(month, average.upper))+
geom_bar(aes(fill=month),stat="identity")+scale_fill_brewer(palette="Set3")+xlab("Months")+ylab("Upper Quantile")+theme_bw()
ggplot
qplot
A separate function which wraps ggplot, for simpler syntax
R code
qplot(month, upper, fill=month, data=weather, facets = ~yr, geom="bar", stat="identity")
Ethos behind visualization
http://keylines.com/network-visualization
Final Challenge
Final Challenge
R code
library(ggplot2)
#Reads in datadata = read.csv("final.csv")
#Preparing for the rectangle backgroundareas=unique(subset(data, select=c(Planning_Area,Planning_Region)))areas=areas[order(areas$Planning_Region),]areas$rectid=1:nrow(areas)rectdata = areas %>% group_by(Planning_Region) %>% summarise(xstart=min(rectid)-0.5,xend= max(rectid)+0.5)
#Order the levelsdata$Planning_Area=factor(data$Planning_Area, levels=as.character(areas[order(areas$Planning_Region),]$Planning_Area))
Final challenge
R code#Plot
p0 =
ggplot(data, aes(Planning_Area, Unit_Price____psm_))+
geom_boxplot(outlier.colour=NA)+
geom_rect(data=rectdata,aes(xmin=xstart,xmax=xend,ymin = -Inf, ymax = Inf, fill =
Planning_Region,group=Planning_Region), alpha = 0.4,inherit.aes=F)+
geom_jitter(alpha=0.40, aes(color=as.factor(Year)))+
scale_color_brewer("Year", palette='RdBu')+
scale_fill_brewer(palette="Set1",name='Region')+
theme_minimal()+
theme(axis.text.x = element_text(angle=45, hjust=1, vjust=1))+
xlab("Planning Area")+ylab("Unit Price (PSM)")
#Save plot
ggsave(p0, file="areaboxplots.pdf",w=20,h=10,units="in",dpi=300)
“Above all else show the data.” ― Edward R. Tufte, The Visual Display of Quantitative Information
Thank you for your time
gridExtras