+ All Categories
Home > Science > Exploratory Analysis Part1 Coursera DataScience Specialisation

Exploratory Analysis Part1 Coursera DataScience Specialisation

Date post: 14-Jul-2015
Category:
Upload: wesley-goi
View: 253 times
Download: 1 times
Share this document with a friend
Popular Tags:
40
Exploratory Data Analysis Wesley GOI
Transcript
Page 1: Exploratory Analysis Part1 Coursera DataScience Specialisation

Exploratory Data Analysis

Wesley GOI

Page 2: Exploratory Analysis Part1 Coursera DataScience Specialisation

In today’s session

• Principles behind exploratory analyses

• Plotting data out on to popular exploratory graphs

• Plotting Systems in R

• Base (Week1)

• Lattice (Week2)

• GGPLOT2 (Week2)

• Choosing and using Graphic Devices aka the output formats

Scripts can be downloaded at:https://www.dropbox.com/s/ii1yj8f650d4l1q/lesson1.r?dl=0https://www.dropbox.com/s/eme44h6lrhn775l/final.r?dl=0

Page 3: Exploratory Analysis Part1 Coursera DataScience Specialisation

Principles behind exploratory analyses

• Show comparisons

• Show causality, mechanism, explanation

• Show multivariate data

• Integrate multiple modes of evidence

• Describe and document the evidence

• Content is king

• SPEED

Page 4: Exploratory Analysis Part1 Coursera DataScience Specialisation

Dimensionality

• Five-number summary

• Boxplots

• Histograms

• Density plot

• Barplot

Multiple-overlayed 1D plots

Scatter plots

Page 5: Exploratory Analysis Part1 Coursera DataScience Specialisation

Downloading our dataset

R code

download.file(“http://www.bio.ic.ac.uk/research/mjcraw/therbook/data/therbook.zip",dest="data.zip")

unzip(“data.zip”)

dir.create("exploring_data")

setwd(“exploring_data”)

Page 6: Exploratory Analysis Part1 Coursera DataScience Specialisation

R code

Boxplots

onemonth = subset(weather, month==1 & yr == 2004)

boxplot(onemonth$rain)

weather = read.table("SilwoodWeather.txt",h=T)

Header = T

Page 7: Exploratory Analysis Part1 Coursera DataScience Specialisation

Histograms

R code

hist(weather$upper)

rug(weather$upper) ticks for each value

Page 8: Exploratory Analysis Part1 Coursera DataScience Specialisation

Barplot

R code

Barplot(table(weather$month), col = "wheat", main = "Number of Observations in

Months”)

Page 9: Exploratory Analysis Part1 Coursera DataScience Specialisation

Raster Vector

PNG PDF SVG

Filesize small medium medium

Scalable No Yes Yes

Web friendly Yes No Yes

grDevices

Page 10: Exploratory Analysis Part1 Coursera DataScience Specialisation

Plotting Systems

Plotting Systems

Base Lattice Grid

Libraries lattice grid, gridExtrasggplot2

Examplefunctions

hist✔barplot✔boxplot✔Plot

xyplot (scatterplots)

bwplot (boxplots)

levelplot

qplotggplot

geom

Facetted plots Yes Yes Yes

Grammar of graphics

NO No Yes

Interface with statisticalfunctions

Yes Partial Partial + Workarounds

Cannot be mixed

Page 11: Exploratory Analysis Part1 Coursera DataScience Specialisation

Base plots: Scatterplot

R code

data1 = read.table("scatter1.txt", h=T)data2 = read.table("scatter2.txt", h=T)

Page 12: Exploratory Analysis Part1 Coursera DataScience Specialisation

Base plots: Scatterplot

R code

data1 = read.table("scatter1.txt", h=T)data2 = read.table("scatter2.txt", h=T)

#Colorwith(data1, plot(xv, ys, col="red"))

#Regression Linewith(data1, abline(lm(ys~xv)))

Color

Page 13: Exploratory Analysis Part1 Coursera DataScience Specialisation

Base plots: Scatterplot

Set symbol to represent data point

Page 14: Exploratory Analysis Part1 Coursera DataScience Specialisation

Base plots: Scatterplot

R code

data1 = read.table("scatter1.txt", h=T)data2 = read.table("scatter2.txt", h=T)

#Colorwith(data1, plot(xv, ys, col="red"))with(data1, abline(lm(ys~xv)))

#shapewith(data2,

points(xv2, ys2, col="blue", pch =11))

Symbol shape

Page 15: Exploratory Analysis Part1 Coursera DataScience Specialisation

Base plots: Scatterplot

R code

data1 = read.table("scatter1.txt", h=T)data2 = read.table("scatter2.txt", h=T)

#Colorwith(data1, plot(xv, ys, col="red"))with(data1, abline(lm(ys~xv)))

#shapewith(data2,

points(xv2, ys2, col="blue", pch =11))

Symbol shape

Page 16: Exploratory Analysis Part1 Coursera DataScience Specialisation

Base plots: Using par for multiple plots

R code

with(data1, plot(xv, ys, col="red"))with(data1, abline(lm(ys~xv)))

par(mfrow=c(1,2))

title(“My Title", outer=TRUE)

#Plot2with(data2,

plot(xv2, ys2, col="blue", pch =11))

Page 17: Exploratory Analysis Part1 Coursera DataScience Specialisation

Par: To set global settings

R code

mfrow(mar=c(5.1,4.1,4.1,2.1), oma=c(2,2,2,2))

Page 18: Exploratory Analysis Part1 Coursera DataScience Specialisation

Lattice

productivity = read.table("productivity.txt",h=T)

# of species in forest against differing productivity

library(lattice)

#plotting

xyplot( x~y, productivity,

xlab=list(label="Productivity"),

ylab=list(label="Mammal Species"))

R code

Formular

Data frame

Page 19: Exploratory Analysis Part1 Coursera DataScience Specialisation
Page 20: Exploratory Analysis Part1 Coursera DataScience Specialisation

Lattice

productivity = read.table("productivity.txt",h=T)

# of species in forest against differing productivity

library(lattice)

#plotting

xyplot( x~y, productivity,

xlab=list(label="Productivity"),

ylab=list(label="Mammal Species"))

xyplot( x~y | f, productivity,

xlab=list(label="Productivity"),

ylab=list(label="Mammal Species"))

R code

Formular

Data frame

given

Page 21: Exploratory Analysis Part1 Coursera DataScience Specialisation
Page 22: Exploratory Analysis Part1 Coursera DataScience Specialisation

ggplot2

• Grammar of graphics (gg)

• Based on GRID plotting system, cannot be mixed with base

ggplot2.org

Page 23: Exploratory Analysis Part1 Coursera DataScience Specialisation

ggplot

Components

• Data & relationship

• GEOMetric Object

• Statistical transformation

• Scales

• Coordinate system

• Facetting

Page 24: Exploratory Analysis Part1 Coursera DataScience Specialisation

ggplot

Data

Page 25: Exploratory Analysis Part1 Coursera DataScience Specialisation

ggplot

Mapping

Page 26: Exploratory Analysis Part1 Coursera DataScience Specialisation

ggplot

Geometric objectsaka

Geoms

Coordinate systemwrt

scales

Log scale / sqrt / log ratio

Title

Plot Theme etc

Page 27: Exploratory Analysis Part1 Coursera DataScience Specialisation

ggplot

Geometric objectsaka

Geoms

Page 28: Exploratory Analysis Part1 Coursera DataScience Specialisation

ggplot

Components

• Data & relationship ✔

• GEOMetric Object

• Statistical transformation

• Scales

• Coordinate system

• Facetting

R code

data.frame

ggplot(weather, aes(x=month, y=upper))+

geom_boxplot()

Aesthetics function which maps the relationships

Rmbr to change month into a

factor

Page 29: Exploratory Analysis Part1 Coursera DataScience Specialisation

ggplot

Components

• Data & relationship ✔

• GEOMetric Object ✔

• Statistical transformation✔

• Scales

• Coordinate system

• Facetting

R code

weather2 = weather %>%group_by(month) %>%summarise(average.upper = mean(upper))

ggplot(weather2, aes(month, average.upper))+geom_bar(stat="identity")

Page 30: Exploratory Analysis Part1 Coursera DataScience Specialisation

ggplot

Components

• Data & relationship ✔

• GEOMetric Object ✔

• Statistical transformation✔

• Scales

• Coordinate system

• Facetting

R code

weather2 = weather %>%group_by(month) %>%summarise(average.upper = mean(upper))

ggplot(weather2, aes(month, average.upper))+geom_bar(stat="identity")

Page 31: Exploratory Analysis Part1 Coursera DataScience Specialisation

ggplot

Components

• Data & relationship ✔

• GEOMetric Object ✔

• Statistical transformation✔

• Scales✔

• Coordinate system

• Facetting

R code

plot2 = ggplot(weather2, aes(month, average.upper))+

geom_bar(aes(fill=month),stat="identity")+scale_fill_brewer(palette="Set3")+xlab("Months")+ylab("Upper Quantile")+theme_bw()

Page 32: Exploratory Analysis Part1 Coursera DataScience Specialisation

ggplot

Components

• Data & relationship ✔

• GEOMetric Object ✔

• Statistical transformation✔

• Scales✔

• Coordinate system

• Facetting

R code

plot2 = ggplot(weather2, aes(month, average.upper))+

geom_bar(aes(fill=month),stat="identity")+scale_fill_brewer(palette="Set3")+xlab("Months")+ylab("Upper Quantile")+theme_bw()

Page 33: Exploratory Analysis Part1 Coursera DataScience Specialisation

ggplot

Page 34: Exploratory Analysis Part1 Coursera DataScience Specialisation

qplot

A separate function which wraps ggplot, for simpler syntax

R code

qplot(month, upper, fill=month, data=weather, facets = ~yr, geom="bar", stat="identity")

Page 35: Exploratory Analysis Part1 Coursera DataScience Specialisation

Ethos behind visualization

http://keylines.com/network-visualization

Page 36: Exploratory Analysis Part1 Coursera DataScience Specialisation

Final Challenge

Page 37: Exploratory Analysis Part1 Coursera DataScience Specialisation

Final Challenge

R code

library(ggplot2)

#Reads in datadata = read.csv("final.csv")

#Preparing for the rectangle backgroundareas=unique(subset(data, select=c(Planning_Area,Planning_Region)))areas=areas[order(areas$Planning_Region),]areas$rectid=1:nrow(areas)rectdata = areas %>% group_by(Planning_Region) %>% summarise(xstart=min(rectid)-0.5,xend= max(rectid)+0.5)

#Order the levelsdata$Planning_Area=factor(data$Planning_Area, levels=as.character(areas[order(areas$Planning_Region),]$Planning_Area))

Page 38: Exploratory Analysis Part1 Coursera DataScience Specialisation

Final challenge

R code#Plot

p0 =

ggplot(data, aes(Planning_Area, Unit_Price____psm_))+

geom_boxplot(outlier.colour=NA)+

geom_rect(data=rectdata,aes(xmin=xstart,xmax=xend,ymin = -Inf, ymax = Inf, fill =

Planning_Region,group=Planning_Region), alpha = 0.4,inherit.aes=F)+

geom_jitter(alpha=0.40, aes(color=as.factor(Year)))+

scale_color_brewer("Year", palette='RdBu')+

scale_fill_brewer(palette="Set1",name='Region')+

theme_minimal()+

theme(axis.text.x = element_text(angle=45, hjust=1, vjust=1))+

xlab("Planning Area")+ylab("Unit Price (PSM)")

#Save plot

ggsave(p0, file="areaboxplots.pdf",w=20,h=10,units="in",dpi=300)

Page 39: Exploratory Analysis Part1 Coursera DataScience Specialisation

“Above all else show the data.” ― Edward R. Tufte, The Visual Display of Quantitative Information

Thank you for your time

Page 40: Exploratory Analysis Part1 Coursera DataScience Specialisation

gridExtras


Recommended