05 subsetting

Hadley Wickham

Stat405Subsetting & shortcuts

Tuesday, 7 September 2010

• Lectures 1-3: basic graphics

• Lectures 4-6: basic data handling

• Lectures 7-9: basic functions

• The absolutely most essential tools. Rest of course is building your vocab, and learning how to use them all together.

Roadmap


1. Character subsetting

2. Sorting

3. Shortcuts

4. Iteration

5. (Optional extra: command line tips)


Subsetting


Your turn

In pairs, try and recall the five types of subsetting we talked about last week.

You have one minute!


logical

blank

integer

character

+ve: include-ve: exclude

lookup by name

include TRUEs

include all


# Matches by namesdiamonds[1:5, c("carat", "cut", "color")]

# Useful technique: change labellingc("Fair" = "C", "Good" = "B", "Very Good" = "B+", "Premium" = "A", "Ideal" = "A+")[diamonds$cut]

# Can also be used to collapse levelstable(c("Fair" = "C", "Good" = "B", "Very Good" = "B", "Premium" = "A", "Ideal" = "A")[diamonds$cut])

# (see ?cut for continuous to discrete equivalent)


x <- c(2, 4, 3, 1)order(x)# means: to get x in order, put 4th in # 1st, 1st in 2nd, 3rd in 3rd and 2nd in 4th

x[order(x)]

# What does this do?diamonds[order(diamonds$price), ]

Sorting a data frame


# Order by x, then y, then zorder(diamonds$x, diamonds$y, diamonds$y)

# Put in order of qualityorder(diamonds$color, desc(diamonds$cut), desc(diamonds$clarity))

# desc sorts in descending order# also found in the plyr packagex[order(x)]x[order(desc(x))]


Reorder the mpg dataset from most to least efficient.

The fl variable gives the type of fuel (r = regular, d = diesel, p = premium, c = cng, e = ethanol). Modify fl to spell out the fuel type explicitly, collapsing c, d, and e in a single other category.

Your turn


Short cuts


Short cuts

You’ve been typing diamonds many many times. These following shortcut save typing, but may be a little harder to understand, and will not work in some situations. (Don’t forget the basics!)

Four specific to data frames, one more generic.


Function Package

subset base

summarise plyr

transform base

arrange plyr

plyr is loaded automatically with ggplot2, or load it explicitly with library(plyr). base always automatically loaded


# subset: short cut for subsettingzero_dim <- diamonds$x == 0 | diamonds$y == 0 | diamonds$z == 0diamonds[zero_dim & !is.na(zero_dim), ]

subset(diamonds, x == 0 | y == 0 | z == 0)

# summarise/summarize: short cut for creating summary biggest <- data.frame( price.max = max(diamonds$price), carat.max = max(diamonds$carat))

biggest <- summarise(diamonds, price.max = max(price), carat.max = max(carat))


# transform: short cut for adding new variablesdiamonds$volume <- diamonds$x * diamonds$y * diamonds$zdiamonds$density <- diamonds$volume / diamonds$carat

diamonds <- transform(diamonds, volume = x * y * z)diamonds <- transform(diamonds, density = volume / carat)

# arrange: short cut for reorderingdiamonds <- diamonds[order(diamonds$price, desc(diamonds$carat)), ]

diamonds <- arrange(diamonds, price, desc(carat))


# They all have similar syntax. The first argument# is a data frame, and all other arguments are # interpreted in the context of that data frame# (so you don't need to use data$ all the time)

subset(df, subset)transform(df, var1 = expr1, ...)summarise(df, var1 = expr1, ...)arrange(df, var1, ...)

# They all return a modified data frame. You still # have to save that to a variable if you want to # keep it


Use summarise, transform, subset and arrange to:

Find all diamonds bigger than 3 carats and order from most expensive to cheapest.

Add a new variable that estimates the diameter of the diamond (average of x and y).

Compute depth (z / diameter * 100) yourself. How does it compare to the depth in the data?

Your turn


Aside: never use attach!

Non-local effects; not symmetric; implicit, not explicit.

Makes it very easy to make mistakes.

Use with() instead:with(bnames, table(year, length))


# with is more general. Use in concert with other # functions, particularly those that don't have a data # argument

diamonds$volume <- with(diamonds, x * y * z)

# This won't work:with(diamonds, volume <- x * y * z)# with only changes lookup, not assignment


Iteration


Best data analyses tell a story, with a natural flow from beginning to end.

For homeworks, try and come up with three plots that tell a story.

Stories about a small sample of the data can work well.

Stories


qplot(x, y, data = diamonds)qplot(x, z, data = diamonds)

# Start by fixing incorrect values

y_big <- diamonds$y > 10z_big <- diamonds$z > 6

x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0z_zero <- diamonds$z == 0

diamonds$x[x_zero] <- NAdiamonds$y[y_zero | y_big] <- NAdiamonds$z[z_zero | z_big] <- NA


qplot(x, y, data = diamonds)# How can I get rid of those outliers?

qplot(x, x - y, data = diamonds)qplot(x - y, data = diamonds, binwidth = 0.01)last_plot() + xlim(-0.5, 0.5)last_plot() + xlim(-0.2, 0.2)

asym <- abs(diamonds$x - diamonds$y) > 0.2diamonds_sym <- diamonds[!asym, ]

# Did it work?qplot(x, y, data = diamonds_sym)qplot(x, x - y, data = diamonds_sym)# Something interesting is going on there!qplot(x, x - y, data = diamonds_sym, geom = "bin2d", binwidth = c(0.1, 0.01))


# What about x and z?qplot(x, z, data = diamonds_sym)qplot(x, x - z, data = diamonds_sym)# Subtracting doesn't work - z smaller than x and yqplot(x, x / z, data = diamonds_sym)# But better to log transform to make symmetricalqplot(x, log10(x / z), data = diamonds_sym)

# and so on...


# How does symmetry relate to price?qplot(abs(x - y), price, data =diamonds_sym) + geom_smooth()

qplot(abs(x - y), price, data = diamonds_sym, geom = "boxplot", group = round(abs(x-y) * 10))

diamonds_sym$sym <- zapsmall(abs(diamonds_sym$x - diamonds_sym$y))qplot(sym, price, data = diamonds_sym, geom = "boxplot", group = sym)# Are asymmetric diamonds worth more?

qplot(carat, price, data = diamonds_sym, colour = sym)qplot(log10(carat), log10(price), data = diamonds_sym, colour = sym, group = sym) + geom_smooth(method = lm, se = F)


# Modelling

summary(lm(log10(price) ~ log10(carat) + sym, data = diamonds_sym))# But statistical significance != practical # significance

sd(diamonds_sym$sym, na.rm = T)# [1] 0.02368828

# So 1 sd increase in sym, decreases log10(price) # by -0.01 (= 0.23 * -0.44)# 10 ^ -0.01 = 0.976# So 1 sd increase in sym decreases price by ~2%


Command line


Provenance & reproducibility.

Working with remote servers.

Automation & scripting.

Common tools.

Why?


Basicspwd: the location of the current directory

ls: the files in the current directory

cd: change to another directory

cd ..: change to parent directory

cd ~: change to home directory

mkdir: create a new directory


Your turn

Create a directory for stat405.

Inside that directory, create a directory for homework 2.

Confirm that there are no files in that directory.

Navigate back to your home directory. What other files are there?


Date post:	13-May-2015
Category:	Documents
Upload:	hadley-wickham
View:	643 times
Download:	0 times

05 subsetting

Documents