Date post: 13-May-2015
Hadley Wickham Stat405 Subsetting & shortcuts Tuesday, 7 September 2010
Hadley Wickham

Stat405Subsetting & shortcuts

Tuesday, 7 September 2010

• Lectures 1-3: basic graphics

• Lectures 4-6: basic data handling

• Lectures 7-9: basic functions

• The absolutely most essential tools. Rest of course is building your vocab, and learning how to use them all together.


Tuesday, 7 September 2010

1. Character subsetting

2. Sorting

3. Shortcuts

4. Iteration

5. (Optional extra: command line tips)

Tuesday, 7 September 2010

Your turn

In pairs, try and recall the five types of subsetting we talked about last week.

You have one minute!

Tuesday, 7 September 2010

+ve: include-ve: exclude

lookup by name

include TRUEs

include all

Tuesday, 7 September 2010

# Matches by namesdiamonds[1:5, c("carat", "cut", "color")]

# Useful technique: change labellingc("Fair" = "C", "Good" = "B", "Very Good" = "B+", "Premium" = "A", "Ideal" = "A+")[diamonds$cut]

# Can also be used to collapse levelstable(c("Fair" = "C", "Good" = "B", "Very Good" = "B", "Premium" = "A", "Ideal" = "A")[diamonds$cut])

# (see ?cut for continuous to discrete equivalent)

Tuesday, 7 September 2010

x <- c(2, 4, 3, 1)order(x)# means: to get x in order, put 4th in # 1st, 1st in 2nd, 3rd in 3rd and 2nd in 4th


# What does this do?diamonds[order(diamonds$price), ]

Sorting a data frame

Tuesday, 7 September 2010

# Order by x, then y, then zorder(diamonds$x, diamonds$y, diamonds$y)

# Put in order of qualityorder(diamonds$color, desc(diamonds$cut), desc(diamonds$clarity))

# desc sorts in descending order# also found in the plyr packagex[order(x)]x[order(desc(x))]

Tuesday, 7 September 2010

Reorder the mpg dataset from most to least efficient.

The fl variable gives the type of fuel (r = regular, d = diesel, p = premium, c = cng, e = ethanol). Modify fl to spell out the fuel type explicitly, collapsing c, d, and e in a single other category.

Your turn

Tuesday, 7 September 2010

Short cuts

Tuesday, 7 September 2010

Short cuts

You’ve been typing diamonds many many times. These following shortcut save typing, but may be a little harder to understand, and will not work in some situations. (Don’t forget the basics!)

Four specific to data frames, one more generic.

Tuesday, 7 September 2010

Function Package

subset base

summarise plyr

transform base

arrange plyr

plyr is loaded automatically with ggplot2, or load it explicitly with library(plyr). base always automatically loaded

Tuesday, 7 September 2010

# subset: short cut for subsettingzero_dim <- diamonds$x == 0 | diamonds$y == 0 | diamonds$z == 0diamonds[zero_dim & !is.na(zero_dim), ]

subset(diamonds, x == 0 | y == 0 | z == 0)

# summarise/summarize: short cut for creating summary biggest <- data.frame( price.max = max(diamonds$price), carat.max = max(diamonds$carat))

biggest <- summarise(diamonds, price.max = max(price), carat.max = max(carat))

Tuesday, 7 September 2010

# transform: short cut for adding new variablesdiamonds$volume <- diamonds$x * diamonds$y * diamonds$zdiamonds$density <- diamonds$volume / diamonds$carat

diamonds <- transform(diamonds, volume = x * y * z)diamonds <- transform(diamonds, density = volume / carat)

# arrange: short cut for reorderingdiamonds <- diamonds[order(diamonds$price, desc(diamonds$carat)), ]

diamonds <- arrange(diamonds, price, desc(carat))

Tuesday, 7 September 2010

# They all have similar syntax. The first argument# is a data frame, and all other arguments are # interpreted in the context of that data frame# (so you don't need to use data$ all the time)

subset(df, subset)transform(df, var1 = expr1, ...)summarise(df, var1 = expr1, ...)arrange(df, var1, ...)

# They all return a modified data frame. You still # have to save that to a variable if you want to # keep it

Tuesday, 7 September 2010

Use summarise, transform, subset and arrange to:

Find all diamonds bigger than 3 carats and order from most expensive to cheapest.

Add a new variable that estimates the diameter of the diamond (average of x and y).

Compute depth (z / diameter * 100) yourself. How does it compare to the depth in the data?

Your turn

Tuesday, 7 September 2010

Aside: never use attach!

Non-local effects; not symmetric; implicit, not explicit.

Makes it very easy to make mistakes.

Use with() instead:with(bnames, table(year, length))

Tuesday, 7 September 2010

# with is more general. Use in concert with other # functions, particularly those that don't have a data # argument

diamonds$volume <- with(diamonds, x * y * z)

# This won't work:with(diamonds, volume <- x * y * z)# with only changes lookup, not assignment

Tuesday, 7 September 2010

Best data analyses tell a story, with a natural flow from beginning to end.

For homeworks, try and come up with three plots that tell a story.

Stories about a small sample of the data can work well.


Tuesday, 7 September 2010

qplot(x, y, data = diamonds)qplot(x, z, data = diamonds)

# Start by fixing incorrect values

y_big <- diamonds$y > 10z_big <- diamonds$z > 6

x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0z_zero <- diamonds$z == 0

diamonds$x[x_zero] <- NAdiamonds$y[y_zero | y_big] <- NAdiamonds$z[z_zero | z_big] <- NA

Tuesday, 7 September 2010

qplot(x, y, data = diamonds)# How can I get rid of those outliers?

qplot(x, x - y, data = diamonds)qplot(x - y, data = diamonds, binwidth = 0.01)last_plot() + xlim(-0.5, 0.5)last_plot() + xlim(-0.2, 0.2)

asym <- abs(diamonds$x - diamonds$y) > 0.2diamonds_sym <- diamonds[!asym, ]

# Did it work?qplot(x, y, data = diamonds_sym)qplot(x, x - y, data = diamonds_sym)# Something interesting is going on there!qplot(x, x - y, data = diamonds_sym, geom = "bin2d", binwidth = c(0.1, 0.01))

Tuesday, 7 September 2010

# What about x and z?qplot(x, z, data = diamonds_sym)qplot(x, x - z, data = diamonds_sym)# Subtracting doesn't work - z smaller than x and yqplot(x, x / z, data = diamonds_sym)# But better to log transform to make symmetricalqplot(x, log10(x / z), data = diamonds_sym)

# and so on...

Tuesday, 7 September 2010

# How does symmetry relate to price?qplot(abs(x - y), price, data =diamonds_sym) + geom_smooth()

qplot(abs(x - y), price, data = diamonds_sym, geom = "boxplot", group = round(abs(x-y) * 10))

diamonds_sym$sym <- zapsmall(abs(diamonds_sym$x - diamonds_sym$y))qplot(sym, price, data = diamonds_sym, geom = "boxplot", group = sym)# Are asymmetric diamonds worth more?

qplot(carat, price, data = diamonds_sym, colour = sym)qplot(log10(carat), log10(price), data = diamonds_sym, colour = sym, group = sym) + geom_smooth(method = lm, se = F)

Tuesday, 7 September 2010

# Modelling

summary(lm(log10(price) ~ log10(carat) + sym, data = diamonds_sym))# But statistical significance != practical # significance

sd(diamonds_sym$sym, na.rm = T)# [1] 0.02368828

# So 1 sd increase in sym, decreases log10(price) # by -0.01 (= 0.23 * -0.44)# 10 ^ -0.01 = 0.976# So 1 sd increase in sym decreases price by ~2%

Tuesday, 7 September 2010

Command line

Provenance & reproducibility.

Working with remote servers.

Automation & scripting.

Common tools.


Tuesday, 7 September 2010

Basicspwd: the location of the current directory

ls: the files in the current directory

cd: change to another directory

cd ..: change to parent directory

cd ~: change to home directory

mkdir: create a new directory

Tuesday, 7 September 2010

Your turn

Create a directory for stat405.

Inside that directory, create a directory for homework 2.

Confirm that there are no files in that directory.

Navigate back to your home directory. What other files are there?

