Date post: | 13-May-2015 |
Category: |
Documents |
Upload: | hadley-wickham |
View: | 643 times |
Download: | 0 times |
Hadley Wickham
Stat405Subsetting & shortcuts
Tuesday, 7 September 2010
• Lectures 1-3: basic graphics
• Lectures 4-6: basic data handling
• Lectures 7-9: basic functions
• The absolutely most essential tools. Rest of course is building your vocab, and learning how to use them all together.
Roadmap
Tuesday, 7 September 2010
1. Character subsetting
2. Sorting
3. Shortcuts
4. Iteration
5. (Optional extra: command line tips)
Tuesday, 7 September 2010
Subsetting
Tuesday, 7 September 2010
Your turn
In pairs, try and recall the five types of subsetting we talked about last week.
You have one minute!
Tuesday, 7 September 2010
logical
blank
integer
character
+ve: include-ve: exclude
lookup by name
include TRUEs
include all
Tuesday, 7 September 2010
# Matches by namesdiamonds[1:5, c("carat", "cut", "color")]
# Useful technique: change labellingc("Fair" = "C", "Good" = "B", "Very Good" = "B+", "Premium" = "A", "Ideal" = "A+")[diamonds$cut]
# Can also be used to collapse levelstable(c("Fair" = "C", "Good" = "B", "Very Good" = "B", "Premium" = "A", "Ideal" = "A")[diamonds$cut])
# (see ?cut for continuous to discrete equivalent)
Tuesday, 7 September 2010
x <- c(2, 4, 3, 1)order(x)# means: to get x in order, put 4th in # 1st, 1st in 2nd, 3rd in 3rd and 2nd in 4th
x[order(x)]
# What does this do?diamonds[order(diamonds$price), ]
Sorting a data frame
Tuesday, 7 September 2010
# Order by x, then y, then zorder(diamonds$x, diamonds$y, diamonds$y)
# Put in order of qualityorder(diamonds$color, desc(diamonds$cut), desc(diamonds$clarity))
# desc sorts in descending order# also found in the plyr packagex[order(x)]x[order(desc(x))]
Tuesday, 7 September 2010
Reorder the mpg dataset from most to least efficient.
The fl variable gives the type of fuel (r = regular, d = diesel, p = premium, c = cng, e = ethanol). Modify fl to spell out the fuel type explicitly, collapsing c, d, and e in a single other category.
Your turn
Tuesday, 7 September 2010
Short cuts
Tuesday, 7 September 2010
Short cuts
You’ve been typing diamonds many many times. These following shortcut save typing, but may be a little harder to understand, and will not work in some situations. (Don’t forget the basics!)
Four specific to data frames, one more generic.
Tuesday, 7 September 2010
Function Package
subset base
summarise plyr
transform base
arrange plyr
plyr is loaded automatically with ggplot2, or load it explicitly with library(plyr). base always automatically loaded
Tuesday, 7 September 2010
# subset: short cut for subsettingzero_dim <- diamonds$x == 0 | diamonds$y == 0 | diamonds$z == 0diamonds[zero_dim & !is.na(zero_dim), ]
subset(diamonds, x == 0 | y == 0 | z == 0)
# summarise/summarize: short cut for creating summary biggest <- data.frame( price.max = max(diamonds$price), carat.max = max(diamonds$carat))
biggest <- summarise(diamonds, price.max = max(price), carat.max = max(carat))
Tuesday, 7 September 2010
# transform: short cut for adding new variablesdiamonds$volume <- diamonds$x * diamonds$y * diamonds$zdiamonds$density <- diamonds$volume / diamonds$carat
diamonds <- transform(diamonds, volume = x * y * z)diamonds <- transform(diamonds, density = volume / carat)
# arrange: short cut for reorderingdiamonds <- diamonds[order(diamonds$price, desc(diamonds$carat)), ]
diamonds <- arrange(diamonds, price, desc(carat))
Tuesday, 7 September 2010
# They all have similar syntax. The first argument# is a data frame, and all other arguments are # interpreted in the context of that data frame# (so you don't need to use data$ all the time)
subset(df, subset)transform(df, var1 = expr1, ...)summarise(df, var1 = expr1, ...)arrange(df, var1, ...)
# They all return a modified data frame. You still # have to save that to a variable if you want to # keep it
Tuesday, 7 September 2010
Use summarise, transform, subset and arrange to:
Find all diamonds bigger than 3 carats and order from most expensive to cheapest.
Add a new variable that estimates the diameter of the diamond (average of x and y).
Compute depth (z / diameter * 100) yourself. How does it compare to the depth in the data?
Your turn
Tuesday, 7 September 2010
Aside: never use attach!
Non-local effects; not symmetric; implicit, not explicit.
Makes it very easy to make mistakes.
Use with() instead:with(bnames, table(year, length))
Tuesday, 7 September 2010
# with is more general. Use in concert with other # functions, particularly those that don't have a data # argument
diamonds$volume <- with(diamonds, x * y * z)
# This won't work:with(diamonds, volume <- x * y * z)# with only changes lookup, not assignment
Tuesday, 7 September 2010
Iteration
Tuesday, 7 September 2010
Best data analyses tell a story, with a natural flow from beginning to end.
For homeworks, try and come up with three plots that tell a story.
Stories about a small sample of the data can work well.
Stories
Tuesday, 7 September 2010
qplot(x, y, data = diamonds)qplot(x, z, data = diamonds)
# Start by fixing incorrect values
y_big <- diamonds$y > 10z_big <- diamonds$z > 6
x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0z_zero <- diamonds$z == 0
diamonds$x[x_zero] <- NAdiamonds$y[y_zero | y_big] <- NAdiamonds$z[z_zero | z_big] <- NA
Tuesday, 7 September 2010
qplot(x, y, data = diamonds)# How can I get rid of those outliers?
qplot(x, x - y, data = diamonds)qplot(x - y, data = diamonds, binwidth = 0.01)last_plot() + xlim(-0.5, 0.5)last_plot() + xlim(-0.2, 0.2)
asym <- abs(diamonds$x - diamonds$y) > 0.2diamonds_sym <- diamonds[!asym, ]
# Did it work?qplot(x, y, data = diamonds_sym)qplot(x, x - y, data = diamonds_sym)# Something interesting is going on there!qplot(x, x - y, data = diamonds_sym, geom = "bin2d", binwidth = c(0.1, 0.01))
Tuesday, 7 September 2010
# What about x and z?qplot(x, z, data = diamonds_sym)qplot(x, x - z, data = diamonds_sym)# Subtracting doesn't work - z smaller than x and yqplot(x, x / z, data = diamonds_sym)# But better to log transform to make symmetricalqplot(x, log10(x / z), data = diamonds_sym)
# and so on...
Tuesday, 7 September 2010
# How does symmetry relate to price?qplot(abs(x - y), price, data =diamonds_sym) + geom_smooth()
qplot(abs(x - y), price, data = diamonds_sym, geom = "boxplot", group = round(abs(x-y) * 10))
diamonds_sym$sym <- zapsmall(abs(diamonds_sym$x - diamonds_sym$y))qplot(sym, price, data = diamonds_sym, geom = "boxplot", group = sym)# Are asymmetric diamonds worth more?
qplot(carat, price, data = diamonds_sym, colour = sym)qplot(log10(carat), log10(price), data = diamonds_sym, colour = sym, group = sym) + geom_smooth(method = lm, se = F)
Tuesday, 7 September 2010
# Modelling
summary(lm(log10(price) ~ log10(carat) + sym, data = diamonds_sym))# But statistical significance != practical # significance
sd(diamonds_sym$sym, na.rm = T)# [1] 0.02368828
# So 1 sd increase in sym, decreases log10(price) # by -0.01 (= 0.23 * -0.44)# 10 ^ -0.01 = 0.976# So 1 sd increase in sym decreases price by ~2%
Tuesday, 7 September 2010
Command line
Tuesday, 7 September 2010
Provenance & reproducibility.
Working with remote servers.
Automation & scripting.
Common tools.
Why?
Tuesday, 7 September 2010
Basicspwd: the location of the current directory
ls: the files in the current directory
cd: change to another directory
cd ..: change to parent directory
cd ~: change to home directory
mkdir: create a new directory
Tuesday, 7 September 2010
Your turn
Create a directory for stat405.
Inside that directory, create a directory for homework 2.
Confirm that there are no files in that directory.
Navigate back to your home directory. What other files are there?
Tuesday, 7 September 2010