Importing data - Islamic University of...

Importing and visualization dataImporting dataCurrently, one of the most common ways of storing and sharing data for analysis is through electronic spreadsheets. A spreadsheet stores data in rows and columns. It is basically a file version of a data frame. When saving such a table to a computer file, one needs a way to define when a new row or column ends and the other begins. This in turn defines the cells in which single values are stored.

When creating spreadsheets with text files, like the ones created with a simple text editor, a new row is defined with return and columns are separated with some predefined special character. The most common characters are comma (,), semicolon (;), space ( ), and tab (a preset number of spaces or \t).

When we read-in data from a spreadsheet it is important to know if the file has a header or not. Most reading functions assume there is a header. In RStudio, we can read if the file has a header by either opening the file in the editor or navigating to the file location, double clicking on the file, and hitting View File.

However, not all spreadsheet files are in a text format. Google Sheets, which are rendered on a browser, are an example. Another example is the proprietary format used by Microsoft Excel.

This section describes the difference between text (ASCII), Unicode, and binary files and how this affects how to import them. It then explains the concepts of file paths and working directories, which are essential to understand how to import data effectively. It also introduces the readr and readxl package and the functions that are available to import spreadsheets into R. Finally, it provides some recommendations on how to store and organize data in files. More complex challenges such as extracting data from web pages or PDF documents are left to wrangle with.

Paths and the working directoryThe first step when importing data from a spreadsheet is to locate the file containing the data. Although we do not recommend it, you can use an approach similar to what you do to open files in Microsoft Excel by clicking on the RStudio “File” menu, clicking “Import Dataset”, then clicking through folders until you find the file.

A spreadsheet containing the US murders data is included as part of the dslabs package. Finding this file is not straightforward, but the following lines of code copy the file to the folder in which R looks in by default. We explain how these lines work below.

filename <- "murders.csv"dir <- system.file("extdata", package = "dslabs") fullpath <- file.path(dir, filename)file.copy(fullpath, "murders.csv")

This code does not read the data into R, it just copies a file. But once the file is copied, we can import the data with a simple line of code. Here, we use the read_csv function from the readr package, which is part of the tidyverse.

library(tidyverse)dat <- read_csv(filename)

The data is imported and stored in dat.

The filesystemYou can think of your computer’s filesystem as a series of nested folders, each containing other folders and files. Data scientists refer to folders as directories. We refer to the folder that contains all other folders as the root directory. We refer to the directory in which we are currently located as the working directory. The working directory therefore changes as you move through folders: think of it as your current location.

Relative and full pathsThe path of a file is a list of directory names that can be thought of as instructions on what folders to click on, and in what order, to find the file. If these instructions are for finding the file from the root directory we refer to it as the full path. If the instructions are for finding the file starting in the working directory we refer to it as a relative path.

To see an example of a full path on your system type the following:

system.file(package = "dslabs")

The strings separated by slashes are the directory names. The first slash represents the root directory and we know this is a full path because it starts with a slash. If the first directory name appears without a slash in front, then the path is assumed to be relative. We can use the function list.files to see examples of relative paths.

dir <- system.file(package = "dslabs")list.files(path = dir)#> [1] "data" "DESCRIPTION" "extdata" "help" #> [5] "html" "INDEX" "Meta" "NAMESPACE" #> [9] "R" "script"

These relative paths give us the location of the files or directories if we start in the directory with the full path. For example, the full path to the help directory in the example above is /Library/Frameworks/R.framework/Versions/3.5/Resources/library/dslabs/help.Note: You will probably not make much use of the system.file function in your day-to-day data analysis work. We introduce it in this section because it facilitates the sharing of spreadsheets by including them in the dslabs package.

The working directoryWe highly recommend only writing relative paths in your code. You can get the full path of your working directory without writing out explicitly by using the getwd function.

wd <- getwd()

Generating path namesAnother example of obtaining a full path without writing out explicitly was given above when we created the object fullpath like this:

filename <- "murders.csv"dir <- system.file("extdata", package = "dslabs") fullpath <- file.path(dir, filename)

The function system.file provides the full path of the folder containing all the files and directories relevant to the package specified by the package argument. By exploring the directories in dir we find that the extdata contains the file we want:

dir <- system.file(package = "dslabs") filename %in% list.files(file.path(dir, "extdata")) #> [1] TRUE

The system.file function permits us to provide a subdirectory as a first argument, so we can obtain the fullpath of the extdata directory like this:

dir <- system.file("extdata", package = "dslabs")

The function file.path is used to combine directory names to produce the full path of the file we want to import.

fullpath <- file.path(dir, filename)

Copying files using pathsThe final line of code we used to copy the file into our home directory usedthe function file.copy. This function takes two arguments: the file to copy and the name to give it in the new directory.

file.copy(fullpath, "murders.csv")#> [1] TRUE

If a file is copied successfully, the file.copy function returns TRUE. Note that we are giving the file the same name, murders.csv, but we could have named it anything. Also note that by not starting the string with a slash, R assumes this is a relative path and copies the file to the working directory.You should be able to see the file in your working directory and can check by using:

list.files()

The readr and readxl packagesIn this section we introduce the main tidyverse data importing functions. We will use the murders.csv file provided by the dslabs package as an example. To simplify the illustration we will copy the file to our working directory using the following code:

filename <- "murders.csv"dir <- system.file("extdata", package = "dslabs") fullpath <- file.path(dir, filename)file.copy(fullpath, "murders.csv")

readrThe readr library includes functions for reading data stored in text file spreadsheets into R. readr is part of the tidyverse package, or you can load it directly:

library(readr)

The following functions are available to read-in spreadsheets:

Function Format Typical suffix

read_table white space separated values txt

read_csv comma separated values csv

read_csv2 semicolon separated values csv

read_tsv tab delimited separated values tsv

read_delim general text file format, must define delimiter txt

Although the suffix usually tells us what type of file it is, there is no guarantee that these always match. We can open the file to take a look or use the function read_lines to look at a few lines:

read_lines("murders.csv", n_max = 3)#> [1] "state,abb,region,population,total"#> [2] "Alabama,AL,South,4779736,135" #> [3] "Alaska,AK,West,710231,19"

This also shows that there is a header. Now we are ready to read-in the data into R. From the .csv suffix and the peek at the file, we know to use read_csv:

dat <- read_csv(filename)#> Parsed with column specification:#> cols(#> state = col_character(),#> abb = col_character(),#> region = col_character(),#> population = col_double(),#> total = col_double()#> )

Note that we receive a message letting us know what data types were used for each column. Also note that dat is a tibble, not just a data frame. This is because read_csv is a tidyverse parser. We can confirm that the data has in fact been read-in with:

View(dat)

Finally, note that we can also use the full path for the file:

dat <- read_csv(fullpath)

readxlYou can load the readxl package using

library(readxl)

The package provides functions to read-in Microsoft Excel formats:

Function Format Typical suffix

read_excel auto detect the format xls, xlsx

read_xls original format xls

read_xlsx new format xlsx

The Microsoft Excel formats permit you to have more than one spreadsheet in one file. These are referred to as sheets. The functions listed above read the first sheet by default, but we can also read the others. The excel_sheets function gives us the names of all the sheets in an Excel file. These names can then be passed to the sheet argument in the three functions above to read sheets other than the first.

Exercises1. Use the read_csv function to read each of the files that the following

code saves in the files object:

path <- system.file("extdata", package = "dslabs")files <- list.files(path)files

2. Note that the last one, the olive file, gives us a warning. This is because the first line of the file is missing the header for the first column.Read the help file for read_csv to figure out how to read in the file without reading this header. If you skip the header, you should not get this warning. Save the result to an object called dat.

3. A problem with the previous approach is that we don’t know what the columns represent. Type:

names(dat)

to see that the names are not informative.

Use the readLines function to read in just the first line (we later learn how to extract values from the output).

Downloading filesAnother common place for data to reside is on the internet. When these data are in files, we can download them and then import them or even read them directly from the web. For example, we note that because our dslabs package is on GitHub, the file we downloaded with the package has a url:

url <- "https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/murders.csv"

The read_csv file can read these files directly:

dat <- read_csv(url)

If you want to have a local copy of the file, you can use the download.file function:

download.file(url, "murders.csv")

This will download the file and save it on your system with the name murders.csv. You can use any name here, not necessarily murders.csv. Note that when using download.file you should be careful as it will overwrite existing files without warning.Two functions that are sometimes useful when downloading data from the internet are tempdir and tempfile. The first creates a directory with a random name that is very likely to be unique. Similarly, tempfile creates a character string, not a file, that is likely to be a unique filename. So you can run a command like this which erases the temporary file once it imports the data:

tmp_filename <- tempfile()download.file(url, tmp_filename)dat <- read_csv(tmp_filename)file.remove(tmp_filename)

R-base importing functionsR-base also provides import functions. These have similar names to those in the tidyverse, for example read.table, read.csv and read.delim. However, there are a couple of important differences. To show this we read-in the data with an R-base function:

dat2 <- read.csv(filename)

An important difference is that the characters are converted to factors:

class(dat2$abb)#> [1] "factor"class(dat2$region)#> [1] "factor"

This can be avoided by setting the argument stringsAsFactors to FALSE.

dat <- read.csv("murders.csv", stringsAsFactors = FALSE)class(dat$state)#> [1] "character"

In our experience this can be a cause for confusion since a variable that was saved as characters in file is converted to factors regardless of what the variable represents. In fact, we highly recommend setting stringsAsFactors=FALSE to be your default approach when using the R-base parsers. You can easily convert the desired columns to factors after importing data.

scanWhen reading in spreadsheets many things can go wrong. The file might have a multiline header, be missing cells, or it might use an unexpected encoding We recommend you read this post about common issues found here: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ .

With experience you will learn how to deal with different challenges. Carefully reading the help files for the functions discussed here will be useful. Two other functions that are helpful are scan. With scan you can read-in each cell of a file. Here is an example:

path <- system.file("extdata", package = "dslabs")filename <- "murders.csv"x <- scan(file.path(path, filename), sep=",", what = "c")x[1:10]#> [1] "state" "abb" "region" "population" "total" #> [6] "Alabama" "AL" "South" "4779736" "135"

Text versus binary filesFor data science purposes, files can generally be classified into two categories: text files (also known as ASCII files) and binary files. You have already worked with text files. All your R scripts are text files and so are the R markdown files used to create this book. The csv tables you have read are also text files. One big advantage of these files is that we can easily “look” at them without having to purchase any kind of special software or follow complicated instructions. Any text editor can be used to examine a text file, including freely

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/



available editors such as RStudio, Notepad, textEdit, vi, emacs, nano, and pico. To see this, try opening a csv file using the “Open file” RStudio tool. You should be able to see the content right on your editor. However, if you try to open, say, an Excel xls file, jpg or png file, you will not be able to see anything immediately useful. These are binary files. Excel files are actually compressed folders with several text files inside. But the main distinction here is that text files can be easily examined.

Although R includes tools for reading widely used binary files, such as xls files, in general you will want to find data sets stored in text files. Similarly, when sharing data you want to make it available as text files as long as storage is not an issue (binary files are much more efficient at saving space on your disk). In general, plain-text formats make it easier to share data since commercial software is not required for working with the data.

Extracting data from a spreadsheet stored as a text file is perhaps the easiest way to bring data from a file to an R session. Unfortunately, spreadsheets are not always available and the fact that you can look at text files does not necessarily imply that extracting data from them will be straightforward. In the Data Wrangling part of the book we learn to extract data from more complex text files such as html files.

Unicode versus ASCIIA pitfall in data science is assuming a file is an ASCII text file when, in fact, it is something else that can look a lot like an ASCII text file: a Unicode text file.

To understand the difference between these, remember that everything on a computer needs to eventually be converted to 0s and 1s. ASCII is an encoding that maps characters to numbers. ASCII uses 7 bits (0s and 1s) which results in 27=128 unique items, enough to encode all the characters on an English language keyboard. However, other languages use characters not included in this encoding. For example, the é in México is not encoded by ASCII. For this reason, a new encoding, using more than 7 bits, was defined: Unicode. When using Unicode, one can chose between 8, 16, and 32 bits abbreviated UTF-8, UTF-16, and UTF-32 respectively. RStudio actually defaults to UTF-8 encoding.Although we do not go into the details of how to deal with the different encodings here, it is important that you know these different encodings exist so that you can better diagnose a problem if you encounter it. One way, problems manifest themselves is when you see “weird looking” characters you were not expecting. This StackOverflow discussion is an example: https://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell .

https://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell

https://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell

Organizing data with spreadsheetsAlthough there are R packages designed to read this format, if you are choosing a file format to save your own data, you generally want to avoid Microsoft Excel. We recommend Google Sheets as a free software tool for organizing data. We provide more recommendations in the section Data Organization with Spreadsheets. This book focuses on data analysis. Yet often a data scientist needs to collect data or work with others collecting data. Filling out a spreadsheet by hand is a practice we highly discourage and instead recommend that the process be automatized as much as possible. But sometimes you just have to do it. In this section, we provide recommendations on how to store data in a spreadsheet. We summarize a paper by Karl Broman and Kara Woo14 . Below are their general recommendations. Please read the paper for important details.

Be Consistent - Before you commence entering data, have a plan. Once you have a plan, be consistent and stick to it.

Choose Good Names for Things - You want the names you pick for objects, files, and directories to be memorable, easy to spell, and descriptive. This is actually a hard balance to achieve and it does require time and thought. One important rule to follow is do not use spaces, use underscores _ or dashes instead -. Also, avoid symbols; stick to letters and numbers.

Write Dates as YYYY-MM-DD - To avoid confusion, we strongly recommend using this global ISO 8601 standard.

No Empty Cells - Fill in all cells and use some common code for missing data. Put Just One Thing in a Cell - It is better to add columns to store the extra

information rather than having more than one piece of information in one cell. Make It a Rectangle - The spreadsheet should be a rectangle. Create a Data Dictionary - If you need to explain things, such as what the columns

are or what the labels used for categorical variables are, do this in a separate file. No Calculations in the Raw Data Files - Excel permits you to perform calculations.

Do not make this part of your spreadsheet. Code for calculations should be in a script. Do Not Use Font Color or Highlighting as Data - Most import functions are not able

to import this information. Encode this information as a variable instead. Make Backups - Make regular backups of your data. Use Data Validation to Avoid Errors - Leverage the tools in your spreadsheet

software so that the process is as error-free and repetitive-stress-injury-free as possible.

Save the Data as Text Files - Save files for sharing in comma or tab delimited format.

https://rafalab.github.io/dsbook/importing-data.html#fn14

Exercises1. Pick a measurement you can take on a regular basis. For example, your daily weight or how long it takes you to run 5 miles. Keep a spreadsheet that includes the date, the hour, the measurement, and any other informative variable you think is worth keeping. Do this for 2 weeks. Then make a plot.

Introduction to data visualizationLooking at the numbers and character strings that define a dataset is rarely useful. To convince yourself, print and stare at the US murders data table:

library(dslabs)data(murders)head(murders)We are reminded of the saying “a picture is worth a thousand words”. Data visualization provides a powerful way to communicate a data-driven finding. In some cases, the visualization is so convincing that no follow-up analysis is required.

The growing availability of informative datasets and software tools has led to increased reliance on data visualizations across many industries, academia, and government. A salient example is news organizations, which are increasingly embracing data journalism and including effective infographics as part of their reporting.

Data visualization is the strongest tool of what we call exploratory data analysis (EDA). John W. Tukey, considered the father of EDA, once said,

“The greatest value of a picture is when it forces us to notice what we never expected to see.”

Many widely used data analysis tools were initiated by discoveries made via EDA. EDA is perhaps the most important part of data analysis, yet it is one that is often overlooked.

Data visualization is also now pervasive in philanthropic and educational organizations. In the talks New Insights on Poverty and The Best Stats You’ve Ever Seen, Hans Rosling forces us to notice the unexpected with a series of plots related to world health and economics. In his videos, he uses animated graphs to show us how the world is changing and how old narratives are no longer true.

In this part of the book, we will learn the basics of data visualization and exploratory data analysis by using three motivating examples. We will use the ggplot2 package to code. To learn the very basics, we will start with a somewhat artificial example: heights reported by students. Then we will cover the two examples mentioned above: 1) world health and economics and 2) infectious disease trends in the United States.

ggplot2

Exploratory data visualization is perhaps the greatest strength of R. One can quickly go from idea to data to plot with a unique balance of flexibility and ease. For example, Excel may be easier than R for some plots, but it is nowhere near as flexible. D3.js may be more flexible and powerful than R, but it takes much longer to generate a plot.

Throughout the book, we will be creating plots using the ggplot2 package.

library(dplyr)library(ggplot2)

There are also other packages for creating graphics such as grid and lattice.

The components of a graph

We can clearly see how much states vary across population size and the total number of murders. Not surprisingly, we also see a clear relationship between murder totals and population size. A state falling on the dashed grey line has the same murder rate as the US average. The four geographic regions are denoted with color, which depicts how most southern states have murder rates above the average.

The first step in learning ggplot2 is to be able to break a graph apart into components. Let’s break down the plot above and introduce some of the ggplot2 terminology. The main three components to note are:

Data: The US murders data table is being summarized. We refer to this as the data component.

Geometry: The plot above is a scatterplot. This is referred to as the geometry component. Other possible geometries are barplot, histogram, smooth densities, qqplot, and boxplot. We will learn more about these in the Data Visualization part of the book.

Aesthetic mapping: The plot uses several visual cues to represent the information provided by the dataset. The two most important cues in this plot are the point positions on the x-axis and y-axis, which represent population size and the total number of murders, respectively. Each point represents a different observation, and we map data about these observations to visual cues like x- and y-scale. Color is another visual cue that we map to region. We refer to this as the aesthetic mapping component. How we define the mapping depends on what geometry we are using.

ggplot objectsThe first step in creating a ggplot2 graph is to define a ggplot object. We do this with the function ggplot, which initializes the graph. If we read the help file for this function, we see that the first argument is used to specify what data is associated with this object:ggplot(data = murders)We can also pipe the data in as the first argument. So this line of code is equivalent to the one above:

murders %>% ggplot()What has happened above is that the object was created and, because it was not assigned, it was automatically evaluated. But we can assign our plot to an object, for example like this:

p <- ggplot(data = murders)class(p)#> [1] "gg" "ggplot"To render the plot associated with this object, we simply print the object p. The following two lines of code each produce the same plot we see above:print(p)p

GeometriesIn ggplot2 we create graphs by adding layers. Layers can define geometries, compute summary statistics, define what scales to use, or even change styles. To add layers, we use the symbol +. In general, a line of code will look like this:

DATA %>% ggplot() + LAYER 1 + LAYER 2 + … + LAYER NUsually, the first added layer defines the geometry. We want to make a scatterplot. What geometry do we use?

Taking a quick look at the cheat sheet, we see that the function used to create plots with this geometry is geom_point.Geometry function names follow the pattern: geom_X where X is the name of the geometry. Some examples include geom_point, geom_bar, and geom_histogram.For geom_point to run properly we need to provide data and a mapping. We have already connected the object p with the murders data table, and if we add the layer geom_point it defaults to using this data. To find out what mappings are expected, we read the Aesthetics section of the help file geom_point help file:

Aesthetic mappingsAesthetic mappings describe how properties of the data connect with features of the graph, such as distance along an axis, size, or color. The aes function connects data with what we see on the graph by defining aesthetic mappings and will be one of the functions you use most often when plotting. The outcome of the aes function is often used as the argument of a geometry function. This example produces a scatterplot of total murders versus population in millions:murders %>% ggplot() + geom_point(aes(x = population/10^6, y = total))We can drop the x = and y = if we wanted to since these are the first and second expected arguments, as seen in the help page.Instead of defining our plot from scratch, we can also add a layer to the p object that was defined above as p <- ggplot(data = murders):p + geom_point(aes(population/10^6, total))

LayersA second layer in the plot we wish to make involves adding a label to each point to identify the state. The geom_label and geom_text functions permit us to add text to the plot with and without a rectangle behind the text, respectively.Because each point (each state in this case) has a label, we need an aesthetic mapping to make the connection between points and labels. By reading the help file, we learn that we supply the mapping between point and label through the label argument of aes. So the code looks like this:p + geom_point(aes(population/10^6, total)) + geom_text(aes(population/10^6, total, label = abb))As an example of the unique behavior of aes mentioned above, note that this call:p_test <- p + geom_text(aes(population/10^6, total, label = abb))

is fine, whereas this call:

p_test <- p + geom_text(aes(population/10^6, total), label = abb) will give you an error since abb is not found because it is outside of the aes function. The layer geom_text does not know where to find abb since it is a column name and not a global variable.

Tinkering with argumentsEach geometry function has many arguments other than aes and data. They tend to be specific to the function. For example, in the plot we wish to make, the points are larger than the default size. In the help file we see that size is an aesthetic and we can change it like this:p + geom_point(aes(population/10^6, total), size = 3) + geom_text(aes(population/10^6, total, label = abb))size is not a mapping: whereas mappings use data from specific observations and need to be inside aes(), operations we want to affect all the points the same way do not need to be included inside aes.Now because the points are larger it is hard to see the labels. If we read the help file for geom_text, we see the nudge_x argument, which moves the text slightly to the right or to the left:p + geom_point(aes(population/10^6, total), size = 3) + geom_text(aes(population/10^6, total, label = abb), nudge_x = 1.5)

Global versus local aesthetic mappingsIn the previous line of code, we define the mapping aes(population/10^6, total) twice, once in each geometry. We can avoid this by using a global aesthetic mapping. We can do this when we define the blank slate ggplot object. Remember that the function ggplot contains an argument that permits us to define aesthetic mappings:args(ggplot)If we define a mapping in ggplot, all the geometries that are added as layers will default to this mapping. We redefine p:p <- murders %>% ggplot(aes(population/10^6, total, label = abb))and then we can simply write the following code to produce the previous plot:

p + geom_point(size = 3) + geom_text(nudge_x = 1.5)We keep the size and nudge_x arguments in geom_point and geom_text, respectively, because we want to only increase the size of points and only nudge the labels. If we put those arguments in aes then they would apply to both plots. Also note that the geom_point function does not need a label argument and therefore ignores that aesthetic.If necessary, we can override the global mapping by defining a new mapping within each layer. These local definitions override the global. Here is an example:

p + geom_point(size = 3) + geom_text(aes(x = 10, y = 800, label = "Hello there!"))

ScalesFirst, our desired scales are in log-scale. This is not the default, so this change needs to be added through a scales layer. A quick look at the cheat sheet reveals the scale_x_continuous function lets us control the behavior of scales. We use them like this:p + geom_point(size = 3) + geom_text(nudge_x = 0.05) + scale_x_continuous(trans = "log10") + scale_y_continuous(trans = "log10")Because we are in the log-scale now, the nudge must be made smaller.

This particular transformation is so common that ggplot2 provides the specialized functions scale_x_log10 and scale_y_log10, which we can use to rewrite the code like this:p + geom_point(size = 3) + geom_text(nudge_x = 0.05) + scale_x_log10() + scale_y_log10()

Labels and titlesSimilarly, the cheat sheet quickly reveals that to change labels and add a title, we use the following functions:

p + geom_point(size = 3) + geom_text(nudge_x = 0.05) + scale_x_log10() + scale_y_log10() + xlab("Populations in millions (log scale)") + ylab("Total number of murders (log scale)") + ggtitle("US Gun Murders in 2010")

Categories as colorsWe can change the color of the points using the col argument in the geom_point function. To facilitate demonstration of new features, we will redefine p to be everything except the points layer:p <- murders %>% ggplot(aes(population/10^6, total, label = abb)) + geom_text(nudge_x = 0.05) + scale_x_log10() + scale_y_log10() + xlab("Populations in millions (log scale)") +

ylab("Total number of murders (log scale)") + ggtitle("US Gun Murders in 2010")and then test out what happens by adding different calls to geom_point. We can make all the points blue by adding the color argument:p + geom_point(size = 3, color ="blue")This, of course, is not what we want. We want to assign color depending on the geographical region. A nice default behavior of ggplot2 is that if we assign a categorical variable to color, it automatically assigns a different color to each category and also adds a legend.

Since the choice of color is determined by a feature of each observation, this is an aesthetic mapping. To map each point to a color, we need to use aes. We use the following code:p + geom_point(aes(col=region), size = 3)

Annotation, shapes, and adjustmentsWe often want to add shapes or annotation to figures that are not derived directly from the aesthetic mapping; examples include labels, boxes, shaded areas, and lines.

Here we want to add a line that represents the average murder rate for the entire country. Once we determine the per million rate to be rr, this line is defined by the formula: y=rxy=rx, with yy and xx our axes: total murders and population in millions, respectively. In the log-scale this line turns into: log(y)=log(r)+log(x)log(y)=log(r)+log(x). So in our plot it’s a line with slope 1 and intercept log(r)log(r). To compute this value, we use our dplyr skills:r <- murders %>% summarize(rate = sum(total) / sum(population) * 10^6) %>% pull(rate)To add a line we use the geom_abline function. ggplot2 uses ab in the name to remind us we are supplying the intercept (a) and slope (b). The default line has slope 1 and intercept 0 so we only have to define the intercept:p + geom_point(aes(col=region), size = 3) + geom_abline(intercept = log10(r))Here geom_abline does not use any information from the data object.We can change the line type and color of the lines using arguments. Also, we draw it first so it doesn’t go over our points.

p <- p + geom_abline(intercept = log10(r), lty = 2, color = "darkgrey") + geom_point(aes(col=region), size = 3) Note that we have redefined p and used this new p below and in the next section.

The default plots created by ggplot2 are already very useful. However, we frequently need to make minor tweaks to the default behavior. Although it is not always obvious how to make these even with the cheat sheet, ggplot2 is very flexible.

For example, we can make changes to the legend via the scale_color_discrete function. In our plot the word region is capitalized and we can change it like this:p <- p + scale_color_discrete(name = "Region")

Add-on packagesThe power of ggplot2 is augmented further due to the availability of add-on packages. The remaining changes needed to put the finishing touches on our plot require the ggthemes and ggrepel packages.

The style of a ggplot2 graph can be changed using the theme functions. Several themes are included as part of the ggplot2 package. In fact, for most of the plots in this book, we use a function in the dslabs package that automatically sets a default theme:ds_theme_set()Many other themes are added by the package ggthemes. Among those are the theme_economist theme that we used. After installing the package, you can change the style by adding a layer like this:library(ggthemes)p + theme_economist()You can see how some of the other themes look by simply changing the function. For instance, you might try the theme_fivethirtyeight() theme instead.The final difference has to do with the position of the labels. In our plot, some of the labels fall on top of each other. The add-on package ggrepel includes a geometry that adds labels while ensuring that they don’t fall on top of each other. We simply change geom_text with geom_text_repel.

Putting it all togetherNow that we are done testing, we can write one piece of code that produces our desired plot from scratch.

library(ggthemes)library(ggrepel)

r <- murders %>% summarize(rate = sum(total) / sum(population) * 10^6) %>% pull(rate)

murders %>% ggplot(aes(population/10^6, total, label = abb)) +

geom_abline(intercept = log10(r), lty = 2, color = "darkgrey") + geom_point(aes(col=region), size = 3) + geom_text_repel() + scale_x_log10() + scale_y_log10() + xlab("Populations in millions (log scale)") + ylab("Total number of murders (log scale)") + ggtitle("US Gun Murders in 2010") + scale_color_discrete(name = "Region") + theme_economist()

Quick plots with qplotWe have learned the powerful approach to generating visualization with ggplot. However, there are instances in which all we want is to make a quick plot of, for example, a histogram of the values in a vector, a scatterplot of the values in two vectors, or a boxplot using categorical and numeric vectors. We demonstrated how to generate these plots with hist, plot, and boxplot. However, if we want to keep consistent with the ggplot style, we can use the function qplot.If we have values in two vectors, say:

data(murders)x <- log10(murders$population)y <- murders$totaland we want to make a scatterplot with ggplot, we would have to type something like:

data.frame(x = x, y = y) %>% ggplot(aes(x, y)) + geom_point()This seems like too much code for such a simple plot. The qplot function sacrifices the flexibility provided by the ggplot approach, but allows us to generate a plot quickly.qplot(x, y)

Grids of plotsThere are often reasons to graph plots next to each other.

Install package gridExtra

The gridExtra package permits us to do that:

library(gridExtra)p1 <- qplot(x)p2 <- qplot(x,y)grid.arrange(p1, p2, ncol = 2)

ExercisesStart by loading the dplyr and ggplot2 library as well as the murders and heights data.library(dplyr)library(ggplot2)library(dslabs)data(heights)data(murders)1. With ggplot2 plots can be saved as objects. For example we can associate a dataset with a plot object like this

p <- ggplot(data = murders)Because data is the first argument we don’t need to spell it outp <- ggplot(murders)and we can also use the pipe:

p <- murders %>% ggplot()What is class of the object p?2. Remember that to print an object you can use the command print or simply type the object. Print the object p defined in exercise one and describe what you see.

a. Nothing happens.b. A blank slate plot.c. A scatterplot.d. A histogram.

3. Using the pipe %>%, create an object p but this time associated with the heights dataset instead of the murders dataset.4. What is the class of the object p you have just created?5. Now we are going to add a layer and the corresponding aesthetic mappings. For the murders data we plotted total murders versus population sizes. Explore the murders data frame to remind yourself what are the names for these two variables and select the correct answer. Hint: Look at ?murders.

a. state and abb.b. total_murders and population_size.c. total and population.d. murders and size.

6. To create the scatterplot we add a layer with geom_point. The aesthetic mappings require us to define the x-axis and y-axis variables, respectively. So the code looks like this:murders %>% ggplot(aes(x = , y = )) + geom_point()except we have to define the two variables x and y. Fill this out with the correct variable names.

7. Note that if we don’t use argument names, we can obtain the same plot by making sure we enter the variable names in the right order like this:

murders %>% ggplot(aes(population, total)) + geom_point()Remake the plot but now with total in the x-axis and population in the y-axis.

8. If instead of points we want to add text, we can use the geom_text() or geom_label() geometries. The following codemurders %>% ggplot(aes(population, total)) + geom_label()will give us the error message: Error: geom_label requires the following missing aesthetics: labelWhy is this?

a. We need to map a character to each point through the label argument in aes.b. We need to let geom_label know what character to use in the plot.c. The geom_label geometry does not require x-axis and y-axis values.d. geom_label is not a ggplot2 command.

9. Rewrite the code above to abbreviation as the label through aes10. Change the color of the labels through blue. How will we do this?

a. Adding a column called blue to murders.b. Because each label needs a different color we map the colors through aes.c. Use the color argument in ggplot.d. Because we want all colors to be blue, we do not need to map colors, just use

the color argument in geom_label.

11. Rewrite the code above to make the labels blue.

12. Now suppose we want to use color to represent the different regions. In this case which of the following is most appropriate:

a. Adding a column called color to murders with the color we want to use.b. Because each label needs a different color we map the colors through the color

argument of aes .c. Use the color argument in ggplot.d. Because we want all colors to be blue, we do not need to map colors, just use

the color argument in geom_label.

13. Rewrite the code above to make the labels’ color be determined by the state’s region.

14. Now we are going to change the x-axis to a log scale to account for the fact the distribution of population is skewed. Let’s start by defining an object p holding the plot we have made up to nowp <- murders %>% ggplot(aes(population, total, label = abb, color = region)) + geom_label()

To change the y-axis to a log scale we learned about the scale_x_log10() function. Add this layer to the object p to change the scale and render the plot.15. Repeat the previous exercise but now change both axes to be in the log scale.

16. Now edit the code above to add the title “Gun murder data” to the plot. Hint: use the ggtitle function.

Date post:	23-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Importing data - Islamic University of...

Documents