Exercise 1: Getting Started and Exploring Data in R · Exploring Data in R “If you want to create...

Remote Sensing Applications Center | Exercise 1:Getting Started and Exploring Data in R| 1

Last Updated: May, 2016 Version: R version 2.11.1 (or higher)

Exercise 1: Getting Started and Exploring Data in R “If you want to create predictive models using data, you need to use statistical tools...”

The RStudio workflow outlined in this tutorial is designed to streamline the process of building statistical models with lidar data or any similar dataset for that matter. To create models, you need to learn some of the basics of statistical computing. In these exercises, we will use the R system which is both a statistics package and a computing language. R is free, open source and widely used. There is a lot of documentation and a huge user base making it easy to get help and start coding, not to mention a plethora of available tutorials and online resources.

The tutorial will guide you through the process of building simple statistical models and introduce tools to evaluate the statistical validity of the models created. Some suggestions are made to help you use appropriate statistical-analysis techniques; however, you are responsible for interpreting the validity of your results. It is recommended that you have some background and experience in statistics and basic modeling before accepting or relying on any results produced from this example workflow.

Objectives • Learn about data exploration in the R software • Investigate lidar and forest inventory data properties and relationships

Required Data LidarDataAnalysis.zip: The zip file includes

• BA_lidarData.csv - modeling table with field plot data and correlating lidar plot metrics • Lidar_gridmetrics directory - containing a selection of landscape level lidar metrics in ASCII

format • Shapefile directory - contains a forest polygon shapefile and the lidar tile extents • Scripts directory - contains a few extra template scripts corresponding to exercise appendices • Outputs folder - contains the scripts and data that you will produce—backups are provided for

each

Prerequisites • To create useful predictive models using these tools, it is assumed that you have enough

statistical modeling experience to interpret model tests and evaluate the results.


Table of Contents Part 1: Install Software ..................................................................................................................... 3

Part 2: Review the Course Material ................................................................................................. 3

Part 3: Explore lidar plot data in the R Environment ....................................................................... 5

Part 4: Explore Lidar Plot Data in the R Environment ...................................................................... 6

Part 5: Explore variables and subsetting data ............................................................................... 13

Part 6: Explore distribution of variables ........................................................................................ 14

Part 7: Explore relationships between basal area and lidar predictor variables (one at a time) .. 17

Part 8: Explore relationships between basal area and lidar predictor variables (many at a time) 20

Part 9: Putting it all together ......................................................................................................... 25

Part 10: Optional: Bonus code with best predictor variables ........................................................ 27

Part 11: Data exploration wrap up ................................................................................................ 28


Part 1: Install Software A. Download and install R and R-Studio (IDE)

1. RStudio requires that you have R version 2.11.1 (or higher) installed as well. If you do not yet have this, follow the link to navigate to the RStudio CRAN mirror. (http://cran.rstudio.com/) and follow the instructions to download and install R.

2. Next, navigate to the R-Studio download page: https://www.rstudio.com/products/rstudio/download/.

3. Click on the link recommended for your operating system to download the appropriate version of RStudio (e.g., RStudio 0.99.887 - Windows Vista/7/8/10). This should automatically start the RStudio download.

4. When the download is complete, navigate to your Downloads folder and run the executable (e.g., RStudio-0.99.887.exe) to install RStudio.

i. Double click to run the executable and start the install (see graphic below). (a) Note: if your computer indicates that you need administrator access to install, try

right clicking on the executable and select Run Elevated. When prompted, in the Powerbroker for Windows Authorization pop up window, indicate the software installation Justification (e.g., installation of approved software).

ii. Follow the subsequent prompts to complete this installation.

Part 2: Review the Course Material A. Download and inspect the contents of the course data

1. Download and unzip the course data contained in the LidarDataAnalysis.zip file. 2. The contents of the zip file will contain:

i. BA_lidarData_QualitativeSet.csv - modeling table with field plot data and correlating lidar plot metrics

ii. Lidar_gridmetrics directory - containing a selection of landscape level lidar metrics in ASCII format

iii. Shapefile directory - contains a forest polygon shapefile and the lidar tile extents iv. Scripts directory - contains a few extra template scripts corresponding to exercise

appendices v. Outputs folder - contains the scripts and data that you will produce—backups are

provided for each

Course data is from the Pinaleño Mountain Environmental Management Area in the Coronado National

Forest in southwest Arizona. The project study area covers 85,500 acres in the mixed-conifer zone above

7,000 feet. You can read about the project and data collection methods online at

http://cran.rstudio.com/

https://www.rstudio.com/products/rstudio/download/


http://www.fs.fed.us/eng/rsac/lidar_training/pdf/0118-RPT1.pdf and

http://www.fs.fed.us/rm/pubs_other/rmrs_2012_mitchell_b001.pdf.

B. Examine the landscape level lidar metrics in a GIS 1. Navigate to the LidarDataAnalysis/lidar_gridmetrics directory and open

elev_P70_2plus_25METERS.asc in your GIS. i. The file lacks a coordinate reference system, we will set this in Exercise 3.

2. Explore the raster layer, there are a few things to note: i. This is a landscape level lidar metric and was created in FUSION using the gridmetrics

command. ii. The naming convention is generic and can be deciphered as such:

(a) elev tells us it is a height statistic, (b) P70 tells us it is the height of the 70th percentile, (c) 2plus tells us a 2 meter height cutoff was used, and (d) 25METERS tells us that the statistic is calculated within each 25 meter pixel in the

raster layer. iii. If you go back to BA_lidarData.csv (you should have this open in excel, from Step B) you

will find a column with the heading P70, which is the same statistic summarized at the plot level.

Note: The beauty of lidar data is that it covers the entire landscape, so we can generate canopy metrics for each plot (to correlate with field plot measurements) and also at the landscape level. This allows us to identify relationships at the plot level and use them to make predictions across the landscape. For more information on creating plot and landscape level lidar metrics please refer to our online tutorial on lidar inventory modeling at the following link: http://www.fs.fed.us/eng/rsac/lidar_training/Forest_Inventory_Modeling/story.html.

A. Examine BA_lidarData.csv in Excel 1. Open BA_lidarData_QualitativeSet.csv in Excel and inspect the format and variables and note

the variable names and number of observations.

Note: This dataset has been organized and prepared for analysis following the basic principles of ‘tidy data’. The variables are stored in columns and the plot observations are stored in rows. The tidy-data framework follows a consistent and standard format making it easy to manipulate, model and visualize. For more information on data formatting, refer to Hadley Wickam’s article on tidy data (http://vita.had.co.nz/papers/tidy-data.pdf).

2. The column, BAtotal, is the response variable. This is the total basal area (square meters per hectare), the aggregation of basal area of all live and dead trees within each plot. This data in this column was collected in the field. Read more about the data here: http://www.fs.fed.us/rm/pubs_other/rmrs_2012_mitchell_b001.pdf?.

3. The Series category is for data exploration, but will not be included in the models. This variable represents the dominant plant type. There are four plant associations:

http://www.fs.fed.us/eng/rsac/lidar_training/pdf/0118-RPT1.pdf

http://www.fs.fed.us/rm/pubs_other/rmrs_2012_mitchell_b001.pdf

http://www.fs.fed.us/eng/rsac/lidar_training/Forest_Inventory_Modeling/story.html

http://vita.had.co.nz/papers/tidy-data.pdf

http://www.fs.fed.us/rm/pubs_other/rmrs_2012_mitchell_b001.pdf


i. spruce-fir: dominant late seral species is subalpine fir and Engelmann spruce ii. wet mixed-conifer: dominant late seral species is white fir iii. dry mixed conifer: dominant late seral species is Douglas fir iv. upper pine oak: dominant late seral species is ponderosa pine

4. There are also columns collected from topographic information, including aspect and slope. These too will be used for data exploration, but not included in the models.

5. The columns following the Series column are all derived from the lidar data, using the cloudmetrics function in the FUSION software.

i. Columns HMin to P99 are height metrics. ii. Columns from 1st_cover_3 to all_first_cover_mode are measures of vegetation density.

6. Close Excel.

Part 3: Explore lidar plot data in the R Environment A. Open R Studio

1. From the Start menu, select All Programs. Locate the RStudio folder, then select RStudio to open the IDE.

2. This opens the RStudio Graphical User Interface. Use the graphic below for reference and take a moment to orient yourself.

RStudio is a free, open-source integrated development environment (IDE) for R. R comes with its own text editor and RStudio is not required to work in R. However, RStudio offers several convenient features for managing the R environment, making it easier to use R interactively.

The Console: the command prompt within R Studio. This is where you execute your script.

The Workspace: displays and catalogs the elements in your workspace. This helps us remember the ‘what’ and ‘how’ of our code in our scripts.

The lower right hand GUI panel: access files, plot graphics, R package, and R help. In addition it allows you to export your graphics to the clipboard.


3. Take a moment to go to the RStudio website (https://www.rstudio.com/products/rstudio/), where there is a wealth of information about RStudio and its functionality. Browse through the available resources from the drop down menu in the upper right hand corner, such as the compilation of available Cheat Sheets.

Part 4: Explore Lidar Plot Data in the R Environment A. Create a new script for exploring the lidar dataset.

In R and R Studio, users can type R code directly in the console but these R statements will not be saved for later use. Instead we will create an R script where we can develop, execute, and save our work for documentation or future use.

Note: there are other types of files that you can create in R studio, such as dynamic documents, presentations, and reports in R using R Markdown. For this session, we will just be working with R scripts. If you would like to learn more about the other options, visit the webinars and videos resources available on the R studio website: https://www.rstudio.com/resources/webinars/.

1. From the File menu at the top left of the main RStudio window, select File, then New File, then R Script. Alternatively, you can use the New File icon in the upper left hand corner to select a new R Script (see following graphic). This will open a script window above the R Console window.

https://www.rstudio.com/products/rstudio/

https://www.rstudio.com/resources/webinars/


2. Save your script by selecting File, then Save As… or select the save icon in the upper left hand

corner of the script panel. 3. Navigate to your course directory and name the script:

01_GettingStartedAndExploringData.R

What’s the .R at the end of the file name? This .R file extension allows the R software to recognize this is an R script. If you forget to add this, RStudio will add it by default.

B. Set your working directory. You can read tips and tricks for defining your working directory here: https://support.rstudio.com/hc/en-us/articles/200711843-Working-Directories-and-Workspaces.

1. From the top menu, choose Session, then Set Working Directory, then Choose Directory… 2. In the Choose Working Directory window that appears, navigate to the location of your

LidarDataAnalysis folder. Select this folder.

Take Note of the line of R code that appears in the console window below:

https://support.rstudio.com/hc/en-us/articles/200711843-Working-Directories-and-Workspaces


> (symbol): the IDE command prompt,

setwd() is the function for setting this workspace, and

“C:/LidarDataAnalysis”: is the character string argument passed to this function.

Note that R only accepts forward slashes (/) or double backslashes (\\) in directory names! Both of these statements will work:

setwd("C:/LidarDataAnalysis")

setwd("C:\\LidarDataAnalysis")

The statement below will not work. If you try executing it, you will get an error return (‘\L’ is an unrecognized escape in character string starting “”C:\L”). This is because the single backslash is used to escape a character inside character constants, such as \n to indicate a new line or \t to indicate a tab. You can read more about the uses of quotes and escaping quotes here: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Quotes.html.

setwd("C:\LidarDataAnalysis")

3. For the purposes of learning and for documenting our workflow, copy the line of code above and paste it into the R script window above—Don’t copy the > symbol as this isn’t part of the code and will cause an error! Refer to following graphic.

https://stat.ethz.ch/R-manual/R-devel/library/base/html/Quotes.html

https://stat.ethz.ch/R-manual/R-devel/library/base/html/Quotes.html


4. Optional, below the line you just copied to set the working directory, type in the following statement:

getwd()

i. Next make sure your cursor is still on the line that you’ve just typed (in the graphic below, this is on line 3).

ii. Press the Ctrl and R keys at the same time. This will execute this line of code. iii. What appears in the Console window? It should be populated with the location of your

working directory. Refer to graphic below.

C. Load lidar plot data 1. Select the Environment tab from the window in the upper right hand of the RStudio screen. 2. Choose the dropdown next to Import Dataset and select From Local File… (see following

graphic).

Remote Sensing Applications Center | Exercise 1:Getting Started and Exploring Data in R|

10

3. In the Select File to Import window, choose the plot data file called BA_lidarData_QualitativeSet.csv. Select Open.

i. In the window that appears, change the Name of the dataset by typing lidarData in this dialog (this will be the name of the variable representing our data frame in R).

ii. Select the yes button next to Heading to make sure our variable names are included. iii. Verify that the Separator selected is Comma. iv. Accept the other defaults for now and then click the Import button.


11

Two new lines of R code below should appear in the Console and the data table will be displayed in a new tab in the panel where we are accessing our script (refer to the red box in the following image… note, the topography and series columns are missing in the graphic below).

You will also notice that the Environment panel now has been populated with a Data record called lidar Data. There are 80 observations and 35 variables (purple box in graphic below).

What does the <- symbol mean in R?

The symbols <- and = are used to assign variables in R. They are both used in this tutorial. In this case, the <- symbol is used to assign the lidarData variable to lidar plot data which is a data.frame object in R.

4. Copy the line of code for importing the dataset from the console into your R script. It should be something like the text below:

lidarData<-read.csv("C:/LidarDataAnalysis/BA_lidarData_QualitativeSet.csv")

i. Note, you can switch back to your script by closing the data view tab or simply clicking on the script tab in the top left.

In this one line of code, we are doing three things:

1) create a variable, called lidarData. A variable can be thought of as a container to hold information. Variables can hold many different types of information, such as data tables, single numbers, character strings, or vectors of character strings and numbers. In this example we will assign a data frame object to this variable.

2) the assignment operator (<-) sets the value of the variable, lidarData, to store the output of a function, read.csv.

3) load the lidar dataset into the R Studio using the function, read.csv(). Read.csv requires some arguments before we can execute it. The required argument includes the file location as a character string (enclosed in quotes). There are additional arguments that we can pass to the function, including whether or not our data has headers, how columns are separated in the table, etc.

5. To read more about the arguments that any function takes, you can type the function name preceded by a question mark into the Console. Type in and execute the following statement:


12

?read.csv()

i. After this has run, you will notice that the GUI panel in the lower right hand portion of the R Studio screen has changed over to the Help tab (see following graphic).

ii. Additionally, the help information about Data Input functions has loaded. Scroll down to read about the function, read.csv. This is a complete list of the input parameters that can be specified as well as the proper syntax. If you keep scrolling, details are provided for each of the input parameters. At the very bottom of the help window you will find example statements you can use to see how to execute these functions. These examples call public data sets that are pre-loaded into the R installation. This allows you to copy and paste these statements into your workspace and have them execute successfully. This is just one of the great learning resources available in R Studio (and R). See following graphic.


13

Note: Up until this point, we have been relying on the RStudio GUI to write our R code. Now we will have to start writing our own R code. Don’t worry if you don’t have programming experience. The exercises will provide everything you need to build some template scripts for simple data analysis in R. For more information about writing R scripts, refer to the resources available at the RStudio (https://www.rstudio.com/) and R websites (https://www.r-project.org/).

Part 5: Explore variables and subsetting data A. Use names() function to print list of column headers

1. Type the following line of code into your R script. Make sure to type it exactly as it appears below and note that R is case sensitive.

names(lidarData)

i. With your cursor on the line that you just typed, use your mouse to click on the Run button in the upper right of the script window (shown in the following graphic). Alternatively, use Ctrl+R or Ctrl+Enter on your keyboard to interactively run the line of code.

ii. Remember, if you want to read about the function, names, you can write and execute the line below to open its associated help information.

?names

iii. Note, R commands and variable names are case sensitive. Try executing this same line with the input, the reference to the variable holding our data, to lidardata using all lower case letters (see code example below).

names(lidardata)

Q: What is returned?

A: Error: object ‘lidardata’ not found.

This error message occurs because we have not created a variable object called lidardata; we have a variable object called lidarData.

2. Run the names function again, this time with the correct variable name (lidarData). After you execute the line you should see something similar to the graphic below appear in the console. This function returns a list of variable names.

https://www.rstudio.com/

https://www.r-project.org/


14

3. Examine this list of variable names (shown in the following graphic).

4. Which of these is the response variable (what we are trying to model) and which are predictor variables?

i. Answer: we are going to be using the canopy height and density metrics (see following graphic) to model, or predict, total basal area (BAtotal).

5. Based on the literature review, which do you think will be the best predictors of basal area?

We don’t spend time on the literature review here, but that is one of the most important pieces of data analysis. It’s important to learn from others before you tackle a modeling project. Quickly search for other projects or publications that have modeled basal area using lidar data, try and find studies that were conducted in a ecosystem similar to the one you are working. This will help you set your expectations and also save time and money by not re-inventing the wheel. However there is also a bit of variation in model structure between projects, overall you will generally find a vegetation density metric and a height metric in the models.

For example, a lidar based remote sensing forest inventory in Washington state uses p80, 80th percentile of height of the vegetation points, and the density of all first returns above 6 m to model basal area (Gould and Strunk).

Part 6: Explore distribution of variables A. Make a variable to hold the response variable column, BAtotal

1. Use the $ operator to access the variables by name. Enter the code below (exactly) into your script to create a new variable called ResponseV representing the total basal area, BAtotal column in the lidarData data frame. Then run this line of code using the Run button or Ctrl+R.

ResponseV<-lidarData$BAtotal


15

What does the symbol mean in R?

This symbol is shorthand to access named columns (list elements) within the data frame object. This is equivalent to

lidarData[,‘BAtotal’]

A. Examine the distribution of the basal area in the plot data 1. Use the hist() function to plot the distribution of the response variable (total basal area).

Copy, paste, and execute the code below to generate a frequency histogram of total basal area.

i. Recall that the $ operator allows us to access variables in a data frame by name. ii. Learn more about the hist() function by typing ?hist into the console and hitting enter or

by entering hist into the Search dialog (lower right in R Studio under the Help tab) to get more information on this function. This should bring up the documentation for this function and tell you how you can change graphical parameters, add labels, etc.

hist(lidarData$BAtotal)

# same thing, but replaced the input with the variable representing this column in the table

hist(ResponseV)

What’s that pound sign doing in the code?

If you want to add comments to your code, you can! The pound sign, #, indicates that the characters that follow the pound are not to be executed as code. This indicates to the R software to skip this line and proceed to execute the code on the following line, unless that is commented out (with another pound sign).

2. Examine the histogram that appears in the graphics window. How would you describe the distribution of basal area?

Note: Plotting histograms is just one example of a useful tool for exploring data in R. Frequency or density histograms can easily be plotted in order to visualize data distributions. You can challenge yourself by using the help documentation for the hist() function to experiment with the histogram options, such as change the number of breaks in the histogram, add a title, x-axis labels, and change the display from frequency to density.


16

hist(lidarData$BAtotal,

col = ‘gray’,

main = ‘Distribution of Plot Values’,

xlab = ‘Total Basal Area (kg/ha)’)

3. Use the mean() function to calculate the average value for the total basal area per plot. Run this line of code using the Run button or Ctrl+R. Remember you can call this column two different ways, both examples are included below. They should return the same value.

mean(lidarData$BAtotal);mean(ResponseV)

Compare the response to the histogram, is this mean value what you expected after looking at the histogram? It should be! Although since it doesn’t line up exactly with the highest peak in our histogram, this is an indication our distribution is slightly skewed (as indicated also be the long right hand tail).

4. Calculate the standard deviation and variance of the total basal area per plot, BAtotal, using the var() and sd() functions. Remember you can call this column two different ways, both examples are included below. They should return the same value.

var(lidarData$BAtotal); var(ResponseV)

sd(lidarData$BAtotal);sd(ResponseV)

Compare the standard deviation value to the spread in your histogram. Does it appear that 68% of the data lie within one standard distribution about the mean? If the data are normally distributed, that will be true (refer to image below).


17

What happens if my variables are not distributed normally? We’ll get to this in the next section, but briefly: sometimes it matters, other times it doesn’t. Usually though, we want our response variable (total basal area per plot) to be normally distributed if we are running a linear model. If it isn’t, we can consider a transformation or a different type of statistical model (a non-linear alternative).

Part 7: Explore relationships between basal area and lidar predictor variables (one at a time)

1. Create a scatterplot to visualize relationships in the data. i. Use the plot() function to generate a scatterplot with the lidar predictor variable called

HMean on the x axis and the response variable (ResponseV) on the y axis. To do this, enter the code below into your R script.

plot(lidarData$HMean, ResponseV)


18

2. If you want to add in titles and change the labels along the axis you can insert additional parameters into the plot function. Modify the plot call to look like the statement below, then run the line again. Note the changes in the plot.

plot(lidarData$HMean, ResponseV, main = "Total basal area vs.mean height", xlab = "mean (lidar) tree height", ylab="total basal area")

3. To swap the y and x values, so that the mean height is displayed along the y axis, you can simply change the order that these two variables appear in the function (see line below). Note, don’t forget to re-code your axis labels!

plot(ResponseV, lidarData$HMean, main = "Total basal area vs.mean height", ylab = "mean (lidar) tree height", xlab="total basal area")

Remember you can learn more about the plot function by typing ?plot into the R console and hitting enter. This brings up the documentation for the function. There is a lot here if you want to experiment with changing the plotting parameters, colors, symbols, etc.

4. Do you think that mean height could be used to predict total basal area?

B. Optional: Color the points based on dominant plant type 1. If I have data with thematic attributes, such as dominant plant type, I like to insert a color

coding system so that each point is coded according to its thematic group. First let’s use the unique function to determine how many groups are in the series column, and what they are named.


19

unique(lidarData$Series)

The names use the CPP four letter code, which represent the first two letters of the genus and species: PIPO is ponderosa pine (Pinus ponderosa). Read more about this plant and an example of the naming convention here: https://www.usanpn.org/cpp/sites/www.usanpn.org.cpp/files/pdfs/PIPOv6.pdf.

2. Next create a plot that represents each of these categories. Copy, paste, and execute the code below to plot the points according to dominant plant types.

i. bg sets the background color of the point (according to the Series column) ii. pch represents the point shape

plot(lidarData$HMean,ResponseV, bg = lidarData$Series, pch = 21+as.numeric(lidarData$Series), main="Lidar data by mean height", ylab="total basal area", xlab = "max height")

3. This looks good. There is no clustering of points based on the dominant plant type. 4. Want to add a legend? Explore the legend help file to code this information into the plot.

?legend

https://www.usanpn.org/cpp/sites/www.usanpn.org.cpp/files/pdfs/PIPOv6.pdf


20

C. Calculate Correlations 1. Calculate the correlation between these two variables using the code below.

cor(lidarData$HMean, lidarData$BAtotal)

Part 8: Explore relationships between basal area and lidar predictor variables (many at a time) A. Generate a correlation matrix

1. Use the code below to create a correlation matrix to explore relationships among variables simultaneously.

i. Use the corr function on the entire data frame to create a new correlation matrix object. ii. Assign the variable name corrMat to this correlation matrix.

corrMat <- cor(lidarData[,5:38])

What does the [,5:38] mean?

This is R syntax for subsetting a data frame:

- the first half of the input within the brackets, before the comma, indicates which rows to include. Here we leave it empty to include all the rows within the data frame.

- the second half of the input within the brackets, after the comma, is where we specify the columns included in the correlation matrix. We want columns 5 to 38, so we include these and use the colon to represent the ‘to’. You could also use negative numbers (-1) to exclude just one column; e.g., lidarData[,-1] excludes the first column. The first column is the plot ID; assessing a correlation between the unique plot ID number and the other variables is nonsensical.

2. Now that we have saved the correlation matrix as a variable, we can view it by typing the name of the variable and executing it.

corrMat

3. Which variables are strongly correlated with Basal Area? Which predictor variables are highly collinear?

4. Yikes – this is really hard to read! I might subset it a bit further and run it on the subset of height predictors and then on the density predictors. I am going to first create a variable that holds my subset (first line below). The c() sets up which columns to include; I set this one to include the basal area (#5) and the density metrics (#30-38). Then we create a correlation matrix on this variable (second line below). Copy, paste, and execute the following lines:

DensityMat<- lidarData[,c(5,30:38)]

corrDensityMat <- cor(DensityMat)

corrDensityMat


21

That’s a little better, but still kind of hard to read! We will see a way to better visualize these in the next steps using plotting functions…

B. Use pairs to create scatterplots 1. Creating individual scatterplots for each possible pair of data is tedious work, but there is an

alternative. We can use pairs to create multiple scatterplots. Let’s start with the subset of the data we just created, called DensityMat. Copy, paste, and execute the line below.

pairs(DensityMat)

That’s looking better! Although it’s still quite a lot of information here to look at all at once. Let’s use another function that makes this a bit more readable and replaces the duplicate plots in upper half of the matrix with correlation coefficients.

C. Use pairs.panels to improve on the scatterplots 1. First we need to install and load a package, called psych. I like to keep all the package calls at

the top of my script. Copy the lines below and paste them on line one in the R script. Then execute them.

install.packages("psych")


22

library(psych)

2. Now return to the bottom of your script and type the help command in for the pairs.panels function.

?pairs.panels

1. There are quite a lot of parameters that you can adjust here. Let’s just investigate something basic. Copy, paste, and execute the code below.

pairs.panels(DensityMat,bg="black",main="total basal area by lidar density metrics",hist.col="red", scale = T)

There’s a lot of information that is now added to our plot window. On the upper half of the matrix, we see that the correlation coefficient is printed; the text size is scaled based on the strength of the corelation. The histogram is plotted along the diagonal (this helps see how the data are distributed). Finally, the scatterplots are in the bottom half of the matrix; and nonparametric trend lines (loess smoothed fits) and correlation ellipses have been added to them (the red line).

2. You can also color code your plots by thematic attributes, I have included an example statement below (subsetting the DensityMat data set a bit more, since the output gets really busy!). If you have time later you can explore additional input parameters. Copy, paste, and execute the code below.


23

pairs.panels(DensityMat[,1:4],bg=c("black","yellow","green","blue")[lidarData$Series], pch=21+as.numeric(lidarData$Series),main="Lidar data by Plant Association",hist.col="red", scale = T)

Below is the output. I like to examine the differences in metrics just to be familiar with the datasets. I think it’s interesting to note that the plant type color coded yellow, Douglas fir, tends to fall below the trendline in the plot of basal area vs. the first cover returns.

D. Optional: PerformanceAnalytics package and plotting function 1. The package called PerformanceAnalytics also makes very nice plots.

i. First install and load the package. Copy these lines and paste them at the top of your script. Execute them.

install.packages("PerformanceAnalytics")

library(PerformanceAnalytics)

2. Now examine the plot output using Chart.Correlation. Copy, paste at the bottom of your R script, and execute the line below. Now you will notice that asterisks are included to indicate the significance of the correlation.

chart.Correlation(DensityMat, histogram=TRUE, pch=19)


24

E. Optional: corrplot package and plotting function 1. One final plotting package I like is the corrplot.

i. First install and load the package. Copy these lines and paste them at the top of your script. Execute them.

install.packages("corrplot")

library(corrplot)

2. Now examine the correlation plots you can make using the corrplot function. We’ll turn the correlation matrix we created earlier, corrDensityMat, into a series of color-coded plots. Copy, paste at the bottom of your R script. Execute each line one by one. Notice what happens in the plot (note the line creating the variable called col1 will not plot anything).

i. The parameter in the last line, add = T, overlays the new plot in an existing plot graphic. ## create color series variable called col1

col1 <- colorRampPalette(c("#7F0000","red","#FF7F00","yellow","white","cyan", "#007FFF", "blue","#00007F"))


25

## use color series variable in plotting function

corrplot(corrDensityMat, method="color", col=col1(20), cl.length=21,order = "AOE", addCoef.col="grey")

corrplot(corrDensityMat, method="square", col=col1(200))

corrplot(corrDensityMat, method="ellipse", col=col1(20), add=TRUE, type = "upper")

The shade and the size of the box indicate the strength of the correlation. The darker blue, larger boxes indicate the stronger relationships. We don’t have any negative correlations in this data set, but if we did, the corresponding cell in the matrix would be colored in hot colors (yellow to red).

Part 9: Putting it all together A. Start thinking about the model inputs


26

1. Based on the data exploration, what density metrics do you think will best predict the total basal area? Hint, look at the graphic on page 20. What variables had the highest correlation with the total basal area?

i. A: all_1st_cover_mean, and x1st_cover_mean both had the highest correlation values. Also note that these two have a correlation value of 0.99, indicating they are highly correlated to each other. This means these will likely be good predictors of total basal area, but we don’t want to include both variables in our model due to their collinearity.

B. Repeat these steps in part 8 for the other categories of variables. Recall that usually three variables are important predictors of biomass related forest inventory metrics, such as total basal area. These include: one related to height, one related to canopy cover, and one that describes the variation in the data. Now that we’ve examined the ones related canopy cover, the density measures, we need to explore the ones related to height.

1. There are quite a few more height measures than density (or canopy cover) measures, refer to following graphic.

2. I recommend you break these up into two tables to plot. Use the code to create new variables holding the response and subset of predictor variables to explore. Use the code below to create these subsets, then work through the steps in Part 8 and 9 A with these two new subsets.

Height1Mat<- lidarData[,c(5,6:14)]

Height2Mat<- lidarData[,c(5,15:29)]

I found HMean, p60 and p75 had the strongest correlation factor with the total basal area per plot metric. These relationship also appeared to be linear with total basal area (and collinear with each other).


27

Plot made with this call:

pairs.panels(Height1Mat,bg="black",main="total basal area by lidar height metrics",hist.col="red", scale = T)

Part 10: Optional: Bonus code with best predictor variables A. Subset data to include best predictors

1. Copy, paste, and execute the code below. names(lidarData)

#'HMean',"P60",'P75','x1st_cover_mean','all_1st_cover_mean'

BestPredictorsMat<- lidarData[,c(5,8,23,25,33,37)]

# make sure these are the right column numbers

head(BestPredictorsMat)


28

B. Create new plots with this data subset 1. Copy, paste, and execute the code below to make new plots.

pairs.panels(BestPredictorsMat,bg="black",main="total basal area by most correlated lidar metrics",hist.col="red", scale = T)

pairs.panels(BestPredictorsMat,bg=c("black","yellow","green","blue")[lidarData$Series], pch=21+as.numeric(lidarData$Series),main="Lidar data by Plant Association",hist.col="red", scale = T)

Part 11: Data exploration wrap up A. Explore the range of variability of lidar metrics captured in the

sample. 1. Use the code below to call the range() function on the variable called Hmax within the

lidarData data frame. i. Copy and paste this into your script and run it.

range(lidarData$HMax)

ii. This will return a list of numbers describing the range of observed values– the first number is the minimum and the second is the maximum.

B. Use the summary function to simultaneously look at the statistics for all of the variables in the dataset. 1. Use the code below to call the summary() function on the entire lidarData data frame. Copy

and paste the code below into your script and run it. summary(lidarData)

1. This will return a table containing the minimum, first quartile, median, mean, third quartile and maximum value for each variable in the dataset.

2. This could be saved and used to compare the ranges within the wall-to-wall lidar gridmetrics. Modify the summary line to store it as a variable, then use the write.csv() function to save it in your working directory.

i. After you run the code, check the outputs folder in your working directory. You should find a new csv file created, called summaryData.

summaryData<-summary(lidarData)

write.csv(summaryData,file = 'outputs/summaryData.csv')

Note: the range of variability captured in the sample is important since these observations are used to train and build models for predicting structural metrics across the landscape. If samples don’t adequately capture the range in lidar metric variability across the landscape, models will extrapolate (predict beyond the range of the observations or sample data) when applied to the wall-to-wall. This is something that should be considered in the sample design for inventory modeling plots (White et al 2013). Stratified


29

sampling based on the lidar structural metrics can lead to better models and predictive power (Hawbaker et al 2009).

C. Review your script, add comments, and save it for future reference. 1. Review the code that you have entered in your script 01_GettingStartedAndExploringData.R

and above each line of code, add a comment explaining what the code does. Use the comment character # at the beginning of the line to add a note that will not be run in R as code. Follow the example below:

# This is comment. Text following the # symbol will not run as code.

# The code below creates a correlation matrix for all the variables EXCEPT the first, which is the plot ID

corrMat<-cor(lidarData[,-1])

D. When you have finished, save your script for future reference. 1. Save by choosing File, then Save or entering Ctrl+S on your keyboard.

Congratulations! You have successfully completed this exercise. You have completed the first exercise and gained some familiarity with the R programming language using the RStudio IDE. It is understood that there is a steep learning curve for learning R and this exercise was not designed to teach you everything, merely introduce you to a powerful set of tools for statistical analysis.

Date post:	21-May-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Exercise 1: Getting Started and Exploring Data in R · Exploring Data in R “If you want to create...

Documents