+ All Categories
Home > Documents > Amazon PIRE Data Processing Tutorial

Amazon PIRE Data Processing Tutorial

Date post: 13-Jan-2016
Category:
Upload: cruz
View: 50 times
Download: 7 times
Share this document with a friend
Description:
Amazon PIRE Data Processing Tutorial. A guide to file management, data formatting, visualization, and analysis S. C. Wofsy, January 2010 version a -1.0. Introduction Goals and scope of the tutorial. Scientific data - PowerPoint PPT Presentation
Popular Tags:
22
Amazon PIRE Data Processing Tutorial A guide to file management, data formatting, visualization, and analysis S. C. Wofsy, January 2010 version -1.0
Transcript
Page 1: Amazon PIRE  Data Processing Tutorial

Amazon PIRE Data Processing Tutorial

A guide to file management, data formatting, visualization, and analysis

S. C. Wofsy, January 2010version -1.0

Page 2: Amazon PIRE  Data Processing Tutorial

Scientific data

Scientific data are typically created as, or converted to, electronic files providing information in terms of numerical data complemented by metadata in terms of data descriptors (time, location, units, etc.). Analysis of these data usually proceeds in a series of steps:

1.Acquisition of the data in electronic file format2.Formatting of the data set to enable it to be read using a data analysis application3.QA/QC of the data, using visualization tools, statistical tools, etc.4.Assessment of the data5.Analysis of the data to provide quantitative information and data products.

IntroductionGoals and scope of the tutorial

Page 3: Amazon PIRE  Data Processing Tutorial

This tutorial is intended to help students prepare for analysis of the sets that they will obtain during the PIRE summer field course, and in their studies and careers afterward.

A key principle is that all of the steps 1-5 above must be traceable and reproducible.

Often students may wish to explore and assess data files using graphical user interfaces (GUIs), and the tutorial will help students develop their skills with GUIs.

However our key principle translates into the following requirements: the entire process must be repeatable starting from the most raw version of the data.

We must therefore eschew commonly used spreadsheet programs in favor of much more capable object-oriented data analysis applications. These programs may be applied using both powerful GUIs, which record for future application each command that you execute, and scripts that are essentially sets of command line instructions.

Also, analysis of environmental data will lead us from simple statistical tests and figures to sophisticated, rigorous procedures and carefully customized graphics, providing additional impetus to develop the expertise to use data manipulation and analysis programs.

Finally, colleagues do not use the same computer systems or have the same licenses for software. The applications we will use, and our other data products, will be independent of platform and operating system, and will be open source and free of licensing fees.

Page 4: Amazon PIRE  Data Processing Tutorial

Telecons/Webcasts: Professors Wofsy, Saleska, and or the Proctor will lead section-type discussions where students can ask questions and receive assistance. Dates and times to be announced. The proctor will offer assistance via email throughout.

All students in the 2010 Amazon PIRE field course are expected to complete the Basic tutorials (R or Octave/Matlab). You are strongly encouraged to attend the Telecon/Webcasts, which will be scheduled in the evening hours to facilitate attendance.

Session 1. Preparing your computer.Session 2a. Basic R-tutorial, part 1.Session 3a. Basic R-tutorial, part 2.Session 4a. Intermediate R-tutorial.

Session 2b. Basic Octave/Matlab tutorial, part 1.Session 3b. Basic Octave/Matlab tutorial, part 2.Session 4b. Intermediate Octave/Matlab tutorial.

Page 5: Amazon PIRE  Data Processing Tutorial

This tutorial provides training for students to analyze data using the following applications, free for downloading (with proprietary equivalents):

R (Splus) GNU-Octave (Matlab)

Students who know how to use IDL, and have licenses for this application, can use IDL for data analysis.

Important note:

Experience has shown that Excel and similar spreadsheet programs cannot be successfully applied to analysis of PIRE data sets. Students can readily ingest data into the spreadsheets, but then find it extremely difficult to clean and assess data, and their manipulations are not traceable. Therefore the use of Excel for data analysis will not be permitted in the PIRE summer study.

The tutorial provides specific help for students using the following operating systems on their computers:

Microsoft windows (XP) Apple/Mac Leopard and Snow Leopard Linux (Ubuntu)

Adjustments may need to be made for other versions of these operating systems.

Page 6: Amazon PIRE  Data Processing Tutorial

Download or install easy-to-use syntax text editor – required so that you can edit data files and scripts without any changes in file format. These applications also provide color-coded "syntax highlighting" that greatly facilitate writing and editing program scripts.

Win: notepad++ Mac: Jedit.app (use the Java installation procedure) Linux: gedit

"Office" type applications (MS-Word/wordpad, Pages, etc) cannot be used. Vanilla notepad (Win) and TextEdit (Mac) are inadequate.

Download and install your data analysis application; we suggest R unless you are already familiar with Matlab/Octave R (http://www.r-project.org/) [Ubuntu users download from the Repository, r-cran] Octave (http:// ) or Matlab (installation disk)

Win only: Download and install required file management tools from Gnuwin32 (http://gnuwin32.sourceforge.net/packages/packages.html): "coreutils", "which", "gzip", "tar", and "grep"

Important: Matlab or IDL installations requiring a license server will not work in Manaus!

Preparing your computer

Page 7: Amazon PIRE  Data Processing Tutorial

Win only: Make adjustments to your path environment variable Start => control panel => system => advanced tab click "Environment Variables" Select "path" under system variables, add the following to the end of path: ;c:\program files\gnuwin32\bin; c:\program files\R\<name R version>\bin;c:\program files\notepad++\bin [=<>]

or similarly if you are using Octave or Matlab

To find the full path to your R version, use Windows Explorer to navigate to the application file "R.exe". You can copy the path from the address bar (do not include "R.exe" itself in the path).

Install a shortcut to cmd.exe on your desktop or quicklaunch (in C:\WINDOWS\system32)

In Windows Explorer, click tools => folder options, uncheck "Hide extensions" and check "Show hidden files and folders".

Some participants who are borrowing an institutional computer may lack the permissions to undertake these changes. Have your system administrator give you the permissions, or if they cannot, have them make these changes.

Preparing your computer (continued)

Page 8: Amazon PIRE  Data Processing Tutorial

Linux, Mac: Put the Terminal application on your application bar. Mac only:

Install X-code from your Mac installation CD (to install packages). Install MacPorts/Darwinports from the Internet (http:// ). Install "gfortran".

Open the Terminal application, and in your home folder (/Users/your_username) add the following lines to the file .profile , using the editing program you have installed.

defaults write com.apple.finder AppleShowAllFiles TRUEkillall Finder

Ubuntu only: Install c++, gcc, and gfortran from the repository (may be needed to install some R packages).

Preparing your computer (continued)

Page 9: Amazon PIRE  Data Processing Tutorial

Hands-on activities:

•find out how to use a command•create your data file structure •list files; locate files (Gui ok)•find out how big a file is, how many lines it has, etc.•create and edit a simple file: a data file; an R script•copy, move and delete files•search for strings within files

Learning to use your computer (as a computer)

Page 10: Amazon PIRE  Data Processing Tutorial

Find information on how to use a command

Win: From the Desktop: Look up the help information for cmd, all the commands are listed From within the cmd window, type "<command> /?" or "<command> -h"Examples: mkdir /? pwd –h

Linux, Mac: From the Terminal: "man <command>" e.g. man mkdir

Learning to use your computer (continued)

Page 11: Amazon PIRE  Data Processing Tutorial

Data file structure

You will need a convenient place to put your data files and the scripts that will analyze them. Since the path to this folder will have to be specified in your scripts, keep the name short and the location easy to find. Do not include any symbols other than letters, numbers, and "_"; no spaces should be used.

Good locations might be c:\pire (Win) or $HOME/pire for Linux/Mac. You will need subfolders for data, scripts, etc.

You may use the GUI to do this (Windows Explorer (Win), Finder (Mac), or Nautilus (Ubuntu), but this is a good place to start using the command window (Win) or Terminal application (Mac/Linux).

Using the command/terminal window:

Win: mkdir c:\pire Linux, Mac: mkdir $HOME/pire mkdir c:\pire\scripts mkdir $HOME/pire/scripts

etc. etc.Note: $HOME refers to your home directory on Linux/Mac (type "echo $HOME" from the terminal).

Typing "mkdir pire" has the same effect as the above if you are working in the home folder ( "cd c:\" or "cd $HOME" , "cd" = change directory)

Learning to use your computer (continued)

Page 12: Amazon PIRE  Data Processing Tutorial

Find properties of the files in a folder

Before leaving c:\ or your home directory, try finding out about the files in the folder.

ls –al (lists files and their properties; ls -1 : short list; ls –alt in time order, …)wc (gives number of lines, number of words, and number of bytes in a file; wc <filename> reports on only the named file; "*" is a wildcard)

Some notable anomalies:

Linux treats upper and lower case commands, filenames etc as different. Windows ignores upper/lowercase. Mac sometimes ignores case and sometimes does not. To make your work transportable, assume upper and lower case filenames are different, but do not give different files the same name with different case.

Folder names in a path are distinguished by a forward slash "/" in Linux and Mac, and a backward slash "\" in Windows. Windows also recognizes the "/" but inconsistently, and all three recognize the "\" as an "escape character" that affects the treatment of the following character (Windows inconsistently).

Spaces ("<space>") are used to separate parts of a command. To reference a file or folder with a <space> in its name, the name should be surrounded by quotes. Avoid putting spaces in file names.

Learning to use your computer (continued)

Page 13: Amazon PIRE  Data Processing Tutorial

Create and edit simple files from the cmd or Terminal window

Change directory to pire\data (Win) or pire/data (Linux/Mac)

Open your editing application:

Win: notepad++.exe Mac: open /Applications/Jedit.appLinux: gedit

Create a file with the following content, and save it into the folder pire/data with file name "testfile.txt:

X Y1 0.532 4.753 9.374 16.385 24.676 37.347 48.448 64.419 81.9310 99.83

Learning to use your computer (continued)

Page 14: Amazon PIRE  Data Processing Tutorial

Also, create the following file with name "testfile0.txt"

x.1 x.20 0.90539101 -0.42457582 2.16385303 3.20133924 1.05686815 2.76820386 2.72725127 2.84358198 6.30213339 6.08501790 5.982081911 6.873840412 5.617884413 5.979739714 6.813179815 8.212735516 8.593975218 8.938282919 8.680889711 8.900814020 10.0755642

Learning to use your computer (continued)

Page 15: Amazon PIRE  Data Processing Tutorial

Exercise 1. "Learning to use your computer."

1.Make a copies of testfile.txt called testfile_copy.txt and dummy.txt using the command cp (in windows, "copy" will also work). Check the result using your installed special editor (not the default editor). Then remove dummy.txt using the command rm (del will also work in Win). Then rename/move file testfile_copy.txt to testfile_newcopy.txt using the command mv (move will also work in Win). Make a listing of the contents of your folder using ls –al > filelist.txt. Hand in electronic files testfile.txt and filelist.txt.

2.The command grep " 6" filename > newfile selects every line in "filename" that contains a space followed by the number 6, and put the output of the command grep into a file called newfile. Apply this command to find the lines that have a <space>5 in testifle0.txt. and put them into a file called result.txt. Hints: First execute this command without the "> newfile" part, then inspect "newfile" to see if it contains the expected results. The symbol ">" directs the output of the command grep into file "newfile". Hand it result.txt

3.The command awk '{print $n}' filename > newfile extracts the nth column from file "filename" and places the output into "newfile". Extract the 2nd column of textfile.txt and put it into a file called testfile_col2.txt . Note: In Windows use " rather than ' in this command. Hand in testfile_col2.txt .

To submit answers, create a zip file (zip myname_ex1.zip <list of files>) and email to the proctor (email: xxxxxxxx0).

Page 16: Amazon PIRE  Data Processing Tutorial

R-tutorial (Octave/Matlab users skip to "Octave Tutorial")

The basic R-tutorial covers the first two chapters (11 pages) of the document R-intro.pdf "An introduction to R" by W. N. Venables, D. M. Smith and the R Development Core Team, plus items from some of the other sections listed below.

Getting started: Read Chapters 1 and 2 of "An introduction to R", being sure to type into your computer each R command shown in the chapter. Take careful note of the results. Learn about, and try out, the command setwd("foldername") . When you complete this reading, save the result in the file pire/scripts/tutorial.r using the savehistory("filename") command. After closing R, open this file with your editor and note the syntax highlighting.

Some notable anomalies:

Due to the conflict involving windows "\" symbol, folder separators in filenames referenced within R are designated with two backslashes "\\" or one forward slash ("/"). Don't mix these in one path/file name.

When a data frame is created by reading a file into R using "read.table()", columns of alphabetic data are by default made into "factors". This should be prevented using the argument "as.is=T" in the invocation of read.table().

Example: Win: read.table("c:/pire/data/testfile0.txt,as.is=T) or read.table("c:/pire/data/testfile0.txt", as is=T). Linux, Mac: read.table("$HOME/pire/data/testfile0.txt,as.is=T)

Page 17: Amazon PIRE  Data Processing Tutorial

R-tutorial (continued)

*Basic Tutorial components: What is "Object oriented programming"? What are "attributes" ? Matrix and data frame: creating and manipulating Plotting data, exploring data Fitting data to a straight line; to a curves line; ordinary regressions and RMA regressions. Simple statistics on data: means, medians, quantiles, t-test, confidence intervals, Outliers; time series of data

Saving your work: objects, commands, functions, graphs

*Intermediate tutorial Scripting: what, why, how.

*Data sets

Tree diameter data Soil flux chamber data Temperature data

Page 18: Amazon PIRE  Data Processing Tutorial

R-tutorial (continued)

Exercise 2 *Basic Tutorial

1.Create data frames from the files testfile.txt and testfile0.txt that you made earlier in the tutorial. Hints: use the header=T argument to ensure that the columns will have the colnames attribute. 2.Make graphs of Y vs X and x.2 vs x.1 using the names of the columns in the plotting command. Save the figures as "png" or "jpg" graphics files (use dev.copy( ) followed by dev.off( ). 3.Fit the data to polynomials e.g. Y = a1 + a2*X + a3*x^2 + … , selecting the order of the polynomial by looking at the graphs you have made. Hint: You will create an object with the command

<fitted object name> = lm ( Y ~ X + X^2 + ... , <maybe other arguments>)

4.Plot your best fit curve on the graph of Y vs X. Hint: look at what is accomplished by the function predict().5.Use summary() to examine the parameters of the fit and their uncertainties.6.Read in the file T-test-file.txt downloaded from the website. Read about the t-test (http:// …). Examine the paired variables A and B from the file as to whether their respective means are different in a statistically significant way.

Page 19: Amazon PIRE  Data Processing Tutorial

Octave/Matlab-tutorial

Functionally equivalent to the R-tutorial

Page 20: Amazon PIRE  Data Processing Tutorial

Octave/Matlab-tutorial (continued)

Page 21: Amazon PIRE  Data Processing Tutorial

6.3.3 Working with data frames 286.3.4 Attaching arbitrary lists 286.3.5 Managing the search path 29

7 Reading data from files 307.1 The read.table() function 307.2 The scan() function 317.3 Accessing builtin datasets 317.3.1 Loading data from other R packages 317.4 Editing data 32

9 Grouping, loops and conditional execution: 409.1 Grouped expressions: 409.2 Control statements 409.2.1 Conditional execution: if statements 409.2.2 Repetitive execution: for loops, repeat and while 40

10 Writing your own functions 4210.1 Simple examples 42

12 Graphical procedures 6212.1 High-level plotting commands 6212.1.1 The plot() function 6212.1.2 Displaying multivariate data 6312.1.3 Display graphics: 6312.1.4 Arguments to high-level plotting 6412.2 Low-level plotting commands 65

Introduction and preliminaries 21.1 The R environment 21.2 Related software and documentation 21.3 R and statistics: 21.4 R and the window system 31.5 Using R interactively 31.6 An introductory session: 41.7 Getting help with functions and features 41.8 R commands, case sensitivity, etc. 41.9 Recall and correction of previous commands 51.10 Executing commands from/ diverting output to a file 51.11 Data permanency and removing objects: 5

2 Simple manipulations; numbers and vectors: 72.1 Vectors and assignment 72.2 Vector arithmetic 72.3 Generating regular sequences 82.4 Logical vectors 92.5 Missing values 92.6 Character vectors 102.7 Index vectors; selecting and modifying subsets of a data set 102.8 Other types of objects 11

3 Objects, their modes and attributes 133.1 Intrinsic attributes: mode and length 133.2 Changing the length of an object 143.3 Getting and setting attributes 14

5 Arrays and matrices 185.1 Arrays 185.2 Array indexing. Subsections of an array 18

6 Lists and data frames 266.1 Lists 266.2 Constructing and modifying lists 266.2.1 Concatenating lists 276.3 Data frames 276.3.1 Making data frames 276.3.2 attach() and detach() 27

Appendix A A sample session 78Appendix B Invoking R: 81B.1 Invoking R from the command line 81B.2 Invoking R under Windows 85B.3 Invoking R under Mac OS X 85B.4 Scripting with R 86ivAppendix C The command-line editor 87C.1 Preliminaries 87C.2 Editing actions 87C.3 Command-line editor summary: 87"R-intro_selection.txt" [New] 97L, 2962C written

optional reading12.2.1 Mathematical annotation 6612.3 Interacting with graphics 6612.4 Using graphics parameters 6712.4.1 Permanent changes: T par() 6712.4.2 Arguments to graphics functions 6812.5 Graphics parameters list: 6812.5.1 Graphical elements 6912.5.2 Axes and tick marks: 70

12.5.3 Figure margins 7012.5.4 Multiple figure environment 7212.6 Device drivers 7312.6.1 PostScript diagrams for typeset documents 7312.6.2 Multiple graphics devices 7412.7 Dynamic graphics 7513 Packages 7613.1 Standard packages 7613.2 Contributed packages and CRAN 7613.3 Namespaces 76

Page 22: Amazon PIRE  Data Processing Tutorial

Default installations of R should have the following packages: base, stats, stats4, graphics, grDevices, and a few others (type "library()" to list what you have).

Using "install.packages()", try adding the following packages (some may not install…don't be concerned). Type " help(install.packages) " or "help.search("install packages") to see how to use the function "install.packages()":

akima Interpolation of irregularly spaced datadatasets The R Datasets Packagefields Tools for spatial dataforeign Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, dBase, ...gstat Geostatistical packagelattice Lattice Graphicsmapdata Extra Map Databasesmapproj Map Projectionsmaps Draw Geographical Mapsmatlab MATLAB emulation packageMatrix Sparse and Dense Matrix Classes and Methodssp classes and methods for spatial dataspatial Functions for Kriging and Point Pattern Analysissplines Regression Spline Functions and Classessplus2R S-PLUS functionality missing from Rtseries Time series analysis and computational financeutils The R Utils Package


Recommended