+ All Categories
Home > Documents > BIOSTATS 640 – Spring 2018 Introduction to R and … Data Description...BIOSTATS 640 – Spring...

BIOSTATS 640 – Spring 2018 Introduction to R and … Data Description...BIOSTATS 640 – Spring...

Date post: 24-Jul-2020
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
13
BIOSTATS 640 – Spring 2018 Introduction to R – Data Description R Data Description II Spring 2018.docx Page 1 of 13 BIOSTATS 640 – Spring 2018 Introduction to R and R-Studio Data Description Page 1. Start of Session …………………………………………………………………………………………………………………………………………. a. Preliminaries …………………………………………………………………………………………………………….…………………………….. b. Install Packages ………………………………………………………………………………………………………………………………………… c. Attach Packages ………………………………………………………………………………………………………………………………………... 2. Load R Data ………………………………………………….……………………………………………………………………………………………. a. Load R data frames …………………………………………………………………………………………………….…………………………….. b. Check ……………………………………………………………………………………………………………………………………………………… 3. SINGLE VARIABLE - Discrete ………………………………………………………………………………………………………………………. a. Numerical ………………………………………………………………………………………………………………………….…………………… b. Graphical ………………………………………………………………………………………………………………………………………………… 4. SINGLE VARIABLE - Continuous…………………………………………………………………………………………………………………. a. Numerical ………………………………………………………………………………………………………………………….…………………… b. Graphical ………………………………………………………………………………………………………………………………………………… 5. TWO VARIABLES – One Continuous, One Discrete ……………………………………………………………………………………. a. Numerical ………………………………………………………………………………………………………………….…………………… b. Graphical ………………………………………………………………………………………………………………………………………………… 6. TWO VARIABLES – BOTH Continuous …………………………………………………………………………..…………………………. a. XY Scatter Plot ………………………………….………………………………………………………………………………….…………………… 2 2 2 2 3 3 3 5 5 5 7 7 7 9 9 9 13 13
Transcript
Page 1: BIOSTATS 640 – Spring 2018 Introduction to R and … Data Description...BIOSTATS 640 – Spring 2018 Introduction to R – Data Description R Data Description II Spring 2018.docx

BIOSTATS640–Spring2018IntroductiontoR–DataDescription

RDataDescriptionIISpring2018.docxPage 1 of 13

BIOSTATS640–Spring2018IntroductiontoRandR-Studio

DataDescription

Page

1. StartofSession………………………………………………………………………………………………………………………………………….a. Preliminaries…………………………………………………………………………………………………………….……………………………..b. InstallPackages…………………………………………………………………………………………………………………………………………c. AttachPackages………………………………………………………………………………………………………………………………………...

2. LoadRData………………………………………………….…………………………………………………………………………………………….

a. LoadRdataframes…………………………………………………………………………………………………….……………………………..b. Check………………………………………………………………………………………………………………………………………………………

3. SINGLEVARIABLE-Discrete……………………………………………………………………………………………………………………….

a. Numerical………………………………………………………………………………………………………………………….……………………b. Graphical…………………………………………………………………………………………………………………………………………………

4. SINGLEVARIABLE-Continuous………………………………………………………………………………………………………………….

a. Numerical………………………………………………………………………………………………………………………….……………………b. Graphical…………………………………………………………………………………………………………………………………………………

5. TWOVARIABLES–OneContinuous,OneDiscrete…………………………………………………………………………………….

a.Numerical………………………………………………………………………………………………………………….……………………b.Graphical…………………………………………………………………………………………………………………………………………………

6. TWOVARIABLES–BOTHContinuous…………………………………………………………………………..………………………….

a.XYScatterPlot………………………………….………………………………………………………………………………….……………………

2222333555777999

1313

Page 2: BIOSTATS 640 – Spring 2018 Introduction to R and … Data Description...BIOSTATS 640 – Spring 2018 Introduction to R – Data Description R Data Description II Spring 2018.docx

BIOSTATS640–Spring2018IntroductiontoR–DataDescription

RDataDescriptionIISpring2018.docxPage 2 of 13

1. StartofSession

# 1. START OF SESSION # 1a. Preliminaries# setwd( ) to set working directory# Tips: 1) R wants forward slashes and 2) must enclose in quotessetwd("/Users/cbigelow/Desktop/")# 1b. Clear workspace rm(list=ls())# 1c. Turn OFF scientific notationoptions(scipen=1000)# 1b. Install packages (one time)# Tip - package name MUST be enclosed in quotes# Note - You'll get lots of stuff appearing in your console window.# Dear Reader - I have commented out these installations because I already have them.# install.packages("mosaic")# install.packages("DescTools") # install.packages("psych")# install.packages("ggplot2") # 1c. Attach packages to be used in this session (1x per session)library(mosaic) library(DescTools) library(psych)library(ggplot2)

Page 3: BIOSTATS 640 – Spring 2018 Introduction to R and … Data Description...BIOSTATS 640 – Spring 2018 Introduction to R – Data Description R Data Description II Spring 2018.docx

BIOSTATS640–Spring2018IntroductiontoR–DataDescription

RDataDescriptionIISpring2018.docxPage 3 of 13

2. Load R Data

#2a. LOAD R dataframes# Tip - Be sure you have downloaded data and placed it on desktop# At right click on ENVIRONMENT tab to confirm all is well.setwd("/Users/cbigelow/Desktop/")load(file="larvae.Rdata")load(file="ivf.Rdata")# 2b. Check

# str(DATAFRAME) to check structure of dataframestr(ivf)

## 'data.frame': 641 obs. of 6 variables:## $ id : num 1 2 3 4 5 6 7 8 9 10 ...## $ matage : int 33 34 34 30 35 37 31 31 33 33 ...## $ hyp : int 0 0 0 0 0 0 0 1 1 0 ...## $ gestwks: num 37.7 39.2 35.7 39.3 38.4 ...## $ sex : Factor w/ 2 levels "male","female": 2 2 2 1 2 1 1 2 1 2 ...## $ bweight: int 2410 2977 2100 3270 2620 3260 3750 1450 3200 3675 ...## - attr(*, "datalabel")= chr "In Vitro Fertilization data"## - attr(*, "time.stamp")= chr "14 Feb 2017 09:55"## - attr(*, "formats")= chr "%9.0g" "%8.0g" "%8.0g" "%9.0g" ...## - attr(*, "types")= int 254 251 251 254 251 252## - attr(*, "val.labels")= chr "" "" "" "" ...## - attr(*, "var.labels")= chr "identity number" "maternal age (years)" "hypertension (1=yes, 0=no)" "gestational age (weeks)" ...## - attr(*, "version")= int 12## - attr(*, "label.table")=List of 1## ..$ sex: Named int 1 2## .. ..- attr(*, "names")= chr "male" "female"

# head(DATAFRAME) to display first 6 rowshead(ivf)

## id matage hyp gestwks sex bweight## 1 1 33 0 37.74 female 2410## 2 2 34 0 39.15 female 2977## 3 3 34 0 35.72 female 2100## 4 4 30 0 39.29 male 3270## 5 5 35 0 38.38 female 2620## 6 6 37 0 37.86 male 3260

# tail(DATAFRAME) to display last 6 rowstail(ivf)

## id matage hyp gestwks sex bweight## 636 636 34 0 41.15 male 2972## 637 637 28 0 38.58 female 2850## 638 638 38 1 38.44 male 3182## 639 639 26 0 38.94 female 3048## 640 640 31 0 40.43 female 3183## 641 641 31 0 38.15 male 2920

Page 4: BIOSTATS 640 – Spring 2018 Introduction to R and … Data Description...BIOSTATS 640 – Spring 2018 Introduction to R – Data Description R Data Description II Spring 2018.docx

BIOSTATS640–Spring2018IntroductiontoR–DataDescription

RDataDescriptionIISpring2018.docxPage 4 of 13

# summary(DATAFRAME) to get quick descriptives on every variablesummary(ivf)

## id matage hyp gestwks ## Min. : 1 Min. :23.00 Min. :0.0000 Min. :24.69 ## 1st Qu.:161 1st Qu.:31.00 1st Qu.:0.0000 1st Qu.:38.01 ## Median :321 Median :34.00 Median :0.0000 Median :39.15 ## Mean :321 Mean :33.97 Mean :0.1393 Mean :38.69 ## 3rd Qu.:481 3rd Qu.:37.00 3rd Qu.:0.0000 3rd Qu.:40.15 ## Max. :641 Max. :43.00 Max. :1.0000 Max. :42.35 ## NA's :2 ## sex bweight ## male :326 Min. : 630 ## female:315 1st Qu.:2850 ## Median :3200 ## Mean :3129 ## 3rd Qu.:3550 ## Max. :4650 ##

Page 5: BIOSTATS 640 – Spring 2018 Introduction to R and … Data Description...BIOSTATS 640 – Spring 2018 Introduction to R – Data Description R Data Description II Spring 2018.docx

BIOSTATS640–Spring2018IntroductiontoR–DataDescription

RDataDescriptionIISpring2018.docxPage 5 of 13

3. SINGLE VARIABLE - DISCRETE

# 3. SINGLE VARIABLE - DISCRETE# 3a. Numerical# 3a.i. Frequency and relative frequency table - brute force# Step 1: Create columns of tablen <- length(ivf$sex)sex_freq <- table(ivf$sex)sex_relfreq <- sex_freq/nsex_cum <- cumsum(sex_freq)sex_cumrel <- cumsum(sex_relfreq)# Step 2: cbind(, , ,) to combine the columns into a tablesextable <- cbind(sex_freq, sex_relfreq, sex_cum, sex_cumrel)# Step 3: colnames(" ", " ", " ") to label columns of tablecolnames(sextable) <- c("Freq", "Rel Freq", "Cum Freq", "Cum Rel Freq")# Step 4: Display table with just 4 digits after the decimal round(sextable,digits=4)

## Freq Rel Freq Cum Freq Cum Rel Freq## male 326 0.5086 326 0.5086## female 315 0.4914 641 1.0000

# 3a.ii Frequency and relative frequency table # Command Freq() in package=DescTools# Tip - Turn off scientific notation firstoptions(scipen=1000)Freq(ivf$sex)

## level freq perc cumfreq cumperc## 1 male 326 50.9% 326 50.9%## 2 female 315 49.1% 641 100.0%

# 3b. Graphical# 3b.1. Bar Graph of counts# Command ggplot() + geom_bar() in package=ggplot2# Basicggplot(data=ivf, aes(x=factor(hyp))) + geom_bar()

Page 6: BIOSTATS 640 – Spring 2018 Introduction to R and … Data Description...BIOSTATS 640 – Spring 2018 Introduction to R – Data Description R Data Description II Spring 2018.docx

BIOSTATS640–Spring2018IntroductiontoR–DataDescription

RDataDescriptionIISpring2018.docxPage 6 of 13

# With aesthetics: I recommend building step by step, then displaying!p <- ggplot(data=ivf, aes(x=factor(hyp)))p1 <- p + geom_bar(color="black", fill="blue")p2 <- p1 + xlab("Hypertension") p3 <- p2 + ylab("Frequency")p4 <- p3 + ggtitle("Bar Graph of Hypertension")p5 <- p4 + theme_bw()p5

# 3b.2 Bar Graph of percentsp6 <- ggplot(data=ivf, aes(x=hyp)) p7 <- p6 + geom_bar(aes(y = (..count..)/sum(..count..)), color="black", fill="blue")p8 <- p7 + scale_y_continuous(labels=scales::percent) p9 <- p8 + xlab("Hypertension")p10 <-p9 + ylab("Relative Frequency")p11 <- p10 + ggtitle("Bar Graph of Hypertension")p12 <- p11 + theme_bw()p12

## Warning: Removed 2 rows containing non-finite values (stat_count).

Page 7: BIOSTATS 640 – Spring 2018 Introduction to R and … Data Description...BIOSTATS 640 – Spring 2018 Introduction to R – Data Description R Data Description II Spring 2018.docx

BIOSTATS640–Spring2018IntroductiontoR–DataDescription

RDataDescriptionIISpring2018.docxPage 7 of 13

4. SINGLEVARIABLE–CONTINUOUS

# 4. SINGLE VARIABLE - CONTINUOUS# 4a. Numerical# 4a.1. Five number summary# Command favstats() in package=mosaicfavstats(~bweight, data=ivf)

## min Q1 median Q3 max mean sd n missing## 630 2850 3200 3550 4650 3129.137 652.7827 641 0

# 4a.2. Quantiles# Command quantile() in package=mosaicquantile(~bweight, data=ivf)

## 0% 25% 50% 75% 100% ## 630 2850 3200 3550 4650

# 4a.3. Basic descriptivessummary(ivf$bweight)

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 630 2850 3200 3129 3550 4650

# 4a.4. Detailed descriptives# Command describe( ) in package=psychdescribe(ivf$bweight)

## vars n mean sd median trimmed mad min max range skew## X1 1 641 3129.14 652.78 3200 3181.91 518.91 630 4650 4020 -0.96## kurtosis se## X1 1.78 25.78

# 4b. Graphical# 4b.1. Box plot# Command ggplot() + geom_boxplot() in package=ggplot2# Basicggplot(data=ivf, aes(x=1, y=matage)) + geom_boxplot()

Page 8: BIOSTATS 640 – Spring 2018 Introduction to R and … Data Description...BIOSTATS 640 – Spring 2018 Introduction to R – Data Description R Data Description II Spring 2018.docx

BIOSTATS640–Spring2018IntroductiontoR–DataDescription

RDataDescriptionIISpring2018.docxPage 8 of 13

# With aesthetics - step by stepp13 <- ggplot(data=ivf, aes(x=1, y=matage)) + geom_boxplot(color="black", fill="blue") p14 <- p13 + xlab(".")p15 <- p14 + ylab("Maternal Age (years)")p16 <- p15 + ggtitle("Box Plot of Maternal Age")p17 <- p16 + theme_bw()p17

# 4b.2 Histogram# Command ggplot() + geom_histogram() in package=ggplot2# Basicggplot(data=ivf, aes(x=matage)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Page 9: BIOSTATS 640 – Spring 2018 Introduction to R and … Data Description...BIOSTATS 640 – Spring 2018 Introduction to R – Data Description R Data Description II Spring 2018.docx

BIOSTATS640–Spring2018IntroductiontoR–DataDescription

RDataDescriptionIISpring2018.docxPage 9 of 13

# With aesthetics - step by stepp18 <- ggplot(data=ivf, aes(x=matage)) + geom_histogram(color="black", fill="blue", binwidth=2)p19 <- p18 + xlab("Maternal Age (years)")p20 <- p19 + ylab("Frequency")p21 <- p20 + ggtitle("Histogram of Maternal Age")p22 <- p21 + theme_bw()p22

Page 10: BIOSTATS 640 – Spring 2018 Introduction to R and … Data Description...BIOSTATS 640 – Spring 2018 Introduction to R – Data Description R Data Description II Spring 2018.docx

BIOSTATS640–Spring2018IntroductiontoR–DataDescription

RDataDescriptionIISpring2018.docxPage 10 of 13

5. TWOVARIABLES–OneContinuous,OneDiscrete

# 5. TWO VARIABLES - ONE CONTINUOUS, ONE DISCRETE# 5a. Numerical# 5a.1. Detailed descriptives, by group# Command describeBy() in package=psych# Tip - Declare your discrete group variable to be factorlibrary(psych)ivf$sex <- as.factor(ivf$sex)describeBy(ivf$bweight, group = ivf$sex,digits= 4)

## $male## vars n mean sd median trimmed mad min max range skew## X1 1 326 3211.28 665.98 3290 3256.11 526.32 700 4650 3950 -0.88## kurtosis se## X1 1.59 36.89## ## $female## vars n mean sd median trimmed mad min max range skew## X1 1 315 3044.13 628.66 3120 3107.97 444.78 630 4416 3786 -1.15## kurtosis se## X1 2.04 35.42## ## attr(,"call")## by.default(data = x, INDICES = group, FUN = describe, type = type)

# 5b. Graphical# 5b.1. Side-by-side box plot# Command ggplot() + geom_boxplot() in package=ggplot2# Basicggplot(data=ivf, aes(x=as.factor(sex), y=bweight)) + geom_boxplot()

Page 11: BIOSTATS 640 – Spring 2018 Introduction to R and … Data Description...BIOSTATS 640 – Spring 2018 Introduction to R – Data Description R Data Description II Spring 2018.docx

BIOSTATS640–Spring2018IntroductiontoR–DataDescription

RDataDescriptionIISpring2018.docxPage 11 of 13

# With aesthetics - step by stepp13 <- ggplot(data=ivf, aes(x=as.factor(sex), y=bweight)) + geom_boxplot(color="black", fill="blue") p14 <- p13 + xlab("Sex")p15 <- p14 + ylab("Birthweight (g)")p16 <- p15 + ggtitle("Side-by Side Box Plot of Birthweight, by Sex")p17 <- p16 + theme_bw()p17

# 5b.2. Histogram, by group - Stacked # Command ggplot() + geom_histogram() + facet_wrap() in package=ggplot2# Basicivf$sex <- as.factor(ivf$sex)ggplot(data=ivf, aes(x=bweight)) + geom_histogram() + facet_grid(sex ~ .)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Page 12: BIOSTATS 640 – Spring 2018 Introduction to R and … Data Description...BIOSTATS 640 – Spring 2018 Introduction to R – Data Description R Data Description II Spring 2018.docx

BIOSTATS640–Spring2018IntroductiontoR–DataDescription

RDataDescriptionIISpring2018.docxPage 12 of 13

# With aesthetics - step by stepp18 <- ggplot(data=ivf, aes(x=bweight, fill=sex)) + geom_histogram(color="black", fill="blue", binwidth=200) + facet_wrap(~sex, ncol=1)p19 <- p18 + xlab("Birthweight (g)")p20 <- p19 + ylab("Frequency")p21 <- p20 + ggtitle("Histogram of Birthweight (g), by Sex")p22 <- p21 + theme_bw()p22

# 5b.3. Histogram, by group - Overlay # Command ggplot() + geom_histogram() in package=ggplot2# With aesthetics - step by stepp23 <- ggplot(data=ivf, aes(x=bweight, fill=sex)) + geom_histogram(position="identity", alpha=0.4, binwidth=200)p24 <- p23 + xlab("Birthweight (g)")p25 <- p24 + ylab("Frequency")p26 <- p25 + ggtitle("Histogram of Birthweight (g), over Sex")p27 <- p26 + theme_bw()p27

Page 13: BIOSTATS 640 – Spring 2018 Introduction to R and … Data Description...BIOSTATS 640 – Spring 2018 Introduction to R – Data Description R Data Description II Spring 2018.docx

BIOSTATS640–Spring2018IntroductiontoR–DataDescription

RDataDescriptionIISpring2018.docxPage 13 of 13

6. TWOVARIABLES–BothContinuous

# 6. TWO VARIABLES - BOTH CONTINUOUS# 6a. XY Scatterplot# Command ggplot() + geom_point() in package=ggplot2# Basicggplot(data=ivf, aes(x=matage, y=bweight)) + geom_point()

# With aesthetics - step by stepp28 <- ggplot(data=ivf, aes(x=matage, y=bweight)) + geom_point(size=0.5, color="blue")p29 <- p28 + xlab("Maternal Age (years")p30 <- p29 + ylab("Birthweight (g)")p31 <- p30 + ggtitle("Scatterplot of Birthweight in Relationship to Maternal Age")p32 <- p31 + theme_bw() p32


Recommended