Advanced Data Analytics: Basic Graphics in R
Jeffrey Stanton
School of Information Studies
Syracuse University
Movie Data Set from JSE
• From McClaren and DePaolo’s article in the Journal of Statistics Education
• Daily per theater box office receipts in dollars for 49 movies• A variable number of entries for each movie depending
upon how long it ran• About 2500 observations altogether• DAILY_PER_THEATER: Amount in Dollars• DATE: mm/dd/yyyy of when the observation was made• DAY_NUM: Which day in the run, number from 1 up• MOVIE: The title of the movie• NUMBER: Index number of the movie
2
Movie Dataset from JSE
http://www.amstat.org/publications/jse/datasets/moviedaily.dat
http://www.amstat.org/publications/jse/datasets/moviedaily.txt
> moviedaily <- read.delim("Z:/DataScience/AdvancedAnalytics/moviedaily.dat")
> view(moviedaily) # Display data in R-Studio in a separate pane> attach(moviedaily) # Make the new data the active data frame> class(moviedaily) # Make sure the dataset is a dataframe[1] "data.frame“> ls(moviedaily) # Show the variable names in the dataframe[1] "DAILY_PER_THEATER" "DATE" "DAY_NUM" [4] "MOVIE" "NUMBER" > hist(DAY_NUM)
3
Histogram of DAY_NUM
4
Histogram of DAY_NUM
DAY_NUM
Fre
quen
cy
0 50 100 150 200
020
040
060
0
About Histograms
• Basic type of diagnostic display shows how frequently each value occurs in the data set
• In R, works on numeric data only; getting counts on other modes of data requires another approach
• Works fine with continuous data (e.g., 3.1, 3.2, 3.25, etc.) because it can cluster together nearby values and count them in a single frequency category (representing a range)
• Try hist(NUMBER) and hist(DAILY_PER_THEATER)• Even though these look like numeric variables, in the data
importing process, R has made them into “factors” – factors are stored as integers with “category labels” and are used in various procedures to divide the data into groups
5
Convert a factor into numbers
• Recall that a factor is stored as integers with character labels: It is the labels that we want to convert into numbers (we can’t control how R assigned the integers, so we don’t know exactly what they contain, only that they are unique)
• Try as.character(DAILY_PER_THEATER) – See how we get lots of numbers in quotes, plus some occasional other stuff that is not numbers
• Then try: as.numeric(as.character(DAILY_PER_THEATER))• Note the warning messages: “Warning message: NAs introduced by coercion” –
This is exactly what we want: “NA” is R’s way of coding missing data; all of the unusable string values (like: "No daily data“) have been turned into NAs because they are missing values
> detach(moviedaily)> moviedaily$dailyper<-
as.numeric(as.character(DAILY_PER_THEATER))> attach(moviedaily)> class(dailyper)# Adds a new numeric variable converted from the factor
6
On most days, movies make a few $100
7
Histogram of dailyper
dailyper
Fre
quen
cy
0 5000 10000 15000 20000
050
010
0020
00
Which Movie Made the Most $$$
• First, we need to aggregate the data, by summing the daily takes for each movie:
aggdata <- aggregate(dailyper,by=list(MOVIE),FUN=sum, na.rm=TRUE)# Aggregates by MOVIE, which is a factor with the movie names# Uses the sum function on the variable dailyper
• Next, lets organize the data in descending order:sortdata<-aggdata[order(-aggdata$x),]# The minus sign means decreasing order
• Remove the items that had no data (the sums ended up as zero):
sortdata<-sortdata[sortdata$x>1,]# Takes the subset of rows where the agg $ value > 1
• Finally, create a barplot showing the totals for each movie:barplot(sortdata$x,names.arg=as.character(sortdata$Group.1),las=2)
8
Barplot of Movie Total Daily Take
9
Let’s Do the Same Thing With Rcmdr
• The input data file has some anomalies that we had to clear up: Rcmdr data loader is not as forgiving as R-Studio
[3] ERROR: line 423 did not have 5 elements[4] ERROR: line 1990 did not have 5 elementsmoviedaily <-
read.table("Z:/DataScience/AdvancedAnalytics/moviedaily.dat",
header=TRUE, sep="\t", na.strings="NA", dec=".", strip.white=TRUE)
[5] NOTE: The dataset moviedaily has 2378 rows and 5 columns.
10
Obviously We Need to Tweak It
11
121 16 212 29 379 509 65 81 99
DAILY_PER_THEATER
Fre
quen
cy
02
46
810
We Still Need to Coerce
12
as.numeric(as.character(DAILY_PER_THEATER))
Aggregate is Under the Menu: Data -> Active Data Set
13
Remove Cases with Missing DataSubset the Data for Nonzero Values
14
Rcmdr has no Sort Function…And the Barplot is Troubled
• We can use the sorting capability we learned before:aggdata<-aggdata[order(-aggdata$dailyper),]
• The Barplot menu choice in Rcmdr produces this code:barplot(table(aggdata$MOVIE), xlab="MOVIE", ylab="Frequency")– This creates a frequency table based on MOVIE, which is not really what
we want– The resulting chart is a histogram rather than a barchart with heights
based on dailyper
• We can run our own barchart command using the Rcmdr data:barplot(aggdata$dailyper,names.arg=as.character(aggdata$MOVIE),las=2)
15
But Some Things are Still Messed Up!
• It has not discarded the zeroes as we asked• There are too many entries – there should be 49 or fewer –
looks like the aggregation did not work correctly
16
Tita
nic
Sta
r W
ars:
Pha
ntom
Men
ace
Chi
cago
Bat
man
A B
eaut
iful M
ind
Spi
der-
Man
Lord
of
the
Rin
gs:
Ret
urn
Shr
ek 2
Pira
tes
1: C
urse
of
the
Bla
ck P
earl
Spi
der-
Man
2S
hrek
the
Thi
rdS
hake
spea
re in
Lov
eS
pide
r-M
an 3
Shr
ekE
mpi
re S
trik
es B
ack,
The
Har
ry P
otte
r 4:
Gob
let
of F
ireH
arry
Pot
ter
2: C
ham
ber
of S
ecre
tsH
arry
Pot
ter
5: O
rder
of
the
Pho
enix
Goo
d G
irl,
The
Ret
urn
of t
he J
edi
Gla
diat
orH
arry
Pot
ter
3: P
rison
er o
f A
zkab
anD
epar
ted,
The
Mill
ion
Dol
lar
Bab
yS
uper
Siz
e M
eC
rash
You
Can
Cou
nt o
n M
eP
irate
s 2:
Dea
d M
ans
Che
st11
3508
7/7/
2006
29P
irate
s 2:
Dea
d M
ans
Che
stET
Ups
ide
of A
nger
, T
heP
irate
s 3:
At
Wor
lds
End
1427
15/2
4/20
0730
Pira
tes
3: A
t W
orld
s E
ndH
arry
Pot
ter
1: S
orce
rers
Sto
ne18
8051
1/16
/200
115
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
Pira
tes
3: A
t W
orld
s E
nd38
6435
/26/
2007
30P
irate
s 3:
At
Wor
lds
End
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
7335
511/
22/2
001
15H
arry
Pot
ter
1: S
orce
rers
Sto
neLa
st M
imzy
, T
heP
irate
s 2:
Dea
d M
ans
Che
st96
0127
/15/
2006
29P
irate
s 2:
Dea
d M
ans
Che
stP
irate
s 2:
Dea
d M
ans
Che
st72
9907
/13/
2006
29P
irate
s 2:
Dea
d M
ans
Che
stP
irate
s 2:
Dea
d M
ans
Che
st38
5567
/9/2
006
29P
irate
s 2:
Dea
d M
ans
Che
stP
irate
s 3:
At
Wor
lds
End
9293
96/1
/200
730
Pira
tes
3: A
t W
orld
s E
ndH
arry
Pot
ter
1: S
orce
rers
Sto
ne95
7431
1/24
/200
115
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
Pira
tes
2: D
ead
Man
s C
hest
1524
187/
21/2
006
29P
irate
s 2:
Dea
d M
ans
Che
stH
arry
Pot
ter
1: S
orce
rers
Sto
ne52
0841
1/20
/200
115
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
Pira
tes
2: D
ead
Man
s C
hest
5380
67/1
1/20
0629
Pira
tes
2: D
ead
Man
s C
hest
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
1516
0811
/30/
2001
15H
arry
Pot
ter
1: S
orce
rers
Sto
neP
irate
s 2:
Dea
d M
ans
Che
st11
1915
7/17
/200
629
Pira
tes
2: D
ead
Man
s C
hest
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
3665
811/
18/2
001
15H
arry
Pot
ter
1: S
orce
rers
Sto
neP
irate
s 3:
At
Wor
lds
End
5574
75/2
8/20
0730
Pira
tes
3: A
t W
orld
s E
ndP
irate
s 2:
Dea
d M
ans
Che
st23
2188
7/29
/200
629
Pira
tes
2: D
ead
Man
s C
hest
Pira
tes
2: D
ead
Man
s C
hest
1315
607/
19/2
006
29P
irate
s 2:
Dea
d M
ans
Che
stP
irate
s 2:
Dea
d M
ans
Che
st21
9067
/27/
2006
29P
irate
s 2:
Dea
d M
ans
Che
stP
irate
s 3:
At
Wor
lds
End
2396
86/1
5/20
0730
Pira
tes
3: A
t W
orld
s E
ndP
irate
s 3:
At
Wor
lds
End
1576
36/7
/200
730
Pira
tes
3: A
t W
orld
s E
ndP
irate
s 3:
At
Wor
lds
End
1722
336/
9/20
0730
Pira
tes
3: A
t W
orld
s E
ndP
irate
s 3:
At
Wor
lds
End
7146
85/3
0/20
0730
Pira
tes
3: A
t W
orld
s E
ndP
irate
s 2:
Dea
d M
ans
Che
st29
9438
/4/2
006
29P
irate
s 2:
Dea
d M
ans
Che
stH
arry
Pot
ter
1: S
orce
rers
Sto
ne29
6491
2/14
/200
115
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
2317
4012
/8/2
001
15H
arry
Pot
ter
1: S
orce
rers
Sto
neH
arry
Pot
ter
1: S
orce
rers
Sto
ne43
1094
12/2
8/20
0115
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
4110
2512
/26/
2001
15H
arry
Pot
ter
1: S
orce
rers
Sto
neP
irate
s 2:
Dea
d M
ans
Che
st17
2554
7/23
/200
629
Pira
tes
2: D
ead
Man
s C
hest
Pira
tes
2: D
ead
Man
s C
hest
4360
88/1
8/20
0629
Pira
tes
2: D
ead
Man
s C
hest
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
2133
712/
6/20
0115
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
Pira
tes
3: A
t W
orld
s E
nd11
3098
6/3/
2007
30P
irate
s 3:
At
Wor
lds
End
Pira
tes
3: A
t W
orld
s E
nd37
6936
/29/
2007
30P
irate
s 3:
At
Wor
lds
End
Pira
tes
2: D
ead
Man
s C
hest
1910
217/
25/2
006
29P
irate
s 2:
Dea
d M
ans
Che
stH
arry
Pot
ter
1: S
orce
rers
Sto
ne45
1061
12/3
0/20
0115
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
Pira
tes
2: D
ead
Man
s C
hest
5746
49/1
/200
629
Pira
tes
2: D
ead
Man
s C
hest
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
3783
212/
22/2
001
15H
arry
Pot
ter
1: S
orce
rers
Sto
neP
irate
s 3:
At
Wor
lds
End
1387
06/5
/200
730
Pira
tes
3: A
t W
orld
s E
ndP
irate
s 3:
At
Wor
lds
End
3110
496/
23/2
007
30P
irate
s 3:
At
Wor
lds
End
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
5734
71/1
1/20
0215
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
Pira
tes
3: A
t W
orld
s E
nd29
4046
/21/
2007
30P
irate
s 3:
At
Wor
lds
End
Pira
tes
2: D
ead
Man
s C
hest
5199
08/2
6/20
0629
Pira
tes
2: D
ead
Man
s C
hest
Pira
tes
2: D
ead
Man
s C
hest
2567
97/3
1/20
0629
Pira
tes
2: D
ead
Man
s C
hest
Pira
tes
2: D
ead
Man
s C
hest
3710
398/
12/2
006
29P
irate
s 2:
Dea
d M
ans
Che
stP
irate
s 2:
Dea
d M
ans
Che
st35
3828
/10/
2006
29P
irate
s 2:
Dea
d M
ans
Che
stH
arry
Pot
ter
1: S
orce
rers
Sto
ne49
3981
/3/2
002
15H
arry
Pot
ter
1: S
orce
rers
Sto
neH
arry
Pot
ter
1: S
orce
rers
Sto
ne11
6881
1/26
/200
115
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
5193
91/5
/200
215
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
Pira
tes
2: D
ead
Man
s C
hest
5983
29/3
/200
629
Pira
tes
2: D
ead
Man
s C
hest
Pira
tes
2: D
ead
Man
s C
hest
4925
98/2
4/20
0629
Pira
tes
2: D
ead
Man
s C
hest
Pira
tes
2: D
ead
Man
s C
hest
2766
58/2
/200
629
Pira
tes
2: D
ead
Man
s C
hest
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
1356
311/
28/2
001
15H
arry
Pot
ter
1: S
orce
rers
Sto
neH
arry
Pot
ter
1: S
orce
rers
Sto
ne65
5921
/19/
2002
15H
arry
Pot
ter
1: S
orce
rers
Sto
neH
arry
Pot
ter
1: S
orce
rers
Sto
ne35
4511
2/20
/200
115
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
Pira
tes
3: A
t W
orld
s E
nd19
5426
/11/
2007
30P
irate
s 3:
At
Wor
lds
End
Pira
tes
3: A
t W
orld
s E
nd21
5026
/13/
2007
30P
irate
s 3:
At
Wor
lds
End
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
4793
01/1
/200
215
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
3949
112/
24/2
001
15H
arry
Pot
ter
1: S
orce
rers
Sto
neP
irate
s 2:
Dea
d M
ans
Che
st31
9548
/6/2
006
29P
irate
s 2:
Dea
d M
ans
Che
stP
irate
s 3:
At
Wor
lds
End
2512
076/
17/2
007
30P
irate
s 3:
At
Wor
lds
End
Pira
tes
3: A
t W
orld
s E
nd27
4336
/19/
2007
30P
irate
s 3:
At
Wor
lds
End
Pira
tes
2: D
ead
Man
s C
hest
3340
28/8
/200
629
Pira
tes
2: D
ead
Man
s C
hest
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
1938
412/
4/20
0115
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
Pira
tes
3: A
t W
orld
s E
nd39
6847
/1/2
007
30P
irate
s 3:
At
Wor
lds
End
Pira
tes
3: A
t W
orld
s E
nd33
3516
/25/
2007
30P
irate
s 3:
At
Wor
lds
End
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
1719
3912
/2/2
001
15H
arry
Pot
ter
1: S
orce
rers
Sto
neP
irate
s 2:
Dea
d M
ans
Che
st39
3198
/14/
2006
29P
irate
s 2:
Dea
d M
ans
Che
stP
irate
s 2:
Dea
d M
ans
Che
st65
6139
/9/2
006
29P
irate
s 2:
Dea
d M
ans
Che
stP
irate
s 3:
At
Wor
lds
End
4138
37/4
/200
730
Pira
tes
3: A
t W
orld
s E
ndP
irate
s 3:
At
Wor
lds
End
3531
76/2
7/20
0730
Pira
tes
3: A
t W
orld
s E
ndP
irate
s 2:
Dea
d M
ans
Che
st63
1029
/7/2
006
29P
irate
s 2:
Dea
d M
ans
Che
stH
arry
Pot
ter
1: S
orce
rers
Sto
ne63
1001
/17/
2002
15H
arry
Pot
ter
1: S
orce
rers
Sto
neP
irate
s 2:
Dea
d M
ans
Che
st41
2858
/16/
2006
29P
irate
s 2:
Dea
d M
ans
Che
stH
arry
Pot
ter
1: S
orce
rers
Sto
ne25
2361
2/10
/200
115
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
3326
912/
18/2
001
15H
arry
Pot
ter
1: S
orce
rers
Sto
neP
irate
s 2:
Dea
d M
ans
Che
st45
6968
/20/
2006
29P
irate
s 2:
Dea
d M
ans
Che
stP
irate
s 3:
At
Wor
lds
End
4723
87/1
8/20
0730
Pira
tes
3: A
t W
orld
s E
ndP
irate
s 2:
Dea
d M
ans
Che
st53
2508
/28/
2006
29P
irate
s 2:
Dea
d M
ans
Che
stH
arry
Pot
ter
1: S
orce
rers
Sto
ne27
2421
2/12
/200
115
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
Pira
tes
2: D
ead
Man
s C
hest
4726
18/2
2/20
0629
Pira
tes
2: D
ead
Man
s C
hest
Har
ry P
otte
r 1:
Sor
cere
rs S
tone
3190
912/
16/2
001
15H
arry
Pot
ter
1: S
orce
rers
Sto
neP
irate
s 3:
At
Wor
lds
End
4520
27/1
6/20
0730
Pira
tes
3: A
t W
orld
s E
ndP
irate
s 3:
At
Wor
lds
End
4323
07/9
/200
730
Pira
tes
3: A
t W
orld
s E
ndP
irate
s 2:
Dea
d M
ans
Che
st55
2208
/30/
2006
29P
irate
s 2:
Dea
d M
ans
Che
stH
arry
Pot
ter
1: S
orce
rers
Sto
ne53
1251
/7/2
002
15H
arry
Pot
ter
1: S
orce
rers
Sto
neP
irate
s 2:
Dea
d M
ans
Che
st61
1219
/5/2
006
29P
irate
s 2:
Dea
d M
ans
Che
stH
arry
Pot
ter
1: S
orce
rers
Sto
ne59
5591
/13/
2002
15H
arry
Pot
ter
1: S
orce
rers
Sto
neH
arry
Pot
ter
1: S
orce
rers
Sto
ne55
1071
/9/2
002
15H
arry
Pot
ter
1: S
orce
rers
Sto
neH
arry
Pot
ter
1: S
orce
rers
Sto
ne61
1121
/15/
2002
15H
arry
Pot
ter
1: S
orce
rers
Sto
ne
0
50000
100000
150000
200000
Demonstrating Mastery
• Locate a data set in a CSV or Tab-Delimited file and read it into R
• Check the data to ensure that the process of reading in the data worked properly
• Run a histogram on any numeric variable• Aggregate the data based on any grouping variable; use a
sum function, a mean function, or some other function as appropriate
• Display the aggregated data in a barchart or another type of graph as appropriate
• Describe the difference between a histogram and a barchart
17