Post on 13-Jul-2015
transcript
What is R?
• Statistical Programming Language
• Open source
• > 6000 available packages
• widely used in academics and research
Vector• c(1, 2, 3, 4)
## [1] 1 2 3 4
• 1:4
## [1] 1 2 3 4
• c("a", "b", "c")
## [1] "a" "b" “c"
• c(T, F, T)
## [1] TRUE FALSE TRUE
Matrix• matrix(c(1, 2, 3, 4), ncol=2)
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
• matrix(c(1, 2, 3, 4), ncol=2, byrow=T)
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
Data framename <- c("A", "B", “C")
age <- c(30, 17, 42)
male <- c(T, F, F)
data.frame(name, age, male)
## name age male
## 1 A 30 TRUE
## 2 B 17 FALSE
## 3 C 42 FALSE
SparkR
• An R package that provides a light-weight front-end to use Apache Spark from R
• exposes the RDD API of Spark as distributed lists in R
• allows users to interactively run jobs from the R shell on a cluster
Sparkcount
countByKey
countByValue
flatMap
map (lapply)
…
broadcast
includePackage
…
Filter
reduce
reduceByKey
distinct
union
…
R+
Word countlines <- textFile(sc, “/path/to/file")
words <- flatMap(lines,
function(line) {
strsplit(line, " “)[[1]]
})
wordCount <- lapply(words, function(word) { list(word, 1L) })
counts <- reduceByKey(wordCount, "+", 2L)
output <- collect(counts)
for (wordcount in output) {
cat(wordcount[[1]], ": ", wordcount[[2]], “\n")
}
Machine Learning
• Arthur Samuel (1959): Field of study that gives computers the ability to learn without being explicitly programmed.
Machine Learning• Supervised
Labels, features
Mapping of features to labels
Estimate a concept (model) that is closest to the true mapping
• Unsupervised
No labels
Clustering of data
Machine Learning
• Supervised
Naive Bayes, nearest neighbour, decision tree, linear regression, support vector machine…
• Unsupervised
K-means, DBSCAN, one-class SVM…
Naive BayesP class | doc( ) = P class( ) P word | class( )
word in doc∏
classargmax P class | doc( ) =classargmax P class( ) P word | class( )
word in doc∏
classargmax log P class | doc( )( )
classargmax log P class( )( )+ log P word | class( )( )word in doc∑
P c( ) = number of class c documents in training setstotal number of documents in training sets
P w | c( ) = no. of occurences of word w in documents type c + 1total no. of words in documents type c + size of vocab
“a" “b” “c”
1 1 1 0
2 0 2 1
P 1( ) = P 2( ) = 12
P a |1( ) = 1+11+1+ 3
= 25
P b |1( ) = 25
P c |1( ) = 15
P a | 2( ) = 15
P b | 2( ) = 35
P c | 2( ) = 25
P 1|"a b b"( ) = 12× 25× 25× 25= 0.032
P 2 |"a b b"( ) = 12× 15× 35× 35= 0.036
MLlib
• Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
MLlib and SparkR
• Currently access to MLlib in SparkR is still in development. Thus use this method to run MLlib in R until MLlib is officially integrated into SparkR.
MLlib’s Naive Bayes in RR RDD of list(label,
features)Java RDD of
serialised R objects
Scala RDD of LabeledPoint
J("org.apache.spark.mllib.classification.NaiveBayes", "train", labeled.point.rdd, lambda)
rJava