Date post: | 11-Aug-2014 |
Category: |
Data & Analytics |
Upload: | alton-alexander |
View: | 299 times |
Download: | 2 times |
Scalable Machine Learning
Alton Alexander@10altoids
R
Motivation to use Spark
• http://spark.apache.org/–Speed–Ease of use–Generality–Integrated with Hadoop• Scalability
Performance
Architecture
• New Amazon Memory Optimized Optionshttps://aws.amazon.com/about-aws/whats-new/2014/04/10/r3-announcing-the-next-generation-of-amazon-ec2-memory-optimized-instances/
Learn more about Spark
• http://spark.apache.org/documentation.html – Great documentation (with video tutorials)– June 2014 Conference http://spark-summit.org
• Keynote Talk at STRATA 2014– Use cases by Yahoo and other companies– http://youtu.be/KspReT2JjeE– Matei Zaharia – Core developer and now at DataBricks– 30min Talk in Detail http://youtu.be/nU6vO2EJAb4?t=20m42s
Motivation to use R
• Great community– R: The most powerful and most widely used statistical software– https://www.youtube.com/watch?v=TR2bHSJ_eck
• Statistics• Packages
– There’s an R package for that– Roger Peng, John Hopkins– https://www.youtube.com/watch?v=yhTerzNFLbo
• Plots
Example: Word CountLibrary(SparkR)sc <- sparkR.init(master="local")
Lines <- textFile(sc, “hdfs://my_text_file”)Words <- flatMap(lines,
function(line){strsplit(line, “ “)[[1]]})
wordCount <- lapply(words, function(word){list(word,1L)})
Counts <- reduceByKey(wordCount, "+", 2L)Output <- collect(counts)
Learn more about SparkR
• GitHub repository– https://github.com/amplab-extras/SparkR-pkg– How to install– Examples
• An old but still good talk introducing SparkR– http://
www.youtube.com/watch?v=MY0NkZY_tJw&list=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a
– Shows MINST demo
Backup Slides
Hands on Exercises
• http://spark-summit.org/2013/exercises/index.html– Walk through the tutorial– Set up a cluster on EC2– Data exploration– Stream processing with spark streaming– Machine learning
Local box• Start with a micro dev box using the latest public build on
Amazon EC2– spark.ami.pvm.v9 - ami-5bb18832
• Or start by just installing it on your laptop– wget http://d3kbcqa49mib13.cloudfront.net/spark-0.9.1-bin-
hadoop1.tgz
• Add AWS keys as environment variables– AWS_ACCESS_KEY_ID=– AWS_SECRET_ACCESS_KEY=
Run the examples
• Load pyspark and work interactively– /root/spark-0.9.1-bin-hadoop1/bin/pyspark– >>> help(sc)
• Estimate pi– ./bin/pyspark python/examples/pi.py local[4] 20
Start Cluster
• Configure the cluster and start it– spark-0.9.1-bin-hadoop1/ec2/spark-ec2 -k spark-
key -i ~/spark-key.pem -s 1 launch spark -test-cluster
• Log onto the master– spark-0.9.1-bin-hadoop1/ec2/spark-ec2 login
spark-test-cluster
Ganglia: The Cluster Dashboard
Run these Demos
• http://spark.apache.org/docs/latest/mllib-guide.html – Talks about each of the algorithms– Gives some demos in Scala– More demos in Python
Clusteringfrom pyspark.mllib.clustering import KMeansfrom numpy import arrayfrom math import sqrt
# Load and parse the datadata = sc.textFile(“data/kmeans_data.txt")parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
# Build the model (cluster the data)clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=30, initializationMode="random")
# Evaluate clustering by computing Within Set Sum of Squared Errorsdef error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)print("Within Set Sum of Squared Error = " + str(WSSSE))
Python Code
• http://spark.incubator.apache.org/docs/latest/api/pyspark/index.html
• Python API for Spark
• Package Mllib– Classification– Clustering– Recommendation– Regression
Clustering Skullcandy Followersfrom pyspark.mllib.clustering import KMeansfrom numpy import arrayfrom math import sqrt
# Load and parse the datadata = sc.textFile(“../skullcandy.csv")parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
# Build the model (cluster the data)clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=30, initializationMode="random")
# Evaluate clustering by computing Within Set Sum of Squared Errorsdef error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)print("Within Set Sum of Squared Error = " + str(WSSSE))
Clustering Skullcandy Followers
Clustering Skullcandy Followers
Apply model to all followers
• predictions = parsedData.map(lambda follower: clusters.predict(follower))
• Save this out for visualization– predictions.saveAsTextFile("predictions.csv")
Predicted Groups
Skullcandy Dashboard
Backup
• Upgrade to python 2.7• https://spark-project.atlassian.net/browse/SP
ARK-922
Correlation Matrix