Date post: | 08-Jul-2015 |
Category: |
Technology |
Upload: | roberto-agostino-vitillo |
View: | 494 times |
Download: | 0 times |
SPARK MEETS TELEMETRY
Mozlandia 2014Roberto Agostino Vitillo
TELEMETRY PINGS
• If Telemetry is enabled, a ping is generated for each session
• Pings are sent to our backend infrastructure as json blobs
• Backend validates and stores pings on S3
TELEMETRY PINGS
TELEMETRY PINGS
TELEMETRY MAP-REDUCE
• Processes pings from S3 using a map reduce framework written in Python
• https://github.com/mozilla/telemetry-server
import json
def map(k, d, v, cx): j = json.loads(v) os = j['info']['OS'] cx.write(os, 1)
def reduce(k, v, cx): cx.write(k, sum(v))
SHORTCOMINGS
• Not distributed, limited to a single machine
• Doesn’t support chains of map/reduce ops
• Doesn’t support SQL-like queries
• Batch oriented
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
WHAT IS SPARK?
• In-memory data analytics cluster computing framework (up to 100x faster than Hadoop)
• Comes with over 80 distributed operations for grouping, filtering etc.
• Runs standalone or on Hadoop, Mesos and TaskCluster in the future (right Jonas?)
WHY DO WE CARE?• In memory caching
• Interactive command line interface for EDA (think R command line)
• Comes with higher level libraries for machine learning and graph processing
• Works beautifully on a single machine without tedious setup; doesn’t depend on Hadoop/HDFS
• Scala, Python, Clojure and R APIs are available
WHY DO WE REALLY CARE?
The easier we make it to get answers,the more questions we will ask
MASHUP DEMO
HOW DOES IT WORK?• User creates Resilient Distributed Datasets (RDDs),
transforms and executes them
• RDD operations are compiled to a DAG of operators
• DAG is compiled into stages
• A stage is executed in parallel as a series of tasks
RDDA parallel dataset with partitions
Var A Var B Var Cobservationobservationobservationobservation
Partition
Partition
DAGLogical graph of RDD operations
sc.textFile("input") .map(line => line.split(",")) .map(line => (line(0), line(1).toInt)) .reduceByKey(_ + _, 3)
map map reduceByKey
RDD[String] RDD[Array[String]] RDD[(String, Int)]RDD[(String, Int)]
read
P1
P2
P3
P4
map map reduceByKey
RDD[String] RDD[Array[String]] RDD[(String, Int)]RDD[(String, Int)]
read
STAGE
Stage 1 Stage 2
P1
P2
P3
P4
map mapshuffle
RDD[String] RDD[Array[String]] RDD[(String, Int)]
read input output
STAGE
Stage 1
readmapmap
shuffle
P1
P2
P3
P4
T1
T2
T3
T4
Set of tasks that can run in parallel
Stage 1
STAGE
Stage 2Stage 1
Set of tasks that can run in parallel
STAGE
• Tasks are the fundamental unit of work
• Tasks are serialised and shipped to workers
• Task execution
1. Fetch input
2. Execute
3. Output result
Set of tasks that can run in parallel
task 1
task 2
task 3
task 4
HANDS-ON
1. Visit telemetry-dash.mozilla.org and sign in using Persona.
2. Click “Launch an ad-hoc analysis worker”.
3. Upload your SSH public key (this allows you to log in to the server once it’s started up).
4. Click “Submit”
5. A Ubuntu machine will be started up on Amazon’s EC2 infrastructure.
HANDS-ON
HANDS-ON• Connect to the machine through ssh
• Clone the starter template:
1. git clone https://github.com/vitillo/mozilla-telemetry-spark.git
2. cd mozilla-telemetry-spark && source aws/setup.sh
3. sbt console
• Open http://bit.ly/1wBHHDH
TUTORIAL