+ All Categories
Home > Technology > Spark meets Telemetry

Spark meets Telemetry

Date post: 08-Jul-2015
Category:
Upload: roberto-agostino-vitillo
View: 494 times
Download: 0 times
Share this document with a friend
Description:
A talk about Spark and Mozilla Telemetry.
Popular Tags:
24
SPARK MEETS TELEMETRY Mozlandia 2014 Roberto Agostino Vitillo
Transcript
Page 1: Spark meets Telemetry

SPARK MEETS TELEMETRY

Mozlandia 2014Roberto Agostino Vitillo

Page 2: Spark meets Telemetry

TELEMETRY PINGS

Page 3: Spark meets Telemetry

• If Telemetry is enabled, a ping is generated for each session

• Pings are sent to our backend infrastructure as json blobs

• Backend validates and stores pings on S3

TELEMETRY PINGS

Page 4: Spark meets Telemetry

TELEMETRY PINGS

Page 5: Spark meets Telemetry

TELEMETRY MAP-REDUCE

• Processes pings from S3 using a map reduce framework written in Python

• https://github.com/mozilla/telemetry-server

import json

def map(k, d, v, cx): j = json.loads(v) os = j['info']['OS'] cx.write(os, 1)

def reduce(k, v, cx): cx.write(k, sum(v))

Page 6: Spark meets Telemetry

SHORTCOMINGS

• Not distributed, limited to a single machine

• Doesn’t support chains of map/reduce ops

• Doesn’t support SQL-like queries

• Batch oriented

Page 7: Spark meets Telemetry
Page 8: Spark meets Telemetry

source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html

Page 9: Spark meets Telemetry

WHAT IS SPARK?

• In-memory data analytics cluster computing framework (up to 100x faster than Hadoop)

• Comes with over 80 distributed operations for grouping, filtering etc.

• Runs standalone or on Hadoop, Mesos and TaskCluster in the future (right Jonas?)

Page 10: Spark meets Telemetry

WHY DO WE CARE?• In memory caching

• Interactive command line interface for EDA (think R command line)

• Comes with higher level libraries for machine learning and graph processing

• Works beautifully on a single machine without tedious setup; doesn’t depend on Hadoop/HDFS

• Scala, Python, Clojure and R APIs are available

Page 11: Spark meets Telemetry

WHY DO WE REALLY CARE?

The easier we make it to get answers,the more questions we will ask

Page 12: Spark meets Telemetry

MASHUP DEMO

Page 13: Spark meets Telemetry

HOW DOES IT WORK?• User creates Resilient Distributed Datasets (RDDs),

transforms and executes them

• RDD operations are compiled to a DAG of operators

• DAG is compiled into stages

• A stage is executed in parallel as a series of tasks

Page 14: Spark meets Telemetry

RDDA parallel dataset with partitions

Var A Var B Var Cobservationobservationobservationobservation

Partition

Partition

Page 15: Spark meets Telemetry

DAGLogical graph of RDD operations

sc.textFile("input") .map(line => line.split(",")) .map(line => (line(0), line(1).toInt)) .reduceByKey(_ + _, 3)

map map reduceByKey

RDD[String] RDD[Array[String]] RDD[(String, Int)]RDD[(String, Int)]

read

P1

P2

P3

P4

Page 16: Spark meets Telemetry

map map reduceByKey

RDD[String] RDD[Array[String]] RDD[(String, Int)]RDD[(String, Int)]

read

STAGE

Stage 1 Stage 2

P1

P2

P3

P4

Page 17: Spark meets Telemetry

map mapshuffle

RDD[String] RDD[Array[String]] RDD[(String, Int)]

read input output

STAGE

Stage 1

readmapmap

shuffle

P1

P2

P3

P4

T1

T2

T3

T4

Set of tasks that can run in parallel

Stage 1

Page 18: Spark meets Telemetry

STAGE

Stage 2Stage 1

Set of tasks that can run in parallel

Page 19: Spark meets Telemetry

STAGE

• Tasks are the fundamental unit of work

• Tasks are serialised and shipped to workers

• Task execution

1. Fetch input

2. Execute

3. Output result

Set of tasks that can run in parallel

task 1

task 2

task 3

task 4

Page 20: Spark meets Telemetry

HANDS-ON

Page 21: Spark meets Telemetry

1. Visit telemetry-dash.mozilla.org and sign in using Persona.

2. Click “Launch an ad-hoc analysis worker”.

3. Upload your SSH public key (this allows you to log in to the server once it’s started up).

4. Click “Submit”

5. A Ubuntu machine will be started up on Amazon’s EC2 infrastructure.

HANDS-ON

Page 22: Spark meets Telemetry

HANDS-ON• Connect to the machine through ssh

• Clone the starter template:

1. git clone https://github.com/vitillo/mozilla-telemetry-spark.git

2. cd mozilla-telemetry-spark && source aws/setup.sh

3. sbt console

• Open http://bit.ly/1wBHHDH

Page 23: Spark meets Telemetry

TUTORIAL

Page 24: Spark meets Telemetry

Recommended