+ All Categories
Home > Technology > Jump Start into Apache® Spark™ and Databricks

Jump Start into Apache® Spark™ and Databricks

Date post: 15-Apr-2017
Category:
Upload: databricks
View: 2,864 times
Download: 2 times
Share this document with a friend
39
Jump Start into Apache® Spark™ and Databricks Denny Lee, Technology Evangelist [email protected], @dennylee
Transcript
Page 1: Jump Start into Apache® Spark™ and Databricks

Jump Start into Apache® Spark™ and Databricks

Denny Lee, Technology [email protected], @dennylee

Page 2: Jump Start into Apache® Spark™ and Databricks

Technology Evangelist, Databricks(Working with Spark since v0.5)

Formerly :• Senior Director of Data Sciences Engineering at Concur (now part of SAP)• Principal Program Manager at Microsoft

Hands-on Data Engineer :Architect for more than 15 years, developing internet-scale infrastructure for both on-premises and cloud including Bing’s Audience Insights, Yahoo’s 24TB SSAS cube, and Isotope Incubation Team (HDInsight).

About Me: Denny Lee

Page 3: Jump Start into Apache® Spark™ and Databricks

Founded by the creators of Apache Spark in 2013

Share of Spark code contributed by Databricksin 2014

75%

3

Data Value

Created Databricks on top of Spark to make big data simple.

We are Databricks, the company behind Spark.

Page 4: Jump Start into Apache® Spark™ and Databricks

Apache Spark Engine

Spark Core

SparkStreamingSpark SQL MLlib GraphX

Unified engine across diverse workloads & environments

Scale out, fault tolerant

Python, Java, Scala, and R APIs

Standard libraries

Page 5: Jump Start into Apache® Spark™ and Databricks

Open SourceEcosystem

Page 6: Jump Start into Apache® Spark™ and Databricks

Large-Scale Usage

Largest cluster8000 Nodes (Tencent)

Largest single job1 PB (Alibaba, Databricks)

Top Streaming Intake1 TB/hour (HHMI Janelia Farm)

2014 On-Disk Sort RecordFastest Open Source Engine for sorting a PB

Page 7: Jump Start into Apache® Spark™ and Databricks

Notable Users

Source: Slide 5 of Spark Community Update

Companies That Presented at Spark Summit 2015 in San Francisco

Page 8: Jump Start into Apache® Spark™ and Databricks

Quick StartQuick Start Using Python | Quick Start Using Scala

Page 9: Jump Start into Apache® Spark™ and Databricks

Quick Start with Python

textFile = sc.textFile("/mnt/tardis6/docs/README.md")textFile.count()

Page 10: Jump Start into Apache® Spark™ and Databricks

Quick Start with Scala

textFile = sc.textFile("/mnt/tardis6/docs/README.md")textFile.count()

Page 11: Jump Start into Apache® Spark™ and Databricks

RDDs

• RDDs have actions, which return values, and transformations, which return pointers to new RDDs.

• Transformations are lazy and executed when an action is run• Transformations: map(), flatMap(), filter(), mapPartitions(), mapPartitionsWithIndex(),

sample(), union(), distinct(), groupByKey(), reduceByKey(), sortByKey(), join(), cogroup(), pipe(), coalesce(), repartition(), partitionBy(), ...

• Actions: reduce(), collect(), count(), first(), take(), takeSample(), takeOrdered(), saveAsTextFile(), saveAsSequenceFile(), saveAsObjectFile(), countByKey(), foreach(), ...

• Persist (cache) distributed data in memory or disk

Page 12: Jump Start into Apache® Spark™ and Databricks

Spark API Performance

Page 13: Jump Start into Apache® Spark™ and Databricks

History of Spark APIs

RDD(2011)

DataFrame(2013)

• Distribute collection of JVM objects

• Functional Operators (map, filter, etc.)

• Distribute collection of Row objects

• Expression-based operations and UDFs

• Logical plans and optimizer

• Fast/efficient internal representations

DataSet(2015)

• Internally rows, externally JVM objects

• “Best of both worlds”type safe + fast

Page 14: Jump Start into Apache® Spark™ and Databricks

Benefit of Logical Plan:Performance Parity Across Languages

0 2 4 6 8 10

Java/Scala

Python

Java/Scala

Python

R

SQL

Runtime for an example aggregation workload (secs)

DataFrame

RDD

Page 15: Jump Start into Apache® Spark™ and Databricks

0

50

100

150

200

250

300

350

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9

Spark 1.3.1, 1.4, and 1.5 for 9 queries

1.5RunA1 1.5RunA2 1.5RunB1 1.5RunB2 1.4RunA1 1.4RunA2

NYC Taxi Dataset

Page 16: Jump Start into Apache® Spark™ and Databricks

Dataset API in Spark 1.6

Typed interface over DataFrames / Tungsten

case class Person(name: String, age: Long)

val dataframe = read.json(“people.json”)val ds: Dataset[Person] = dataframe.as[Person]

ds.filter(p => p.name.startsWith(“M”)) .toDF().groupBy($“name”).avg(“age”)

Page 17: Jump Start into Apache® Spark™ and Databricks

Dataset

“Encoder” converts from JVM Object into a Dataset Row

Checkout [SPARK-9999]

JVM Object

Dataset Row

encoder

Page 18: Jump Start into Apache® Spark™ and Databricks

Tungsten Execution

PythonSQL R Streaming

DataFrame (& Dataset)

AdvancedAnalytics

Page 19: Jump Start into Apache® Spark™ and Databricks

Ad Tech ExampleAdTech Sample Notebook (Part 1)

Page 20: Jump Start into Apache® Spark™ and Databricks

Create External Table with RegEx

CREATE EXTERNAL TABLE accesslog ( ipaddress STRING,...

)ROW FORMAT

SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'WITH SERDEPROPERTIES (

"input.regex" = '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \\"(\\S+) (\\S+) (\\S+)\\" (\\d{3}) (\\d+) \\"(.*)\\" \\"(.*)\\" (\\S+) \\"(\\S+), (\\S+), (\\S+), (\\S+)\\"’)LOCATION

"/mnt/mdl/accesslogs/"

Page 21: Jump Start into Apache® Spark™ and Databricks

External Web Service Call via Mapper

# Obtain the unique agents from the accesslog tableipaddresses = sqlContext.sql("select distinct ip1 from \accesslog where ip1 is not null").rdd

# getCCA2: Obtains two letter country code based on IP addressdef getCCA2(ip):

url = 'http://freegeoip.net/csv/' + ipstr = urllib2.urlopen(url).read() return str.split(",")[1]

# Loop through distinct IP addresses and obtain two-letter country codesmappedIPs = ipaddresses.map(lambda x: (x[0], getCCA2(x[0])))

Page 22: Jump Start into Apache® Spark™ and Databricks

Join DataFrames and Register Temp Table

# Join countrycodes with mappedIPsDF so we can have IP address and # three-letter ISO country codesmappedIP3 = mappedIP2 \

.join(countryCodesDF, mappedIP2.cca2 == countryCodesDF.cca2, "left_outer") \

.select(mappedIP2.ip, mappedIP2.cca2, countryCodesDF.cca3, countryCodesDF.cn)

# Register the mapping tablemappedIP3.registerTempTable("mappedIP3")

Page 23: Jump Start into Apache® Spark™ and Databricks

Add Columns to DataFrames with UDFs

from user_agents import parsefrom pyspark.sql.types import StringTypefrom pyspark.sql.functions import udf

# Create UDFs to extract out Browser Family informationdef browserFamily(ua_string) : return xstr(parse(xstr(ua_string)).browser.family)udfBrowserFamily = udf(browserFamily, StringType())

# Obtain the unique agents from the accesslog tableuserAgentTbl = sqlContext.sql("select distinct agent from accesslog")

# Add new columns to the UserAgentInfo DataFrame containing browser informationuserAgentInfo = userAgentTbl.withColumn('browserFamily', \

udfBrowserFamily(userAgentTbl.agent))

Page 24: Jump Start into Apache® Spark™ and Databricks

Use Python UDFs with Spark SQL

# Define function (converts Apache web log time)def weblog2Time(weblog_timestr): ...

# Define and Register UDFudfWeblog2Time = udf(weblog2Time, DateType())sqlContext.registerFunction("udfWeblog2Time", lambda x: weblog2Time(x))

# Create DataFrameaccessLogsPrime = sqlContext.sql("select hash(a.ip1, a.agent) as UserId,

m.cca3, udfWeblog2Time(a.datetime),...")udfWeblog2Time(a.datetime)

Page 25: Jump Start into Apache® Spark™ and Databricks
Page 26: Jump Start into Apache® Spark™ and Databricks
Page 27: Jump Start into Apache® Spark™ and Databricks
Page 28: Jump Start into Apache® Spark™ and Databricks
Page 29: Jump Start into Apache® Spark™ and Databricks

References

Spark DataFrames: Simple and Fast Analysis on Structured Data [Michael Armbrust]

Apache Spark 1.6 presented by Databricks co-founder Patrick Wendell

Announcing Spark 1.6

Introducing Spark Datasets

Spark SQL Data Sources API: Unified Data Access for the Spark Platform

Page 30: Jump Start into Apache® Spark™ and Databricks

Join us at Spark Summit EastFebruary 16-18, 2016 | New York City

Page 31: Jump Start into Apache® Spark™ and Databricks

Thanks!

Page 32: Jump Start into Apache® Spark™ and Databricks

Appendix

Page 33: Jump Start into Apache® Spark™ and Databricks

Spark Survey 2015 Highlights

Page 34: Jump Start into Apache® Spark™ and Databricks

Spark adoption is growing rapidly

Spark use is growing beyond Hadoop

Spark is increasing access to big data

Spark Survey Report 2015 Highlights

TOP 3 APACHE SPARK TAKE AWAYS

Page 35: Jump Start into Apache® Spark™ and Databricks
Page 36: Jump Start into Apache® Spark™ and Databricks
Page 37: Jump Start into Apache® Spark™ and Databricks
Page 38: Jump Start into Apache® Spark™ and Databricks
Page 39: Jump Start into Apache® Spark™ and Databricks

NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO

Source: Slide 5 of Spark Community Update


Recommended