What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3...

Post on 20-May-2020

2 views 0 download

transcript

What’snewinSpark2.0?

©2015CouchbaseInc. 2

©2016CouchbaseInc. 3

SparkOverview

ApacheSparkisafastandgeneralengineforlarge-scaledataprocessing.

©2016CouchbaseInc. 4

SparkOverview

©2016CouchbaseInc. 5

Spark2.0

§ Largelycompatiblewith1.x§ SimplifiesAPI§ 2000patchesfrom280contributors

http://www.slideshare.net/SparkSummit/simplifying-big-data-applications-with-apache-spark-20

©2016CouchbaseInc. 6

Spark2.0

§ StructuredAPIImprovements§ Whole-stagecodegeneration§ StructuredStreaming§ SimplerSetup§ SQL2003Support§ MLlibenhancements§ EnhancedRsupport§ …

©2016CouchbaseInc. 7

StructuredAPIImprovements

§ Dataset(typed)andDataFrame(untyped)arenowunified§  DataFrame==Dataset<Row>

§ AlsousedforStructuredStreaming

https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

©2016CouchbaseInc. 8

Whole-StageCodegen

§ Second-generationTungstenengine§ DepartingfromVolcanoIteratorModel§ Also:Vectorizationformoreefficientbatch-processing

https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

©2016CouchbaseInc. 9

Whole-StageCodegen

https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

©2016CouchbaseInc. 10http://www.slideshare.net/SparkSummit/simplifying-big-data-applications-with-apache-spark-20

©2016CouchbaseInc. 11

StructuredStreaming(Experimental)

§ Tackling“continuousapplications”§  IntegratedAPIwithbatchjobs§ Betterinteractionwithstoragesystems§ RichIntegrationintotherestofSpark

©2016CouchbaseInc. 12

StructuredStreaming(Experimental)

https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

©2016CouchbaseInc. 13

StructuredStreaming(Experimental)

https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

©2016CouchbaseInc. 14

SimplerSetup

§ SparkSessionsubsumesSQLContext,HiveContext,…§ Onecommonentrypoint

©2016CouchbaseInc. 15

SQL2003Support

§ SupportsSQL2003Standard

§ ReworkednativeSQLparser§ NativeDDLcommandimplementations§ Allkindsofsubqueriesnowsupported§ Cannowrunall99TPC-DSqueries

©2016CouchbaseInc. 16

MLlibEnhancements

§ DataFrameasprimaryMLAPI§ ModelPersistence§  SupportforalllanguageAPIsinSpark:Scala,Java,Python&R§  SupportfornearlyallMLalgorithmsintheDataFrame-basedAPI§  SupportforsinglemodelsandfullPipelines,bothunfitted(a“recipe”)andfitted(aresult)

§  Distributedstorageusinganexchangeableformat

QAThanks!

Couchbase&Spark

©2015CouchbaseInc. 19

©2016CouchbaseInc. 20

EcommerserunsonCouchbase

6 10 ECOMMERCE COMPANIES

IN THE UNITED STATES

of the TOP

Online shopping

Omni channel services

©2016CouchbaseInc. 21

TravelrunsonCouchbase

3 3

GLOBAL DISTRIBUTION SYSTEMS WORLDWIDE

of the TOP

3 10

AIRLINES

of the TOP

©2016CouchbaseInc. 22

OnlineVideoStreamingrunsonCouchbase

6 10 NORTH AMERICAN AND

EUROPEAN BROADCAST TELEVISION COMPANIES

of the TOP

©2016CouchbaseInc. 23

Sports&CasinoGamingrunsonCouchbase

6 10 ONLINE SPORTS AND

CASINO GAMING COMPANIES

of the TOP

©2016CouchbaseInc. 24

FinancialServicesrunonCouchbase

3 3 CREDIT REPORTING

COMPANIES

of the TOP

©2015CouchbaseInc. 25DaHorvath,http://up.picr.de/23770402by.jpg

WhySparkandCouchbaseOverview&Use-Cases

©2016CouchbaseInc. 27

UseCases

Operations Analytics

CB

§  Recommendations§  Predictiveanalytics§  Frauddetection

§  Catalog§  Personalization§  Mobileapplications

©2016CouchbaseInc. 28

UseCase:OperationalizeAnalytics/ML

Hadoop

MLModel

Data Warehouse

Training Data

CB

Model Online Data

Serving

Predictions

©2016CouchbaseInc. 29Adaptedfrom:Databricks–NotYourFather’sDatabasehttps://www.brighttalk.com/webcast/12891/196891

©2016CouchbaseInc. 30

UseCase:DataIntegration

RDBMS S3 HDFS ES

NoSQL

©2016CouchbaseInc. 31

StandaloneDeployment

©2016CouchbaseInc. 32

Side-By-SideDeployment

AccessPatternsFromSparktoCouchbaseandBackAgain

©2015CouchbaseInc. 34

Key-Value

Fetch/StorebyDocumentID

©2015CouchbaseInc. 35

Key-Value

Fetch/StorebyDocumentID

N1QLQuery

FetchbyCriteria“SQL”

©2015CouchbaseInc. 36

Key-Value

Fetch/StorebyDocumentID

N1QLQuery

FetchbyCriteria“SQL”

Map-ReduceViews

MaterializedIndexes

(Aggregation)

©2015CouchbaseInc. 37

Key-Value

Fetch/StorebyDocumentID

N1QLQuery

FetchbyCriteria“SQL”

Map-ReduceViews

MaterializedIndexes

(Aggregation)

Streaming

MutationStreamsForProcessing

©2015CouchbaseInc. 38

Key-Value

Fetch/StorebyDocumentID

N1QLQuery

FetchbyCriteria“SQL”

Map-ReduceViews

MaterializedIndexes

(Aggregation)

Streaming

MutationStreamsForProcessing

FullText

SearchonFreeformText

©2015CouchbaseInc. 39

Key-Value

Fetch/StorebyDocumentID

N1QLQuery

FetchbyCriteria“SQL”

Map-ReduceViews

MaterializedIndexes

(Aggregation)

Streaming

MutationStreamsForProcessing

©2016CouchbaseInc. 40

CouchbaseDataPartitioning

©2016CouchbaseInc. 41

DataLocality

§ RDDLocationHintsbasedontheClusterMap

§ NotavailableforN1QLorViews§  Roundrobin-can’tgivelocationhints§  Backendisscattergatherwith1noderesponding

©2016CouchbaseInc. 42

N1QLQuery

§ N1QLisaSQLservicewithJSONextensions

§ UsesCouchbase’sGlobalSecondaryIndexes

§ Canrunonanynodeswithinthecluster

§ Nodeswithdifferingservicescanbeaddedandremovedasneededonthefly

©2016CouchbaseInc. 43

DataService

Projector&Router

CouchbaseQueryArchitecture

QueryService

IndexService

SupervisorIndexmaintenance&Scancoordinator

Index#2Index#1

QueryProcessorcbq-engine

Bucket#1 Bucket#2

DCPStreamIndex#4Index#3

...Bucket#2

Bucket#1

ForestDBStorageEngine

©2016CouchbaseInc. 44

SparkSQLSources

TableScanScanallofthedataandreturnit

PrunedScanScananindexthatmatchesonlyrelevantdatatothequeryathand.

PrunedFilteredScanScananindexthatmatchesonlyrelevantdatatothequeryathand.

©2016CouchbaseInc. 45

PredicateConversion

©2016CouchbaseInc. 46

SchemaInference

©2016CouchbaseInc. 47

SchemaInference

N1QLRelation:28 - Inferring schema from bucket travel-sample with query 'SELECT META(`travel-sample`).id as `META_ID`, `travel-sample`.* FROM `travel-sample` WHERE `type` = 'airline' LIMIT 1000'

N1QLRelation:28 - Executing generated query: 'SELECT `name`,`callsign` FROM `travel-sample` WHERE `type` = 'airline''

©2016CouchbaseInc. 48

SchemaInference

©2016CouchbaseInc. 49

DCPandSparkStreaming

ReplicaIndexing

©2016CouchbaseInc. 50

StructuredStreamingSource

50

Adaptedfromhttps://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

DCPStream UnboundedTable

©2016CouchbaseInc. 51

(Un)StructuredStreaming?

51

©2016CouchbaseInc. 52

StructuredStreamingSource

52

©2016CouchbaseInc. 53

StructuredStreamingSink

53

©2016CouchbaseInc. 54

CouchbaseSparkConnector1.2.1

§ Spark1.6.xsupport,includingDatasets§ DCPFlowControl§ EnhancedJavaAPIs

54

©2016CouchbaseInc. 55

CouchbaseSparkConnector2.0.0

•  Spark2.0.xSupport•  EnhancedDCPClient•  ExperimentalStructuredStreaming

55

©2016CouchbaseInc. 56

Resources

§  SparkPackageshttps://spark-packages.org/package/couchbase/couchbase-spark-connector

•  Docshttp://docs.couchbase.com

§  Sourcehttps://github.com/couchbase/couchbase-spark-connector

§  Bugshttps://issues.couchbase.com/browse/SPARKC

56

QAThanks!