Date post: | 16-Apr-2017 |
Category: |
Data & Analytics |
Upload: | mapr-technologies |
View: | 128 times |
Download: | 0 times |
© 2016 MapR Technologies 1© 2016 MapR Technologies 1© 2016 MapR Technologies
How Spark is Enabling the New Wave of Converged Cloud Applications
Ankur Desai & Carol McDonald
December, 2016
© 2016 MapR Technologies 2© 2016 MapR Technologies 2
Today’s Presenters
Carol McDonaldSolutions Architect
Ankur DesaiSr Mgr, Platform & Products
© 2016 MapR Technologies 3© 2016 MapR Technologies 3
Agenda
• Market Trends
• What’s Needed for Converged Streaming Applications
• Use Cases
• Demo of MapR Streams with Spark Streaming
© 2016 MapR Technologies 4© 2016 MapR Technologies 4
Flexible processing where change is the norm
Distributed processing across clusters, data centers, public & private cloud environments
Supports global apps that can scale arbitrarily
A Single Platform: On-Prem, In the Cloud, or InterCloud
© 2016 MapR Technologies 5© 2016 MapR Technologies 5
MapR on Microsoft Azure Marketplace
MapR and Microsoft enable enterprise grade big data applications in the Azure cloud
Simplified Deployment
Azure Marketplace’s automated deployment capabilities make big data easy
Azure’s infrastructure can scale up to match any requirement and scale down for value
MapR integrates with other Azure services to enable customers to analyze any type of data to unlock the biggest insights
Unlimited Scale Seamless Interoperability
Product Alignment
© 2016 MapR Technologies 6© 2016 MapR Technologies 6
Digital transformation for better customer experienceDeliver self-service insights across the business
• MapR platform on the Azure cloud to modernize their infrastructure and sunset legacy systems.
• Faster exploration of data with Apache Drill mitigating need for schema development.
• Support for use cases such as customer 360, supply chain & image analysis
OBJECTIVES
CHALLENGES
SOLUTION
• Modernize analytics & improve speed of marketing campaigns
• Reduce cost of existing systems• • Existing technologies prohibiting effective & timely reporting and analysis• Very long time to extract value from the data leading to lots of Excel
Leading optical retail chain
© 2016 MapR Technologies 7© 2016 MapR Technologies 7© 2016 MapR Technologies© 2016 MapR Technologies© 2016 MapR Technologies
The Need For Streaming
© 2016 MapR Technologies 8© 2016 MapR Technologies 8
Decreasing Job Latencies
Hours Mins Secs Milli Secs
Data persistence on-disk
Data persistence in-memory
© 2016 MapR Technologies 9© 2016 MapR Technologies 9
Big Data is Continuously Generated One Event at a Time
“time” : “6:01.103”, “event” : “RETWEET”,“location” : “lat” : 40.712784, “lon” : -74.005941
“time: “5:04.120”,“severity” : “CRITICAL”,“msg” : “Service down”
“card_num” : 1234, “merchant” : ”MERCH1”, “amount” : 50
© 2016 MapR Technologies 10© 2016 MapR Technologies 10
It was hot at 6:05 yesterday!
Why Stream Processing?
A n a l y z e
6:01 P.M.: 72°6:02 P.M.: 75°6:03 P.M.: 77°6:04 P.M.: 85°6:05 P.M.: 90°6:06 P.M.: 85°6:07 P.M.: 77°6:08 P.M.: 75°
90°90°6:01 P.M.: 72°6:02 P.M.: 75°6:03 P.M.: 77°6:04 P.M.: 85°6:05 P.M.: 90°6:06 P.M.: 85°6:07 P.M.: 77°6:08 P.M.: 75°
Batch processing may be too late for some events
© 2016 MapR Technologies 11© 2016 MapR Technologies 11
Why Stream Processing?
6:05 P.M.: 90°Topic
Temperature
Turn on the air conditioning!
It’s becoming important to process events as they arrive
S t r e a m
© 2016 MapR Technologies 12© 2016 MapR Technologies 12© 2016 MapR Technologies© 2016 MapR Technologies
Anatomy of Converged Streaming Applications
© 2016 MapR Technologies 13© 2016 MapR Technologies 13
The Trinity of Real-time
Topic 1Real Time Producers
Topic 2
Global Messaging System Persistence (Databases and Files)
Real Time Operational
Analytics
Stream Processing
© 2016 MapR Technologies 14© 2016 MapR Technologies 14
Serve DataStore DataStream Data
Creating the Streaming Pipeline
Process DataData Sources
Topic
© 2016 MapR Technologies 15© 2016 MapR Technologies 15
Open Source Engines & Tools Commercial Engines & Applications
Enterprise-Grade Platform Services
Dat
aPr
oces
sing
Web-Scale StorageMapR-FS MapR-DB
Search and
Others
Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability
MapR Streams
Cloud and
Managed Services
Search and Others
Unified M
anagement and M
onitoring
Search and
Others
Event StreamingDatabase
Custom Apps
HDFS API POSIX, NFS HBase API JSON API Kafka API
MapR Converged Data Platform
© 2016 MapR Technologies 16© 2016 MapR Technologies 16
MapR Streams:Global Pub-sub Event Streaming System for Big Data
Producers publish billions of messages/sec to a topic in a stream.
Guaranteed, immediate delivery to all consumers.
Tie together geo-dispersed clusters. Worldwide.
Standard real-time API (Kafka). Integrates with Spark Streaming, Storm, Apex, and Flink
Direct data access (OJAI API) from analytics frameworks.
Topic
Stream
Producers
Remote sites and consumers
Batch analytics
Topic
Replication
Consumers
Consumers
© 2016 MapR Technologies 17© 2016 MapR Technologies 17
Scalable Event Streaming with MapR Streams
Topics are partitioned for throughput and scalability
Partition 1: Topic - Pressure
Partition 1: Topic - Temperature
Partition 1: Topic - Warning
Partition 2: Topic - Pressure
Partition 2: Topic - Temperature
Partition 2: Topic - Warning
Partition 3: Topic - Pressure
Partition 3: Topic - Temperature
Partition 3: Topic - Warning
Consumers
Consumers
Consumers!
© 2016 MapR Technologies 18© 2016 MapR Technologies 18
MapR-DB is Designed to Scale
Key Range
xxxx
xxxx
Key Col B Col C
val val val
xxx val val
Fast Reads and Writes by KeyData is automatically partitioned by Key Range
Key Range
xxxxxxxx
Key Col B Col C
val val val
xxx val val
Key Range
xxxxxxxx
Key Col B Col C
val val val
xxx val val
© 2016 MapR Technologies 19© 2016 MapR Technologies 19© 2016 MapR Technologies© 2016 MapR Technologies
Use Cases
© 2016 MapR Technologies 20© 2016 MapR Technologies 20
Customer 360 & Behavior Prediction
Website Click-Stream
Real Time/Offline ClickStream Analysis
Internal Data Sources
External Data Sources
• Prediction Modelling
• Attribution Modelling
• Cohort Analysis
• Customer Lifetime Value Analysis
• Attrition Modelling
• Response Modelling
• Churn Modelling
Eliminate latency due to data movement between clusters
Eliminate Redundant storage with MapR streams and lower the TCO
360 Degree Customer View
Customer Behavior PredictionBetter Conversion Rate and Lower attrition $$$
OfflineReal Time
HA, DR, NFS, Snapshots, Data Protection
EDH/EDL
Topic
Topic
Topic
Topic
Support Tickets
DBMSEmail
CRM
© 2016 MapR Technologies 21© 2016 MapR Technologies 21
Prescriptive Analytics: IoT & Auto Manufacturing
GPS
Telematic Data
Telephone Truck Fleet
Data generated from cars are stored locally
Data Modelling/Secondary ETL: Data is converted from proprietary to parquet format
• Identify emission patterns• Route optimization• Customer service requests• How does throttling affect other factors such as fuel consumption, emissions, etc.• Image and video analysis• Time series analysis for threshold breach
Topic
Topic
Topic
Topic
© 2016 MapR Technologies 22© 2016 MapR Technologies 22© 2016 MapR Technologies© 2016 MapR Technologies
Demo
© 2016 MapR Technologies 23© 2016 MapR Technologies 23
What if BP had detected problems before the oil hit the water ?
1M samples/secHigh performance at scale is necessary!
© 2016 MapR Technologies 24© 2016 MapR Technologies 24
Use Case: Time Series Data
Data for real-time monitoring
Sensor time-stamped data
Spark processing
readSpark Streaming
Stream
Topic
© 2016 MapR Technologies 25© 2016 MapR Technologies 25
Use Case: Time Series Data
Sensor time-stamped data
Stream
Topic
COHUTTA,3/10/14,1:01,10.27,1.73,881,1.56,85,1.94
COHUTTA,3/10/14,1:03,10.47,1.732,882,1.7,92,0.66
COHUTTA,3/10/14,1:02,9.67,1.731,882,0.52,87,1.79
Data: PumpId, Date,Time , pressure and flow measurements
© 2016 MapR Technologies 26© 2016 MapR Technologies 26
Schema• All events stored, CF data could be set to expire data• Filtered alerts put in CF alerts• Daily summaries put in CF stats
Row keyCF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
Row Key contains oil pump name, date, and a time stamp
© 2016 MapR Technologies 27© 2016 MapR Technologies 27
Schema• All events stored, CF data could be set to expire data• Filtered alerts put in CF alerts• Daily summaries put in CF stats
Row keyCF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
© 2016 MapR Technologies 28© 2016 MapR Technologies 28
Schema• All events stored, CF data could be set to expire data• Filtered alerts put in CF alerts• Daily summaries put in CF stats
Row keyCF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
© 2016 MapR Technologies 29© 2016 MapR Technologies 29
Serve Data
What Do We Need to Do ?
Data Sources Store DataCollect Data Process Data
St ream
Topic
© 2016 MapR Technologies 30© 2016 MapR Technologies 30
readSpark Streaming
Stream
Topic
Use Case Example Code
Data for real-time monitoring
Sensor time-stamped data Spark processing
© 2016 MapR Technologies 31© 2016 MapR Technologies 31
KafkaProducerString topic=“/streams/pump:warning”;public static KafkaProducer producer;//1 configure KafkaProducer properties Properties properties = new Properties();properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");//2 Create KafkaProducer with propertieskafkaProducer = new KafkaProducer<String, String>(properties);String txt = “msg text”;//3 Create producer records with topic and message ProducerRecord<String, String> record = new ProducerRecord<String, String>(topic, txt);//4 use kafka producer to send recordskafkaProducer.send(record);
© 2016 MapR Technologies 32© 2016 MapR Technologies 32
readSpark Streaming
Stream
Topic
Use Case Example Code
Data for real-time monitoring
Sensor time-stamped data Spark processing
© 2016 MapR Technologies 33© 2016 MapR Technologies 33
Create a DStream
DStream: a sequence of RDDs representing a stream of data
val ssc = new StreamingContext(sparkConf, Seconds(5))// create an input Stream for set of topicsval dStream = KafkaUtils.createDirectStream[String, String](ssc, kafkaParams, topicsSet)
batchtime 0 to 1
batch time 1 to 2
batch time 2 to 3
dStream
Stored in memory as an RDD
© 2016 MapR Technologies 34© 2016 MapR Technologies 34
Message Data to Sensor Object
case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double)// Parse CSV Strings into Sensor objects def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble)}
© 2016 MapR Technologies 35© 2016 MapR Technologies 35
Process DStream// Parse message values into Sensor objects val sensorDStream = dStream.map(_._2).map(parseSensor)
dStream RDDs
batch time 2 to 3
batch time 1 to 2
batchtime 0 to 1
sensorDStream RDDs
New RDDs created for every batch
map map map
© 2016 MapR Technologies 36© 2016 MapR Technologies 36
DataFrame and SQL Operations// for Each RDD sensorDStream.foreachRDD { rdd => val sqlContext = SQLContext.getOrCreate(rdd.sparkContext) // convert RDD to DataFrame rdd.toDF().registerTempTable("sensor") // get the avg max min for pump values val res = sqlContext.sql( "SELECT resid, date, max(hz) as maxhz, min(hz) as minhz, avg(hz) as avghz, max(disp) as maxdisp, min(disp) as mindisp, avg(disp) as avgdisp, max(flo) as maxflo, min(flo) as minflo, avg(flo) as avgflo, max(psi) as maxpsi, min(psi) as minpsi, avg(psi) as avgpsi FROM sensor GROUP BY resid,date”) res.show()}
© 2016 MapR Technologies 37© 2016 MapR Technologies 37
Streaming Application Output
© 2016 MapR Technologies 38© 2016 MapR Technologies 38
Save to HBaserdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)
linesRDD DStream
sensorRDD DStream
output operation: persist data to external storage
Put objects written to HBase
batch time 2-3
batch time 1 to 2
batchtime 0 to 1
mapmap map
savesave save
© 2016 MapR Technologies 39© 2016 MapR Technologies 39
Start Receiving Data
sensorDStream.foreachRDD { rdd => . . .
}// Start the computation ssc.start() // Wait for the computation to terminate ssc.awaitTermination()
© 2016 MapR Technologies 40© 2016 MapR Technologies 40
Stream Processing
Building a Complete Data Architecture
MapR File System (MapR-FS)
MapR Converged Data Platform
MapR Database (MapR-DB)MapR Streams
Sources/Apps Bulk Processing
© 2016 MapR Technologies 41© 2016 MapR Technologies 41
© 2016 MapR Technologies 42© 2016 MapR Technologies 42
Azure and MapR Resources – 3 steps to get started
• Azure Overviewhttps://www.mapr.com/partners/partner/microsoft-azure-microsofts-cloud-computing-platform-moving-faster-achieving-more
• 7 Steps to Deploy the MapR Sandbox on Azurehttps://www.mapr.com/blog/7-steps-deploy-mapr-sandbox-microsoft-azure
• Azure Test Drivehttp://mapr.testdrivelabs.com/ (subject to change)
© 2016 MapR Technologies 43© 2016 MapR Technologies 43
Q & AEngage with us!
1. Read explanation of and Download code– https://www.mapr.com/blog/fast-scalable-streaming-applications-mapr-streams-spark-streaming-and-mapr-db– https://www.mapr.com/blog/spark-streaming-hbase
2. Get Started: MapR Converged Data Platform https://www.mapr.com/get-started-with-mapr
3. Get Answers: MapR Converge Community https://community.mapr.com/community/answers
4. Get Trained: MapR On-Demand Training https://learn.mapr.com