Date post: | 11-Apr-2017 |
Category: |
Data & Analytics |
Upload: | infofarm |
View: | 1,060 times |
Download: | 0 times |
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Data Science Company
Real Time Big Data
InfoFarm Seminar18/11/2015
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Agenda
• Typical Big Data Landscape• The need for Real Time Big Data• Real Time Data Ingestion• Tools for Real Time Big Data– Apache Spark– Apache Storm– Search
• Q&A• Lunch
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
A Typical Big Data Landscape
• Data Silo
• Batch environment
• Periodical Analytics/statistics
• Data Source for new systems
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
The need for Real Time Big Data• Obtaining analytical results faster– Processing faster than once a day
• Load evens out over day
• Past/Present/Future– Alert for certain events– Updating Prediction models on-the-fly
• Allow faster feedback to end users– See results of your actions right away
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Perfect fits for Real Time Processing• Anomaly Detection
– Abnormal readings of sensors– Abnormal amounts of log files– Fraud detection
• Real Time updates to Recommender models– Fast new recommendations in e-commerce– Support for trending items– Fast responses to events happening right now
• Real Time updates of clustering models
• Improving Classification based un current events
• Can be run side-by-side with traditional historical models
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Apache Kafka
• Fast
• Scalable
• Durable
• Distributed
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Apache Kafka - Overview
• Producers write messages to Kafka topics
• Consumers process messages from a topic
• Kafka runs on a cluster of server where each server is called a broker
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Apache Kafka - Topics• Topics are split up in
different partitions• Partitions are
replicated across the cluster
• Order of messages is guaranteed
• Messages are stored for a period of time
• Producers decide which partition they write to
• Consumers keep the offset of which messages they have read
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
The Hadoop Ecosystem
HDFS Distributed File System Amazon S3 Local FS
YARN Resource Management
MapReduce
HBase NoSQL
Hive Data Mart
Pig ScripCng
Sqoop SQL
Import Export
Mahout Machine Learning
…
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
The Hadoop Ecosystem
HDFS Distributed File System Amazon S3 Local FS
YARN Resource Management
MapReduce
HBase NoSQL
Hive Data Mart
Pig ScripCng
Sqoop SQL
Import Export
Mahout Machine Learning
…
Spark Storm …
Spark SQL
Spark MLlib
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Spouts
• Source of streams into the topology• Can be reliable or unreliable
• Support for:– Kafka– Kestrel– RabbitMQ– JMS– Amazon Kinesis– Build your own (e.g. twitter)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Bolts
• Where all the processing happens
• Filtering, functions, aggregations, joins, database updates, …
• You subscribe to streams of a different component (other bolts/spouts)
• Must ack every tuple they process
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Parallelism
• Spouts & Bolts actually run as multiple instances on different machines
• Making sure that the correct messages goes to the correct instance is up to the developer
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Stream Groupings
• Defines how a stream should be partitioned among the bolt's tasks
• Some examples:– Round Robin
– Based on key– All– Specific instance
– …
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Storm Ups and Downs
• Really real time• Very Powerful• Built for performance
• Very low level (comparable to MapReduce)
• Trivial tasks can become hard (sorting, joins, …)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Spark Streaming Input
• Kafka• Flume• Kinesis• Twitter• ZeroMQ• HDFS• TCP Sockets
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Windowing
• You can group multiple batches together into a sliding window.
• E.g. all the events from the last 60 seconds
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Spark Streaming Strengths
• Works just like regular Spark processing, just replace SparkContext with StreamingContext
• Full integration with other Spark libraries (Spark SQL, Spark Mllib, …)
• Ease of development
• Scalable, fault-tolerant, …
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
Spark Streaming Example
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Data output bottlenecks
• Pig & Hive are quite slow
• No visual feedback from results
• Specific calculations (cubing) of metrics – Reporting tools cannot handle the
dimensions of the data
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Elasticsearch
• Document store (ideal for denormalized data)
• Distributed• Highly Available
• Open Source
• Real Time (Inserts & Searches)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Hive Integration
• Writing to Elasticsearch from Hive
CREATE EXTERNAL TABLE artists ( id BIGINT, name STRING, links STRUCT<url:STRING, picture:STRING>) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'radio/artists'); -‐-‐ insert data to Elasticsearch from another table called 'source' INSERT OVERWRITE TABLE artists SELECT NULL, s.name, named_struct('url', s.url, 'picture', s.picture) FROM source s;
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Hive Integration
• Reading from Elasticsearch in Hive
CREATE EXTERNAL TABLE artists ( id BIGINT, name STRING, links STRUCT<url:STRING, picture:STRING>) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'radio/artists', 'es.query' = '?q=me*'); -‐-‐ stream data from Elasticsearch SELECT * FROM artists;
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Pig Integration
• Writing to Elasticsearch from Pig
-‐-‐ load data from HDFS into Pig using a schema A = LOAD 'src/test/resources/artists.dat' USING PigStorage() AS (id:long, name, url:chararray, picture: chararray); -‐-‐ transform data B = FOREACH A GENERATE name, TOTUPLE(url, picture) AS links; -‐-‐ save the result to Elasticsearch STORE B INTO 'radio/artists' USING org.elasticsearch.hadoop.pig.EsStorage();
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Pig Integration
• Reading from Elasticsearch in Pig
-‐-‐ execute Elasticsearch query and load data into Pig A = LOAD 'radio/artists' USING org.elasticsearch.hadoop.pig.EsStorage('es.query=?me*'); DUMP A;
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Spark Integration
• Writing to Elasticsearch from Spark
import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.elasticsearch.spark._ val conf = ... val sc = new SparkContext(conf) -‐-‐ Create RDD here rdd.saveToEs("spark/docs")
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Spark Integration
• Reading from Elasticsearch in Spark
... import org.elasticsearch.spark._ ... val conf = ... val sc = new SparkContext(conf) sc.esRDD("radio/artists", "?q=me*")
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Storm Integration
• Writing to Elasticsearch from Storm
import org.elasticsearch.storm.EsBolt; TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 10); builder.setBolt("es-‐bolt", new EsBolt("storm/docs"), 5) .shuffleGrouping("spout");
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Storm Integration
• Reading from Elasticsearch in Storm
import org.elasticsearch.storm.EsSpout; TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("es-‐spout", new EsSpout("storm/docs", "?q=me*), 5); builder.setBolt("bolt", new PrinterBolt()).shuffleGrouping("es-‐spout");
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Kibana
• Visualization tool on top of Elasticsearch
• Allows ad-hoc querying & graphing
• Support for real time updates
• Create your own dashboards
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Data Science Company
Real Time Big Data
InfoFarm Seminar18/11/2015