Big Data Journey

© 2015 MapR Technologies ‹#›

Big Data Journey with Hadoop & MapR

Tug Grall [email protected] @tgrall

YARN


Big Data Journey



David Pilato [email protected] @dadoonet

YARN

WHY?

https://www.domo.com/

Building new applications

Can I use my existing tools?

(Big) Data Platform(Big) Data Project

Ingest

Store

Process

Consume

Ingest Data

Copy files in HDFS

hadoop fs -put dailylogs-log.zip /logs/2015/09/10/

Import RDBMS data

sqoop import --connect jdbc:mysql://db.foo.com/somedb --table \ customers --target-dir /incremental_dataset --append

FilesHBaseHive

Import RDBMS datainput { jdbc { jdbc_connection_string => "jdbc:postgresql://localhost:5432/mydb" jdbc_user => "postgres" jdbc_driver_library => "/path/to/postgresql-9.4-1201.jdbc41.jar" jdbc_driver_class => "org.postgresql.Driver" statement => "SELECT * from contacts" }}

What’s “wrong”?

Batch????

Streaming

Flume, Kafka, Logstashto the rescue

Log

App Events

Twitter

Sensors

…

HDFS

MapR-FS

Alerts

Elasticsearch

…

DB

Log

App Events

Twitter

Sensors

…

HDFS

MapR-FS

Alerts

Elasticsearch

…

DB

Broker

Producers Consumers

Stream data into Hadoop using Flume

Server

FilesHBaseHive

Server

Server

Server

Streams using Kafka

FilesHBaseHive

Producer

Producer

Producer

Consumer

Consumer

Consumer

Alert

Stream data using Logstash

Data StorageData Format

How to store your data?

• Files in a distributed file system • Rows in NoSQL Table • Index in Search Engine

Process Data

Data Processing

• Transform the data • Enrich the data

• Examples: • Store data in multiple formats • Aggregate data • Build Recommendations • ….

MapReduce Processing Model

• Define mappers • Shuffling is automatic • Define reducers • For complex work, chain jobs together

– Use a higher level language or DSL that does this for you

Apache Spark: Fast Big Data

– Rich APIs in Java, Scala, Python

– Interactive shell

• Fast to Run – General execution

graphs – In-memory storage

Spark: Unified Platform

Spark SQL Spark Streaming (Streaming)

MLlib (Machine learning)

Spark (General execution engine)

GraphX (Graph computation)

Mesos

Distributed File System (HDFS, MapR-FS, S3, …)

Hadoop YARN

Elasticsearch / Watcher

Query the data

FilesHBaseHiveIndex

Discovery/Analytics

SQL strikes back!

FilesHBaseHive

SQL on Hadoop

• SQL Shell• JDBC ODBC• BI Tools• Reporting

Elasticsearch

Kibana as a frontend

Example: Recommendation Platform

Machine Learning

MapR Cluster

HBaseMapR DB

MapR-FS

Add recommendations to movies

Capture RatingsMovies & Recommendations

Movie Database

Conclusion

• If possible use Streams: Kafka, Logstash

• Advanced Data Processing and Machine Learning : Spark

• Expose your data using SQL for your “BI folks” : Drill

• Aggregation and Full Text Search : Elasticsearch

• Data Visualisation : Kibana


Big Data Journey



David Pilato [email protected] @dadoonet

Date post:	09-Jan-2017
Category:	Technology
Upload:	tugdual-grall
View:	471 times
Download:	0 times

Big Data Journey

Technology