Download - The State of BigData - meetup bigdata @ovh

@PennAnData

The State of Big Data2016

Summary

1 Data Facts

2 Hadoop Basics

3 Beyond Batch : Streaming

4 Columnar Storage

5 Ecosystem

Big Data Facts

PART 1

3V's

Volume Velocity Variety

Volume ...

data Production constantly growing

data Retention increase widely

Extract Value from you data

Storage Cost decrease

Velocity ...

Data produced Faster

Get Real Time insight

Move from capture to Analysis

Get Actionable insight

Variety ...

Not only Structured Data

Toward mostly Unstructured

Text (articles, comments, tweets,...)

Images (id cards, bills,...)

Logs, metrics,...

Seek Time

• 5-10 ms

• 200 move/s

Data Transfer Rate

Mbps 100 1000 10000MB/s 12.5 125 1250

1 Mo 80ms 8ms 0.8ms1 CD (700 Mo) 56s 5.6s 0.56s1 Go (1000 Mo) 1m20 8s 0.8s1 DVD (4700 Mo) 6m16 37.6s 3.76s1 To (1000 Go) 22h13 2h13m 13m

Data Transfer Rate

Mbps 100 1000 10000MB/s 12.5 125 1250

1 min 750 MB 7.5 GB 75 GB15 min 11 GB 112 GB 1 TB1 hour 45 GB 450 GB 4.5 TB1 day 1TB 10.8 TB 108 TB

Payload

Definition

“Big Data really is about having insights and making animpact on your business. If you aren’t taking advantage ofthe data you’re collecting, then you just have a pile of data,

you don’t have Big Data.”

#BigData

Introducing Hadoop

PART 2

#DougCutting

#Tools

Timeline

#HDFS

#Blocks

HDFS

/ HDFS

File

Blocks

DataNodes

File

Blocks

DataNodes

/ HDFS / Replication

DataNodes

NameNode

/ HDFS / NameNode

DataNodes

NameNodes

/ HDFS / Namespace #Federation

/ HDFS / HA

#HighAvailability

NN1 NN2

/ HDFS / HA

Failover Controller

● NameNode Side● Health monitor● Manage HA State

● Zookeeper Side● Monitor State● Maintain or Try to

get Active Lock

#Five9rulez

/ HDFS / Client #Read

#DataLocality

/ HDFS / Client #Write

#ReplicationFactor3

#MapReduce

MapReduce

#MAP

MapReduce

#SHUFFLE

MapReduce

#REDUCE

MapReduce

<key1, val1> map

<key2, val2> mapreduce <okey1, oval1>

reduce <okey2, oval2>

<key3, val3> map

<key500, val500> map

<ikey2, ival521><key501, val501> map reduce <okey150, oval150>

<key502, val502> map <ikey150, ival522>

<ikey1, ival1>

<ikey2, ival2>

<ikey1, ival3>

<ikey2, ival4>

<ikey150, ival520>

Input Input Pairs

Intermediate Pairs

Output Pairs

Output

Step 1: Split )

Step 2: Map

Step 3: Shuffle / Sort

Step 4: Reduce

Step 5: Store )

MapReduce

MapReduce

#Pig &

#Hive

Hive

● Tez

● Impala

● Presto.io

#HBase

HBase

#Model

HBase

#Model

HBase

#Model

HBase

#Model

HBase

#PhysicalStorage

HBase

#Scale

HBase

#Scale

HBase

#Meta

#HBaseArch

HBase #SQL

HBase

#Features

● Coprocessor● Auto-sharding● Scan (full,range)● Schemaless● Cell versioning● Battle tested

● Compactions● Replications● Custom filters● Transactional● Low Latency● Active Community

Beyond Batch : Streaming

PART 3

/ Streaming / Data Platform #Transport

/ Streaming / Data Platform / Kafka

+ =

/ Streaming / Frameworks

/ Streaming / Storm / Topology #Storm

/ Streaming / Storm / Topology #Parallelism

/ Streaming / Flink #Job

/ Streaming / Flink

#DataSet API #DataStream API

Ok Steven, but a new DSL for each new hype tool ?Come on...

Apache Beam

#Features

● Open Sourced Google DataFlow● Unify bigdata developements● Beam Model (from DataFlow model)● Parallel Data processing Pipelines● Pluggable runners: Flink or G Cloud DataFlow● Portability● SDKs : Java / Python

#Architecture

Lambda Architecture

Drawbacks

• Hard to mergefor serving layer

• Hard to maintainand operate both realtime andbatch code in sync

Kappa Architecture

From Storm to Flink

#Yarn

Yarn

#MapReduce

Yarn

#MessagePassing

Yarn

#StreamProcessing

Yarn

#DistributedLoadTest

Yarn

#RessourceManagement

Yarn Frameworks

#Mesos

Columnar Storage

PART 4

Columnar Storage

#ORC#Parquet

Ecosystem

PART 5

Vendors

Integration

?@StevenLeRoux

2016