@PennAnData
The State of Big Data2016
Summary
1 Data Facts
2 Hadoop Basics
3 Beyond Batch : Streaming
4 Columnar Storage
5 Ecosystem
Big Data Facts
PART 1
3V's
Volume Velocity Variety
Volume ...
data Production constantly growing
data Retention increase widely
Extract Value from you data
Storage Cost decrease
Velocity ...
Data produced Faster
Get Real Time insight
Move from capture to Analysis
Get Actionable insight
Variety ...
Not only Structured Data
Toward mostly Unstructured
Text (articles, comments, tweets,...)
Images (id cards, bills,...)
Logs, metrics,...
Seek Time
• 5-10 ms
• 200 move/s
Data Transfer Rate
Mbps 100 1000 10000MB/s 12.5 125 1250
1 Mo 80ms 8ms 0.8ms1 CD (700 Mo) 56s 5.6s 0.56s1 Go (1000 Mo) 1m20 8s 0.8s1 DVD (4700 Mo) 6m16 37.6s 3.76s1 To (1000 Go) 22h13 2h13m 13m
Data Transfer Rate
Mbps 100 1000 10000MB/s 12.5 125 1250
1 min 750 MB 7.5 GB 75 GB15 min 11 GB 112 GB 1 TB1 hour 45 GB 450 GB 4.5 TB1 day 1TB 10.8 TB 108 TB
Payload
Definition
“Big Data really is about having insights and making animpact on your business. If you aren’t taking advantage ofthe data you’re collecting, then you just have a pile of data,
you don’t have Big Data.”
#BigData
Introducing Hadoop
PART 2
#DougCutting
#Tools
Timeline
#HDFS
#Blocks
HDFS
/ HDFS
File
Blocks
DataNodes
File
Blocks
DataNodes
/ HDFS / Replication
DataNodes
NameNode
/ HDFS / NameNode
DataNodes
NameNodes
/ HDFS / Namespace #Federation
/ HDFS / HA
#HighAvailability
NN1 NN2
/ HDFS / HA
Failover Controller
● NameNode Side● Health monitor● Manage HA State
● Zookeeper Side● Monitor State● Maintain or Try to
get Active Lock
#Five9rulez
/ HDFS / Client #Read
#DataLocality
/ HDFS / Client #Write
#ReplicationFactor3
#MapReduce
MapReduce
#MAP
MapReduce
#SHUFFLE
MapReduce
#REDUCE
MapReduce
<key1, val1> map
<key2, val2> mapreduce <okey1, oval1>
reduce <okey2, oval2>
<key3, val3> map
<key500, val500> map
<ikey2, ival521><key501, val501> map reduce <okey150, oval150>
<key502, val502> map <ikey150, ival522>
<ikey1, ival1>
<ikey2, ival2>
<ikey1, ival3>
<ikey2, ival4>
<ikey150, ival520>
Input Input Pairs
Intermediate Pairs
Output Pairs
Output
Step 1: Split )
Step 2: Map
Step 3: Shuffle / Sort
Step 4: Reduce
Step 5: Store )
MapReduce
MapReduce
#Pig &
#Hive
Hive
● Tez
● Impala
● Presto.io
#HBase
HBase
#Model
HBase
#Model
HBase
#Model
HBase
#Model
HBase
#PhysicalStorage
HBase
#Scale
HBase
#Scale
HBase
#Meta
#HBaseArch
HBase #SQL
HBase
#Features
● Coprocessor● Auto-sharding● Scan (full,range)● Schemaless● Cell versioning● Battle tested
● Compactions● Replications● Custom filters● Transactional● Low Latency● Active Community
Beyond Batch : Streaming
PART 3
/ Streaming / Data Platform #Transport
/ Streaming / Data Platform / Kafka
+ =
/ Streaming / Frameworks
/ Streaming / Storm / Topology #Storm
/ Streaming / Storm / Topology #Parallelism
/ Streaming / Flink #Job
/ Streaming / Flink
#DataSet API #DataStream API
Ok Steven, but a new DSL for each new hype tool ?Come on...
Apache Beam
#Features
● Open Sourced Google DataFlow● Unify bigdata developements● Beam Model (from DataFlow model)● Parallel Data processing Pipelines● Pluggable runners: Flink or G Cloud DataFlow● Portability● SDKs : Java / Python
#Architecture
Lambda Architecture
Drawbacks
• Hard to mergefor serving layer
• Hard to maintainand operate both realtime andbatch code in sync
Kappa Architecture
From Storm to Flink
#Yarn
Yarn
#MapReduce
Yarn
#MessagePassing
Yarn
#StreamProcessing
Yarn
#DistributedLoadTest
Yarn
#RessourceManagement
Yarn Frameworks
#Mesos
Columnar Storage
PART 4
Columnar Storage
#ORC#Parquet
Ecosystem
PART 5
Vendors
Integration
?@StevenLeRoux
2016