Cmprssd Intrduction To Hadoop, SQL-on-Hadoop, NoSQL
@arsenyspb
Singapore University of Technology & Design 2016-11-09
Thank You For Inviting! My special kind regards to:
Professor Meihui Zhang
Associate Director Hou Liang Seah
Industry Outreach Manager Robin Soo
🤔🤔 What am I supposed to do?.. Please raise hand if you…
…want to learn about modern data analytics ?..
…are OK if I use words like “Java” or “Command Line” or “Port”?..
…got enough kopi / teh / red bull for next 1 hour?..
…have hands-on experience with Hadoop, Spark, Hive?..
Shameless Self-Intro
5
Hi, My Name Is Arseny, And I’m…
Hadoop In A 🌰 Nutshell
7
1998
2016
It All Started At Google
8
2003
2004
2006
Hadoop is Google’s Tech in Open Source
2006
9 Hadoop Originates From Hyperscale Approach
However, in 2016 big data & Hadoop don’t need a hyperscale datacenter
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Closer Look, i.e. Hortonworks Data Platform (HDP)
YARN : Data Operating System
DATA ACCESS SECURITY GOVERNANCE & INTEGRATION OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Administration Authentication Authorization Auditing Data Protection Ranger Knox Atlas HDFS Encryption
Data Workflow Sqoop Flume Kafka NFS WebHDFS
Provisioning, Managing, & Monitoring
Ambari Cloudbreak Zookeeper Scheduling Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase Accumulo Phoenix
Stream
Storm
In-memory
Spark
Others
ISV Engines
Tez Tez Tez Slider Slider
HDFS Hadoop Distributed File System
DATA MANAGEMENT
Hortonworks Data Platform 2.3
Data Lifecycle & Governance Falcon Atlas
We will “compress” all these topics during next 1 hour
Quick demo
HDFS In A 🌰 Nutshell Hadoop Distributed File System
13 © 2015 Pivotal Software, Inc. All rights reserved.
Reading Data From HDFS
Client Node Client JVM
Distributed FileSystem
HDFS Client
1: open
FSData InputStream
namenode JVM
NameNode
datanode JVM
DataNode
datanode JVM
DataNode
datanode JVM
DataNode
2: Request file block locations
3: read
6: close
4: read from block
5: read from block
14 © 2015 Pivotal Software, Inc. All rights reserved.
Writing Data to HDFS
Client Node Client JVM
Distributed FileSystem
HDFS Client
1: create
FSDataOutputStream
namenode JVM
NameNode
datanode JVM
DataNode
datanode JVM
DataNode
datanode JVM
DataNode
2: create
3a: write
6: close
4a: write packet 5c: ack packet
4b: write packet
4c: write packet
5b: ack packet
5a: ack packet
7: complete
DataStreamer
3b: Request allocation (as new blocks required)
3c: Three data-node, data-block pairs returned
Diagram shows 3x replication
Quick demo
YARN In A 🌰 Nutshell Yet Another Resource Negotiator
17
Traditional SQL databases: structured Schema-on-Write Legacy SQL Is All Structured
row keys color shape timestamp
row
row
row
... ...
first red square HH:MM:SS
second blue round HH:MM:SS
1 create schema on file or block storage 2 load data 3 query data select ROW KEY, COLOR from … where
Can’t add data before the schema is created. To change schema, drop and re-loaded entire table. A drop of TB-size table with Foreign Keys could last days.
18
file.csv & other.txt
Unstructured Schema-on-Read Query MapReduce In Color
1 load data straight from HDFS 2 query data - map - shuffle - reduce
19
MapReduce In Process Diagram
20 © 2015 Pivotal Software, Inc. All rights reserved.
Starting Job – MapReduce v2.0
Client Node Client JVM
Job MapReduce program
Jobtracker Node
1: initiate job 2: request new application
3: copy job jars, config
4: submit job
9: retrieve job jars, data
Node Manager Node
JVM
Node manager
Child JVM
YARN child
Mapper or Reducer
10: run
Shared File-System (e.g. HDFS)
6: determine input splits
7b: start container
Node Manager Node JVM
MRApp Master Node Manager
5b: launch
5c: initialize job
5a: start container
7a: allocate task resources
8: launch
JVM
ResourceManager
Quick demo
Hive In A 🌰 Nutshell SQL interface to MapReduce Jobs
23
Relational DB
Relational DB and SQL conceived to – Remove repeated data, replace with tabular structure & relationships
▪ Provide efficient & robust structure for data storage
Exploit regular structure with declarative query language
–Structured Query Language
DRY – Don’t Repeat Yourself
24
What Hive Is… A SQL-like processing capability based on Hadoop
Enables easy data summarisation, ad-hoc reporting and querying, and analysis of large volumes of data
Built on HQL, a SQL-like query language – Statements run as mapreduce jobs – Also allows mapreduce programmers to plugin custom mappers and
reducers
• Works with Plain text, Hbase, ORC, Parquet and others formats
• Metadata is stored in MySQL
25
Hive Schemas
Hive is schema-on-read – Schema is only enforced when the data is read (at query time) – Allows greater flexibility: same data can be read using multiple
schemas
Contrast with RDBMSes, which are schema-on-write – Schema is enforced when the data is loaded – Speeds up queries at the expense of load times
26
Hive Architecture
Hive Metastore + MySQL
27
What Hive Is Not…
Hive, like Hadoop, is designed for batch processing of large datasets
Not a real-time system, not fully SQL-92 compliant – “Sibling” solutions like Tez, Impala and HAWQ offer more compliance
Latency and throughput are both high compared to a traditional RDBMS – Even when dealing with relatively small data (<100 MB)
Quick demo
HBASE In A 🌰 Nutshell SQL interface to MapReduce Jobs
30
ACID is Business Requirement for RDBMs Traditional DB-s have excellent support for ACID transactions
–Atomic: All write operations succeed, or nothing is written
–Consistent: Integrity rules guaranteed at commit
–Isolation: It appears to the user as if only one process executes at a time. (Two concurrent transactions will not see on another’s transaction while “in flight”.)
–Durable: The updates made to the database in a committed transaction will be visible to future transactions. (Effects of a process do not get lost if the system crashes.)
31
Scale RDBMS?..
RDBMS is bad fit for huge scale, online applications Sharding?.. Scaling up?.. No Joins?.. Master-Slave?..
Big Data describes problem, Not only SQLdefines the general approach to solution: – Emphasis on scale, distributed processing, use of commodity
hardware
32
Business Needs for “Not Only SQL” Not Only SQL DBs evolved from web-scale use-cases
– Google, Amazon, Facebook, Twitter, Yahoo, … ▪ “Google Cache” = Entire page saved in to a cell of a BigTable database
▪ Columnar layout preferred ▪ filters to reduce the disk lookups for non-existent rows or columns increases the performance of a
database query operation.
– Requirement for massive scale, relational fits badly ▪ Queries relatively simple ▪ Direct interaction with online customers
– Cost-effective, dynamic horizontal scaling required ▪ Many nodes based on inexpensive (commodity) hardware ▪ Must manage frequent node failures & addition of nodes at any time
🤔🤔 But how to build such DB?..
34
Reminder: The CAP Theorem (2 not 3)
Consistency
Partition tolerance
Availability “Once a writer has written, all readers will see that write”
Single Version of Truth?
“System is Available to serve 100% of requests and complete them successfully.”
No SPOF?..
“A system can continue to operate in the presence of a network Partitions”
Replicas?..
35
Eventually Consistent vs. ACID An artificial acronym you may see is BASE
–Basically Available ▪ System seems to work all the time
–Soft State ▪ Not wholly consistent all the time, but…
–Eventual Consistency ▪ After a period with no updates, a given dataset will be consistent
Resulting systems characterized as “eventually consistent” – Overbooking an airline or hotel and passing risk to customer
36
Non-relational distributed database • HBase is a database: has a schema, but it’s non-relational
row keys column family
“color” column family
“shape”
row
row
first “red”: #F00 “blue”: #00F
“yellow”: #F0F “square”:�
second “round”: “size”: XXL
1.) Create column families
2.) Load data, multiples of rows form region files on HDFS 3.) Query data
hbase>get “first”, “color”:”yellow” COLUMN CELL yellow timestamp=1295774833226, value=“#F0F” hbase>get “second”, “shape”:”size” COLUMN CELL size timestamp=1295723467122, value=“XXL”
37
Col
umn
Orie
nted
St
orag
e
38
Hba
se
Clie
nt
Reg
ion
Serv
er
Zookeeper
SQL
ODB
C Cl
ient
Pivo
tal
HAW
Q P
XF
Hba
se
Clie
nt
Apac
he
Phoe
nix
Hba
se
Clie
nt
Sequential HDFS Write & L2 Read
Adaptive Pre Fetch & L2 Reads Sequential Writes
SQ
L JD
BC
Clie
nt
Hbas
e AP
I Cl
ient
(1) Put/Delete
Writ
e-Ah
ead
Log
(WAL
)
Mem
stor
e (3) Flush to HDFS
(2.1) Write to MemStore
(2.0) Write to WAL
(4) Get/Scan Read Request Client RAM Pre-Fetch
HBase Architecture, Read & Write
Memstore = Eventual
Consistency
HFile
39
HBase namespace layout
40
From “Hbase Definitive Guide”
http://www.slideshare.net/Hadoop_Summit/kamat-singh-june27425pmroom210cv2
Compression (HBase and others)
Q&A?.. http://bit.ly/isilonhbase
@arsenyspb