Post on 11-Apr-2017
transcript
Cmprssd Intrduction ToHadoop, SQL-on-Hadoop, NoSQL
@arsenyspb
Arseny.Chernov@Dell.com
Singapore University of Technology & Design2016-11-09
Thank You For Inviting!My special kind regards to:
Professor Meihui Zhang
Associate Director Hou Liang Seah
Industry Outreach Manager Robin Soo
🤔 What am I supposed to do?..
Please raise hand if you…
…want to learn about modern data analytics ?..
…are OK if I use words like “Java” or “Command Line” or “Port”?..
…got enough kopi / teh / red bull for next 1 hour?..
…have hands-on experience with Hadoop, Spark, Hive?..
Shameless Self-Intro
5
Hi, My Name Is Arseny, And I’m…
2011
Hadoop In A 🌰 Nutshell
7
1998
2016
It All Started At Google
8
2003
2004
2006
Hadoop is Google’s Tech in Open Source
2006
9 Hadoop Originates From Hyperscale Approach
However, in 2016 big data & Hadoop don’t need a hyperscale datacenter
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Closer Look, i.e. Hortonworks Data Platform (HDP)
YARN : Data Operating System
DATA ACCESS SECURITYGOVERNANCE & INTEGRATION OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
AdministrationAuthenticationAuthorizationAuditingData Protection
RangerKnoxAtlasHDFS EncryptionData Workflow
SqoopFlumeKafkaNFSWebHDFS
Provisioning, Managing, & Monitoring
AmbariCloudbreakZookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBaseAccumuloPhoenix
Stream
Storm
In-memory
Spark
Others
ISV Engines
TezTez Tez Slider Slider
HDFS Hadoop Distributed File System
DATA MANAGEMENT
Hortonworks Data Platform 2.3
Data Lifecycle & Governance
FalconAtlas
We will “compress” all these topics during next 1 hour
Quick demo
HDFS In A 🌰 NutshellHadoop Distributed File System
13© 2015 Pivotal Software, Inc. All rights reserved.
Reading Data From HDFS
Client NodeClient JVM
DistributedFileSystem
HDFS Client
1: open
FSDataInputStream
namenodeJVM
NameNode
datanodeJVM
DataNode
datanodeJVM
DataNode
datanodeJVM
DataNode
2: Request file block locations
3: read
6: close
4: read from block
5: read from block
14© 2015 Pivotal Software, Inc. All rights reserved.
Writing Data to HDFS
Client NodeClient JVM
DistributedFileSystem
HDFS Client
1: create
FSDataOutputStream
namenodeJVM
NameNode
datanodeJVM
DataNode
datanodeJVM
DataNode
datanodeJVM
DataNode
2: create
3a: write
6: close
4a: write packet5c: ack packet
4b: write packet
4c: write packet
5b: ack packet
5a: ack packet
7: complete
DataStreamer
3b: Request allocation(as new blocks required)
3c: Three data-node, data-block pairs returned
Diagram shows3x replication
Quick demo
YARN In A 🌰 NutshellYet Another Resource Negotiator
17
Traditional SQL databases: structured Schema-on-WriteLegacy SQL Is All Structured
row keys color shape timestamp
row
row
row
......
first red square HH:MM:SS
second blue round HH:MM:SS
1 create schema on file or block storage
2 load data3 query dataselect ROW KEY, COLOR from … where
Can’t add data before the schema is created. To change schema, drop and re-loaded entire table. A drop of TB-size table with Foreign Keys could last days.
18
file.csv & other.txt
Unstructured Schema-on-Read QueryMapReduce In Color
1 load data straight from HDFS2 query data - map - shuffle - reduce
19
MapReduce In Process Diagram
20© 2015 Pivotal Software, Inc. All rights reserved.
Starting Job – MapReduce v2.0
Client NodeClient JVM
JobMapReduceprogram
Jobtracker Node
1: initiate job 2: request new application
3: copy job jars, config
4: submit job
9: retrieve job jars, data
Node Manager Node
JVM
Node manager
Child JVM
YARNchild
Mapper or Reducer
10: run
Shared File-System (e.g. HDFS)
6: determine input splits
7b: start container
Node Manager NodeJVM
MRAppMasterNode Manager
5b: launch
5c: initialize job
5a: start container
7a: allocate task resources
8: launch
JVM
ResourceManager
Quick demo
Hive In A 🌰 NutshellSQL interface to MapReduce Jobs
23
Relational DB
Relational DB and SQL conceived to– Remove repeated data, replace with tabular structure & relationships
â–Ş Provide efficient & robust structure for data storage
Exploit regular structure with declarative query language
–Structured Query Language
DRY – Don’t Repeat Yourself
24
What Hive Is… A SQL-like processing capability based on Hadoop
Enables easy data summarisation, ad-hoc reporting and querying, and analysis of large volumes of data
Built on HQL, a SQL-like query language– Statements run as mapreduce jobs– Also allows mapreduce programmers to plugin custom mappers and reducers
• Works with Plain text, Hbase, ORC, Parquet and others formats
• Metadata is stored in MySQL
25
Hive Schemas
Hive is schema-on-read– Schema is only enforced when the data is read (at query time)– Allows greater flexibility: same data can be read using multiple
schemas
Contrast with RDBMSes, which are schema-on-write– Schema is enforced when the data is loaded– Speeds up queries at the expense of load times
26
Hive Architecture
Hive Metastore + MySQL
27
What Hive Is Not…
Hive, like Hadoop, is designed for batch processing of large datasets
Not a real-time system, not fully SQL-92 compliant– “Sibling” solutions like Tez, Impala and HAWQ offer more compliance
Latency and throughput are both high compared to a traditional RDBMS– Even when dealing with relatively small data (<100 MB)
Quick demo
HBASE In A 🌰 NutshellSQL interface to MapReduce Jobs
30
ACID is Business Requirement for RDBMs Traditional DB-s have excellent support for ACID transactions
– Atomic: All write operations succeed, or nothing is written
–Consistent: Integrity rules guaranteed at commit
–Isolation: It appears to the user as if only one process executes at a time. (Two concurrent transactions will not see on another’s transaction while “in flight”.)
–Durable: The updates made to the database in a committed transaction will be visible to future transactions. (Effects of a process do not get lost if the system crashes.)
31
Scale RDBMS?..
RDBMS is bad fit for huge scale, online applications How to do Sharding?.. Unlimited but Scaling up?.. Maybe give up on Joins for latency and do Master-Slave?..
Big Data describes problem, Not only SQLdefines the general approach to solution:– Emphasis on scale, distributed processing, use of commodity
hardware
32
Business Needs for “Not Only SQL” Not Only SQL DBs evolved from web-scale use-cases
– Google, Amazon, Facebook, Twitter, Yahoo, …▪ “Google Cache” = Entire page saved in to a cell of a BigTable database
â–ŞColumnar layout preferredâ–Ş filters to reduce the disk lookups for non-existent rows or columns increases the performance of
a database query operation.
– Requirement for massive scale, relational fits badly ▪ Queries relatively simple▪ Direct interaction with online customers
– Cost-effective, dynamic horizontal scaling required▪ Many nodes based on inexpensive (commodity) hardware▪ Must manage frequent node failures & addition of nodes at any time
🤔 But how to build such DB?..
34
Reminder: The CAP Theorem (2 not 3)
Consistency
Partition tolerance
Availability “Once a writer has written, all readers will see that write”
Single Version of Truth?
“System is Available to serve 100% of requests and complete them successfully.”
No SPOF?..
“A system can continue to operate in the presence of a network Partitions”
Replicas?..
35
Eventually Consistent vs. ACID An artificial acronym you may see is BASE
– Basically Available▪ System seems to work all the time
– Soft State▪ Not wholly consistent all the time, but…
– Eventual Consistency▪ After a period with no updates, a given dataset will be consistent
Resulting systems characterized as “eventually consistent” – Overbooking an airline or hotel and passing risk to customer
36
Non-relational distributed databaseHBase is a database: has a schema, but it’s non-relational
row keyscolumn family
“color”column family
“shape”
row
row
first “red”: #F00“blue”: #00F
“yellow”: #F0F“square”:
second“round”:
“size”: XXL
1.) Create column families
2.) Load data, multiples of rows form region files on HDFS3.) Query data
hbase>get “first”, “color”:”yellow” COLUMN CELL yellow timestamp=1295774833226, value=“#F0F”
hbase>get “second”, “shape”:”size” COLUMN CELL size timestamp=1295723467122, value=“XXL”
37
Col
umn
Orie
nted
Sto
rage
38
Hba
se
Clie
nt
Reg
ion
Serv
er
Zookeeper
SQL
ODB
C Cl
ient
Pivo
tal
HAW
Q P
XF H
base
Cl
ient
Apac
he
Phoe
nix
Hba
se
Clie
nt
Sequential HDFS Write & L2 Read
Adaptive Pre Fetch & L2 Reads Sequential Writes
SQ
L JD
BC
Clie
ntHb
ase
API
Clie
nt (1) Put/Delete
Writ
e-Ah
ead
Log
(WAL
)
Mem
stor
e (3) Flush toHDFS
(2.1) Write toMemStore
(2.0) Write to WAL
(4) Get/Scan Read RequestClient RAM Pre-Fetch
HBase Architecture, Read & Write
Memstore = Eventual
Consistency
HFile
39
HBase namespace layout
40
From “Hbase Definitive Guide”
http://www.slideshare.net/Hadoop_Summit/kamat-singh-june27425pmroom210cv2
Compression (HBase and others)
Q&A?..
http://bit.ly/isilonhbase
@arsenyspb