Date post: | 26-Sep-2015 |
Category: |
Documents |
Upload: | dharmendard |
View: | 22 times |
Download: | 6 times |
www.edureka.in/hadoopwww.edureka.in/hadoop
www.edureka.in/hadoop
Course Topics
Week 1 Understanding Big Data
Introduction to HDFS
Week 2 Playing around with Cluster
Data Loading Techniques
Week 3 Map-Reduce Basics, types and formats
Use-cases for Map-Reduce
Week 4 Analytics using Pig
Understanding Pig Latin
Week 5 Analytics using Hive
Understanding HIVE QL
Week 6 NoSQL Databases
Understanding HBASE
Week 7 Data loading Techniques in Hbase
Zookeeper
Week 8 Real world Datasets and Analysis
Hadoop Project Environment
www.edureka.in/hadoop
Data Loading Techniques used in HBASE
Using HBASE SHELL
Using PIG
Using Sqoop
Using Client API
Data Loading
TechniquesIn
HBASE
www.edureka.in/hadoop
Loading data into HBASE using Pig
Load data into HDFS
Retrieve the data using Pig
Store into hbase
raw_data = LOAD 'input.csv' USING PigStorage( ',' ) AS (listing_id: chararray,fname: chararray,lname: chararray);
dump raw_data;STORE raw_data INTO 'hbase://sample_names' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:fnameinfo:lname');
input.csv
1, fname1, lname12, fname2, lname23, fname3, lname3
Example:
www.edureka.in/hadoop
Loading data into HBASE using Sqoop
Sqoop can be used to directly import data from RDBMS to Hbase:
Example:sqoop import--connect jdbc:mysql://\ --username --password --table --hbase-table--column-family --hbase-row-key --hbase-create-table
MYSQL SQOOP HBASE
www.edureka.in/hadoop
HBase Client API
Package
Org.apache.hadoop.hbase.client.Htable
It is recommended that you create Htable instances only once, one per thread and reuse that instance for the rest of the lifetime of
your client application.
www.edureka.in/hadoop
HBASE JAVA Client interfaces
Configuration where to find the cluster and tunable settings .
Similar to JDBC connection String
HBaseAdmin helps to manage administrative tasks
HBaseDescriptor has the details of the table
HTable is a handle on a single table. Is used to issue Put,
Get, Scan commands to the table
www.edureka.in/hadoop
Steps to create Table and Column family
Create a HBaseAdmin instance Code the table schema Table Schema represented by HTableDescriptor Introduce column families to table description(HColumnDescriptor) Create the table through the HBaseAdmin instance
www.edureka.in/hadoop
Problem
Problem:
In a distributed environment, getting processes to act in any kind of synchrony is an extremely hard problem.
For example, simply having a set of processes wait until theyve all reached the same point in their execution a kind of distributed barrier is surprisingly difficult to do correctly.
!
Solution:ZooKeeper offers an API to facilitate this sort of distributed coordination.For example, it is often used to serve locks to client processes locks are just another kind of coordination primitive.
www.edureka.in/hadoop
Why Use Zookeeper?
To solve complex distributed algorithms
To avoid race conditions & dead locks
To avoid management complexities
Easy to use Programming Model
To apply reusable code libraries in common
Use Cases
WHY USE ZOOKEEPER?
www.edureka.in/hadoop
What Is A ZooKeeper?
ZooKeeper is a sort of central nervous system for distributed systems where the role of the brain is played by the coordination service, axons are the network, processes are the monitored and controlled body parts, and eventsare the hormones and neurotransmitters used for messaging.
A Reliable, Scalable Distributed Coordination System
Apache ZooKeeper is a software project of the Apache Software Foundation, providing an open source distributed configuration service, synchronization service and naming registry for large distributed systems.
Every complex distributed application needs a coordination and orchestration system of some sort, so the ZooKeeper folks at Yahoo decide to build a good one and open source it for everyone to use!
www.edureka.in/hadoop
The Target Market
Target Market For ZooKeeper
Multi-Host Multi-Process CJava Based
Systems
www.edureka.in/hadoop
Who Uses Zookeeper?
Future Users -
www.edureka.in/hadoop
Working Model
ZooKeeper works using distributed processes to coordinate with each other through a shared hierarchical name space.
Data is kept in memory and is backed up to a log for reliability. By using memory, ZooKeeper is very fast and can handle the high loads.
www.edureka.in/hadoop
Distribute Coordination System
Because you can't get these guarantees from an event system plopped on top of a database and
these are the sort of guarantees you need in a complex distributed system where connections drop,
nodes fail, retransmits happen, and chaos rules the day.
Why would you ever need a Distribute Coordination System?
www.edureka.in/hadoop
Example
For Example:
Assume the system is an ad system for serving advertisements to web sites. Ad systems are
complex beasts that require a fair bit of coordination. Imagine all the subsystems needing to run
on those 100 nodes: database, monitoring, fraud detectors, beacon servers, web server event log
processors, failover servers, customer dashboards, targeting engines, campaign planners,
campaign scenario testers, upgrades, installs, media managers, and so on.
There's a lot going on
www.edureka.in/hadoop
Now imagine the power in the data center flips and all the machines power on!
How do all the processes across all the hosts know what to do?
Now imagine everything is up and a few machines go down.
How do all the processes know what to do in this situation?
This is where a coordination service comes in!
A coordination service acts as the backplane over which all these subsystems figure out what they
should do relative to all the other subsystems in a product.
Coordination Service
www.edureka.in/hadoop
In this scenario, ZooKeeper acts as the Service Locator. Each process goes to ZooKeeper and finds
out which is the primary database. If a new primary is elected, say because a host fails, then
ZooKeeper sends an event that allows everyone dependent on the database to react by getting the
new primary database.
Coordination Service
www.edureka.in/hadoop
ZooKeeper Data Model
Hierarchal namespace (like a File System)
Each znode has data and children
Data is read and written in its entity
/
Services
apps
users
locks
Stupidname
YaView
Servers
morestupidity
read-1
www.edureka.in/hadoop
ZooKeeper has a hierarchal name space, much like a Distributed file system. The only difference is
that each node in the namespace can have data associated with it as well as children. It is like having
a file system that allows a file to also be a directory. Paths to nodes are always expressed as
canonical, absolute, slash-separated paths; there are no relative reference.
Hierarchal Namespace
www.edureka.in/hadoop
Any unicode character can be used in a path subject to the following constraints:
The null character (\u0000) cannot be part of a path name.
These characters can't be used : \u0001 - \u0019 and \u007F - \u009F.
These characters are not allowed: \ud800 -uF8FFF, \uFFF0-uFFFF, \uXFFFE -\uXFFFF
(where X is a digit 1 - E), \uF0000 - \uFFFFF.
The "." character can be used as part of another name, but "." and ".." cannot alone be used to
indicate a node along a path.
The token "zookeeper" is reserved.
www.edureka.in/hadoop
ZNodes
Every node in a ZooKeeper tree is referred to as a Znode.
Znodes maintain a stat structure that includes version numbers for data changes, acl changes.
The stat structure also has timestamps.
The version number, together with the timestamp allow ZooKeeper to validate the cache and to
coordinate updates. Each time a Znode's data changes, the version number increases.
For instance, whenever a client retrieves data, it also receives the version of the data. And when
a client performs an update or a delete, it must supply the version of the data of the znode it is
changing. If the version it supplies doesn't match the actual version of the data, the update will
fail.
www.edureka.in/hadoop
ZooKeeper Service
ZooKeeper Service
Client Client Client
Server Server Server Server Server
Client Client Client Client Client
Leader
www.edureka.in/hadoop
Facts
All servers store a copy of the data in the memory.
The leader is elected at startup.
Followers respond to clients all updates go through the leader.
Responses are sent when a majority of servers have persisted the change.
www.edureka.in/hadoop
Observations
Distributed systems always need some form of coordination.
Programmers cannot use locks correctly.
Message based coordination can be hard to use in some applications.
www.edureka.in/hadoop
Wishes
Simple, Robust, Good Performance
Tuned for Read dominant workloads
Familiar models and interfaces
Wait Free
Need to be able to wait efficiently
www.edureka.in/hadoop
ZooKeeper API
String create(path, data, acl, flags)
void delete(path, expectedVersion)
Stat setData(path, data, expectedVersion)
(data, Stat) getData(path, watch)
Stat exists(path, watch)
String[] getChildren(path, watch)
void sync(path)
www.edureka.in/hadoop
ZooKeeper and HBase
Master Failover
Region Servers and Master discovery via ZooKeeper
HBase clients connect to ZooKeeper to find configuration data
Region Servers and Master failure detection
www.edureka.in/hadoop
HBase and ZooKeeper As of Now!
Master If more than one master, they fight.
Root Region Server This znode holds the location of the server
hosting the root of all tables in Hbase.
rs A directory in which there is a znode per Hbase
region server Region Servers register themselves with
ZooKeeper when they come online
On Region Server failure (detected via ephemeral znodes and notification via ZooKeeper), the master splits the edits out per region.
shutdown
rs
Root.region.server
shutdown
/
master
www.edureka.in/hadoop
Release 3.3.0, Whats In For HBase?
Allow configuration of session timeout min/max bounds
Improved logging information to detect
issues
Improved debugging tools
Queue implementation available
Key Features
Improved performance and
robustness
Improved documentation
www.edureka.in/hadoop
Upcoming 3.4 Release
No Connectionloss
Testing - Mockito
More of backwards compatibility testing
Use Netty - allow encryption
KeyFeatures
www.edureka.in/hadoop
More ZooKeeper in HBase?
Table Schema and state in ZooKeeper
read only, online
Region Server state transitions via ZooKeeper
Store region assignment in ZooKeeper for each Region Server
http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases
www.edureka.in/hadoop
Thank You See You in Class Next Week