Hadoop-1

www.edureka.in/hadoopwww.edureka.in/hadoop

www.edureka.in/hadoop

Course Topics

Week 1 Understanding Big Data

Introduction to HDFS

Week 2 Playing around with Cluster

Data Loading Techniques

Week 3 Map-Reduce Basics, types and formats

Use-cases for Map-Reduce

Week 4 Analytics using Pig

Understanding Pig Latin

Week 5 Analytics using Hive

Understanding HIVE QL

Week 6 NoSQL Databases

Understanding HBASE

Week 7 Data loading Techniques in Hbase

Zookeeper

Week 8 Real world Datasets and Analysis

Hadoop Project Environment


Data Loading Techniques used in HBASE

Using HBASE SHELL

Using PIG

Using Sqoop

Using Client API

Data Loading

TechniquesIn

HBASE


Loading data into HBASE using Pig

Load data into HDFS

Retrieve the data using Pig

Store into hbase

raw_data = LOAD 'input.csv' USING PigStorage( ',' ) AS (listing_id: chararray,fname: chararray,lname: chararray);

dump raw_data;STORE raw_data INTO 'hbase://sample_names' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:fnameinfo:lname');

input.csv

1, fname1, lname12, fname2, lname23, fname3, lname3

Example:


Loading data into HBASE using Sqoop

Sqoop can be used to directly import data from RDBMS to Hbase:

Example:sqoop import--connect jdbc:mysql://\ --username --password --table --hbase-table--column-family --hbase-row-key --hbase-create-table

MYSQL SQOOP HBASE


HBase Client API

Package

Org.apache.hadoop.hbase.client.Htable

It is recommended that you create Htable instances only once, one per thread and reuse that instance for the rest of the lifetime of

your client application.


HBASE JAVA Client interfaces

Configuration where to find the cluster and tunable settings .

Similar to JDBC connection String

HBaseAdmin helps to manage administrative tasks

HBaseDescriptor has the details of the table

HTable is a handle on a single table. Is used to issue Put,

Get, Scan commands to the table


Steps to create Table and Column family

Create a HBaseAdmin instance Code the table schema Table Schema represented by HTableDescriptor Introduce column families to table description(HColumnDescriptor) Create the table through the HBaseAdmin instance


Problem

Problem:

In a distributed environment, getting processes to act in any kind of synchrony is an extremely hard problem.

For example, simply having a set of processes wait until theyve all reached the same point in their execution a kind of distributed barrier is surprisingly difficult to do correctly.

!

Solution:ZooKeeper offers an API to facilitate this sort of distributed coordination.For example, it is often used to serve locks to client processes locks are just another kind of coordination primitive.


Why Use Zookeeper?

To solve complex distributed algorithms

To avoid race conditions & dead locks

To avoid management complexities

Easy to use Programming Model

To apply reusable code libraries in common

Use Cases

WHY USE ZOOKEEPER?


What Is A ZooKeeper?

ZooKeeper is a sort of central nervous system for distributed systems where the role of the brain is played by the coordination service, axons are the network, processes are the monitored and controlled body parts, and eventsare the hormones and neurotransmitters used for messaging.

A Reliable, Scalable Distributed Coordination System

Apache ZooKeeper is a software project of the Apache Software Foundation, providing an open source distributed configuration service, synchronization service and naming registry for large distributed systems.

Every complex distributed application needs a coordination and orchestration system of some sort, so the ZooKeeper folks at Yahoo decide to build a good one and open source it for everyone to use!


The Target Market

Target Market For ZooKeeper

Multi-Host Multi-Process CJava Based

Systems


Who Uses Zookeeper?

Future Users -


Working Model

ZooKeeper works using distributed processes to coordinate with each other through a shared hierarchical name space.

Data is kept in memory and is backed up to a log for reliability. By using memory, ZooKeeper is very fast and can handle the high loads.


Distribute Coordination System

Because you can't get these guarantees from an event system plopped on top of a database and

these are the sort of guarantees you need in a complex distributed system where connections drop,

nodes fail, retransmits happen, and chaos rules the day.

Why would you ever need a Distribute Coordination System?


Example

For Example:

Assume the system is an ad system for serving advertisements to web sites. Ad systems are

complex beasts that require a fair bit of coordination. Imagine all the subsystems needing to run

on those 100 nodes: database, monitoring, fraud detectors, beacon servers, web server event log

processors, failover servers, customer dashboards, targeting engines, campaign planners,

campaign scenario testers, upgrades, installs, media managers, and so on.

There's a lot going on


Now imagine the power in the data center flips and all the machines power on!

How do all the processes across all the hosts know what to do?

Now imagine everything is up and a few machines go down.

How do all the processes know what to do in this situation?

This is where a coordination service comes in!

A coordination service acts as the backplane over which all these subsystems figure out what they

should do relative to all the other subsystems in a product.

Coordination Service


In this scenario, ZooKeeper acts as the Service Locator. Each process goes to ZooKeeper and finds

out which is the primary database. If a new primary is elected, say because a host fails, then

ZooKeeper sends an event that allows everyone dependent on the database to react by getting the

new primary database.

Coordination Service


ZooKeeper Data Model

Hierarchal namespace (like a File System)

Each znode has data and children

Data is read and written in its entity

/

Services

apps

users

locks

Stupidname

YaView

Servers

morestupidity

read-1


ZooKeeper has a hierarchal name space, much like a Distributed file system. The only difference is

that each node in the namespace can have data associated with it as well as children. It is like having

a file system that allows a file to also be a directory. Paths to nodes are always expressed as

canonical, absolute, slash-separated paths; there are no relative reference.

Hierarchal Namespace


Any unicode character can be used in a path subject to the following constraints:

The null character (\u0000) cannot be part of a path name.

These characters can't be used : \u0001 - \u0019 and \u007F - \u009F.

These characters are not allowed: \ud800 -uF8FFF, \uFFF0-uFFFF, \uXFFFE -\uXFFFF

(where X is a digit 1 - E), \uF0000 - \uFFFFF.

The "." character can be used as part of another name, but "." and ".." cannot alone be used to

indicate a node along a path.

The token "zookeeper" is reserved.


ZNodes

Every node in a ZooKeeper tree is referred to as a Znode.

Znodes maintain a stat structure that includes version numbers for data changes, acl changes.

The stat structure also has timestamps.

The version number, together with the timestamp allow ZooKeeper to validate the cache and to

coordinate updates. Each time a Znode's data changes, the version number increases.

For instance, whenever a client retrieves data, it also receives the version of the data. And when

a client performs an update or a delete, it must supply the version of the data of the znode it is

changing. If the version it supplies doesn't match the actual version of the data, the update will

fail.


ZooKeeper Service

ZooKeeper Service

Client Client Client

Server Server Server Server Server

Client Client Client Client Client

Leader


Facts

All servers store a copy of the data in the memory.

The leader is elected at startup.

Followers respond to clients all updates go through the leader.

Responses are sent when a majority of servers have persisted the change.


Observations

Distributed systems always need some form of coordination.

Programmers cannot use locks correctly.

Message based coordination can be hard to use in some applications.


Wishes

Simple, Robust, Good Performance

Tuned for Read dominant workloads

Familiar models and interfaces

Wait Free

Need to be able to wait efficiently


ZooKeeper API

String create(path, data, acl, flags)

void delete(path, expectedVersion)

Stat setData(path, data, expectedVersion)

(data, Stat) getData(path, watch)

Stat exists(path, watch)

String[] getChildren(path, watch)

void sync(path)


ZooKeeper and HBase

Master Failover

Region Servers and Master discovery via ZooKeeper

HBase clients connect to ZooKeeper to find configuration data

Region Servers and Master failure detection


HBase and ZooKeeper As of Now!

Master If more than one master, they fight.

Root Region Server This znode holds the location of the server

hosting the root of all tables in Hbase.

rs A directory in which there is a znode per Hbase

region server Region Servers register themselves with

ZooKeeper when they come online

On Region Server failure (detected via ephemeral znodes and notification via ZooKeeper), the master splits the edits out per region.

shutdown

rs

Root.region.server

shutdown

/

master


Release 3.3.0, Whats In For HBase?

Allow configuration of session timeout min/max bounds

Improved logging information to detect

issues

Improved debugging tools

Queue implementation available

Key Features

Improved performance and

robustness

Improved documentation


Upcoming 3.4 Release

No Connectionloss

Testing - Mockito

More of backwards compatibility testing

Use Netty - allow encryption

KeyFeatures


More ZooKeeper in HBase?

Table Schema and state in ZooKeeper

read only, online

Region Server state transitions via ZooKeeper

Store region assignment in ZooKeeper for each Region Server

http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases


Thank You See You in Class Next Week

Date post:	26-Sep-2015
Category:	Documents
Upload:	dharmendard
View:	22 times
Download:	6 times

Hadoop-1

Documents