+ All Categories
Home > Documents > Hadoop-1

Hadoop-1

Date post: 26-Sep-2015
Category:
Upload: dharmendard
View: 22 times
Download: 6 times
Share this document with a friend
Description:
This is excellent Bigdata Doc
Popular Tags:
33
www.edureka.in/hadoop www.edureka.in/hadoop
Transcript
  • www.edureka.in/hadoopwww.edureka.in/hadoop

  • www.edureka.in/hadoop

    Course Topics

    Week 1 Understanding Big Data

    Introduction to HDFS

    Week 2 Playing around with Cluster

    Data Loading Techniques

    Week 3 Map-Reduce Basics, types and formats

    Use-cases for Map-Reduce

    Week 4 Analytics using Pig

    Understanding Pig Latin

    Week 5 Analytics using Hive

    Understanding HIVE QL

    Week 6 NoSQL Databases

    Understanding HBASE

    Week 7 Data loading Techniques in Hbase

    Zookeeper

    Week 8 Real world Datasets and Analysis

    Hadoop Project Environment

  • www.edureka.in/hadoop

    Data Loading Techniques used in HBASE

    Using HBASE SHELL

    Using PIG

    Using Sqoop

    Using Client API

    Data Loading

    TechniquesIn

    HBASE

  • www.edureka.in/hadoop

    Loading data into HBASE using Pig

    Load data into HDFS

    Retrieve the data using Pig

    Store into hbase

    raw_data = LOAD 'input.csv' USING PigStorage( ',' ) AS (listing_id: chararray,fname: chararray,lname: chararray);

    dump raw_data;STORE raw_data INTO 'hbase://sample_names' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:fnameinfo:lname');

    input.csv

    1, fname1, lname12, fname2, lname23, fname3, lname3

    Example:

  • www.edureka.in/hadoop

    Loading data into HBASE using Sqoop

    Sqoop can be used to directly import data from RDBMS to Hbase:

    Example:sqoop import--connect jdbc:mysql://\ --username --password --table --hbase-table--column-family --hbase-row-key --hbase-create-table

    MYSQL SQOOP HBASE

  • www.edureka.in/hadoop

    HBase Client API

    Package

    Org.apache.hadoop.hbase.client.Htable

    It is recommended that you create Htable instances only once, one per thread and reuse that instance for the rest of the lifetime of

    your client application.

  • www.edureka.in/hadoop

    HBASE JAVA Client interfaces

    Configuration where to find the cluster and tunable settings .

    Similar to JDBC connection String

    HBaseAdmin helps to manage administrative tasks

    HBaseDescriptor has the details of the table

    HTable is a handle on a single table. Is used to issue Put,

    Get, Scan commands to the table

  • www.edureka.in/hadoop

    Steps to create Table and Column family

    Create a HBaseAdmin instance Code the table schema Table Schema represented by HTableDescriptor Introduce column families to table description(HColumnDescriptor) Create the table through the HBaseAdmin instance

  • www.edureka.in/hadoop

    Problem

    Problem:

    In a distributed environment, getting processes to act in any kind of synchrony is an extremely hard problem.

    For example, simply having a set of processes wait until theyve all reached the same point in their execution a kind of distributed barrier is surprisingly difficult to do correctly.

    !

    Solution:ZooKeeper offers an API to facilitate this sort of distributed coordination.For example, it is often used to serve locks to client processes locks are just another kind of coordination primitive.

  • www.edureka.in/hadoop

    Why Use Zookeeper?

    To solve complex distributed algorithms

    To avoid race conditions & dead locks

    To avoid management complexities

    Easy to use Programming Model

    To apply reusable code libraries in common

    Use Cases

    WHY USE ZOOKEEPER?

  • www.edureka.in/hadoop

    What Is A ZooKeeper?

    ZooKeeper is a sort of central nervous system for distributed systems where the role of the brain is played by the coordination service, axons are the network, processes are the monitored and controlled body parts, and eventsare the hormones and neurotransmitters used for messaging.

    A Reliable, Scalable Distributed Coordination System

    Apache ZooKeeper is a software project of the Apache Software Foundation, providing an open source distributed configuration service, synchronization service and naming registry for large distributed systems.

    Every complex distributed application needs a coordination and orchestration system of some sort, so the ZooKeeper folks at Yahoo decide to build a good one and open source it for everyone to use!

  • www.edureka.in/hadoop

    The Target Market

    Target Market For ZooKeeper

    Multi-Host Multi-Process CJava Based

    Systems

  • www.edureka.in/hadoop

    Who Uses Zookeeper?

    Future Users -

  • www.edureka.in/hadoop

    Working Model

    ZooKeeper works using distributed processes to coordinate with each other through a shared hierarchical name space.

    Data is kept in memory and is backed up to a log for reliability. By using memory, ZooKeeper is very fast and can handle the high loads.

  • www.edureka.in/hadoop

    Distribute Coordination System

    Because you can't get these guarantees from an event system plopped on top of a database and

    these are the sort of guarantees you need in a complex distributed system where connections drop,

    nodes fail, retransmits happen, and chaos rules the day.

    Why would you ever need a Distribute Coordination System?

  • www.edureka.in/hadoop

    Example

    For Example:

    Assume the system is an ad system for serving advertisements to web sites. Ad systems are

    complex beasts that require a fair bit of coordination. Imagine all the subsystems needing to run

    on those 100 nodes: database, monitoring, fraud detectors, beacon servers, web server event log

    processors, failover servers, customer dashboards, targeting engines, campaign planners,

    campaign scenario testers, upgrades, installs, media managers, and so on.

    There's a lot going on

  • www.edureka.in/hadoop

    Now imagine the power in the data center flips and all the machines power on!

    How do all the processes across all the hosts know what to do?

    Now imagine everything is up and a few machines go down.

    How do all the processes know what to do in this situation?

    This is where a coordination service comes in!

    A coordination service acts as the backplane over which all these subsystems figure out what they

    should do relative to all the other subsystems in a product.

    Coordination Service

  • www.edureka.in/hadoop

    In this scenario, ZooKeeper acts as the Service Locator. Each process goes to ZooKeeper and finds

    out which is the primary database. If a new primary is elected, say because a host fails, then

    ZooKeeper sends an event that allows everyone dependent on the database to react by getting the

    new primary database.

    Coordination Service

  • www.edureka.in/hadoop

    ZooKeeper Data Model

    Hierarchal namespace (like a File System)

    Each znode has data and children

    Data is read and written in its entity

    /

    Services

    apps

    users

    locks

    Stupidname

    YaView

    Servers

    morestupidity

    read-1

  • www.edureka.in/hadoop

    ZooKeeper has a hierarchal name space, much like a Distributed file system. The only difference is

    that each node in the namespace can have data associated with it as well as children. It is like having

    a file system that allows a file to also be a directory. Paths to nodes are always expressed as

    canonical, absolute, slash-separated paths; there are no relative reference.

    Hierarchal Namespace

  • www.edureka.in/hadoop

    Any unicode character can be used in a path subject to the following constraints:

    The null character (\u0000) cannot be part of a path name.

    These characters can't be used : \u0001 - \u0019 and \u007F - \u009F.

    These characters are not allowed: \ud800 -uF8FFF, \uFFF0-uFFFF, \uXFFFE -\uXFFFF

    (where X is a digit 1 - E), \uF0000 - \uFFFFF.

    The "." character can be used as part of another name, but "." and ".." cannot alone be used to

    indicate a node along a path.

    The token "zookeeper" is reserved.

  • www.edureka.in/hadoop

    ZNodes

    Every node in a ZooKeeper tree is referred to as a Znode.

    Znodes maintain a stat structure that includes version numbers for data changes, acl changes.

    The stat structure also has timestamps.

    The version number, together with the timestamp allow ZooKeeper to validate the cache and to

    coordinate updates. Each time a Znode's data changes, the version number increases.

    For instance, whenever a client retrieves data, it also receives the version of the data. And when

    a client performs an update or a delete, it must supply the version of the data of the znode it is

    changing. If the version it supplies doesn't match the actual version of the data, the update will

    fail.

  • www.edureka.in/hadoop

    ZooKeeper Service

    ZooKeeper Service

    Client Client Client

    Server Server Server Server Server

    Client Client Client Client Client

    Leader

  • www.edureka.in/hadoop

    Facts

    All servers store a copy of the data in the memory.

    The leader is elected at startup.

    Followers respond to clients all updates go through the leader.

    Responses are sent when a majority of servers have persisted the change.

  • www.edureka.in/hadoop

    Observations

    Distributed systems always need some form of coordination.

    Programmers cannot use locks correctly.

    Message based coordination can be hard to use in some applications.

  • www.edureka.in/hadoop

    Wishes

    Simple, Robust, Good Performance

    Tuned for Read dominant workloads

    Familiar models and interfaces

    Wait Free

    Need to be able to wait efficiently

  • www.edureka.in/hadoop

    ZooKeeper API

    String create(path, data, acl, flags)

    void delete(path, expectedVersion)

    Stat setData(path, data, expectedVersion)

    (data, Stat) getData(path, watch)

    Stat exists(path, watch)

    String[] getChildren(path, watch)

    void sync(path)

  • www.edureka.in/hadoop

    ZooKeeper and HBase

    Master Failover

    Region Servers and Master discovery via ZooKeeper

    HBase clients connect to ZooKeeper to find configuration data

    Region Servers and Master failure detection

  • www.edureka.in/hadoop

    HBase and ZooKeeper As of Now!

    Master If more than one master, they fight.

    Root Region Server This znode holds the location of the server

    hosting the root of all tables in Hbase.

    rs A directory in which there is a znode per Hbase

    region server Region Servers register themselves with

    ZooKeeper when they come online

    On Region Server failure (detected via ephemeral znodes and notification via ZooKeeper), the master splits the edits out per region.

    shutdown

    rs

    Root.region.server

    shutdown

    /

    master

  • www.edureka.in/hadoop

    Release 3.3.0, Whats In For HBase?

    Allow configuration of session timeout min/max bounds

    Improved logging information to detect

    issues

    Improved debugging tools

    Queue implementation available

    Key Features

    Improved performance and

    robustness

    Improved documentation

  • www.edureka.in/hadoop

    Upcoming 3.4 Release

    No Connectionloss

    Testing - Mockito

    More of backwards compatibility testing

    Use Netty - allow encryption

    KeyFeatures

  • www.edureka.in/hadoop

    More ZooKeeper in HBase?

    Table Schema and state in ZooKeeper

    read only, online

    Region Server state transitions via ZooKeeper

    Store region assignment in ZooKeeper for each Region Server

    http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases

  • www.edureka.in/hadoop

    Thank You See You in Class Next Week


Recommended