Date post: | 21-Jul-2015 |
Category: |
Data & Analytics |
Upload: | lokesh-ramaswamy |
View: | 33 times |
Download: | 1 times |
Hadoop Fundamentals
Agenda
• What is Hadoop
• Revisiting Big Data
• Examples of Hadoop in Action
• Limitations of Hadoop
• Big Data and Cloud
• Hadoop Architecture
What is Hadoop?
• Open Source Project
• Written in Java
• Uses Google’s MapReduce and Google File system as its foundation
• Optimized to handle• Massive amounts of data through parallelism
• A variety of data (structured, unstructured and semi structured)
• Using inexpensive commodity hardware
• Great Performance
• Reliability provided through replication
• Not suited for OLTP workloads, OLAP workloads
• Hadoop is used for Big Data.. Complements OLTP and OLAP
What is Big Data?
• With all the devices available to collect data viz. RFID readers, microphones, cameras and sensors, there is an explosion of data worldwide
• Big data is large collections of data (also known as datasets) that may be unstructured, and grow so large and quickly that it is difficult to manage with regular database or statistical tools
Important statistics
• > 2 billion internet users & 7.3 billion active cellphones
• Twitter processes 7TB of data every day, Facebook processes 500TB
• approximately 80% of these data are unstructured
Need for fast, reliable, deep data insight. Hence relevance of Hadoop
Some of the Open Source Projects related to Hadoop
• Eclipse - popular IDE donated by IBM to the open source community
• Lucene – Text search engine developed in Java
• Hbase – Hadoop database
• Hive – provides data warehousing tools to extract, transform and load data, and then, query this data stored
• Pig – High level language that generates MapReduce code to analyze large datasets
Examples of Hadoop in Action
• 2011 – Watson, a Supercomputer developed by IBM participated in a quiz show – Jeapordy
• Approximately 2 million pages of text were input using Hadoop to distribute the workload for storing information into memory
• Use of advanced search and analysis
China Mobile
• Hadoop cluster to data mining on Call Data Records
• China Mobile was producing 5-8 TB of data
• Used Hadoop to process 10 times of data with 1/5th of the cost
Examples of Hadoop in Action
New York Times
• Hosted all public domain articles from 1851 to 1922
• 11 million files converted into 1.5 TB of PDF
• One employee who ran a job for 24 hours on a 100-instance Amazon EC2 Hadoop
Yahoo
• Largest production user with an application running a Hadoop cluster consisting of approximately 10,000 Linux machines
• Also largest contributor to Hadoop Opensource Project
Limitations of Hadoop
• Not good for processing transactions (random access)
• Not good when work cannot be parallalized
• Not good for low latency data access
• Not good for processing lots of small files
• Not good for intensive calculations with little data
Big Data Solutions and the Cloud
• Big Data solutions are more than just Hadoop• Add business intelligence / analytics functionality
• Derive information from data in motion
• Big Data solutions and the cloud are a perfect fit.• Cloud allows you to setup a cluster of systems in minutes and it’s relatively
inexpensive
Hadoop Architecture
Terminology Review
• Node – a computer typically non-enterprise, commodity hardware
• Rack – a collection of 30 – 40 nodes that are physically stored close to each other and connected to same network switch
• Network bandwidth between any two nodes in rack is greater than bandwidth between two nodes on different racks
• A Hadoop Cluster is a collection of racks
Pre Hadoop 2.2 Architecture
• Distributed File System• Hadoop Distributed File System (HDFS)
• GPFS – FPO
• MapReduce Engine• Framework for performing calculations on the data in the file system
• Has a built in resource manager and scheduler
Pre Hadoop 2.2 MapReduce is called MapReduce 1.0 has its own resource management and scheduling
Hadoop Distributed File System
• A distributed file system that provides high-performance access to data across Hadoop clusters.
• Key tool for managing pools of big data and supporting big data analytics applications.
Hadoop Distributed File System
• HDFS runs on the existing file system• Not POSIX compliant
• Designed to tolerate high component failure rate• Reliability through replication
• Designed to handle very large files• Large streaming data access patterns
• No random access
• Uses blocks to store a file or parts of a file
Basic Features of HDFS
• Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data
• Replication – Each data block replicated many times (typically 3)
• Failure – failure is the norm rather than exception
• Highly fault-tolerant• Name node consistently checking data nodes
• Detection of faults and quick automatic recovery from them is the goal
• High throughput
• Streaming access to file system data
• Can be built out of commodity hardware
Types of Nodes
Client Job Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Name Node
Name Nodes & Data Nodes• Name Node maintains metadata about the files
• Data Nodes• Store actual data• Files are divided into blocks• Each block is replicated N times (default N=3)
Master/slave architecture
HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients.
There are a number of DataNodes usually one per node in a cluster.
The DataNodes manage storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored in files.
A file is split into one or more blocks and set of blocks are stored in DataNodes.
DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode
Job Tracker and Task Tracker
• Job Tracker is the Master Node• Receives the user’s job
• Decides on how many tasks will run (number of mappers)
• Decides on where to run each mapper (concept of locality)
• Task Tracker is the Slave Node• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting progress
Fault Tolerance
• Failure is the norm rather than exception
• A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data.
• Since we have huge number of components and that each component has non-trivial probability of failure means that there is always some component that is non-functional.
• Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS
Data Characteristics
Streaming data access
Applications need streaming access to data
Batch processing rather than interactive user access.
Large data sets and files: gigabytes to terabytes size
High aggregate data bandwidth
Scale to hundreds of nodes in a cluster
Tens of millions of files in a single instance
Write-once-read-many: a file once created, written and closed need not be changed – this assumption simplifies coherency
A map-reduce application or web-crawler application fits perfectly with this model
Data Replication
HDFS is designed to store very large files across machines in a large cluster.
Each file is a sequence of blocks.
All blocks in the file except the last are of the same size.
Blocks are replicated for fault tolerance.
Block size and replicas are configurable per file.
The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster.
BlockReport contains all the blocks on a Datanode
HDFS file blocks
• Not the same as operating system’s file blocks• HDFS book made up of multiple operating system blocks
• Default for Hadoop is 64 MB• Recommended is 128 MB (this is BigInsight’s block)
• Size of a block can be larger than the single disk in a cluster• Blocks for single file are spread across multiple nodes in a cluster
• If a chunk of file is smaller than HDFS block size• Only the needed space is used
• Blocks work well with replication• Fault tolerant and availability with commodity hardware
HDFS Replication
• Blocks of data are replicated to multiple nodes
• Allows for node failure without data loss
• Replication can be done on many more nodes by• Changing Hadoop configuration
• Setting replication factor for each file
MapReduce Framework
• Based on technology from Google
• Processes huge datasets with certain kind of distributable problems using large number of nodes
• A MapReduce program consists of map and reduce functions• Map-Phase – divides data into smaller subsets that are distributed over
different nodes
• Reduce – Phase – master node collects all the returned data and combines into some sort of output that can be used again
• Allows for distributed processing of map and reduce operations• Tasks run in parallel
Dataflow in MapReduce
Features of MapReduce
• Fine grained Map and Reduce tasks• Improved load balancing
• Faster recovery from failed tasks
• Automatic re-execution on failure• In a large cluster, some nodes are always slow or flaky
• Framework re-executes failed tasks
• Locality optimizations• With large data, bandwidth to data is a problem
• Map-Reduce + HDFS is a very effective solution
• Map-Reduce queries HDFS for locations of input data
• Map tasks are scheduled close to the inputs when possible
Key Value Pairs
• Mappers and Reducers are users’ code (provided functions)
• Just need to obey the Key-Value pairs interface
• Mappers:• Consume <key, value> pairs
• Produce <key, value> pairs
• Reducers:• Consume <key, <list of values>>
• Produce <key, value>
• Shuffling and Sorting:• Hidden phase between mappers and reducers
• Groups all similar keys from all mappers, sorts and passes them to a certain reducer in the form of <key, <list of values>>
Word Count Example
• Mapper• Input: value: lines of text of input
• Output: key: word, value: 1
• Reducer• Input: key: word, value: set of counts
• Output: key: word, value: sum
• Launching program• Defines this job
• Submits job to cluster
Example - Word Count Dataflow
Example – Color Count
29
Shuffle & Sorting based on k
Reduce
Reduce
Reduce
Map
Map
Map
Map
Input blocks on HDFS
Produces (k, v)( , 1)
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Consumes(k, [v])( , [1,1,1,1,1,1..])
Produces(k’, v’)( , 100)
Job: Count the number of each color in a data set
Part0003
Part0002
Part0001
That’s the output file, it has 3 parts on probably 3 different machines
How Many Maps and Reduces
• Maps• Usually as many as the number of HDFS blocks being processed, this is the default
• Else the number of maps can be specified as a hint
• The number of maps can also be controlled by specifying the minimum split size
• The actual sizes of the map inputs are computed by:
• max(min(block_size,data/#maps), min_split_size
• Reduces• Unless the amount of data being processed is small
• 0.95*num_nodes*mapred.tasktracker.tasks.maximum
Types of Nodes
• Named Node• Only one Named Node in a cluster
• Stores metadata for the data node
• Manages the file system namespace and metadata• Data does not go through the named node
• Data is not stored in the named node
• Single point of failure• Good idea to mirror named node
• Do not use inexpensive commodity hardware
• Has large memory requirement• File system metadata is maintained in RAM to server read requests
Types of Nodes
Data Node
• Many per Hadoop Cluster
• Manages the blocks with data and serves them to clients
• Blocks from different files can be stored on the same DataNode
• Periodically reports to the NameNode the list of blocks it stores
• Suitable for inexpensive commodity hardware – replication provided at software level
Types of Nodes
Job Tracker
• Manages the MapReduce jobs in the cluster
• One per Hadoop Cluster
• Receives job requests submitted by the client
• Schedules and monitors MapReduce Jobs on TaskTrackers• Attempts to direct a task to the TaskTracker where the data resides
• Monitors of any failing tasks that need to rescheduled
Types of Nodes
TaskTracker
• Many per Hadoop Cluster
• Executes the MapReduce operations• Runs the MapReduce tasks in JVMs
• Have a set of slots used to run tasks
• Communicates with JobTracker
• Reads blocks from DataNodes
Hadoop 2.2 Architecture
Provides YARN
• Referred to as MapReduce V2
• Resource Manager and Scheduler external to anyframework
• DataNodes still exist
• JobTracker and TaskTrackers no longer exist
• Not Required to run YARN with Hadoop 2.2• Still supports MapReduce V1
YARN
Two main ideas
• Provide generic scheduling and resource management• Support more than just MapReduce
• Support more than just batch processing
• More efficient scheduling and workload management• No more balancing between map slots and reduce slots
Hadoop High Availability
JournalNode JournalNode JournalNode
NameNode NameNode
DataNode DataNode DataNode DataNode DataNode