Post on 06-May-2015
description
transcript
1
An Introduction to Cloudera’s Administrator Training for Apache Hadoop
Ian WrigleySr. Curriculum Managerian@cloudera.com
2© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Why Take Cloudera Training?
Administrator Course Contents
A Deeper Dive: An overview of HDFS High Availability
A Deeper Dive: Some of Hadoop’s advanced configuration options
Question time
Topics
3
1 Broadest Range of CoursesDeveloper, Admin, Analyst, HBase, Data Science
2
3
Most Experienced InstructorsOver 15,000 students trained since 2009
5 Widest Geographic CoverageMost classes offered: 50 cities worldwide plus online
6 Most Relevant Platform & CommunityCDH deployed more than all other distributions combined
7 Depth of Training MaterialHands-on exercises and VMs support live instruction
Leader in CertificationOver 5,000 accredited Cloudera professionals
4 State of the Art CurriculumClasses updated regularly as Hadoop evolves 8 Ongoing Learning
Video tutorials and e-learning complement training
Why Cloudera Training?
4
Data AnalystTraining
Implement massively distributed, columnar storage at scaleEnable random, real-time read/write access to all data
HBaseTraining
Configure, install, and monitor clusters for optimal performanceImplement security measures and multi-user functionality
Vertically integrate basic analytics into data managementTransform and manipulate data to drive high-value utilization
Enterprise Training
Use Cloudera Manager to speed deployment and scale the clusterLearn which tools and techniques improve cluster performance
Learning Path: System Administrators
AdministratorTraining
5© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Why Take Training?
Administrator Course Contents
A Deeper Dive: An overview of HDFS High Availability
A Deeper Dive: Some of Hadoop’s advanced configuration options
Question time
Topics
6© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
During the Administrator course, you learn:
The core technologies of Hadoop
How to populate HDFS from external sources
How to plan your Hadoop cluster hardware and software
How to deploy a Hadoop cluster
What issues to consider when installing Pig, Hive, and Impala
What issues to consider when deploying Hadoop clients
How Cloudera Manager can simplify Hadoop administration
How to configure HDFS for high availability
What issues to consider when implementing Hadoop security
Administrator Course Objectives
7© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
How to schedule jobs on the cluster
How to maintain your cluster
How to monitor, troubleshoot, and optimize the cluster
Administrator Course Objectives (cont’d)
8© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The course features many Hands-On Exercises, including:–Deploying Hadoop in pseudo-distributed mode–Deploying a complete, multi-node Hadoop cluster– Importing data into HDFS using Sqoop and Flume– Installing Hive and Impala–Using Hue to control user access–Configuring HDFS High Availability–Configuring the FairScheduler– Troubleshooting problems on the cluster–… and more
Hands-On Exercises
9© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Course Chapters
Introduction
Planning Your Hadoop Cluster Hadoop Installation and Initial Configuration Installing and Configuring Hive, Impala, and Pig Hadoop Clients Cloudera Manager Advanced Cluster Configuration Hadoop Security
Introduction to Apache Hadoop
Planning, Installing, andConfiguring a Hadoop Cluster
Course Introduction
The Case for Apache Hadoop HDFS Getting Data Into HDFS MapReduce
Managing and Scheduling Jobs Cluster Maintenance Cluster Monitoring and Troubleshooting Conclusion Kerberos Configuration Configuring HDFS Federation
Cluster Operations and Maintenance
Course Conclusion and Appendices
10© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Why Take Training?
Administrator Course Contents
A Deeper Dive: An overview of HDFS High Availability
A Deeper Dive: Some of Hadoop’s advanced configuration options
Question time
Topics
11© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
A single NameNode is a single point of failure
Two ways a NameNode can result in HDFS downtime–Unexpected NameNode crash (rare)–Planned maintenance of NameNode (more common)
HDFS High Availability (HA) eliminates this SPOF–Available in CDH4 (or related Apache Hadoop 0.23.x, and 2.x)
HDFS High Availability Overview
12© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
HDFS High Availability uses a pair of NameNodes–One Active and one Standby–Clients only contact the Active NameNode–DataNodes heartbeat in to both NameNodes–Active NameNode writes its metadata to a quorum of JournalNodes– Standby NameNode reads from the JournalNodes to remain in sync with
the Active NameNode
HDFS High Availability Architecture
13© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Active NameNode writes edits to the JournalNodes– Software to do this is the Quorum Journal Manager (QJM)
–Built in to the NameNode–Waits for a success acknowledgment from the majority of JournalNodes
–Majority commit means a single crashed or lagging JournalNode will not impact NameNode latency
–Uses the Paxos algorithm to ensure reliability even if edits are being written as a JournalNode fails
Note that there is no Secondary NameNode when implementing HDFS High Availability– The Standby NameNode periodically performs checkpointing
HDFS High Availability Architecture (cont’d)
14© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Only one NameNode must be active at any given time– The other is in standby mode
The standby maintains a copy of the active NameNode’s state– So it can take over when the active NameNode goes down
Two types of failover–Manual (detected and initiated by a user)–Automatic (detected and initiated by HDFS itself)
Failover
15© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Automatic failover is based on Apache ZooKeeper–A coordination service system also used by HBase–An open source Apache project –One of the components in CDH
A daemon called the ZooKeeper Failover Controller (ZKFC) runs on each NameNode machine
ZooKeeper needs a quorum of nodes– Typical installations use three or five nodes– Low resource usage
–Can install alongside existing master daemons
Automatic Failover
16© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
HDFS HA With Automatic Failover – Deployment
17© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Why Take Training?
Administrator Course Contents
A Deeper Dive: An overview of HDFS High Availability
A Deeper Dive: Some of Hadoop’s more advanced configuration options
Question time
Topics
18© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
hdfs-site.xml
dfs.namenode.handler.count The number of threads the NameNode uses to handle RPC requests from DataNodes. Default: 10. Recommended: ln(number of cluster nodes) * 20. Symptoms of this being set too low: ‘connection refused’ messages in DataNode logs as they try to transmit block reports to the NameNode. Used by the NameNode.
dfs.datanode.failed.volumes.tolerated
The number of volumes allowed to fail before the DataNode takes itself offline, ultimately resulting in all of its blocks being re-replicated. Default: 0, but often increased on machines with several disks. Used by DataNodes.
19© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
core-site.xml
fs.trash.interval When a file is deleted, it is placed in a .Trash directory in the user’s home directory, rather than being immediately deleted. It is purged from HDFS after the number of minutes specified. Default: 0 (disabled). Recommended: 1440 (one day). Used by clients and the NameNode.
20© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
mapred-site.xml
mapred.job.tracker.handler.count
Number of threads used by the JobTracker to respond to heartbeats from the TaskTrackers. Default: 10. Recommendation: ln(number of cluster nodes) * 20. Used by the JobTracker.
mapred.reduce.parallel.copies
Number of TaskTrackers a Reducer can connect to in parallel to transfer its data. Default: 5. Recommendation: ln(number of cluster nodes) * 4 with a floor of 10. Used by TaskTrackers.
tasktracker.http.threads The number of HTTP threads in the TaskTracker which the Reducers use to retrieve data. Default: 40. Recommendation: 80. Used by TaskTrackers.
21© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
mapred-site.xml (cont’d)
mapred.reduce.slowstart.completed.maps
The percentage of Map tasks which must be completed before the JobTracker will schedule Reducers on the cluster. Default: 0.05 (5 percent). Recommendation: 0.8 (80 percent). Used by the JobTracker.
22© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Why Take Training?
Administrator Course Contents
A Deeper Dive: An overview of HDFS High Availability
A Deeper Dive: Some of Hadoop’s more advanced configuration options
Question time
Topics
23
24
• Submit questions in the Q&A panel
• Watch on-demand video of this webinar and many more at http://cloudera.com
• Follow Ian on Twitter @iwrigley
• Follow Cloudera University @ClouderaU
• Learn more at Strata + Hadoop World: http://tinyurl.com/hadoopworld
• Thank you for attending!
Register now for Cloudera training at http://university.cloudera.com
Use discount code Admin_10 to save 10% on new enrollments in
Administrator Training classes delivered by Cloudera until December 1, 2013*
Use discount code 15off2 to save 15% on enrollments in two or more training classes delivered by Cloudera until
December 1, 2013*
* Excludes classes sold or delivered by Cloudera partners