TRAINING OFFERING | DEV-303 HORTONWORKS DATA … · Hortonworks is a leading innovator at creating,...

TRAINING OFFERING | DEV-303HORTONWORKS DATA PLATFORM (HDP®) ANALYST: HBASE ESSENTIALS

2 DAYS This course is designed for big data analysts who want to use the HBase NoSQL database which runs on top of HDFS to provide real-time read/write access to sparse datasets. Topics include HBase architecture, services, installation and schema design.

PREREQUISITES Students must have basic familiarity with data management systems. Familiarity with Hadoop or databases is helpful but not required. Students new to Hadoop are encouraged to attend the HDP Overview: Apache Hadoop Essentials course.

TARGET AUDIENCE Architects, software developers, and analysts responsible for implementing non-SQL databases in order to handle sparse datasets commonly found in big data use cases.

FORMAT 50% Lecture/Discussion 50% Hands-0n Labs

AGENDA SUMMARY

Day 1: A Hadoop Primer and HBase Overview Day 2: HBase Command Line Basics, HBase Installation and Configuration and HBase Schema Design

About Hortonworks

Hortonworks is a leading innovator at creating, distributing and supporting enterprise-ready open data platforms. Our mission is to manage the world’s data. We have a single-minded focus on driving innovation in open source communities such as Apache Hadoop, NiFi, and Spark. Our open Connected Data Platforms power Modern Data Applications that deliver actionable intelligence from all data: data in-motion and data-at-rest. Along with our 1600+ partners, we provide the expertise, training and services that allows our customers to unlock the transformational value of data across any line of business. We are Powering the Future of Data™.

About Cerulium

Cerulium was founded for the purpose of providing BI consulting, data warehousing, big data, and education services within both the USA and internationally. Since its inception, the company has provided these services using only the most qualified and experienced people. Cerulium is committed to providing its services at the highest standards. This is accomplished by establishing strong, positive relationships with its clients and delivering exceptional value. We are proud to continue these standards as an Authorized Training Partner of Hortonworks.

For further information visit www.cerulium.com or www.hortonworks.com © 2011-2016 Hortonworks Inc. All Rights Reserved.

Privacy Policy | Terms of Service

DAY 1 OBJECTIVES

• Distinguish between Hadoop and the Hortonworks Data Platform• Identify that Hadoop is Comprised of Multiple Apache Projects• Describe How Hadoop Stores Files and Processes Data• Describe the Hadoop Distributed File System (HDFS)• Describe the Reason HBase was Created• List HBase Features• List the Components of the HBase Architecture• Describe an HBase Table as a set of Value Mappings• Identify HBase as Either a Row or Column Oriented Database• Describe the Features Available in HBase 1.0• Describe a High-Level View of the Overall HBase Architectural Design• Discuss HBase Region Design and Implementation• List HBase Services• Use HBase Data Operations• Describe HBase High (HA) Availability Options• List HBase Operational Commands• Outline an HBase Query Operation

DAY 1 LABS

• Running a MapReduce Job• Using HBase• Importing Tables from MySQL Into HBase• Working with Apache ZooKeeper• Examining HBase Configuration Files

http://hortonworks.com/privacy-policy

http://hortonworks.com/terms-of-service

About Hortonworks


About Cerulium




DAY 2 OBJECTIVES

• List the Shell Command Line Categories• Use General Shell Commands• Use Data Manipulations Shell Commands• Use Surgery Tools• Use Cluster Replication Tools• Describe General Considerations for Installation and Configuration• Describe HBase Configuration Requirements• Describe Apache ZooKeeper Configuration Requirements• Backup HBase Tables and Metadata• Identify How to Choose an Appropriate Rowkey• Describe the HBase Data Model• Discuss Sample Schema Use Cases• Describe How to Optimize Block Size• Describe How to Adjust Cache Size• Use Bloom Filters• Discuss General Optimization Methods

DAY 2 LABS AND DEMONSTRATIONS

• Using HBase Shell Commands• Performing a Backup and Using Snapshot• Exporting with Apache Pig and Importing with ImportTsv• Setting Block Size and Enabling Bloom Filters• Demonstration: Using Java Data Access Object

Revised 10/06/2017



TRAINING OFFERING | ADM-203HORTONWORKS DATA PLATFORM (HDP®) OPERATIONS: APACHE HBASE ADVANCED MANAGEMENT

4 DAYS This course is designed for administrators who will be installing, configuring and managing HBase clusters. It covers installation with Ambari, configuration, security and troubleshooting HBase implementations. The course includes an end-of-course project in which students work together to design and implement an HBase schema.

PREREQUISITES Students must have basic familiarity with data management systems. Familiarity with Hadoop or databases is helpful but not required. Students new to Hadoop are encouraged to take the HDP Overview: Apache Hadoop Essentials course.

TARGET AUDIENCE Architects, software developers, and analysts responsible for implementing non-SQL databases in order to handle sparse datasets commonly found in big data use cases.

FORMAT 50% Lecture/Discussion 50% Hands-on Labs

AGENDA SUMMARY Day 1: An Apache HBase Overview and Installing HBase Day 2: Using the HBase Shell and Ingest/ImportTSV Day 3: Managing HA Clusters and Log Files, Backup Recovery and Security Day 4: Monitoring HBase, Maintenance, Troubleshooting and Class Project

About Hortonworks


About Cerulium




DAY 1 OBJECTIVES

• Describe the Characteristics and Operation of HDFS• Describe the Responsibilities of the NameNode and DataNode• Describe the Purpose of YARN, including the:

o ResourceManagero NodeManagero ApplicationMaster

• Describe the Primary Differences Between Hadoop 1.x and 2.x• Describe the Function and Purpose of HBase• List HBase Features and Components• Describe an HBase Table as a Set of Key-Value Mappings• Idenfity HBase as Either a Row-or- Column-Oriented Database• Describe HBase Operations• List the Options for HBase Installation• List the HBase Minimum System Requirements• Describe the Process for Installing HBase Using Ambari• Describe the Process for Confirming a Successful Installation

DAY 1 LABS

• Installing and Configuring HBase with Ambari• Manually Installing an HBase Cluster



About Hortonworks


About Cerulium




DAY 2 OBJECTIVES

• Work with Basic HBase Shell Commands• List the Categories of Shell Commands Including:

o Generalo Table Managemento Data Manipulationo Surgery Toolso Cluster Replication Toolso Security Tools

• Work with Cluster Administration Commands• Describe the Function and Purpose of the Regionserver• Identify the Purpose of Key-Value Pairs• Identify the Purpose of Row Keys• Identify the Purpose of Column Families• Describe How to Read and Write Data in HBase• Describe the Flush Process• Describe the Compaction Process• Perform a Bulk Ingest Using ImportTSV• Describe the Function and Purpose of a CopyTable

DAY 2 LABS

• Using HBase Shell Commands• Ingesting Data with ImportTSV



About Hortonworks


About Cerulium




DAY 3 OBJECTIVES

• List the Steps Required to Upgrade HBase• Configure HBase for High Availability• View Log Files• Describe the Function and Purpose of HBase Coprocessors• Describe the Function and Purpose of HBase Filters• Describe the Process for Using Filters for Scans• Describe the Process for Protecting HBase Data with Backups• Describe the Function and Benefits of Snapshots in HBase• Describe the Process for Performing Snapshots in HBase• Describe the Process for HBase Replication• Configure HBase Cluster Replication• Describe the Purpose of HBase Authentication• Describe the Purpose and Benefits of HBase Authorization Via ACLs• Describe the Benefits of Ranger and Knox for HBase Security• Describe the Process Used to Configure Simple Authentication• Describe the Secure Bulk Load Process

DAY 3 LABS

• Enabling HBase High Availability• Viewing Log Files• Configuring and Enabling Snapshots• Configuring Cluster Replication• Enabling Authentication and Authorization in HBase Tables



About Hortonworks


About Cerulium




DAY 4 OBJECTIVES

• List Important Metrics to Monitor for an HBase Cluster• Monitor an HBase Cluster Using Ambari• Describe the Benefits of OpenTSDB as a Took for Monitoring• Describe How to Identify a Region Hot Spot• Design a Row-Key Schema to Avoid Hot Spotting• Configure an HBase Table Using Pre-Splitting• Describe the Region Splitting Process• Describe the Function of the Load Balancer• Define Region Sizing• Describe the Process of Manual Splitting and Merging• Describe the Process of Resolving Regions Overlap Issues• Use the Zookeeper Command Line Tool to Check Zookeeper Status and State• Monitor JVM Garbage Collection Metrics on Regionservers• Resolve Startup Errors for Masterserver and Regionservers• Tune HBase for Better Performance• Tune HDFS for Better HBase Performance

DAY 4 LABS

• Diagnosing and Resolving Hot Spotting• Region Splitting• Monitoring JVM Garbage Collection• End of Course Lab Project – Designing an HBase Schema

Revised 10/06/2017



TRAINING OFFERING | SCI-201HORTONWORKS DATA PLATFORM (HDP®) ANALYST: DATA SCIENCE

3 DAYS This course Provides instruction on the processes and practice of data science, including machine learning and natural language processing. Included are: tools and programming languages (Python, IPython, Mahout, Pig, NumPy, pandas, SciPy, Scikit-learn), the Natural Language Toolkit (NLTK), and Spark MLlib.

PREREQUISITES Students must have experience with at least one programming or scripting language, knowledge in statistics and/or mathematics, and a basic understanding of big data and Hadoop principles. Students new to Hadoop are encouraged to attend the HDP Overview: Apache Hadoop Essentials course.

TARGET AUDIENCE Architects, software developers, analysts and data scientists who need to apply data science and machine learning on Hadoop .


AGENDA SUMMARY

Day 1: Introduction to HDP, the Data Science Life Cycle and Pig Day 2: Python Programming and Machine Algorithms Day 3: Python Programming ( Continued), Natural Language Processing, and Spark

About Hortonworks


About Cerulium




DAY 1 OBJECTIVES

• List Data Science Use Cases• Define Data Science and What a Data Scientists Does• List Reasons to use Hadoop for Data Science• Describe the Hadoop Distributed File System (HDFS)• Describe Block Storage• Describe the Function and Purpose of NameNodes and DataNodes• List Common HDFS Commands• Describe MapReduce, the Map Phase and the Reduce Phase• Describe Hadoop Streaming and MapReduce• Define HDFS Federation• Explain How NameNode High Availability is Implemented• Define YARN• Define Apache Slider• Describe Machine Learning and How Machines Learn• List Examples of Machine Learning Tasks• Describe Hadoop Machine Learning Capabilities• Describe the Data Science Life Cycle Process Flow• Describe Apache Pig• Describe Pig Latin• Define a Schema• Use Common Pig Operators


• Setting Up a Development Environment• Using HDFS Commands• Demonstration: Understanding Map Reduce• Using Mahout for Machine Learning



About Hortonworks


About Cerulium




DAY 2 OBJECTIVES

• List and Describe Python Programming Concepts• Import Python Modules• Develop Python Code• List the Components of the Scientific Python Ecosystem:

o NumPyo Pandaso SciPy Libraryo matplotlib

• List options for running Python on Hadoop• Invoke Python using Hadoop Streaming• Invoke Python using Pig User Defined Functions (UDFs)• Invoke Python Using the Pig STREAM Command• Describe Hadoop Machine Learning Tools• Describe the Scikit-Learn Library• Describe Machine Learning Algorithms:

o Recommender Systemso Support Vector Machineso Naives Bayeso Nearest Neighbor

• Deploy Python Machine Learning Algorithms on Hadoop



About Hortonworks


About Cerulium


For further information visit www.cerulium.com or www.hortonworks.com

© 2011-2016 Hortonworks Inc. All Rights Reserved. Privacy Policy | Terms of Service


• Getting Started with Apache Pig• Exploring Data with Apache Pig• Using the IPython Notebook• Demonstration: Understanding the NumPy Package• Demonstration: The Pandas Library• Data Analysis with Python• Interpolating Data Points• Defining a Pig UDF in Python• Streaming Python with Apache Pig• Demonstration: Classification with Scikit-Learn• Computing K-Nearest Neighbor• Generating a K-Means Clustering



About Hortonworks


About Cerulium




DAY 3 OBJECTIVES

• Define Natural Language Processing• Describe Common NLP Tasks• Use the Natural Language Toolkit• Describe Apache Spark• List Spark Components and their Functionalities• Use Resilient Distributed Datasets• Describe the Spark MLlib• Perform Spark Operations• Describe the Process to Implement a Data Science Production Deployment for:

o A Recommender System Architectureo Data Product Design


• Demonstration: POS Tagging Using a Decision Tree• Using the Python Natural Language Toolkit• Classifying Text Using Naïve Bayes• Using Spark Transformations and Actions• Using Spark MLlib• Creating a Spam Classifier Using Spark MLlib

Revised 10/06/2017



TRAINING OFFERING | DEV-343HORTONWORKS DATA PLATFORM (HDP®) DEVELOPER: ENTERPRISE APACHE SPARK 2.x

4 DAYS This course introduces the Apache Spark distributed computing engine, and is suitable for developers, data analysts, architects, technical managers, and anyone who needs to use Spark in a hands-on manner. It is based on the Spark 2.x release. The course provides a solid technical introduction to the Spark architecture and how Spark works. It covers the basic building blocks of Spark (e.g. RDDs and the distributed compute engine), as well as higher-level constructs that provide a simpler and more capable interface.It includes in-depth coverage of Spark SQL, DataFrames, and DataSets, which are now the preferred programming API. This includes exploring possible performance issues and strategies for optimization. The course also covers more advanced capabilities such as the use of Spark Streaming to process streaming data, and integrating with the Kafka server.

PREREQUISITES Students should be familiar with programming principles and have previous experience in software development using Scala. Previous experience with data streaming, SQL, and HDP is also helpful, but not required.

TARGET AUDIENCE Software engineers that are looking to develop in-memory applications for time sensitive and highly iterative applications in an Enterprise HDP environment.


AGENDA SUMMARY Day 1: Scala Ramp Up, Introduction to Spark Day 2: RDDs and Spark Architecture, Spark SQL, DataFrames and DataSets Day 3: Shuffling, Transformations and Performance, Performance Tuning Day 4: Creating Standalone Applications and Spark Streaming

About Hortonworks


About Cerulium




DAY 1 OBJECTIVES

• Scala Introduction• Working with:

o Variableso Data Typeso Control Flow

• The Scala Interpreter• Collections and their Standard Methods (e.g. map())• Working with:

o Functionso Methodso Function Literals

• Define the Following as they Relate to Scale:o Classo Objecto Case Class

• Overview, Motivations, Spark Systems• Spark Ecosystem• Spark vs. Hadoop• Acquiring and Installing Spark• The Spark Shell, SparkContext

DAY 1 LABS

• Setting Up the Lab Environment• Starting the Scala Interpreter• A First Look at Spark• A First Look at the Spark Shell



About Hortonworks


About Cerulium




DAY 2 OBJECTIVES

• RDD Concepts, Lifecycle, Lazy Evaluation• RDD Partitioning and Transformations• Working with RDDs Including:

o Creating and Transforming (map, filter, etc.)• An Overview of RDDs• SparkSession, Loading/Saving Data, Data Formats (JSON, CSV, Parquet, text ...)• Introducing DataFrames and DataSets (Creation and Schema Inference)• Identify Supported Data Formats, Including:

o JSONo Texto CSVo Parquet

• Working with the DataFrame (untyped) Query DSL, including:o Columno Filteringo Groupingo Aggregation

• SQL-based Queries• Working with the DataSet (typed) API• Mapping and Splitting (flatMap(), explode(), and split())• DataSets vs. DataFrames vs. RDDs

DAY 2 LABS

• RDD Basics• Operations on Multiple RDDs• Data Formats• Spark SQL Basics• DataFrame Transformations• The DataSet Typed API• Splitting Up Data



About Hortonworks


About Cerulium




DAY 3 OBJECTIVES

• Working with:o Groupingo Reducingo Joining

• Shuffling, Narrow vs. Wide Dependencies, and Performance Implications• Exploring the Catalyst Query Optimizer (explain(), Query Plans, Issues with lambdas)• The Tungsten Optimizer (Binary Format, Cache Awareness, Whole-Stage Code Gen)• Discuss Caching, Including:

o Conceptso Storage Typeo Guidelines

• Minimizing Shuffling for Increased Performance• Using Broadcast Variables and Accumulators• General Performance Guidelines

o Using the Spark UIo Efficient Transformationso Data Storageo Monitoring

DAY 3 LABS

• Exploring Group Shuffling• Seeing Catalyst at Work• Seeing Tungsten at Work• Working with Caching, Joins, Shuffles, Broadcasts, Accumulators• Broadcast General Guidelines



About Hortonworks


About Cerulium




DAY 4 OBJECTIVES

• Core API, SparkSession.Builder• Configuring and Creating a SparkSession• Building and Running Applications - sbt/build.sbt and spark-submit• Application Lifecycle (Driver, Executors, and Tasks)• Cluster Managers (Standalone, YARN, Mesos)• Logging and Debugging• Introduction and Streaming Basics• Spark Streaming (Spark 1.0+)

o DStreams, Receivers, Batchingo Stateless Transformationo Windowed Transformationo Stateful Transformation

• Structured Streaming (Spark 2+)o Continuous Applicationso Table Paradigm, Result Tableo Steps for Structured Streamingo Sources and Sinks

• Consuming Kafka Datao Kafka Overviewo Structured Streaming - "kafka" Formato Processing the Streaz

DAY 4 LABS

• Spark Job Submission• Additional Spark Capabilities• Spark Streaming• Spark Structured Streaming• Spark Structured Streaming with Kafka

Revised 10/18/2017



TRAINING OFFERING: ADM-221HORTONWORKS DATA PLATFORM (HDP®) OPERATIONS: ADMINISTRATION FOUNDATION

4 DAYS This course is intended for systems administrators who will be responsible for the design, installation, configuration, and management of the Hortonworks Data Platform (HDP). The course provides in-depth knowledge and experience in using Apache Ambari as the operational management platform for HDP. This course presumes no prior knowledge or experience with Hadoop.

PREREQUISITES Students must have experience working in a Linux environment with standard Linux system commands. Students should be able to read and execute basic Linux shell scripts. Basic knowledge of SQL statements is recommended, but not a requirement. In addition, it is recommended for students to have some operational experience in data center practices, such as change management, release management, incident management, and problem management.

TARGET AUDIENCE Linux administrators and system operators responsible for installing, configuring and managing an HDP cluster.

FORMAT 50% Lecture 50% Hands-on Labs

AGENDA SUMMARY

Day 1: Introduction to Big Data, Hadoop and the Hortonworks Data Platform Day 2: Managing HDFS Storage, Rack Awareness, HDFS Snapshots and HDFS Centralized Cache Day 3: Introduction to YARN Day 4: High Availability with HDP, Deploying HDP with Blueprints, and the HDP Upgrade Process

About Hortonworks


About Cerulium




DAY 1 OBJECTIVES• Describe Apache Hadoop• Summarize the Purpose of the Hortonworks Data Platform Software Frameworks• List Hadoop Cluster Management Choices• Describe Apache Ambari• Identify Hadoop Cluster Deployment Options• Plan for a Hadoop Cluster Deployment• Perform an Interactive HDP Installation using Apache Ambari• Install Apache Ambari• Describe the Differences Between Hadoop Users, Hadoop Service Owners, and Apache Ambari Users• Manage Users, Groups and Permissions• Identify Hadoop Configuration Files• Summarize Operations of the Web UI Tool• Manage Hadoop Service Configuration Properties Using the Apache Ambari Web UI• Describe the Hadoop Distributed File System (HDFS)• Perform HDFS Shell Operations• Use WebHDFS• Protect Data Using HDFS Access Control Lists (ACLs)

DAY 1 LABS • Setting Up the Environment• Installing HDP• Managing Ambari Users and Groups• Managing Hadoop Services• Using HDFS Storage• Using WebHDFS• Using HDFS Access Control Lists



About Hortonworks


About Cerulium




DAY 2 OBJECTIVES • Describe HDFS Architecture and Operation• Manage HDFS using Ambari Web, NameNode and DataNode UIs• Manage HDFS using Command-line Tools• Summarize the Purpose and Benefits of Rack Awareness• Configure Rack Awareness• Summarize Hadoop Backup Considerations• Enable and Manage HDFS Snapshots• Copy Data Using DistCP• Use Snapshots and DistCP Together• Identify the Purpose and Operation of Heterogeneous HDFS Storage• Summarize the Purpose and Operation of HDFS Centralized Caching• Configure HDFS Centralized Cache• Define and Manage Cache Pools and Cache Directives• Identify HDFS NFS Gateway Use Cases• Recall HDFS NFS Gateway Architecture and Operation• Install and Configure an HDFS NFS Gateway• Configure an HDFS NFS Gateway Client

DAY 2 LABS • Managing HDFS Storage• Managing HDFS Quotas• Configuring Rack Awareness• Managing HDFS Snapshots• Using DistCP• Configuring HDFS Storage Policies• Configuring HDFS Centralized Cache• Configuring an NFS Gateway



About Hortonworks


About Cerulium


For further information visit www.cerulium.com or www.hortonworks.com © 2011-2016 Hortonworks Inc. All Rights Reserved. Privacy Policy | Terms of Service

DAY 3 OBJECTIVES • Describe YARN Resource Management• Summarize YARN Architecture and Operation• Identify and Use YARN Management Options• Summarize YARN Response to Component Failure• Understand the Basics of Running Simple YARN Applications• Summarize the Purpose and Operation of the YARN Capacity Scheduler• Configure and Manage YARN Queues• Control Access to YARN Queues• Summarize the Purpose and Operation of YARN Node Labels• Describe the Process used to Create Node Labels• Describe the Process Used to Add, Modify and Remove Node Labels• Configure Queues to Access Node Label Resources• Run Test Jobs to Confirm Node Label Behavior

DAY 3 LABS • Managing YARN Using Ambari• Managing YARN Using CLI• Running Sample YARN Applications• Setting Up for Capacity Scheduler• Managing YARN Containers and Queues• Managing YARN ACLs and User Limits• Working with YARN Node Labels



About Hortonworks


About Cerulium




DAY 4 OBJECTIVES • Summarize the Purpose of NameNode HA• Configure NameNode HA Using Ambari• Summarize the Purpose of ResourceManager HA• Configure ResourceManager HA using Apache Ambari• Identify Reasons to Add, Replace and Delete Worker Nodes• Demonstrate How to Add a Worker Node• Configure and Run the HDFS Balancer• Decommission and Re-commission a Worker Node• Describe the Process of Moving a Master Component• Summarize the Purpose and Operation of Apache Ambari Metrics• Describe the Features and Benefits of the Apache Ambari Dashboard• Summarize the Purpose and Benefits of Apache Ambari Blueprints• Recall the Process Used to Deploy a Cluster Using Ambari Blueprints• Recall the Definition of an HDP Stack and Interpret its Version Number• View the Current Stack and Identify Compatible Apache Ambari Software Versions• Recall the Types of Methods and Upgrades Available in HDP• Describe the Upgrade Process, Restrictions and Pre-upgrade Checklist• Perform an Upgrade Using the Apache Ambari Web UI

DAY 4 LABS • Configuring NameNode HA• Configuring Resource Manager HA• Adding, Decommissioning and Re-commissioning a Worker Node• Configuring Ambari Alerts• Deploying an HDP Cluster Using Ambari Blueprints• Performing an HDP Upgrade – Express

Revised 10/04/2017



Date post:	14-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

TRAINING OFFERING | DEV-303 HORTONWORKS DATA … · Hortonworks is a leading innovator at creating,...

Documents