10/10/2018
Big Data with Cloudera, Talent, Tableau | Santanu B
QUICKITDOTNET,KHARADHI,PUNE HADOOP AND IT’S
ECOSYSTEM
HADOOP
With
Cloudera, Talend (ETL tool), Tableau (Data
Visualization), Informatica, Teradata
About the trainer: Santanu has overall 20 yrs experience in the industry. He
worked with HCL Tech, Amdocs, and Cognizant. He was in Cognizant till Apr’17 as
a Senior Manager (Emp Id:274292) in Data Warehouse team. He worked in DWBI
domain with Informatica, Teradata, Hadoop, Talend, Tableau tools. He has long
experience in the industry with big clients like AT&T, Novertise, Astrazeneca, CS,
etc. He was attached with the recruitment team for long time in Amdocs, Cognizant
which will definitely help the students for interview preparation!
(Santanu Bhattacharjee:9823569371)
Java
(Basic understanding
before stating Big Data)
Basic Concepts, Class, Objects, Methods, Loops,
Decision, Arrays, Variables, OOPS concepts
(Abstraction, Polymorphism, encapsulation
etc...), Command line input, Exception
handling etc.
SQL
(Basic understanding
before stating Big Data)
Introduction to RDBMS concept and
architecture, Introduction to SQL, DDL, DML,
SELECT STATEMENT - Where Clause, Order
By / Distinct Clause, SQL Function - Scalar,
Aggregate Functions, Group By/Having
Clause, Self Join, Inner Join, Outer Join, LEFT
JOIN, RIGHT JOIN, FULL JOIN, Union, Sub
Queries, Views, Indexes.
Linux
(Basic understanding
before stating Big Data)
Overview of Linux OS and it's importance
various commands, vim editor, shell Scripts-
arithmetic operators, File Test Operators,
Command line parameter, Conditions, Loops,
Executing Scripts.
Introduction to Big
Data
Overview, history and today's challenges
which can be handle by Bid Data.
Goals of HDFS system.
Hadoop ecosystem and different components
and it's usages.
Architecture of Hadoop system.
MapReduce and how it Works.
Discussion on different installation modes
(Standalone, Pseudo, Fully Distributed).
Installation of Hadoop (in Ubentu-14) with
Pseudo Distribution mode.
Discussion on important Daemons running
background of Hadoop system.
Discussion on different important HDFS
commands.
Hadoop Eco-system
What is Hadoop?
Hadoop's Key Characteristics
Hadoop Eco-system & Core Components
Where Hadoop Fits?
Traditional vs. Hadoop’s Data Analytics
Architecture
When to Use & Not Use Hadoop?
Apache Hadoop & Distributions
Hadoop Job Trends
HDFS Architecture
Introduction to Hadoop Distributed File
System
HDFS Architecture and Features
Files and Data Blocks
Anatomy of a File Read/ Write on HDFS
Replication & Rack Awareness
Hadoop Setup
Hadoop Deployment Modes
Setting up a Pseudo-distributed Cluster
Cloudera Sandbox Installation &Configuration
Linux Terminal Commands
Configuration Parameters and Values
MapReduce Basics
What is MapReduce?
MapReduce Framework, Architecture and Use
Cases
Input Splits
Hands on with MapReduce Programming
Packaging MapReduce Jobs in a JAR
Using Pig
Background
Pig Architecture
Understanding and installation of ‘Pig’
Pig Latin Basics
Pig Execution Modes
Pig Processing – Loading and Transforming
Data
Pig Built-in Functions
Filtering, Grouping, Sorting Data
Relational Join Operators
Pig User Defined Functions
Sample exercise on ‘Pig’ with data visualization
tool ‘Zeppelin’.
Create Talend jobs to execute Pig tasks.
Web Log Report Analytics by ‘Pig’.
Using Hive
Background of Hive
Hive Architecture
Warehouse Directory & Metastore
Data Processing – Loading Data into Tables
Using Hive Built-in Functions, UDF
Using Joins in Hive
Partitioning Data using Hive - Static &
Dynamic
Bucketing in Hive
ETL by Talend and visualization by Tableau
Application: Store Data
Analytics
Case Study: Store data analytics and Reporting using
Hive & Zeppelin (Data Visualization)
Working with HBase
HBase Overview
HBase Data Model
Row Oriented v/s Column Oriented
Storage
HBase Architecture
HBase Shell Commands
Bulk Load Data into HBase
Loading data by Talend into HBASE
Impala
Overview and Environment
Impala Architecture
Database creation ,deletion
Table and View specific statements like Create,
Insert, Describe, alter, drop etc.
Impala Clauses-Order By, Group By, Having,
Limit, offset, Union, Distinct.
ETL by Talend and visualization by Tableau
Sqoop
Why Flume?
Setup MySQL RDBMS & Sqoop
Sqoop Connectors, Commands
Sqoop Options File
Importing Data – to HDFS & Hive
Exporting Data to MySQL
Data Ingestion using Flume
Flume Architecture
Ingesting Weblog Data into HDFS using
Flume
Flume, ZooKeeper
Overview on ZooKeeper and how it helps a
cluster for coordination activity.
Installation of ZooKeeper in the Ubentu VM.
Discussion on Kafka with real life scenario.
Discussion on Point to Point , Publish-
Subscribe Messaging System.
Discussion on Kafka's architecture, Producer ,
Consumer , Topic category , Broker .
Installation of Kafka in the Ubentu VM.
Discussion on architecture of Flume.
Agent of Flume and it's components.
Verious types of Sources, Channels, Sinks
supported by Flume.
Discussion on how to configure an Agent of
Flume in the configuration file.
Run Consumer-Subscriber and publish a
Topic by the Producer!
MongoDB:
MongoDB
Concept of No SQL database ‘MongoDB’ and where to use.
Installation of ‘MongoDB’ in VM Mapping ‘SQL’ to ‘MongoDB’ Queries-sample
examples. Create, Insert, Update and Delete operations. Complex aggregations, equal, less than, greater than
operators. Join, Group By, Having etc. Case Study on ‘MongoDB’
Spark and Scala
Overview of Apache Spark and it's importancy.
Resilient Distributed Datasets (RDDs)
Components of Spark
Configuration to access tables in MySQL, Hive.
Spark SQL – DataFrames, SQLContext,
HiveContext.
Loading data in HDFS, Hive, MYSQL.
Create a report in Zeppeline.
Overview of Sacla and it's importance.
Compilation of Scala program in spark and
execute it.
Scala Data Types, Variables, Access Modifiers,
Operators, Logical statements, Loops,
Functions, Closures, Collections, Classes &
Objects, Exception Handling.
Singleton object.
Using Oozie
Overview, Features and Challenges of Oozie.
Setting up Database & Oozie Configuration.
Creating Workflows .
Submitting, Monitoring and Managing Oozie
Jobs.
ETL tool ‘Talend’ with Cloudera:
Talend &
Cloudera
Configuration of Talent with Cloudera VM How to automate HDFS Jobs in Talend. Insert data in Hbase by Talend Job. Load data from various source systems in Cloudera
Hive by Talend. Create Jobs in Talend to check aggregation, filters,
left-right-outer joins in Pig. ETL job creation and load it in Hive & Impala and
visualization by Tableau.
Reporting by ‘Tabelue’:
Tableau
Case Study: We have census dataset which is available in .CSV file. We need to load it in Cloudera’s Hive OR Impala by creating Talend Jobs and analyse it by Tableau Report.
Informatica & Teradata (Additional):
Informatica
Concept of Warehouse and ETL layers Architecture How to configure Source Analyzer and Target Designer Mappings Workflow Monitor Different Transformations
Verification points while doing ETL testing. Common defects coming in ETL projects.
Teradata
Teradata architecture SQL in Teradata Utilities like SQL Assistant, BTEQ, FirstLoad, MultiLoad.
Note: Although intention of this course is to learn Big Data with Talend & Tableau but additionally
basic Informatica & Teradata will be discussed as these are highly used for data acquisition with Big
Data in the industry. VM (Virtual Machine) will be provided for practicing purpose!
Sample Case Study
Census data analysis:
Following is an example of Job in Talend which will create a table in Impla before
loading Census data. It will take the data in .csv format from the Host System and
will load it into the Cloudera’s Impala table.
(Load data from Census.csv into Impala by Talend)
(Load data from MySql to Hive by Talend)
Check in Cloudera’s Hue that Table should be created and data should be loaded in Hive:
(Check the data loaded into the Hive/Impala table)
We need to configure Tableau with Cloudera by selecting IP of the Cloudera VM,
select the type of the connectin like HiveServer2 or Impala. Design reports in
Tabelue will point into the Cloudera’s Hive/Impala table and will start analysing on
Census data:
(Connect Tableau with HiveServer2 of Cloudera server)
Whenever require, we can use formulas to extract data and creating reports in
tableau. Here we are creating Density of Health Care Centre w.r.t population taking
states as Dimension.
Density of Health Care Centre:
(Horizontal Bar graph in Tableau)
AVG Literacy rate in treemaps report in Tableau:
(Treemaps report)
Web Log Report Analytics by Pig:
Log files are very important for analyzing purpose. We can get lots of information
like as follows. We will analysis it by Pig script.
Total bites consumed by each URL
Most visited URLs
Rank of the IPs based on total visits.
(Architecture of Log File analysis)
Two tables are joining in Talend’s mapping and creating a target table in
Hive/Impala:
(Talend Mapper)