Date post: | 10-May-2015 |
Category: |
Technology |
Upload: | srikanth-sundarrajan |
View: | 577 times |
Download: | 0 times |
Hadoop First ETL On Apache Falcon
Srikanth Sundarrajan Naresh Agarwal
About Authors ! Srikanth Sundarrajan
! Principal Architect, InMobi Technology Services
! Naresh Agarwal ! Director – Engineering, InMobi Technology Services
Agenda ! ETL & Challenges with Big Data
! Apache Falcon – Background
! Pipeline Designer – Overview
! Pipeline Designer – Internals
Agenda ! ETL & Challenges with Big Data
! Apache Falcon – Background
! Pipeline Designer – Overview
! Pipeline Designer – Internals
ETL (Extract Transform Load)
Intelligence
Information
Data
Value
ETL Use cases
Data Warehouse
Data Migration
Data Consolidation
Master Data Management
Data Synchronization
Data Archiving
ETL Authoring
Hand coded
In-house tools
Off-shelf tools
ETL & Big Data – Challenges
Challenges
Volume
Variety Velocity
Big Data ETL ! Mostly Hand coded (High Cost – Implementation +
Maintenance) ! Map Reduce
! Hive (i.e. SQL) ! Pig ! Crunch / Cascading
! Spark
! Off-shelf tools (Scale/Performance) ! Mostly Retrofitted
Agenda ! ETL & Challenges with Big Data
! Apache Falcon – Background
! Pipeline Designer – Overview
! Pipeline Designer – Internals
Apache Falcon ! Off the shelf, Falcon provides standard data
management functions through declarative constructs ! Data movement recipes
! Cross data center replication
! Cross cluster data synchronization
! Data retention recipes ! Eviction
! Archival
Apache Falcon ! However ETL related functions are still largely left
to the developer to implement. Falcon today manages only ! Orchestration ! Late data handling / Change data capture
! Retries ! Monitoring
Agenda ! ETL & Challenges with Big Data
! Apache Falcon – Background
! Pipeline Designer – Overview
! Pipeline Designer – Internals
Pipeline Designer – Basics
Pipeline Designer – Basics ! Feed
! Is a data entity that Falcon manages and is physically present in a cluster.
! Data present in this feed conforms to a schema and partitions of the same are registered with Hcatalog
! Data Management functions such as eviction, archival etc are declaratively specified through Falcon Feed definitions
Pipeline Designer – Basics
Pipeline Designer – Basics ! Process
! Workflow that defines various actions that needs to be performed along with control flow
! Executes at a specified frequency on one or more clusters
! Pipelines ! Logical grouping of Falcon processes owned and
operated together
Pipeline Designer – Basics
Pipeline Designer – Basics ! Actions
! Actions in designer are the building blocks for the process workflows.
! Actions have access to output variables earlier in the flow and can emit output variables
! Actions can transition to other actions ! Default / Success Transition
! Failure Transition
! Conditional Transition
! Transformation action is a special action that further is a collection of transforms
Pipeline Designer – Basics
Pipeline Designer – Basics ! Transforms
! Is a data manipulation function that accepts one or more inputs with well defined schema and produces ore or more outputs
! Multiple transform elements can be stitched together to compose a single transformation action which can further be used to build a flow
! Composite Transformations ! Transforms that are built through a combination of
multiple primitive transforms
! Possible to add more transforms and extend the system
Pipeline Designer – Basics ! Deployment & Monitoring
! Once a process and the pipeline is composed, the same is deployed in Falcon as a standard process
Agenda ! ETL & Challenges with Big Data
! Apache Falcon – Background
! Pipeline Designer – Overview
! Pipeline Designer – Internals
Pipeline Designer Service
Pipeline Designer
Pipeline Designer Service
REST API
Versioned Storage
Flow / Action /
Transforms Compiler + Optimizer
Falcon Server
Hcatalog Service
Des
igner
UI
Falc
on D
ashboa
rd
Process
Feed
Schema
Pipeline Designer – Internals ! Transformation actions are compiled into PIG
scripts
! Actions and Flows are compiled into Falcon Process definitions
Q & A