Falcon Meetup

March 2016

Data Movement & Management Meet-up

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

Agenda

• Networking• Brief introduction - Venkat Ranganathan• Falcon Use Case Discussion• Falcon 0.9 Release and Demo• New Features coming in 0.10

• Hive DR: Balu Vellanki• Server side extensions – Sowmya Ramesh• ADF and Instance search – Ying Zheng• Hive based ingestion and export – Venkatesan Ramachandran• Spark integration - Peeyush

• Sqoop 2 Features – Abraham Fine

Page 2


Falcon At a Glance

> Falcon offers a high-level abstraction of key services for Hadoop data processing needs.> Complex data processing logic such as late data handling and retries are handled by Falcon instead

of hard-coded in data processing apps.> Falcon maximizes reuse and consistency, enabling faster development of data processing apps.

Data Processing Applications

Data Ingest and

Replication

Scheduling and

Coordination

Data Lifecycle Policies

Multi-Cluster Management

SLA Management

Falcon Framework


Usage Scenarios

• Dataset Replication• Replicate datasets (whether HDFS files or Hive Tables) as part of your Disaster Recovery, Backup and

Archival plans.• Falcon triggers processes for retries and handles late data arrival.

• Dataset Lifecycle Management• Establish the retention policies for datasets.• Falcon schedules and handles eviction.

• Dataset Lineage + Traceability• View coarse-grained dependencies between clusters, datasets and processes.

Page 4

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Dataset Replication + Retention

HDFS Hive Tables

Weblog Dataset

retentionpolicy

HDFS

retentionpolicy

Hive Tables

Recommendations Dataset

retentionpolicy

retentionpolicy


Datasets Across Environments

• Disaster Recovery and Backup between environments• Publishing data between environments for Discovery

Page 6

Site to Site Site to Cloud


> Falcon manages process workflow and replication at different stages.> Enables data continuity without requiring full data representation.

Falcon Example: Replication

Staged Data

Staged Data

Cleansed Data

Presented Data

Processed Data

ConformedData

Repl

icati

on

Repl

icati

on


>Sophisticated retention policies expressed in one place.>Simplify data retention for audit, compliance, or for data re-

processing.

Falcon Example: Retention

Staged Data

Retain 5 Years

Cleansed Data

Retain 3 Years

Conformed Data

Retain 3 Years

Presented Data

Retain Last Copy Only


Falcon Example: Late Data Handling

> Processing waits until all required input data is available.> Checks for late data arrivals, issues retrigger processing as necessary.> Eliminates writing complex data handling rules within applications.

Online Transaction

Data (via Sqoop)

Web Log Data (via FTP)

Staged Data Combined Dataset

Wait up to 4hours for FTP data

to arrive


Learn Falcon Using Tutorials

• http://hortonworks.com/hadoop-tutorial/defining-processing-data-end-end-data-pipeline-apache-falcon/

• More to come…• Questions – Please reach out to [email protected]

Page 10

http://hortonworks.com/hadoop-tutorial/defining-processing-data-end-end-data-pipeline-apache-falcon/





Falcon Usage in a Pharma Company

Ivo Lašek

03/24 2016 Hadoop Data Management and Data Movement



SearchData

IntegrationData

AnalyticsOpenData


Public Brazilian Data




Data Lake


Data Lake


Data Lake

Merge

Clean


Data Lake

Merge

Clean


Data Lake

Merge

Clean

Security and Data Governance


Data Lake

Merge

Clean

Data Catalog

Security and Data Governance



Falcon Usage


Datasets and Feeds

• HDFS and Hive based datasets

• HDFS folder and Hive table is a single feed in Falcon

• Our Dataset represents a HDFS folder or a collection of Hive tables

• Our Dataset corresponds to 1 or more Falcon feeds


Dataset Level Properties

• Need to set dataset level properties, not table level• Retention policy, frequency etc.

• Currently we use a middleware layer that translates datasets to feeds• Need to keep the primary information in the middleware layer

• Potential synchronization issues• Falcon can’t be accessed directly


Parametrized Scripts

INSERT INTO TABLE ${falcon_output_database}.${falcon_output_table} PARTITION (${falcon_output_partitions_hive})

select *

from ${falcon_input1_database}.${falcon_input1_table} table1

join ${falcon_input2_database}.${falcon_input2_table} table2

on i1.common_id = i2.common_id

WHERE ${falcon_input1_partition_filter_hive} AND

${falcon_input2_partition_filter_hive}

WHERE ds = ‘2015-08-14-09-00’ AND ds = ‘2015-08-14-10-00’


Processes

• Process chaining

• Used to use pull model but constrained only to Sqoop and Oozie based ingests

• Need to support external ingestion tools (e.g. ETL)

• Push model enabled by availability flag


Collaboration on Falcon (Done)

• Falcon REST API trusted user support• Impersonation is possible• Necessary for our Middleware layer• FALCON-1027• Available in Falcon 0.8


Wish List

• Retention policy• For hundreds of tables there are hundreds of Oozie jobs launched at the same

time to check the retention

• Kerberos• Kerberos ticket for Falcon principal expires after 1 day and Falcon needs to

renew it• Workaround: Falcon restarts twice a day

• Explicitly triggered run of a process (off schedule)

• Version based retention policy (not only time based)

• Support for streaming• Additional storages (e.g. Hbase)


Contacts

• Ivo Lasek ([email protected])

• Twitter: @ilasek

• http://www.merck.com/

• http://www.msdit.cz/

mailto:[email protected]

http://www.merck.com/

http://www.merck.com/

http://www.msdit.cz/

http://www.msdit.cz/


Falcon Features: What’s New in 0.9?


$whoami

• Pallavi Rao• Architect, InMobi

• Committer, Apache Falcon

• Contributor, Apache PIG (on Spark)


New Features• Import from DB and export to a DB.

• Native Scheduler

• Enhanced Falcon Unit API

• Hive DR replication metrics via CLI


Data Import/ Export

35


Data Management Actions

36


The Missing Piece

37


Data Import

Falcon Feed

• Different Modes of extraction: Full or incremental• Different Modes of output (merge): Snapshot, append• Include/Exclude columns

RDBMS

HDFS


Data Export

Falcon Feed

• Different Modes of Load: Insert, update-only• Include/ Exclude columns

RDBMS

HDFS


Native Scheduler


Why Builda Native Scheduler?

• Falcon uses Oozie for:• DAG Execution• Scheduling - Gaps Exist

• Simple periodic scheduling without data gating• Cron + calendar based scheduling with/without data gating.• Flexible data gating• Support for a-periodic datasets and triggers based on data

availability.• Support for external triggers.


Scheduler - Before

Falcon Server Scheduler

Execution


Scheduler - The Plan

• Time based scheduling - Available in 0.9• Data based gating - Will be available in .10• Complete parity with Oozie and additional features - The release after.


Scheduler - After

Falcon Server

DAG Executor

ExecutionANY DAG Executor

Sche

dule

r


Additional Benefits

• Understands the notion of a pipeline • Better throttling primitives • Prioritization and backlog catch up


Falcon Unit


Motivation for Falcon Unit

• User errors caught only at deployment time• Input/ Output feeds and paths not getting resolved• Errors in specification

• Integration Tests require environment setup/teardown.• Messy deployment scripts• Time consuming

• Debugging was cumbersome.• Logs scattered


Falcon Unit

Falcon Unit

In Process execution env.

• Local Oozie• Local File System• Local Job Runner• Local Message

Queue

Actual cluster• Oozie• HDFS• YARN• Active MQ

Test suite


What You Can Test

Data Management

Data Governance

Process Management

● Data creation● Data injection● Retention● Replication

● Lineage● Data availability for verification

● Validation of definition ● Entity scheduling and status verification● Correctness of data window being picked up● Reruns● Missing dependencies/properties

Future

Available in 0.8

Available in 0.9


For More Information

• https://cwiki.apache.org/confluence/display/FALCON/Release+Notes

• https://blogs.apache.org/falcon/entry/what_s_new_in_falcon

• http://falcon.apache.org

https://cwiki.apache.org/confluence/display/FALCON/Release+Notes

https://blogs.apache.org/falcon/entry/what_s_new_in_falcon

http://falcon.apache.org/


Demo


Demo Pipeline

In

RDBMS

HDFSFalcon Feed

Falcon Process

Copy Cat Out

HDFS

RDBMS


Hive Disaster Recovery

• Hive Event based replication• Hive should set hive.metastore.event.listeners property to

org.apache.hive.hcatalog.listener.DbNotificationListener• Requires Hive version 1.2.0 or above.• Uses Falcon Recipe framework to support Hive DR.

• Requires Bootstrap operation from user.• Will replicate: DB, Table, Partition

– Add/drop partition, update, delete, alter • Wont replicate: Views, roles, direct HDFS writes without registering Metadata

Page 53


New Features in 0.10


Server Side Extensions

• Provide capability to add Falcon extensions that can be used to provide a specific data management function

– Data anonymization, masking etc.

• Managed and accessed like other standard Falcon entities– UI, CLI and REST API access

• Better manageability than client side recipes– Types of extensions– Trusted/provided extensions which are OOTB extensions that run in

the Falcon context– Custom extensions: Custom recipes are user defined recipes.

Extension cooking will be done outside Falcon context in a new process.

Page 55


Server side Extensions (Cont.)

• Extension Repository Management

– Templatized entities and parameterized workflows used during extension cooking to realize well constructed Falcon entities are referred to as extension artifacts

– Extension artifact store is a HDFS based store which Falcon system maintains to store the extension artifacts

– should be configured using “*.extension.store.uri” property in Falcon startup properties

– Rest API/CLI support should be provided for extension store management

Page 56


Spark Integration

• Support Spark as a processing option with Hive, Pig and Oozie workflows

• Enables users to easily do data management functions using Spark

• Both Java and Python applications are supported

Page 57


Data Ingestion and Export

• Relational Database Ingestion and Export– Falcon supports defining and scheduling import and export jobs– - Supports Datasource as a top level abstraction– Leverages Sqoop 1 internally – Now supports Hcatalog tables as Source and Target for Export and Import– Support for jceks based password alias– WIP to support resource throttling on Data Sources

• Support for other types of Data Sources– WIP to support data sources other than Relational databases

Page 58


HDFS Snapshot Management and Replication

• Use Case– Cost effective replication that only copies modified blocks– Provides ability to rollback in case of data corruption

• Falcon will use server side extensions to implement this feature

• Extension will do the following– Create the snapshot on source directory– Replicate the directory using current and previous snapshot (If exists)– Create snapshot on Target directory

• Snapshot retention policy– Users can specify age limit and N number of snapshots to retain.– Falcon will deletes snapshots on source and target that are older than the age

limit while retaining at least N snapshots.Page 59


New Feature : Cluster Entity Update

• Falcon will provide ability to update cluster entity without having to delete and re-submit entity

• Use Cases – Update Hadoop installation from unsecure to secure.– Update from non-HA to high availability

• Cluster entity update expects underlying HDFS and Oozie installations remain the same

• Cluster update only allowed by super user in falcon safe-mode.• Falcon will do the following

– Update cluster entity when server is in safe mode– When Falcon starts in normal mode, the coordinator/bundle jobs for all

dependent Feed/Process entities will be updated in workflow engine

Page 60


Falcon Safe Mode

• Falcon server will support starting in safe-mode• Use cases

– Supports rolling upgrades– Useful while updating cluster entities

• When in safe-mode, users can do the following– Read operations on all entity/instances– Suspend or Kill feed/process instances– Update cluster entity.

• When in safe-mode, users cannot do the following– Submit entity operations.– Schedule operations on feed/process– Validate, touch, dry-run operations– Delete entity– Instance rerun/resume operations

Page 61


Hybrid Hadoop Data Pipeline: HDP + Azure Data Factory

• On-premise and cloud hybrid Hadoop data pipeline– Build pipeline for HDP data processing on Azure (e.g. Hive)– Copy data to Azure blobs (e.g. aggregation result from Hive)– Use Azure Machine Learning platform for predictive analysis

• Keep sensitive data (e.g. PII) on-premises for privacy, compliance reasons

• Share non-sensitive data on cloud for cross-region replication, recovery, data prediction, etc.

Page 62


Hybrid Hadoop Data Pipeline: HDP + Azure Data Factory

• Pipeline building and job tracking

Page 63


Search and Lineage - Current

• Entity search: Filter by name subsequence, tags, …• Instance search of one entity: Filter by time and status• Lineage for succeeded instances


Search and Lineage - New

• Global instance search– Provide instance status summary– Improve search performance

• Lineage for instances in all statuses

Page 65


Abraham FineApache Sqoop 2


• Software Engineer at Cloudera• Previously:

• Software Engineer at Yahoo!• Software Engineer at BrightRoll• Student at The University of Illinois

at Urbana-Champaign

Who am I?


Committer – Apache Sqoop


What is Apache Sqoop?


“Apache SqoopTM is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases."


So much more than that…

(S)FTP


Sqoop 1A brief overview


Sqoop 1

• Based on Connectors• Responsible for Metadata lookups, and Data Transfer• Majority of connectors are JDBC based• Non-JDBC (direct) connectors for optimized data transfer

• Connectors responsible for all supported functionality• HBase Import, Avro Support, ...


Sqoop 1 Architecture

Metadata

Job Submission


Sqoop 1 Shortcomings

• Client needs… • direct access to the database• Access to Hadoop configuration

• Connectors strictly coupled with MapReduce• No way to manage database passwords for users• Resource management is difficult• Client needs the JDK• Very long complicated command line scripts


Sqoop as a serviceSqoop 2


Sqoop 2 Architecture

Repository


Sqoop 2 Internals


Connectors

• Connectors implement an interface that allows Sqoop to retrieve and write data

• JDBC and HDFS are implemented with connectors

• They define the configuration needed to work with a type of data source

• Anyone can write connectors for Sqoop 2


Links• If connectors are classes, then links are instances

• Links define connections to individual data sources

• Links contain inputs which are values assigned to the configuration specified in the link’s connector

• “Sensitive values” are hidden from the user and encrypted in the repository


Sqoop 2 Internals

Connector

Link A

Input A.A Input A.B

Link B

Input B.A


Jobs

• From link• To link• Some extra configuration (for resource management, etc…)

Job

Link A Link B FromJobConf ToJobConf


• Admin/DBA • Sets up links and manages

passwords to databases

• User• Sets up and runs jobs

2 Classes of Sqoop User


Demo!


Questions?Abraham [email protected]://www.linkedin.com/in/abrahamfine @abrahamfine


Thank You

Date post:	13-Feb-2017
Category:	Technology
Upload:	hortonworks
View:	947 times
Download:	1 times

Falcon Meetup

Technology