+ All Categories
Home > Technology > Falcon Meetup

Falcon Meetup

Date post: 13-Feb-2017
Category:
Upload: hortonworks
View: 947 times
Download: 1 times
Share this document with a friend
86
March 2016 Data Movement & Management Meet-up
Transcript
Page 1: Falcon Meetup

March 2016

Data Movement & Management Meet-up

Page 2: Falcon Meetup

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

Agenda

• Networking• Brief introduction - Venkat Ranganathan• Falcon Use Case Discussion• Falcon 0.9 Release and Demo• New Features coming in 0.10

• Hive DR: Balu Vellanki• Server side extensions – Sowmya Ramesh• ADF and Instance search – Ying Zheng• Hive based ingestion and export – Venkatesan Ramachandran• Spark integration - Peeyush

• Sqoop 2 Features – Abraham Fine

Page 2

Page 3: Falcon Meetup

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

Falcon At a Glance

> Falcon offers a high-level abstraction of key services for Hadoop data processing needs.> Complex data processing logic such as late data handling and retries are handled by Falcon instead

of hard-coded in data processing apps.> Falcon maximizes reuse and consistency, enabling faster development of data processing apps.

Data Processing Applications

Data Ingest and

Replication

Scheduling and

Coordination

Data Lifecycle Policies

Multi-Cluster Management

SLA Management

Falcon Framework

Page 4: Falcon Meetup

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

Usage Scenarios

• Dataset Replication• Replicate datasets (whether HDFS files or Hive Tables) as part of your Disaster Recovery, Backup and

Archival plans.• Falcon triggers processes for retries and handles late data arrival.

• Dataset Lifecycle Management• Establish the retention policies for datasets.• Falcon schedules and handles eviction.

• Dataset Lineage + Traceability• View coarse-grained dependencies between clusters, datasets and processes.

Page 4

Page 5: Falcon Meetup

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Dataset Replication + Retention

HDFS Hive Tables

Weblog Dataset

retentionpolicy

HDFS

retentionpolicy

Hive Tables

Recommendations Dataset

retentionpolicy

retentionpolicy

Page 6: Falcon Meetup

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

Datasets Across Environments

• Disaster Recovery and Backup between environments• Publishing data between environments for Discovery

Page 6

Site to Site Site to Cloud

Page 7: Falcon Meetup

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

> Falcon manages process workflow and replication at different stages.> Enables data continuity without requiring full data representation.

Falcon Example: Replication

Staged Data

Staged Data

Cleansed Data

Presented Data

Processed Data

ConformedData

Repl

icati

on

Repl

icati

on

Page 8: Falcon Meetup

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

>Sophisticated retention policies expressed in one place.>Simplify data retention for audit, compliance, or for data re-

processing.

Falcon Example: Retention

Staged Data

Retain 5 Years

Cleansed Data

Retain 3 Years

Conformed Data

Retain 3 Years

Presented Data

Retain Last Copy Only

Page 9: Falcon Meetup

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

Falcon Example: Late Data Handling

> Processing waits until all required input data is available.> Checks for late data arrivals, issues retrigger processing as necessary.> Eliminates writing complex data handling rules within applications.

Online Transaction

Data (via Sqoop)

Web Log Data (via FTP)

Staged Data Combined Dataset

Wait up to 4hours for FTP data

to arrive

Page 10: Falcon Meetup

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

Learn Falcon Using Tutorials

• http://hortonworks.com/hadoop-tutorial/defining-processing-data-end-end-data-pipeline-apache-falcon/

• More to come…• Questions – Please reach out to [email protected]

Page 10

Page 11: Falcon Meetup

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Falcon Usage in a Pharma Company

Ivo Lašek

03/24 2016 Hadoop Data Management and Data Movement

Page 12: Falcon Meetup

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Page 13: Falcon Meetup

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

SearchData

IntegrationData

AnalyticsOpenData

Page 14: Falcon Meetup

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Public Brazilian Data

Page 15: Falcon Meetup

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Page 16: Falcon Meetup

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Page 17: Falcon Meetup

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Lake

Page 18: Falcon Meetup

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Lake

Page 19: Falcon Meetup

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Lake

Merge

Clean

Page 20: Falcon Meetup

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Lake

Merge

Clean

Page 21: Falcon Meetup

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Lake

Merge

Clean

Security and Data Governance

Page 22: Falcon Meetup

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Lake

Merge

Clean

Data Catalog

Security and Data Governance

Page 23: Falcon Meetup

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Page 24: Falcon Meetup

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Falcon Usage

Page 25: Falcon Meetup

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Datasets and Feeds

• HDFS and Hive based datasets

• HDFS folder and Hive table is a single feed in Falcon

• Our Dataset represents a HDFS folder or a collection of Hive tables

• Our Dataset corresponds to 1 or more Falcon feeds

Page 26: Falcon Meetup

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Dataset Level Properties

• Need to set dataset level properties, not table level• Retention policy, frequency etc.

• Currently we use a middleware layer that translates datasets to feeds• Need to keep the primary information in the middleware layer

• Potential synchronization issues• Falcon can’t be accessed directly

Page 27: Falcon Meetup

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Parametrized Scripts

INSERT INTO TABLE ${falcon_output_database}.${falcon_output_table} PARTITION (${falcon_output_partitions_hive})

select *

from ${falcon_input1_database}.${falcon_input1_table} table1

join ${falcon_input2_database}.${falcon_input2_table} table2

on i1.common_id = i2.common_id

WHERE ${falcon_input1_partition_filter_hive} AND

${falcon_input2_partition_filter_hive}

WHERE ds = ‘2015-08-14-09-00’ AND ds = ‘2015-08-14-10-00’

Page 28: Falcon Meetup

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Processes

• Process chaining

• Used to use pull model but constrained only to Sqoop and Oozie based ingests

• Need to support external ingestion tools (e.g. ETL)

• Push model enabled by availability flag

Page 29: Falcon Meetup

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Collaboration on Falcon (Done)

• Falcon REST API trusted user support• Impersonation is possible• Necessary for our Middleware layer• FALCON-1027• Available in Falcon 0.8

Page 30: Falcon Meetup

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Wish List

• Retention policy• For hundreds of tables there are hundreds of Oozie jobs launched at the same

time to check the retention

• Kerberos• Kerberos ticket for Falcon principal expires after 1 day and Falcon needs to

renew it• Workaround: Falcon restarts twice a day

• Explicitly triggered run of a process (off schedule)

• Version based retention policy (not only time based)

• Support for streaming• Additional storages (e.g. Hbase)

Page 31: Falcon Meetup

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Contacts

• Ivo Lasek ([email protected])

• Twitter: @ilasek

• http://www.merck.com/

• http://www.msdit.cz/

Page 32: Falcon Meetup

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Falcon Features: What’s New in 0.9?

Page 33: Falcon Meetup

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

$whoami

• Pallavi Rao• Architect, InMobi

• Committer, Apache Falcon

• Contributor, Apache PIG (on Spark)

Page 34: Falcon Meetup

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

New Features• Import from DB and export to a DB.

• Native Scheduler

• Enhanced Falcon Unit API

• Hive DR replication metrics via CLI

Page 35: Falcon Meetup

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Import/ Export

35

Page 36: Falcon Meetup

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Management Actions

36

Page 37: Falcon Meetup

37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

The Missing Piece

37

Page 38: Falcon Meetup

38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Import

Falcon Feed

• Different Modes of extraction: Full or incremental• Different Modes of output (merge): Snapshot, append• Include/Exclude columns

RDBMS

HDFS

Page 39: Falcon Meetup

39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Export

Falcon Feed

• Different Modes of Load: Insert, update-only• Include/ Exclude columns

RDBMS

HDFS

Page 40: Falcon Meetup

40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Native Scheduler

Page 41: Falcon Meetup

41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Why Builda Native Scheduler?

• Falcon uses Oozie for:• DAG Execution• Scheduling - Gaps Exist

• Simple periodic scheduling without data gating• Cron + calendar based scheduling with/without data gating.• Flexible data gating• Support for a-periodic datasets and triggers based on data

availability.• Support for external triggers.

Page 42: Falcon Meetup

42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Scheduler - Before

Falcon Server Scheduler

Execution

Page 43: Falcon Meetup

43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Scheduler - The Plan

• Time based scheduling - Available in 0.9• Data based gating - Will be available in .10• Complete parity with Oozie and additional features - The release after.

Page 44: Falcon Meetup

44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Scheduler - After

Falcon Server

DAG Executor

ExecutionANY DAG Executor

Sche

dule

r

Page 45: Falcon Meetup

45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Additional Benefits

• Understands the notion of a pipeline • Better throttling primitives • Prioritization and backlog catch up

Page 46: Falcon Meetup

46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Falcon Unit

Page 47: Falcon Meetup

47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Motivation for Falcon Unit

• User errors caught only at deployment time• Input/ Output feeds and paths not getting resolved• Errors in specification

• Integration Tests require environment setup/teardown.• Messy deployment scripts• Time consuming

• Debugging was cumbersome.• Logs scattered

Page 48: Falcon Meetup

48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Falcon Unit

Falcon Unit

In Process execution env.

• Local Oozie• Local File System• Local Job Runner• Local Message

Queue

Actual cluster• Oozie• HDFS• YARN• Active MQ

Test suite

Page 49: Falcon Meetup

49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What You Can Test

Data Management

Data Governance

Process Management

● Data creation● Data injection● Retention● Replication

● Lineage● Data availability for verification

● Validation of definition ● Entity scheduling and status verification● Correctness of data window being picked up● Reruns● Missing dependencies/properties

Future

Available in 0.8

Available in 0.9

Page 50: Falcon Meetup

50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

For More Information

• https://cwiki.apache.org/confluence/display/FALCON/Release+Notes

• https://blogs.apache.org/falcon/entry/what_s_new_in_falcon

• http://falcon.apache.org

Page 51: Falcon Meetup

51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Demo

Page 52: Falcon Meetup

52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Demo Pipeline

In

RDBMS

HDFSFalcon Feed

Falcon Process

Copy Cat Out

HDFS

RDBMS

Page 53: Falcon Meetup

53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

Hive Disaster Recovery

• Hive Event based replication• Hive should set hive.metastore.event.listeners property to

org.apache.hive.hcatalog.listener.DbNotificationListener• Requires Hive version 1.2.0 or above.• Uses Falcon Recipe framework to support Hive DR.

• Requires Bootstrap operation from user.• Will replicate: DB, Table, Partition

– Add/drop partition, update, delete, alter • Wont replicate: Views, roles, direct HDFS writes without registering Metadata

Page 53

Page 54: Falcon Meetup

54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

New Features in 0.10

Page 55: Falcon Meetup

55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

Server Side Extensions

• Provide capability to add Falcon extensions that can be used to provide a specific data management function

– Data anonymization, masking etc.

• Managed and accessed like other standard Falcon entities– UI, CLI and REST API access

• Better manageability than client side recipes– Types of extensions– Trusted/provided extensions which are OOTB extensions that run in

the Falcon context– Custom extensions: Custom recipes are user defined recipes.

Extension cooking will be done outside Falcon context in a new process.

Page 55

Page 56: Falcon Meetup

56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

Server side Extensions (Cont.)

• Extension Repository Management

– Templatized entities and parameterized workflows used during extension cooking to realize well constructed Falcon entities are referred to as extension artifacts

– Extension artifact store is a HDFS based store which Falcon system maintains to store the extension artifacts

– should be configured using “*.extension.store.uri” property in Falcon startup properties

– Rest API/CLI support should be provided for extension store management

Page 56

Page 57: Falcon Meetup

57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

Spark Integration

• Support Spark as a processing option with Hive, Pig and Oozie workflows

• Enables users to easily do data management functions using Spark

• Both Java and Python applications are supported

Page 57

Page 58: Falcon Meetup

58 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

Data Ingestion and Export

• Relational Database Ingestion and Export– Falcon supports defining and scheduling import and export jobs– - Supports Datasource as a top level abstraction– Leverages Sqoop 1 internally – Now supports Hcatalog tables as Source and Target for Export and Import– Support for jceks based password alias– WIP to support resource throttling on Data Sources

• Support for other types of Data Sources– WIP to support data sources other than Relational databases

Page 58

Page 59: Falcon Meetup

59 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

HDFS Snapshot Management and Replication

• Use Case– Cost effective replication that only copies modified blocks– Provides ability to rollback in case of data corruption

• Falcon will use server side extensions to implement this feature

• Extension will do the following– Create the snapshot on source directory– Replicate the directory using current and previous snapshot (If exists)– Create snapshot on Target directory

• Snapshot retention policy– Users can specify age limit and N number of snapshots to retain.– Falcon will deletes snapshots on source and target that are older than the age

limit while retaining at least N snapshots.Page 59

Page 60: Falcon Meetup

60 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

New Feature : Cluster Entity Update

• Falcon will provide ability to update cluster entity without having to delete and re-submit entity

• Use Cases – Update Hadoop installation from unsecure to secure.– Update from non-HA to high availability

• Cluster entity update expects underlying HDFS and Oozie installations remain the same

• Cluster update only allowed by super user in falcon safe-mode.• Falcon will do the following

– Update cluster entity when server is in safe mode– When Falcon starts in normal mode, the coordinator/bundle jobs for all

dependent Feed/Process entities will be updated in workflow engine

Page 60

Page 61: Falcon Meetup

61 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

Falcon Safe Mode

• Falcon server will support starting in safe-mode• Use cases

– Supports rolling upgrades– Useful while updating cluster entities

• When in safe-mode, users can do the following– Read operations on all entity/instances– Suspend or Kill feed/process instances– Update cluster entity.

• When in safe-mode, users cannot do the following– Submit entity operations.– Schedule operations on feed/process– Validate, touch, dry-run operations– Delete entity– Instance rerun/resume operations

Page 61

Page 62: Falcon Meetup

62 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

Hybrid Hadoop Data Pipeline: HDP + Azure Data Factory

• On-premise and cloud hybrid Hadoop data pipeline– Build pipeline for HDP data processing on Azure (e.g. Hive)– Copy data to Azure blobs (e.g. aggregation result from Hive)– Use Azure Machine Learning platform for predictive analysis

• Keep sensitive data (e.g. PII) on-premises for privacy, compliance reasons

• Share non-sensitive data on cloud for cross-region replication, recovery, data prediction, etc.

Page 62

Page 63: Falcon Meetup

63 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

Hybrid Hadoop Data Pipeline: HDP + Azure Data Factory

• Pipeline building and job tracking

Page 63

Page 64: Falcon Meetup

64 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

Search and Lineage - Current

• Entity search: Filter by name subsequence, tags, …• Instance search of one entity: Filter by time and status• Lineage for succeeded instances

Page 65: Falcon Meetup

65 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015

Search and Lineage - New

• Global instance search– Provide instance status summary– Improve search performance

• Lineage for instances in all statuses

Page 65

Page 66: Falcon Meetup

66 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Abraham FineApache Sqoop 2

Page 67: Falcon Meetup

67 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

• Software Engineer at Cloudera• Previously:

• Software Engineer at Yahoo!• Software Engineer at BrightRoll• Student at The University of Illinois

at Urbana-Champaign

Who am I?

Page 68: Falcon Meetup

68 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Committer – Apache Sqoop

Page 69: Falcon Meetup

69 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What is Apache Sqoop?

Page 70: Falcon Meetup

70 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

“Apache SqoopTM is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases."

Page 71: Falcon Meetup

71 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

So much more than that…

(S)FTP

Page 72: Falcon Meetup

72 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Sqoop 1A brief overview

Page 73: Falcon Meetup

73 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Sqoop 1

• Based on Connectors• Responsible for Metadata lookups, and Data Transfer• Majority of connectors are JDBC based• Non-JDBC (direct) connectors for optimized data transfer

• Connectors responsible for all supported functionality• HBase Import, Avro Support, ...

Page 74: Falcon Meetup

74 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Sqoop 1 Architecture

Metadata

Job Submission

Page 75: Falcon Meetup

75 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Sqoop 1 Shortcomings

• Client needs… • direct access to the database• Access to Hadoop configuration

• Connectors strictly coupled with MapReduce• No way to manage database passwords for users• Resource management is difficult• Client needs the JDK• Very long complicated command line scripts

Page 76: Falcon Meetup

76 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Sqoop as a serviceSqoop 2

Page 77: Falcon Meetup

77 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Sqoop 2 Architecture

Repository

Page 78: Falcon Meetup

78 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Sqoop 2 Internals

Page 79: Falcon Meetup

79 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Connectors

• Connectors implement an interface that allows Sqoop to retrieve and write data

• JDBC and HDFS are implemented with connectors

• They define the configuration needed to work with a type of data source

• Anyone can write connectors for Sqoop 2

Page 80: Falcon Meetup

80 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Links• If connectors are classes, then links are instances

• Links define connections to individual data sources

• Links contain inputs which are values assigned to the configuration specified in the link’s connector

• “Sensitive values” are hidden from the user and encrypted in the repository

Page 81: Falcon Meetup

81 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Sqoop 2 Internals

Connector

Link A

Input A.A Input A.B

Link B

Input B.A

Page 82: Falcon Meetup

82 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Jobs

• From link• To link• Some extra configuration (for resource management, etc…)

Job

Link A Link B FromJobConf ToJobConf

Page 83: Falcon Meetup

83 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

• Admin/DBA • Sets up links and manages

passwords to databases

• User• Sets up and runs jobs

2 Classes of Sqoop User

Page 84: Falcon Meetup

84 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Demo!

Page 85: Falcon Meetup

85 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Questions?Abraham [email protected]://www.linkedin.com/in/abrahamfine @abrahamfine

Page 86: Falcon Meetup

86 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank You


Recommended