Date post: | 13-Feb-2017 |
Category: |
Technology |
Upload: | hortonworks |
View: | 947 times |
Download: | 1 times |
March 2016
Data Movement & Management Meet-up
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015
Agenda
• Networking• Brief introduction - Venkat Ranganathan• Falcon Use Case Discussion• Falcon 0.9 Release and Demo• New Features coming in 0.10
• Hive DR: Balu Vellanki• Server side extensions – Sowmya Ramesh• ADF and Instance search – Ying Zheng• Hive based ingestion and export – Venkatesan Ramachandran• Spark integration - Peeyush
• Sqoop 2 Features – Abraham Fine
Page 2
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015
Falcon At a Glance
> Falcon offers a high-level abstraction of key services for Hadoop data processing needs.> Complex data processing logic such as late data handling and retries are handled by Falcon instead
of hard-coded in data processing apps.> Falcon maximizes reuse and consistency, enabling faster development of data processing apps.
Data Processing Applications
Data Ingest and
Replication
Scheduling and
Coordination
Data Lifecycle Policies
Multi-Cluster Management
SLA Management
Falcon Framework
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015
Usage Scenarios
• Dataset Replication• Replicate datasets (whether HDFS files or Hive Tables) as part of your Disaster Recovery, Backup and
Archival plans.• Falcon triggers processes for retries and handles late data arrival.
• Dataset Lifecycle Management• Establish the retention policies for datasets.• Falcon schedules and handles eviction.
• Dataset Lineage + Traceability• View coarse-grained dependencies between clusters, datasets and processes.
Page 4
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dataset Replication + Retention
HDFS Hive Tables
Weblog Dataset
retentionpolicy
HDFS
retentionpolicy
Hive Tables
Recommendations Dataset
retentionpolicy
retentionpolicy
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015
Datasets Across Environments
• Disaster Recovery and Backup between environments• Publishing data between environments for Discovery
Page 6
Site to Site Site to Cloud
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015
> Falcon manages process workflow and replication at different stages.> Enables data continuity without requiring full data representation.
Falcon Example: Replication
Staged Data
Staged Data
Cleansed Data
Presented Data
Processed Data
ConformedData
Repl
icati
on
Repl
icati
on
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015
>Sophisticated retention policies expressed in one place.>Simplify data retention for audit, compliance, or for data re-
processing.
Falcon Example: Retention
Staged Data
Retain 5 Years
Cleansed Data
Retain 3 Years
Conformed Data
Retain 3 Years
Presented Data
Retain Last Copy Only
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015
Falcon Example: Late Data Handling
> Processing waits until all required input data is available.> Checks for late data arrivals, issues retrigger processing as necessary.> Eliminates writing complex data handling rules within applications.
Online Transaction
Data (via Sqoop)
Web Log Data (via FTP)
Staged Data Combined Dataset
Wait up to 4hours for FTP data
to arrive
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015
Learn Falcon Using Tutorials
• http://hortonworks.com/hadoop-tutorial/defining-processing-data-end-end-data-pipeline-apache-falcon/
• More to come…• Questions – Please reach out to [email protected]
Page 10
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Falcon Usage in a Pharma Company
Ivo Lašek
03/24 2016 Hadoop Data Management and Data Movement
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SearchData
IntegrationData
AnalyticsOpenData
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Public Brazilian Data
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Lake
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Lake
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Lake
Merge
Clean
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Lake
Merge
Clean
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Lake
Merge
Clean
Security and Data Governance
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Lake
Merge
Clean
Data Catalog
Security and Data Governance
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Falcon Usage
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Datasets and Feeds
• HDFS and Hive based datasets
• HDFS folder and Hive table is a single feed in Falcon
• Our Dataset represents a HDFS folder or a collection of Hive tables
• Our Dataset corresponds to 1 or more Falcon feeds
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dataset Level Properties
• Need to set dataset level properties, not table level• Retention policy, frequency etc.
• Currently we use a middleware layer that translates datasets to feeds• Need to keep the primary information in the middleware layer
• Potential synchronization issues• Falcon can’t be accessed directly
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Parametrized Scripts
INSERT INTO TABLE ${falcon_output_database}.${falcon_output_table} PARTITION (${falcon_output_partitions_hive})
select *
from ${falcon_input1_database}.${falcon_input1_table} table1
join ${falcon_input2_database}.${falcon_input2_table} table2
on i1.common_id = i2.common_id
WHERE ${falcon_input1_partition_filter_hive} AND
${falcon_input2_partition_filter_hive}
WHERE ds = ‘2015-08-14-09-00’ AND ds = ‘2015-08-14-10-00’
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Processes
• Process chaining
• Used to use pull model but constrained only to Sqoop and Oozie based ingests
• Need to support external ingestion tools (e.g. ETL)
• Push model enabled by availability flag
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Collaboration on Falcon (Done)
• Falcon REST API trusted user support• Impersonation is possible• Necessary for our Middleware layer• FALCON-1027• Available in Falcon 0.8
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Wish List
• Retention policy• For hundreds of tables there are hundreds of Oozie jobs launched at the same
time to check the retention
• Kerberos• Kerberos ticket for Falcon principal expires after 1 day and Falcon needs to
renew it• Workaround: Falcon restarts twice a day
• Explicitly triggered run of a process (off schedule)
• Version based retention policy (not only time based)
• Support for streaming• Additional storages (e.g. Hbase)
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Contacts
• Ivo Lasek ([email protected])
• Twitter: @ilasek
• http://www.merck.com/
• http://www.msdit.cz/
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Falcon Features: What’s New in 0.9?
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
$whoami
• Pallavi Rao• Architect, InMobi
• Committer, Apache Falcon
• Contributor, Apache PIG (on Spark)
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
New Features• Import from DB and export to a DB.
• Native Scheduler
• Enhanced Falcon Unit API
• Hive DR replication metrics via CLI
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Import/ Export
35
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Management Actions
36
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The Missing Piece
37
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Import
Falcon Feed
• Different Modes of extraction: Full or incremental• Different Modes of output (merge): Snapshot, append• Include/Exclude columns
RDBMS
HDFS
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Export
Falcon Feed
• Different Modes of Load: Insert, update-only• Include/ Exclude columns
RDBMS
HDFS
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Native Scheduler
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Builda Native Scheduler?
• Falcon uses Oozie for:• DAG Execution• Scheduling - Gaps Exist
• Simple periodic scheduling without data gating• Cron + calendar based scheduling with/without data gating.• Flexible data gating• Support for a-periodic datasets and triggers based on data
availability.• Support for external triggers.
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scheduler - Before
Falcon Server Scheduler
Execution
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scheduler - The Plan
• Time based scheduling - Available in 0.9• Data based gating - Will be available in .10• Complete parity with Oozie and additional features - The release after.
44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scheduler - After
Falcon Server
DAG Executor
ExecutionANY DAG Executor
Sche
dule
r
45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Additional Benefits
• Understands the notion of a pipeline • Better throttling primitives • Prioritization and backlog catch up
46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Falcon Unit
47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Motivation for Falcon Unit
• User errors caught only at deployment time• Input/ Output feeds and paths not getting resolved• Errors in specification
• Integration Tests require environment setup/teardown.• Messy deployment scripts• Time consuming
• Debugging was cumbersome.• Logs scattered
48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Falcon Unit
Falcon Unit
In Process execution env.
• Local Oozie• Local File System• Local Job Runner• Local Message
Queue
Actual cluster• Oozie• HDFS• YARN• Active MQ
Test suite
49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What You Can Test
Data Management
Data Governance
Process Management
● Data creation● Data injection● Retention● Replication
● Lineage● Data availability for verification
● Validation of definition ● Entity scheduling and status verification● Correctness of data window being picked up● Reruns● Missing dependencies/properties
Future
Available in 0.8
Available in 0.9
50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
For More Information
• https://cwiki.apache.org/confluence/display/FALCON/Release+Notes
• https://blogs.apache.org/falcon/entry/what_s_new_in_falcon
• http://falcon.apache.org
51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo
52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo Pipeline
In
RDBMS
HDFSFalcon Feed
Falcon Process
Copy Cat Out
HDFS
RDBMS
53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015
Hive Disaster Recovery
• Hive Event based replication• Hive should set hive.metastore.event.listeners property to
org.apache.hive.hcatalog.listener.DbNotificationListener• Requires Hive version 1.2.0 or above.• Uses Falcon Recipe framework to support Hive DR.
• Requires Bootstrap operation from user.• Will replicate: DB, Table, Partition
– Add/drop partition, update, delete, alter • Wont replicate: Views, roles, direct HDFS writes without registering Metadata
Page 53
54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
New Features in 0.10
55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015
Server Side Extensions
• Provide capability to add Falcon extensions that can be used to provide a specific data management function
– Data anonymization, masking etc.
• Managed and accessed like other standard Falcon entities– UI, CLI and REST API access
• Better manageability than client side recipes– Types of extensions– Trusted/provided extensions which are OOTB extensions that run in
the Falcon context– Custom extensions: Custom recipes are user defined recipes.
Extension cooking will be done outside Falcon context in a new process.
Page 55
56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015
Server side Extensions (Cont.)
• Extension Repository Management
– Templatized entities and parameterized workflows used during extension cooking to realize well constructed Falcon entities are referred to as extension artifacts
– Extension artifact store is a HDFS based store which Falcon system maintains to store the extension artifacts
– should be configured using “*.extension.store.uri” property in Falcon startup properties
– Rest API/CLI support should be provided for extension store management
Page 56
57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015
Spark Integration
• Support Spark as a processing option with Hive, Pig and Oozie workflows
• Enables users to easily do data management functions using Spark
• Both Java and Python applications are supported
Page 57
58 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015
Data Ingestion and Export
• Relational Database Ingestion and Export– Falcon supports defining and scheduling import and export jobs– - Supports Datasource as a top level abstraction– Leverages Sqoop 1 internally – Now supports Hcatalog tables as Source and Target for Export and Import– Support for jceks based password alias– WIP to support resource throttling on Data Sources
• Support for other types of Data Sources– WIP to support data sources other than Relational databases
Page 58
59 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015
HDFS Snapshot Management and Replication
• Use Case– Cost effective replication that only copies modified blocks– Provides ability to rollback in case of data corruption
• Falcon will use server side extensions to implement this feature
• Extension will do the following– Create the snapshot on source directory– Replicate the directory using current and previous snapshot (If exists)– Create snapshot on Target directory
• Snapshot retention policy– Users can specify age limit and N number of snapshots to retain.– Falcon will deletes snapshots on source and target that are older than the age
limit while retaining at least N snapshots.Page 59
60 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015
New Feature : Cluster Entity Update
• Falcon will provide ability to update cluster entity without having to delete and re-submit entity
• Use Cases – Update Hadoop installation from unsecure to secure.– Update from non-HA to high availability
• Cluster entity update expects underlying HDFS and Oozie installations remain the same
• Cluster update only allowed by super user in falcon safe-mode.• Falcon will do the following
– Update cluster entity when server is in safe mode– When Falcon starts in normal mode, the coordinator/bundle jobs for all
dependent Feed/Process entities will be updated in workflow engine
Page 60
61 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015
Falcon Safe Mode
• Falcon server will support starting in safe-mode• Use cases
– Supports rolling upgrades– Useful while updating cluster entities
• When in safe-mode, users can do the following– Read operations on all entity/instances– Suspend or Kill feed/process instances– Update cluster entity.
• When in safe-mode, users cannot do the following– Submit entity operations.– Schedule operations on feed/process– Validate, touch, dry-run operations– Delete entity– Instance rerun/resume operations
Page 61
62 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015
Hybrid Hadoop Data Pipeline: HDP + Azure Data Factory
• On-premise and cloud hybrid Hadoop data pipeline– Build pipeline for HDP data processing on Azure (e.g. Hive)– Copy data to Azure blobs (e.g. aggregation result from Hive)– Use Azure Machine Learning platform for predictive analysis
• Keep sensitive data (e.g. PII) on-premises for privacy, compliance reasons
• Share non-sensitive data on cloud for cross-region replication, recovery, data prediction, etc.
Page 62
63 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015
Hybrid Hadoop Data Pipeline: HDP + Azure Data Factory
• Pipeline building and job tracking
Page 63
64 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015
Search and Lineage - Current
• Entity search: Filter by name subsequence, tags, …• Instance search of one entity: Filter by time and status• Lineage for succeeded instances
65 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2015
Search and Lineage - New
• Global instance search– Provide instance status summary– Improve search performance
• Lineage for instances in all statuses
Page 65
66 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Abraham FineApache Sqoop 2
67 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
• Software Engineer at Cloudera• Previously:
• Software Engineer at Yahoo!• Software Engineer at BrightRoll• Student at The University of Illinois
at Urbana-Champaign
Who am I?
68 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Committer – Apache Sqoop
69 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Apache Sqoop?
70 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
“Apache SqoopTM is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases."
71 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
So much more than that…
(S)FTP
72 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop 1A brief overview
73 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop 1
• Based on Connectors• Responsible for Metadata lookups, and Data Transfer• Majority of connectors are JDBC based• Non-JDBC (direct) connectors for optimized data transfer
• Connectors responsible for all supported functionality• HBase Import, Avro Support, ...
74 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop 1 Architecture
Metadata
Job Submission
75 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop 1 Shortcomings
• Client needs… • direct access to the database• Access to Hadoop configuration
• Connectors strictly coupled with MapReduce• No way to manage database passwords for users• Resource management is difficult• Client needs the JDK• Very long complicated command line scripts
76 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop as a serviceSqoop 2
77 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop 2 Architecture
Repository
78 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop 2 Internals
79 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Connectors
• Connectors implement an interface that allows Sqoop to retrieve and write data
• JDBC and HDFS are implemented with connectors
• They define the configuration needed to work with a type of data source
• Anyone can write connectors for Sqoop 2
80 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Links• If connectors are classes, then links are instances
• Links define connections to individual data sources
• Links contain inputs which are values assigned to the configuration specified in the link’s connector
• “Sensitive values” are hidden from the user and encrypted in the repository
81 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop 2 Internals
Connector
Link A
Input A.A Input A.B
Link B
Input B.A
82 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Jobs
• From link• To link• Some extra configuration (for resource management, etc…)
Job
Link A Link B FromJobConf ToJobConf
83 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
• Admin/DBA • Sets up links and manages
passwords to databases
• User• Sets up and runs jobs
2 Classes of Sqoop User
84 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo!
85 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions?Abraham [email protected]://www.linkedin.com/in/abrahamfine @abrahamfine
86 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You