+ All Categories
Home > Technology > A Reference Architecture for ETL 2.0

A Reference Architecture for ETL 2.0

Date post: 06-May-2015
Category:
Upload: hadoop-summit
View: 7,045 times
Download: 3 times
Share this document with a friend
Description:
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.
Popular Tags:
31
© Hortonworks Inc. 2013 ETL 2.0 Reference Architecture Page 1 George Vetticaden - Hortonworks: Solutions Engineer George Trujillo - Hortonworks: Master Principal Big Data Specialist
Transcript
Page 1: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

ETL 2.0 Reference Architecture

Page 1

George Vetticaden - Hortonworks: Solutions Engineer George Trujillo - Hortonworks: Master Principal Big Data Specialist

Page 2: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

George Vetticaden

•  Solutions Engineer – Big Data at Hortonworks •  Chief Architect and Co-Founder of eScreeningz §  Enterprise Architect vFabric Cloud App Platform – VMware §  Specialties:

§  Big Data and Cloud Computing §  Hadoop §  Cloud Application Platforms (PAAS) – Cloud Foundry, Heroku §  Infrastructure as Service Platforms – vCloud Director, AWS §  Virtualization – vSphere, vCenter §  J2EE §  Hibernate, Spring §  ESB and Middleware Integration §  SOA Architecture

Page 3: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

George Trujillo

•  Master Principal Big Data Specialist - Hortonworks •  Tier One BigData, Oracle and BCA Specialist - VMware •  20+ years Oracle DBA: DW, BI, RAC, Streams, Data

Guard, Perf, B/R §  Oracle Double ACE §  Sun Microsystem's Ambassador for Application Middleware §  Oracle Fusion Council & Oracle Beta Leadership Council §  Two terms Independent Oracle Users Group Board of

Directors §  Recognized as one of the “Oracles of Oracle” by IOUG §  MySQL Certified DBA §  VMware Certified Instructor (VCI)

Sun Ambassador

Page 4: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

Challenges with a Traditional ETL Platform

Page 4

Incapable/high complexity when

dealing with loosely structured data

Data discarded due to cost and/or

performance

-Lot of time spent understanding source and defining destination data structures -High latency between data generation and availability

No visibility into transactional data

-Doesn’t scale linearly. -License Costs High

EDW used as an ETL tool with 100s of

transient staging tables

Page 5: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

Hadoop Based ETL Platform

Page 5

-Support for any type of data: structured/ unstructured

-Linearly scalable on commodity hardware -Massively parallel storage and compute

-Store raw transactional data -Store 7+ years of data with no archiving -Data Lineage: Store intermediate stages of data -Becomes a powerful analytics platform

-Provides data for use with minimum delay and latency -Enables real time capture of source data

-Data warehouse can focus less on storage & transformation and more on analytics

Page 6: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

Key Capability in Hadoop: Late binding

Page 6

DATA    SERVICES  

OPERATIONAL  SERVICES  

HORTONWORKS    DATA  PLATFORM  

HADOOP  CORE  WEB  LOGS,    

CLICK  STREAMS  

MACHINE  GENERATED  

OLTP  

Data  Mart  /  EDW  

Client  Apps  

Dynamically  Apply  Transforma8ons  

Hortonworks  HDP  

With  tradi=onal  ETL,  structure  must  be  agreed  upon  far  in  advance  and  is  difficult  to  change.  

With  Hadoop,  capture  all  data,  structure  data  as  business  need  evolve.  

WEB  LOGS,    CLICK  STREAMS  

MACHINE  GENERATED  

OLTP  

ETL  Server   Data  Mart  /  EDW  

Client  Apps  

Store  Transformed  Data  

Page 7: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

Organize Tiers and Process with Metadata

Page 7

Work Tier

Standardize, Cleanse, Transform MapReduce Pig

Raw Tier

Extract & Load WebHDFS

Flume Sqoop

Gold/Storage

Tier

Transform, Integrate, Storage MapReduce Pig

Conform, Summarize, Access HiveQL Pig

Access Tier

HCat

Provides unified metadata access

to Pig, Hive & MapReduce

•  Organize data based on source/derived relationships

•  Allows for fault

and rebuild process

Page 8: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

ETL Reference Architecture

Page 8

Model/ Apply Metadata

Extract & Load

Publish Exchange

Explore Visualize Report

Analyze

Publish Event Signal Data

Transformation

Transform & Aggregate

Page 9: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION

ETL Reference Architecture

Page 9

Organize/Model Create Metadata

Extract & Load

Publish Exchange

Explore Visualize Report

Analyze

Publish Event Signal Data

Transformation Transform & Aggregate

Page 10: A Reference Architecture for ETL 2.0

HCatalog

Page 11: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

HCatalog

Table access Aligned metadata REST API

•  Raw Hadoop data •  Inconsistent, unknown •  Tool specific access

Apache HCatalog provides flexible metadata services across tools and external access

Metadata Services with HCatalog

•  Consistency of metadata and data models across tools (MapReduce, Pig, Hbase, and Hive)

•  Accessibility: share data as tables in and out of HDFS •  Availability: enables flexible, thin-client access via REST API

Shared table and schema management opens the platform

Page 11

Page 12: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

• Best Practice: Use HCatalog to manage metadata – Schema/structure when needed via tables and partitions – Late binding at work: Multiple/changing bindings supported – Abstract Location of data, scale and maintain over time easily – Abstract format of data file (e.g.: compression type, HL7 v2, HL7 v3)

• Cope with change of source data seamlessly – Heterogeneous schemas across partitions within HCatalog as source

system evolves, consumers of data unaffected – E.g.: Partition ‘2012-01-01’ of Table X has schema with 30 fields and

HL7 v2 format. Partition ‘2013-01-01’ has 35 fields with HL7 v3 format

• RESTful API via WebHCat

Page 12

Step 2 – HCatalog, Metadata

Page 13: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

Sample Tweet data as JSON {

"user":{ "name":"George Vetticaden - Name", "id":10000000, "userlocation":"Chicago", "screenname":"gvetticadenScreenName", "geoenabled":false }, "tweetmessage":"hello world", "createddate":"2013-06-18T11:47:10", "geolocation":{ "latitude":1000.0, "longitude":10000.0 }

}

Page 14: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

Hive/HCat Schema for the Twitter Data create external table tweet (

user struct < userlocation:string, id:bigint, name:string, screenname:string, geoenabled:string

>, geoLocation struct <

latitude:float, longitude:float

>, tweetmessage string, createddate string

) ROW FORMAT SERDE 'org.apache.hcatalog.data.JsonSerDe' location "/user/kibana/twitter/landing"

Page 15: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

Pig Example

Page 15

Count how many time users tweeted an url: raw = load '/user/kibana/twitter/landing' as (user, tweetmessage); botless = filter raw by myudfs.NotABot(user);

grpd = group botless by (url, user);

cntd = foreach grpd generate flatten(url, user), COUNT(botless);

store cntd into '/data/counted/20120530';

Using HCatalog: raw = load ’tweet' using HCatLoader();

botless = filter raw by myudfs.NotABot(user) and ds == '20120530’;

grpd = group botless by (url, user);

cntd = foreach grpd generate flatten(url, user), COUNT(botless);

store cntd into 'counted' using HCatStorer();

No need to know file location

No need to declare schema

Partition filter

Page 16: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION

ETL Reference Architecture

Page 16

Organize/Model Create Metadata

Extract & Load

Publish Exchange

Explore Visualize Report

Analyze

Publish Event Signal Data

Transformation Transform & Aggregate

Page 17: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

Step 3&4 – Transform, Aggregate, Explore

• MapReduce – For Programmers – When control matters

• Hive – HiveQL (SQL-like) to ad-hoc query and explore data

• Pig – Pig for declarative data crunching and preprocessing (the T in ELT) – User Defined Functions (UDF) for extensibility and portability. Ex:

Custom UDF for calling industry-specific data format parsers (SWIFT, X12, NACHA, HL7, HIPPA, etc.)

• HCatalog – Consistent metadata, consistent data Sharing across all tools

Page 17

Page 18: A Reference Architecture for ETL 2.0

Common Processing Patterns

Page 19: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

Common ETL Processing Patterns

• Long-term data retention

• Staging for Data Exploration

• Data Cleansing

• Data Enrichment

Page 19

Page 20: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

Important Dimensions to Consider..

• Compression • Buffering • Data Format Containers • Logical Processing Tiers (Raw, Work, Gold, Access)

Page 20

Page 21: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

Compression in Hadoop is Important

• Biggest performance bottleneck in Hadoop: Read/Write IO • Compression formats supported in HDP include gzip, bzip2, LZO, LZ4

and Snappy • Type of compression to use is based on a number of factors like:

– Size of the data –  Is faster compression/decompression or compression effectiveness more

important (space/time trade-off)? Faster compression/decompression speeds usually come at the expense of smaller space savings. – Do compressed files need to be split-able for parallel MapReduce

processing of a large file

Page 21

Page 22: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

Suitcase Pattern: Buffering and Compression

• Suitcase Pattern – Before we travel, we take our clothes off the rack and pack them

(easier to store) – We then unpack them when we arrive and put them back on the

rack (easier to process) – Consider event data “traveling” over the network to Hadoop

– we want to compress it before it makes the trip, but in a way that facilitates how we intend to process it once it arrives

• Suitcase Pattern Implementation – In Hadoop, generally speaking, several thousand bytes to several

hundred thousand bytes is deemed important – Buffering records during collection also allows us to compress the

whole block of records as a single record to be sent over the network to Hadoop – resulting in lower network and file I/O – Buffering records during collection also helps us handle bursts

Page 23: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

Time Series: The Key to MapReduce

• Event data has a natural temporal ordering – Observations close together in time will be more closely related

than observations further apart – Time series analysis of events often makes use of the one-way

ordering of time

• Batching by time is a composite pattern – Batches of records from a single event source (compressed and

written as a single physical record in HDFS) are organized by time – Physical records in HDFS are organized into files by time – Metadata can be associated with both to support queries with time-

range predicates – A sequence of files can be indexed based on the highest timestamp

inside of HCatalog to avoid MapReduce from having to open the file – A sequence of physical records in a file can be partitioned based on

the highest timestamp (record-level metadata inside a SequenceFile) to avoid Mappers from having to de-compress the batch

Page 24: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

Different Data Format Containers

Page 24

Data Format Description Key Advantages

Sequence File Persistent data structure for binary key-value pairs. Row-oriented. This means that fields in each row are stored together as the contents of as single sequence-file record

•  Split-able •  Compress-able at Block and Row

Level •  Work well as contains for small

files. HDFS and MapReduce are optimized for large files, so packing files into a Sequence file makes storing and processing the smaller files more efficient

Avro File Similar to sequence files (split-able, compressible, row-oriented) except they have support schema evolution and binding in multiple language Schema stored in the file itself

•  Split-able •  Compress-able at Block and Row

Level •  Ideally suited for unstructured data

sets with constantly changing attributes/schema

RC File Similar to sequence and avro file but are column-oriented

•  Provides faster access to subset of columns without doing a full table scan across all columns

Optimized RC File

Optimized RC Fileformat supporting sql like types and has more efficient serialization/deserialization

•  Provides faster access in Next Generation MR

•  HIVE-3874

Page 25: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

Best Practices for Processing Patterns

Page 25

Processing Pattern

Tier Path

Data Format

Compression

Description

Long-term data retention

Raw à Gold

Avro Sequence

Gzip/bzip2 Conversion of all raw data into sequence/avro files with block compression, a useable but compressed data format. This can also involve the aggregation of smaller files from ingestion into large sequence or avro formats.

Staging for Data Exploration

Raw à Access

RC, ORC LZO A conversion of subset of raw input normalized tables into an access-optimized data structure like RC file.

Data Cleansing Raw à Work

Txt(Raw format)

None Common ETL cleansing operations (e.g: discarding bad data, scrubbing, sanitizing)

Data Enrichment Raw à Work

Sequence LZO, None Aggregations or calculation of fields based on analysis of data within Hadoop or other information pulled from other sources ingested into Hadoop.

Page 26: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

The Question that You are dying to Ask..

What Tooling do I have to do orchestrate these ETL flows?

Page 26

Page 27: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

Falcon: One-stop Shop for Data Lifecycle

Apache Falcon Provides Orchestrates

Data Management Needs Tools Multi Cluster Management Oozie Replication Sqoop Scheduling Distcp Data Reprocessing Flume Dependency Management Map / Reduce

Hive and Pig Jobs

Falcon provides a single interface to orchestrate data lifecycle. Sophisticated DLM easily added to Hadoop applications.

Page 28: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

Falcon Usage At A Glance.

>  Falcon provides the key services data processing applications need. >  Complex data processing logic handled by Falcon instead of hard-coded in apps. >  Faster development and higher quality for ETL, reporting and other data

processing apps on Hadoop.

Hortonworks Data Management Product (Herd, Continiuum) (or Data Processing Applications, Customer Management Software)

Spec Files or REST APIs

Data Import and

Replication

Scheduling and

Coordination

Data Lifecycle Policies

Multi-Cluster Management

SLA Management

Falcon Data Lifecycle Management Service

Page 29: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

Falcon Example: Multi-Cluster Failover

>  Falcon manages workflow, replication or both. >  Enables business continuity without requiring full data reprocessing. >  Failover clusters require less storage and CPU.

Staged Data

Cleansed Data

Conformed Data

Presented Data

Staged Data

Presented Data

BI and Analytics

Primary Hadoop Cluster

Failover Hadoop Cluster

Rep

licat

ion

Page 30: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

Example – Data Lifecycle Management

• User creates entities using DSL – Cluster for Primary, Cluster for Secondary (BCP) – Data Set – Submits to Falcon (RESTful API)

• Falcon orchestrates these into scheduled workflows – Maintains the dependencies and relationships between entities – Instruments workflows for dependencies, retry logic, Table/

Partition registration, notifications, etc. – Creates a scheduled recurring workflow for

– Copying data from source to target(s) – Purging expired data on source and target(s)

<cluster colo=”colo-1" description="test cluster" name=”cluster-primary" xmlns="uri:ivory:cluster:0.1”> <interfaces> <interface type="readonly" endpoint="hftp://localhost:50070" version="1.1.1"/> <interface type="write" endpoint="hdfs://localhost:54310” version="1.1.1"/> <interface type="execute" endpoint="localhost:54311" version="1.1.1"/> <interface type="workflow" endpoint="http://localhost:11000/oozie/" version="3.3.0"/> <interface type="messaging" endpoint="tcp://localhost:61616?daemon=true" version="5.1.6"/> </interfaces> </cluster>

<feed description="TestHourlySummary" name="TestHourlySummary” xmlns="uri:ivory:feed:0.1"> <partitions/> <groups>bi</groups> <frequency>hours(1)</frequency> <late-arrival cut-off="hours(4)"/> <clusters> <cluster name=”cluster-primary" type="source"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name=”cluster-BCP" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> </clusters> <locations> <location type="data” path="/projects/test/TestHourlySummary/${YEAR}-${MONTH}-${DAY}-${HOUR}"/> <location type="stats" path="/none"/> <location type="meta" path="/none"/> </locations> <ACL owner=”venkatesh" group="users" permission="0755"/> <schema location="/none" provider="none"/> </feed>

Page 31: A Reference Architecture for ETL 2.0

© Hortonworks Inc. 2013

Thanks/Questions…

Page 31


Recommended