+ All Categories
Home > Technology > Luo june27 1150am_room230_a_v2

Luo june27 1150am_room230_a_v2

Date post: 03-Jul-2015
Category:
Upload: hadoop-summit
View: 244 times
Download: 2 times
Share this document with a friend
Description:
Near real-time, big data analytics is a reality via a new data pattern that avoids the latency and overhead of legacy ETL–the 3 T’s of Hadoop: Transfer, Transform, and Translate. Transfer: Once a Hadoop infrastructure is in place, a mandate is needed to immediately and continuously transfer all enterprise data, from external and internal sources and through different existing systems, into Hadoop. Previously, enterprise data was isolated, disconnected and monolithically segmented. Through this T, various source data are consolidated and centralized in Hadoop almost as they are generated in near real-time. Transform: Most of the enterprise data, when flowing into Hadoop, is transactional in nature. Analytics requires data be transformed from record-based OLTP form to column-based OLAP. This T is not the same T in ETL as we need to retain the granularity in the data feeds. The key is to transform in-place within Hadoop, without further data movement from Hadoop to other legacy systems. Translate: We pre-compute or provide on-the-fly views of analytical data, exposed for consumption. We facilitate analysis and reporting, for both scheduled and ad hoc needs, to be interactive with the data for analysts and end users, integrated in and on top of Hadoop.
20
MetaScale is a subsidiary of Sears Holdings Corporation The 3 Ts of Hadoop Wuheng Luo Ankur Gupta 06.2013
Transcript
Page 1: Luo june27 1150am_room230_a_v2

MetaScale is a subsidiary of Sears Holdings Corporation

The 3 Ts of Hadoop

Wuheng Luo

Ankur Gupta

06.2013

Page 2: Luo june27 1150am_room230_a_v2

The 3 Ts of Hadoop

3-Stage Circular Process of Enterprise Big Data

Page 3: Luo june27 1150am_room230_a_v2

What is the 3Ts?

3Ts = Transfer, Transform, and Translate

A new enterprise big data pattern to bring disruptive change to conventional ETL

To leverage Hadoop for streamlining data processes

To move toward real-time analytics

Page 4: Luo june27 1150am_room230_a_v2

The 3Ts Goal

To simplify enterprise data processing, reduce latency to

turn enterprise data from raw form to products of discovery

so as to better support business decisions.

Page 5: Luo june27 1150am_room230_a_v2

The 3Ts One Liners

Transfer

Once the Hadoop system is in place, a mandate is needed to

immediately and continuously capture and deliver all enterprise data,

from all data sources, through all data systems, to Hadoop, and store the

data under HDFS.

Transform

When source data is in, clean, standardize, and convert the data through

dimensional modeling. Data transformation should be performed in-place

within Hadoop, without moving the data out again for integration reasons.

Translate

Finish the data flow cycle by turning analytical data aggregated in

Hadoop to data products of business wisdom. Use batch and streaming

tools built on top of Hadoop to Interact with data scientists and end users.

Page 6: Luo june27 1150am_room230_a_v2

Hadoop as Enterprise Data Hub

“Data Hub” is not a new concept, but:

Conventional Data Hub Hadoop Enterprise Data Hub

RDBMS or EDW based Hadoop ecosystem based

No consistent architectural style:ODS, MDM, messaging or publish-subscribe, etc.

3-phased architecture to cover full enterprise data flow cycle from data source to data products

Heavily reply on ETL 3Ts-driven

Intermediate, partial, siloed True center of enterprise data

… …

Page 7: Luo june27 1150am_room230_a_v2

TRANSFER

Sourcing Data into Hadoop

Intent

Capture continuously all enterprise data at earliest touch

points possible, deliver the data from all sources, through

all source data systems, to Hadoop, and store the data

under HDFS.

Page 8: Luo june27 1150am_room230_a_v2

TRANSFER

Motivation

To gain distinctive competing capability, enterprises need to

build an integrated data infrastructure as the foundation

for big data analytics. Use Hadoop as THE centralized

enterprise data repository, and make it the grand

destination for all enterprise source data.

Page 9: Luo june27 1150am_room230_a_v2

TRANSFER

(3 Ts’) Transfer vs. (ETL’s) Extract

Traditional ETL - Extract Hadoop - Transfer

Bottom-up Top-down

Task/project specific Enterprise-wide mandate

Passive Proactive

Data is not available when needed Data is ready when needed

Same datasets are moved around

again and again, with no value added

Move the data once, and use it many

times, each time with value increased

Page 10: Luo june27 1150am_room230_a_v2

TRANSFER

Consequences

Before After

Isolated, disconnected in various

siloed data/file systems

Consolidated and centralized in

Hadoop

Monolithically segmented Heterogeneous, diverse, huge

Separated and partial Federated and holistic

Page 11: Luo june27 1150am_room230_a_v2

TRANSFER

Implementation

Always do a data gap analysis first

Fork the ingestion in both batch and streaming if needed

Have a delivery plan for the data feed

Synchronize data changes between source system and Hadoop

Page 12: Luo june27 1150am_room230_a_v2

TRANSFORM

Integrating Data within Hadoop

Intent

Keep the data flow beyond the ingest phase by transforming

the data from dirty to clean, from raw to standardized, and

from transactional to analytical, all within Hadoop.

Page 13: Luo june27 1150am_room230_a_v2

TRANSFORM

Motivation

As the latency or speed from raw data to business insight

becomes the focal point of enterprise data analytics, use

Hadoop as data integration platform to perform in-place

data transformation.

Page 14: Luo june27 1150am_room230_a_v2

TRANSFORM

Implementation

Partition enterprise-wide standardized data and job-specific analytical

data in HDFS, and retain history.

Use dimensional modeling to transform and standardize, make

dimensional data as the atomic unit of enterprise data.

Identify all enterprise data entities, and add finest grain attributes to

each entity as dimensional data.

Take a bottom-up approach, also think about data usage across the

enterprise, not specific task bound.

Page 15: Luo june27 1150am_room230_a_v2

TRANSFORM

(3 Ts’) Transform vs. (ETL’s) Transform

Transform in ETL / ELT Transform in 3 Ts

in vitro, outside Hadoop in vivo, within Hadoop

Use Hadoop as rental space Use Hadoop as integration platform

Non-value adding data movement in

between data storage and

transformation

Data is transformed while flowing

from one partition to another under

HDFS

High latency Low latency

Network bottleneck Data locality

Page 16: Luo june27 1150am_room230_a_v2

TRANSLATE

Making Data Products out of Hadoop

Intent

Turn analytical data into data products of business wisdom

using home-made or commercial tools of analytics built on

top of Hadoop. Business decisions supported by data

products will help generate more new data, thus a new

round of enterprise data flow cycle…

Page 17: Luo june27 1150am_room230_a_v2

TRANSLATE

Motivation

Low-latency big data analytics requires right platform/tools

Use Hadoop as the platform of choice for enterprise data

analytics because of its openness and flexibility

Choose analytical tools that are flexible, agile, interactive

and user friendly

Page 18: Luo june27 1150am_room230_a_v2

TRANSLATE

Implementation

Big data analytics takes a team effort

Include statisticians, data scientists and developers

Utilize both generic and Hadoop specific technologies

Consider both batch and streaming based approaches

Provide access to pre-computed view and on-the-fly query

Use both home-made and Hadoop-based commercial tools

Use web-based, mobile friendly UI

Visualize

Page 19: Luo june27 1150am_room230_a_v2

The 3 Ts of Hadoop

Continuous Iteration of Enterprise Data Flow

Page 20: Luo june27 1150am_room230_a_v2

Thank You!

For further information

email:visit:

[email protected]

www.metascale.com

MetaScale is a subsidiary of Sears Holdings Corporation


Recommended