Post on 03-Jul-2015
description
transcript
MetaScale is a subsidiary of Sears Holdings Corporation
The 3 Ts of Hadoop
Wuheng Luo
Ankur Gupta
06.2013
The 3 Ts of Hadoop
3-Stage Circular Process of Enterprise Big Data
What is the 3Ts?
3Ts = Transfer, Transform, and Translate
A new enterprise big data pattern to bring disruptive change to conventional ETL
To leverage Hadoop for streamlining data processes
To move toward real-time analytics
The 3Ts Goal
To simplify enterprise data processing, reduce latency to
turn enterprise data from raw form to products of discovery
so as to better support business decisions.
The 3Ts One Liners
Transfer
Once the Hadoop system is in place, a mandate is needed to
immediately and continuously capture and deliver all enterprise data,
from all data sources, through all data systems, to Hadoop, and store the
data under HDFS.
Transform
When source data is in, clean, standardize, and convert the data through
dimensional modeling. Data transformation should be performed in-place
within Hadoop, without moving the data out again for integration reasons.
Translate
Finish the data flow cycle by turning analytical data aggregated in
Hadoop to data products of business wisdom. Use batch and streaming
tools built on top of Hadoop to Interact with data scientists and end users.
Hadoop as Enterprise Data Hub
“Data Hub” is not a new concept, but:
Conventional Data Hub Hadoop Enterprise Data Hub
RDBMS or EDW based Hadoop ecosystem based
No consistent architectural style:ODS, MDM, messaging or publish-subscribe, etc.
3-phased architecture to cover full enterprise data flow cycle from data source to data products
Heavily reply on ETL 3Ts-driven
Intermediate, partial, siloed True center of enterprise data
… …
TRANSFER
Sourcing Data into Hadoop
Intent
Capture continuously all enterprise data at earliest touch
points possible, deliver the data from all sources, through
all source data systems, to Hadoop, and store the data
under HDFS.
TRANSFER
Motivation
To gain distinctive competing capability, enterprises need to
build an integrated data infrastructure as the foundation
for big data analytics. Use Hadoop as THE centralized
enterprise data repository, and make it the grand
destination for all enterprise source data.
TRANSFER
(3 Ts’) Transfer vs. (ETL’s) Extract
Traditional ETL - Extract Hadoop - Transfer
Bottom-up Top-down
Task/project specific Enterprise-wide mandate
Passive Proactive
Data is not available when needed Data is ready when needed
Same datasets are moved around
again and again, with no value added
Move the data once, and use it many
times, each time with value increased
TRANSFER
Consequences
Before After
Isolated, disconnected in various
siloed data/file systems
Consolidated and centralized in
Hadoop
Monolithically segmented Heterogeneous, diverse, huge
Separated and partial Federated and holistic
TRANSFER
Implementation
Always do a data gap analysis first
Fork the ingestion in both batch and streaming if needed
Have a delivery plan for the data feed
Synchronize data changes between source system and Hadoop
TRANSFORM
Integrating Data within Hadoop
Intent
Keep the data flow beyond the ingest phase by transforming
the data from dirty to clean, from raw to standardized, and
from transactional to analytical, all within Hadoop.
TRANSFORM
Motivation
As the latency or speed from raw data to business insight
becomes the focal point of enterprise data analytics, use
Hadoop as data integration platform to perform in-place
data transformation.
TRANSFORM
Implementation
Partition enterprise-wide standardized data and job-specific analytical
data in HDFS, and retain history.
Use dimensional modeling to transform and standardize, make
dimensional data as the atomic unit of enterprise data.
Identify all enterprise data entities, and add finest grain attributes to
each entity as dimensional data.
Take a bottom-up approach, also think about data usage across the
enterprise, not specific task bound.
TRANSFORM
(3 Ts’) Transform vs. (ETL’s) Transform
Transform in ETL / ELT Transform in 3 Ts
in vitro, outside Hadoop in vivo, within Hadoop
Use Hadoop as rental space Use Hadoop as integration platform
Non-value adding data movement in
between data storage and
transformation
Data is transformed while flowing
from one partition to another under
HDFS
High latency Low latency
Network bottleneck Data locality
TRANSLATE
Making Data Products out of Hadoop
Intent
Turn analytical data into data products of business wisdom
using home-made or commercial tools of analytics built on
top of Hadoop. Business decisions supported by data
products will help generate more new data, thus a new
round of enterprise data flow cycle…
TRANSLATE
Motivation
Low-latency big data analytics requires right platform/tools
Use Hadoop as the platform of choice for enterprise data
analytics because of its openness and flexibility
Choose analytical tools that are flexible, agile, interactive
and user friendly
TRANSLATE
Implementation
Big data analytics takes a team effort
Include statisticians, data scientists and developers
Utilize both generic and Hadoop specific technologies
Consider both batch and streaming based approaches
Provide access to pre-computed view and on-the-fly query
Use both home-made and Hadoop-based commercial tools
Use web-based, mobile friendly UI
Visualize
The 3 Ts of Hadoop
Continuous Iteration of Enterprise Data Flow
Thank You!
For further information
email:visit:
contact@metascale.com
www.metascale.com
MetaScale is a subsidiary of Sears Holdings Corporation