www.luxoft.com
DWH & Big Data
Odessa
Vladimir Slobodianiuk
Date: 2014
www.luxoft.com
Agenda
1
2
Big Data – what is it
Hadoop vs RDBMS – pros and cons
3 Hadoop & Enterprise architecture
4 Hadoop as ETL engine
5 Case Studies
www.luxoft.com
Big Data
– what is it
www.luxoft.com
Current state
Big data - is an all-encompassing term for any collection of data sets so large and
complex that it becomes difficult to process using traditional data processing
applications.
www.luxoft.com
Limitations & Problems
Big data is difficult to work with using
most relational databases, requiring
instead massively parallel software
running on tens, hundreds, or even
thousands of servers
eBay.com uses two data warehouses at 7.5 petabytes
Walmart handles more than 1 million customer
transactions every hour
Facebook handles 50 billion photos from its user base
In 2012, the Obama administration announced the Big
Data Research and Development Initiative
www.luxoft.com
Hadoop vs RDBMS
www.luxoft.com
CORE HADOOP - MapReduce
In 2004, Google published a paper on a process called MapReduce
DISTRIBUTED
COMPUTING
FRAMEWORK
Process large jobs in
parallel across many
nodes and combine the
results
www.luxoft.com
Hadoop Structure
HDFS is a distributed file system designed to run on commodity hardware
HBase store data rows in labelled tables (sortable key and an arbitrary number of columns)
Hive provide data summarization, query, and analysis (SQL-like interface)
Pig is a platform for analyzing large data sets that consists of a high-level language
www.luxoft.com
Hadoop vs RDBMS
Hadoop RDBMS
Performance for relational data
Machine query optimization
Mature workload management
High concurrency interactive query
processing
How might this change in the future
Query Optimization Improvements in Hive
– Statistics, better join ordering, more join types, etc
Startup Time Improvements
– Simpler query plans to pass out
Runtime Performance Improvements
Schema-less Model
Human query optimization
Ability to create complex dataflow
with multiple inputs and outputs
Parallelize many Analytic Functions
www.luxoft.com
Hadoop &
Enterprise architecture
www.luxoft.com
Classic architecture approach
www.luxoft.com
Hadoop & Enterprise architecture
www.luxoft.com
Case Study 1
Hadoop as ETL Data Quality tool
BENEFITS
Reduced TCO (commodity hardware usage)
Traceability of all the data quality issues
Hadoop becomes clean data tool.
PROBLEM
Traditional tools show poor performance in exception
and data cleansing.
SOLUTION
Hadoop transforms the data into single format and
processes it using data cleansing workflows.
www.luxoft.com
Case Study 2
Know Your Customer PoC
Business Challenge
• Knowing the actual customerreaction to products is essentialfor business growth, but it’sdifficult to get valuable insights.Social media is the place wherecustomer really share theiropinion
SOLUTION
Hadoop-based analysis tool that provides the ability to:
• Find the events in the clientstreams, identify neededreaction
• Propose a product to a client,based on his interests
www.luxoft.com
Case Study 3
Enterprise ETL & Hadoop Integration
Goals:
MapReduce ETL jobs development
without coding
Build, re-use, and check impact analysis
with enhanced metadata capabilities
A windows-based graphical development
environment
Comprehensive built-in transformations
A library of Use Case Accelerators to
fast-track Hadoop productivity
www.luxoft.com
Big Data:
Cutting edge of DI technologies
State-of-the-art design approaches
A bit more than simple development, it's some of art, art
of data management
Summary
www.luxoft.com
THANK YOU