Date post: | 27-Jan-2015 |
Category: |
Technology |
Upload: | hadoopsummit |
View: | 132 times |
Download: | 6 times |
Deutsche Telekom Perspective on HADOOP and Big Data TechnologiesGregory SmithVP Solution Design and Emerging Technologies and ArchitecturesT-Systems North [email protected]
Deutsche Telekom and T-Systems Key Stats
Deutsche Telekom is Europe’s largest telecom service provider– Revenue: $75 billion– Employees: 232,342
T-Systems is the enterprise division of Deutsche Telekom – Revenue: $13 billion– Employees: 52,742– Services: data center, end user computing, networking, systems
integration, cloud and big data
2
Overwhelmed by new data types?
3
Sentiment data
Call detail records (CDRs)
Sensor- / machine-based data
Big DataTransactions, Interactions, Observations
Clickstream data
80% of new data in 2015 will land on Hadoop!
4
Hadoop is like a data warehouse,but it can store more data, more kinds of data,
and perform more flexible analyses
Hadoop is open sourceand runs on industry standard hardware,
so it's 1-2 orders of magnitude more economicalthan conventional data warehouse solutions
Hadoop provides more cost effective storage, processing, and analysis. Some existing workloads run faster, cheaper, better
Hadoop can deliver a foundation for profitable growth:Gain value from all your data by asking bigger questions
5
Reference architecture view of Hadoop
Se
cu
rity
Op
era
tion
s
Infrastructure
Virtualization Compute / Storage / Network
Wo
rkflo
w a
nd
Sc
he
du
ling
M
an
ag
em
en
t an
d M
on
itorin
g
Da
ta Is
ola
tion
Ac
ce
ss
Ma
na
ge
me
nt
Da
ta E
nc
ryp
tion
Data Integration
Data Processing
Batch ProcessingReal Time/Stream
ProcessingSearch and Indexing
Application
Analytics Apps Transactional AppsAnalytics
Middleware
Presentation
Data Visualization and Reporting Clients
Real Time Ingestion
Batch Ingestion
Data Connectors
Metadata Services
Data Management
Distributed Processing
(MapReduce)
Non-relational DB
Structured In Memory
Distributed Storage(HDFS)
Hadoop Core
Hadoop Projects
Adjacent Categories
Example application landscape
ETL
Real TimeStreams
(Social,sensors)
Structured and Unstructured Data(HDFS, MAPR)
Real Time Database
(Shark, Gemfire, hBase,
Cassandra)
Interactive Analytics
(Impala,Greenplum,AsterData,Netezza…)
BatchProcessing(Map-Reduce)
Real-TimeProcessing
(s4, storm,spark) Data Visualization
(Excel, Tableau)
(Informatica, Talend, Spring Integration)
Compute Storage Networking
Cloud Infrastructure
HIVE
Machine Learning(Mahout, etc…)
Source: Vmware
Disruptive innovations in Big Data
7
TraditionalDatabase
HADOOPNoSQL
DatabaseMPP
AnalyticsData
Warehouse
SchemaPre-defined, fixedRequired on write
Required on readStore first, ask questions later
ProcessingNo or limited
data processing
Processing coupled with data Parallel processing / scale out
Data typesStructured Any, including unstructured
..
Physical infrastructure
Enterprise grade
Mission critical
Commodity is an optionMuch cheaper storage
Business problem
TechnologySolution
Legacy BI
Backward-looking analysis
Using data out of business applications
SAP Business Objects
IBM Cognos MicroStrategy
Structured Limited (2 – 3 TB in
RAM)
High Performance BI
Quasi-real-time analysis
Using data out of business applications
Oracle Exadata SAP HANA
Structured Limited (2 – 8 TB in
RAM)
“Hadoop” Ecosystem
Forward-looking predictive analysis
Questions defined in the moment, using data from many sources
Hadoop distributions No ACID transactions Limited SQL Set (joins)
Structured or unstructured
Unlimited (20 – 30 PB)
„True“ big dataLegacy vendor definition of big data
Selected Vendors
Data Type/Scalability
Innovations: Hadoop is 100x cheaper per TB than in-memory appliances like HANA and handles unstructured data as well
Innovations: Store first, ask questions later
9
SAN Storage3-5€/GB
Based on HDS SAN Storage
NAS Filers1-3€/GB
Based on Netapp FAS-Series
White Box DAS1)
0.50-1.00€/GB
Hardware can be self-assembled
Data Cloud1)
0.10-0.30€ /GB
Based on large scale object
storage interfaces
Enterprise ClassHadoop Storage
???€/GB
Based on Netapp E-Series (NOSH)
1) Hadoop offers Storage + Compute (incl. search). Data Cloud offers Amazon S3 and native storage functions
? !Illustrative acquisition cost
Much cheaper storagebut not just storage…
10
Target use cases
IT Infrastructure& Operations
Business Intelligence &
Data Warehousing
Line of Business &Business Analysts
CXO
Time to value
LongerShorterLower
Higher
Potential value
Lower Cost Storage
Enterprise Data Lake
Enterprise Data Warehouse Offload
Enterprise Data Warehouse Archive
ETL Offload
Capacity Planning & Utilization
Customer Profiling & Revenue Analytics
Targeted Advertising Analytics
Service Renewal Implementation
CDR based Data Analytics
Fraud Management
New Business Models
Cost effective storage, processing, and analysis
Foundation for profitable growth
Enterprise data warehouse offload use case
11
The Challenge
Many EDWs are at capacity Running out of budget before
running out of relevant data Older data archived “in the dark”,
not available for exploration
The Solution
Hadoop for data storage and processing: parse, cleanse, apply structure and transform
Free EDW for valuable queries Retain all data for analysis!
Operational (44%)
ETL Processing (42%)
Analytics (11%)
DATA WAREHOUSE
Storage & Processing
HADOOP
Operational (50%)
Analytics (50%)
DATA WAREHOUSE
Cost is 1/10th
GOAL: Platform that natively supports
mixed workloads as shared service
AVOID: Systems separated by workload
type due to contention
From data puddles and ponds to lakes and oceans
Page 12
Big Data
BU1
Big Data
BU2
Big Data
BU3
Big DataTransactions, Interactions, Observations
Refine Explore Enrich
Batch Interactive Online
13
Questions to ask in designing a solution for a particular business use case
Which distribution is right for your needs today vs. tomorrow? Which distribution will ensure you stay on the main path of
open source innovation, vs. trap you in proprietary forks?
Secu
rity
Op
eration
s
Infrastructure
Data Inte-gra-tion Data Processing
Application
Presentation
Data Management
Note: Distributions include more than just the Data Management layer but are discussed at this point in the presentation.Not shown: Intel, Fujitsu and other distributions
Widely adopted, mature distribution GTM partners include Oracle, HP, Dell, IBM
Fully open source distribution (incl. management tools) Reputation for cost-effective licensing Strong developer ecosystem momentum GTM partners include Microsoft, Teradata, Informatica, Talend
More proprietary distribution with features that appeal to some business critical use cases
GTM partner AWS (M3 and M5 versions only)
Just announced by EMC, very early stage Differentiator is HAWQ – claims manifold query speed
improvement, full SQL instruction set
Common objections to Hadoop
14
We don’t have big data problems
We don’t have petabytes of data
We can’t justify the budget for a
new project
We don’t havethe skills
We’re not sure Hadoop is
mature/secure/enterprise-ready
We already have a scale-out strategy for our EDW/ETL
15
MYTH: Big Data means “Petabytes”
Not just Volume Remember Variety, Velocity Plenty of issues at smaller scales
– Data processing– Unstructured data
Often warehouse volumes are small because the technology is expensive, not because there is no relevant data
Scalability is about growing with the business, affordably and predictably
Every organization has data problems!Hadoop can help…
MYTH: Big Data means Data Science
Hadoop solves existing problems faster, better, cheaper than conventional technology, e.g.– Landing zone – capturing and
refining multi-structured data types with unknown future value
– Cost effective platform for retaining lots of data for long periods of time
Walk before you run Big Data Is a State of Mind
Waves of adoption – crossing the chasm
16
Wave 1Batch Orientation
Wave 2Interactive Orientation
Wave 3Real-Time Orientation
Mainstream, 70% of organizations
Early adopters, 20% of organizations
Bleeding edge, 10% of organizations
Adoption today*
Refine:archival and transformation
Explore:query and visualization
Enrich: real-time decisions
Example use cases
Hour(s) Minutes SecondsResponse time
Volume VelocityData characteristic
EDW / RDBMS talk to Hadoop
Analytic apps talk directly to Hadoop
Derived data also stored in Hadoop
Architectural characteristic
MapReduce, Pig, Hive
ODBC/JDBC, Hive HBase, NoSQL, SQL
Example technologies
* Among organizations using Hadoop
Hadoop in a nutshell
The Hadoop open source ecosystem delivers powerful innovation in storage, databases and business intelligence, promising unprecedented price / performance compared to existing technologies.
Hadoop is becoming an enterprise-wide landing zone for large amounts of data. Increasingly it is also used to transform data.
Large enterprises have realized substantial cost reductions by offloading some enterprise data warehouse, ETL and archiving workloads to a Hadoop cluster.
17
Challenges in the Enterprise
Use-case identification and cost justification Cooperation and coordination from independent business units As Hadoop increases its footprint in business-critical areas, the
business will demand mature enterprise capabilities, e.g. DR, snap-shots, etc.
Hadoop’s disruptive approve is challenging strong legacy EDW People, processes and technologies.
Data harmonization is often a significant challenge. Fear of forking (think UNIX) Proprietary absorption (Borged) Audience: Hadoop address business problems, not IT problems Fear of data complexity (“I hated statistics class!”)
18