Date post: | 26-Jan-2015 |
Category: |
Technology |
Upload: | reedshea |
View: | 113 times |
Download: | 5 times |
PROFITFROM ALL OFYOURDATA
February 2012
Hadoop in the EnterpriseAdam Smieszny | Systems Engineer
©2011 Cloudera, Inc. All Rights Reserved.2
Agenda
• Hadoop Overview• History of Hadoop• What is Hadoop• Hadoop in the Enterprise
©2011 Cloudera, Inc. All Rights Reserved.3
Existing Data Management
10,000
2005 20152010
5,000
0
Current Database Solutions are designed for structured data.
Optimized to answer known questions quickly
Schemas dictate form/context
Difficult to adapt to new data types and new questions
Expensive at Petabyte scale
STRUCTURED DATA UNSTRUCTURED DATA
GIG
AB
YT
ES
OF
DA
TA C
RE
AT
ED
(IN
BIL
LIO
NS
)
10%
©2011 Cloudera, Inc. All Rights Reserved.4
Why the Need for Hadoop?
10,000
2005 20152010
5,000
0
1.8 trillion gigabytes of data wascreated in 2011…
More than 90% is unstructured data
Approx. 500 quadrillion files
Quantity doubles every 2 years
STRUCTURED DATA UNSTRUCTURED DATA
GIG
AB
YT
ES
OF
DA
TA C
RE
AT
ED
(IN
BIL
LIO
NS
)
Source: IDC 2011
More Devices
New Sources
More Content
New & Better Info
©2011 Cloudera, Inc. All Rights Reserved.5
The Origins of Hadoop
Open source web crawler project created
by Doug Cutting
Publishes MapReduce and GFS Paper
Open Source MapReduce and HDFS
project created by Doug Cutting
Runs 4,000-node Hadoop cluster
Hadoop wins Terabyte sort benchmark
Launches SQL support for Hadoop
Releases CDH and Cloudera Enterprise
2002 2007 2012
6
What is Apache Hadoop?
Hadoop Distributed File System (HDFS)
File Sharing & Data Protection Across Physical Servers
MapReduce
Distributed Computing Across Physical Servers
Flexibility
A single repository for storing processing & analyzing any type of data
Not bound by a single schema
Scalability
Scale-out architecture divides workloads across multiple nodes
Flexible file system eliminates ETL bottlenecks
Low Cost
Can be deployed on commodity hardware
Open source platform guards against vendor lock
Hadoop is a platform for data storage and processing that is…
Scalable Fault tolerant Open source
CORE HADOOP COMPONENTS
©2011 Cloudera, Inc. All Rights Reserved.
7
What is CDH?
Fastest Path to Success
No need to write your own scripts or do integration testing on different components
Works with a wide range of operating systems, hardware, databases and data warehouses
Stable and Reliable
Extensive Cloudera QA systems, software & processes
Tested & run in production at scale
Proven at scale in dozens of enterprise environments
Community Driven
Incorporates only main-line components from the Apache Hadoop ecosystem – no forks or proprietary underpinnings
FREE
Cloudera’s Distribution IncludingApache Hadoop (CDH) is an enterprise-ready distribution of Hadoop that is…
100% Apache open source Contains all components needed for deployment Fully documented and supported Released on a reliable schedule
©2011 Cloudera, Inc. All Rights Reserved.
More coming…
Packaging, testing
Sqoop frame-work,
adapters
Drivers, language enhancements, testing
Coordination
Data Integration
Fast Read/Write
Access
Languages / Compilers
Workflow Scheduling Metadata
APACHE ZOOKEEPER
APACHE FLUME, APACHE SQOOP APACHE HBASE
APACHE PIG, APACHE HIVE
APACHE OOZIE APACHE OOZIE APACHE HIVE
File System Mount UI Framework SDKFUSE-DFS HUE HUE SDK
8
CDH & Enterprise Ecosystem
unstructured data
semi-structured data
structured data
Create context (classification, text mining)
Analyze
Parse, aggregate Analyze, report
Analyze, reportActive archival
Long running queries
9Copyright 2011 Cloudera Inc. All rights reserved
Slide borrowed from Krishnan Parasuraman presentation at Enzee’11
Hadoop / RDBMS Use Cases
EDW
EDW
EDW
©2011 Cloudera, Inc. All Rights Reserved.10
Hadoop in Production
How Apache Hadoop fitsinto your existing infrastructure.
Logs Files Web DataRelational
Data
IDE’s BI / AnalyticsEnterprise Reporting
Enterprise Data Warehouse
Low-Latency Serving Systems
Web Application
Management Tools
OPERATORS ENGINEERS ANALYSTS BUSINESS USERS CUSTOMERS
©2011 Cloudera, Inc. All Rights Reserved.11
Hadoop Use CasesA
DV
AN
CE
D A
NA
LYT
ICS
DA
TA P
RO
CE
SS
ING
Social Network Analysis
Content Optimization
Network Analytics
Loyalty & Promotions Analysis
Fraud Analysis
Entity Analysis
Clickstream Sessionization
Clickstream Sessionization
Mediation
Data Factory
Trade Reconciliation
SIGINT
Application ApplicationIndustry
Web
Media
Telco
Retail
Financial
Federal
Bioinformatics Genome MappingSequencing Analysis
Use CaseUse Case
Use Case: Customer Risk
Build comprehensive data picture of customer side risk
Publish a consolidated set of attributes for analysis
Map ratings across products
Parse and aggregate data from difference sources
Credit and debit cards, product payments, deposits and savings
Banking activity, browsing behavior, call logs, e-mails and chats
Merge data into a single view
A “fuzzy join” among data sources
Structure and normalize attributes
Sentiment analysis, pattern recognition
Copyright 2010 Cloudera Inc. All rights reserved12
Use Case: Sentiment Analysis
Copyright 2010 Cloudera Inc. All rights reserved13
Internet generates a lot of chatter about brandsUnderstanding what’s being said is crucial to protecting brand value
Facebook, Twitter generate a lot of data for a global top brand
Capturing and Processing direct feedbackBetter engagement and alerting via Sentiment Analysis
Not yet ready for fully automated customer service
Hadoop handles the diverse data types and processingSources of data changing and semantics continuously evolving
Sophistication of algorithms is improving daily
©2011 Cloudera, Inc. All Rights Reserved.14
Journey of CDH Users
Discover the Benefits of Apache Hadoop
DeployCDH
Subscribe to Cloudera Enterprise
Gain the flexibility to store and mine all types of data
• • •
Leverage the scale-out architecture for complex data analysis
• • •
Easily scale to meet growing data requirements
• • •
Avoid vendor lock-in with an open source technology
The fastest, surest path to success with Apache Hadoop
• • •
Stable, reliable version of Apache Hadoop without the vendor lock-in
imposed by proprietary vendors
• • •
Integrates with your other technology platforms ensuring
investment protection
Simplify and accelerate Apache Hadoop deployment
• • •
Reduce adoption costs and risks
• • •
More effectively manage cluster resources
• • •Leverage the experience of our
experts
©2011 Cloudera, Inc. All Rights Reserved.15
http://www.cloudera.com/hadoop/
cloudera.com twitter.com/cloudera
facebook.com/cloudera
Get Hadoop