Date post: | 20-Aug-2015 |
Category: |
Technology |
Upload: | cloudera-inc |
View: | 6,271 times |
Download: | 0 times |
Apache Hadoop in the Enterprise
Cloudera, Inc.
Amr Awadallah, Founder, CTO, VP of Engineering.
[email protected], twitter: @awadallah
Microstrategy World – January 2011 – Las Vegas
Source: IDC White Paper - sponsored by EMC.
As the Economy Contracts, the Digital Universe Expands. May 2009.
.
Unstructured Data Explosion
• 2,500 exabytes of new information in 2012 with Internet as primary driver
• Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year
2 Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Relational
Complex, Unstructured
Dramatic Changes in Enterprise Data Needs
Data Explosion
• Any Type of Data
• From Many Sources
• Instrument Everything
Hard Problems
3
• Complex Analysis
• At Lowest Granularity
• Data Beats Algorithm
Copyright © 2011, Cloudera, Inc. All Rights Reserved.
What is Hadoop?
• A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license)
• Core Hadoop has two main components • Hadoop Distributed File System (HDFS): self-healing high-bandwidth
clustered storage
• MapReduce: fault-tolerant distributed processing
• Key business values • Flexible -> Store any data, run any analysis (Mine First, Govern Later)
• Affordable -> Cost per TB at a fraction of traditional options
• Broadly adopted -> A large and active ecosystem
• Proven at scale -> Several petabyte deployments in production today
• Open Source -> No Lock-In, low cost, large developer community.
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 4
Cloudera’s Data Operating System (CDH)
• Open Source – 100% Apache licensed
• Simplified – Component versions & dependencies managed for you
• Reliable – Predictable release schedules, Patched with fixes to improve stability
• Many Form Factors – Public Cloud, Private Cloud, Ubuntu, RHEL, 32 or 64bit, etc.
• Integrated – All components & functions interoperate through standard API’s
• Supported – Founders, committers, contributors across all projects
5
Hue Hue SDK
Oozie Oozie
HBase Avro, Flume, Sqoop
Zookeeper
Avro, Hive
Pig. Hive
Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Benefit #1: Agility
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 6
Schema-on-Read (Hadoop):
Schema-on-Write (RDBMS):
• Schema must be created before data is loaded
• Explicit load operation has to take place which transforms data to database internal structure
• New columns must be added explicitly before data for such columns can be loaded into the database
• Read is Fast
• Standards/Governance
• Data is simply copied to the file store, no special transformation is needed
• A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns
• New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse them
• Load is Fast
• Evolving Schemas/Agility
Benefits
Benefit #2: Data Consolidation
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 7
A single data system to enable processing across the universe of data types.
Complex Data
Documents Web feeds System logs Online forums
Structured Data (“relational”)
CRM Financials Logistics Data Marts
SharePoint Sensor data EMB archives Images/Video
Inventory Sales records HR records Web Profiles
Benefit #3: Any Programing Language (Not Only SQL)
1. Java MapReduce: Gives the most flexibility and performance, but potentially long development cycle (the “assembly language” of Hadoop).
2. Streaming MapReduce (also Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility.
3. Cascading: Cascading is a thin Java library that sits on top of MapReduce, it lets developers assemble complex processes.
4. Pig: A high-level language out of Yahoo, suitable for batch data flow workloads.
5. Hive: A SQL interpreter out of Facebook, also includes a meta-store mapping files to their schemas and associated SerDe.
6. Oozie: A PDL XML workflow server engine that enables creating a workflow of jobs composed of any of the above.
8 Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Benefit #4: Balancing Return on Investment (or Byte!)
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 9
Low ROB
• Return on Byte = value to be extracted from that byte divided by the cost of storing that byte • If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical active storage.
High ROB
Use The Right Tool For The Right Job
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 10
Relational Databases:
Hadoop:
Use when:
• Structured or Not (Agility)
• Scalable Storage/Compute
• Complex Data Processing
Use when:
• Interactive OLAP Analytics (<1sec)
• Multistep ACID Transactions
• SQL Compliance
Where Does Hadoop Fit in the Enterprise Data Stack?
11 Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Logs Files Web Data
Enterprise Data
Warehouse
Web Application
Enterprise Reporting
BI, Analytics
Analysts Business Users
Users
IDEs
Data Scientists
Relational Databases
Low-Latency Serving Systems
Cloudera
Mgmt Apps
System Administrators
Data Architects
Apache Hive Features
• A subset of SQL covering the most common statements
• JDBC/ODBC support
• Agile data types: Array, Map, Struct, and JSON objects
• Pluggable SerDe system to work on unstructured files directly.
• User Defined Functions and Aggregates
• Regular Expression support
• MapReduce support
• Partitions and Buckets (for performance optimization)
• In The Works: Indices, Columnar Storage, Views, Microstrategy compatibility, Explode/Collect
• More details: http://wiki.apache.org/hadoop/Hive
12 Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Broad Adoption in Key Verticals
13
Stakeholders
Risk Analysts Intelligence
Risk management:
“Examine purchase behavior across debit and credit properties to better identify high-risk customers.”
Example Applications
Financial Services Telecom Retail Government
Research Insight Team
IT: Operations
IT: Data Engineering
BSS:
“Analyze calling patterns among users and current capacity to forecast traffic growth and locate new towers.”
Brand Equity:
“Monitor customer and product data recorded across internal & external sources to trend brand valuation.”
Traffic Analysis:
“Use multimedia data from various sources to build an actionable graph of relationships among targets.”
Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Customers
14 Copyright © 2011, Cloudera, Inc. All Rights Reserved.
How are Customers Using Cloudera?
15
Analyze search terms and subsequent user purchase decisions to tune search results, increase conversion rates
Digest long-term historical trade data to identify fraudulent activity and build real-time fraud prevention
Model site visitor behavior with analytics that deliver better recommendations for new purchases
Continually refine predictive models for advertising response rates to deliver more precisely targeted advertisements
Replace expensive legacy ETL system with more flexible, cheaper infrastructure that is 20 times faster
Correlate educational outcomes with programs and student histories to improve results
Examine customer behavior to improve loan risk scoring
Answering Questions that Were Impossible to Ask Before
Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Big Bank
More: http://www.cloudera.com/company/press-center/hadoop-world-nyc/
Cloudera Offerings
Software Services Training
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 16
Facilitating enterprise adoption of Hadoop
• Improves conformance to important IT SLAs, policies and procedures
• Lowers the cost of management and administration
• Increases reliability and consistency of the platform
• Certified integration with RDBMS, ETL, BI, Server, and Cloud Systems
Cloudera Enterprise Enterprise Support and Management Applications
Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Integrating with Existing IT Infrastructure
18
RDBMS Cloud/OS Hardware BI/Analytics
Copyright © 2011, Cloudera, Inc. All Rights Reserved.
ETL
MicroStrategy (for interactive Dashboards)
19 Copyright © 2011 Couldera, Inc. All Rights Reserved.
Informatica (for Extract-Transform-Load, aka ETL)
20 Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Summary
• Cloudera’s Data OS (CDH) enables:
• Data Agility (Evolving Schemas)
• Consolidation (Structured or Not)
• Complex Data Processing (Any Language)
• Economical Storage (Enable Return-on-Byte > 1)
• Cloudera Enterprise enables:
• Conformance to important IT SLAs, policies and procedures
• Lower cost of management and administration
• Increased reliability and consistency
• Certified integration with existing IT infrastructure
21 Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Contact Information and Free Hadoop Book
22 Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Amr Awadallah
CTO, Cloudera, Inc.
650-644-3921
twitter.com/awadallah
twitter.com/cloudera
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 23
Appendix
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 24
Cloudera Overview
Hadoop…
Jeff Hammerbacher, Chief Scientist
Amr Awadallah, CTO, VP Engineering
Doug Cutting, Chief Architect
… meets enterprise
Mike Olson - CEO
Omer Trajman – VP, Customer Solutions
John Kreisa –VP, Marketing
Charles Zedlewski – VP, Product Management
Ed Albanese – Head of Business Development
Investors Accel Partners, Greylock Partners, Meritech Capital Partners
Product category Data Management
Business model Cloudera offers Software, Support, Training, and Professional Services
Employees 70+
Customers 75+
Headquarters Palo Alto, California
Elevator pitch The leading provider of Apache Hadoop-based software and services for the enterprise
Vision We enable organizations to profit from all of their data
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 25
Why CDH (Cloudera Distribution for Hadoop)?
Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Features Benefits
It’s packaged Much easier for users to install CDH than any other form of Hadoop.
It’s patched This makes CDH more stable and secure than just downloading an Apache branch
It’s proven Thousands of organizations already use CDH today so risk is lower
It’s highly functional CDH will cover more use cases and users will be more productive than if they were just using core Hadoop.
It’s integrated Save time (of piecing a system together yourself) and lower risk (of choosing the wrong combination of versions or patches)
It’s the accepted standard More of your preexisting investments in RDBMS, ETL and BI work best with CDH
It’s supported CDH is one of only two distributions that has a commercial entity standing behind it
It’s 100% Apache licensed Investment in this technology is insured.
Hadoop Timeline
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 27
2002 2003 2004 2005 2006 2007 2008 2009
Doug Cutting & Mike Cafarella started working on Nutch
Google publishes GFS & MapReduce papers
Cutting adds DFS & MapReduce support to Nutch
Yahoo! hires Cutting, Hadoop spins out of Nutch
Web-scale deployments at Y!, Facebook, Last.fm
Fastest sort of a TB, 3.5mins over 910 nodes
NY Times converts 4TB of image archives over 100 EC2s
• Fastest sort of a TB, 62secs over 1,460 nodes • Sorted a PB in 16.25hours over 3,658 nodes
Hadoop Summit 2009, 750 attendees
Cloudera Founded
Cloudera hires Cutting
10 Common Hadoop-able Problems
1. Modeling true risk
2. Customer churn analysis
3. Recommendation engine
4. Ad targeting
5. PoS transaction analysis
6. Analyzing network data to predict failure
7. Threat analysis
8. Trade surveillance
9. Search quality
10. Data “sandbox”
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 28
Case Studies: Hadoop World 2009
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 29
•VISA: Large Scale Transaction Analysis •JP Morgan Chase: Data Processing for Financial Services •China Mobile: Data Mining Platform for Telecom Industry •Rackspace: Cross Data Center Log Processing •Booz Allen Hamilton: Protein Alignment using Hadoop •eHarmony: Matchmaking in the Hadoop Cloud •General Sentiment: Understanding Natural Language •Yahoo!: Social Graph Analysis •Visible Technologies: Real-Time Business Intelligence
Slides and Videos: http://www.cloudera.com/hadoop-world-nyc
Case Studies: Hadoop World 2010
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 30
•eBay: Hadoop at eBay •Twitter: The Hadoop Ecosystem at Twitter •Yale University: MapReduce and Parallel Database Systems •General Electric: Sentiment Analysis powered by Hadoop •Facebook: HBase in Production •AOL: AOL’s Data Layer •Raytheon: SHARD: Storing and Querying Large-Scale Data •StumbleUpon: Mixing Real-Time and Batch Processing
More Info: http://www.cloudera.com/company/press-center/hadoop-world-nyc/
Hadoop Design Axioms
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 31
1. System Shall Manage and Heal Itself
2. Performance Shall Scale Linearly
3. Compute Should Move to Data
4. Simple Core, Modular and Extensible
Block Size = 64MB
Replication Factor = 3
HDFS: Hadoop Distributed File System
Cost/GB is a few
¢/month vs $/month
Copyright © 2011, Cloudera, Inc. All Rights Reserved.
MapReduce: Distributed Processing
Copyright © 2011, Cloudera, Inc. All Rights Reserved.
MapReduce Example for Word Count
Split 1
Split i
Split N
Map 1 (docid, text)
(docid, text) Map i
(docid, text) Map M
Reduce 1
Output
File 1 (sorted words,
sum of counts)
Reduce i
Output
File i (sorted words,
sum of counts)
Reduce R
Output
File R (sorted words,
sum of counts)
(words, counts) (sorted words, counts)
Map(in_key, in_value) => list of (out_key, intermediate_value) Reduce(out_key, list of intermediate_values) => out_value(s)
Shuffle
(words, counts) (sorted words, counts)
“To Be
Or Not
To Be?”
Be, 5
Be, 12
Be, 7
Be, 6
Be, 30
cat *.txt | mapper.pl | sort | reducer.pl > out.txt
Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Hadoop High-Level Architecture
Name Node Maintains mapping of file blocks
to data node slaves
Job Tracker Schedules jobs across
task tracker slaves
Data Node Stores and serves
blocks of data
Hadoop Client Contacts Name Node for data
or Job Tracker to submit jobs
Task Tracker Runs tasks (work units)
within a job
Share Physical Node
Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Hive vs Pig Example (count distinct values > 0)
• Hive syntax:
SELECT COUNT(DISTINCT col1)
FROM mytable
WHERE col1 > 0;
• Pig syntax:
mytable = LOAD ‘myfile’ AS (col1, col2, col3);
mytable = FOREACH mytable GENERATE col1;
mytable = FILTER mytable BY col1 > 0;
mytable = DISTINCT col1;
mytable = GROUP mytable BY col1;
mytable = FOREACH mytable GENERATE COUNT(mytable);
DUMP mytable;
36 Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Hive Agile Data Types
• STRUCTS: • SELECT mytable.mycolumn.myfield FROM …
• MAPS (Hashes): • SELECT mytable.mycolumn[mykey+ FROM …
• ARRAYS: • SELECT mytable.mycolumn*5+ FROM …
• JSON: • SELECT get_json_object(mycolumn, objpath
37 Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 38