+ All Categories
Home > Technology > Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Date post: 20-Aug-2015
Category:
Upload: cloudera-inc
View: 6,271 times
Download: 0 times
Share this document with a friend
38
Apache Hadoop in the Enterprise Cloudera, Inc. Amr Awadallah, Founder, CTO, VP of Engineering. [email protected], twitter: @awadallah Microstrategy World – January 2011 – Las Vegas
Transcript
Page 1: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Apache Hadoop in the Enterprise

Cloudera, Inc.

Amr Awadallah, Founder, CTO, VP of Engineering.

[email protected], twitter: @awadallah

Microstrategy World – January 2011 – Las Vegas

Page 2: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Source: IDC White Paper - sponsored by EMC.

As the Economy Contracts, the Digital Universe Expands. May 2009.

.

Unstructured Data Explosion

• 2,500 exabytes of new information in 2012 with Internet as primary driver

• Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year

2 Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Relational

Complex, Unstructured

Page 3: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Dramatic Changes in Enterprise Data Needs

Data Explosion

• Any Type of Data

• From Many Sources

• Instrument Everything

Hard Problems

3

• Complex Analysis

• At Lowest Granularity

• Data Beats Algorithm

Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Page 4: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

What is Hadoop?

• A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license)

• Core Hadoop has two main components • Hadoop Distributed File System (HDFS): self-healing high-bandwidth

clustered storage

• MapReduce: fault-tolerant distributed processing

• Key business values • Flexible -> Store any data, run any analysis (Mine First, Govern Later)

• Affordable -> Cost per TB at a fraction of traditional options

• Broadly adopted -> A large and active ecosystem

• Proven at scale -> Several petabyte deployments in production today

• Open Source -> No Lock-In, low cost, large developer community.

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 4

Page 5: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Cloudera’s Data Operating System (CDH)

• Open Source – 100% Apache licensed

• Simplified – Component versions & dependencies managed for you

• Reliable – Predictable release schedules, Patched with fixes to improve stability

• Many Form Factors – Public Cloud, Private Cloud, Ubuntu, RHEL, 32 or 64bit, etc.

• Integrated – All components & functions interoperate through standard API’s

• Supported – Founders, committers, contributors across all projects

5

Hue Hue SDK

Oozie Oozie

HBase Avro, Flume, Sqoop

Zookeeper

Avro, Hive

Pig. Hive

Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Page 6: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Benefit #1: Agility

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 6

Schema-on-Read (Hadoop):

Schema-on-Write (RDBMS):

• Schema must be created before data is loaded

• Explicit load operation has to take place which transforms data to database internal structure

• New columns must be added explicitly before data for such columns can be loaded into the database

• Read is Fast

• Standards/Governance

• Data is simply copied to the file store, no special transformation is needed

• A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns

• New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse them

• Load is Fast

• Evolving Schemas/Agility

Benefits

Page 7: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Benefit #2: Data Consolidation

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 7

A single data system to enable processing across the universe of data types.

Complex Data

Documents Web feeds System logs Online forums

Structured Data (“relational”)

CRM Financials Logistics Data Marts

SharePoint Sensor data EMB archives Images/Video

Inventory Sales records HR records Web Profiles

Page 8: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Benefit #3: Any Programing Language (Not Only SQL)

1. Java MapReduce: Gives the most flexibility and performance, but potentially long development cycle (the “assembly language” of Hadoop).

2. Streaming MapReduce (also Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility.

3. Cascading: Cascading is a thin Java library that sits on top of MapReduce, it lets developers assemble complex processes.

4. Pig: A high-level language out of Yahoo, suitable for batch data flow workloads.

5. Hive: A SQL interpreter out of Facebook, also includes a meta-store mapping files to their schemas and associated SerDe.

6. Oozie: A PDL XML workflow server engine that enables creating a workflow of jobs composed of any of the above.

8 Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Page 9: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Benefit #4: Balancing Return on Investment (or Byte!)

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 9

Low ROB

• Return on Byte = value to be extracted from that byte divided by the cost of storing that byte • If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical active storage.

High ROB

Page 10: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Use The Right Tool For The Right Job

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 10

Relational Databases:

Hadoop:

Use when:

• Structured or Not (Agility)

• Scalable Storage/Compute

• Complex Data Processing

Use when:

• Interactive OLAP Analytics (<1sec)

• Multistep ACID Transactions

• SQL Compliance

Page 11: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Where Does Hadoop Fit in the Enterprise Data Stack?

11 Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Logs Files Web Data

Enterprise Data

Warehouse

Web Application

Enterprise Reporting

BI, Analytics

Analysts Business Users

Users

IDEs

Data Scientists

Relational Databases

Low-Latency Serving Systems

Cloudera

Mgmt Apps

System Administrators

Data Architects

Page 12: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Apache Hive Features

• A subset of SQL covering the most common statements

• JDBC/ODBC support

• Agile data types: Array, Map, Struct, and JSON objects

• Pluggable SerDe system to work on unstructured files directly.

• User Defined Functions and Aggregates

• Regular Expression support

• MapReduce support

• Partitions and Buckets (for performance optimization)

• In The Works: Indices, Columnar Storage, Views, Microstrategy compatibility, Explode/Collect

• More details: http://wiki.apache.org/hadoop/Hive

12 Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Page 13: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Broad Adoption in Key Verticals

13

Stakeholders

Risk Analysts Intelligence

Risk management:

“Examine purchase behavior across debit and credit properties to better identify high-risk customers.”

Example Applications

Financial Services Telecom Retail Government

Research Insight Team

IT: Operations

IT: Data Engineering

BSS:

“Analyze calling patterns among users and current capacity to forecast traffic growth and locate new towers.”

Brand Equity:

“Monitor customer and product data recorded across internal & external sources to trend brand valuation.”

Traffic Analysis:

“Use multimedia data from various sources to build an actionable graph of relationships among targets.”

Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Page 14: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Customers

14 Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Page 15: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

How are Customers Using Cloudera?

15

Analyze search terms and subsequent user purchase decisions to tune search results, increase conversion rates

Digest long-term historical trade data to identify fraudulent activity and build real-time fraud prevention

Model site visitor behavior with analytics that deliver better recommendations for new purchases

Continually refine predictive models for advertising response rates to deliver more precisely targeted advertisements

Replace expensive legacy ETL system with more flexible, cheaper infrastructure that is 20 times faster

Correlate educational outcomes with programs and student histories to improve results

Examine customer behavior to improve loan risk scoring

Answering Questions that Were Impossible to Ask Before

Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Big Bank

More: http://www.cloudera.com/company/press-center/hadoop-world-nyc/

Page 16: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Cloudera Offerings

Software Services Training

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 16

Facilitating enterprise adoption of Hadoop

Page 17: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

• Improves conformance to important IT SLAs, policies and procedures

• Lowers the cost of management and administration

• Increases reliability and consistency of the platform

• Certified integration with RDBMS, ETL, BI, Server, and Cloud Systems

Cloudera Enterprise Enterprise Support and Management Applications

Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Page 18: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Integrating with Existing IT Infrastructure

18

RDBMS Cloud/OS Hardware BI/Analytics

Copyright © 2011, Cloudera, Inc. All Rights Reserved.

ETL

Page 19: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

MicroStrategy (for interactive Dashboards)

19 Copyright © 2011 Couldera, Inc. All Rights Reserved.

Page 20: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Informatica (for Extract-Transform-Load, aka ETL)

20 Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Page 21: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Summary

• Cloudera’s Data OS (CDH) enables:

• Data Agility (Evolving Schemas)

• Consolidation (Structured or Not)

• Complex Data Processing (Any Language)

• Economical Storage (Enable Return-on-Byte > 1)

• Cloudera Enterprise enables:

• Conformance to important IT SLAs, policies and procedures

• Lower cost of management and administration

• Increased reliability and consistency

• Certified integration with existing IT infrastructure

21 Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Page 22: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Contact Information and Free Hadoop Book

22 Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Amr Awadallah

CTO, Cloudera, Inc.

[email protected]

650-644-3921

twitter.com/awadallah

twitter.com/cloudera

Page 23: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 23

Page 24: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Appendix

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 24

Page 25: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Cloudera Overview

Hadoop…

Jeff Hammerbacher, Chief Scientist

Amr Awadallah, CTO, VP Engineering

Doug Cutting, Chief Architect

… meets enterprise

Mike Olson - CEO

Omer Trajman – VP, Customer Solutions

John Kreisa –VP, Marketing

Charles Zedlewski – VP, Product Management

Ed Albanese – Head of Business Development

Investors Accel Partners, Greylock Partners, Meritech Capital Partners

Product category Data Management

Business model Cloudera offers Software, Support, Training, and Professional Services

Employees 70+

Customers 75+

Headquarters Palo Alto, California

Elevator pitch The leading provider of Apache Hadoop-based software and services for the enterprise

Vision We enable organizations to profit from all of their data

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 25

Page 26: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Why CDH (Cloudera Distribution for Hadoop)?

Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Features Benefits

It’s packaged Much easier for users to install CDH than any other form of Hadoop.

It’s patched This makes CDH more stable and secure than just downloading an Apache branch

It’s proven Thousands of organizations already use CDH today so risk is lower

It’s highly functional CDH will cover more use cases and users will be more productive than if they were just using core Hadoop.

It’s integrated Save time (of piecing a system together yourself) and lower risk (of choosing the wrong combination of versions or patches)

It’s the accepted standard More of your preexisting investments in RDBMS, ETL and BI work best with CDH

It’s supported CDH is one of only two distributions that has a commercial entity standing behind it

It’s 100% Apache licensed Investment in this technology is insured.

Page 27: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Hadoop Timeline

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 27

2002 2003 2004 2005 2006 2007 2008 2009

Doug Cutting & Mike Cafarella started working on Nutch

Google publishes GFS & MapReduce papers

Cutting adds DFS & MapReduce support to Nutch

Yahoo! hires Cutting, Hadoop spins out of Nutch

Web-scale deployments at Y!, Facebook, Last.fm

Fastest sort of a TB, 3.5mins over 910 nodes

NY Times converts 4TB of image archives over 100 EC2s

• Fastest sort of a TB, 62secs over 1,460 nodes • Sorted a PB in 16.25hours over 3,658 nodes

Hadoop Summit 2009, 750 attendees

Cloudera Founded

Cloudera hires Cutting

Page 28: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

10 Common Hadoop-able Problems

1. Modeling true risk

2. Customer churn analysis

3. Recommendation engine

4. Ad targeting

5. PoS transaction analysis

6. Analyzing network data to predict failure

7. Threat analysis

8. Trade surveillance

9. Search quality

10. Data “sandbox”

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 28

Page 29: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Case Studies: Hadoop World 2009

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 29

•VISA: Large Scale Transaction Analysis •JP Morgan Chase: Data Processing for Financial Services •China Mobile: Data Mining Platform for Telecom Industry •Rackspace: Cross Data Center Log Processing •Booz Allen Hamilton: Protein Alignment using Hadoop •eHarmony: Matchmaking in the Hadoop Cloud •General Sentiment: Understanding Natural Language •Yahoo!: Social Graph Analysis •Visible Technologies: Real-Time Business Intelligence

Slides and Videos: http://www.cloudera.com/hadoop-world-nyc

Page 30: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Case Studies: Hadoop World 2010

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 30

•eBay: Hadoop at eBay •Twitter: The Hadoop Ecosystem at Twitter •Yale University: MapReduce and Parallel Database Systems •General Electric: Sentiment Analysis powered by Hadoop •Facebook: HBase in Production •AOL: AOL’s Data Layer •Raytheon: SHARD: Storing and Querying Large-Scale Data •StumbleUpon: Mixing Real-Time and Batch Processing

More Info: http://www.cloudera.com/company/press-center/hadoop-world-nyc/

Page 31: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Hadoop Design Axioms

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 31

1. System Shall Manage and Heal Itself

2. Performance Shall Scale Linearly

3. Compute Should Move to Data

4. Simple Core, Modular and Extensible

Page 32: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Block Size = 64MB

Replication Factor = 3

HDFS: Hadoop Distributed File System

Cost/GB is a few

¢/month vs $/month

Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Page 33: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

MapReduce: Distributed Processing

Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Page 34: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

MapReduce Example for Word Count

Split 1

Split i

Split N

Map 1 (docid, text)

(docid, text) Map i

(docid, text) Map M

Reduce 1

Output

File 1 (sorted words,

sum of counts)

Reduce i

Output

File i (sorted words,

sum of counts)

Reduce R

Output

File R (sorted words,

sum of counts)

(words, counts) (sorted words, counts)

Map(in_key, in_value) => list of (out_key, intermediate_value) Reduce(out_key, list of intermediate_values) => out_value(s)

Shuffle

(words, counts) (sorted words, counts)

“To Be

Or Not

To Be?”

Be, 5

Be, 12

Be, 7

Be, 6

Be, 30

cat *.txt | mapper.pl | sort | reducer.pl > out.txt

Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Page 35: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Hadoop High-Level Architecture

Name Node Maintains mapping of file blocks

to data node slaves

Job Tracker Schedules jobs across

task tracker slaves

Data Node Stores and serves

blocks of data

Hadoop Client Contacts Name Node for data

or Job Tracker to submit jobs

Task Tracker Runs tasks (work units)

within a job

Share Physical Node

Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Page 36: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Hive vs Pig Example (count distinct values > 0)

• Hive syntax:

SELECT COUNT(DISTINCT col1)

FROM mytable

WHERE col1 > 0;

• Pig syntax:

mytable = LOAD ‘myfile’ AS (col1, col2, col3);

mytable = FOREACH mytable GENERATE col1;

mytable = FILTER mytable BY col1 > 0;

mytable = DISTINCT col1;

mytable = GROUP mytable BY col1;

mytable = FOREACH mytable GENERATE COUNT(mytable);

DUMP mytable;

36 Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Page 37: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Hive Agile Data Types

• STRUCTS: • SELECT mytable.mycolumn.myfield FROM …

• MAPS (Hashes): • SELECT mytable.mycolumn[mykey+ FROM …

• ARRAYS: • SELECT mytable.mycolumn*5+ FROM …

• JSON: • SELECT get_json_object(mycolumn, objpath

37 Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Page 38: Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 38


Recommended