+ All Categories
Home > Documents > Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages ....

Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages ....

Date post: 22-May-2020
Category:
Upload: others
View: 24 times
Download: 0 times
Share this document with a friend
33
© Hortonworks, Inc. All Rights Reserved. Apache Hadoop Today & Tomorrow Eric Baldeschwieler, CEO Hortonworks, Inc. twitter: @jeric14 (@hortonworks) www.hortonworks.com
Transcript
Page 1: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

Apache Hadoop Today & Tomorrow

Eric Baldeschwieler, CEO Hortonworks, Inc.

twitter: @jeric14 (@hortonworks) www.hortonworks.com

Page 2: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

Agenda

Brief Overview of Apache Hadoop

Where Apache Hadoop is Used

Apache Hadoop Core Hadoop Distributed File System (HDFS) Map/Reduce

Where Apache Hadoop Is Going

Q&A

2

Page 3: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

3

A set of open source projects owned by the Apache Foundation that transforms commodity computers and network into a distributed service •HDFS – Stores petabytes of data reliably •Map-Reduce – Allows huge distributed computations

Key Attributes •Reliable and Redundant – Doesn’t slow down or loose data even as hardware fails •Simple and Flexible APIs – Our rocket scientists use it directly! •Very powerful – Harnesses huge clusters, supports best of breed analytics •Batch processing centric – Hence its great simplicity and speed, not a fit for all use cases

What is Apache Hadoop?

Page 4: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

What is it used for?

Internet scale data Web logs – Years of logs at many TB/day Web Search – All the web pages on earth Social data – All message traffic on facebook

Cutting edge analytics Machine learning, data mining…

Enterprise apps Network instrumentation, Mobil logs Video and Audio processing Text mining

And lots more!

4

Page 5: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

5

Apache Hadoop Projects

Programming Languages

Computation

Object Storage

Zoo

keep

er

(Coo

rdin

atio

n)

Core Apache Hadoop Related Apache Projects

HDFS (Hadoop Distributed File System)

MapReduce (Distributed Programing Framework)

Hive (SQL)

Pig (Data Flow)

HBase (Columnar Storage)

HCatalog (Meta Data)

HM

S (M

anag

emen

t)

Table Storage

Page 6: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

6

Where Hadoop is Used

Page 8: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

8

HADOOP @ YAHOO!

40K+ Servers 170 PB Storage 5M+ Monthly Jobs 1000+ Active users

Page 9: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

twice the engagement

CASE STUDY YAHOO! HOMEPAGE

9

Personalized for each visitor Result: twice the engagement

+160% c licks vs . one s ize fits a ll

+79% c licks vs . randomly s e lec ted

+43% c licks vs . ed itor s e lec ted

Recommended links News Interests Top Searches

Page 10: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

CASE STUDY YAHOO! HOMEPAGE

10

• Serving Maps • Users - Interests

• Five Minute

Production • Weekly

Categorization models

SCIENCE HADOOP CLUSTER

SERVING SYSTEMS

PRODUCTION HADOOP CLUSTER

USER BEHAVIOR

ENGAGED USERS

CATEGORIZATION MODELS (weekly)

SERVING MAPS

(every 5 minutes) USER

BEHAVIOR

» Identify user interests using Categorization models

» Machine learning to build ever better categorization models

Build customized home pages with latest data (thousands / second)

Page 11: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

CASE STUDY YAHOO! MAIL Enabling quick response in the spam arms race

• 450M mail boxes • 5B+ deliveries/day • Antispam models retrained every few hours on Hadoop

40% less spam than Hotmail and 55% less spam than Gmail

“ “

SCIENCE

PRODUCTION

11

Page 12: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

12

, early adopters Scale and productize Hadoop

Apache Hadoop

A Brief History

2006 – present

Wide Enterprise Adoption Funds further development, enhancements

Nascent / 2011

Other Internet Companies Add tools / frameworks, enhance Hadoop

2008 – present

Service Providers Provide training, support, hosting

2010 – present

Page 13: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

13

Traditional Enterprise Architecture Data Silos + ETL

EDW Data Marts

BI / Analytics

Traditional Data Warehouses, BI & Analytics Serving Applications

Web Serving

NoSQLRDMS

Unstructured Systems

Serving Logs

Social Media

Sensor Data

Text Systems

Traditional ETL & Message buses

Page 14: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

14

EDW Data Marts

BI / Analytics

Traditional Data Warehouses, BI & Analytics Serving Applications

Web Serving

NoSQLRDMS

Unstructured Systems

Apache Hadoop EsTsL (s = Store) Custom Analytics

Serving Logs

Social Media

Sensor Data

Text Systems

Traditional ETL & Message buses

Hadoop Enterprise Architecture Connecting All of Your Big Data

Page 15: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

15

EDW Data Marts

BI / Analytics

Traditional Data Warehouses, BI & Analytics Serving Applications

Web Serving

NoSQLRDMS

Unstructured Systems

Serving Logs

Social Media

Sensor Data

Text Systems

80-90% of data produced today is unstructured

Gartner predicts 800% data growth over next 5 years

Traditional ETL & Message buses

Apache Hadoop EsTsL (s = Store) Custom Analytics

Hadoop Enterprise Architecture Connecting All of Your Big Data

Page 16: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

What is Driving Adoption?

Business drivers Identified high value projects that require use of more data Belief that there is great ROI in mastering big data

Financial drivers Growing cost of data systems as proportion of IT spend Cost advantage of commodity hardware + open source

Enables departmental-level big data strategies

Technical drivers Existing solutions failing under growing requirements

3Vs - Volume, velocity, variety

Proliferation of unstructured data

16

Page 17: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

17

Big Data Platforms Cost per TB, Adoption

Source:

Size of bubble = cost effectiveness of solution

Page 18: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

18

Apache Hadoop Core

Page 19: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

19

Overview

Frameworks share commodity hardware Storage - HDFS Processing - MapReduce

2 * 10GigE 2 * 10GigE 2 * 10GigE

2 * 10GigE

• 20-40 nodes / rack • 16 Cores • 48G RAM • 6-12 * 2TB disk • 1-2 GigE to node

Network Core

Rack Switch

1-2U server

Rack Switch

1-2U server

Rack Switch

1-2U server

Rack Switch

1-2U server

Page 20: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

Map/Reduce

Map/Reduce is a distributed computing programming model It works like a Unix pipeline:

cat input | grep | sort | uniq -c > output Input | Map | Shuffle & Sort | Reduce | Output

Strengths: Easy to use! Developer just writes a couple of functions Moves compute to data

Schedules work on HDFS node with data if possible Scans through data, reducing seeks Automatic reliability and re-execution on failure

20

Page 21: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

HDFS: Scalable, Reliable, Managable

21

Scale IO, Storage, CPU • Add commodity servers & JBODs • 4K nodes in cluster, 80

Fault Tolerant & Easy management Built in redundancy Tolerate disk and node failures Automatically manage

addition/removal of nodes One operator per 8K nodes!!

Storage server used for computation

Move computation to data

Not a SAN But high-bandwidth network

access to data via Ethernet

Immutable file system Read, Write, sync/flush

No random writes

`

Switch

Switch

Switch

Core Switch

Core Switch

Page 22: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

HDFS Use Cases

Petabytes of unstructured data for parallel, distributed analytics processing using commodity hardware

Solve problems that cannot be solved using traditional systems at a cheaper price Large storage capacity ( >100PB raw) Large IO/Computation bandwidth (>4K servers)

> 4 Terabit bandwidth to disk! (conservatively) Scale by adding commodity hardware Cost per GB ~= $1.5, includes MapReduce cluster

22

Page 23: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

HDFS Architecture

Namenode

Persistent Namespace Metadata & Journal

Namespace State

Block Map

Heartbeats & Block Reports

Block ID Block Locations

Datanodes

Block ID Data

NFS

Hierarchal Namespace File Name BlockIDs

Horizontally Scale IO and Storage

b1

b5

b3

JBOD Blo

ck S

tora

ge

Nam

espa

ce

b2

b3

b1

JBOD

b3

b5

b2

JBOD

b1

b5

b2

JBOD

Page 24: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

Client Read & Write Directly from Closest Server

Namenode

Namespace State

Block Map

Horizontally Scale IO and Storage

24

b1

b5

b3

JBOD

b2

b3

b1

JBOD

b3

b5

b2

JBOD

b1 b5

b2

JBOD

Client

1 open

2 read

Client

2 write

1 create

End-to-end checksum

Page 25: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

Actively maintain data reliability

Namenode

Namespace State

Block Map

25

b1

b5

b3

JBOD

b2

b3

b4

JBOD

b3

b5

b2

JBOD

b1 b5

b2

JBOD

2. copy

3. blockReceived

1. replicate

Bad/lost block replica

Periodically check block checksums

Page 26: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

HBase

Hadoop ecosystem Database, based on Google BigTable

Goal: Hosting of very large tables (billions of rows X millions of columns) on commodity hardware. Multidimensional sorted Map

Table => Row => Column => Version => Value

Distributed column-oriented store Scale – Sharding etc. done automatically

No SQL, CRUD etc.

26

Page 27: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

27

What’s Next

Page 28: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

About Hortonworks – Basics

Founded – July 1st, 2011 22 Architects & committers from Yahoo!

Mission – Architect the future of Big Data Revolutionize and commoditize the storage and

processing of Big Data via open source

Vision – Half of the worlds data will be stored in Hadoop within five years

Page 29: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

Game Plan

Support the growth of a huge Apache Hadoop ecosystem Invest in ease of use, management, and other

enterprise features Define APIs for ISVs, OEMs and others to integrate

with Apache Hadoop Continue to invest in advancing the Hadoop core,

remain the experts Contribute all of our work to Apache

Profit by providing training & support to the Hadoop

community

Page 30: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

30

Lines of Code Contributed to Apache Hadoop

Page 31: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

Apache Hadoop Roadmap

Phase 1 – Making Apache Hadoop Accessible • Release the most stable version of Hadoop ever (Hadoop 0.20.205)

• Frequent sustaining releases • Release directly usable code via Apache (RPMs, .debs…) • Improve project integration (HBase support)

2011

Phase 2 – Next-Generation Apache Hadoop • Address key product gaps (HA, Management…) • Enable partner innovation via open APIs • Enable community innovation via modular architecture

2012 (Alphas in Q4 2011)

31

Page 32: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

Next-Generation Hadoop

Core HDFS Federation – Scale out and innovation via new APIs

Will run on 6000 node clusters with 24TB disk / node = 144PB in next release

Next Gen MapReduce – Support for MPI and many other programing models HA (no SPOF) and Wire compatibility

Data - HCatalog 0.3 Pig, Hive, MapReduce and Streaming as clients HDFS and HBase as storage systems Performance and storage improvements

Management & Ease of use Ambari – A Apache Hadoop Management & Monitoring System Stack installation and centralized config management REST and GUI for user & administrator tasks

32

Page 33: Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

© Hortonworks, Inc. All Rights Reserved.

Thank You!

33

Questions

Twitter: @jeric14 (@hortonworks) www.hortonworks.com


Recommended