Megabytes Gigabytes Terabytes Petabytes Purchase detail Purchase record Payment record Purchase...

Post on 24-Dec-2015

217 views 0 download

Tags:

transcript

Under the covers of HDP for WindowsRohit Bakhshi

DBI-B387

Speaker

Rohit BakhshiProduct ManagerHortonworks

Modern Data Architecture

Hadoop for Windows

Hortonworks Data Platform under the covers

Q&A

Agenda

Modern Data Architecture

What Makes Up Big Data?

Megabytes

Gigabytes

Terabytes

Petabytes

Purchase detail

Purchase record

Payment record

ERP

CRM

WEB

BIG DATA

Offer details

Support Contacts

Customer Touches

Segmentation

Web logs

Offer history

A/B testing

Dynamic Pricing

Affiliate Networks

Search Marketing

Behavioral Targeting

Dynamic Funnels

User Generated Content

Mobile Web

SMS/MMSSentiment

External Demographics

HD Video, Audio, Images

Speech to Text

Product/Service Logs

Social Interactions & Feeds

Business Data Feeds

User Click Stream

Sensors / RFID / Devices

Spatial & GPS Coordinates

Increasing Data Variety and Complexity

Transactions + Interactions +

Observations=

BIG DATA

A data architecture under pressure from new data

APPL

ICAT

ION

SDA

TA S

YSTE

M

REPOSITORIES

SOU

RCES Existing Sources

(CRM, ERP, Clickstream, Logs)

RDBMS EDW MPP

Business Analytics

Custom Applications

PackagedApplications

Source: IDC

2.8 ZB in 2012

85% from New Data Types

15x Machine Data by 2020

40 ZB by 2020

OLTP, ERP, CRM Systems

Unstructured documents, emails

Clickstream

Server logs

Sentiment, Web Data

Sensor. Machine Data

Geolocation

Hadoop within an emerging Modern Data Architecture

OPERATIONS TOOLS

Provision, Manage &Monitor

DEV & DATA TOOLS

Build & Test

DATA

SYS

TEM

REPOSITORIES

SOU

RCES

RDBMS EDW MPP

OLTP, ERP,CRM Systems

Documents, Emails

Web Logs,Click Streams

Social Networks

Machine Generated

SensorData

Geolocation Data

G

ove

rnan

ce

& I

nte

gra

tio

n

Sec

uri

ty

Op

erat

ion

sData Access

Data Management

APPL

ICAT

ION

S

Business Analytics

Custom Applications

PackagedApplications

OLTP, ERP, CRM Systems

Unstructured documents, emails

Clickstream

Server logs

Sentiment, Web Data

Sensor. Machine Data

Geolocation

Hadoop: typically used for new analytic applications…

SC

ALE

SCOPE

New Analytic AppsNew types of dataLOB-driven

… and incrementally delivers a ‘Data Lake’S

CA

LE

SCOPE

A Modern Data Architecture/Data Lake

New Analytic AppsNew types of dataLOB-driven

RDBMS

MPP

EDW

Go

vern

ance

&

In

teg

rati

on

Sec

uri

ty

Op

erat

ion

sData Access

Data Management

Data LakeAn architectural shift in the data center that uses Hadoop to deliver deeper insight across a large, broad, diverse set of data at efficient scale

Hadoop for Windows

HDP for Windows

HDP 2.1Hortonworks Data Platform

Provision, Manage & Monitor

Ambari (SCOM)Zookeeper

Scheduling

Oozie

Data Workflow, Lifecycle &

Governance

FalconSqoopFlume

WebHDFS YARN : Data Operating System

DATA MANAGEMENT

SECURITYDATA ACCESSGOVERNANCE &

INTEGRATION

AuthenticationAuthorization

AccountingData Protection

Storage: HDFSResources: YARNAccess: Hive, … Pipeline: Falcon

Cluster: Knox

OPERATIONS

Script

Pig

Search

Solr

SQL

Hive/Tez, HCatalog

NoSQL

HBase

Stream

Storm

Others

In-Memory Analytics,

ISV engines

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

°

N

HDFS (Hadoop Distributed File System)

Batch

Map Reduce

Deployment ChoiceLinux Windows On-

Premise Cloud

Hortonworks Data Platform (HDP)

The Only Completely Open Distribution for Apache Hadoop

Fundamentally Versatile and Comprehensive enterprise capabilities

Wholly Integrated for deep ecosystem interoperability

HDP: Enterprise Data PlatformHDP certifies most recent & stable community innovation

Hortonworks Data Platform

S

olr

H

ad

oo

p

&Y

AR

N

P

ig

Te

z

H

ive

& H

Cat

alog

H

Ba

se

S

qo

op

O

ozi

e

Z

oo

ke

ep

er

M

ah

ou

t

A

mb

ari

S

torm

F

lum

e

K

no

x

P

ho

en

ix

2.2.0

1.1.2

0.11.0

0.11.0

0.12.0

0.12.0

HDP 1.3

May

2013

2.4.0 0.12.1

HDP 2.0

October

2013

HDP 2.1

April

2014

SecurityOperationsData AccessData

Management

0.13.0

0.94.6

0.96.1

0.98.0

0.9.1

0.7.0

0.8.0

0.9.04.7.2

1.4.3

1.4.4

1.3.1

1.4.0

1.2.5

1.4.4

1.5.1

3.3.2

4.0.0

3.4.5

0.4.0

0.4.04.0.0

F

alc

on

0.5.0

Governance & Integration

Seamless InteroperabilityIntegrations with Microsoft tools for native big data analysis

SOU

RCES

APPL

ICAT

ION

S

OPERATIONAL TOOLS

DEV & DATA TOOLS

INFRASTRUCTURExΩ

a

DATA

SYS

TEM

HDInsight

Azure

New! Power BI

Right Tool for the Right Usage

TraditionalDatabase

SCALE (storage & processing)

HadoopPlatform

NoSQLMPPAnalytics

EDW

schema

speed

governance

best fit use

processing

Required on write Required on read

Reads are fast Writes are fast

Standards and structured Loosely structured

Limited, no data processing

Processing coupled with data

data typesStructured Multi and unstructured

Interactive OLAP AnalyticsComplex ACID Transactions

Operational Data Store

Data DiscoveryProcessing unstructured dataMassive Storage/Processing

Maximize Hadoop Deployment ChoiceHortonworks Data Platform (HDP) for Windows100% Apache open source Hadoop software for Windows Server

Microsoft Azure HDInsightHadoop-based managed service in the cloud via Microsoft Azure

Microsoft Analytics Platform System (APS)Scale-out appliance with data warehousing and Hadoop in one box

All offerings co-engineered by Hortonworks and MicrosoftEnjoy seamless interoperability across on-premises and cloud

HDP under the covers

Data Operating System of Hadoop

Single Cluster, Shared Data Set, Multiple WorkloadsSupport a range of access patternsShared operational services

HDP 2.1: Core Platform

DATA ACCESS

YARN : Data Operating System

DATA MANAGEMENT

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

°

N

HDFS (Hadoop Distributed File System)

Script

Pig

Search

Solr

SQL

Hive/Tez, HCatalog

NoSQL

HBaseAccumulo

Stream

Storm

Others

In-Memory Analytics,

ISV engines

Batch

Map Reduce

YARN: Next Generation HadoopSingle Use System

Batch AppsMulti Use Data Platform

Batch, Interactive, Online, Streaming, …

1st Gen of Hadoop

HDFS(redundant, reliable storage)

MapReduce(cluster resource management

& data processing)

Redundant, Reliable Storage(HDFS)

Efficient Cluster Resource Management & Shared Services

(YARN)

Flexible DataProcessing

Hive, Pig, others…

BatchMapReduce

Batch & InteractiveTez

Online Data Processing

HBase, Accumulo

Stream Processing

Stormothers

2nd Gen of HadoopClassic

Hadoop Apps

YARN: Data Operating System

NodeManager NodeManager NodeManager NodeManager

map 1.1

vertex1.2.2

NodeManager NodeManager NodeManager NodeManager

NodeManager NodeManager NodeManager NodeManager

map1.2

reduce1.1

Batch

vertex1.1.1

vertex1.1.2

vertex1.2.1

Interactive SQL

ResourceManager

Scheduler

Real-Time

nimbus0

nimbus1

nimbus2

HDP 2.1 SQL Access: Stinger InitiativeStinger Initiative

Next generation SQL based interactive query in Hadoop

SpeedInteractive Hive Query response

Scalequeries that scale from TB to PB

SQLbroadest range of SQL semantics for analytic applications

Business Analytics CustomApps

Apache YARN

Apache MapReduce

1

°

°

°

°

°

°

°

°

°

°

°

°

°

N

Apache Tez

Apache Hive

SQL

°

°

°

°

°

°

HDFS (Hadoop Distributed File System)

Apache Hive Contribution… an Open Community at its finest

1,672Jira Tickets Closed

145Developers

44Companies

~390,000Lines Of Code Added… (2x)

13Months

Apache Tez (“Speed”)Replaces MapReduce as primitive for Hive, Pig, etc

Task with pluggable Input, Processor and Output

Tez Task - <Input, Processor, Output>

Task

ProcessorInput Output

Hive with Tez as execution engine

Hive – MR Hive – Tez

SELECT a.state

JOIN (a, c)SELECT c.price

SELECT b.id

JOIN(a, b)GROUP BY a.state

COUNT(*)AVERAGE(c.price)

M M M

R R

M M

R

M M

R

M M

R

HDFS

HDFS

HDFS

M M M

R R

R

M M

R

R

SELECT a.state,c.itemId

JOIN (a, c)

JOIN(a, b)GROUP BY a.state

COUNT(*)AVERAGE(c.price)

SELECT b.id

SELECT a.state, COUNT(*), AVERAGE(c.price)

FROM a

JOIN b ON (a.id = b.id)

JOIN c ON (a.itemId = c.itemId)

GROUP BY a.state

Tez avoids unneeded writes to

HDFS

Hive: Enhanced SQL SemanticsHive SQL Datatypes Hive SQL SemanticsINT SELECT, INSERT

TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY

BOOLEAN JOIN on explicit join key

FLOAT Inner, outer, cross and semi joins

DOUBLE Sub-queries in FROM clause

STRING ROLLUP and CUBE

TIMESTAMP UNION

BINARY Windowing Functions (OVER, RANK, etc)

DECIMAL Custom Java UDFs

ARRAY, MAP, STRUCT, UNION Standard Aggregation (SUM, AVG, etc.)

DATE Advanced UDFs (ngram, Xpath, URL)

VARCHAR Sub-queries for IN/NOT IN, HAVING

CHAR Expanded JOIN Syntax

INTERSECT / EXCEPT

Hive 0.12 (HDP 2.0)

Hive 0.11

Hive 0.13 (HDP 2.1)

SQL ComplianceHive provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop

HDP 2.1: Data Governance & IntegrationApache FalconSimplified Data Governance for Enterprise Hadoop

Provides key governance framework for:Acquisition & processing of data setsReplication & Retention of datasetsRedirect datasets to non-Hadoop extensionsProvides audit trail & lineage

Apache Falcon: ReplicationDisaster Recovery and Backup between environments

Publishing data between environments for Discovery

Site to Site

Site to Cloud

Apache Falcon: RetentionDefine sophisticated retention policiesSimplify data retention for audit, compliance, or for data re-processing

Staged Data

Retain 5 Years

Cleansed Data

Retain 3 Years

Conformed Data

Retain 3 Years

Presented Data

Retain Last Copy Only

HDP 2.1: SearchApache SolrOpen source enterprise search for Hadoop

Simple, powerful UI for advanced search applications

High performance indexing & sub-second search times over billions of documents

Search Architecture

HDFS (Hadoop Distributed File System)

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

Raw FilesIndexed

Documents

MapReduce Indexing Job

Solr Solr Solr

Lucene

HTMLPDFWordXMLLogs

Search Web App

Query

Response

HDP 2.1: Stream ProcessingApache StormReal-time event processing for sensor and business activity monitoring

Unlocks new business cases for Hadoop

Scale: Ingest millions of events per second. Fast query on petabytes of data

http://storm.incubator.apache.org/

HDP 2.1: Perimeter SecurityApache KnoxA common place to preform authentication across Hadoop and all related projects

Integrated to LDAP and AD

Secure interfaces for:WebHDFS, WebHCAT, Oozie, Hive & HBase

Broad community effort, Incubated with Microsoft, broad set of developers invovled

Apache Knox: Perimeter Security

EnterpriseIdentityProvider

LDAP/AD

Identity Providers

Knox Gateway

GW

DMZ

A stateless reverse proxy instance deployed in DMZ

Firew

all

HDP Cluster 1

Masters

JTNNWebHCat

Oozie

YARNHBaseHive

DN TT

HDP Hadoop Cluster 2

Masters

JTNNWebHCat

Oozie

YARNHBaseHive

DN TT

-Requests streamed through GW to Hadoop services after auth. -URLs rewritten to refer to gateway

Firew

all

RESTClient

JDBCClient

Browser

Operating Enterprise HadoopAmbari: Deploy, Manage, Monitor

AMBARI WEB

compute&

storage. . .

. . .

. .compute

&storage

.

.

PROVISION

MANAGE

MONITOR

REST APIs

AMBARI SERVERPROVISION | MANAGE | MONITOR

Enables Microsoft System Center Operations Manager (SCOM) to monitor HadoopAmbari SCOM Management Pack gives insight into the performance and health of HadoopAmbari SCOM leverages the Ambari framework to aggregate and expose Hadoop metrics

Ambari SCOM

Ambari SCOMMgmtPack

HADOOPStorage & Process

at Scale

AmbariSCOMServer

Ambari SCOM Server aggregates + exposes Hadoop metrics

Ambari SCOM monitors health + alerts in case of problems

HDP - Reference Architecture

For More InformationWebhortonworks.com/products/hdp-windows/hortonworks.com/labs/microsoft/microsoft.com/bigdata

Traininghortonworks.com/hadoop-training/hadoop-on-windows/

Online documentationdocs.hortonworks.com

Forumshortonworks.com/community/forums/

Questions?

Track resources

Download Microsoft SQL Server 2014 http://www.trySQLSever.com

Try out Power BI for Office 365! http://www.powerbi.com

Sign up for Microsoft HDInsight today! http://microsoft.com/bigdata

Resources

Learning

Microsoft Certification & Training Resources

www.microsoft.com/learning

msdn

Resources for Developers

http://microsoft.com/msdn

TechNet

Resources for IT Professionals

http://microsoft.com/technet

Sessions on Demand

http://channel9.msdn.com/Events/TechEd

Complete an evaluation and enter to win!

Evaluate this session

Scan this QR code to evaluate this session.

© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.