Date post: | 24-Dec-2015 |
Category: |
Documents |
Author: | kelley-ross |
View: | 215 times |
Download: | 0 times |
Under the covers of HDP for WindowsRohit Bakhshi
DBI-B387
Speaker
Rohit BakhshiProduct ManagerHortonworks
Modern Data Architecture
Hadoop for Windows
Hortonworks Data Platform under the covers
Q&A
Agenda
Modern Data Architecture
What Makes Up Big Data?
Megabytes
Gigabytes
Terabytes
Petabytes
Purchase detail
Purchase record
Payment record
ERP
CRM
WEB
BIG DATA
Offer details
Support Contacts
Customer Touches
Segmentation
Web logs
Offer history
A/B testing
Dynamic Pricing
Affiliate Networks
Search Marketing
Behavioral Targeting
Dynamic Funnels
User Generated Content
Mobile Web
SMS/MMSSentiment
External Demographics
HD Video, Audio, Images
Speech to Text
Product/Service Logs
Social Interactions & Feeds
Business Data Feeds
User Click Stream
Sensors / RFID / Devices
Spatial & GPS Coordinates
Increasing Data Variety and Complexity
Transactions + Interactions +
Observations=
BIG DATA
A data architecture under pressure from new data
APPL
ICAT
ION
SDA
TA S
YSTE
M
REPOSITORIES
SOU
RCES Existing Sources
(CRM, ERP, Clickstream, Logs)
RDBMS EDW MPP
Business Analytics
Custom Applications
PackagedApplications
Source: IDC
2.8 ZB in 2012
85% from New Data Types
15x Machine Data by 2020
40 ZB by 2020
OLTP, ERP, CRM Systems
Unstructured documents, emails
Clickstream
Server logs
Sentiment, Web Data
Sensor. Machine Data
Geolocation
Hadoop within an emerging Modern Data Architecture
OPERATIONS TOOLS
Provision, Manage &Monitor
DEV & DATA TOOLS
Build & Test
DATA
SYS
TEM
REPOSITORIES
SOU
RCES
RDBMS EDW MPP
OLTP, ERP,CRM Systems
Documents, Emails
Web Logs,Click Streams
Social Networks
Machine Generated
SensorData
Geolocation Data
G
ove
rnan
ce
& I
nte
gra
tio
n
Sec
uri
ty
Op
erat
ion
sData Access
Data Management
APPL
ICAT
ION
S
Business Analytics
Custom Applications
PackagedApplications
OLTP, ERP, CRM Systems
Unstructured documents, emails
Clickstream
Server logs
Sentiment, Web Data
Sensor. Machine Data
Geolocation
Hadoop: typically used for new analytic applications…
SC
ALE
SCOPE
New Analytic AppsNew types of dataLOB-driven
… and incrementally delivers a ‘Data Lake’S
CA
LE
SCOPE
A Modern Data Architecture/Data Lake
New Analytic AppsNew types of dataLOB-driven
RDBMS
MPP
EDW
Go
vern
ance
&
In
teg
rati
on
Sec
uri
ty
Op
erat
ion
sData Access
Data Management
Data LakeAn architectural shift in the data center that uses Hadoop to deliver deeper insight across a large, broad, diverse set of data at efficient scale
Hadoop for Windows
HDP for Windows
HDP 2.1Hortonworks Data Platform
Provision, Manage & Monitor
Ambari (SCOM)Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle &
Governance
FalconSqoopFlume
WebHDFS YARN : Data Operating System
DATA MANAGEMENT
SECURITYDATA ACCESSGOVERNANCE &
INTEGRATION
AuthenticationAuthorization
AccountingData Protection
Storage: HDFSResources: YARNAccess: Hive, … Pipeline: Falcon
Cluster: Knox
OPERATIONS
Script
Pig
Search
Solr
SQL
Hive/Tez, HCatalog
NoSQL
HBase
Stream
Storm
Others
In-Memory Analytics,
ISV engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
Batch
Map Reduce
Deployment ChoiceLinux Windows On-
Premise Cloud
Hortonworks Data Platform (HDP)
The Only Completely Open Distribution for Apache Hadoop
Fundamentally Versatile and Comprehensive enterprise capabilities
Wholly Integrated for deep ecosystem interoperability
HDP: Enterprise Data PlatformHDP certifies most recent & stable community innovation
Hortonworks Data Platform
S
olr
H
ad
oo
p
&Y
AR
N
P
ig
Te
z
H
ive
& H
Cat
alog
H
Ba
se
S
qo
op
O
ozi
e
Z
oo
ke
ep
er
M
ah
ou
t
A
mb
ari
S
torm
F
lum
e
K
no
x
P
ho
en
ix
2.2.0
1.1.2
0.11.0
0.11.0
0.12.0
0.12.0
HDP 1.3
May
2013
2.4.0 0.12.1
HDP 2.0
October
2013
HDP 2.1
April
2014
SecurityOperationsData AccessData
Management
0.13.0
0.94.6
0.96.1
0.98.0
0.9.1
0.7.0
0.8.0
0.9.04.7.2
1.4.3
1.4.4
1.3.1
1.4.0
1.2.5
1.4.4
1.5.1
3.3.2
4.0.0
3.4.5
0.4.0
0.4.04.0.0
F
alc
on
0.5.0
Governance & Integration
Seamless InteroperabilityIntegrations with Microsoft tools for native big data analysis
SOU
RCES
APPL
ICAT
ION
S
OPERATIONAL TOOLS
DEV & DATA TOOLS
INFRASTRUCTURExΩ
a
DATA
SYS
TEM
HDInsight
Azure
New! Power BI
Right Tool for the Right Usage
TraditionalDatabase
SCALE (storage & processing)
HadoopPlatform
NoSQLMPPAnalytics
EDW
schema
speed
governance
best fit use
processing
Required on write Required on read
Reads are fast Writes are fast
Standards and structured Loosely structured
Limited, no data processing
Processing coupled with data
data typesStructured Multi and unstructured
Interactive OLAP AnalyticsComplex ACID Transactions
Operational Data Store
Data DiscoveryProcessing unstructured dataMassive Storage/Processing
Maximize Hadoop Deployment ChoiceHortonworks Data Platform (HDP) for Windows100% Apache open source Hadoop software for Windows Server
Microsoft Azure HDInsightHadoop-based managed service in the cloud via Microsoft Azure
Microsoft Analytics Platform System (APS)Scale-out appliance with data warehousing and Hadoop in one box
All offerings co-engineered by Hortonworks and MicrosoftEnjoy seamless interoperability across on-premises and cloud
HDP under the covers
Data Operating System of Hadoop
Single Cluster, Shared Data Set, Multiple WorkloadsSupport a range of access patternsShared operational services
HDP 2.1: Core Platform
DATA ACCESS
YARN : Data Operating System
DATA MANAGEMENT
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
Script
Pig
Search
Solr
SQL
Hive/Tez, HCatalog
NoSQL
HBaseAccumulo
Stream
Storm
Others
In-Memory Analytics,
ISV engines
Batch
Map Reduce
YARN: Next Generation HadoopSingle Use System
Batch AppsMulti Use Data Platform
Batch, Interactive, Online, Streaming, …
1st Gen of Hadoop
HDFS(redundant, reliable storage)
MapReduce(cluster resource management
& data processing)
Redundant, Reliable Storage(HDFS)
Efficient Cluster Resource Management & Shared Services
(YARN)
Flexible DataProcessing
Hive, Pig, others…
BatchMapReduce
Batch & InteractiveTez
Online Data Processing
HBase, Accumulo
Stream Processing
Stormothers
…
2nd Gen of HadoopClassic
Hadoop Apps
YARN: Data Operating System
NodeManager NodeManager NodeManager NodeManager
map 1.1
vertex1.2.2
NodeManager NodeManager NodeManager NodeManager
NodeManager NodeManager NodeManager NodeManager
map1.2
reduce1.1
Batch
vertex1.1.1
vertex1.1.2
vertex1.2.1
Interactive SQL
ResourceManager
Scheduler
Real-Time
nimbus0
nimbus1
nimbus2
HDP 2.1 SQL Access: Stinger InitiativeStinger Initiative
Next generation SQL based interactive query in Hadoop
SpeedInteractive Hive Query response
Scalequeries that scale from TB to PB
SQLbroadest range of SQL semantics for analytic applications
Business Analytics CustomApps
Apache YARN
Apache MapReduce
1
°
°
°
°
°
°
°
°
°
°
°
°
°
N
Apache Tez
Apache Hive
SQL
°
°
°
°
°
°
HDFS (Hadoop Distributed File System)
Apache Hive Contribution… an Open Community at its finest
1,672Jira Tickets Closed
145Developers
44Companies
~390,000Lines Of Code Added… (2x)
13Months
Apache Tez (“Speed”)Replaces MapReduce as primitive for Hive, Pig, etc
Task with pluggable Input, Processor and Output
Tez Task - <Input, Processor, Output>
Task
ProcessorInput Output
Hive with Tez as execution engine
Hive – MR Hive – Tez
SELECT a.state
JOIN (a, c)SELECT c.price
SELECT b.id
JOIN(a, b)GROUP BY a.state
COUNT(*)AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,c.itemId
JOIN (a, c)
JOIN(a, b)GROUP BY a.state
COUNT(*)AVERAGE(c.price)
SELECT b.id
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Tez avoids unneeded writes to
HDFS
Hive: Enhanced SQL SemanticsHive SQL Datatypes Hive SQL SemanticsINT SELECT, INSERT
TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY
BOOLEAN JOIN on explicit join key
FLOAT Inner, outer, cross and semi joins
DOUBLE Sub-queries in FROM clause
STRING ROLLUP and CUBE
TIMESTAMP UNION
BINARY Windowing Functions (OVER, RANK, etc)
DECIMAL Custom Java UDFs
ARRAY, MAP, STRUCT, UNION Standard Aggregation (SUM, AVG, etc.)
DATE Advanced UDFs (ngram, Xpath, URL)
VARCHAR Sub-queries for IN/NOT IN, HAVING
CHAR Expanded JOIN Syntax
INTERSECT / EXCEPT
Hive 0.12 (HDP 2.0)
Hive 0.11
Hive 0.13 (HDP 2.1)
SQL ComplianceHive provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop
HDP 2.1: Data Governance & IntegrationApache FalconSimplified Data Governance for Enterprise Hadoop
Provides key governance framework for:Acquisition & processing of data setsReplication & Retention of datasetsRedirect datasets to non-Hadoop extensionsProvides audit trail & lineage
Apache Falcon: ReplicationDisaster Recovery and Backup between environments
Publishing data between environments for Discovery
Site to Site
Site to Cloud
Apache Falcon: RetentionDefine sophisticated retention policiesSimplify data retention for audit, compliance, or for data re-processing
Staged Data
Retain 5 Years
Cleansed Data
Retain 3 Years
Conformed Data
Retain 3 Years
Presented Data
Retain Last Copy Only
HDP 2.1: SearchApache SolrOpen source enterprise search for Hadoop
Simple, powerful UI for advanced search applications
High performance indexing & sub-second search times over billions of documents
Search Architecture
HDFS (Hadoop Distributed File System)
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
Raw FilesIndexed
Documents
MapReduce Indexing Job
Solr Solr Solr
Lucene
HTMLPDFWordXMLLogs
…
Search Web App
Query
Response
HDP 2.1: Stream ProcessingApache StormReal-time event processing for sensor and business activity monitoring
Unlocks new business cases for Hadoop
Scale: Ingest millions of events per second. Fast query on petabytes of data
http://storm.incubator.apache.org/
HDP 2.1: Perimeter SecurityApache KnoxA common place to preform authentication across Hadoop and all related projects
Integrated to LDAP and AD
Secure interfaces for:WebHDFS, WebHCAT, Oozie, Hive & HBase
Broad community effort, Incubated with Microsoft, broad set of developers invovled
Apache Knox: Perimeter Security
EnterpriseIdentityProvider
LDAP/AD
Identity Providers
Knox Gateway
GW
DMZ
A stateless reverse proxy instance deployed in DMZ
Firew
all
HDP Cluster 1
Masters
JTNNWebHCat
Oozie
YARNHBaseHive
DN TT
HDP Hadoop Cluster 2
Masters
JTNNWebHCat
Oozie
YARNHBaseHive
DN TT
-Requests streamed through GW to Hadoop services after auth. -URLs rewritten to refer to gateway
Firew
all
RESTClient
JDBCClient
Browser
Operating Enterprise HadoopAmbari: Deploy, Manage, Monitor
AMBARI WEB
compute&
storage. . .
. . .
. .compute
&storage
.
.
PROVISION
MANAGE
MONITOR
REST APIs
AMBARI SERVERPROVISION | MANAGE | MONITOR
Enables Microsoft System Center Operations Manager (SCOM) to monitor HadoopAmbari SCOM Management Pack gives insight into the performance and health of HadoopAmbari SCOM leverages the Ambari framework to aggregate and expose Hadoop metrics
Ambari SCOM
Ambari SCOMMgmtPack
HADOOPStorage & Process
at Scale
AmbariSCOMServer
Ambari SCOM Server aggregates + exposes Hadoop metrics
Ambari SCOM monitors health + alerts in case of problems
HDP - Reference Architecture
For More InformationWebhortonworks.com/products/hdp-windows/hortonworks.com/labs/microsoft/microsoft.com/bigdata
Traininghortonworks.com/hadoop-training/hadoop-on-windows/
Online documentationdocs.hortonworks.com
Forumshortonworks.com/community/forums/
Questions?
Track resources
Download Microsoft SQL Server 2014 http://www.trySQLSever.com
Try out Power BI for Office 365! http://www.powerbi.com
Sign up for Microsoft HDInsight today! http://microsoft.com/bigdata
Resources
Learning
Microsoft Certification & Training Resources
www.microsoft.com/learning
msdn
Resources for Developers
http://microsoft.com/msdn
TechNet
Resources for IT Professionals
http://microsoft.com/technet
Sessions on Demand
http://channel9.msdn.com/Events/TechEd
Complete an evaluation and enter to win!
Evaluate this session
Scan this QR code to evaluate this session.
© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.