Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Hadoop, current status & future projects….
Spring 2014 Version 1.4
Simon Gregory Director Strategic Alliances & Business Development. Hortonworks EMEA. [email protected]
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
No apologies… A recap on Apache Hadoop.
No excuses…. Current Hadoop eco-system. Customer Adoption Methods. What’s being worked on. What that means to you.
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop: A quick re-cap.
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop stores and processes the data you currently do-not or cannot…. Cost Profile. Data Structure.
OLTP, ERP, CRM Systems
Unstructured documents, emails
Clickstream
Server logs
Sen>ment, Web Data
Sensor. Machine Data
Geoloca>on
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop enables scalable compute & storage with a compelling cost profile….
MPP
SAN
Engineered System
NAS
HADOOP
Cloud Storage
$0 $20,000 $40,000 $60,000 $80,000 $180,000
Fully-loaded Cost Per Raw TB of Data (Min–Max Cost)
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop enables scalable compute & storage for all data structures….
✚
Determine list of ques/ons
Design solu/ons
Collect structured data
Ask ques/ons from list
Detect addi/onal ques/ons
Current Reality Apply schema on write
Dependent on IT
Repeatable Process: SQL
Augment w/ Hadoop
Apply schema on read
Support range of access patterns to data stored in HDFS: polymorphic access
HADOOP Iterate
over structure Transform and Analyze
Batch Interactive Real-time
Right Engine, Right Job
In-memory
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop alone is not the answer!
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
OPERATIONS TOOLS
Provision, Manage & Monitor
DEV & DATA TOOLS
Build & Test
DATA
SYSTEM
REPOSITORIES
SOURC
ES
RDBMS EDW MPP
OLTP, ERP, CRM Systems
Documents, Emails
Web Logs, Click Streams
Social Networks
Machine Generated
Sensor Data
Geoloca>on Data
Gov
erna
nce
&
Inte
grat
ion
Secu
rity
Ope
ratio
ns
Data Access
Data Management
APPLICAT
IONS
Business Analy/cs
Custom Applica/ons
Packaged Applica/ons
OLTP, ERP, CRM Systems
Unstructured documents, emails
Clickstream
Server logs
Sen>ment, Web Data
Sensor. Machine Data
Geoloca>on
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop: Current Eco-system
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved S
olr
Had
oop
&YA
RN
Pig
Tez
Hiv
e &
HC
atal
og
HB
ase
Sqo
op
Ooz
ie
Zoo
keep
er
Mah
out
Am
bari
Sto
rm
Flu
me
Kno
x
Pho
enix
Acc
umul
o
2.2.0
1.1.2
0.11.0
0.11.0
0.12.0
0.12.0
HDP 1.3
May
2013
2.4.0 0.12.1
HDP 2.0
October
2013
HDP 2.1 April
2014
Security Operations Data Access Data Management
0.13.0
0.94.6
0.96.1
0.98.0
0.9.1
0.7.0
0.8.0
0.9.0 4.7.2
1.4.3
1.4.4
1.3.1
1.4.0
1.2.5
1.4.4
1.5.1
3.3.2
4.0.0
3.4.5
0.4.0
0.4.0 4.0.0
1.5.1
Fal
con
0.5.0
Governance & Integration
Apache Hadoop: Driven by the community
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Load data and manage
according to policy
Deploy and effectively
manage the platform
Store and process all of your Corporate Data Assets
Access your data simultaneously in multiple ways (batch, interactive, real-time) Provide layered
approach to security through Authentication, Authorization,
Accounting, and Data Protection
DATA MANAGEMENT
SECURITY DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS
Enable both existing and new application to provide value to the organization
PRESENTATION & APPLICATION
Empower existing operations and security tools to manage Hadoop
ENTERPRISE MGMT & SECURITY
Provide deployment choice across physical, virtual, cloud
DEPLOYMENT OPTIONS
YARN : Data Opera/ng System
The days of Hadoop = MR & HDFS are over.
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Stable Project Releases
Fixed Issues
Upstream Community Projects
Downstream Enterprise Product Certified at scale using the most advanced Hadoop test bed on the planet • 1000’s of production nodes at Yahoo! • Over 1500 unit & system tests
Distribute
Integrate & Test
Package & Certify
Release Apache Hadoop
Test & Patch
Design & Develop
Virtuous cycle when development & fixed issues done upstream & stable project releases flow downstream
Design & Develop
Apache Hive
Apache HBase
Apache Pig
Apache Falcon
Apache Knox
Apache Storm
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
DATA ACCESS
DATA MANAGEMENT
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
YARN is the central point of innovation to tie the Hadoop stack together
Architectural consistency • Platform capabilities leverage
YARN as the common data operating system
• The common integration point for all data processing engines
– Community (e.g. Storm, Spark, etc) – Commercial (e.g. SAS)
Provision, Manage & Monitor
Ambari Zookeeper
Scheduling Oozie
SECURITY
Authen/ca/on Authoriza/on Accoun/ng
Data Protec/on
Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox
OPERATIONS
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume NFS
WebHDFS
GOVERNANCE & INTEGRATION
YARN : Data Opera/ng System
Script Pig
Search
Solr
SQL
Hive/Tez, HCatalog
NoSQL
HBase Accumulo
Stream
Storm
Others
In-‐Memory Analy>cs, ISV engines
Batch
Map Reduce
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume NFS
WebHDFS
SECURITY GOVERNANCE & INTEGRATION
Authen/ca/on Authoriza/on Accoun/ng
Data Protec/on
Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox
OPERATIONS
Yet Another Resource Negotiator The data operating system of Hadoop that allows multiple processing engines to access data stored in Hadoop with predictable levels of service
DATA ACCESS
YARN : Data Opera/ng System
DATA MANAGEMENT
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
Script Pig
Search
Solr
SQL
Hive/Tez, HCatalog
NoSQL
HBase Accumulo
Stream
Storm
Others
In-‐Memory Analy>cs, ISV engines
Batch
Map Reduce
Avoid Hadoop Silos & Reduce TCO
• Single Cluster, Shared Data Set, Multiple Workloads
• Support a range of access patterns: batch, interactive, online, streaming, real-time and more
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Supported with KEY Enterprise Capabilities
DATA ACCESS
DATA MANAGEMENT
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
Data Center Requirements Apache Hadoop has expanded to deliver the core requirements of any data platform across Governance, Security and Operations
Script Pig
Search
Solr
SQL
Hive/Tez, HCatalog
NoSQL
HBase Accumulo
Stream
Storm
Others
In-‐Memory Analy>cs, ISV engines
Batch
Map Reduce
Data Governance Govern and apply policy to the data lifecycle. Integrate and move data in and out of Hadoop
Data Security Authenticate, authorize and provide accountability of data access. Protect data at rest and in motion
Operations Provision, manage and monitor cluster resources. Maintain and improve performance
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
SECURITY
Authen/ca/on Authoriza/on Accoun/ng
Data Protec/on
Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox
OPERATIONS
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume NFS
WebHDFS
GOVERNANCE & INTEGRATION
YARN : Data Opera/ng System
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Customer Adoption Methods.
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
On-Premise:
Hortonworks Sandbox: http://hortonworks.com/products/hortonworks-sandbox/
Hortonworks HDP: http://hortonworks.com/hdp/downloads/
Getting started..
Cloud:
Microsoft HDInsight: http://azure.microsoft.com/en-us/pricing/free-trial/ Rackspace: http://www.rackspace.com/big-data/
Amazon: http://hortonworks.com/blog/deploying-hadoop-cluster-amazon-ec2-hortonworks/
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
1 2 Evaluation –
Business Value
Awareness, Interest, Education
Evaluation – Technical
Enterprise Deployment
Enterprise Production
Industry Leadership
Point Deployment
Point Production
3 4 Operational Value Strategic Value Data-Driven
Organization
* Timeline varies by company size. Often smaller or focused online businesses achieve milestones at the shorter end of the range.
Typical elapsed time* from start of phase 1 in months:
2-6 9-15 18-24
Potential Value
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
New Analytic Applications from New Types of Data INDUSTRY USE CASE Sentiment
& Web Clickstream & Behavior
Machine & Sensor Geographic Server Logs Structured &
Unstructured
Financial Services New Account Risk Screens ✔ ✔
Trading Risk ✔
Insurance Underwriting ✔ ✔ ✔
Telecom Call Detail Records (CDR) ✔ ✔
Infrastructure Investment ✔ ✔
Real-time Bandwidth Allocation ✔ ✔ ✔
Retail 360° View of the Customer ✔ ✔ ✔
Localized, Personalized Promotions ✔
Website Optimization ✔
Manufacturing Supply Chain and Logistics ✔
Assembly Line Quality Assurance ✔
Crowd-sourced Quality Assurance ✔
Healthcare Use Genomic Data in Medial Trials ✔ ✔ ✔
Monitor Patient Vitals in Real-Time
Pharmaceuticals Recruit and Retain Patients for Drug Trials ✔ ✔
Improve Prescription Adherence ✔ ✔ ✔ ✔
Oil & Gas Unify Exploration & Production Data ✔ ✔ ✔ ✔
Monitor Rig Safety in Real-Time ✔ ✔ ✔
Government ETL Offload/Federal Budgetary Pressures ✔ ✔
Sentiment Analysis for Government Programs ✔
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
What’s being worked on? The future for Hadoop….
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
New Projects:
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume NFS
WebHDFS YARN : Data Opera/ng System
DATA MANAGEMENT
SECURITY DATA ACCESS GOVERNANCE & INTEGRATION
Authen/ca/on Authoriza/on Accoun/ng
Data Protec/on
Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox
OPERATIONS
Script Pig
Search
Solr
SQL
Hive/Tez, HCatalog
NoSQL
HBase Accumulo
Stream
Storm
Others
In-‐Memory Analy>cs, ISV engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
Batch
Map Reduce
Deployment Choice Linux Windows On-Premise Cloud
http://hortonworks.com/labs/ Storm… Spark (ml) Slider Falcon XASecure Cascading
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
What that means to you.
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Open Source Drive innovation in the open exclusively via the Apache community-driven open source process.
Enterprise Rigor Engineer, test and certify Apache Hadoop with the enterprise in mind.
Ecosystem Endorsement Focus on deep integration with existing data center technologies and skills.
Pace of innovation…. Fit for purpose…. Integrated….
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Questions….