Hortonworks Data Platform for Systems Integrators Webinar 9-5-2012.pptx

transcript

Hortonworks & Systems Integrators Mitch Ferguson VP, Business Development Rikin Shah Dir, Field Engineering

September 5, 2012

Big data changes the game

Megabytes

Gigabytes

Terabytes

Petabytes

Purchase detail Purchase record Payment record

BIG DATA

Offer details

Support Contacts

Customer Touches

Segmentation

Web logs

Offer history

A/B testing

Dynamic Pricing

Affiliate Networks

Search Marketing

Behavioral Targeting

Dynamic Funnels

User Generated Content

Mobile Web

SMS/MMS Sentiment

External Demographics

HD Video, Audio, Images

Speech to Text

Product/Service Logs

Social Interactions & Feeds

Business Data Feeds

User Click Stream

Sensors / RFID / Devices

Spatial & GPS Coordinates

Increasing Data Variety and Complexity

Transactions + Interactions + Observations = BIG DATA

Hortonworks Snapshot

The industry leading and only 100% open source Apache Hadoop distribution

Most experienced open source leadership team –  Rob Bearden – CEO (JBoss, SpringSource, i2, Oracle) –  Shaun Connolly – VP Strategy (VMW, SpringSource, Red Hat, JBoss) –  Mitch Ferguson: VP BD (SringSource, VMWare) –  John Kreisa – VP Marketing (Red Hat, Cloudera, MarkLogic, Bus Obj) –  Ari Zilka – CPO (Teracotta, Accenture, Walmart.com) –  Greg Pavlik – VP Eng. (Oracle SOA & Integration platform)

Business model focused on customer success: Hadoop support, services & training – Subscription support for Hortonworks Data Platform – Training business: Private and public classes

available for developers & administrators

•  Headquarters Sunnyvale, CA

•  100+ Employees

•  Formed with core Apache Hadoop engineering team from Yahoo!

•  40+ engineers and architects including 25+ Hadoop committers

Hortonworks Business Strategy

Enable the next gen data management platform

• Accelerate the adoption of Apache Hadoop

• Create a vibrant eco-systems – ISVs, IHV, Systems Integrators

• Provide world-class enterprise Support & Training

We believe that by the end of 2015, more than half the world's data will be processed by Apache Hadoop.

Hortonworks Vision & Role

Be diligent stewards of the open source core 1

Be tireless innovators beyond the core 2

Provide robust data platform services & open APIs 3

Enable the ecosystem at each layer of the stack 4

Make the platform enterprise-ready & easy to use 5

Enabling Hadoop as Enterprise Big Data Platform

DEVELOPER Data Platform Services & Open APIs

Hortonworks Data Platform

Applications, Business Tools, Development Tools, Open APIs and access Data Movement & Integration, Data Management Systems, Systems Management

Installation & Configuration, Administration, Monitoring, High Availability, Replication, Multi-tenancy, ..

Metadata, Indexing, Search, Security, Management, Data Extract & Load, APIs

Hortonworks Partner Eco-System

Hortonworks & SIs Our business models are 100% Complementary • Systems Integrators are a corner-stone of our business model

• Enable high-value & repeatable solutions • Leverage multi-party relationships to accelerate business

Systems Integrator

Customer

Why Hortonworks?

•  The most Apache Hadoop experience and expertise –  Reliable Hadoop from the experts, project leaders, architects and

builders –  Collectively over 90 years operational Hadoop experience

(at least double that of the closest competitor)

•  Influence community direction –  Provides a direct connection to drive innovation in the community

•  Focus on the ecosystem –  Roadmap and vision to provide access to the wide ecosystem of

enterprise application, such as Teradata

•  Industry momentum –  Collaborate across partners (ISVs/IHVs/SIs) to enable high-value

solutions

Hortonworks Apache Hadoop Leadership

Hortonworkers… the builders, operators and core architects of Apache Hadoop

•  Most experienced team running Hadoop in production at scale (> 5 years, 42000 nodes)

•  All “stable” releases of Apache Hadoop have been shipped by Hortonworkers

Leadership

• VP and PMC of Hadoop Arun Murthy

• Core Architect of YARN Arun Murthy

• Core Architect MapReduce2 Arun Murthy

• VP & PMC of Pig Daniel Dai

• VP of Zookeeper Mahdev Konar

•  Inventor of HCatalog Alan Gates

• Project Lead for Ambari Mahedev Konar

• Original Project Lead Eric Baldschweiler

“We have noticed more activity over the last year from Hortonworks’ engineers on building out Apache Hadoop’s more innovative features. These include YARN, Ambari and HCatalog..”

- Jeff Kelly: Wkibon

Operate Integrate

Develop Interact

Distributed Storage (HDFS)

Distributed Processing (MapReduce)

Query (Hive)

Scripting (Pig)

Metadata Services (HCatalog)

itorin

Apache Hadoop Release Management

Hadoop 1

1.0.1 1.0.2 1.0.3

1.1.1 1.1.2

HDP 1.0

•  Apache Hadoop Release management is run by Hortonworks •  Matt Foley, Release Manager for Hadoop 1 •  Arun Murthy, Release Manager for Hadoop 2 •  Ashutosh Chauhan, Release Manager for Hive •  Daniel Dai, Release Manager for Pig •  Alan Gates, Release Manager for Hcat

•  Hadoop Core releases validated (and fixed) by Hortonworks •  ~1300 end to end system tests run in house using our IP before any release can be made

•  Hortonworks Data Platform is released directly from Apache Hadoop branches

Full Stack High Availability

HA Pairs

Core Switch

Rack Switch Rack Switch

Namenode HA Manager

Job Tracker HA Manager

Etc. daemon HA Manager

Namenode HA Manager

Job Tracker HA Manager

Etc. daemon HA Manager

HA Cluster

Full Stack High Availability

•  Failover and restart for •  NameNode •  JobTracker •  HBase and other services to come…

•  Open API allows use of Proven HA from multiple vendors (Red Hat & VMWare)

•  Minimized changes to clients and configuration

•  Complementary to 2.0 HA efforts •  Server & Operating System failure

detection and VM restart •  Smart resource management

ensures sufficient resources are available to restart VMs

Addresses HA needs on stable Apache Hadoop 1.0

Capacity Scheduler Delivers Multi-tenancy

• Queue definition – % of total system memory – % CPU utilization (not slot count)

• Queues per team – Soft limits and hard so you can use entire cluster if available – Ownership and security built in

• Proactive resource management – Lots of rules and observation points – Don’t start another task if it will blow up the node – Don’t start another task if other workloads are spinning up

• Better than Fair + Preemption (HDP Supports All) – Utilization not measured by slot count (can blow up a node /

cluster) – Doesn’t start all tasks automatically (proactive vs. reactive)

HCatalog METADATA

HCatalog

Table access Aligned metadata REST API

•  Raw Hadoop data •  Inconsistent, unknown •  Tool specific access

Apache HCatalog provides flexible metadata services across tools and external access

Metadata Services

•  Consistency of metadata and data models across tools (MapReduce, Pig, HBase and Hive)

•  Accessibility: share data as tables in and out of HDFS •  Availability: enables flexible, thin-client access via REST API

Shared table and schema management opens the platform

Options Lead to Complexity

Feature MapReduce Pig Hive Record format Key value pairs Tuple Record Data model User defined int, float, string,

bytes, maps, tuples, bags

int, float, string, maps, structs, lists

Schema Encoded in app Declared in script or read by loader

Read from metadata

Data location Encoded in app Declared in script Read from metadata

Data format Encoded in app Declared in script Read from metadata

•  Pig and MR users need to know a lot to write their apps •  When data schema, location, or format change Pig and MR apps must be

rewritten, retested, and redeployed •  Hive users have to load data from Pig/MR users to have access to it

Hadoop Ecosystem

metastore- tables- partitions- files- types

Pig(scripting)

Hive(SQL)

MapReduce(Java)

Input/OutputFormat

Interface:Load/Store

Interface:SerDe

Input/OutputFormat

Interface:SQL

Opening up Metadata to MR & Pig

HCat Metadata layer

metastore- tables- partitions- files- types

Pig(scripting)

Hive(SQL)

MapReduce(Java)

HCatInput/OutputFormat

Interface:HCatLoad/Store

Interface:SerDe

Interface:SQL

Tools With HCatalog

Feature MapReduce + HCatalog

Pig + HCatalog Hive

Record format Record Tuple Record Data model int, float, string,

maps, structs, lists int, float, string, bytes, maps, tuples, bags

int, float, string, maps, structs, lists

Schema Read from metadata

Read from metadata

Data location Read from metadata

Read from metadata

Data format Read from metadata

Read from metadata

•  Pig/MR users can read schema from metadata •  Pig/MR users are insulated from schema, location, and format changes •  All users have access to other users’ data as soon as it is committed

Hadoop Cluster

Existing Infrastructure

Metadata Services

metastore

Hive applications

data stores

visualization

REST •  ddl •  dml

HCatalog

create describe

HDFS HBase External Store

Existing & New Applications

MapReduce Pig Hive

HCatalog

HCatalog RESTful Web Services

Services Integration

Provides RESTful API as “front door” for Hadoop

•  Opens the door to languages other than Java

•  Thin clients via web services vs. fat-clients in gateway

•  Insulation from interface changes release to release

Opens Hadoop to integration with existing and new applications

WebHDFS

Data Integration Services

•  Intuitive graphical data integration tools for HDFS, Hive, HBase, HCatalog and Pig

•  Oozie scheduling allows you to manage and stage jobs

•  Connectors for any database, business application or system

•  Integrated HCatalog storage

Bridge the gap between legacy data & Hadoop

Simplify and speed development

Teradata and Hortonworks Partner to Provide the First Enterprise Reference Architecture

for Hadoop and Big Data

Partnership provides clear path to enterprise for Hadoop

•  Reference architecture that provides guidance on best applications

for Teradata, Teradata Aster, and Hadoop

•  Clear partnership between industry and community leaders

•  Deeper integration to ease data movement in/out of Hadoop

•  Joint R&D and go-to-market

Ambari Cluster Provisioning Configuration Management Monitoring

Ambari Architecture

•  Installs your cluster onto target HW for you

•  Manage, reconfigure from one place

•  Monitor key and meaningful Hadoop metrics, not just OS / HW

•  Scalable in line w/ Hadoop itself

Hadoop

nNworker node

Ganglia

Puppet

Nagios

data and task sink

Ambari

Ganglia PuppetNagios

php portalcontroller

operator

Ambari Live Demonstration

Why HDP?

ONLY Hortonworks Data Platform provides…

•  Tightly aligned to core Apache Hadoop development line - Reduces risk for customers who may add custom coding or projects

•  Enterprise Integration - HCatalog provides scalable, extensible integration point to Hadoop data

•  Most reliable Hadoop distribution - Full stack high availability on v1 delivers the strongest SLA guarantees

•  Multi-tenant scheduling and resource management - Capacity and fair scheduling optimizes cluster resources

•  Integration with operations, eases cluster management - Ambari is the most open/complete operations platform for Hadoop clusters

Hortonworks Support Subscriptions

Objective: help organizations to successfully develop and deploy solutions based upon Apache Hadoop

• Full-lifecycle technical support available – Developer support for design, development and POCs – Production support for staging and production environments

– Up to 24x7 with 1-hour response times

• Delivered by the Apache Hadoop experts – Backed by development team that has released every major

version of Apache Hadoop since 0.1

• Forward-compatibility – Hortonworks’ leadership role helps ensure bug fixes and patches

can be included in future versions of Hadoop projects

Hortonworks Training

Objective: help organizations overcome Hadoop knowledge gaps

• Expert role-based training for developers, administrators & data analysts – Heavy emphasis on hands-on labs – Extensive schedule of public training courses available

(hortonworks.com/training)

• Comprehensive certification programs

• Customized, on-site courses available

Thank You! Questions & Answers

Hortonworks Data Platform for Systems Integrators Webinar 9-5-2012.pptx

Technology