© Hortonworks Inc. 2014
Quick Housekeeping
Q&A box is available for your questions Webinar will be recorded Thank You for joining!
© Hortonworks Inc. 2014
Hadoop 2.0: YARN to Further Optimize Data Processing
© Hortonworks Inc. 2014
Your Speakers
John Kreisa, VP Strategic Marketing, Hortonworks
Imad Birouty, Director, Technical Product Marketing, Teradata
John Haddad, Senior Director, Product Marketing, Informatica
© Hortonworks Inc. 2014
John Kreisa, VP Strategic Marketing, Hortonworks @marked_man
© Hortonworks Inc. 2014
Big Data Market Trends and Predictions
Big Data Explosion
% by which org’s leveraging modern info management systems outperform peers by 2015
ñ Hadoop enabled DBMS’s
85% from new data types
50x data growth 2010 to 2020
1 Zettabyte (ZB) =
1 Billion TBs
15x
growth rate of machine
generated data by 2020
The US has 1/3 of the world’s data
Big Data is 1 of 5 US GDP Game Changers $325 billion incremental annual GDP from big data analytics
in retail and manufacturing by 2020
© Hortonworks Inc. 2014
Existing systems under pressure AP
PLICAT
IONS
DATA
SYSTEM
REPOSITORIES
SOURC
ES
Exis4ng Sources (CRM, ERP, Clickstream, Logs)
RDBMS EDW NoSQL
Business Analy4cs
Custom Applica4ons
Packaged Applica4ons
Source: IDC
2.8 ZB in 2012
85% from New Data Types
15x Machine Data by 2020
40 ZB by 2020
OLTP, ERP, CRM Systems
Unstructured documents, emails
Clickstream
Server logs
Sen>ment, Web Data
Sensor. Machine Data
Geoloca>on
© Hortonworks Inc. 2014
Hadoop with YARN Compliments Existing Architecture
OPERATIONS TOOLS
Provision, Manage & Monitor
DEV & DATA TOOLS
Build & Test
DATA
SYSTEM
REPOSITORIES
SOURC
ES
RDBMS EDW NoSQL
OLTP, ERP, CRM Systems
Documents, Emails
Web Logs, Click Streams
Social Networks
Machine Generated
Sensor Data
Geoloca>on Data
APPLICAT
IONS
Business Analy4cs
Custom Applica4ons
Packaged Applica4ons
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
HDFS (Hadoop Distributed File System)
Interactive Real-Time Batch
© Hortonworks Inc. 2014
Hadoop: Typically used for new analytic apps SC
ALE
SCOPE
New Analytic Apps New types of data LOB-driven
© Hortonworks Inc. 2014
Unlock Value in New Types of Data
1. Social Understand how people are feeling and interacting – right now
2. Clickstream Capture and analyze website visitors’ data trails and optimize your website
3. Sensor/Machine Discover patterns in data streaming from remote sensors and machines
4. Geographic Analyze location-based data to manage operations where they occur
5. Server Logs Diagnose process failures and prevent security breaches
6. Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents
Value
+ Online archive Data that was once purged or moved to tape can be stored in Hadoop to discover long term trends and previously hidden value
© Hortonworks Inc. 2014
New Analytic Applications on Hadoop
Industry Use Case Type of Data
Financial Services New Account Risk Screens Text, Server Logs
Trading Risk Server Logs
Insurance Underwriting Geographic, Sensor, Text
Telecom Call Detail Records (CDRs) Machine, Geographic
Infrastructure Investment Machine, Server Logs
Real-time Bandwidth Allocation Server Logs, Text, Social
Retail 360° View of the Customer Clickstream, Text
Localized, Personalized Promotions Geographic
Website Optimization Clickstream
Manufacturing Supply Chain and Logistics Sensor
Assembly Line Quality Assurance Sensor
Crowdsourced Quality Assurance Social
Healthcare Use Genomic Data in Medical Trials Structured
Monitor Patient Vitals in Real-Time Sensor
Pharmaceuticals Recruit and Retain Patients for Drug Trials Social, Clickstream
Improve Prescription Adherence Social, Unstructured, Geographic
Oil & Gas Unify Exploration & Production Data Sensor, Geographic & Unstructured
Monitor Rig Safety in Real-Time Sensor, Unstructured
Government ETL Offload in Response to Federal Budgetary Pressures Structured
Sentiment Analysis for Government Programs Social
© Hortonworks Inc. 2014
Hadoop: YARN Driven MDA Leads to a Data Lake SC
ALE
SCOPE
A Modern Data Architecture/Data Lake
New Analytic Apps New types of data LOB-driven
RDBMS
MPP
EDW
Data Lake An architectural shift in the data center that uses Hadoop to deliver deeper insight across a large, broad, diverse set of data at efficient scale
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
HDFS (Hadoop Distributed File System)
Interactive Real-Time Batch
© Hortonworks Inc. 2014
Integrating with Existing Investments AP
PLICAT
IONS
DATA
SYSTEM
SOURC
ES
RDBMS EDW MPP
Emerging Sources (Sensor, Sen4ment, Geo, Unstructured)
BusinessObjects BI
OPERATIONAL TOOLS
DEV & DATA TOOLS
Exis4ng Sources (CRM, ERP, Clickstream, Logs)
INFRASTRUCTURE
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
HDFS (Hadoop Distributed File System)
Interactive Real-Time Batch
SOURC
ES
OLTP, ERP, CRM Systems
Documents, Emails
Web Logs, Click Streams
Social Networks
Machine Generated
Sensor Data
Geoloca>on Data
Viewpoint
Imad Birouty, Director, Technical Product Marketing, Teradata
Analysts Recommend: Shift from a Single Platform to an Ecosystem
“We will abandon the old models based on the desire to implement for high-value analytic applications.”
"Logical" Data Warehouse
Math and Stats
Data Mining
Business Intelligence
Applications
Languages
Marketing
ANALYTIC TOOLS & APPS
USERS
DISCOVERY PLATFORM
DATA WAREHOUSE
ERP
SCM
CRM
Images
Audio and Video
Machine Logs
Text
Web and Social
SOURCES
DATA PLATFORM
ACCESS MANAGE MOVE
UNIFIED DATA ARCHITECTURE
Marketing Executives
Operational Systems
Frontline Workers
Customers Partners
Engineers
Data Scientists
Business Analysts
Fast Loading
Filtering and Processing
Online Archival
Business Intelligence
Predictive Analytics
Operational Intelligence
Data Discovery
Path, graph, time-series analysis
Pattern Detection
Math and Stats
Data Mining
Business Intelligence
Applications
Languages
Marketing
ANALYTIC TOOLS & APPS
USERS
DISCOVERY PLATFORM
DATA WAREHOUSE
ERP
SCM
CRM
Images
Audio and Video
Machine Logs
Text
Web and Social
SOURCES
DATA PLATFORM
ACCESS MANAGE MOVE
UNIFIED DATA ARCHITECTURE
Marketing Executives
Operational Systems
Frontline Workers
Customers Partners
Engineers
Data Scientists
Business Analysts
Data Lake Overview
• The single source of raw, historical, and real-time operational data
• The ability to cost effectively explore data sets of unknown, under-appreciated, or unrecognized value
• The reduction of LOB specific big data environments, which reduces costs and analytical discrepancies
• The co-location of data sets to enable light, on-the-fly integration
Approaches to Data Integration
Schema on Write
• Well understood data • Relational integrity • Storage efficiency
Schema On Read
• Dynamic data • Reduced coordination • Human readable
Data Warehouse
Data Lake
The “Capture Everything” Approach
“Capture only what’s needed”
IT delivers a platform for storing, refining, and analyzing all data
sources
Business explores data for questions worth
answering
Big Data Method Multi-structured & Iterative
Analysis
IT structures the data to answer those questions
Business determines what questions to ask
Classic Method Structured & Repeatable
Analysis
“Capture in case it’s needed”
Value from combining business data with detail data
• Determine which cars to recall for bad battery lot > Business data held in data warehouse > Detailed sensor data held in data lake > Query combines data > Determine which cars to repair
Automobile Sensor Data Use Case
TERADATA
PRODUCTION DATA
• VINs • Service records
• Warranty data • DTC descriptions
HADOOP
RAW MULTI-STRUCTURED
DATA
• Battery Temperature Sensor data
Battery Temperature vs. Air Temperature
Customer Value Based on Social Influence Use Case
HADOOP TERADATA
ASTER DATABASE
TERADATA DATABASE
• Determine high value customers based on history
• Determine customer value based on social influence
<=
• Determine customer sentiment
• Determine customer sphere of influence
$$
Data Optimization for the Modern Data Architecture
John Haddad, Senior Director, Product Marketing, Informatica
The Big Data Journey
The Big Data Journey
Optimize infrastructure for performance, cost, &
scalability
A single place to manage the supply and
demand of data
Real-time proactive customer engagement
Data Warehouse Optimization
Real-Time Customer Analytics
Managed Data Lake
Big Data business initiatives
IT driven Business driven
Proactive Customer Engagement
Web Logs Clickstream Data
Big Data Integration / Analytics
Streaming
Master Data Mgmt
Financial Advisors
Integration & Quality
Customer / Product Master
Customer
Customer Smartphone
Real-Time Event
Processing
Visualization
Social Data / Signals
Social Data Connector
FIX, SWIFT, Market Data
Customer Portal
DATA PLATFORM
DISCOVERY PLATFORM
DATA WAREHOUSE
Proactive Patient Member Engagement
Web Logs Clickstream Data
Big Data Integration / Analytics
Streaming
Care Providers
Integration & Quality
Patient Member
Patient Member Smartphone
Real-Time Event
Processing
Visualization
Social Data / Signals
Social Data Connector
RFID, Patient Monitoring
Healthcare & Patient Forums
Master Data Mgmt
Member / Provider Master
DATA PLATFORM
DISCOVERY PLATFORM
DATA WAREHOUSE
Unified Data Architecture
DATA PLATFORM
DISCOVERY PLATFORM
DATA WAREHOUSE
The Intelligent Data Platform
Rol
e-B
ased
Dat
a M
anag
emen
t To
ols
Infra
stru
ctur
e S
ervi
ces
Data Intelligence Metadata Meets Machine Learning
Data Infrastructure
Vibe ™ Virtual Data Machine
New
Industry- Leading
Data Lake Infrastructure
Data Lake Architecture Informatica Developers are Now Hadoop Developers
Visual Development Environment
Enterprise Repositories
MDM
DATA REFINEMENT
Profile Profile
Parse
ETL
Cleanse
Match
LOAD
SOURCE DATA
Batch
Replicate Stream Archive
JMS Queue’s
Servers & Mainframe
Files
Databases
Sensor data
Social
Apache YARN
Apache MapReduce
1 ° ° °
° ° ° °
° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
Apache Tez
Apache Hive SQL
DELIVER
Batch
Services Events Topics
DATA WAREHOUSE
How do you plan to staff your Big Data projects?
4 weeks 4 days!
2X performance!
Vs.
Hadoop Hand-coders
Informatica developers
Choose tools that leverages existing skills so you can quickly staff Big Data projects
How do you adopt and minimize the impact of new and rapidly changing technologies?
Choose a platform and tools that minimize the need to rebuild your data pipeline as technologies change
Hadoop
Cloud DI Servers Data Warehouse
Development Deployment
Time to Deploy
How long does it take you to deploy Big Data projects to production?
Maximize Reuse
Available 24x7 Scale Performance
Flexible to Change Easy to Maintain
Automa4cally Deploy
Time to Deploy
Everything you build in the sandbox should be immediately deployed as enterprise ready production
© Hortonworks Inc. 2014
Next Steps
Try the free Informatica Big Data Edition 60-Day Trial Download the Hortonworks Sandbox
Download Teradata Express Download Aster Express
http://marketplace.informatica.com/bdehortonworks
http://downloads.teradata.com/download/database
http://hortonworks.com/products/hortonworks-sandbox/
http://downloads.teradata.com/download/aster/aster-express