Date post: | 14-Jul-2015 |
Category: |
Technology |
Upload: | ameet-paranjape |
View: | 282 times |
Download: | 1 times |
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Access with Hadoop
@ameetp512
Ameet Paranjape
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Interactive and real-time data analysis in Hadoop!
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
2013Digital universe
2.3 Zettabytes
1 Zettabyte (ZB) = 1 million Petabytes (PB); Sources: IDC and IDG Enterprise
85% of growth from new types of data
with machine-generated data increasing
15x
2020Digital universe
40 Zettabytes
Analysts consensus estimates
enterprise data growth of
year over year through 2020
50x
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
AP
PL
ICA
TIO
NS
DA
TA
S
YS
TE
M
Business
Analytics
Custom
Applications
Packaged
Applications
Traditional systems under pressure
• Silos of Data
• Costly to Scale
• Constrained Schemas
Clickstream
Geolocation
Sentiment, Web Data
Sensor. Machine Data
Unstructured docs, emails
Server logs
SO
UR
CE
S
Existing Sources (CRM, ERP,…)
RDBMS EDW MPP
New Data Types
…and difficult to
manage new data
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 5
Virtualization
Slicing your servers into pieces so your can parcel out computing resources
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 6
Hadoop
Tying your servers together to make them act like one big computer
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Cost of storage is going down
According to StatisticBrain, the average cost per gigabyte of storage was
$437,500 in 1980, $11 in 2000, and just five cents in 2013.
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop 101
The basics
1. Hadoop ties your servers together, and makes them act like one big computer
• So you can use inexpensive servers to do your big data processing
2. Hadoop works well with structured, semi-
structured, and unstructured information
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop and the Modern Data Architecture (MDA)
SO
UR
CE
S
EXISTING Systems
Clickstream Web &Social
Geolocation Sensor & Machine
Server Logs
Unstructured
AP
PL
ICA
TIO
NS
DA
TA
S
YS
TE
M
Business
Analytics
Custom
Applications
Packaged
Applications
RDBMS EDW MPP YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
HDFS (Hadoop Distributed File System)
Interactive Real-TimeBatch
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Recommended Reading
The Forrester Wave Report – Big Data Hadoop Solutions, Q1 2014
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop Comparison Tips
1. Is the solution open or closed source?
2. If code is open, who owns the IP?
3. What’s available for free and what do you pay for?
4. Is the solution substrate agnostic?
5. OS support options?
6. Partnerships
7. What’s the pricing model?
8. Local resources to help?
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
A Blueprint for Enterprise Hadoop
Load data
and manage
according
to policy
Deploy and
effectively
manage the
platform
Store and process all of your Corporate Data Assets
Access your data simultaneously in multiple ways
(batch, interactive, real-time) Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
DATA MANAGEMENT
SECURITYDATA ACCESSGOVERNANCE
& INTEGRATIONOPERATIONS
Enable both existing and new application to
provide value to the organization
PRESENTATION & APPLICATION
Empower existing operations and
security tools to manage Hadoop
ENTERPRISE MGMT & SECURITY
Provide deployment choice across physical, virtual, cloud
DEPLOYMENT OPTIONS
YARN Data Operating System
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Hadoop & A Hadoop “Distribution”
Apache Hadoop Is a project
Governed by Apache Software Foundation (ASF)
Comprises YARN and HDFS
Hadoop distribution is a package of projects (e.g. HDP)
Packages Apache Hadoop and related Apache projects
It extends Hadoop with:
–Data access services to manipulate the data
–Data governance and integration services
–Security services
–Operational services to manage the cluster
Tested for consistency across the entire package
Hardened for the enterprise
Page 14
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN has transformed Hadoop
YARN: Data Operating System
DATA MANAGEMENT
BATCH, INTERACTIVE & REAL-TIME
DATA ACCESS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
Interactive Real-TimeBatch
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Projects for Data Access
Apache Pig
Apache Hive
Apache HBase
Apache Storm
Apache Solr
Apache Spark
Traditional Tools
YARN: Data Operating System
DATA MANAGEMENT
BATCH, INTERACTIVE & REAL-TIME
DATA ACCESS
Script
Pig
Search
Solr
SQL
Hive
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Others
ISV
Engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
In-Memory
Spark
TezTez
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Projects for Governance
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
NFS
WebHDFS
YARN: Data Operating System
DATA MANAGEMENT
BATCH, INTERACTIVE & REAL-TIME
DATA ACCESS
GOVERNANCE
& INTEGRATION
Script
Pig
Search
Solr
SQL
Hive
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Others
ISV
Engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
In-Memory
Spark
TezTez
Apache Falcon
Apache Sqoop
Apache Flume
Hadoop NFS & WebHDFS
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Projects for Security
Apache Knox
Apache Argus
Entire Stack
(HDFS, Hive, YARN)
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
NFS
WebHDFS
YARN: Data Operating System
DATA MANAGEMENT
SECURITYBATCH, INTERACTIVE & REAL-TIME
DATA ACCESS
GOVERNANCE
& INTEGRATION
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
Script
Pig
Search
Solr
SQL
Hive
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Others
ISV
Engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
In-Memory
Spark
TezTez
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Projects for Operations
Apache Ambari
Apache Zookeeper
Apache Oozie
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
NFS
WebHDFS
YARN: Data Operating System
DATA MANAGEMENT
SECURITYBATCH, INTERACTIVE & REAL-TIME
DATA ACCESS
GOVERNANCE
& INTEGRATION
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
OPERATIONS
Script
Pig
Search
Solr
SQL
Hive
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Others
ISV
Engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
In-Memory
Spark
TezTez
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Remember the MDA
SO
UR
CE
S
EXISTING Systems
Clickstream Web &Social
Geolocation Sensor & Machine
Server Logs
Unstructured
AP
PL
ICA
TIO
NS
DA
TA
S
YS
TE
M
Business
Analytics
Custom
Applications
Packaged
Applications
RDBMS EDW MPP YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
HDFS (Hadoop Distributed File System)
Interactive Real-TimeBatch
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
What is Data Access?
Data Access defines ALL the channels
through which data can be accessed,
analyzed, cleansed and consumed within
Hadoop. Each channel can be categorized
into THREE core patterns; Batch, Interactive
and Real-time.Multiple engines provide
optimized access to your mission
critical data.
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Access patterns enabled by YARN
BatchNeeds to happen but, no
timeframe limitations
InteractiveNeeds to happen at
Human time
Real-Time Needs to happen at
Machine Execution time.
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS (Hadoop Distributed File System)
Interactive Real-TimeBatch
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HBase
• Apache™ HBase is a non-relational (NoSQL)
database that runs on top of the Hadoop®
Distributed File System (HDFS).
• It is columnar and provides fault-tolerant
storage and quick access to large quantities
of sparse data.
• It also adds transactional capabilities to
Hadoop, allowing users to conduct updates,
inserts and deletes.
• HBase was created for hosting very large
tables with billions of rows and millions of
columns.
•
Developers use it to:
• Provide low latency access to
massive amounts of data (eg.
Recommendation engine
results)
• Document store
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark
• Spark is a general-purpose engine for ad-hoc
interactive analytics, iterative machine-
learning, and other use cases well-suited to
interactive, in-memory data processing of GB
to TB sized datasets.
• Spark loads data into memory so it can be
queried repeatedly. It can create a “shadow”
of data that can be used in the next iteration
of a query
• Spark provides simple APIs for data scientists
and engineers familiar with Scala
(programming language) to build applications
• Spark is YARN-ready – another engine on
YARN!
Developers use it to:
• Data Science: machine
Learning and iterative analytics
Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Stream Processing in Hadoop
How do I deal with this
continuous stream of data
coming in from sensors…etc?
Apache StormReal-time event processing for sensor
and business activity monitoring
• Unlocks new business cases for Hadoop
• Key component of a data lake architecture
• Scale: Ingest millions of events per second. Fast
query on petabytes of data
• Integrated with Ambari to manage
• Predictive Analytics
Prevent Optimize
Finance- Securities Fraud
- Compliance violations
- Order routing
- Pricing
Telco- Security breaches
- Network Outages
- Bandwidth allocation
- Customer service
Retail- Offers
- Pricing
Manufacturing- Machine failures - Supply chain
Transportation- Driver & fleet issues - Routes
- Pricing
Web- Application failures
- Operational issues
- Site content
Sentiment Clickstream Machine/Sensor Server Logs Geo-location
----
Monitor real-time data to…
YARN: Data Operating System
Interactive Real-TimeBatch
Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Trucking company w/ large fleet of trucks in Midwest
A truck generates millions of events for
a given route; an event could be:
• 'Normal' events: starting / stopping of the vehicle
• ‘Violation’ events: speeding, excessive
acceleration and breaking, unsafe tail distance
Route?
Truck?
Driver?
Analysts query a
broad history to
understand if today’s
violations are part of
a larger problem with
specific routes,
trucks, or drivers
Company uses an application that
monitors truck locations and violations
from the truck/driver in real-time
Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Truck Sensors
Distributed Storage: HDFS
Many Workloads: YARN
Solutions on Hadoop Require All!
Stream Processing (Storm)
Inbound Messaging(Kafka)
Microsoft
Excel
Interactive Query(Hive on Tez)
Real-time Serving (HBase)
Alerts & Events(ActiveMQ)
Real-Time
User Interface
Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Query Executes Blazingly Fast with Hive 13 on Tez
Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN Has
Fundamentally
Changed HadoopYARN enables:
• More WorkloadsFrom batch to interactive & real-
time
• More Data Multiple data sets of varying types
and structures
• More ValueHosting multiple business cases
in a single Hadoop cluster
Enterprise Hadoop Enables…
• More WorkloadsFrom batch to interactive & real-time
• More Data Multiple data sets of varying types
and structures
• More ValueHosting multiple business cases
in a single Hadoop cluster