Date post: | 28-Nov-2014 |
Category: |
Technology |
Upload: | insidehpc |
View: | 448 times |
Download: | 1 times |
The Analytics Frontier of the Hadoop Eco-system
Ted Willke Senior Principal Engineer and GM
• Scalable commodity processing established with Hadoop MapReduce, with good libraries for machine learning and data mining
• Twitter libraries like Scalding improve upon MapReduce, providing a more generalized dataflow model
• YARN opened door for in-memory iterative processing with Apache Spark, with its own libraries and others being ported
Today | Hadoop Analytics
• Variety - Expansion of data primitives in commercial use
• Speed - Data processing models evolving (batch streaming)
• Complexity – Monolithic analytics analytics pipelines
• Intelligence – Prescriptive ML Applying ML to ML itself
• Ease of Use – Gap between skills and needs growing
Trends | Hadoop Eco-System Analytics
Life Sciences Personalized medicine, drug repurposing predictions, integration of
heterogeneous data
Education Personalized instruction, outcomes measurement and intervention
Network Security Data fusion, threat assessment and identification
Retail Inventory management, product display management, demand forecasting
Trends | Areas of Application (to name a few)
Variety
Variety | Primitives Usage Patterns
Key-Value Document Graph
Off-line (Queue) Async (Bus) Sync (I/O)
API (Remote) LIB (Local)
Model
Access
Implementation
SQL Column
• When the problem is an information network
• When a graph is a natural way of expressing the algorithm
• When you want to study specific relationships
• When you want faster machine learning or solvers on sparse data
shortest path
central
influence
sub networks
triangle count
Variety | Graphical Model
High
Program Importance (Centrality)
Low
Graph of channel viewing behavior
Current popular surfing patterns
SH002463130000 EP005544723744
Changes in surfing behavior may predict
customer churn.
Variety | Graph Statistics
Preference and Similarity Recommendations
User
Movie
1.7MM Nodes 23.9MM Edges
similar cast
prefers
similar topic
userId: A0A22A5
title: The Godfather genre: Crime drama cast: [M. Brando, Al Pacino]
title: Scarface genre: Crime drama cast: [Al Pacino, M. Pfeiffer]
title: The Departed genre: Crime drama cast: [L. DiCaprio, M. Damon]
weight=11.8
weight=0.67
weight=0.03
weight=14.98
Variety | Graph Search
10
URL Ground-Truth Data
IP/Domain Reputations
420MM Records
74.5MM Nodes 185MM Edges
URL
Domain
IP Address
Calculation of priors
LBP Messaging 84.231.82.93
86.39.155.137
forum.vsichko.com
hermansonskok.se
euskzzbz.nonetheups.com
keesenbep.spaces.live.com
Variety | Graphical Machine Learning
Variety | Loopy Belief Propagation on the (semantic) web
Reputations
Neutral
Good
Bad
Suspect
Variety | Unification with Apache Spark
Image Source: Databricks
• In-memory structures (RDDs) support both table and graph abstractions
• Batch processing and Spark streaming
Spark
RDDs, Transformations, and Actions
Spark Streaming
real-time
Spark
SQL
MLLib
machine learning
DStream’s:
Streams of RDD’s
SchemaRDD’s
RDD-Based
Matrices
GraphX
graph processing/
machine learning
RDD-Based
Graphs
Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)
Variety | Unification within the In-Memory Database (IMDB)
• Index data
structure for
graph traversal
• Prototyped in
SAP HANA
distributed
columnar IMDB
• Lays foundation
for complex
graph query and
algorithms
Variety | Graph Traversal
Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)
Variety | Graph Indexing
Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)
Variety | Graph Traversal Results
Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)
Speed
Cloud Infrastructure
UI
Data Platform
Analytics Platform
Datacenter Network Gateway Thing
Services
Speed | Hadoop Meets The Internet of Things
Source: http://spark.apache.org/docs/1.0.0/streaming-programming-guide.html, accessed on 9/26/2014
Data Stream
Feature Processing
Model Updates
Learning
Distributed Messaging System
(e.g., Kafka)
Speed | Stream Processing Pipeline
• Data replay (e.g., a bug is found or application improved)
• Getting faster and more efficient than “fast batch”
• Time-evolving models and computation
Speed | Challenges
Source: Jay Kreps, http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html, accessed on 9/26/2014
Lambda
Kappa
• Implement transform logic twice
• Federate information at query time
• Retains input data unchanged
The thinking continues to evolve....
• Retain full replay window
• 2nd instance can re-process
• Query against latest table
Speed | Cluster-Scale Stream Processing
Source: http://spark.apache.org/docs/1.0.0/streaming-programming-guide.html, accessed on 9/26/2014
Apply Dstreams to built-in:
• Machine Learning
• Graph Processing
Speed | Spark Streaming
Source: http://spark.apache.org/docs/1.0.0/streaming-programming-guide.html, accessed on 9/26/2014
• Mini-batch +/- windowing
• Analytics can be run on
any of the resultant RDDs
• No provisions for merging
RDDs
Speed | Spark (Discretized) Streaming
(Mini-) batch Streaming (Mini-) batch Analytics
Image Source: GraphX project
• Graph processing engine on Spark
• Supports Pregel-style vertex programming
• View same data as either graphs or collections
Speed | GraphX API for Spark
• Current Spark streaming provides mini-batch streaming
• No concept of data (model) merging
• GraphX is currently designed for static graphs: 1. Merge table data prior to graph pipeline
2. Re-generate entire (accumulated) graph 3. Re-run machine learning at each window
Speed | Spark Streaming for GraphX?
Straightforward, but wastes computation and time. Can we do better?
• Merge information directly into data model used by algorithms
• Static algorithms -> Online algorithms • Incremental re-computation triggered by changes in data values or
data structure
• Possible with many machine learning algos (PageRank as example)
• Evolve IM data stores to maximize performance and freshness
• Better partitioning algorithms reduced data replication
• Dynamic indexing fast retrieval
Speed | Online Version of GraphX
Static PageRank (delta method)
Online PageRank Speed | Online PageRank
Good for algos with abelian accumulators (commutative, associative, with inverse)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
50K 100K 200K 400K 600K 800K 1M
Co
nve
rgen
ce R
ate
(
No
rmal
ized
Exe
cuti
on
TIm
e)
Throughput (Edges/Second)
Convergence Rate
Naive incremental
• Algorithm: Page Rank
• Reset probability: 0.15
• Convergence Threshold: 0.001
• 1 Master + 3 workers
• Distributed Messaging System: Kafka
• Spark 1.1.0 + our graph streaming
0%
20%
40%
60%
80%
100%
120%
50K 100K 200K 400K 600K 800K 1M
No
rmal
ized
Mes
sage
s Se
nt
Throughput (Edges/Second)
Communication Overhead
naiveincremental
Speed | (Really) Early Results for Online PageRank
Complexity
Complexity | Challenges
• Feature Engineering for Data Science
• Monolithic Analytics Complex Pipeline Analytics
Complexity | Directed Acyclic Graphs of Actions
• Common in Data Science
“feature engineering”
• Developed iteratively • Becomes a new tool in
the toolbox A A
B
C
Source: ISTC-Pervasive Computing
Discriminative structures come at multiple scales and varying deformations
Complexity | Hierarchical Matching Pursuit for Image Classification
• Feature learning
• Multiple layers to learn
• Multipath sparse
coding
Source: ISTC-Pervasive Computing
• Robustness
– Local deformations such as translation, rotation and scaling
– Lighting condition changes
– Viewpoint and pose changes
– Large intra-class variations
• Hierarchy
– Sparse data: The total number of possible image patches grows
exponentially with their sizes
– Shared structure: Large patches could share similar or even same small
patches
Complexity | Robust Hierarchical Representations?
Source: Bo, Ren, & Fox, “Multipath Sparse Coding Using Hierarchical Pursuit,” IEEE CVPR 2013
Complexity | Object Recognition on Caltech 256 Benchmark
#Training
Images
15 30 45 60
Local NBB [1] 33.5 40.1 - -
LLC [2] 34.4 41.2 45.3 47.7
CRBM [3] 35.1 42.1 45.7 47.9
LASERC [4] 35.2 43.6 - -
LP-beta [5] - 45.8 - -
Our Work 41.1 48.7 52.8 56.2
[1] S. McCann and D. Lowe, CVPR 12
[2] J. Wang et al, CVPR 10
[3] K Sohn et al, ICCV 11
[4] K. Nguyen et al, ECCV 12
[5] P. Gehler and S. Nowozin, ICCV 09
Much better than the state of the art
(especially when given more data)
Source: Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI)
Distributed Deep Learning Library
Spark
Hadoop
IA/MIC IA/MIC IA/MIC IA/MIC IA/MIC IA/MIC
Complexity | Deep Learning Library for Spark
• Open source
• Spark MLlib
contribution
• Optimized for IA
Complete POC in 2015
Intelligence
• Selecting the right data to process
• Selecting the right features to engineer
• Selecting the right algorithm to run
Intelligence | Challenges
Image Source: University of Nebraska-Lincoln
Intelligence | Ensemble Learning (Wisdom of the Crowd)
Trade computational power for automated experimentation
• Tackles the data and algorithm
selection problem
• Diversification methods vary
• Bagging
• Boosting
• Combining techniques vary
• Majority vote on label
• Bucket of models
• N bagged predictors
N times the computation
Intelligence | Beyond Ensemble Learning
• Downsides of ensemble learning include the number of:
• Tunable parameters • Selection criteria
• Companies claim that non-parametric methods that require
no selection of criteria are in development
For now, it’s the Wisdom of the Crowd. Stay tuned!
Ease of Use
Ingest &
Clean
Engineer
Features Structure
Model
Train
Model Query &
Analyze
Learn
Visualize
Skills shortage at intersection of
systems engineering
and data analysis
Painful data ingestion
and preparation
Tools that are not designed
with loopbacks in mind
Pipeline state not
easy to manage, especially for collaboration
Composing
pipeline is DIY
Ease of Use | Data Science Workflow
Congratulations! You are a
data scientist!
Intel Confidential
Decomposing the “data scientist”
Source: 2013 Report from Accenture Institute for High Performance
Source: http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial/ht_wordcount1_source.html,
accessed on 9/30/2014
Ease of Use | Programming Languages
WordCount: The “Hello World” for Big Data In Java MapReduce
Python
R
Dataflow
GUI
...
Datacenter / Cloud Network Client
“Data Science”
API
Connect
Manage
Secure
Analyzedistributed and parallel
Manage Secure
Connect
Analyzelocal
Query
Big Data Java/Scala/C++ Computational Frameworks
Big Data Algorithms
Cluster Workload Mgmt
Cluster Storage
Machine Learning & Statistics
Data Wrangling Analyst Skills
The Other
Skills
Ease of Use | Making Big Data Familiar
• One consistent API for:
– ETL & feature engineering
– Including Spark and whatever comes next
– Graph construction, databases, analytics, query
– Same API for Titan, Neo4j, etc.
– Same API for Giraph, GraphX, GraphLab, etc.
– Machine learning & statistical analytics
• Programming language integration
• Extensibility at the core
Ease of Use | API Functionality
POST https://site.com/joe/graphs/29/transforms
{
operation: "ml.cgd",
arguments: [ {
edge_properties = ["rating"],
output_property_prefix="cgd_",
vertex_type = "vertex_type",
edge_type = "splits",
max_supersteps = 20,
feature_dimension = 3,
convergence_threshold = 0,
cgd_lambda = 0.65,
learning_output_interval = 1,
bias_on = true,
num_iters = 3
}]
}
Ease of Use | Run CGD (REST Call)
201 Created { operation: “ml.gcd", argments : (same as request) id: 2, created: "2014-01-31 10:51:05.1234",
depends_on: [{
link: {method: “GET”, uri: https://site.com/v1/graphs/29/transforms/1}
type: “graphbuilder”, started: “2014-01-31 10:51:02.8899”, eta: null,
status: “pending”
}] links: [ { rel: “self”, method: “GET”, uri: https://site.com/v1/graphs/29/transforms/2},
{rel=“intel:idpat-progress”, method=“GET”,
uri: https://site.com/v1/graphs/29/transforms/2/progress}
{rel=“intel:idpat-cancel”, method=“DELETE”,
uri: https://site.com/user/joe/graphs/29/transforms/2}] }
Ease of Use | Run CGD (REST Response)
FILESYSTEMS AND NOSQL STORAGE
HW PLATFORM
APACHE HADOOP APACHE SPARK
DATA WRANGLING
MACHINE LEARNING AND STATISTICS
Graphical Algorithms
Classical Algorithms
Graph Construction Tools
Useful String Manipulation
Useful Math Operators
“DATA SCIENCE” API
Intel Analytics Toolkit Ease of Use | Delivering It
Unified UI’s
across the workflow
Easier feature & model creation
End-to-end graph
pipeline
Fully scalable throughout
Multiple data
primitives
Optimized for IA
Cloud & On-Prem
Python
Libraries
3rd Party
GUIs/SDKs
Viz
Tools
Future
Libraries BI
Connectors
Query Interfaces
...
Approach Algorithm Category Applications/Use Cases
Loopy Belief Propagation (LBP) Structured Prediction Personalized recs, image de-noising
Label Propagation Structured Prediction Personalized recommendations
Alternating Least Squares (ALS) Collaborative Filtering Recommenders
Conjugate Gradient Descent (CGD) Collaborative Filtering Recommenders
Connected Components Graph Analytics Network manipulation, image analysis
Latent Dirichlet Allocation (LDA) Topic Modeling Document Clustering
Structure Attribute Clustering Network analysis, consumer seg
K-Truss Clustering Social network analysis
KNN* Clustering Recommenders
Logistic Regression* Classification Fraud detection
Random Forest* Classification Fraud detection, consumer seg
Generalized Linear Model (Binomial, Poisson) Non-linear Curve Fitting Forecasting, pricing, market mix models
Association Rule Mining Data Mining Market basket analysis, recommenders
Frequent Pattern Mining* Data Mining Pattern Recognition
Gra
ph
50
Ease of Use | A Full Spectrum of Analytics
Real Time Database
BQL – BigDAWG Query Language &
Compiler
Analytics Libraries
Hardware Platforms
Applications, Visualization, Languages
“Narrow waist”
provides portability
Historical / Analytics Databases
Spill Stream
Ease of Use | Future Vision – BigDAWG
Ease of Use | Future Vision – BigDAWG
Real Time DBMSs
BQL – BigDAWG Query Language &
Compiler
Visualization & Presentation, e.g., ScalaR, imMens, TweetMap, Prefetching
Languages, e.g, Julia, R, MLbase, GraphLab
SciDB
Analytics, e.g., ScaLAPACK, ML algos, plsh, other analytics packages
TupleWare
Hardware Platforms, e.g., NVM simulator, Xeon Phi, Xeon
Applications, e.g., medical data, astronomy, Twitter, urban sensing, IoT
TileDB S-Store
“Narrow waist”
provides portability
MyriaX
Historical / Analytics DBMSs Spill
Stream
Ease of Use | BigDAWG Deliverables ‘15-’16
• Complete prototype “big data” stack and reference
implementation
• Battle-tested on multiple use cases
• Standard federation language (BQL)
• Next-generation interface for analytics
• Next-generation stream processing system
Stay Tuned!
http://istc-bigdata.org/
1. Big Data Visualization, especially graph*
2. Big Data DB that supports relational and graph equally*
3. A better workflow manager (Like Oozie for Hadoop, Spark, etc.)*
4. UI partners (R, Julia, etc.)
5. Better portable machine learning models (like PMML) that also capture feature engineering (not just algos)
6. Cluster monitoring (GUI, etc.) that works across many big data tools
7. Distributed debugger (for Spark clusters, etc.) for profiling and troubleshooting
8. Cluster auto-configuration tools
• * - Open source STRONGLY preferred
Technology Wish List
• Intel Analytics Toolkit Beta program (now-January ’15) Have a POC, particularly graph? [email protected]
• GRADES 2015, Melbourne Australia, May 31, 2015 Papers due March 15. HTTP://EVENT.CWI.NL/GRADES2015/
Call to Action