Date post: | 02-Apr-2018 |
Category: |
Documents |
Upload: | muhammad-rizwan |
View: | 219 times |
Download: | 0 times |
of 27
7/27/2019 Analytics DistributedBigData Shankar
1/27
Arjun Shankar, Ph.D.
Thanks also to: James Horey and ArvindRamanathan
ORNL Computational Sciencesand Engineering Division
SOS, Jekyll Island, Georgia
March 2013
Analytics and Fusion forDistributed Big Data
7/27/2019 Analytics DistributedBigData Shankar
2/27
2 Managed by UT-Battellefor the U.S. Department of Energy
Outline
Context
Examples of Distributed Big Data Analytics
Emerging needs as Big Compute and Big Dataanalytics converge
7/27/2019 Analytics DistributedBigData Shankar
3/27
3 Managed by UT-Battellefor the U.S. Department of Energy
Context
7/27/2019 Analytics DistributedBigData Shankar
4/27
4 Managed by UT-Battellefor the U.S. Department of Energy
Volume and rates
Twitter Updates 400 M/d
Facebook Likes/Comments: 2.7 B/d
Shared Contents: 30 B/m
World Emails 419 B/d
YouTube Storage: 76 PB/yr
Traffic: 16.2 EB/yr
World Social Media 1.8 ZB (x2 every 2 years)
7/27/2019 Analytics DistributedBigData Shankar
5/27
5 Managed by UT-Battellefor the U.S. Department of Energy
Volume and rates
2.5 m Telescope 200 GB/d
Ion Mobility Spectroscopy 10 TB/d
3D X-ray Diffraction Microscopy 24 TB/d
Boeing 737 cross-country flight 240 TB
Personal Location Data 1 PB/yr
Astrophysics Data 10 PB (2014)
Square Kilometer Array 480 PB/d
7/27/2019 Analytics DistributedBigData Shankar
6/27
6 Managed by UT-Battellefor the U.S. Department of Energy
Big Data = Volume, Variety, Velocity
6
7/27/2019 Analytics DistributedBigData Shankar
7/27
7 Managed by UT-Battellefor the U.S. Department of Energy
Adam Jacobs, The Pathologies of Big Data 2009
Data is typically acquired in a transactionalfashion...The
trouble comes when we want to take that accumulated
data, collected over months or years, and learn something
from it.
Learning from data is a major problem
How we harness our infrastructure to do data managementand data analytics?
7/27/2019 Analytics DistributedBigData Shankar
8/27
8 Managed by UT-Battellefor the U.S. Department of Energy
Data Systems and Analytics
Pre-History
1900-1970s
Databases1970
Networkingand Analytics
1980
Machinelearning
Large-scaleclustercomputing
Infrastructure2000
Large scale commodityprocessing (Google,
Hadoop) NoSQL systems Virtualization
Web 3.02005
Mobile Semantics, linked-data
(IBM Watson) Apps, social-media
7/27/2019 Analytics DistributedBigData Shankar
9/27
9 Managed by UT-Battellefor the U.S. Department of Energy
Big Computing Developments
Prehistory: ENIAC,ILLIAC, CDC, Cray
1930-1970
Vector, Pipelined,Compilers,
Interconnects,
FLOPS
Shared/Distributed
MemoryHierarchies,
Storage
Multicore,Heterogeneity
Big Data
Programming
Models
Flexible Data
Access
7/27/2019 Analytics DistributedBigData Shankar
10/27
10 Managed by UT-Battellefor the U.S. Department of Energy
Examples of Distributed Big DataAnalytics
7/27/2019 Analytics DistributedBigData Shankar
11/27
11 Managed by UT-Battellefor the U.S. Department of Energy
Large Scale Infrastructures forAnalytics
Centralized (aka Big Compute) HPC
Centralized Big Compute/Data platforms
Distributed Big Compute/Data platforms
Wide-Area Distributed Big Data platforms
7/27/2019 Analytics DistributedBigData Shankar
12/27
12 Managed by UT-Battellefor the U.S. Department of Energy
Big Data Analytics
1. Centralized (aka Big Compute) HPC
First Principles, Floating Point Solvers, Prospective Data Generation2. Centralized Big Compute/Data platforms
Discrete/Combinatorial, Retrieval, Indexing Lookup, Query, Graph-lookups, Data Ingest (ETL)
3. Distributed Big Compute/Data platforms
Virtualization and Utility Computing, Machine Learning Stack
4. Wide-Area Distributed Big Data platforms
Distributed Sensing, Correlate, and Actuate Real-time, HPCResource Use
7/27/2019 Analytics DistributedBigData Shankar
13/27
13 Managed by UT-Battellefor the U.S. Department of Energy
(#2) Ebay Big Data Analytics ExampleKiller app: mining web logs
Slides from:Tom Fastener, 2011 Ebay
Principal Architect, @ High-Performance
Transaction Systems, Asilomar
c. 2011
7/27/2019 Analytics DistributedBigData Shankar
14/27
14 Managed by UT-Battellefor the U.S. Department of Energy
(#3) Healthcare OverpaymentsAnalysis (@ORNL)
$10M
$7M
Deceasedbeneficiaries
Omitted fromconsolidated
billing
Inaccurate payments(preliminary estimates)
User submitsSQL to Hive
Hive transforms SQLto MapReduce
MapReduce is executedin Hadoop cluster
SQL Hive
MapReducejob
MapReducejob
NAS
Data is loaded into Hadoopfilesystem (HDFS)
Compute nodes
Name nodeand jobtracker
Hadoop infrastructure
Fraud
patterns
7/27/2019 Analytics DistributedBigData Shankar
15/27
15 Managed by UT-Battellefor the U.S. Department of Energy
(#4) Real-Time Distributed Analytics
Centralized data analysis across infrastructure and data
modalities is prohibitive (and often too late)!
Di t ib t d P tt St d
7/27/2019 Analytics DistributedBigData Shankar
16/27
16 Managed by UT-Battellefor the U.S. Department of Energy
Distributed Pattern Storage andCorrelation
If (Sensor-noticed-X && Report-said-Y && Within-Last-Day)
Notify!
Messageflow
upahierarchy
Middleware
Optimizes In-
Network Storage
7/27/2019 Analytics DistributedBigData Shankar
17/27
17 Managed by UT-Battellefor the U.S. Department of Energy
Processing in the Network
Application specific open questions
Value of Information
What to keep and what to drop (or send later)?
When to take notice?
Infrastructure
Where in the hierarchy to compute?
How to set up the infrastructure to enable the forward joins?
7/27/2019 Analytics DistributedBigData Shankar
18/27
18 Managed by UT-Battellefor the U.S. Department of Energy
Big Data Analytics on Big Compute -Emerging Applications
1. Graph analytics
2. Hypothesis driven to data driven analytics
3. Compute-analyze-compute paradigm
7/27/2019 Analytics DistributedBigData Shankar
19/27
19 Managed by UT-Battellefor the U.S. Department of Energy
(1) Data Analytics Beyond Data-Parallelism
Data-Parallel Graph-Parallel
Cross
Validation
Feature
Extraction
Map Reduce
Computing Sufficient
Statistics
Graphical ModelsGibbs Sampling
Belief Propagation
Variational Opt.
Semi-Supervised
LearningLabel Propagation
CoEM
Graph AnalysisPageRank
Triangle Counting
CollaborativeFiltering
Tensor Factorization
Slide courtesy: Prof.
Carlos Guestrins
GraphLab Workshop,July 2012
7/27/2019 Analytics DistributedBigData Shankar
20/27
20 Managed by UT-Battellefor the U.S. Department of Energy
PageRank
Whats the rank
of this user?
Rank?
Depends on rankof who follows her
Depends on rankof who follows them
Loops in graph Must iterate!Slide courtesy: Prof.
Carlos Guestrins
GraphLab Workshop,July 2012
7/27/2019 Analytics DistributedBigData Shankar
21/27
21 Managed by UT-Battellefor the U.S. Department of Energy
PageRank Iteration
is the random reset probability
wji is the prob. transitioning (similarity) from j to i
R[i]
R[j]wji Iterate until convergence:
My rank is weightedaverage of my friends ranks
Slide courtesy: Prof.
Carlos Guestrins
GraphLab Workshop,July 2012
7/27/2019 Analytics DistributedBigData Shankar
22/27
22 Managed by UT-Battellefor the U.S. Department of Energy
Properties of Graph Parallel Algorithms
Dependency
GraphIterative
Computation
My Rank
Friends Rank
Local
Updates
Slide courtesy: Prof.
Carlos Guestrins
GraphLab Workshop,July 2012
7/27/2019 Analytics DistributedBigData Shankar
23/27
7/27/2019 Analytics DistributedBigData Shankar
24/27
24 Managed by UT-Battellefor the U.S. Department of Energy
(2) Scaling Analytics: Let a Million HypothesesBloom ...
Classic scientific discovery scenario:
Choose relevant D dimensions, evaluate on N samples
Modern data-driven analysis:
Measure everything, hope to discover relevant D using N samples
The progression to ultrascale:
Characterized by interdependency (not necessarily redundancy); many inter-related subsystems
D N
N
D N
1
D
: Error
D : Dimensionality
N:#Samples
D N
N: fixed
i
1i
2i
3i
4
i
1i
2i
3i
4
i
1i
2i
3i
4
i
1i
2i
3i
4
i
1i
2i
3i
4
i
1i
2i
3i
4
i
1i
2i
3i
4
i
1i
2i
3i
4
i
1i
2i
3i
4
i
1i
2i
3i
4
i
1i
2i
3i
4
i
1i
2i
3i
4i
5i
6i
7i
8i
9i1
i
1i
2i
3i
4i
5i
6i
7i
8i
9i1
7/27/2019 Analytics DistributedBigData Shankar
25/27
25 Managed by UT-Battellefor the U.S. Department of Energy
(3) Big Data Analytics Integrated with BigCompute: In Situ Machine Learning
Biological Data
High resolution
all-atomsimulations
(Jaguar/Titan)
Infer biological
function?
> 1 petabyte
Save state (Big Data)
Analyze state (Big DataAnalytics)
Run MD (Big Compute)
Save?
7/27/2019 Analytics DistributedBigData Shankar
26/27
26 Managed by UT-Battellefor the U.S. Department of Energy
Big-Data Analytic Needs from Big-Compute
Better programming
abstractions andenvironments for
Machine learning suites
Automatic memory
hierarchies managementlibraries
Automatic storagemanagement libraries
Simplify communicationand synchronizationabstractions
DARPA is asking for programming
techniques to simplify big data
analytics (19th
March 2013)
7/27/2019 Analytics DistributedBigData Shankar
27/27
27 Managed by UT-Battelle
Thank you!Discussion, Questions?