+ All Categories
Home > Documents > Analytics DistributedBigData Shankar

Analytics DistributedBigData Shankar

Date post: 02-Apr-2018
Category:
Upload: muhammad-rizwan
View: 219 times
Download: 0 times
Share this document with a friend

of 27

Transcript
  • 7/27/2019 Analytics DistributedBigData Shankar

    1/27

    Arjun Shankar, Ph.D.

    Thanks also to: James Horey and ArvindRamanathan

    ORNL Computational Sciencesand Engineering Division

    SOS, Jekyll Island, Georgia

    March 2013

    Analytics and Fusion forDistributed Big Data

  • 7/27/2019 Analytics DistributedBigData Shankar

    2/27

    2 Managed by UT-Battellefor the U.S. Department of Energy

    Outline

    Context

    Examples of Distributed Big Data Analytics

    Emerging needs as Big Compute and Big Dataanalytics converge

  • 7/27/2019 Analytics DistributedBigData Shankar

    3/27

    3 Managed by UT-Battellefor the U.S. Department of Energy

    Context

  • 7/27/2019 Analytics DistributedBigData Shankar

    4/27

    4 Managed by UT-Battellefor the U.S. Department of Energy

    Volume and rates

    Twitter Updates 400 M/d

    Facebook Likes/Comments: 2.7 B/d

    Shared Contents: 30 B/m

    World Emails 419 B/d

    YouTube Storage: 76 PB/yr

    Traffic: 16.2 EB/yr

    World Social Media 1.8 ZB (x2 every 2 years)

  • 7/27/2019 Analytics DistributedBigData Shankar

    5/27

    5 Managed by UT-Battellefor the U.S. Department of Energy

    Volume and rates

    2.5 m Telescope 200 GB/d

    Ion Mobility Spectroscopy 10 TB/d

    3D X-ray Diffraction Microscopy 24 TB/d

    Boeing 737 cross-country flight 240 TB

    Personal Location Data 1 PB/yr

    Astrophysics Data 10 PB (2014)

    Square Kilometer Array 480 PB/d

  • 7/27/2019 Analytics DistributedBigData Shankar

    6/27

    6 Managed by UT-Battellefor the U.S. Department of Energy

    Big Data = Volume, Variety, Velocity

    6

  • 7/27/2019 Analytics DistributedBigData Shankar

    7/27

    7 Managed by UT-Battellefor the U.S. Department of Energy

    Adam Jacobs, The Pathologies of Big Data 2009

    Data is typically acquired in a transactionalfashion...The

    trouble comes when we want to take that accumulated

    data, collected over months or years, and learn something

    from it.

    Learning from data is a major problem

    How we harness our infrastructure to do data managementand data analytics?

  • 7/27/2019 Analytics DistributedBigData Shankar

    8/27

    8 Managed by UT-Battellefor the U.S. Department of Energy

    Data Systems and Analytics

    Pre-History

    1900-1970s

    Databases1970

    Networkingand Analytics

    1980

    Machinelearning

    Large-scaleclustercomputing

    Infrastructure2000

    Large scale commodityprocessing (Google,

    Hadoop) NoSQL systems Virtualization

    Web 3.02005

    Mobile Semantics, linked-data

    (IBM Watson) Apps, social-media

  • 7/27/2019 Analytics DistributedBigData Shankar

    9/27

    9 Managed by UT-Battellefor the U.S. Department of Energy

    Big Computing Developments

    Prehistory: ENIAC,ILLIAC, CDC, Cray

    1930-1970

    Vector, Pipelined,Compilers,

    Interconnects,

    FLOPS

    Shared/Distributed

    MemoryHierarchies,

    Storage

    Multicore,Heterogeneity

    Big Data

    Programming

    Models

    Flexible Data

    Access

  • 7/27/2019 Analytics DistributedBigData Shankar

    10/27

    10 Managed by UT-Battellefor the U.S. Department of Energy

    Examples of Distributed Big DataAnalytics

  • 7/27/2019 Analytics DistributedBigData Shankar

    11/27

    11 Managed by UT-Battellefor the U.S. Department of Energy

    Large Scale Infrastructures forAnalytics

    Centralized (aka Big Compute) HPC

    Centralized Big Compute/Data platforms

    Distributed Big Compute/Data platforms

    Wide-Area Distributed Big Data platforms

  • 7/27/2019 Analytics DistributedBigData Shankar

    12/27

    12 Managed by UT-Battellefor the U.S. Department of Energy

    Big Data Analytics

    1. Centralized (aka Big Compute) HPC

    First Principles, Floating Point Solvers, Prospective Data Generation2. Centralized Big Compute/Data platforms

    Discrete/Combinatorial, Retrieval, Indexing Lookup, Query, Graph-lookups, Data Ingest (ETL)

    3. Distributed Big Compute/Data platforms

    Virtualization and Utility Computing, Machine Learning Stack

    4. Wide-Area Distributed Big Data platforms

    Distributed Sensing, Correlate, and Actuate Real-time, HPCResource Use

  • 7/27/2019 Analytics DistributedBigData Shankar

    13/27

    13 Managed by UT-Battellefor the U.S. Department of Energy

    (#2) Ebay Big Data Analytics ExampleKiller app: mining web logs

    Slides from:Tom Fastener, 2011 Ebay

    Principal Architect, @ High-Performance

    Transaction Systems, Asilomar

    c. 2011

  • 7/27/2019 Analytics DistributedBigData Shankar

    14/27

    14 Managed by UT-Battellefor the U.S. Department of Energy

    (#3) Healthcare OverpaymentsAnalysis (@ORNL)

    $10M

    $7M

    Deceasedbeneficiaries

    Omitted fromconsolidated

    billing

    Inaccurate payments(preliminary estimates)

    User submitsSQL to Hive

    Hive transforms SQLto MapReduce

    MapReduce is executedin Hadoop cluster

    SQL Hive

    MapReducejob

    MapReducejob

    NAS

    Data is loaded into Hadoopfilesystem (HDFS)

    Compute nodes

    Name nodeand jobtracker

    Hadoop infrastructure

    Fraud

    patterns

  • 7/27/2019 Analytics DistributedBigData Shankar

    15/27

    15 Managed by UT-Battellefor the U.S. Department of Energy

    (#4) Real-Time Distributed Analytics

    Centralized data analysis across infrastructure and data

    modalities is prohibitive (and often too late)!

    Di t ib t d P tt St d

  • 7/27/2019 Analytics DistributedBigData Shankar

    16/27

    16 Managed by UT-Battellefor the U.S. Department of Energy

    Distributed Pattern Storage andCorrelation

    If (Sensor-noticed-X && Report-said-Y && Within-Last-Day)

    Notify!

    Messageflow

    upahierarchy

    Middleware

    Optimizes In-

    Network Storage

  • 7/27/2019 Analytics DistributedBigData Shankar

    17/27

    17 Managed by UT-Battellefor the U.S. Department of Energy

    Processing in the Network

    Application specific open questions

    Value of Information

    What to keep and what to drop (or send later)?

    When to take notice?

    Infrastructure

    Where in the hierarchy to compute?

    How to set up the infrastructure to enable the forward joins?

  • 7/27/2019 Analytics DistributedBigData Shankar

    18/27

    18 Managed by UT-Battellefor the U.S. Department of Energy

    Big Data Analytics on Big Compute -Emerging Applications

    1. Graph analytics

    2. Hypothesis driven to data driven analytics

    3. Compute-analyze-compute paradigm

  • 7/27/2019 Analytics DistributedBigData Shankar

    19/27

    19 Managed by UT-Battellefor the U.S. Department of Energy

    (1) Data Analytics Beyond Data-Parallelism

    Data-Parallel Graph-Parallel

    Cross

    Validation

    Feature

    Extraction

    Map Reduce

    Computing Sufficient

    Statistics

    Graphical ModelsGibbs Sampling

    Belief Propagation

    Variational Opt.

    Semi-Supervised

    LearningLabel Propagation

    CoEM

    Graph AnalysisPageRank

    Triangle Counting

    CollaborativeFiltering

    Tensor Factorization

    Slide courtesy: Prof.

    Carlos Guestrins

    GraphLab Workshop,July 2012

  • 7/27/2019 Analytics DistributedBigData Shankar

    20/27

    20 Managed by UT-Battellefor the U.S. Department of Energy

    PageRank

    Whats the rank

    of this user?

    Rank?

    Depends on rankof who follows her

    Depends on rankof who follows them

    Loops in graph Must iterate!Slide courtesy: Prof.

    Carlos Guestrins

    GraphLab Workshop,July 2012

  • 7/27/2019 Analytics DistributedBigData Shankar

    21/27

    21 Managed by UT-Battellefor the U.S. Department of Energy

    PageRank Iteration

    is the random reset probability

    wji is the prob. transitioning (similarity) from j to i

    R[i]

    R[j]wji Iterate until convergence:

    My rank is weightedaverage of my friends ranks

    Slide courtesy: Prof.

    Carlos Guestrins

    GraphLab Workshop,July 2012

  • 7/27/2019 Analytics DistributedBigData Shankar

    22/27

    22 Managed by UT-Battellefor the U.S. Department of Energy

    Properties of Graph Parallel Algorithms

    Dependency

    GraphIterative

    Computation

    My Rank

    Friends Rank

    Local

    Updates

    Slide courtesy: Prof.

    Carlos Guestrins

    GraphLab Workshop,July 2012

  • 7/27/2019 Analytics DistributedBigData Shankar

    23/27

  • 7/27/2019 Analytics DistributedBigData Shankar

    24/27

    24 Managed by UT-Battellefor the U.S. Department of Energy

    (2) Scaling Analytics: Let a Million HypothesesBloom ...

    Classic scientific discovery scenario:

    Choose relevant D dimensions, evaluate on N samples

    Modern data-driven analysis:

    Measure everything, hope to discover relevant D using N samples

    The progression to ultrascale:

    Characterized by interdependency (not necessarily redundancy); many inter-related subsystems

    D N

    N

    D N

    1

    D

    : Error

    D : Dimensionality

    N:#Samples

    D N

    N: fixed

    i

    1i

    2i

    3i

    4

    i

    1i

    2i

    3i

    4

    i

    1i

    2i

    3i

    4

    i

    1i

    2i

    3i

    4

    i

    1i

    2i

    3i

    4

    i

    1i

    2i

    3i

    4

    i

    1i

    2i

    3i

    4

    i

    1i

    2i

    3i

    4

    i

    1i

    2i

    3i

    4

    i

    1i

    2i

    3i

    4

    i

    1i

    2i

    3i

    4

    i

    1i

    2i

    3i

    4i

    5i

    6i

    7i

    8i

    9i1

    i

    1i

    2i

    3i

    4i

    5i

    6i

    7i

    8i

    9i1

  • 7/27/2019 Analytics DistributedBigData Shankar

    25/27

    25 Managed by UT-Battellefor the U.S. Department of Energy

    (3) Big Data Analytics Integrated with BigCompute: In Situ Machine Learning

    Biological Data

    High resolution

    all-atomsimulations

    (Jaguar/Titan)

    Infer biological

    function?

    > 1 petabyte

    Save state (Big Data)

    Analyze state (Big DataAnalytics)

    Run MD (Big Compute)

    Save?

  • 7/27/2019 Analytics DistributedBigData Shankar

    26/27

    26 Managed by UT-Battellefor the U.S. Department of Energy

    Big-Data Analytic Needs from Big-Compute

    Better programming

    abstractions andenvironments for

    Machine learning suites

    Automatic memory

    hierarchies managementlibraries

    Automatic storagemanagement libraries

    Simplify communicationand synchronizationabstractions

    DARPA is asking for programming

    techniques to simplify big data

    analytics (19th

    March 2013)

  • 7/27/2019 Analytics DistributedBigData Shankar

    27/27

    27 Managed by UT-Battelle

    Thank you!Discussion, Questions?


Recommended