+ All Categories
Home > Documents > Modern Big Data Analytics Tools: An OverviewHadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout...

Modern Big Data Analytics Tools: An OverviewHadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout...

Date post: 13-Feb-2021
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
43
Modern Big Data Analytics Tools:An Overview 7/24/2019 1/43 S. Saranya/ IT6006/ Modern Big Data Analytics Tools
Transcript
  • Modern Big Data Analytics

    Tools:An Overview

    7/24/2019 1/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • Hadoop Midwife :-)7/24/2019 2/43

    S. Saranya/ IT6006/ Modern Big Data Analytics Tools

  • Onceupon atime, in a landfar far away…

    7/24/2019 3/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • 7/24/2019 4/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • Fast forward 15years..

    7/24/2019 5/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • 7/24/2019 6/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • What Happened ?

    7/24/2019 7/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • 7/24/2019 8/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • 7/24/2019 9/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • 7/24/2019 10/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • 7/24/2019 11/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • In ablinkof aneye…

    HDFS

    Sqoop Flume

    Coordination and workflow management

    Zookeeper

    Command

    Center

    GemFire XD

    Oozie

    MapReduce

    Pig Hive

    Tez

    Giraph

    Hadoop UI

    Hue

    SolrCloud

    Phoenix

    HBase

    Crunch Mahout

    Spark

    Shark

    Streaming

    MLib

    GraphX

    Impala

    HAW

    Q

    SpringXD

    MADlib

    Ham

    ste

    r

    PivotalR

    YARN

    ASFProjects FLOSSProjects Pivotal Products

    7/24/2019 12S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • Google Papers7/24/2019 13/43

    S. Saranya/ IT6006/ Modern Big Data Analytics Tools

  • Yahoo! Search

    +

    =

    7/24/2019 14/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • W-1-W

    • WebMap :Graph processing for WWW• Dreadnaught: Infrastructure for WebMap• W-1-W:WebMap In One Week• Juggernaut: Infrastructure for W-1-W• JFS,JMR,Condor:Abandoned for Hadoop

    7/24/2019 15/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • Lucene,Nutch7/24/2019 16/43

    S. Saranya/ IT6006/ Modern Big Data Analytics Tools

  • MapReduce is the Revenge of System Programmers on Database community.

    - Anonymous at XLDB, Stanford,2010

    7/24/2019 17/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • 7/24/2019 18/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • O’Reilly Books20137/24/2019 19/43

    S. Saranya/ IT6006/ Modern Big Data Analytics Tools

  • Who Uses Hadoop?(From Hadoop Summit 2010)

    7/24/2019 20/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • Big Data Landscape - July 2012http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/

    7/24/2019 21/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

    http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/

  • 7/24/2019 22/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • 7/24/2019 23/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • 7/24/2019 24/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • Hadoop Maturity

    ETL OffloadAccommodate massive data growth with existing EDW investments

    Data LakesUnify Unstructured and Structured DataAccess

    Big Data AppsBuild analytic-led applications impacting top line revenue

    Data-Driven EnterpriseApp Dev and Operational Management on HDFS DataArchitecture

    7/24/2019 25S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • 70% of data

    generated by

    customers

    80% of data

    being stored

    3% being prepared for

    analysis

    0.5% being

    analyzed

  • Storage Options

    • HDFS, MapR, Quantcast QFS• EMC Isilon,NetApp, IBM GPFS, PanFS, PVFS,

    Lustre

    • Amazon S3, EMC Atmos, OpenStackSwift• GlusterFS,Ceph• EMCViPR

    7/24/2019 27/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • SQL-on-Hadoop

    • Pivotal HAWQ• Cloudera Impala, Facebook Presto, Apache

    Drill, Cascading Lingual, Optiq, Hortonworks Stinger

    • Hadapt, Jethrodata, IBM BigSQL, Microsoft PolyBase

    • More to come...

    7/24/2019 28/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • ...

    ......HAWQ & HDFSMaster Severs

    Planning & dispatch

    Network Interconnect

    Segment Severs

    Query execution

    ...Storage

    HDFS, HBase …

    7/24/201929/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • Namenode

    Breplication

    Rack1 Rack2

    DatanodeDatanode Datanode

    Read/Write

    S

    Segment

    Segment

    Segment host

    Segment host

    Master host

    Meta Ops

    HAWQ Interconnect

    Segment

    Segment host Segment Segment

    Segment

    SegmentSegment

    Segment

    Segment

    egment host

    Segment

    Datanode

    Segment

    7/24/201930/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • HAWQ vsHive

    Lower is Better

    7/24/201931/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • Provides data-parallel implementations

    of mathematical, statistical and machine-learning methods

    for structured and unstructureddata.

    In-DatabaseAnalytics

    7/24/2019 32/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • MADlibAlgorithms

    7/24/2019 33/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • MADLib Functions

    • Linear Regression

    • Logistic Regression

    • Multinomial Logistic Regression

    • K-Means

    • Association Rules

    • Latent Dirichlet Allocation

    • Naïve Bayes

    • Elastic NetRegression

    • Decision Trees / Random Forest

    • Support VectorMachines

    • Cox Proportional Hazards Regression

    • Descriptive Statistics

    • ARIMA

    7/24/2019 34/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • k-MeansUsage

    SELECT * FROM madlib.kmeanspp (

    -- name of the input table

    -- name of the feature array column

    -- k : number of clusters

    „customers‟,

    „features‟,

    2

    );

    centroids | objective_fn | frac_reassigned | …

    ------------------------------------------------------------------------+------------------+-----------------+ …

    {{68.01668579784,48.9667382972952},{28.1452167573446,84.5992507653263}} | 586729.010675982 | 0.001 | …

    7/24/2019 35/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • pivotal R

    • Interface is Rclient• Execution is in database• Parallelism handled by PivotalR• Supports a portion of R

    R> x = db.data.frame(“t1”)

    R> l = madlib.lm(interlocks ~ assets + nation, data = t)

    7/24/2019 36/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • MapReduce 1.0(Image Courtesy Arun Murthy,Hortonworks)

    7/24/2019 37/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • Hadoop 2.0(Image Courtesy Arun Murthy,Hortonworks)

    HADOOP 1.0

    HDFS(redundant, reliable storage)

    MapReduce(cluster resource management

    & dataprocessing)

    HDFS2(redundant, reliable storage)

    YARN(cluster resource management)

    Tez(execution engine)

    HADOOP 2.0

    Pig(dataflow)

    Hive(sql)

    Others(cascading)

    Pig(dataflow)

    Hive(sql)

    Others(cascading)

    MR(batch)

    GraphStorm, Giraph

    RT

    Stream, ServicesHBase

    7/24/2019 38/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • Applications Run Natively INHadoop

    YARN (Cluster ResourceManagement)

    HDFS2 (Redundant, ReliableStorage)

    BATCH(MapReduce)

    INTERACTIVE(Tez)

    STREAMING(Storm,S4,…)

    GRAPH(Giraph)

    INLMEMORY(Spark)

    HPCMPI(OpenMPI)

    ONLINE(HBase)

    OTHER

    (Search) (Weave…)

    YARN Platform(Image Courtesy Arun Murthy,Hortonworks)

    7/24/2019 39/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • NodeManager NodeManager NodeManager NodeManager

    Container 1.1

    Container 2.4

    NodeManager NodeManager NodeManager NodeManager

    NodeManager NodeManager NodeManager NodeManager

    Container 1.2

    Container 1.3

    AM 1

    Container 2.2

    Container 2.1

    Container 2.3

    AM2

    Client2

    ResourceManager

    Scheduler

    YARNArchitecture(Image Courtesy Arun Murthy,Hortonworks)

    7/24/2019 40/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • GraphLab + Hamster on

    Hadoop

    7/24/2019 41/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • Data Platform of the Future ?

    Analytic Data Marts

    Operational Intelligence

    SQL Services In-MemoryDatabase

    Run-Time Applications

    Data Staging Platform

    Stream Ingestion

    Streaming Services Data Mgmt. Services

    nter

    In-Memory Grid

    New Data-fabrics

    ...ETCSoftware-Defined Datace

    7/24/2019 42S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools

  • Questions?

    7/24/2019 43/43S. Saranya/ IT6006/ Modern Big Data

    Analytics Tools


Recommended