Modern Big Data Analytics
Tools:An Overview
7/24/2019 1/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
Hadoop Midwife :-)7/24/2019 2/43
S. Saranya/ IT6006/ Modern Big Data Analytics Tools
Onceupon atime, in a landfar far away…
7/24/2019 3/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
7/24/2019 4/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
Fast forward 15years..
7/24/2019 5/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
7/24/2019 6/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
What Happened ?
7/24/2019 7/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
7/24/2019 8/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
7/24/2019 9/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
7/24/2019 10/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
7/24/2019 11/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
In ablinkof aneye…
HDFS
Sqoop Flume
Coordination and workflow management
Zookeeper
Command
Center
GemFire XD
Oozie
MapReduce
Pig Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAW
Q
SpringXD
MADlib
Ham
ste
r
PivotalR
YARN
ASFProjects FLOSSProjects Pivotal Products
7/24/2019 12S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
Google Papers7/24/2019 13/43
S. Saranya/ IT6006/ Modern Big Data Analytics Tools
Yahoo! Search
+
=
7/24/2019 14/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
W-1-W
• WebMap :Graph processing for WWW• Dreadnaught: Infrastructure for WebMap• W-1-W:WebMap In One Week• Juggernaut: Infrastructure for W-1-W• JFS,JMR,Condor:Abandoned for Hadoop
7/24/2019 15/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
Lucene,Nutch7/24/2019 16/43
S. Saranya/ IT6006/ Modern Big Data Analytics Tools
MapReduce is the Revenge of System Programmers on Database community.
- Anonymous at XLDB, Stanford,2010
7/24/2019 17/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
7/24/2019 18/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
O’Reilly Books20137/24/2019 19/43
S. Saranya/ IT6006/ Modern Big Data Analytics Tools
Who Uses Hadoop?(From Hadoop Summit 2010)
7/24/2019 20/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
Big Data Landscape - July 2012http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/
7/24/2019 21/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/
7/24/2019 22/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
7/24/2019 23/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
7/24/2019 24/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
Hadoop Maturity
ETL OffloadAccommodate massive data growth with existing EDW investments
Data LakesUnify Unstructured and Structured DataAccess
Big Data AppsBuild analytic-led applications impacting top line revenue
Data-Driven EnterpriseApp Dev and Operational Management on HDFS DataArchitecture
7/24/2019 25S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
70% of data
generated by
customers
80% of data
being stored
3% being prepared for
analysis
0.5% being
analyzed
Storage Options
• HDFS, MapR, Quantcast QFS• EMC Isilon,NetApp, IBM GPFS, PanFS, PVFS,
Lustre
• Amazon S3, EMC Atmos, OpenStackSwift• GlusterFS,Ceph• EMCViPR
7/24/2019 27/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
SQL-on-Hadoop
• Pivotal HAWQ• Cloudera Impala, Facebook Presto, Apache
Drill, Cascading Lingual, Optiq, Hortonworks Stinger
• Hadapt, Jethrodata, IBM BigSQL, Microsoft PolyBase
• More to come...
7/24/2019 28/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
...
......HAWQ & HDFSMaster Severs
Planning & dispatch
Network Interconnect
Segment Severs
Query execution
...Storage
HDFS, HBase …
7/24/201929/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
Namenode
Breplication
Rack1 Rack2
DatanodeDatanode Datanode
Read/Write
S
Segment
Segment
Segment host
Segment host
Master host
Meta Ops
HAWQ Interconnect
Segment
Segment host Segment Segment
Segment
SegmentSegment
Segment
Segment
egment host
Segment
Datanode
Segment
7/24/201930/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
HAWQ vsHive
Lower is Better
7/24/201931/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
Provides data-parallel implementations
of mathematical, statistical and machine-learning methods
for structured and unstructureddata.
In-DatabaseAnalytics
7/24/2019 32/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
MADlibAlgorithms
7/24/2019 33/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
MADLib Functions
• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• K-Means
• Association Rules
• Latent Dirichlet Allocation
• Naïve Bayes
• Elastic NetRegression
• Decision Trees / Random Forest
• Support VectorMachines
• Cox Proportional Hazards Regression
• Descriptive Statistics
• ARIMA
7/24/2019 34/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
k-MeansUsage
SELECT * FROM madlib.kmeanspp (
-- name of the input table
-- name of the feature array column
-- k : number of clusters
„customers‟,
„features‟,
2
);
centroids | objective_fn | frac_reassigned | …
------------------------------------------------------------------------+------------------+-----------------+ …
{{68.01668579784,48.9667382972952},{28.1452167573446,84.5992507653263}} | 586729.010675982 | 0.001 | …
7/24/2019 35/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
pivotal R
• Interface is Rclient• Execution is in database• Parallelism handled by PivotalR• Supports a portion of R
R> x = db.data.frame(“t1”)
R> l = madlib.lm(interlocks ~ assets + nation, data = t)
7/24/2019 36/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
MapReduce 1.0(Image Courtesy Arun Murthy,Hortonworks)
7/24/2019 37/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
Hadoop 2.0(Image Courtesy Arun Murthy,Hortonworks)
HADOOP 1.0
HDFS(redundant, reliable storage)
MapReduce(cluster resource management
& dataprocessing)
HDFS2(redundant, reliable storage)
YARN(cluster resource management)
Tez(execution engine)
HADOOP 2.0
Pig(dataflow)
Hive(sql)
Others(cascading)
Pig(dataflow)
Hive(sql)
Others(cascading)
MR(batch)
GraphStorm, Giraph
RT
Stream, ServicesHBase
7/24/2019 38/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
Applications Run Natively INHadoop
YARN (Cluster ResourceManagement)
HDFS2 (Redundant, ReliableStorage)
BATCH(MapReduce)
INTERACTIVE(Tez)
STREAMING(Storm,S4,…)
GRAPH(Giraph)
INLMEMORY(Spark)
HPCMPI(OpenMPI)
ONLINE(HBase)
OTHER
(Search) (Weave…)
YARN Platform(Image Courtesy Arun Murthy,Hortonworks)
7/24/2019 39/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
NodeManager NodeManager NodeManager NodeManager
Container 1.1
Container 2.4
NodeManager NodeManager NodeManager NodeManager
NodeManager NodeManager NodeManager NodeManager
Container 1.2
Container 1.3
AM 1
Container 2.2
Container 2.1
Container 2.3
AM2
Client2
ResourceManager
Scheduler
YARNArchitecture(Image Courtesy Arun Murthy,Hortonworks)
7/24/2019 40/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
GraphLab + Hamster on
Hadoop
7/24/2019 41/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
Data Platform of the Future ?
Analytic Data Marts
Operational Intelligence
SQL Services In-MemoryDatabase
Run-Time Applications
Data Staging Platform
Stream Ingestion
Streaming Services Data Mgmt. Services
nter
In-Memory Grid
New Data-fabrics
...ETCSoftware-Defined Datace
7/24/2019 42S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
Questions?
7/24/2019 43/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools