The Big Data Ecosystem Data Scientist’s Paradise August 26, 2014 Tuesday 6, E Bay Street, Jacksonville
(Ravi Nair, Jax Big Data )
1 www.meetup.com/jaxbigdata
Rules
There are twelve horses There are five fences The race is announced The starting gate holds the first six to arrive A gun is fired to start the race The time taken between fences is between 2-6 seconds (sleep) No two horses jump the same fence at the same time No two horses cross the finish line at the same time A commentator announces Each horse jumping a fence
The first three horses to cross the finish line
All horses have finished the race
Lets solve this problem
Let me implement and run this in front of you, and will form the basic for our
future discussion
Steeplechase Demo
2 www.meetup.com/jaxbigdata
Challenges
Heterogeneity Latency Remote Memory vs Local Memory Synchronization Partial failure
Applications need to adapt gracefully in the face of partial failure Lamport once defined a distributed system as “One on which I cannot get any
work done because some machine I have never heard of has crashed”
3 www.meetup.com/jaxbigdata
Elephant to Rescue
Thanks to Dave Cutting, Tom White, Murthy et al.
4 www.meetup.com/jaxbigdata
• Simple data-parallel programming model designed for scalability and fault-tolerance
• Pioneered by Google • Processes 20 petabytes of data per day
• Popularized by open-source Hadoop project
• Used at Yahoo!, Facebook, Amazon, …
MapReduce
5 www.meetup.com/jaxbigdata
• At Google: Index construction for Google Search Article clustering for Google News Statistical machine translation
• At Yahoo!: “Web map” powering Yahoo! Search Spam detection for Yahoo! Mail
• At Facebook: Data mining Ad optimization Spam detection
MapReduce Use
6 www.meetup.com/jaxbigdata
• Scalability to large data volumes: 1000’s of machines, 10,000’s of disks
• Cost-efficiency:
Commodity machines (cheap, but unreliable) Commodity network Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers)
MapReduce Design Goals
7 www.meetup.com/jaxbigdata
Commodity Hardware
8
Typically in 2 level architecture – Nodes are commodity PCs – 30-40 nodes/rack – Uplink from rack is 3-4 gigabit – Rack-internal is 1 gigabit
www.meetup.com/jaxbigdata
Hadoop Cluster
Courtesy: http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf
9 www.meetup.com/jaxbigdata
MapReduce In Action
the quick
brown fox
the fox ate
the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
quick, 1
the, 1
fox, 1
the, 1
how, 1
now, 1
brown, 1
ate, 1
mouse, 1
cow, 1
Input Map Shuffle & Sort Reduce Output
10 www.meetup.com/jaxbigdata
Input: (filename, text) records Output: list of files containing each word Map: foreach word in text.split(): output(word, filename) Combine: uniquify filenames for each word Reduce: def reduce(word, filenames): output(word, sort(filenames))
One More Example – Inverted Index
11 www.meetup.com/jaxbigdata
Inverted Index - Flow
to be or not to be afraid, (12th.txt)
be, (12th.txt, hamlet.txt) greatness, (12th.txt)
not, (12th.txt, hamlet.txt) of, (12th.txt)
or, (hamlet.txt) to, (hamlet.txt)
hamlet.txt
be not afraid of greatness
12th.txt
to, hamlet.txt be, hamlet.txt or, hamlet.txt not, hamlet.txt
be, 12th.txt not, 12th.txt afraid, 12th.txt of, 12th.txt greatness, 12th.txt
12 www.meetup.com/jaxbigdata
I want more bread
Many parallel algorithms can be expressed by a series of MapReduce jobs But MapReduce is fairly low-level: must think about keys, values, partitioning, etc Can we capture common “job building blocks”?
13 www.meetup.com/jaxbigdata
Pig
Started at Yahoo! Research Runs about 30% of Yahoo!’s jobs Features:
•Expresses sequences of MapReduce jobs •Data model: nested “bags” of items •Provides relational (SQL) operators (JOIN, GROUP BY, etc) •Easy to plug in Java functions •Pig Pen development environment for Eclipse
14 www.meetup.com/jaxbigdata
Example - Pig
Problem:
Suppose you have logged in user data in one file, tweets data in another, and you need to find the top 25 trending topics by users aged 18 - 35.
Load Users Load tweet data/topics
Filter by age
Join on name
Group on topics
Count topics
Order by topics
Take top 25
15 www.meetup.com/jaxbigdata
Pig Translation
Problem: Load Users Load Tweet
data/topics
Filter by age
Join on name
Group on topics
Count topics
Order by topics
Take top 25
Users = load … Filtered = filter … Topics= load … Joined = join … Grouped = group … Total= … count()… Sorted = order … Top 25 = limit …
16 www.meetup.com/jaxbigdata
Pig Latin
Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 35; Topics = load ‘topics’ as (user, topic); Joined = join Filtered by name, topics by user; Grouped = group Joined by topics; Summed = foreach Grouped generate group, count(Joined) as frequencies; Sorted = order Summed by frequencies desc; Top25 = limit Sorted 25;
store Top25 into ‘top25trends’;
17 www.meetup.com/jaxbigdata
Hive | Introduction
Structured data to HDFS Developed at Facebook Used for majority of Facebook jobs “Relational database” built on Hadoop
• Maintains list of table schemas • SQL-like query language (HQL) • Can call Hadoop Streaming scripts from HQL • Supports table partitioning, clustering, complex data types, some optimizations
18 www.meetup.com/jaxbigdata
Hive | Sample Hive Queries
• Find top 5 pages visited by users aged 18-25:
SELECT p.url, COUNT(1) as clicks
FROM users u JOIN page_views p ON (u.name = p.user)
WHERE u.age >= 18 AND u.age <= 25
GROUP BY p.url
ORDER BY clicks
LIMIT 5;
• Filter page views through Python script :
SELECT TRANSFORM(p.user, p.date)
USING 'map_script.py'
AS dt, uid CLUSTER BY dt
FROM page_views p;
19 www.meetup.com/jaxbigdata
Hadoop 1 and Hadoop 2
20 www.meetup.com/jaxbigdata
Hadoop Daemons
21 www.meetup.com/jaxbigdata
Hadoop 1
Limited up to 4,000 nodes per cluster
O(# of tasks in a cluster)
JobTracker bottleneck - resource management, job scheduling and monitoring
Only has one namespace for managing HDFS
Map and Reduce slots are static
Only job to run is MapReduce
© Hortonworks Inc. 2011
www.meetup.com/jaxbigdata
Hadoop MapReduce Classic
• JobTracker
–Manages cluster resources and job scheduling
• TaskTracker
–Per-node agent
–Manage tasks
23
Hadoop 1 - Reading Files
Rack1 Rack2 Rack3 RackN
read file (fsimage/edit) Hadoop Client
NameNode SNameNode
return DNs, block ids, etc.
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
checkpoint
heartbeat/ block report read blocks
Hadoop 1 - Writing Files
Rack1 Rack2 Rack3 RackN
request write (fsimage/edit) Hadoop Client
NameNode SNameNode
return DNs, etc.
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
checkpoint
block report write blocks
replication pipelining
Hadoop 1 - Running Jobs
Rack1 Rack2 Rack3 RackN
Hadoop Client
JobTracker
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
submit job
deploy job
part 0
map
reduce
shuffle
Hadoop 1 - APIs
org.apache.hadoop.mapreduce.Partitioner
org.apache.hadoop.mapreduce.Mapper
org.apache.hadoop.mapreduce.Reducer
org.apache.hadoop.mapreduce.Job
Hadoop 2
Potentially up to 10,000 nodes per cluster O(cluster size) Supports multiple namespace for managing HDFS Efficient cluster utilization (YARN) MRv1 backward and forward compatible Any apps can integrate with Hadoop Beyond Java
YARN: A new abstraction layer
HADOOP 1.0
HDFS (redundant, reliable storage)
MapReduce (cluster resource management
& data processing)
HDFS2 (redundant, reliable storage)
YARN (cluster resource management)
MapReduce (data processing)
Others (data processing)
HADOOP 2.0
Single Use System
Batch Apps
Multi Purpose Platform
Batch, Interactive, Online, Streaming, …
Page 29
© Hortonworks Inc. 2011
www.meetup.com/jaxbigdata
YARN Architecture
Page 30 Architecting the Future of Big Data
© Hortonworks Inc. 2011
www.meetup.com/jaxbigdata
Requirements
• Reliability
• Availability
• Utilization
• Wire Compatibility
• Agility & Evolution – Ability for customers to control
upgrades to the grid software stack.
• Scalability - Clusters of 6,000-10,000 machines –Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks
–100,000+ concurrent tasks
–10,000 concurrent jobs
Page 31 Architecting the Future of Big Data
© Hortonworks Inc. 2011
www.meetup.com/jaxbigdata
Architecture
• Resource Manager –Global resource scheduler
–Hierarchical queues
• Node Manager
–Per-machine agent
–Manages the life-cycle of container
–Container resource monitoring
• Application Master
–Per-application
–Manages application scheduling and task execution
–E.g. MapReduce Application Master
32
Hadoop 2 - Reading Files (w/ NN Federation)
Rack1 Rack2 Rack3 RackN
read file
fsimage/edit copy Hadoop Client NN1/ns1
SNameNode per NN
return DNs, block ids, etc.
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
checkpoint
register/ heartbeat/ block report
read blocks
fs sync Backup NN per NN
checkpoint
NN2/ns2 NN3/ns3 NN4/ns4
or
ns1 ns2 ns3 ns4
dn1, dn2
dn1, dn3
dn4, dn5 dn4, dn5
Block Pools
Hadoop 2 - Writing Files
Rack1 Rack2 Rack3 RackN
request write
Hadoop Client
return DNs, etc.
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
write blocks
replication pipelining
fsimage/edit copy NN1/ns1
SNameNode per NN
checkpoint
block report
fs sync Backup NN per NN
checkpoint
NN2/ns2 NN3/ns3 NN4/ns4
or
Hadoop 2 - Running Jobs
RackN
NodeManager
NodeManager
NodeManager
Rack2
NodeManager
NodeManager
NodeManager
Rack1
NodeManager
NodeManager
NodeManager
C2.1
C1.4
AM2
C2.2 C2.3
AM1
C1.3
C1.2
C1.1
Hadoop Client 1
Hadoop Client 2
create app2
submit app1
submit app2
create app1
ASM Scheduler queues
ASM Containers
NM ASM
Scheduler Resources
.......negotiates.......
.......reports to.......
.......partitions.......
ResourceManager
status report
Hadoop 2 - Security
F I R E W A L L
LDAP/AD
Knox Gateway Cluster
KDC
Hadoop Cluster
Enterprise/ Cloud SSO Provider
JDBC Client
REST Client
F I R E W A L L
DMZ
Browser(HUE) Native Hive/HBase Encryption
Hadoop 2 - APIs
org.apache.hadoop.yarn.api.ApplicationClientProtocol
org.apache.hadoop.yarn.api.ApplicationMasterProtocol
org.apache.hadoop.yarn.api.ContainerManagementProtocol
38
• Map Reduce is the processing model within any Hadoop ecosystem.
• Whenever you execute your actions against Hadoop/Hive, Map Reduce is invoked.
• This is 1) complex and 2) time consuming 3)Difficult to learn/debug.
• We often need tools/engines which can efficiently help reducing this complexity.
Live Demo Objectives: 1) A Hive table is queried using standard Hive query language
2) We see a Map Reduce program is executed in the background
3) We use Presto to avoid Map Reduce and get the data faster
4) We will connect R to Hive through Presto for retrieval of data from Hive without Map Reduce
Hive Table
No Map Reduce
R http://www.r-project.org/
1) R is a language and environment
for statistical computing and
graphics. (Similar to SAS, but
open source)
2) R is supplied with a variety of
packages for various analytical
purposes.
3) By writing code in “R” , you can
create customized actions, one
of which is to connect to big
data.
Presto http://prestodb.io
1) Open source query engine
against GBs and PBs of data
2) Developed by Facebook, the one
who developed Hive
3) Multiple data sources will be
supported in the near future
4) Currently used at Facebook for
querying their 300 PB Hive data
warehouse
My favourites – R and Presto
www.meetup.com/jaxbigdata
Please submit your feedback At
www.meetup.com/jaxbigdata
Thank you for opportunity!!
www.meetup.com/jaxbigdata