Thus Far
• Locality is important!!!– Need to get Processing closer to storage– Need to get tasks close to data
• Rack locality: Hadoop• Kill task and if a local slot is available: Quincy
• Why?– Network is bad: gives horrible performance– Why?
• Over-subscription of network
What Has Changed?• Network is no-longer over-subscribed
– Fat-tree, VL2• Network has fewer congestion points
– Helios, C-thru, Hedera, MicroTE• Server uplinks are much faster• Implications: network transfers are much faster
– Network is now just as fast as Disk I/o– Difference between local and rack local is only 8%
• Storage practices have also changed– Compression is being used
• Smaller amount of data need to be transferred– De-replication is being practiced
• There’s only one copy so locality is really hard to achieve
So What Now?
• No need to worry about locality when doing placement– Placement can happen faster– Scheduling algorithms can be smaller/simpler
• Network is as fast as SATA disk? But still a lot slower than SSD?– If SDD used then disk-locality is a problem AGAIN!– However too costly to be used for all storage
Caching with Memory/SSD
• 94% of all jobs can have input fit in memory• So a new problem is memory locality – Want to place a task where it will have access to
data already in memory• Interesting challenges:– 46% of task use data that is never re-used• So need to pre-catch for these tasks
– Current caching scheme are ineffective
How do you build a FS that Ignore locality
• FDS from MSR ignores locality
• Eliminate networking problem to remove importance of locality
• Eliminate meta-data server problems to improve throughput of the whole system
Meta-data Server
• Current meta-data server (name-node)– Stores mapping of chunks to servers– Central point of failure– Central bottle-neck • Processing issues: before anyone reads/writes must
consult metadata server • Storage issues: must store location of EVERY chunk and
size of every chunk
FDS’s Meta-data Server
– Only store list of servers: • smaller memory footprint: • # servers <<< # chunks
– Clients only interact with it at startup• Not every-time they need to read/write• To determine where to read/write: Consistent hashing
– Write/read data at server at this location in array» Hash(GUID)/#-server
• # reads/writes <<<< # client boot
Network Changes
• Uses VL2 style Clos Network– Eliminates over-subscription+ congestion
• 1 TCP doesn’t saturate Server 10-gig NIC– Use 5 TCP connections to saturate link
• Since VL2, No congestion in core but maybe at receiver– Receiver controls the senders sending rate
• Receiver sends rate-limiting messages to
Disk locality is almost a distant problem
• Advances in networking– Eliminate over-subscription/congestion
• We have prototype of FDS that doesn’t need locality– Uses VL2– Eliminates meta-data servers
• New problem, new challenges– Memory locality
• New cache replacement techniques• New pre-caching schemes
Class Wrap-UP
• What have we covered and learned?• The big data-stack – How to optimize each layer?– What are the challenges in each layer?– Are there any opportunities to optimize across
layers?
Big-Data Stack: App Paradigms
• Commodity devices impact the design of application paradigms– Hadoop: dealing with failures• Addresses n/w oversubscription—rack aware placement• Straggler detection mitigation --- restart tasks
– Dryad: hadoop for smarter programmers• Can create more expressive task DAGs (non cyclic)• Can determine which should run locally on same devs• Dryad does optimizations: adds extra nodes to do temp
aggregation
Hadoop DryadApp
SDN
Storage
N/W Sharing
Tail Latency
N/W Paradign
Virt Drawbacks
Sharing
Big-Data Stack: App Paradigms Revisited
• User visible services are complex and composed of multiple M-R jobs– Flume & DryadLinQ• Delay Execution until output is required• Allows for various optimizations• Storing output to HDFS between M-R jobs adds times
– Eliminate HDFS between jobs• Programmers aren’t smart, often have extra un-
necessarily steps– Knowing what is required for output, you can eliminate
unnecessary
Hadoop Dryad
FlumeJava DryadLinQApp
SDN
Storage
N/W Sharing
Tail Latency
N/W Paradign
Virt Drawbacks
Sharing
Big-Data Stack: App Paradigms Revisited-yet-again
• User visible services require interactivity: so jobs need to be fast. Jobs should return results before completing processing– Hadoop-Online:
• pipeline results from map to reduce before done. • Pipeline too early and reduce need to do sorting
– Increases processing overhead on reduce: BAD!!!– RRD: Spark
• Store data in memory: much faster than disk• Instead of doing process: create abstract graph of processing
and to processing when output is required– Allows for optimizations
• Failure recovery is the challenge
Hadoop Dryad
FlumeJava DryadLinQ HadoopOnline Spark
MesosOmega Sharing
App
SDN
Storage
N/W Sharing
Tail Latency
N/W Paradign
Virt Drawbacks
Sharing
Big-Data Stack: Sharing in CaringHow to share a non-virtualized cluster
• Sharing is good: you have too much data and cost too much to build many cluster for same data
• Need dynamic sharing: if static, you can waste• Mesos:– Resource offers: give app options of resources and let them
pick– App knows best
• Omega:– Optimistic allocation: each scheduler picks resources and if
there’s a conflict omega detects this and gives resources to only one. Others pick new resources
– Even with conflicts this is much better than centralized entity
Hadoop Dryad
FlumeJava DryadLinQ HadoopOnline Spark
MesosOmega Sharing
App
SDN
Storage
N/W Sharing
Tail Latency
N/W Paradign
Virt Drawbacks
Sharing
Big-Data Stack: Sharing in CaringCloud Sharing
• Clouds gives the illusion of equality– H/W differences diff performance– Poor isolation tenants can impact each other• I/O and CPU bound jobs can conflict.
Hadoop Dryad
FlumeJava DryadLinQ HadoopOnline Spark
MesosOmega
BobTail RFA CloudGaming Virt Drawbacks
Sharing
App
SDN
Storage
N/W Sharing
Tail Latency
N/W Paradign
Virt Drawbacks
Sharing
Big-Data Stack: Better Networks
• Networks give bad performance– Cause: Congestion + over-subscription
• VL2/Portland– Eliminate over-subscription + congestion with
commodity devices+ECMP• Helios/C-through– Mitigate congestion by carefully adding new
capacity
Hadoop Dryad
FlumeJava DryadLinQ HadoopOnline Spark
MesosOmega
BobTail RFA CloudGaming
VL2 Portland Helios C-Thru HederaMicroTE N/W Paradign
Virt Drawbacks
Sharing
App
SDN
Storage
N/W Sharing
Tail Latency
N/W Paradign
Virt Drawbacks
Sharing
Big-Data Stack: Better Networks
• When you need multiple servers to service a request– .99100 = .65 (HORRIBLE)– Duplicate requests: send same request to 2 servers• At-least one will finish within acceptable time
– Dolly: be smart when selecting the 2 servers• You don’t want I/O contention because that leads to bad perf• Avoid Maps using same replicas• Avoid Reducers reading same intermediate output
Hadoop Dryad
FlumeJava DryadLinQ HadoopOnline Spark
MesosOmega
BobTail RFA CloudGaming
VL2 Portland Helios C-Thru HederaMicroTE
Dolly (Clones) MantreiTail Latency Tail Latency
N/W Paradign
Virt Drawbacks
Sharing
App
Big-Data Stack: Networks Sharing• How to share efficiently while making guarantees• Elastic-Switch– Two level bandwidth allocation system
• Orchestra– M/R has barriers and completion is based on a set of flows not
individual flows– Make optimization to a set of flows
• Hull: Trade BW for latency– Want zero buffering: but TCP needs buffering– Limit traffic to 90% of link and use the remaining 10% as buffers
Hadoop Dryad
FlumeJava DryadLinQ HadoopOnline Spark
MesosOmega
BobTail RFA CloudGaming
VL2 Portland Helios C-Thru HederaMicroTE
Elastic Cloud
Dolly (Clones) Mantrei
Hull
Tail Latency
Orchestra N/W Sharing
Tail Latency
N/W Paradign
Virt Drawbacks
Sharing
App
SDN
Storage
Big-Data Stack: Enter SDN
• Remove the Control plane from the switches and centralize it
• Centralization == Scalability challenges• NOX: how does it scale to data-centers– How many controllers do you need?
• How should you design these controllers:– Kandoo: a hierarchy (many local and 1 global
controller, local communicate with the global;)– ONIX: a mesh (communication through a DHT or DB)
Hadoop Dryad
FlumeJava DryadLinQ HadoopOnline Spark
MesosOmega
BobTail RFA CloudGaming
VL2 Portland Helios C-Thru HederaMicroTE
Elastic Cloud
SDNKandoo ONIX
Dolly (Clones) Mantrei
Hull
Tail Latency
Orchestra N/W Sharing
Tail Latency
N/W Paradign
Virt Drawbacks
Sharing
App
Storage
Big Data Stack: SDN+Big-Data
• FlowComb:– Detect app patterns and have SDN controller
assign paths based on knowledge of traffic patterns and contention
• Sinbad:– HDFS writes are important– Let SDN controller tell HDFS best place to write
data to based on knowledge of n/w congesetion
Hadoop Dryad
FlumeJava DryadLinQ HadoopOnline Spark
MesosOmega
BobTail RFA CloudGaming
VL2 Portland Helios C-Thru HederaMicroTE
Elastic Cloud
SDNKandoo ONIX
FlowComb SinBaD
Dolly (Clones) Mantrei
Hull
Tail Latency
Orchestra N/W Sharing
Tail Latency
N/W Paradign
Virt Drawbacks
Sharing
App
SDN
Storage
N/W Sharing
Tail Latency
N/W Paradign
Virt Drawbacks
Sharing
App
Big Data Stack: Distributed Storage
• Ideal: Nice API, low latency, scalable• Problem: H/W fails a lot, in limited locations,
and contains limited resources• Partition: gives good performance– Cassandra: use consistent hashing– Megastore: each partition == A RDBMS with good
consistency guarantees• Replicate: Multiple copies avoid failures– Megastore: replicas allow for low latency
Hadoop Dryad
FlumeJava DryadLinQ HadoopOnline Spark
MesosOmega
BobTail RFA CloudGaming
VL2 Portland Helios C-Thru HederaMicroTE
Elastic Cloud
SDNKandoo ONIX
FlowComb SinBaD
Megastore Casandra
Dolly (Clones) Mantrei
Hull
Storage
Tail Latency
Orchestra N/W Sharing
Tail Latency
N/W Paradign
Virt Drawbacks
Sharing
App
Big Data Stack: Disk locality Irrelevant
• Disk Locality is becoming irrelevant– Data is getting smaller (compressed) so smaller
times– Networks are getting much faster (only 8% slower)
• Mem locality is new challenge– Input for 94% fit in mem– Need new caching+prefetching schemes
Hadoop Dryad
FlumeJava DryadLinQ HadoopOnline Spark
MesosOmega
BobTail RFA CloudGaming
VL2 Portland Helios C-Thru HederaMicroTE
Elastic Cloud
SDNKandoo ONIX
FlowComb SinBaD
Megastore Casandra
FDS
Dolly (Clones) Mantrei
Hull
Storage
Tail Latency
Orchestra N/W Sharing
Tail Latency
N/W Paradign
Virt Drawbacks
Sharing
App
Disk-locality irrelevant
Hadoop Dryad
FlumeJava DryadLinQ HadoopOnline Spark
MesosOmega
BobTail RFA CloudGaming
VL2 Portland Helios C-Thru HederaMicroTE
Elastic Cloud
SDNKandoo ONIX
FlowComb SinBaD
Megastore Casandra
FDS
Dolly (Clones) Mantrei
Hull
Storage
Tail Latency
Orchestra N/W Sharing
Tail Latency
N/W Paradign
Virt Drawbacks
Sharing
App
Disk-locality irrelevant