Date post: | 08-May-2015 |
Category: |
Technology |
Upload: | nick-dimiduk |
View: | 2,569 times |
Download: | 0 times |
HBase Low Latency Nick Dimiduk, Hortonworks (@xefyr)
Nicolas Liochon, Scaled Risk (@nkeywal)
HBaseCon May 5, 2014
Agenda
• Latency, what is it, how to measure it
• Write path
• Read path
• Next steps
What’s low latency
Latency is about percenJles • Average != 50% percenJle • There are oRen order of magnitudes between « average » and « 95 percenJle » • Post 99% = « magical 1% ». Work in progress here.
• Meaning from micro seconds (High Frequency Trading) to seconds (interacJve queries) • In this talk milliseconds
Measure latency
bin/hbase org.apache.hadoop.hbase.PerformanceEvaluaJon • More opJons related to HBase: autoflush, replicas, … • Latency measured in micro second • Easier for internal analysis
YCSB -‐ Yahoo! Cloud Serving Benchmark • Useful for comparison between databases • Set of workload already defined
Write path
• Two parts • Single put (WAL)
• The client just sends the put • MulJple puts from the client (new behavior since 0.96)
• The client is much smarter
• Four stages to look at for latency • Start (establish tcp connecJons, etc.) • Steady: when expected condiJons are met • Machine failure: expected as well • Overloaded system
Single put: communica>on & scheduling
• Client: TCP connecJon to the server • Shared: mulJtheads on the same client are using the same TCP connecJon
• Pooling is possible and does improve the performances in some circonstances • hbase.client.ipc.pool.size
• Server: mulJple calls from mulJple threads on mulJple machines
• Can become thousand of simultaneous queries • Scheduling is required
Single put: real work
• The server must • Write into the WAL queue • Sync the WAL queue (HDFS flush) • Write into the memstore
• WALs queue is shared between all the regions/handlers • Sync is avoided if another handlers did the work • You may flush more than expected
Simple put: A small run
Percen&le Time in ms Mean 1.21 50% 0.95 95% 1.50 99% 2.12
Latency sources
• Candidate one: network • 0.5ms within a datacenter • Much less between nodes in the same rack
Percen&le Time in ms Mean 0.13 50% 0.12 95% 0.15 99% 0.47
Latency sources
• Candidate two: HDFS Flush
• We can sJll do beier: HADOOP-‐7714 & sons.
Percen&le Time in ms Mean 0.33 50% 0.26 95% 0.59 99% 1.24
Latency sources
• Millisecond world: everything can go wrong • JVM • Network • OS Scheduler • File System • All this goes into the post 99% percenJle
• Requires monitoring • Usually using the latest version shelps.
Latency sources
• Split (and presplits) • Autosharding is great! • Puts have to wait • Impacts: seconds
• Balance • Regions move • Triggers a retry for the client
• hbase.client.pause = 100ms since HBase 0.96
• Garbage CollecJon • Impacts: 10’s of ms, even with a good config • Covered with the read path of this talk
From steady to loaded and overloaded
• Number of concurrent tasks is a factor of • Number of cores • Number of disks • Number of remote machines used
• Difficult to esJmate • Queues are doomed to happen • hbase.regionserver.handler.count
• So for low latency • Replable scheduler since HBase 0.98 (HBASE-‐8884). Requires specific code. • RPC PrioriJes: work in progress (HBASE-‐11048)
From loaded to overloaded
• MemStore takes too much room: flush, then blocksquite quickly • hbase.regionserver.global.memstore.size.lower.limit • hbase.regionserver.global.memstore.size • hbase.hregion.memstore.block.multiplier
• Too many Hfiles: block unJl compacJons keep up • hbase.hstore.blockingStoreFiles
• Too many WALs files: Flush and block • hbase.regionserver.maxlogs
Machine failure
• Failure • Dectect • Reallocate • Replay WAL
• Replaying WAL is NOT required for puts • hbase.master.distributed.log.replay
• (default true in 1.0)
• Failure = Dectect + Reallocate + Retry • That’s in the range of ~1s for simple failures • Silent failures leads puts you in the 10s range if the hardware does not help
• zookeeper.session.timeout
Single puts
• Millisecond range
• Spikes do happen in steady mode • 100ms • Causes: GC, load, splits
Streaming puts
Htable#setAutoFlushTo(false)!Htable#put!Htable#flushCommit!
• As simple puts, but • Puts are grouped and send in background • Load is taken into account • Does not block
Mul>ple puts
hbase.client.max.total.tasks (default 100) hbase.client.max.perserver.tasks (default 5) hbase.client.max.perregion.tasks (default 1)
• Decouple the client from a latency spike of a region server
• Increase the throughput by 50% compared to old mulJput • Makes split and GC more transparent
Conclusion on write path
• Single puts can be very fast • It’s not a « hard real Jme » system: there are spikes
• Most latency spikes can be hidden when streaming puts
• Failure are NOT that difficult for the write path • No WAL to replay
And now for the read path
Read path
• Get/short scan are assumed for low-‐latency operaJons • Again, two APIs
• Single get: HTable#get(Get) • MulJ-‐get: HTable#get(List<Get>)
• Four stages, same as write path • Start (tcp connecJon, …) • Steady: when expected condiJons are met • Machine failure: expected as well • Overloaded system: you may need to add machines or tune your workload
Mul> get / Client
Group Gets by RegionServer
Execute them one by one
Mul> get / Server
Mul> get / Server
Access latency magnides Storage hierarchy: a different view
A bumpy ride that has been getting bumpier over time
Dean/2009
Memory is 100000x faster than disk!
Disk seek = 10ms
Known unknowns
• For each candidate HFile • Exclude by file metadata
• Timestamp • Rowkey range
• Exclude by bloom filter
StoreFileScanner# shouldUseScanner()
Unknown knowns
• Merge sort results polled from Stores • Seek each scanner to a reference KeyValue • Retrieve candidate data from disk
• MulJple HFiles => mulitple seeks • hbase.storescanner.parallel.seek.enable=true
• Short Circuit Reads • dfs.client.read.shortcircuit=true
• Block locality • Happy clusters compact!
HFileBlock# readBlockData()
BlockCache
• Reuse previously read data • Maximize cache hit rate
• Larger cache • Temporal access locality • Physical access locality
BlockCache#getBlock()
BlockCache Showdown
• LruBlockCache • Default, onheap • Quite good most of the Jme • EvicJons impact GC
• BucketCache • Oxeap alternaJve • SerializaJon overhead • Large memory configuraJons
hip://www.n10k.com/blog/blockcache-‐showdown/
L2 off-‐heap BucketCache makes a strong showing
Latency enemies: Garbage Collec>on
• Use heap. Not too much. With CMS. • Max heap • 30GB (compressed pointers) • 8-‐16GB if you care about 9’s
• Healthy cluster load • regular, reliable collecJons • 25-‐100ms pause on regular interval
• Overloaded RegionServer suffers GC overmuch
Off-‐heap to the rescue?
• BucketCache (0.96, HBASE-‐7404) • Network interfaces (HBASE-‐9535) • MemStore et al (HBASE-‐10191)
Latency enemies: Compac>ons
• Fewer HFiles => fewer seeks
• Evict data blocks! • Evict Index blocks!!
• hfile.block.index.cacheonwrite • Evict bloom blocks!!!
• hfile.block.bloom.cacheonwrite
• OS buffer cache to the rescue • Compactected data is sJll fresh • Beier than going all the way back to disk
Failure
• Detect + Reassign + Replay • Strong consistency requires replay
• Locality drops to 0 • Cache starts from scratch
Hedging our bets
• HDFS Hedged reads (2.4, HDFS-‐5776) • Reads on secondary DataNodes • Strongly consistent • Works at the HDFS level
• Timeline consistency (HBASE-‐10070) • Reads on « Replica Region » • Not strongly consistent
Read latency in summary
• Steady mode • Cache hit: < 1 ms • Cache miss: + 10 ms per seek • WriJng while reading => cache churn • GC: 25-‐100ms pause on regular interval
Network request + (1 -‐ P(cache hit)) * (10 ms * seeks) • Same long tail issues as write • Overloaded: same scheduling issues as write • ParJal failures hurt a lot
HBase ranges for 99% latency
Put Streamed Mul&put Get Timeline get
Steady milliseconds milliseconds milliseconds milliseconds
Failure seconds seconds seconds milliseconds
GC 10’s of
milliseconds milliseconds 10’s of
milliseconds milliseconds
What’s next
• Less GC • Use less objects • Oxeap
• Compressed BlockCache (HBASE-‐8894) • Prefered locaJon (HBASE-‐4755)
• The « magical 1% » • Most tools stops at the 99% latency • What happens aRer is much more complex
Thanks! Nick Dimiduk, Hortonworks (@xefyr)
Nicolas Liochon, Scaled Risk (@nkeywal)
HBaseCon May 5, 2014