HBase Low Latency, StrataNYC 2014

1. HBase Low LatencyNick Dimiduk, Hortonworks (@xefyr)Nicolas Liochon, Scaled Risk (@nkeywal)Strata New York, October 17, 2014

2. Agenda Latency, what is it, how to measure it Write path Read path Next steps 3. Whats low latency Meaning from micro seconds (High FrequencyTrading) to seconds (interactive queries) In this talk millisecondsLatency is about percentiles Average != 50% percentile There are often order of magnitudes between average and 95percentile Post 99% = magical 1% . Work in progress here. 4. Measure latencyYCSB - Yahoo! Cloud Serving Benchmark Useful for comparison between databases Set of workload already definedbin/hbase org.apache.hadoop.hbase.PerformanceEvaluation More options related to HBase: autoflush, replicas, Latency measured in micro second Easier for internal analysis 5. Why is it importantDurabilityAvailabilityConsistency 6. 31ClientBuffer0ServerBufferHBase BigTableOSBufferon GFS1TraditionalDB EngineDiskDurability2 7. Durability 8. ConsistencyTwo processes: P1, P2Counter updated by a P1v1, then v2, then v3Eventual consistency allows P1and P2 to see these events in anyorder.Strong consistency allows only oneorderGoogle F1 paper, VLDB (2013)We store financial data and have hard requirementson data integrity and consistency. We also have a lotof experience with eventual consistency systems atGoogle. In all such systems, we find developers spenda significant fraction of their time building extremelycomplex and error-prone mechanisms to cope witheventual consistency and handle data that may beout of date. We think this is an unacceptable burdento place on developers and that consistency problemsshould be solved at the database level. 9. ConsistencyBig Table design: allows consistency by partitioning the data: eachmachine serves a subset of the data. 10. Availability Contract is: a client outside the cluster will sees HBase as available ifthere are partitions or failure within the HBase cluster There is a lot more to say, but its outside the scope of this talk(unfortunately) 11. AvailabilityA partition or a machine failureappear to the client as a latency spike 12. Trade off Maximizing the benefits while minimizingthe cost Implementation details count Configuration counts 13. Write path Two parts Single put (WAL) The client just sends the put Multiple puts from the client (new behavior since 0.96) The client is much smarter Four stages to look at for latency Start (establish tcp connections, etc.) Steady: when expected conditions are met Machine failure: expected as well Overloaded system 14. Single put: communication & scheduling Client: TCP connection to the server Shared: multitheads on the same client are using the same TCP connection Pooling is possible and does improve the performances in some circonstances hbase.client.ipc.pool.size Server: multiple calls from multiple threads on multiple machines Can become thousand of simultaneous queries Scheduling is required 15. Single put: real work The server must Write into the WAL queue Sync the WAL queue (HDFS flush) Write into the memstore WALs queue is shared between all the regions/handlers Sync is avoided if another handlers did the work You may flush more than expected 16. Simple put: A small runPercentile Time in msMean 1.2150% 0.9595% 1.5099% 2.12 17. Latency sources Candidate one: network 0.5ms within a datacenter Much less between nodes in the same rackPercentile Time in msMean 0.1350% 0.1295% 0.1599% 0.47 18. Latency sources Candidate two: HDFS FlushPercentile Time in msMean 0.3350% 0.2695% 0.5999% 1.24 We can still do better: HADOOP-7714 & sons. 19. Latency sources Millisecond world: everything can go wrong JVM Network OS Scheduler File System All this goes into the post 99% percentile Requires monitoring Usually using the latest version shelps. 20. Latency sources Split (and presplits) Autosharding is great! Puts have to wait Impacts: seconds Balance Regions move Triggers a retry for the client hbase.client.pause = 100ms since HBase 0.96 Garbage Collection Impacts: 10s of ms, even with a good config Covered with the read path of this talk 21. From steady to loaded and overloaded Number of concurrent tasks is a factor of Number of cores Number of disks Number of remote machines used Difficult to estimate Queues are doomed to happen hbase.regionserver.handler.count So for low latency Replable scheduler since HBase 0.98 (HBASE-8884). Requires specific code. RPC Priorities: since 0.98 (HBASE-11048) 22. From loaded to overloaded MemStore takes too much room: flush, then blocksquite quickly hbase.regionserver.global.memstore.size.lower.limit hbase.regionserver.global.memstore.size hbase.hregion.memstore.block.multiplier Too many Hfiles: block until compactions keep up hbase.hstore.blockingStoreFiles Too manyWALs files: Flush and block hbase.regionserver.maxlogs 23. Machine failure Failure Dectect Reallocate Replay WAL ReplayingWAL is NOT required for puts hbase.master.distributed.log.replay (default true in 1.0) Failure = Dectect + Reallocate + Retry Thats in the range of ~1s for simple failures Silent failures leads puts you in the 10s range if the hardware does not help zookeeper.session.timeout 24. Single puts Millisecond range Spikes do happen in steady mode 100ms Causes: GC, load, splits 25. Streaming putsHtable#setAutoFlushTo(false)Htable#putHtable#flushCommit As simple puts, but Puts are grouped and send in background Load is taken into account Does not block 26. Multiple putshbase.client.max.total.tasks (default 100)hbase.client.max.perserver.tasks (default 5)hbase.client.max.perregion.tasks (default 1) Decouple the client from a latency spike of a region server Increase the throughput by 50% compared to old multiput Makes split and GC more transparent 27. Conclusion on write path Single puts can be very fast Its not a hard real time system: there are spikes Most latency spikes can be hidden when streaming puts Failure are NOT that difficult for the write path No WAL to replay 28. And now for the read path 29. Read path Get/short scan are assumed for low-latency operations Again, two APIs Single get: HTable#get(Get) Multi-get: HTable#get(List) Four stages, same as write path Start (tcp connection, ) Steady: when expected conditions are met Machine failure: expected as well Overloaded system: you may need to add machines or tune your workload 30. Multi get / ClientGroup Gets byRegionServerExecute themone by one 31. Multi get / Server 32. Multi get / Server 33. AcceSstso rlaagtee hniecrya rmchya: gan diifdfeeresnt viewDean/2009Memory is 100000xfaster than disk!Disk seek = 10ms 34. Known unknowns For each candidate HFile Exclude by file metadata Timestamp Rowkey range Exclude by bloom filterStoreFileScanner#shouldUseScanner() 35. Unknown knowns Merge sort results polled from Stores Seek each scanner to a reference KeyValue Retrieve candidate data from disk Multiple HFiles => mulitple seeks hbase.storescanner.parallel.seek.enable=true Short Circuit Reads dfs.client.read.shortcircuit=true Block locality Happy clusters compact!HFileBlock#readBlockData() 36. BlockCache Reuse previously read data Maximize cache hit rate Larger cache Temporal access locality Physical access localityBlockCache#getBlock() 37. BlockCache Showdown LruBlockCache Default, onheap Quite good most of the time Evictions impact GC BucketCache Offheap alternative Serialization overhead Large memory configurationshttp://www.n10k.com/blog/blockcache-showdown/L2 off-heap BucketCachemakes a strong showing 38. Latency enemies: Garbage Collection Use heap. Not too much. With CMS. Max heap 30GB (compressed pointers) 8-16GB if you care about 9s Healthy cluster load regular, reliable collections 25-100ms pause on regular interval Overloaded RegionServer suffers GC overmuch 39. Off-heap to the rescue? BucketCache (0.96, HBASE-7404) Network interfaces (HBASE-9535) MemStore et al (HBASE-10191) 40. Latency enemies: Compactions Fewer HFiles => fewer seeks Evict data blocks! Evict Index blocks!! hfile.block.index.cacheonwrite Evict bloom blocks!!! hfile.block.bloom.cacheonwrite OS buffer cache to the rescue Compactected data is still fresh Better than going all the way back to disk 41. Failure Detect + Reassign + Replay Strong consistency requires replay Locality drops to 0 Cache starts from scratch 42. Hedging our bets HDFS Hedged reads (2.4, HDFS-5776) Reads on secondary DataNodes Strongly consistent Works at the HDFS level Timeline consistency (HBASE-10070) Reads on Replica Region Not strongly consistent 43. Read latency in summary Steady mode Cache hit: < 1 ms Cache miss: + 10 ms per seek Writing while reading => cache churn GC: 25-100ms pause on regular intervalNetwork request + (1 - P(cache hit)) * (10 ms * seeks) Same long tail issues as write Overloaded: same scheduling issues as write Partial failures hurt a lot 44. HBase ranges for 99% latencyPutStreamedMultiput Get Timeline getSteady milliseconds milliseconds milliseconds millisecondsFailure seconds seconds seconds millisecondsGC10s ofmilliseconds milliseconds10s ofmilliseconds milliseconds 45. Whats next Less GC Use less objects OffheapCompressed BlockCache (HBASE-11331) Prefered location (HBASE-4755) The magical 1% Most tools stops at the 99% latency What happens after is much more complex 46. 35.0x30.0x25.0x20.0x15.0x10.0x5.0x0.0xPerformance with Compressed BlockCacheTotal RAM: 24G LruBlockCache Size: 12GData Size: 45G Compressed Size: 11GCompression: SNAPPYthroughput (ops/sec) latency (ms, p95) latency (ms, p99) cpu loadTimes improvementMetricEnabledDisabled 47. Thanks!Nick Dimiduk, Hortonworks (@xefyr)Nicolas Liochon, Scaled Risk (@nkeywal)Strata New York, October 17, 2014

Date post:	27-Nov-2014
Category:	Technology
Upload:	nick-dimiduk
View:	406 times
Download:	2 times

HBase Low Latency, StrataNYC 2014

Technology