Post on 13-Apr-2017
transcript
Juggling with Bits and Bytes How Apache Flink operates on binary data
Fabian Hueske
:ueske@apache.org @:ueske
1
Big Data frameworks on JVMs
• Many (open source) Big Data frameworks run on JVMs – Hadoop, Drill, Spark, Hive, Pig, and ... – Flink as well
• Common challenge: How to organize data in-‐memory? – In-‐memory processing (sorOng, joining, aggregaOng) – In-‐memory caching of intermediate results
• Memory management of a system influences – Reliability – Resource efficiency, performance & performance predictability – Ease of configuraOon
2
The straight-‐forward approach
Store and process data as objects on the heap • Put objects in an array and sort it
A few notable drawbacks • PredicOng memory consumpOon is hard
– If you fail, an OutOfMemoryError will kill you!
• High garbage collecOon overhead – Easily 50% of Ome spend on GC
• Objects have considerable space overhead – At least 8 bytes for each (nested) object! (Depends on arch)
3
FLINK’S APPROACH
4
Flink adopts DBMS technology
• Allocates fixed number of memory segments upfront • Data objects are serialized into memory segments • DBMS-‐style algorithms work on binary representaOon
5
Why is that good?
• Memory-‐safe execuOon – Used and available memory segments are easy to count – No parameter tuning for reliable operaOons!
• Efficient out-‐of-‐core algorithms – Memory segments can be efficiently wrifen to disk
• Reduced GC pressure – Memory segments are off-‐heap or never deallocated – Data objects are short-‐lived or reused
• Space-‐efficient data representaOon
• Efficient operaOons on binary data 6
What does it cost?
• Significant implementaOon investment – Using java.uOl.HashMap vs. – ImplemenOng a spillable hash table backed by byte arrays and custom serializaOon stack
• Other systems use similar techniques – Apache Drill, Apache AsterixDB (incubaOng)
• Apache Spark evolves into a similar direcOon
7
MEMORY ALLOCATION
8
Memory segments
• Unit of memory distribuOon in Flink – Fixed number allocated when worker starts
• Backed by a regular byte array (default 32KB)
• On-‐heap or off-‐heap allocaOon
• R/W access through Java’s efficient unsafe methods
• MulOple memory segments can be logically concatenated to a larger chunk of memory
9
On-‐heap memory allocaOon
10
Off-‐heap memory allocaOon
11
On-‐heap vs. Off-‐heap
• No significant performance difference in micro-‐benchmarks
• Garbage CollecOon – Smaller heap -‐> faster GC
• Faster start-‐up Ome – A mulO-‐GB JVM heap takes Ome to allocate
12
DATA SERIALIZATION
13
Custom de/serializaOon stack
• Many alternaOves for Java object serializaOon – Dynamic: Kryo – Schema-‐dependent: Apache Avro, Apache Thrip, Protobufs
• But Flink has its own serializaOon stack – OperaOng on serialized data requires knowledge of layout – Control over layout can improve efficiency of operaOons – Data types are known before execuOon
14
Rich & extensible type system
• SerializaOon framework requires knowledge of types
• Flink analyzes return types of funcOons – Java: ReflecOon based type analyzer – Scala: Compiler informaOon + CodeGen via Macros
• Rich type system – Atomics: PrimiOves, Writables, Generic types, … – Composites: Tuples, Pojos, CaseClasses – Extensible by custom types
15
Serializing a Tuple3<Integer, Double, Person>
16
OPERATING ON BINARY DATA
17
Data processing algorithms
• Flink’s algorithms are based on RDBMS technology – External Merge Sort, Hybrid Hash Join, Sort Merge Join, …
• Algorithms receive a budget of memory segments – AutomaOc decision about budget size – No fine-‐tuning of operator memory!
• Operate in-‐memory as long as data fits into budget – And gracefully spill to disk if data exceeds memory
18
In-‐memory sort – Fill the sort buffer
19
In-‐memory sort – Sort the buffer
20
In-‐memory sort – Read sorted buffer
21
SHOW ME NUMBERS!
22
Sort benchmark
• Task: Sort 10 million Tuple2<Integer, String> records – String length 12 chars
• Tuple has 16 Bytes of raw data • ~152 MB raw data
– Integers uniformly, Strings long-‐tail distributed – Sort on Integer field and on String field
• Generated input provided as mutable object iterator
• Use JVM with 900 MB heap size – Minimum size to reliable run the benchmark
23
SorOng methods 1. Objects-‐on-‐Heap:
– Put cloned data objects in ArrayList and use Java’s CollecOon sort. – ArrayList is iniOalized with right size.
2. Flink-‐serialized (on-‐heap): – Using Flink’s custom serializers. – Integer with full binary sorOng key, String with 8 byte prefix key.
3. Kryo-‐serialized (on-‐heap): – Serialize fields with Kryo. – No binary sorOng keys, objects are deserialized for comparison.
• All implementaOons use a single thread • Average execuOon Ome of 10 runs reported • GC triggered between runs (does not go into reported Ome)
24
ExecuOon Ome
25
Garbage collecOon and heap usage
26
Objects-‐on-‐heap
Flink-‐serialized
Memory usage
27
• Breakdown: Flink serialized -‐ Sort Integer – 4 bytes Integer – 12 bytes String – 4 bytes String length – 4 bytes pointer – 4 bytes Integer sorOng key – 28 bytes * 10M records = 267 MB
Object-‐on-‐heap Flink-‐serialized Kryo-‐serialized
Sort Integer Approx. 700 MB 277 MB 266 MB
Sort String Approx. 700 MB 315 MB 266 MB
Going out-‐of-‐core
28
• Single thread HashJoin with 4GB memory budget • Build side varies, Probe side 64GB
WHAT’S NEXT?
29
We’re not done yet!
• SerializaOon layouts tailored towards operaOons – More efficient operaOons on binary data
• Table API provides full semanOcs for execuOon – Use code generaOon to operate fully on binary data
• …
30
Summary
• AcOve memory management avoids OOMErrors
• Highly efficient data serializaOon stack – Facilitates operaOons on binary data – Makes more data fit into memory
• DBMS-‐style operators operate on binary data – High performance in-‐memory processing – Graceful destaging to disk if necessary
• Read Flink’s blog: – hfp://flink.apache.org/news/2015/05/11/Juggling-‐with-‐Bits-‐and-‐Bytes.html – hfp://flink.apache.org/news/2015/03/13/peeking-‐into-‐Apache-‐Flinks-‐Engine-‐Room.html – hfp://flink.apache.org/news/2015/09/16/off-‐heap-‐memory.html
31
32
hfp://flink.apache.org @ApacheFlink
Apache Flink