+ All Categories
Home > Documents > Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion...

Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion...

Date post: 12-Jan-2016
Category:
Upload: merry-richard
View: 217 times
Download: 0 times
Share this document with a friend
Popular Tags:
30
Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley
Transcript
Page 1: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Tachyon: memory-speed data sharing

Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica

UC Berkeley

Page 2: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Memory trumps everything else

• RAM throughput increasing exponentially• Disk throughput increasing slowly

Memory-locality key to interactive response time

Page 3: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Realized by many…• Frameworks already leverage memory– e.g. Spark, Shark, GraphX, …

Page 4: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Example: -

• Fast in-memory data processing within a job– Keep only one copy in-memory copy JVM– Track lineage of operations used to derive data– Upon failure, use lineage to re-compute data

map

filter map

join reduce

Lineage Tracking

Page 5: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Challenge 1

Spark Task

Spark memoryblock manager

block 1

block 3

HDFSdisk

block 1

block 3

block 2

block 4

execution engine & storage enginesame JVM process

Page 6: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Challenge 1

crash

Spark memoryblock manager

block 1

block 3

HDFSdisk

block 1

block 3

block 2

block 4

execution engine & storage enginesame JVM process

Page 7: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Challenge 1

JVM crash: lose all cache

HDFSdisk

block 1

block 3

block 2

block 4

execution engine & storage enginesame JVM process

crash

Page 8: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Challenge 2

JVM heap overhead:GC & duplicate memory per job

Spark Task

Spark mem block manager

block 1

block 3

Spark Task

Spark mem block manager

block 3

Block 1

HDFSdisk

block 1

block 3

block 2

block 4

execution engine & storage enginesame JVM process(GC & duplication)

Page 9: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Challenge 3

Different jobs share data:Slow writes to disk

Spark Task

Spark mem block manager

block 1

block 3

Spark Task

Spark mem block manager

block 3

block 1

HDFSdisk

block 1

block 3

block 2

block 4

storage engine & execution enginesame JVM process(slow writes)

Page 10: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Challenge 3

Different frameworks share data: Slow writes to disk

Spark Task

Spark mem block manager

block 1

block 3

Hadoop MR

YARN

HDFSdisk

block 1

block 3

block 2

Block 4

storage engine & execution enginesame JVM process(slow writes)

Page 11: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Tachyon

Reliable data sharing at memory-speed within and across cluster frameworks/jobs

Page 12: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Challenge 1 revisited

Spark Task

Spark memoryblock manager

block 1

HDFSdisk

block 1

block 3

block 2

block 4

execution engine & storage enginesame JVM process

Tachyonin-memory

block 1

block 3 block 4

Page 13: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Challenge 1 revisited

Spark memoryblock manager

block 1

HDFSdisk

block 1

block 3

block 2

block 4

execution engine & storage enginesame JVM process

Tachyonin-memory

block 1

block 3 block 4

crash

HDFSdisk

block 1

block 3

block 2

block 4

Page 14: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Challenge 1 revisited

HDFSdisk

block 1

block 3

block 2

block 4

execution engine & storage enginesame JVM process

Tachyonin-memory

block 1

block 3 block 4

crash

HDFSdisk

block 1

block 3

block 2

block 4

JVM crash: keep memory-cache

Page 15: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Challenge 2 revisited

Off-heap memory storageNo GC & one memory copy

Spark Task

Spark memblock 1

Spark Task

Spark memblock 4

HDFSdisk

block 1

Block 3

block 2

Block 4

execution engine & storage enginesame JVM process(no GC & duplication)

HDFSdisk

block 1

block 3

block 2

block 4Tachyonin-memory

block 1

block 3 block 4

Page 16: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Challenge 3 revisited

Different frameworks shareat memory-speed

execution engine & storage enginesame JVM process(fast writes)

Spark Task

Spark memblock 1

Hadoop MR

YARN

HDFSdisk

block 1

Block 3

block 2

Block 4

HDFSdisk

block 1

block 3

block 2

block 4Tachyonin-memory

block 1

block 3 block 4

Page 17: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Tachyon and Spark

Spark’s of off-JVM-heap RDD-store• In-memory RDDs (serialized)• Fault-tolerant cache

Enables• avoiding GC overhead• fine-grained executors• fast RDD sharing

Page 18: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Tachyon research vision

Vision• Push lineage down to storage layer• Use memory aggressively

Approach• One in-memory copy• Rely on recomputation for fault-tolerance

Page 19: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Architecture

Page 20: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Comparison with in Memory HDFS

Page 21: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Further Improve Spark’s Performance

Grep

Page 22: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Master Faster Recovery

Page 23: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Open Source Status

• New release– V0.4.0 (Feb 2014)– 20 Developers (7 from Berkeley, 13 from

outside)– 11 Companies– Writes go synchronously to under filesystem

(No lineage information in Developer Preview release)

– MapReduce and Spark can run without any code change (ser/de becomes the new bottleneck)

Page 24: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Using HDFS vs Tachyon

• Sparkval file = sc.textFile(“hdfs://ip:port/path”)

• SharkCREATE TABLE orders_cached AS SELECT * FROM orders;

• Hadoop MapReducehadoop jar examples.jar wordcount hdfs://localhost/input hdfs://localhost/output

Page 25: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Using HDFS vs Tachyon

• Sparkval file = sc.textFile(“tachyon://ip:port/path”)

• SharkCREATE TABLE orders_tachyon AS SELECT * FROM orders;

• Hadoop MapReducehadoop jar examples.jar wordcount tachyon://localhost/input tachyon://localhost/output

Page 26: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Thanks to Redhat!

Page 27: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Future Research Focus

• Integration with HDFS caching

• Memory Fair Sharing

• Random Access Abstraction

• Mutable Data Support

Page 28: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

AcknowledgmentsCalvin Jia, Nick Lanham, Grace Huang, Mark

Hamstra, Bill Zhao, Rong Gu, Hobin Yoon, Vamsi

Chitters, Joseph Jin-Chuan Tang, Xi Liu, Qifan Pu,

Aslan Bekirov, Reynold Xin, Xiaomin Zhang, Achal Soni, Xiang Zhong, Dilip Joseph,

Srinivas Parayya, Tim St. Clair, Shivaram Venkataraman, Andrew Ash

Page 29: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Tachyon Summary

• High-throughput, fault-tolerant in-memory storage

• Interface compatible to HDFS• Further improve performance for

Spark, Shark, and Hadoop• Growing community with 10+

organizations contributing

Page 30: Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica UC Berkeley.

Thanks!• More: https://

github.com/amplab/tachyon


Recommended