Date post: | 15-Jan-2015 |
Category: |
Technology |
Upload: | narihiro-nakamura |
View: | 21,051 times |
Download: | 0 times |
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Parallel worlds of CRuby's GC
nari/Narihiro Nakamura/@nari_en
Network Applied Communication Laboratory Ltd.
I'm very happy now.
Today is my first presentation in English.
My English is not good.
But, I'll do my best.Please bear with me :)
Self introduction
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Ice-cream factory
I worked in an assembly line✓
For example, I made many cardboard boxes.
I was a professional cardboard box maker :)
✓
✓
8/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Ice-cream factory
I made 150 boxes per hour(ZOMG)
✓
9/207
http://www.flickr.com/photos/kevincollins123/5887984753/http://www.flickr.com/photos/kevincollins123/5887984753/
I was like a machine!!
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Working with Java
I worked in a big company.✓
This work was similar to assembly line work..
I made a part of a product. I didn't understand whole product.
✓
✓
13/207
http://www.flickr.com/photos/kevincollins123/5887984753/http://www.flickr.com/photos/kevincollins123/5887984753/
I was still like a machine!!
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
My current work
Currently, I work at NaCl.✓
matz and shyouhei and takaokouji are my co-workers.
✓
shugo is my boss.They are CRuby committers.✓
✓
17/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
When I started Ruby programming
I felt free.✓
This work wasn't similar to assembly line work.
I could make the whole product.✓
✓
18/207
http://www.flickr.com/photos/danzden/121379782/http://www.flickr.com/photos/danzden/121379782/
I was no longera machine!!
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Garbage Collection for me
GC technology is very interesting for me.
✓
GC is a garbage collecting machine.
✓
I've been creating it since then. It's very fun!!
✓
21/207
I'm making a machine!!
My relationship to GC
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
I'm a CRuby Committer
I work on GC.✓
24/207
And, I wrote abook about GC.
But, it's only in Japanese :(
And, I've been creating GC with RDD.
What is RDD?
RDD = RubyKaigi Driven Development
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
My RDD history
LazySweepGC - RubyKaigi2008✓
LonglifeGC - 2009✓
LazySweepGC - 2010✓
ParallelMarkingGC - 2011✓
30/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
My RDD history
LazySweepGC - RubyKaigi2008✓
LonglifeGC - 2009✓
LazySweepGC - 2010✓
ParallelMarkingGC - 2011✓
31/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
LonglifeGC
It treats long-life objects as a special case.
similar to Generational GC.✓
✓
LonglifeGC was rejected in CRuby 1.9.2 by some reason.
:'(✓
✓
32/207
http://www.flickr.com/photos/conifer/2389654222/http://www.flickr.com/photos/conifer/2389654222/
But, LonglifeGC has been
used in Kiji :-)
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Kiji
Kiji is an optimized version of REE by Twitter developers.
✓
The twitter team substantially extended LonglifeGC.
It's cool!!✓
✓
34/207
But, Kiji will be rejected also... :'(
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
My RDD history
LazySweepGC - RubyKaigi2008✓
LonglifeGC - 2009✓
LazySweepGC - 2010✓
ParallelMarkingGC - 2011✓
36/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
LazySweepGC
Traditional M&S GC executes mark and sweep atomically.
Ruby application stops during GC (stop-the-world).
✓
✓
In Lazy sweeping, sweeping is lazy.
✓
37/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
LazySweepGC
Each invocation of the object allocation sweeps Ruby's heap
until it finds an appropriate free object.✓
✓
38/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Improvements
This improves the response time of GC
✓
I.e. the worst case time of GC decreases.
✓
39/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
LazySweepGC
You can use LazySweepGC since Ruby 1.9.3
✓
40/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
My RDD history
LazySweepGC - RubyKaigi2008✓
LonglifeGC - 2009✓
LazySweepGC - 2010✓
ParallelMarkingGC - 2011✓
41/207
Today's topics
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Today's topics
Why do we need Parallel Marking?
✓
What to consider?✓
How to implement?✓
How much did performance improve?
✓
43/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Today's topics
Why do we need Parallel Marking?
✓
What to consider?✓
How to implement?✓
How much did performance improve?
✓
44/207
Why do we need Parallel Marking?
This is CRuby'scurrent GC.
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Current CRuby's GC
GC operates on only 1 core.✓
In multi-core environment, other cores don't help GC.
✓
47/207
http://www.flickr.com/photos/hortont/2698261070/http://www.flickr.com/photos/hortont/2698261070/
GC:"I'm alone, it's so hard."
http://www.flickr.com/photos/knallaerbse/2863161933/http://www.flickr.com/photos/knallaerbse/2863161933/
We should run GC in parallel!!
First, Let me explain a few GC related concepts.
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
What is GC?
GC collects all dead objects.✓
51/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
What is a dead object?
A dead object is an object that is never referenced by the program.
✓
In GC terms, we say a that dead object is unreachable from Roots.
✓
52/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
What is Roots?
Roots is a set of pointers that directly reference objects in the program.
e.g. Ruby's local variables, etc..✓
✓
53/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
For example
54/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Please remember that
GC collects objects that are unreachable from Roots.
✓
55/207
Next, Let me explain the current CRuby GC
algorithm.
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
CRuby's GC algorithm summary
CRuby adopts the Mark & Sweep algorithm
✓
Collector works in separate Mark and Sweep phases.
✓
57/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
In the Mark phase
collector marks live objects that are reachable from Roots.
✓
58/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
For example
59/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Mark phase with GC.start
60/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Ruby Heap after marking
61/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
In the Sweep phase
collector sweeps "dead" objects"dead" means unmarked✓
"dead" means unreachable from Roots✓
✓
62/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Sweep phase
63/207
Characteristics of CRuby's GC
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Characteristics
The stop-the-world algorithm✓
Single thread execution✓
65/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Recently, PC has multi-core processors. But,
GC executes on a single thread.✓
Other cores don't work during GC.✓
What a waste!!✓
66/207
How can we fix this?
UseParallel Marking,Luke
What is Parallel Marking?
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
What is Parallel Marking?
Collector run several marking processes in parallel
by using native threads.✓
✓
We will be happy on multi-core machine.
✓
70/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Flow diagram for Parallel Marking
71/207
BTW:Why not perform
sweeping in parallel?
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Why not perform sweeping in parallel
The sweeping is much faster than the marking.
You can see ko1's research✓
<URL:http://www.atdot.net/~ko1/diary/201011.html#d4>
✓
✓
73/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Why not perform sweeping in parallel
So, Mark phase improvement = GC improvement
✓
And, we already have the lazy sweeping.
✓
74/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Today's topics
Why do we need Parallel Marking?
✓
What to consider?✓
How to implement?✓
How much did performance improve?
✓
75/207
What to consider when implementing Parallel
Marking?
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
We should consider two problems
Workload balancing✓
Wait-free algorithm✓
77/207
Workload balancing
How can we divide the marking task into sub-
tasks?
I tried think about a simple approach.
1 branch of Roots is marked by 1 thread.
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
This means..
Tasks are distributed to multiple threads.
✓
The task of marking the entire heap is divided into several tasks, each marking a single branch.
✓
84/207
This seems to be no problem.
But actually, this solution suffers from the workload
problem.
Each thread doesn't know what the other threads are doing.
For instance, if A and B finishes work early,
then, they will stop doing anything :(
I think "machines should work forever" :D
So, I think A and B should ...
http://www.flickr.com/photos/ryanr/157458385/http://www.flickr.com/photos/ryanr/157458385/
Parallel Marking with Task Stealing.
If A and B finishes work early,
This is called"Task Stealing"
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
We should consider two problems
Workload balancing✓
Wait-free algorithm✓
97/207
Wait-free algorithm
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
What does "wait-free" mean?
A wait-free program does non-blocking execution.
✓
It guarantees per-thread progress.✓
99/207
Why is wait-free important?
Amdahl's law
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Amdahl's law
is used to find the maximum expected improvement to an overall system when only part of the system is improved.
[cited from `Amdahl's law - Wikipedia']
102/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Amdahl's law is used in parallel computing
If parallel portion of the system is X%
✓
And number of processors is Y,✓
How much speedup can we expect?
✓
103/207
It's worse than expected, right?
The conclusion so far
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
The conclusion so far
We should consider how we can efficiently balance workloads.
So, we use Task Stealing.✓
✓
We should eliminate non-parallel parts
by using wait-free algorithm.✓
✓
109/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Today's topics
Why do we need Parallel Marking?
✓
What to consider?✓
How to implement?✓
How much did performance improve
✓
110/207
How to implement Parallel Marking?
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Task Stealing
In Task Stealing, threads steal tasks from each other
✓
Task Stealing is achieved with Arora's Deque
✓
112/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Arora's Deque
Deque stands for the Double-Ended Queue.
✓
In Arora's Deque, the deque contains tasks as elements.
✓
It's a wait-free data structure.✓
113/207
Arora's Deque has only three operations.
Each mark worker has a single deque.
Only the owner can call pop() and push().
Worker can call shift() to steal other workers' deque.
"Hey wait a minute, doesn't shift() have
contention problems?"
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
In what ways could shift() cause contention problems?
e.g...
Multi-thread (workers) may call shift() of same deque at the same time.
✓
122/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
In what ways could shift() cause contention problems?
e.g...
shift() and pop() could be called at the same time
when deque has only one element.✓
✓
123/207
But, Arora's Deque avoids these contention problems.
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Serialization
shift() is serialized by using CAS.CAS = Compare And Swap✓
✓
And, this serialization doesn't use a lock.
It's wait-free!!✓
✓
125/207
I omit details of the implementation of the
serialization.
For the sake of this presentation, let's assume that Arora's Deque avoids
contention problems.
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Summary for Arora's Deque
A simple data structure for Task Stealing.
✓
Each worker has a single deque.✓
Stealing (shift operation) is wait-free!
✓
128/207
How to use Arora's Deque in Parallel Marking?
First try: A task is an object.
Let's say that worker A has a branch that is composed of 4 objects.
We start by marking A and pushing it to the deque.
pop A, mark B and C, push B and C.
pop C, mark D, push D
pop D, pop B
This is a branch marking.
How do you steal?
Suppose that worker1 has task B and C. Worker2 has no task.
Worker2 steals task B on Worker1 by using shift().
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Summary
Marker uses Arora's Deque as a marking stack.
✓
A "task" means an object.The granularity of the task is very fine.✓
✓
This is a naive implementation.✓
140/207
I implemented this approach.
But..
It's slower than original GC.
http://www.flickr.com/photos/emariephotos/4958245676/http://www.flickr.com/photos/emariephotos/4958245676/
OMG...
I fell intothe Pitfalls ofParallel Processing(PPP!!!)
Why slow?
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Why slow?
pop(),push(),shift() are called frequently.
Because deque has fine-grained tasks.✓
✓
Their overhead is too big.✓
147/207
How to fix this?
We can make the tasks less fine-grained.
A task is a branch
All branches in Roots are divided roughly among the deques.
Each Worker marks a branch in its deque.
When the deque is empty, the worker steals a branch from another worker.
like this!!
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Good point & Bad point
Number of calls to Deque's operations was reduced.
Marking speed of the worker is improved.
✓
✓
However, Coarse-grained tasks decrease parallelism.
✓
155/207
Why do coarse-grained tasks decrease parallelism?
Tasks may involve a large branch.
If an object in B's branch has many child objects..
.. then A can't steal it while B is marking the large branch.
So, the worker needs to treat large branches as
special cases.
Almost all large branches hold large Array objects
and/or large Hash objects.
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Treatment for large Array objects and Hash objects
Each marker has a special deque to manage them.
✓
A marker divides them into fixed size tasks.
e.g. 0-9 elements of Array, 10-19 elements of Array...
✓
✓
162/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Treatment for Large Array and Hash
By doing this, other workers can steal divided tasks.
This improves parallelism.✓
✓
163/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Summary
The naive implementation was slow.
Grain of the task was too fine.✓
✓
A "task" means a branch in RootsGrain of the task is coarse.✓
✓
It's faster!!✓164/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Today's topics
Why do we need Parallel Marking?
✓
What to consider?✓
How to implement?✓
How much did performance improve?
✓
165/207
How much did performance improve?
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
These are my machine specs
My machine has only 2 cores✓
Memory: 8GB✓
OS: Linux✓
167/207
Parallel marking uses 4 marking threads.
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
First benchmark program is
make benchmarkThis is the benchmark which used in CRuby development
✓
✓
169/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Why does this seem so slow?
I think it's affected by Parallel Marking's preparation.
e.g. creating marking threads, allocation of deques.
✓
✓
171/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Why does this seem so slow?
In most of the benchmarks, the mark target objects are few.
In this case, Parallel Marking cost is expensive.
✓
✓
172/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Next benchmark program is
make rdocmake rdoc generates the Ruby documentation.
✓
This benchmark measures execution time and the GC execution time of make rdoc.
✓
✓
173/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
make rdoc
It takes about 80 seconds on my machine.
✓
In fact, 30% of that time is spent on GC!!
✓
How much did performance improve?
✓
174/207
All GC time is improved by 40%!
So fast!!
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
In many core environment
I expect we get a large improvement.
e.g. 8 core, 16 core...✓
✓
But, my machine has just 2 cores.I can't see it :(✓
✓
178/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Best case for Parallel GC
If the objects are many.In this case, mark targets is also many.✓
✓
If the objects are long-lived.Server-side application?✓
✓
179/207
Demo
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Demonstration
I want to show the performance improvement with Parallel GC.
✓
This demonstration is video game style.
✓
181/207
Let me explain about this game.
And, Character has HP.
When GC runs,
the character loses HP while waiting for the GC to finish.
We must reach the goal before HP run out.
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Other characteristics of SUPER NARIO GC
GC is running in fixed intervals.✓
A lot of objects are generated to increase GC's burden.
Burden = Game Level✓
✓
187/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Try to compare Original GC and Parallel GC
Original GC pause time is long.This game will be difficult.✓
✓
Parallel GC pause time is short.This game will be easy.✓
✓
188/207
OK, Let's try!
DEMOOriginal GC version
Oops.. so difficult!!!
DEMOParallel GC version
Wow!! Easy!!!!
Let's compare average times GC
Fast!!
Remaining Problems
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Windows OS is not supported
Mark Worker uses pthread as native thread.
✓
And, uses some gcc built-in functions.
✓
But, I'll support for Windows eventually.
✓
198/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Increased memory usage.
Size of 1 Deque is roughly 32KB.✓
But generally multi-core machine have plenty of memory.
So, I think it's OK :P✓
✓
199/207
Conclusion
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Conclusion
I implemented Parallel Marking GC
✓
GC was improved!I'll report to ruby-core soon.✓
✓
201/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Conclusion
But, Parallel Marking has some problems.
I'll fix these.✓
✓
202/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
source code
Parallel Marking GC<URL:https://github.com/authorNari/ruby/tree/pmark_div_root2>
✓
✓
SUPER NARIO GC<URL:https://github.com/authorNari/nario/>
✓
✓
203/207
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Acknowledgments
Following people helped me make this presentation!!
Tor-san!!✓
matz, shugo, yhara, sada, takaokouji, other co-workers!!
✓
✓
204/207
Thank you!!!
Do you have any questions?
Please short and simple questions :)
Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3
Sorry
It's too difficult for me to understand/answer the question.
✓
Could be send the question on twitter(@nari_en)?
✓
207/207