+ All Categories
Home > Documents > Understanding Performance with Flame Graphs · Profilers I Linux: perf,SystemTap,ktap I Solaris:...

Understanding Performance with Flame Graphs · Profilers I Linux: perf,SystemTap,ktap I Solaris:...

Date post: 20-Aug-2018
Category:
Upload: duongthien
View: 225 times
Download: 0 times
Share this document with a friend
27
Understanding Performance with Flame Graphs Vinzent Steinberg C++ User Group, FIAS November 11, 2015 1 / 27
Transcript

Understanding Performance with Flame Graphs

Vinzent Steinberg

C++ User Group, FIASNovember 11, 2015

1 / 27

Profiling Stack Traces

I Which code-paths are hot?I Timed interrupt collecting current function address, or entire

stack traceI Human readable summary reportI Problem: long output, difficult to comprehend

2 / 27

Flame Graphs

I Visualization of sampled stack tracesI Identify hot code-paths quicklyI Works for any profiler

3 / 27

Credits

Figure 1: Brendan Gregg

I Talk based on Brendan Gregg’s workI Performance architect at NetflixI Pioneered flame graphs for performance analysis during work

All information presented here is from his website:http://www.brendangregg.com/flamegraphs.html

4 / 27

Audience

I DevelopersI SysadminsI Support staffI Performance engineers

5 / 27

Languages

I C, C++I Rust, Haskell, GoI Javascript, Ruby, Lua, Python, Erlang, Java

6 / 27

Motivating Problem

I Slow MySQL data baseI Fix it or lose customerI CPU-limited

→ Sample stack traces with profiler

7 / 27

Profilers

I Linux: perf, SystemTap, ktapI Solaris: Dtrace, illumosI FreeBSD: DTraceI Max OS X: DTrace, InstrumentsI Windows: Xperf.exe

8 / 27

Background: Stack Frame

I Stack frame shows location in codeI Usual profiler output:

<module>`<function>+<instruction offset>

I For example:

libc.so.1`mutex_trylock_adaptive+0x112

9 / 27

Background: Stack TraceI List of stack framesI Current frame firstI Example:

libc.so.1`mutex_trylock_adaptive+0x112libc.so.1`mutex_lock_impl+0x165libc.so.1`mutex_lock+0xcmysqld`key_cache_read+0x741mysqld`_mi_fetch_keypage+0x48mysqld`w_search+0x84mysqld`_mi_ck_write_btree+0xa5mysqld`mi_write+0x344mysqld`ha_myisam::write_row+0x43mysqld`handler::ha_write_row+0x8dmysqld`end_write+0x1a3mysqld`evaluate_join_record+0x11e[...]libc.so.1`_lwp_start+0

10 / 27

Using perf_events$ perf record -F 99 -p 13204 -g -- sleep 30$ perf report -n --stdio[...]# Overhead Samples Command Shared Object Symbol# ........ ............ ....... ................. ...................................#

20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version|--- xen_hypercall_xen_version

check_events||--44.13%-- syscall_trace_enter| tracesys| || |--35.58%-- __GI___libc_fcntl| | || | |--65.26%-- do_redirection_internal| | | do_redirections| | | execute_builtin_or_function| | | execute_simple_command| | | execute_command_internal| | | execute_command| | | execute_while_or_until| | | execute_while_command| | | execute_command_internal| | | execute_command| | | reader_loop| | | main| | | __libc_start_main| | || | --34.74%-- do_redirections| | || | |--54.55%-- execute_builtin_or_function| | | execute_simple_command| | | execute_command_internal| | | execute_command| | | execute_while_or_until| | | execute_while_command| | | execute_command_internal| | | execute_command| | | reader_loop| | | main| | | __libc_start_main| | |

[...]

11 / 27

Problem

I Only 20.4% · 44.1% · 35.6% · 65.3% ≈ 2% of the samplesaccounted for

I This is already the condensed output

12 / 27

Too Much Data

Figure 2: Full profiling output.

13 / 27

Flame Graph

Figure 3: Visualization of profiling output.

14 / 27

Flame Graph - Explained

I Each box represents a function in the stack(merged stack frame)

I y -axis: stack depthI x -axis: spans sample population (ordering meaningless)I Width of boxes: total time spent on-CPU (including ancestry)I Colors for readability onlyI InteractiveI Works well for multithreaded software

15 / 27

Creating Flame Graphs

1. Capture stacks (profiler of your choice)2. Fold stacks, converting

unix`i86_mwait+0xdunix`cpu_idle_mwait+0xf1unix`idle+0x114unix`thread_start+0x8

19486

to

unix`i86_mwait+0xd;...;unix`thread_start+0x8 19486

3. Generate plot

16 / 27

Creating Flame Graphs: Basics

I Sample all processes for 60 s with 99 Hz(avoiding lockstep)

$ git clone https://github.com/brendangregg/FlameGraph$ cd FlameGraph$ perf record -F 99 -a -g -- sleep 60$ perf script

| ./stackcollapse-perf.pl > out.perf-folded$ ./flamegraph.pl out.perf-folded > perf-kernel.svg

17 / 27

Creating Flame Graphs: Tricks

I Filtered flame graphs

[...]$ perf script

| ./stackcollapse-perf.pl > out.perf-folded$ grep -v cpu_idle out.perf-folded

| ./flamegraph.pl > nonidle.svg$ grep ext4 out.perf-folded

| ./flamegraph.pl > ext4internals.svg$ egrep 'system_call.*sys_(read|write)'

out.perf-folded | ./flamegraph.pl > rw.svg

I Demangle C++ names

$ perf script | c++filt| stackcollapse-perf.pl > out.perf-folded

18 / 27

Other Uses

I Other eventsI Device I/OI CPU events (for example cache misses)I Off-CPU eventsI Specific function callsI Memory allocation

I Other values (instead of sample count)I BytesI Latency

19 / 27

Off-CPU Time

Figure 4: Visualization of off-CPU time.20 / 27

Off-CPU Time Problems

I Harder to interpretI On-CPU time limited, off-CPU unlimitedI Interpretation depends on what the code is doing

I Uses tracing scheduler eventsI Higher overhead

I Distorts resultsI Hurts the target

I More accurate than sampling

21 / 27

On-CPU and Off-CPU Time

Figure 5: Visualization of on-CPU and off-CPU time.

22 / 27

On-CPU and Off-CPU Time

Figure 6: Visualization of on-CPU and off-CPU time.

23 / 27

Memory

Figure 7: Visualization of memory allocation.

24 / 27

Differential Flame Graph

Figure 8: Difference to previous Flame Graph: Blue is faster, red is slower.

25 / 27

More Information

I http://www.brendangregg.com/flamegraphs.html(links to everything)

I Talk by Brendan Gregg at LISA 13

26 / 27

Summary

1. Collect stack traces with profiler2. Transform stack traces into flame graph3. Understand performance

27 / 27


Recommended