Date post: | 12-Jan-2015 |
Category: |
Technology |
Upload: | linaro |
View: | 660 times |
Download: | 4 times |
Thu 6 March, 10:05am, Santosh Shukla, Mike Holmes
LCA14-401: BoF, Networking - Debug/tracing/counter
• Introduction• Use case
• libperf and in-kernel perf API• Test analysis direct user access vs syscall
based perf counter access• Design Issues and Next step• QA
Fast access to perf Counters
• Access to perf counters is not fast enough in the embedded networking space.
• We think we need• The fastest access from user space. (see use
case)• Shared when read only (no locking overhead).• Stable API (based on libperf)• Easy way to access to SoC specific counters
Introduction
• In fast path (could be ODP in future), There’ll be a method to analyze odp crash dump based on statistics.
• Because crash dump statistics are based on the perf hw counters, really low overhead counter access is needed. Should provide near accurate cpu or bus clock cycle precision.
• For example, in the fast path - per-packet budgeting is 1000 cpu cycle, then measuring can not take 3000 cpu cycle as it does today with syscall based perf counter in linux.
Use Case
Perf provides a syscall method to open a perf file descriptor for user space application to access the counters, and attach the events to them.
sys_perf_counter_open - The syscall - event type attributes for monitoring/sampling - target pid - target cpu - group_fd - flags
Event type : - PERF_TYPE_HARDWARE - PERF_TYPE_SOFTWARE - PERF_TYPE_TRACEPOINT - PERF_TYPE_HW_CACHE - PERF_TYPE_RAW (for raw tracepoint data)
Perf
attr.sample_type
{
bitmask
PERF_SAMPLE_IP
PERF_SAMPLE_TID
PERF_SAMPLE_TIME
PERF_SAMPLE_CALLCHAIN
PERF_SAMPLE_ID
PERF_SAMPLE_CPU }
attr config bitfield
{
disabled: off by default
inherit: children inherit it
exclude_{user,kernel,hv,idle}: don’t count these
mmap: include mmap data
comm: include comm data
inherit_stat: per task counts
enable_on_exec: next exec enables
}
perf continued..
• Libperf creates set of file descriptors for bunch of perf events..by calling sys_perf_open_event() api, and does enable/disable/read operation on them .
current API has :libperf_initialize : sets up a set of fd's for profiling code to read fromlibperf_finalize : read from fd’s, print and close all pef FD.libperf_readcounter : read perf counter.libperf_enablecounter : Enable perf counterlibperf_disablecounter : disable perf counterlibperf_close : Close fd
Libperf
• Raw Proposal :• Mmaping hw counters to user space could be a way forward for fast
access, removing overhead with the current kernel implementation.• Adding scalable framework in user space ..could be libperf so to read
cpu specific counter, counter on offload block and other variant of counters.
• Current mmapped based perf support in kernel:• in-kernel perf supports mmaped based persistent ring-buffer
implementation for user space.• This implementation is limited in performance due to the following.
The hw counter mappable and stored into ring-buffer with lots of synchronisation overhead for user space to access i.e.. rmb for every perf read counter, locking, async wake-up event for user space to read statistics.
design issues, next step investigation
• But,• The current kernel mappable events are exclusive, and
are not shareable, they won't fall back to sysfs perf event mode. Therefore it is not scalable.
• The current kernel counter overhead is still significant, therefore the current implementation won't achieve 1000 cycle requirement for fast path model, example ODP crash dump statistics requirement mentioned in prev slide [4].
Next Step continued..
• Effort to investigate and try to evaluate these issues : • Focus on exclusive fast access approach • HW counter pinned to specific core, specific task• Avoid sync primitives in kernel space while reading hw counter, Let
user space application handle this job.• Educate libperf to handle sync primitive and decide on locking policy.• Design should be flexible enough to fall back to syscall based perf
mode.• Respect SMP policy as much as possible.
Next Step continued..
Userspace fast access flow control arrow key - too short Application should be square
Both these inside SocArm Processor Coreevent extensions
Custom user space application detail -• Ran test application on arndale to demonstrate delta of user vs kernel space perf
counter. Result shows close to 9x improvement.• Tiny test kernel module enables,disable perf counter for user mode.
/* enable */asm ("MCR p15, 0, %0, C9, C14, 0\n\t" :: "r"(1));/* disable */asm ("MCR p15, 0, %0, C9, C14, 2\n\t" :: "r"(0x8000000f));
• User app uses x86 style timer api to read perf counter. static inline uint32_t rdtsc32(void) { #if defined(__GNUC__) && defined(__ARM_ARCH_7A__) uint32_t r = 0; asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r"(r) ); return r; #else #error Unsupported architecture/compiler! #endif
}
Benchmarking current & proposed access
Libperf application using perf syscall -• Create perf event FD using perf_event_open syscall.• Reads perf counter event from file descriptor.
init(void){ static struct perf_event_attr attr; attr.type = PERF_TYPE_HARDWARE; attr.config = PERF_COUNT_HW_CPU_CYCLES; fddev = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0);}
• Both application runs in a tight loop for some duration and there delta recorded for comparison..
Benchmarking cont..
• Enable pmu direct user space vs perf syscall based application.
Benchmarking cont..
[1]ARM A15 Performance counter registershttp://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0438c/BIIFDEEJ.html[2]LNG CARDhttps://cards.linaro.org/browse/LNG-260[3]Perf 0n A15https://perf.wiki.kernel.org/index.php/Tutorial[4]http://neocontra.blogspot.com/2013/05/user-mode-performance-counters-for.html[5]https://github.com/thoughtpolice/enable_arm_pmu[6]Lib perf https://github.com/theonewolf/libperf[7]http://www.linux-kongress.org/2010/slides/lk2010-perf-acme.pdf
Reference links
QA
More about Linaro Connect: http://connect.linaro.orgMore about Linaro: http://www.linaro.org/about/
More about Linaro engineering: http://www.linaro.org/engineering/Linaro members: www.linaro.org/members