diff options
author | Andrii Nakryiko <andrii@kernel.org> | 2024-03-26 09:21:47 -0700 |
---|---|---|
committer | Alexei Starovoitov <ast@kernel.org> | 2024-03-28 18:31:40 -0700 |
commit | 7df4e597ea2cfd677e65730948153d5544986a10 (patch) | |
tree | d1e7e35d4fd5d276b2a207a6eca1b39ae11a6a8f /tools/testing/selftests/bpf/bench.c | |
parent | 1175f8dea349e5999d99727346db24f38306a793 (diff) | |
download | linux-7df4e597ea2cfd677e65730948153d5544986a10.tar.gz linux-7df4e597ea2cfd677e65730948153d5544986a10.tar.bz2 linux-7df4e597ea2cfd677e65730948153d5544986a10.zip |
selftests/bpf: add batched, mostly in-kernel BPF triggering benchmarks
Existing kprobe/fentry triggering benchmarks have 1-to-1 mapping between
one syscall execution and BPF program run. While we use a fast
get_pgid() syscall, syscall overhead can still be non-trivial.
This patch adds kprobe/fentry set of benchmarks significantly amortizing
the cost of syscall vs actual BPF triggering overhead. We do this by
employing BPF_PROG_TEST_RUN command to trigger "driver" raw_tp program
which does a tight parameterized loop calling cheap BPF helper
(bpf_get_numa_node_id()), to which kprobe/fentry programs are
attached for benchmarking.
This way 1 bpf() syscall causes N executions of BPF program being
benchmarked. N defaults to 100, but can be adjusted with
--trig-batch-iters CLI argument.
For comparison we also implement a new baseline program that instead of
triggering another BPF program just does N atomic per-CPU counter
increments, establishing the limit for all other types of program within
this batched benchmarking setup.
Taking the final set of benchmarks added in this patch set (including
tp/raw_tp/fmodret, added in later patch), and keeping for now "legacy"
syscall-driven benchmarks, we can capture all triggering benchmarks in
one place for comparison, before we remove the legacy ones (and rename
xxx-batched into just xxx).
$ benchs/run_bench_trigger.sh
usermode-count : 79.500 ± 0.024M/s
kernel-count : 49.949 ± 0.081M/s
syscall-count : 9.009 ± 0.007M/s
fentry-batch : 31.002 ± 0.015M/s
fexit-batch : 20.372 ± 0.028M/s
fmodret-batch : 21.651 ± 0.659M/s
rawtp-batch : 36.775 ± 0.264M/s
tp-batch : 19.411 ± 0.248M/s
kprobe-batch : 12.949 ± 0.220M/s
kprobe-multi-batch : 15.400 ± 0.007M/s
kretprobe-batch : 5.559 ± 0.011M/s
kretprobe-multi-batch: 5.861 ± 0.003M/s
fentry-legacy : 8.329 ± 0.004M/s
fexit-legacy : 6.239 ± 0.003M/s
fmodret-legacy : 6.595 ± 0.001M/s
rawtp-legacy : 8.305 ± 0.004M/s
tp-legacy : 6.382 ± 0.001M/s
kprobe-legacy : 5.528 ± 0.003M/s
kprobe-multi-legacy : 5.864 ± 0.022M/s
kretprobe-legacy : 3.081 ± 0.001M/s
kretprobe-multi-legacy: 3.193 ± 0.001M/s
Note how xxx-batch variants are measured with significantly higher
throughput, even though it's exactly the same in-kernel overhead. As
such, results can be compared only between benchmarks of the same kind
(syscall vs batched):
fentry-legacy : 8.329 ± 0.004M/s
fentry-batch : 31.002 ± 0.015M/s
kprobe-multi-legacy : 5.864 ± 0.022M/s
kprobe-multi-batch : 15.400 ± 0.007M/s
Note also that syscall-count is setting a theoretical limit for
syscall-triggered benchmarks, while kernel-count is setting similar
limits for batch variants. usermode-count is a happy and unachievable
case of user space counting without doing any syscalls, and is mostly
the measure of CPU speed for such a trivial benchmark.
As was mentioned, tp/raw_tp/fmodret require kernel-side kfunc to produce
similar benchmark, which we address in a separate patch.
Note that run_bench_trigger.sh allows to override a list of benchmarks
to run, which is very useful for performance work.
Cc: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240326162151.3981687-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Diffstat (limited to 'tools/testing/selftests/bpf/bench.c')
-rw-r--r-- | tools/testing/selftests/bpf/bench.c | 21 |
1 files changed, 20 insertions, 1 deletions
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c index 7ca1e1eb5c30..484bcbeaa819 100644 --- a/tools/testing/selftests/bpf/bench.c +++ b/tools/testing/selftests/bpf/bench.c @@ -280,6 +280,7 @@ extern struct argp bench_strncmp_argp; extern struct argp bench_hashmap_lookup_argp; extern struct argp bench_local_storage_create_argp; extern struct argp bench_htab_mem_argp; +extern struct argp bench_trigger_batch_argp; static const struct argp_child bench_parsers[] = { { &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 }, @@ -292,6 +293,7 @@ static const struct argp_child bench_parsers[] = { { &bench_hashmap_lookup_argp, 0, "Hashmap lookup benchmark", 0 }, { &bench_local_storage_create_argp, 0, "local-storage-create benchmark", 0 }, { &bench_htab_mem_argp, 0, "hash map memory benchmark", 0 }, + { &bench_trigger_batch_argp, 0, "BPF triggering benchmark", 0 }, {}, }; @@ -508,6 +510,15 @@ extern const struct bench bench_trig_fexit; extern const struct bench bench_trig_fentry_sleep; extern const struct bench bench_trig_fmodret; +/* batched, staying mostly in-kernel benchmarks */ +extern const struct bench bench_trig_kernel_count; +extern const struct bench bench_trig_kprobe_batch; +extern const struct bench bench_trig_kretprobe_batch; +extern const struct bench bench_trig_kprobe_multi_batch; +extern const struct bench bench_trig_kretprobe_multi_batch; +extern const struct bench bench_trig_fentry_batch; +extern const struct bench bench_trig_fexit_batch; + /* uprobe/uretprobe benchmarks */ extern const struct bench bench_trig_uprobe_nop; extern const struct bench bench_trig_uretprobe_nop; @@ -548,7 +559,7 @@ static const struct bench *benchs[] = { &bench_rename_fexit, /* pure counting benchmarks for establishing theoretical limits */ &bench_trig_usermode_count, - &bench_trig_base, + &bench_trig_kernel_count, /* syscall-driven triggering benchmarks */ &bench_trig_tp, &bench_trig_rawtp, @@ -560,6 +571,13 @@ static const struct bench *benchs[] = { &bench_trig_fexit, &bench_trig_fentry_sleep, &bench_trig_fmodret, + /* batched, staying mostly in-kernel triggers */ + &bench_trig_kprobe_batch, + &bench_trig_kretprobe_batch, + &bench_trig_kprobe_multi_batch, + &bench_trig_kretprobe_multi_batch, + &bench_trig_fentry_batch, + &bench_trig_fexit_batch, /* uprobes */ &bench_trig_uprobe_nop, &bench_trig_uretprobe_nop, @@ -567,6 +585,7 @@ static const struct bench *benchs[] = { &bench_trig_uretprobe_push, &bench_trig_uprobe_ret, &bench_trig_uretprobe_ret, + /* ringbuf/perfbuf benchmarks */ &bench_rb_libbpf, &bench_rb_custom, &bench_pb_libbpf, |