-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarks and instrumentation #395
Conversation
Mainly just using dtrace, these assume ou've started a workload elsewhere for now...
That was a... slightly frustrating exercise in struct and scope drop order.
I looked into flamegraph-rs / inferno, but the former seemed to not provide any mechanism for actually filtering the stacks down like I need? Unsure.
Now just need to fit in lockstat...
We're in a better place with this now that I've had the time to dive back in! I have box-to-box iperf instrumentation working over the out-ports of a pair of Intel I350-T2 NICs (2 x 1GbE), so we can meaningfully separate Rx and Tx processing for each direction of iperf traffic. Thanks to #381, I'm getting (typically) 1.6--1.75Gbps, though I do have to set around
So, we get nice graphs like this of aggregate packet processing times (receive of large TCP segments): ...and nice demangled flamegraphs like these: (github seems to have broken the interactivity when viewed on private-user-images.githubusercontent.com, but zooming into actual functions works if you save them to disk and reopen 🤷) Since these are actually box-to-box, these should be higher fidelity than the initial prototype. Still to do:
|
An initial high-level observation is that we should run all the benchmarks we can in CI to maintain history. The Update: we should be able to run |
There are many great tests in |
Toward the tail end of Warning: It is not recommended to reduce nresamples below 1000. Benchmarking parse/Hairpin-DHCPv4/alloc_ct/Discover: Warming up for 1.0000 ns Warning: Unable to complete 10 samples in 10.0µs. You may wish to increase target time to 155.1µs. Benchmarking parse/Hairpin-DHCPv4/alloc_ct/Discover: AnalyzingCriterion.rs ERROR: At least one measurement of benchmark parse/Hairpin-DHCPv4/alloc_ct/Discover took zero time per iteration. This should not be possible. If using iter_custom, please verify that your routine is correctly measured. |
Thanks for giving this a test.
I think that should be doable, it would be really nice to host the results on buildomat. I'll need to dust off my gnuplot-fu on the candlestick graphs.
I've tweaked config to the point that I can silence all but the 'zero time per iteration' warnings. I'll dig into that. EDIT: it's specifically the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still working my way through this, but wanted to post my comments on taking the ubench
and kbench
machinery for a spin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI report archive looks great. Thanks for the really great starting point with benchmarks and instrumentation here!
Thanks; I'm going to open some issues around the remaining nice to haves/bugbears so we don't lose track:
|
* a4c956e chore(deps): lock file maintenance (oxidecomputer/opte#474) * 6847b04 chore(deps): update rust crate nix to 0.28 (oxidecomputer/opte#473) * c30aa36 Benchmarks and instrumentation (oxidecomputer/opte#395) * 53201de Fix xtask non-package install for fresh systems. (oxidecomputer/opte#470) * 7e56634 Correctly preserve inbound UFID on outbound TCP close (oxidecomputer/opte#469)
This PR provides two new benchmarking tools locally scoped to the repository for measuring how XDE/OPTE performs on different test setups. These produce a mixture of flame graphs and histograms/stats of rx/tx timing information.
cargo ubench
. This is a standard set of microbenchmarks usingcriterion
in the expected way, which measures individual packet parse and processing times for various classes of packet: modified ULP traffic and hairpin packets. In addition, it measures the number and total size of memory allocations made for each packet class. This uses the standardcriterion
CLI, so offers full support for saving/comparing baselines and filtering down to specific benchmarks. These are designed to run on any compatible development machine.cargo kbench
. This is explicitly used for measuring the XDE kernel module in real Helios deployments using dtrace, and is something of a swiss-army knife in setting up zone-based deployment. Outputs are fed intocriterion
for more robust comparison of statistics on rx/tx timing, and flamegraphs are built in parallel. It operates in a few modes:cargo kbench local
. This spins up an identical topology to the existing XDE tests, runs iperf between the two nodes, and collects statistics.cargo kbench remote/server
. These commands allow two physical machines on the same lab network to run iperf over a physical underlay, allowing better measurement of rx/tx timings on one node.cargo kbench in-situ
. This produces identical statistics/flamegraphs to the above without building any topology or spawning a workload. This is intended for, e.g., taking live measurements from a gimlet.There are some caveats to
ubench
today that would need extra engineering work to be spent oncriterion
. From what I can tell, saving of reports (and generation of plots) for elements with zero variance such as allocation counts/sizes are not an intended use case, so there are some deepNaN
propagation issues preventing the results from being persisted and compared/saved as baselines. This only affects memory statistics, which can be rerun very quickly using a filter such ascargo ubench alloc_ct
orcargo ubench alloc_sz
.Will close #392.