Profiling tail latency events #1738

RainM · 2025-02-06T20:34:59Z

Hello,

I think many people use Aeron in latency-sensitive applications. And it's constant issue to profile rare events which contribute to a tail latency a lot but doesn't affect average profile significantly.
I have an idea how to address this issue. Let's imagine Aeron's event loop writes current timestamp (heartbeat) to some memory-mapped file (heartbeat file) at the beginnig of every poll. Next, it's possible to hack some profiler to read this timestamp (via memory mapped file as well) and record samples if only current timestamp (sample's timestamp) is significantly greater than a timestamp of poll start. If poll has started recently, just skip such sample. In general, it's similar for 'time-to-safepoint' profiling.

I've implemented such thing for async-profiler (not upstreamed this yet).

Do you think such approach helps profiling tail latency issues? What do you think if I contribute such think (heartbeat agent) for Agrona/Aeron? I'm finishing PR to async-profiler and feedback from Aeron is highly valuable as well.

vyazelenko · 2025-02-06T20:54:36Z

@RainM Are you aware about the duty cycle stall tracking that is implemented for all Aeron components? There are two counters per component: one tracks the max duty cycle time ever recorded whereas the other one contains the number of times a threshold has been breached (e.g. MediaDriver.Context#senderCycleThresholdNs() ).

For example, have a look at the Sender#doWork and DutyCycleStallTracker.

vad0 · 2025-02-08T06:20:02Z

When DutyCycleStallTracker reports a slow cycle it is already too late - the cycle has ended. We already have no way to find out why it was slow. The idea is to store DutyCycleStallTracker.timeOfLastUpdateNs in an off heap variable. Async profiler will read this variable before collecting a stack trace. If deltaNs=asyncProfilerNowNs-timeOfLastUpdateNs<thresholdNs, then this stack trace is not collected. If deltaNs>thresholdNs, then stack trace is collected. If we set thresholdNs at a 0.99 latency percentile value, then we will completely ignore fast cycles and will collect stack traces only for the slow ones.

RainM · 2025-02-09T23:52:47Z

Yep, the key thing is to share knowledge between application and profiler if poll is slow or not.

One step back.If we want to get profile of tail events, we need either

gather full profile and mark out it with poll's starts and ends. And filter out all fast polls.
tell profiler if it's slow or fast poll while profiler wants to record sample. The target state here is to record all samples from poll start and either discard them (if it's fast poll) or record them (if it's slow poll). But it requires significant support from profiler side. At least, profiler should keep recorded samples while there is no knowledge if it's fast or slow poll. It requires some temporary storage for such samples. It's big change and requires a lot of work on profiler side.

I'd like is to simplify application-profiler interface. It's just one long integer variable. And it's pretty straightforward for profiler to understand if poll is slow or fast.

Back to your answer. Yep, I know DutyCycleStallTracker. But the knowledge that the current poll was slow doesn't let you get to know it's profile anyhow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling tail latency events #1738

Profiling tail latency events #1738

RainM commented Feb 6, 2025

vyazelenko commented Feb 6, 2025 •

edited

Loading

vad0 commented Feb 8, 2025

RainM commented Feb 9, 2025

Profiling tail latency events #1738

Profiling tail latency events #1738

Comments

RainM commented Feb 6, 2025

vyazelenko commented Feb 6, 2025 • edited Loading

vad0 commented Feb 8, 2025

RainM commented Feb 9, 2025

vyazelenko commented Feb 6, 2025 •

edited

Loading