FEX: Implements new sampling based stats #4291

Sonicadvance1 · 2025-01-22T03:36:45Z

The ftrace/gpuviz path has an issue with some usability. It can't be put in high-frequency locations like the SIGBUS handler, and it is hard to correlate between stutters in the game with a gpuviz trace after the fact.

With the combination of these two problems I have written this little SHM implementation of statistics that are /virtually/ free on the FEX side and requires the reader to accumulate the data correctly.

Some design decisions

All atomics are weak-memory ordered, requiring the use of memory_barriers and mutexes in appropriate locations
- This improves performance since it's avoid atomic overheads entirely (Which FEX already has issues with)
- This means all stats are per-thread and the accumulation has to happen on the reader side
The stats SHM region can never shrink, which would result in problems.
The stats are in a singly linked-list because the expectation is low thread counts
The stats are in a single allocation region which allows us to expand the stats if the number of threads grow

This is implemented for Linux and WINE, so we can use it everywhere.

This gets us some nice stat graphics like what follows:

We only have four stats currently, but there can definitely be more stats as time progresses.
The Mangohud patches are forthcoming while I clean them up.

bylaws · 2025-01-22T09:54:04Z

I'll make a fixup to replace the wine support with something simpler in a bit

neobrain

Gave this a first look - heavy stuff, but gathering and visualizing these metrics indeed is useful functionality to have.

Will mangohud need to maintain a fixed list of FEX-specific metrics (and update them if we add any) or will it automatically pick up new metrics added in FEX?

Is this designed to support accumulation of metrics across processes as well or is it solely for the process running mangohud?

The current implementation leaves me uncomfortable, since there's only so much faith one can have in reliability when interleaving weakly-ordered atomics, shared memory, and a custom linked list implementation. Well-placed and tested abstractions and more explaining docs would help support that faith.

More generally, can I ask what's the benefit of a MangoHud-specific approach compared to an established nanosecond-resolution live-profiler like Tracy? Obviously MangoHud is great for general stats like FPS and CPU load, but FEX's metrics are very specific and I'm not sure how valuable having a short window of these is. Most metrics could probably even be reported via ftrace despite its overhead (i.e. less rapidly updated metrics, or those that relate to costly SIGBUS/SMC events anyway), which could give us free multi-process awareness without bespoke accumulation logic in external projects.

Source/Tools/LinuxEmulation/LinuxSyscalls/ThreadManager.cpp

Source/Common/Profiler.cpp

Source/Common/Profiler.h

Source/Tools/LinuxEmulation/LinuxSyscalls/ThreadManager.cpp

neobrain · 2025-01-22T14:35:30Z

Source/Common/Profiler.h

+#endif
+
+namespace FEX::Profiler {
+class StatAllocBase {


There's some heavy logic in here, so this should explain what it's doing.

neobrain · 2025-01-22T15:22:22Z

Source/Common/Profiler.h

+  FEXCore::Profiler::ThreadStatsHeader* Head {};
+  FEXCore::Profiler::ThreadStats* Stats;
+  FEXCore::Profiler::ThreadStats* StatTail {};


Why are you preferring a custom-implemented linked list over flat memory? This seems neither more convenient for the implementation nor like it would improve performance.

Sparsity in the allocation buffer is a concern, so this is how I'm keeping the implementation.

Could you elaborate what exactly the problem is?

The reader side would need to walk the entire allocated buffer space to ensure it doesn't miss any slots, instead of the average case of a hundred or so.

What scenario exactly would trigger this? I don't know what your reader side looks like, but it sounds like a bitmask and/or active slot count would easily resolve this.

I'm just going to leave it as a linked list. We can rev the version number and change it to a bitmask in the future if someone is feeling up to writing that code in the future.

If the choice between linked lists and flat memory doesn't seem to be so important, why is the focus on weakly-ordered atomics considered critical?

(I don't agree using a custom data structure here is the way forward since it makes up a good chunk of the complexity in this PR, but on top of that the performance targets don't seem consistent here as-is.)

For the atomics I don't want the writer side to be affected by cacheline sharing problems at all. Semantically these are two independent things. The containing data which is FEXCore::Profiler::ThreadStats and the way to walk the list of data.

I understand they are different things, but it seems arbitrary to want this (with no explicit motivation) while using a data structure with poor performance characteristics in the same module.

Indeed, which is why we leave some of the "poor performance" problems on the reader side which has something like 500ms to sample the data, even though this is going to be hundreds of nanoseconds regardless.

neobrain · 2025-01-22T15:26:39Z

Source/Common/Profiler.cpp

+  }
+
+  // Find a free slot
+  memory_barrier();


Compared to the cost of the profiled events (SIGBUS/SMC), is it really worth dealing with the added complexity here instead of just using stronger atomics? Also, isn't thread creation/destruction in particular rare enough that we could use strong memory ordering on slot management variables while leaving the stats themselves weakly ordered?

False dependency sharing in the hot path is a significant concern, so I'm keeping them as weak atomics.

Where's the hot path? The currently traced events are orders of magnitude more expensive to process than potential cache behavior effects.

This applies even more to thread creation/destruction, which is a rare event on top of that.

Yes, we have some leniency on thread creation and teardown, which is already slow and causes global locks to be taken, but this is the way I have decided to implement it.

Sonicadvance1 · 2025-01-22T21:30:32Z

https://github.com/Sonicadvance1/MangoHud/tree/fex
For the WIP FEX support in mangohud.

Source/Common/Profiler.h

Source/Tools/LinuxEmulation/LinuxSyscalls/ThreadManager.cpp

Source/Windows/WOW64/Module.cpp

Source/Windows/ARM64EC/Module.cpp

Source/Windows/WOW64/Module.cpp

When running on a system with a 48-bit VA, if FEX does any allocations between us reserving the upper 128TB and the application running, then /technically/ we are intersecting with the application's memory region in the lower 47-bits. This didn't typically result in any problems due to how ASLR works, but if we did any large allocations (like FEX-Emu#4291 wants with 128MB VMA region) then these typically get pushed higher in the VA space. Again not usually a problem, but if you happen to be running an application that is using MAP_FIXED with hardcoded addresses then this can stomp over FEX-Emu memory causing problems. This is what happens with Wine, it reserves the upper-32MB of its 47-bit VA space, which is /highly/ likely to stomp on FEX memory. In-fact it likely occurs all the time, we just got lucky with whatever it was clobbering wasn't used at the time. On 39-bit VA systems this isn't a problem because the mmap fails outright with a warning message from WINE. Because we are already reserving the upper 128TB of VA space, instead just always enable our allocator and use the regions that were reserved. We need to be a little bit careful to ensure we don't accidentally allocate more memory post-reservation but that just requires a small adjustment to our unique_ptr and constructor for the 64BitAllocator. This means /all/ FEX-Emu allocations will be in the upper 128TB VA space when running 64-bit applications on a 48-bit VA system. Which is kind of nice. Fixes WINE in FEX-Emu#4291 when the allocator stats are bumped to 128MB per process.

Based on FEX-Emu#4291 and FEX-Emu#4324. Ideally this gets merged at the same time so we can have Mangohud be on version 2 before giving them an upstream patch. Performance-wise this change falls within noise of my x87 microbench. This just lets us track the number of float fallbacks FEX does, letting us detect things like x87 fallbacks and how frequent they are, so we can detect if a game might be slow or stuttering because of these fallbacks.

bylaws · 2025-02-07T00:56:08Z

I would err towards only supporting the wine side with the fex patch without a solid usecase for the questionable path

Sonicadvance1 · 2025-02-07T01:04:59Z

I would err towards only supporting the wine side with the fex patch without a solid usecase for the questionable path

Sure, ripped it out.

Not wired up, just the definitions so it lives in the InternalThreadState. We want this accessible from both FEXCore and the frontends so it needs to live there. Two types of events supported. Scoped cyclecounts and instant increments. This gives us JIT time and Signal handling time, plus events for number of SIGBUS and number of SMC events. All useful statistics for seeing stutter live.

For the four things we care about

Not wired up to anything. Requires the frontends to allocate shared memory in the expected way.

This is fairly straightforward. It creates the shared memory region in /dev/shm/fex-<pid>-stats so that Mangohud can sample it.

This is a little trickier, we actually open the `/dev/shm/fex-<pid>-stats` file directly using Windows APIs that way Mangohud (which is going to be on the Linux side, or potentially even embedded in to Gamescope) can safely pick up the stats. A little quirky plus doesn't support expanding its size since WINE doesn't support NtExtendSection, but that's fine.

Based on FEX-Emu#4291 and FEX-Emu#4324. Ideally this gets merged at the same time so we can have Mangohud be on version 2 before giving them an upstream patch. Performance-wise this change falls within noise of my x87 microbench. This just lets us track the number of float fallbacks FEX does, letting us detect things like x87 fallbacks and how frequent they are, so we can detect if a game might be slow or stuttering because of these fallbacks.

lioncash

Just one super small nit

Source/Common/Profiler.h

Fallback to the previous path if it doesn't exist.

Based on FEX-Emu#4291 and FEX-Emu#4324. Ideally this gets merged at the same time so we can have Mangohud be on version 2 before giving them an upstream patch. Performance-wise this change falls within noise of my x87 microbench. This just lets us track the number of float fallbacks FEX does, letting us detect things like x87 fallbacks and how frequent they are, so we can detect if a game might be slow or stuttering because of these fallbacks.

This implements support for FEX-Emu's profiling stats interface behind a command line option that isn't default enabled. The purpose of these statistics in FEX-Emu is to expose statistics that are high frequency and can't easily be collected in other ways due to overhead. The reason why these are getting exposed in Mangohud versus another path is to have that information readily available while running a game. It's sometimes difficult to understand why a game has stuttered in FEX and being able to attribute the stutter to FEX-Emu overheads. This allows us to directly related these stats to frame time drops in the game. The attached example is a good indicator for why a game is having low performance and how the stats look. We can see in the image that the game (Celeste) is only running at 55FPS, with a CPU core being pegged to 100% and then the FEX stats lets us know that 33.7 million soft float operations are happening, ~600k per frame. The implementation is already in FEX-Emu upstream (FEX-Emu/FEX#4291) and we aren't expecting the implementation to change heavily, potentially just adding additional sample events. We version these stats so that if they change in the future, that the interface doesn't get broken on one side versus the other.

Sonicadvance1 force-pushed the profile_stats branch from 235e9cd to 46dca8a Compare January 22, 2025 04:45

neobrain reviewed Jan 22, 2025

View reviewed changes

Sonicadvance1 force-pushed the profile_stats branch from 2e5457b to 9bbe6a2 Compare January 22, 2025 20:07

Sonicadvance1 force-pushed the profile_stats branch from 9bbe6a2 to 38dbfe2 Compare January 22, 2025 23:48

bylaws reviewed Jan 23, 2025

View reviewed changes

Source/Common/Profiler.h Show resolved Hide resolved

bylaws reviewed Jan 23, 2025

View reviewed changes

Source/Tools/LinuxEmulation/LinuxSyscalls/ThreadManager.cpp Outdated Show resolved Hide resolved

bylaws reviewed Jan 23, 2025

View reviewed changes

Source/Windows/WOW64/Module.cpp Outdated Show resolved Hide resolved

Source/Windows/ARM64EC/Module.cpp Outdated Show resolved Hide resolved

Source/Windows/WOW64/Module.cpp Outdated Show resolved Hide resolved

Sonicadvance1 force-pushed the profile_stats branch from 38dbfe2 to 832f9c2 Compare January 23, 2025 01:45

Sonicadvance1 mentioned this pull request Jan 23, 2025

FEX: Allocate a VMA allocator when running on a 48-bit VA #4299

Merged

Sonicadvance1 force-pushed the profile_stats branch from a9b28f9 to 83af118 Compare February 3, 2025 19:54

Sonicadvance1 mentioned this pull request Feb 7, 2025

FEXCore/Profiler: Implement support for JIT float fallbacks #4336

Merged

Sonicadvance1 force-pushed the profile_stats branch from 83af118 to 5b5c2c8 Compare February 7, 2025 01:03

Sonicadvance1 added 7 commits February 11, 2025 12:56

Windows: Expose support for NtCreateSection and NtMapViewOfSection

906988c

Profiler: Sprinkle the profile stats around

a761343

For the four things we care about

WinAPI: Implement support for DeleteFile

f6e3689

Common: Implement a base profiler implementation

5c9543f

Not wired up to anything. Requires the frontends to allocate shared memory in the expected way.

LinuxSyscalls: Implements support for Linux side profile stats

2ba0b66

This is fairly straightforward. It creates the shared memory region in /dev/shm/fex-<pid>-stats so that Mangohud can sample it.

Sonicadvance1 force-pushed the profile_stats branch from 5b5c2c8 to 8ac9084 Compare February 11, 2025 20:56

lioncash reviewed Feb 11, 2025

View reviewed changes

Source/Common/Profiler.h Outdated Show resolved Hide resolved

Sonicadvance1 added 2 commits February 11, 2025 13:42

Review

c8c27f2

Wine: Add support for magic fex+wine shm path

602c530

Fallback to the previous path if it doesn't exist.

Sonicadvance1 force-pushed the profile_stats branch from 8ac9084 to 602c530 Compare February 11, 2025 21:42

lioncash approved these changes Feb 11, 2025

View reviewed changes

lioncash merged commit 6a39a8d into FEX-Emu:main Feb 11, 2025
12 checks passed

Sonicadvance1 deleted the profile_stats branch February 11, 2025 21:51

Sonicadvance1 mentioned this pull request Feb 16, 2025

FEX-Emu: Profile stats support flightlessmango/MangoHud#1577

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEX: Implements new sampling based stats #4291

FEX: Implements new sampling based stats #4291

Sonicadvance1 commented Jan 22, 2025

bylaws commented Jan 22, 2025

neobrain left a comment •

edited

Loading

neobrain Jan 22, 2025

neobrain Jan 22, 2025

Sonicadvance1 Jan 22, 2025

neobrain Jan 22, 2025

Sonicadvance1 Jan 22, 2025

neobrain Jan 22, 2025

Sonicadvance1 Jan 22, 2025

neobrain Jan 22, 2025 •

edited

Loading

Sonicadvance1 Jan 22, 2025

neobrain Jan 22, 2025

Sonicadvance1 Jan 22, 2025

neobrain Jan 22, 2025

Sonicadvance1 Jan 22, 2025

neobrain Jan 22, 2025 •

edited

Loading

Sonicadvance1 Jan 23, 2025

Sonicadvance1 commented Jan 22, 2025

bylaws commented Feb 7, 2025 •

edited

Loading

Sonicadvance1 commented Feb 7, 2025

lioncash left a comment

FEX: Implements new sampling based stats #4291

FEX: Implements new sampling based stats #4291

Conversation

Sonicadvance1 commented Jan 22, 2025

bylaws commented Jan 22, 2025

neobrain left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neobrain Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neobrain Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sonicadvance1 commented Jan 22, 2025

bylaws commented Feb 7, 2025 • edited Loading

Sonicadvance1 commented Feb 7, 2025

lioncash left a comment

Choose a reason for hiding this comment

neobrain left a comment •

edited

Loading

neobrain Jan 22, 2025 •

edited

Loading

neobrain Jan 22, 2025 •

edited

Loading

bylaws commented Feb 7, 2025 •

edited

Loading