Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEX: Implements new sampling based stats #4291

Merged
merged 9 commits into from
Feb 11, 2025

Conversation

Sonicadvance1
Copy link
Member

The ftrace/gpuviz path has an issue with some usability. It can't be put in high-frequency locations like the SIGBUS handler, and it is hard to correlate between stutters in the game with a gpuviz trace after the fact.

With the combination of these two problems I have written this little SHM implementation of statistics that are /virtually/ free on the FEX side and requires the reader to accumulate the data correctly.

Some design decisions

  • All atomics are weak-memory ordered, requiring the use of memory_barriers and mutexes in appropriate locations
    • This improves performance since it's avoid atomic overheads entirely (Which FEX already has issues with)
    • This means all stats are per-thread and the accumulation has to happen on the reader side
  • The stats SHM region can never shrink, which would result in problems.
  • The stats are in a singly linked-list because the expectation is low thread counts
  • The stats are in a single allocation region which allows us to expand the stats if the number of threads grow

This is implemented for Linux and WINE, so we can use it everywhere.

This gets us some nice stat graphics like what follows:
Screenshot 2025-01-21 at 19 28 37

We only have four stats currently, but there can definitely be more stats as time progresses.
The Mangohud patches are forthcoming while I clean them up.

@bylaws
Copy link
Collaborator

bylaws commented Jan 22, 2025

I'll make a fixup to replace the wine support with something simpler in a bit

Copy link
Member

@neobrain neobrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this a first look - heavy stuff, but gathering and visualizing these metrics indeed is useful functionality to have.

Will mangohud need to maintain a fixed list of FEX-specific metrics (and update them if we add any) or will it automatically pick up new metrics added in FEX?

Is this designed to support accumulation of metrics across processes as well or is it solely for the process running mangohud?

The current implementation leaves me uncomfortable, since there's only so much faith one can have in reliability when interleaving weakly-ordered atomics, shared memory, and a custom linked list implementation. Well-placed and tested abstractions and more explaining docs would help support that faith.

More generally, can I ask what's the benefit of a MangoHud-specific approach compared to an established nanosecond-resolution live-profiler like Tracy? Obviously MangoHud is great for general stats like FPS and CPU load, but FEX's metrics are very specific and I'm not sure how valuable having a short window of these is. Most metrics could probably even be reported via ftrace despite its overhead (i.e. less rapidly updated metrics, or those that relate to costly SIGBUS/SMC events anyway), which could give us free multi-process awareness without bespoke accumulation logic in external projects.

#endif

namespace FEX::Profiler {
class StatAllocBase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some heavy logic in here, so this should explain what it's doing.

Comment on lines +49 to +53
FEXCore::Profiler::ThreadStatsHeader* Head {};
FEXCore::Profiler::ThreadStats* Stats;
FEXCore::Profiler::ThreadStats* StatTail {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you preferring a custom-implemented linked list over flat memory? This seems neither more convenient for the implementation nor like it would improve performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sparsity in the allocation buffer is a concern, so this is how I'm keeping the implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate what exactly the problem is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reader side would need to walk the entire allocated buffer space to ensure it doesn't miss any slots, instead of the average case of a hundred or so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What scenario exactly would trigger this? I don't know what your reader side looks like, but it sounds like a bitmask and/or active slot count would easily resolve this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just going to leave it as a linked list. We can rev the version number and change it to a bitmask in the future if someone is feeling up to writing that code in the future.

Copy link
Member

@neobrain neobrain Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the choice between linked lists and flat memory doesn't seem to be so important, why is the focus on weakly-ordered atomics considered critical?

(I don't agree using a custom data structure here is the way forward since it makes up a good chunk of the complexity in this PR, but on top of that the performance targets don't seem consistent here as-is.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the atomics I don't want the writer side to be affected by cacheline sharing problems at all. Semantically these are two independent things. The containing data which is FEXCore::Profiler::ThreadStats and the way to walk the list of data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand they are different things, but it seems arbitrary to want this (with no explicit motivation) while using a data structure with poor performance characteristics in the same module.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, which is why we leave some of the "poor performance" problems on the reader side which has something like 500ms to sample the data, even though this is going to be hundreds of nanoseconds regardless.

}

// Find a free slot
memory_barrier();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared to the cost of the profiled events (SIGBUS/SMC), is it really worth dealing with the added complexity here instead of just using stronger atomics? Also, isn't thread creation/destruction in particular rare enough that we could use strong memory ordering on slot management variables while leaving the stats themselves weakly ordered?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False dependency sharing in the hot path is a significant concern, so I'm keeping them as weak atomics.

Copy link
Member

@neobrain neobrain Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's the hot path? The currently traced events are orders of magnitude more expensive to process than potential cache behavior effects.

This applies even more to thread creation/destruction, which is a rare event on top of that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we have some leniency on thread creation and teardown, which is already slow and causes global locks to be taken, but this is the way I have decided to implement it.

@Sonicadvance1
Copy link
Member Author

https://github.com/Sonicadvance1/MangoHud/tree/fex
For the WIP FEX support in mangohud.

Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Jan 23, 2025
When running on a system with a 48-bit VA, if FEX does any allocations
between us reserving the upper 128TB and the application running, then
/technically/ we are intersecting with the application's memory region
in the lower 47-bits.

This didn't typically result in any problems due to how ASLR works, but
if we did any large allocations (like FEX-Emu#4291 wants with 128MB VMA region)
then these typically get pushed higher in the VA space.

Again not usually a problem, but if you happen to be running an
application that is using MAP_FIXED with hardcoded addresses then this
can stomp over FEX-Emu memory causing problems.

This is what happens with Wine, it reserves the upper-32MB of its 47-bit
VA space, which is /highly/ likely to stomp on FEX memory. In-fact it
likely occurs all the time, we just got lucky with whatever it was
clobbering wasn't used at the time.

On 39-bit VA systems this isn't a problem because the mmap fails
outright with a warning message from WINE.

Because we are already reserving the upper 128TB of VA space, instead
just always enable our allocator and use the regions that were reserved.
We need to be a little bit careful to ensure we don't accidentally
allocate more memory post-reservation but that just requires a small
adjustment to our unique_ptr and constructor for the 64BitAllocator.

This means /all/ FEX-Emu allocations will be in the upper 128TB VA space
when running 64-bit applications on a 48-bit VA system. Which is kind of
nice.

Fixes WINE in FEX-Emu#4291 when the allocator stats are bumped to 128MB per
process.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Jan 23, 2025
When running on a system with a 48-bit VA, if FEX does any allocations
between us reserving the upper 128TB and the application running, then
/technically/ we are intersecting with the application's memory region
in the lower 47-bits.

This didn't typically result in any problems due to how ASLR works, but
if we did any large allocations (like FEX-Emu#4291 wants with 128MB VMA region)
then these typically get pushed higher in the VA space.

Again not usually a problem, but if you happen to be running an
application that is using MAP_FIXED with hardcoded addresses then this
can stomp over FEX-Emu memory causing problems.

This is what happens with Wine, it reserves the upper-32MB of its 47-bit
VA space, which is /highly/ likely to stomp on FEX memory. In-fact it
likely occurs all the time, we just got lucky with whatever it was
clobbering wasn't used at the time.

On 39-bit VA systems this isn't a problem because the mmap fails
outright with a warning message from WINE.

Because we are already reserving the upper 128TB of VA space, instead
just always enable our allocator and use the regions that were reserved.
We need to be a little bit careful to ensure we don't accidentally
allocate more memory post-reservation but that just requires a small
adjustment to our unique_ptr and constructor for the 64BitAllocator.

This means /all/ FEX-Emu allocations will be in the upper 128TB VA space
when running 64-bit applications on a 48-bit VA system. Which is kind of
nice.

Fixes WINE in FEX-Emu#4291 when the allocator stats are bumped to 128MB per
process.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Jan 24, 2025
When running on a system with a 48-bit VA, if FEX does any allocations
between us reserving the upper 128TB and the application running, then
/technically/ we are intersecting with the application's memory region
in the lower 47-bits.

This didn't typically result in any problems due to how ASLR works, but
if we did any large allocations (like FEX-Emu#4291 wants with 128MB VMA region)
then these typically get pushed higher in the VA space.

Again not usually a problem, but if you happen to be running an
application that is using MAP_FIXED with hardcoded addresses then this
can stomp over FEX-Emu memory causing problems.

This is what happens with Wine, it reserves the upper-32MB of its 47-bit
VA space, which is /highly/ likely to stomp on FEX memory. In-fact it
likely occurs all the time, we just got lucky with whatever it was
clobbering wasn't used at the time.

On 39-bit VA systems this isn't a problem because the mmap fails
outright with a warning message from WINE.

Because we are already reserving the upper 128TB of VA space, instead
just always enable our allocator and use the regions that were reserved.
We need to be a little bit careful to ensure we don't accidentally
allocate more memory post-reservation but that just requires a small
adjustment to our unique_ptr and constructor for the 64BitAllocator.

This means /all/ FEX-Emu allocations will be in the upper 128TB VA space
when running 64-bit applications on a 48-bit VA system. Which is kind of
nice.

Fixes WINE in FEX-Emu#4291 when the allocator stats are bumped to 128MB per
process.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Jan 24, 2025
When running on a system with a 48-bit VA, if FEX does any allocations
between us reserving the upper 128TB and the application running, then
/technically/ we are intersecting with the application's memory region
in the lower 47-bits.

This didn't typically result in any problems due to how ASLR works, but
if we did any large allocations (like FEX-Emu#4291 wants with 128MB VMA region)
then these typically get pushed higher in the VA space.

Again not usually a problem, but if you happen to be running an
application that is using MAP_FIXED with hardcoded addresses then this
can stomp over FEX-Emu memory causing problems.

This is what happens with Wine, it reserves the upper-32MB of its 47-bit
VA space, which is /highly/ likely to stomp on FEX memory. In-fact it
likely occurs all the time, we just got lucky with whatever it was
clobbering wasn't used at the time.

On 39-bit VA systems this isn't a problem because the mmap fails
outright with a warning message from WINE.

Because we are already reserving the upper 128TB of VA space, instead
just always enable our allocator and use the regions that were reserved.
We need to be a little bit careful to ensure we don't accidentally
allocate more memory post-reservation but that just requires a small
adjustment to our unique_ptr and constructor for the 64BitAllocator.

This means /all/ FEX-Emu allocations will be in the upper 128TB VA space
when running 64-bit applications on a 48-bit VA system. Which is kind of
nice.

Fixes WINE in FEX-Emu#4291 when the allocator stats are bumped to 128MB per
process.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Jan 29, 2025
When running on a system with a 48-bit VA, if FEX does any allocations
between us reserving the upper 128TB and the application running, then
/technically/ we are intersecting with the application's memory region
in the lower 47-bits.

This didn't typically result in any problems due to how ASLR works, but
if we did any large allocations (like FEX-Emu#4291 wants with 128MB VMA region)
then these typically get pushed higher in the VA space.

Again not usually a problem, but if you happen to be running an
application that is using MAP_FIXED with hardcoded addresses then this
can stomp over FEX-Emu memory causing problems.

This is what happens with Wine, it reserves the upper-32MB of its 47-bit
VA space, which is /highly/ likely to stomp on FEX memory. In-fact it
likely occurs all the time, we just got lucky with whatever it was
clobbering wasn't used at the time.

On 39-bit VA systems this isn't a problem because the mmap fails
outright with a warning message from WINE.

Because we are already reserving the upper 128TB of VA space, instead
just always enable our allocator and use the regions that were reserved.
We need to be a little bit careful to ensure we don't accidentally
allocate more memory post-reservation but that just requires a small
adjustment to our unique_ptr and constructor for the 64BitAllocator.

This means /all/ FEX-Emu allocations will be in the upper 128TB VA space
when running 64-bit applications on a 48-bit VA system. Which is kind of
nice.

Fixes WINE in FEX-Emu#4291 when the allocator stats are bumped to 128MB per
process.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Feb 7, 2025
Based on FEX-Emu#4291 and FEX-Emu#4324. Ideally this gets merged at the same time so
we can have Mangohud be on version 2 before giving them an upstream
patch.

Performance-wise this change falls within noise of my x87 microbench.

This just lets us track the number of float fallbacks FEX does, letting
us detect things like x87 fallbacks and how frequent they are, so we can
detect if a game might be slow or stuttering because of these fallbacks.
@bylaws
Copy link
Collaborator

bylaws commented Feb 7, 2025

I would err towards only supporting the wine side with the fex patch without a solid usecase for the questionable path

@Sonicadvance1
Copy link
Member Author

I would err towards only supporting the wine side with the fex patch without a solid usecase for the questionable path

I would err towards only supporting the wine side with the fex patch without a solid usecase for the questionable path

Sure, ripped it out.

Not wired up, just the definitions so it lives in the
InternalThreadState.

We want this accessible from both FEXCore and the frontends so it needs
to live there.

Two types of events supported. Scoped cyclecounts and instant
increments.

This gives us JIT time and Signal handling time, plus events for number
of SIGBUS and number of SMC events.

All useful statistics for seeing stutter live.
For the four things we care about
Not wired up to anything. Requires the frontends to allocate shared
memory in the expected way.
This is fairly straightforward. It creates the shared memory region in
/dev/shm/fex-<pid>-stats so that Mangohud can sample it.
This is a little trickier, we actually open the
`/dev/shm/fex-<pid>-stats` file directly using Windows APIs that way
Mangohud (which is going to be on the Linux side, or potentially even
embedded in to Gamescope) can safely pick up the stats.

A little quirky plus doesn't support expanding its size since WINE
doesn't support NtExtendSection, but that's fine.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Feb 11, 2025
Based on FEX-Emu#4291 and FEX-Emu#4324. Ideally this gets merged at the same time so
we can have Mangohud be on version 2 before giving them an upstream
patch.

Performance-wise this change falls within noise of my x87 microbench.

This just lets us track the number of float fallbacks FEX does, letting
us detect things like x87 fallbacks and how frequent they are, so we can
detect if a game might be slow or stuttering because of these fallbacks.
Copy link
Contributor

@lioncash lioncash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one super small nit

Fallback to the previous path if it doesn't exist.
@lioncash lioncash merged commit 6a39a8d into FEX-Emu:main Feb 11, 2025
12 checks passed
@Sonicadvance1 Sonicadvance1 deleted the profile_stats branch February 11, 2025 21:51
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Feb 11, 2025
Based on FEX-Emu#4291 and FEX-Emu#4324. Ideally this gets merged at the same time so
we can have Mangohud be on version 2 before giving them an upstream
patch.

Performance-wise this change falls within noise of my x87 microbench.

This just lets us track the number of float fallbacks FEX does, letting
us detect things like x87 fallbacks and how frequent they are, so we can
detect if a game might be slow or stuttering because of these fallbacks.
Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Feb 11, 2025
Based on FEX-Emu#4291 and FEX-Emu#4324. Ideally this gets merged at the same time so
we can have Mangohud be on version 2 before giving them an upstream
patch.

Performance-wise this change falls within noise of my x87 microbench.

This just lets us track the number of float fallbacks FEX does, letting
us detect things like x87 fallbacks and how frequent they are, so we can
detect if a game might be slow or stuttering because of these fallbacks.
Sonicadvance1 added a commit to Sonicadvance1/MangoHud that referenced this pull request Feb 16, 2025
This implements support for FEX-Emu's profiling stats interface behind a
command line option that isn't default enabled.

The purpose of these statistics in FEX-Emu is to expose statistics that
are high frequency and can't easily be collected in other ways due to
overhead. The reason why these are getting exposed in Mangohud versus
another path is to have that information readily available while running
a game. It's sometimes difficult to understand why a game has stuttered
in FEX and being able to attribute the stutter to FEX-Emu overheads.
This allows us to directly related these stats to frame time drops in
the game.

The attached example is a good indicator for why a game is having low
performance and how the stats look. We can see in the image that the
game (Celeste) is only running at 55FPS, with a CPU core being pegged to
100% and then the FEX stats lets us know that 33.7 million soft float
operations are happening, ~600k per frame.

The implementation is already in FEX-Emu upstream
(FEX-Emu/FEX#4291) and we aren't expecting the
implementation to change heavily, potentially just adding additional
sample events. We version these stats so that if they change in the
future, that the interface doesn't get broken on one side versus the
other.
Sonicadvance1 added a commit to Sonicadvance1/MangoHud that referenced this pull request Feb 17, 2025
This implements support for FEX-Emu's profiling stats interface behind a
command line option that isn't default enabled.

The purpose of these statistics in FEX-Emu is to expose statistics that
are high frequency and can't easily be collected in other ways due to
overhead. The reason why these are getting exposed in Mangohud versus
another path is to have that information readily available while running
a game. It's sometimes difficult to understand why a game has stuttered
in FEX and being able to attribute the stutter to FEX-Emu overheads.
This allows us to directly related these stats to frame time drops in
the game.

The attached example is a good indicator for why a game is having low
performance and how the stats look. We can see in the image that the
game (Celeste) is only running at 55FPS, with a CPU core being pegged to
100% and then the FEX stats lets us know that 33.7 million soft float
operations are happening, ~600k per frame.

The implementation is already in FEX-Emu upstream
(FEX-Emu/FEX#4291) and we aren't expecting the
implementation to change heavily, potentially just adding additional
sample events. We version these stats so that if they change in the
future, that the interface doesn't get broken on one side versus the
other.
Sonicadvance1 added a commit to Sonicadvance1/MangoHud that referenced this pull request Feb 19, 2025
This implements support for FEX-Emu's profiling stats interface behind a
command line option that isn't default enabled.

The purpose of these statistics in FEX-Emu is to expose statistics that
are high frequency and can't easily be collected in other ways due to
overhead. The reason why these are getting exposed in Mangohud versus
another path is to have that information readily available while running
a game. It's sometimes difficult to understand why a game has stuttered
in FEX and being able to attribute the stutter to FEX-Emu overheads.
This allows us to directly related these stats to frame time drops in
the game.

The attached example is a good indicator for why a game is having low
performance and how the stats look. We can see in the image that the
game (Celeste) is only running at 55FPS, with a CPU core being pegged to
100% and then the FEX stats lets us know that 33.7 million soft float
operations are happening, ~600k per frame.

The implementation is already in FEX-Emu upstream
(FEX-Emu/FEX#4291) and we aren't expecting the
implementation to change heavily, potentially just adding additional
sample events. We version these stats so that if they change in the
future, that the interface doesn't get broken on one side versus the
other.
flightlessmango pushed a commit to flightlessmango/MangoHud that referenced this pull request Feb 24, 2025
This implements support for FEX-Emu's profiling stats interface behind a
command line option that isn't default enabled.

The purpose of these statistics in FEX-Emu is to expose statistics that
are high frequency and can't easily be collected in other ways due to
overhead. The reason why these are getting exposed in Mangohud versus
another path is to have that information readily available while running
a game. It's sometimes difficult to understand why a game has stuttered
in FEX and being able to attribute the stutter to FEX-Emu overheads.
This allows us to directly related these stats to frame time drops in
the game.

The attached example is a good indicator for why a game is having low
performance and how the stats look. We can see in the image that the
game (Celeste) is only running at 55FPS, with a CPU core being pegged to
100% and then the FEX stats lets us know that 33.7 million soft float
operations are happening, ~600k per frame.

The implementation is already in FEX-Emu upstream
(FEX-Emu/FEX#4291) and we aren't expecting the
implementation to change heavily, potentially just adding additional
sample events. We version these stats so that if they change in the
future, that the interface doesn't get broken on one side versus the
other.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants