-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NativeAOT] Enable async runtime suspension and return hijacking on unix-arm64 #73216
Conversation
/azp run runtime-extra-platforms |
I think this is ready to review/merge. |
Azure Pipelines successfully started running 1 pipeline(s). |
@@ -184,6 +184,7 @@ EXTERN_C void REDHAWK_CALLCONV RhpFailFastForPInvokeExceptionCoop(intptr_t PInvo | |||
void* pExceptionRecord, void* pContextRecord); | |||
int32_t __stdcall RhpVectoredExceptionHandler(PEXCEPTION_POINTERS pExPtrs); | |||
|
|||
// REVIEW: this is no longer used by pInvokes and use in hijack seems bogus. Remove? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was meant to protect against SEH exceptions entering the managed code and leaving it in inconsistent state. We should re-introduce this protection. Could you please create an issue on it and link it from here?
If PInvoke throws SEH exception, it is going unwind the managed portion of the stack and the runtime is going to be left in inconsistent state currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I will log an issue.
But we do not need this for hijack probes, right? There is no scenario that I can think of where we expect exceptions from waiting for GC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be there so that fatal crashes like stackoverflow or access violations caused by process state corruptions terminate the process immediately instead letting it to propagate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logged: #73429
|
||
// ldp with pre/post/no offset | ||
// x010 100x x1xx xxxx xxxx xxxx xxxx xxxx | ||
#define LDP_BITS2 0x28400000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will trust you that you got these magic numbers right :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took quite some time to construct and proof-read these.
The "BITS" parts for the individual instruction patterns ended up matching constants from instrarm64.h
And that is encouraging :-)
runtime/src/coreclr/jit/instrsarm64.h
Line 613 in 2201016
INST2(ldp, "ldp", LD, IF_EN2E, 0x29400000, 0x28400000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
// we could be catching threads in restartable sequences such as LL/SC style interlocked on ARM64 | ||
// and forcing them to restart. | ||
// if interrupt mechanism is fast, eagerness could be hurting our overall progress. | ||
waitCycles += 10000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we really keep this growing without any limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately the wait is unbounded, since suspension cannot gracefully fail. - "It is done, when it is done".
Practically, we can have very good timings here.
Both waiting too long and waiting not enough between re-hijacking can make the whole thing last longer.
Here I observed that we would interrupt a thread doing LL/SC InterlockedSomething loop, interrupting would invalidate its monitor, so SC would fail and start over - just in time for us to interrupt it again - leading to very long hangs.
Fixed spin counts are often a bad idea as the guesstimate may be wrong when running on a different platform. I just made the spin count to adjust in a naive way for now.
Dealing with this loop is the next/last part in the NativeAOT suspension work item. #67805
I plan to make this similar to what CoreCLR does:
- interrupt threads and hijack/suspend accordingly
- spin-check without re-hijacking as we are making progress - the common case is everything suspends quickly.
- otherwise wait for progress with a 1 msec timeout, and if timed out, try hijacking again - to deal with remaining strugglers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
Thanks!! |
On ARM64 Linux. See if #73216 helped with the hang.
On ARM64 Linux. See if #73216 helped with the hang.
Contributes to: #67805