Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] Enable async runtime suspension and return hijacking on unix-arm64 #73216

Merged
merged 5 commits into from
Aug 5, 2022

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented Aug 2, 2022

Contributes to: #67805

@VSadov VSadov mentioned this pull request Aug 4, 2022
9 tasks
@VSadov
Copy link
Member Author

VSadov commented Aug 4, 2022

/azp run runtime-extra-platforms

@dotnet dotnet deleted a comment from azure-pipelines bot Aug 4, 2022
@VSadov VSadov marked this pull request as ready for review August 4, 2022 23:48
@VSadov VSadov requested review from janvorli and jkotas August 4, 2022 23:48
@VSadov
Copy link
Member Author

VSadov commented Aug 4, 2022

I think this is ready to review/merge.

@dotnet dotnet deleted a comment from azure-pipelines bot Aug 5, 2022
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@@ -184,6 +184,7 @@ EXTERN_C void REDHAWK_CALLCONV RhpFailFastForPInvokeExceptionCoop(intptr_t PInvo
void* pExceptionRecord, void* pContextRecord);
int32_t __stdcall RhpVectoredExceptionHandler(PEXCEPTION_POINTERS pExPtrs);

// REVIEW: this is no longer used by pInvokes and use in hijack seems bogus. Remove?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was meant to protect against SEH exceptions entering the managed code and leaving it in inconsistent state. We should re-introduce this protection. Could you please create an issue on it and link it from here?

If PInvoke throws SEH exception, it is going unwind the managed portion of the stack and the runtime is going to be left in inconsistent state currently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I will log an issue.
But we do not need this for hijack probes, right? There is no scenario that I can think of where we expect exceptions from waiting for GC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be there so that fatal crashes like stackoverflow or access violations caused by process state corruptions terminate the process immediately instead letting it to propagate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logged: #73429


// ldp with pre/post/no offset
// x010 100x x1xx xxxx xxxx xxxx xxxx xxxx
#define LDP_BITS2 0x28400000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will trust you that you got these magic numbers right :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took quite some time to construct and proof-read these.
The "BITS" parts for the individual instruction patterns ended up matching constants from instrarm64.h
And that is encouraging :-)

INST2(ldp, "ldp", LD, IF_EN2E, 0x29400000, 0x28400000)

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

// we could be catching threads in restartable sequences such as LL/SC style interlocked on ARM64
// and forcing them to restart.
// if interrupt mechanism is fast, eagerness could be hurting our overall progress.
waitCycles += 10000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really keep this growing without any limit?

Copy link
Member Author

@VSadov VSadov Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately the wait is unbounded, since suspension cannot gracefully fail. - "It is done, when it is done".
Practically, we can have very good timings here.

Both waiting too long and waiting not enough between re-hijacking can make the whole thing last longer.
Here I observed that we would interrupt a thread doing LL/SC InterlockedSomething loop, interrupting would invalidate its monitor, so SC would fail and start over - just in time for us to interrupt it again - leading to very long hangs.
Fixed spin counts are often a bad idea as the guesstimate may be wrong when running on a different platform. I just made the spin count to adjust in a naive way for now.

Dealing with this loop is the next/last part in the NativeAOT suspension work item. #67805

I plan to make this similar to what CoreCLR does:

  • interrupt threads and hijack/suspend accordingly
  • spin-check without re-hijacking as we are making progress - the common case is everything suspends quickly.
  • otherwise wait for progress with a 1 msec timeout, and if timed out, try hijacking again - to deal with remaining strugglers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

@VSadov VSadov merged commit 85c411e into dotnet:main Aug 5, 2022
@VSadov VSadov deleted the arm64susp branch August 5, 2022 22:35
@VSadov
Copy link
Member Author

VSadov commented Aug 5, 2022

Thanks!!

MichalStrehovsky added a commit that referenced this pull request Aug 7, 2022
On ARM64 Linux. See if #73216 helped with the hang.
MichalStrehovsky added a commit that referenced this pull request Aug 8, 2022
On ARM64 Linux. See if #73216 helped with the hang.
@ghost ghost locked as resolved and limited conversation to collaborators Sep 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants