-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8347564: ZGC: Crash in DependencyContext::clean_unloading_dependents #23194
Conversation
👋 Welcome back aboldtch! A progress list of the required criteria for merging this PR into |
@xmas92 This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 254 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Solving the problem by removing pointless code is great. Now it looks much... cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I think it would be good if someone more familiar with the MethodHandle code also reviewed this.
Good work, Axel. But the original problem makes me wonder whether |
For our STW collectors this should always be the case, as the cleaner will run its code after the collection where the CallSite dies, which will in turn have unlinked the nmethods. For the concurrent collectors, because references processing is done before nmethod unlinking, we might run the cleaner code before the GC gets to the nmethod, and as such observe a none empty list. I've run some sanity stress tests locally with G1 and the following patch: diff --git a/src/hotspot/share/prims/methodHandles.cpp b/src/hotspot/share/prims/methodHandles.cpp
index 97e3eae1a2f..fadcec9bd90 100644
--- a/src/hotspot/share/prims/methodHandles.cpp
+++ b/src/hotspot/share/prims/methodHandles.cpp
@@ -1330,6 +1331,7 @@ JVM_ENTRY(void, MHN_clearCallSiteContext(JNIEnv* env, jobject igcls, jobject con
NoSafepointVerifier nsv;
MutexLocker ml(THREAD, CodeCache_lock, Mutex::_no_safepoint_check_flag);
DependencyContext deps = java_lang_invoke_MethodHandleNatives_CallSiteContext::vmdependencies(context());
+ guarantee(deps.is_empty(), "just checking");
deps.remove_and_mark_for_deoptimization_all_dependents(&deopt_scope);
// This is assumed to be an 'atomic' operation by verification.
// So keep it under lock for now. But I'll take it a spin through our CI as well. But if this were to occur it would mean we must have either missed adding a dependent CallSite oop to a linked nmethod, or that GC failed to unlink a nmethod with a dead oop. Neither would be very healthy.
Tangent:
The There is a world where we move this to java. Having some object representing the nmethod. Letting CallSites have a linked list of weak references to these objects, etc. Having thee GC solve the memory reclamation problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarifications, Axel. The fix looks good.
But I'll take it a spin through our CI as well.
Thanks!
Ran tier1 through tier8 with G1 and the guarantee that the dependency list is empty, and it is indeed the case. /integrate |
Going to push as commit 14136f8.
Your commit was automatically rebased without conflicts. |
The proposed change here is the following:
vmdependencies
list head from theContext
to theCallSite
object(1.) fixes the crash. (2.) is because without
vmdependencies
the Context and its cleaner serves no purpose.First what goes wrong without this patch.
Currently when the GC unlinks a nmethod it clean/flush out all the dependency contexts. These are either attached to an InstanceKlass, or a Context object reachable through a CallSite oop. A nmethod is unlinked when either one of its oops have died or it is heuristically determined to be cold and should be unloaded. So when the GC clean's the context through a CallSite oop, it may be that that CallSite object is dead.
For ZGC, which is a concurrent generational GC (the different generations are collected concurrently, but in a coordinated manner), it is important that the unlinking is coordinated with the reclamation of this dead object. In generational ZGC all nmethod oops are considered as strong roots if they reside in the young generation and thusly can only become unreachable / dead after promotion to the old generation. This means that the CallSite object at the time of unlinking is either reachable / live, or unreachable / dead and is reclaimed by the old generation collection (the same generation that does the unlinking). So we can make reading from this object safe by not reclaiming the object before unlinking is finished.
The issue is that we do not have this guarantee for the Context object. As this is a distinct object it may be that it has not been promoted and resides in the young generation at the time of its CallSite object becoming unreachable and collected by the old generation collection.
If this is the case and a young generation collection runs after old marking has finished, we have two bad scenarios. If it the young generation collection starts after reference processing and the cleaner has run, the Context object would be unreachable and the young generation collection would reclaim the memory. If it started before the reference processing it would still be reachable, but may be relocated.
For reachable old CallSite objects the Context oop field would have been tracked and remapped if a young collection relocates the Context object, however this is not true then the CallSite is unreachable. The Context object may have moved or been reclaimed, and the load barrier on the field will produce a bogus oop because the field will not have been tracked (and remapped) correctly. This will cause the crash.
The solution in this patch is to remove the Context object and store the
vmdependencies
list head directly in the CallSite object. This avoids any cross generation pointers.An alternative would have been to keep the cleaner and change the GC code to only look at CallSite object which are reachable. However this would require a per-GC adaption and would be a larger change. (Even if it is only really ZGC which would need to do anything, maybe that generational Shenandoah is also affected, but not something I have looked into)
I've tried to dig and find why a cleaner was used in the first place as it seems like the GC has cleaned these / removed unloading nmethods for a while. Maybe someone else have more insight.
The lifetime of the native memory is now solely tied to the lifetime of the nmthod it tracks a dependency for. This means that it is important that after
add_dependent_nmethod
it will eventually be seen by and can be unlinked by the GC. As far as I can tell this seem to be the case,add_dependent_nmethod
is called after a nmethod has been created in the code cache.Tested:
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23194/head:pull/23194
$ git checkout pull/23194
Update a local copy of the PR:
$ git checkout pull/23194
$ git pull https://git.openjdk.org/jdk.git pull/23194/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 23194
View PR using the GUI difftool:
$ git pr show -t 23194
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23194.diff
Using Webrev
Link to Webrev Comment