Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8347564: ZGC: Crash in DependencyContext::clean_unloading_dependents #23194

Closed
wants to merge 1 commit into from

Conversation

xmas92
Copy link
Member

@xmas92 xmas92 commented Jan 20, 2025

The proposed change here is the following:

  1. Move the vmdependencies list head from the Context to the CallSite object
  2. Remove the Context object and its corresponding cleaner

(1.) fixes the crash. (2.) is because without vmdependencies the Context and its cleaner serves no purpose.

First what goes wrong without this patch.

Currently when the GC unlinks a nmethod it clean/flush out all the dependency contexts. These are either attached to an InstanceKlass, or a Context object reachable through a CallSite oop. A nmethod is unlinked when either one of its oops have died or it is heuristically determined to be cold and should be unloaded. So when the GC clean's the context through a CallSite oop, it may be that that CallSite object is dead.

For ZGC, which is a concurrent generational GC (the different generations are collected concurrently, but in a coordinated manner), it is important that the unlinking is coordinated with the reclamation of this dead object. In generational ZGC all nmethod oops are considered as strong roots if they reside in the young generation and thusly can only become unreachable / dead after promotion to the old generation. This means that the CallSite object at the time of unlinking is either reachable / live, or unreachable / dead and is reclaimed by the old generation collection (the same generation that does the unlinking). So we can make reading from this object safe by not reclaiming the object before unlinking is finished.

The issue is that we do not have this guarantee for the Context object. As this is a distinct object it may be that it has not been promoted and resides in the young generation at the time of its CallSite object becoming unreachable and collected by the old generation collection.

If this is the case and a young generation collection runs after old marking has finished, we have two bad scenarios. If it the young generation collection starts after reference processing and the cleaner has run, the Context object would be unreachable and the young generation collection would reclaim the memory. If it started before the reference processing it would still be reachable, but may be relocated.

For reachable old CallSite objects the Context oop field would have been tracked and remapped if a young collection relocates the Context object, however this is not true then the CallSite is unreachable. The Context object may have moved or been reclaimed, and the load barrier on the field will produce a bogus oop because the field will not have been tracked (and remapped) correctly. This will cause the crash.

The solution in this patch is to remove the Context object and store the vmdependencies list head directly in the CallSite object. This avoids any cross generation pointers.

An alternative would have been to keep the cleaner and change the GC code to only look at CallSite object which are reachable. However this would require a per-GC adaption and would be a larger change. (Even if it is only really ZGC which would need to do anything, maybe that generational Shenandoah is also affected, but not something I have looked into)

I've tried to dig and find why a cleaner was used in the first place as it seems like the GC has cleaned these / removed unloading nmethods for a while. Maybe someone else have more insight.

The lifetime of the native memory is now solely tied to the lifetime of the nmthod it tracks a dependency for. This means that it is important that after add_dependent_nmethod it will eventually be seen by and can be unlinked by the GC. As far as I can tell this seem to be the case, add_dependent_nmethod is called after a nmethod has been created in the code cache.

Tested:

  • Github Actions
  • Oracle supported platforms tier1 through 8

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8347564: ZGC: Crash in DependencyContext::clean_unloading_dependents (Bug - P2)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23194/head:pull/23194
$ git checkout pull/23194

Update a local copy of the PR:
$ git checkout pull/23194
$ git pull https://git.openjdk.org/jdk.git pull/23194/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 23194

View PR using the GUI difftool:
$ git pr show -t 23194

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23194.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jan 20, 2025

👋 Welcome back aboldtch! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jan 20, 2025

@xmas92 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8347564: ZGC: Crash in DependencyContext::clean_unloading_dependents

Reviewed-by: eosterlund, stefank, vlivanov

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 254 new commits pushed to the master branch:

  • d985b31: 8342096: Popup menus that request focus are not shown on Linux with Wayland
  • cbe9ec5: 8348905: Add support to specify the JDK for compiling Jtreg tests
  • 6b581d2: 8347997: assert(false) failed: EA: missing memory path
  • 4662363: 8348687: [BACKOUT] C2: Non-fluid StringBuilder pattern bails out in OptoStringConcat
  • d266ca9: 8348752: Enable -XX:+AOTClassLinking by default when -XX:AOTMode is specified
  • cbc89a7: 8348898: Remove unused OctalDigits to clean up code
  • 96fefed: 8319850: PrintInlining should print which methods are late inlines
  • 51cce6e: 8318577: Windows Look-and-Feel JProgressBarUI does not render correctly on 2x UI scale
  • 6bfae3a: 8333386: TestAbortOnVMOperationTimeout test fails for client VM
  • f98d9a3: 8348870: Eliminate array bound checks in DecimalDigits
  • ... and 244 more: https://git.openjdk.org/jdk/compare/cc198972022c94199d698461e2ac42afc0058fd7...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jan 20, 2025
@openjdk
Copy link

openjdk bot commented Jan 20, 2025

@xmas92 The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot hotspot-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Jan 20, 2025
@mlbridge
Copy link

mlbridge bot commented Jan 20, 2025

Webrevs

Copy link
Contributor

@fisk fisk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Solving the problem by removing pointless code is great. Now it looks much... cleaner.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jan 20, 2025
Copy link
Member

@stefank stefank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I think it would be good if someone more familiar with the MethodHandle code also reviewed this.

@iwanowww
Copy link
Contributor

Good work, Axel. Cleaner-based solution to manage nmethod dependencies on java.lang.invoke.CallSites does look redundant. If that's the case, then I'd expect that by the time cleanup action is invoked corresponding dependency list is empty. Is it the case in practice?

But the original problem makes me wonder whether Cleaner-based approach to manage native resources is broken or not.
Dependent instance is used to avoid referent to be artificially kept alive by cleanup action. If it is not safe to access it in order to perform cleanup, how the whole approach is intended to work? Or does anything make 'CallSite` use case special?

@xmas92
Copy link
Member Author

xmas92 commented Jan 28, 2025

Good work, Axel. Cleaner-based solution to manage nmethod dependencies on java.lang.invoke.CallSites does look redundant. If that's the case, then I'd expect that by the time cleanup action is invoked corresponding dependency list is empty. Is it the case in practice?

For our STW collectors this should always be the case, as the cleaner will run its code after the collection where the CallSite dies, which will in turn have unlinked the nmethods.

For the concurrent collectors, because references processing is done before nmethod unlinking, we might run the cleaner code before the GC gets to the nmethod, and as such observe a none empty list.

I've run some sanity stress tests locally with G1 and the following patch:

diff --git a/src/hotspot/share/prims/methodHandles.cpp b/src/hotspot/share/prims/methodHandles.cpp
index 97e3eae1a2f..fadcec9bd90 100644
--- a/src/hotspot/share/prims/methodHandles.cpp
+++ b/src/hotspot/share/prims/methodHandles.cpp
@@ -1330,6 +1331,7 @@ JVM_ENTRY(void, MHN_clearCallSiteContext(JNIEnv* env, jobject igcls, jobject con
     NoSafepointVerifier nsv;
     MutexLocker ml(THREAD, CodeCache_lock, Mutex::_no_safepoint_check_flag);
     DependencyContext deps = java_lang_invoke_MethodHandleNatives_CallSiteContext::vmdependencies(context());
+    guarantee(deps.is_empty(), "just checking");
     deps.remove_and_mark_for_deoptimization_all_dependents(&deopt_scope);
     // This is assumed to be an 'atomic' operation by verification.
     // So keep it under lock for now.

But I'll take it a spin through our CI as well. But if this were to occur it would mean we must have either missed adding a dependent CallSite oop to a linked nmethod, or that GC failed to unlink a nmethod with a dead oop. Neither would be very healthy.

But the original problem makes me wonder whether Cleaner-based approach to manage native resources is broken or not.

Tangent:
Cleaner-based native resource management may be broken but not because of this. (It suffers from a similar problem to finalisers, namely that they are triggered at the GCs discretion, and the GC have no real insight into the cost of a "native resource" so it may never be cleaned. Trying to create scoped lifetimes, with try-with-resource should always be preferred)

Dependent instance is used to avoid referent to be artificially kept alive by cleanup action. If it is not safe to access it in order to perform cleanup, how the whole approach is intended to work? Or does anything make 'CallSite` use case special?

CallSite is special, because the GC knows about it. The reason being is that the GC must remove all nmethod dependencies when unlinking, otherwise we would have a dangling pointer and an ABA problem. When unlinking the GC runs the following code.

    for (Dependencies::DepStream deps(this); deps.next(); ) {
      if (deps.type() == Dependencies::call_site_target_value) {
        // CallSite dependencies are managed on per-CallSite instance basis.
        oop call_site = deps.argument_oop(0);
        MethodHandles::clean_dependency_context(call_site);
      }

The call_site in question here may or may not be dead. But regardless it will have a link to the currently unlinking nmethod, which must be removed. After unlinking the GC will phase the system (by a handshake or by the property of doing all this work in a safepoint). If it was not removed it is a dangling pointer, not only that it is a pointer we are more than likely going to reuse for a new nmethod, which will make them indistinguishable, leading to ABA.

There is a world where we move this to java. Having some object representing the nmethod. Letting CallSites have a linked list of weak references to these objects, etc. Having thee GC solve the memory reclamation problem.

Copy link
Contributor

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarifications, Axel. The fix looks good.

But I'll take it a spin through our CI as well.

Thanks!

@xmas92
Copy link
Member Author

xmas92 commented Jan 30, 2025

Ran tier1 through tier8 with G1 and the guarantee that the dependency list is empty, and it is indeed the case.
Thanks for all the reviews.

/integrate

@openjdk
Copy link

openjdk bot commented Jan 30, 2025

Going to push as commit 14136f8.
Since your change was applied there have been 255 commits pushed to the master branch:

  • 04c24f1: 8347779: sun/tools/jhsdb/HeapDumpTestWithActiveProcess.java fails with Unable to deduce type of thread from address
  • d985b31: 8342096: Popup menus that request focus are not shown on Linux with Wayland
  • cbe9ec5: 8348905: Add support to specify the JDK for compiling Jtreg tests
  • 6b581d2: 8347997: assert(false) failed: EA: missing memory path
  • 4662363: 8348687: [BACKOUT] C2: Non-fluid StringBuilder pattern bails out in OptoStringConcat
  • d266ca9: 8348752: Enable -XX:+AOTClassLinking by default when -XX:AOTMode is specified
  • cbc89a7: 8348898: Remove unused OctalDigits to clean up code
  • 96fefed: 8319850: PrintInlining should print which methods are late inlines
  • 51cce6e: 8318577: Windows Look-and-Feel JProgressBarUI does not render correctly on 2x UI scale
  • 6bfae3a: 8333386: TestAbortOnVMOperationTimeout test fails for client VM
  • ... and 245 more: https://git.openjdk.org/jdk/compare/cc198972022c94199d698461e2ac42afc0058fd7...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jan 30, 2025
@openjdk openjdk bot closed this Jan 30, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jan 30, 2025
@openjdk
Copy link

openjdk bot commented Jan 30, 2025

@xmas92 Pushed as commit 14136f8.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

4 participants