Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit concurrency in JavaNetReverseProxy2#service #913

Merged
merged 1 commit into from
Feb 19, 2025

Conversation

basil
Copy link
Member

@basil basil commented Feb 3, 2025

Problem

In Jenkins core builds, we see frequent flakiness in hudson.PluginManagerTest#doNotThrowWithUnknownPlugins. Right before the test fails we see:

   4.236 [id=64]	WARNING	o.e.jetty.ee9.nested.HttpChannel#handleException: handleException /update-center.json java.nio.file.NoSuchFileException: /home/jenkins/agent/workspace/Core_jenkins_PR-10231/test/target/jenkins.io-cache2/a99c432e44df4790e1a50308de383c6d

Evaluation

JavaNetReverseProxy2 has concurrency issues when multiple processes are being used, which is the case in our CI builds which both set forkCount=2 and reuseForks=false. This results in two separate JVMs running concurrently executing test cases.

This class was designed for a single JVM to be in use at any time, for a few reasons. First, it has a fixed name for the cache directory. Second, it writes to the cache in a synchronized (this) block, which works to guard the critical section between threads in one JVM but not between multiple JVMs. Finally, it installs a shutdown hook to delete the cache directory, which again makes sense in a single JVM scenario.

The proximate cause of the error is a race condition between two test JVMs:

  1. JVM 1 ensures that the cache directory exists, creating it if necessary.
  2. JVM 2 finishes a test and executes its shutdown hook to remove the cache directory.
  3. JVM 1 attempts to create the cache file within the cache directory, but receives NoSuchFileException because the directory has since been deleted since step 1.

Solution

One way to solve this problem would have been to preserve the existing design but have a separate cache directory per JVM. This would have resulted in the lowest amount of code change, but on the other hand each test JVM would have to populate the cache from scratch, diminishing the value of the cache.

One thing to note is that this code originally contained the comment

// TODO: think of a better location --- ideally inside the target/ dir so that clean would wipe them out
INSTANCE = new JavaNetReverseProxy2(
        new File(new File(System.getProperty("java.io.tmpdir")), "jenkins.io-cache2"));

For a while now, our plugin parent POM and Jenkins core POM have been setting java.io.tmpdir to the Maven target/ (${project.build.directory}) directory, so things like mvn clean work to clear the cache and there is no risk of polluting the broader system with garbage if the Java code doesn't clean the cache. Given the above, we felt it was more prudent to never clean the cache from Java code and allow it to be shared between all test JVMs. This maximizes the benefit of the cache.

Implementation

First we remove the shutdown hook which deletes the cache. This ensures the cache is shared between all test JVMs in a run and prevents one JVM from interfering with another.

Next, we removed the synchronized (this) statement, which doesn't work across JVMs, and implemented synchronization using filesystem primitives without memory barriers. Rather than two processes attempting to write to the same file at the same time, we create unique temporary files and then do an atomic rename of the temporary file to the final name. This might waste a little bit of effort if two processes both try to populate the cache at the same time (the second one will win, having duplicated the effort of the first), but correctness is still achieved and the performance cost is minimal here.

Note that the Javadoc for ATOMIC_MOVE says "If the target file exists then it is implementation specific if the existing file is replaced or this method fails by throwing an IOException", but in practice the standard implementations for Linux, Windows, and macOS all replace the existing file, so it seems safe to rely on this behavior. Just for completeness, I implemented a fallback to non-atomic move to degrade gracefully in pathological scenarios.

Testing done

I've been running the Jenkins core build for many days now with various iterations of this PR. I would typically see this flakiness every other build or so, but I've now run 3-4 builds without any failure.

@basil basil force-pushed the debug branch 8 times, most recently from 2fb462f to edcf6aa Compare February 18, 2025 19:09
@basil basil changed the title Attempt to debug flakiness in hudson.PluginManagerTest.doNotThrowWithUnknownPlugins Revisit concurrency in JavaNetReverseProxy2#service Feb 19, 2025
@basil basil marked this pull request as ready for review February 19, 2025 15:18
Copy link
Member

@timja timja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for debugging this!

@basil basil merged commit f25ef8c into jenkinsci:master Feb 19, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants