Revisit concurrency in `JavaNetReverseProxy2#service` #913

basil · 2025-02-03T20:19:35Z

Problem

In Jenkins core builds, we see frequent flakiness in hudson.PluginManagerTest#doNotThrowWithUnknownPlugins. Right before the test fails we see:

   4.236 [id=64]	WARNING	o.e.jetty.ee9.nested.HttpChannel#handleException: handleException /update-center.json java.nio.file.NoSuchFileException: /home/jenkins/agent/workspace/Core_jenkins_PR-10231/test/target/jenkins.io-cache2/a99c432e44df4790e1a50308de383c6d

Evaluation

JavaNetReverseProxy2 has concurrency issues when multiple processes are being used, which is the case in our CI builds which both set forkCount=2 and reuseForks=false. This results in two separate JVMs running concurrently executing test cases.

This class was designed for a single JVM to be in use at any time, for a few reasons. First, it has a fixed name for the cache directory. Second, it writes to the cache in a synchronized (this) block, which works to guard the critical section between threads in one JVM but not between multiple JVMs. Finally, it installs a shutdown hook to delete the cache directory, which again makes sense in a single JVM scenario.

The proximate cause of the error is a race condition between two test JVMs:

JVM 1 ensures that the cache directory exists, creating it if necessary.
JVM 2 finishes a test and executes its shutdown hook to remove the cache directory.
JVM 1 attempts to create the cache file within the cache directory, but receives NoSuchFileException because the directory has since been deleted since step 1.

Solution

One way to solve this problem would have been to preserve the existing design but have a separate cache directory per JVM. This would have resulted in the lowest amount of code change, but on the other hand each test JVM would have to populate the cache from scratch, diminishing the value of the cache.

One thing to note is that this code originally contained the comment

// TODO: think of a better location --- ideally inside the target/ dir so that clean would wipe them out
INSTANCE = new JavaNetReverseProxy2(
        new File(new File(System.getProperty("java.io.tmpdir")), "jenkins.io-cache2"));

For a while now, our plugin parent POM and Jenkins core POM have been setting java.io.tmpdir to the Maven target/ (${project.build.directory}) directory, so things like mvn clean work to clear the cache and there is no risk of polluting the broader system with garbage if the Java code doesn't clean the cache. Given the above, we felt it was more prudent to never clean the cache from Java code and allow it to be shared between all test JVMs. This maximizes the benefit of the cache.

Implementation

First we remove the shutdown hook which deletes the cache. This ensures the cache is shared between all test JVMs in a run and prevents one JVM from interfering with another.

Next, we removed the synchronized (this) statement, which doesn't work across JVMs, and implemented synchronization using filesystem primitives without memory barriers. Rather than two processes attempting to write to the same file at the same time, we create unique temporary files and then do an atomic rename of the temporary file to the final name. This might waste a little bit of effort if two processes both try to populate the cache at the same time (the second one will win, having duplicated the effort of the first), but correctness is still achieved and the performance cost is minimal here.

Note that the Javadoc for ATOMIC_MOVE says "If the target file exists then it is implementation specific if the existing file is replaced or this method fails by throwing an IOException", but in practice the standard implementations for Linux, Windows, and macOS all replace the existing file, so it seems safe to rely on this behavior. Just for completeness, I implemented a fallback to non-atomic move to degrade gracefully in pathological scenarios.

Testing done

I've been running the Jenkins core build for many days now with various iterations of this PR. I would typically see this flakiness every other build or so, but I've now run 3-4 builds without any failure.

timja

Thanks for debugging this!

basil added the work-in-progress label Feb 3, 2025

basil force-pushed the debug branch 8 times, most recently from 2fb462f to edcf6aa Compare February 18, 2025 19:09

Revisit concurrency in JavaNetReverseProxy2#service

7866bc4

basil force-pushed the debug branch from edcf6aa to 7866bc4 Compare February 19, 2025 01:43

basil changed the title ~~Attempt to debug flakiness in hudson.PluginManagerTest.doNotThrowWithUnknownPlugins~~ Revisit concurrency in JavaNetReverseProxy2#service Feb 19, 2025

basil added bug and removed work-in-progress labels Feb 19, 2025

basil marked this pull request as ready for review February 19, 2025 15:18

timja approved these changes Feb 19, 2025

View reviewed changes

basil merged commit f25ef8c into jenkinsci:master Feb 19, 2025
15 checks passed

basil mentioned this pull request Feb 20, 2025

Implement afterDisconnect in SimpleCommandLauncher #922

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit concurrency in `JavaNetReverseProxy2#service` #913

Revisit concurrency in `JavaNetReverseProxy2#service` #913

basil commented Feb 3, 2025 •

edited

Loading

timja left a comment

Revisit concurrency in JavaNetReverseProxy2#service #913

Revisit concurrency in JavaNetReverseProxy2#service #913

Conversation

basil commented Feb 3, 2025 • edited Loading

Problem

Evaluation

Solution

Implementation

Testing done

timja left a comment

Choose a reason for hiding this comment

Revisit concurrency in `JavaNetReverseProxy2#service` #913

Revisit concurrency in `JavaNetReverseProxy2#service` #913

basil commented Feb 3, 2025 •

edited

Loading