Revisit concurrency in JavaNetReverseProxy2#service
#913
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
In Jenkins core builds, we see frequent flakiness in
hudson.PluginManagerTest#doNotThrowWithUnknownPlugins
. Right before the test fails we see:Evaluation
JavaNetReverseProxy2
has concurrency issues when multiple processes are being used, which is the case in our CI builds which both setforkCount=2
andreuseForks=false
. This results in two separate JVMs running concurrently executing test cases.This class was designed for a single JVM to be in use at any time, for a few reasons. First, it has a fixed name for the cache directory. Second, it writes to the cache in a
synchronized (this)
block, which works to guard the critical section between threads in one JVM but not between multiple JVMs. Finally, it installs a shutdown hook to delete the cache directory, which again makes sense in a single JVM scenario.The proximate cause of the error is a race condition between two test JVMs:
NoSuchFileException
because the directory has since been deleted since step 1.Solution
One way to solve this problem would have been to preserve the existing design but have a separate cache directory per JVM. This would have resulted in the lowest amount of code change, but on the other hand each test JVM would have to populate the cache from scratch, diminishing the value of the cache.
One thing to note is that this code originally contained the comment
For a while now, our plugin parent POM and Jenkins core POM have been setting
java.io.tmpdir
to the Maventarget/
(${project.build.directory}
) directory, so things likemvn clean
work to clear the cache and there is no risk of polluting the broader system with garbage if the Java code doesn't clean the cache. Given the above, we felt it was more prudent to never clean the cache from Java code and allow it to be shared between all test JVMs. This maximizes the benefit of the cache.Implementation
First we remove the shutdown hook which deletes the cache. This ensures the cache is shared between all test JVMs in a run and prevents one JVM from interfering with another.
Next, we removed the
synchronized (this)
statement, which doesn't work across JVMs, and implemented synchronization using filesystem primitives without memory barriers. Rather than two processes attempting to write to the same file at the same time, we create unique temporary files and then do an atomic rename of the temporary file to the final name. This might waste a little bit of effort if two processes both try to populate the cache at the same time (the second one will win, having duplicated the effort of the first), but correctness is still achieved and the performance cost is minimal here.Note that the Javadoc for
ATOMIC_MOVE
says "If the target file exists then it is implementation specific if the existing file is replaced or this method fails by throwing anIOException
", but in practice the standard implementations for Linux, Windows, and macOS all replace the existing file, so it seems safe to rely on this behavior. Just for completeness, I implemented a fallback to non-atomic move to degrade gracefully in pathological scenarios.Testing done
I've been running the Jenkins core build for many days now with various iterations of this PR. I would typically see this flakiness every other build or so, but I've now run 3-4 builds without any failure.