Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster broken when pre-touch enabled #1218

Closed
wojciech-adaptive opened this issue Sep 1, 2021 · 4 comments
Closed

Cluster broken when pre-touch enabled #1218

wojciech-adaptive opened this issue Sep 1, 2021 · 4 comments

Comments

@wojciech-adaptive
Copy link
Contributor

If you run a cluster node with -Daeron.pre.touch.mapped.memory=true, it will fail with:

java.lang.NullPointerException
	at io.aeron.ClientConductor.logBuffers(ClientConductor.java:1194)
	at io.aeron.ClientConductor.onNewPublication(ClientConductor.java:298)
	at io.aeron.DriverEventsAdapter.onMessage(DriverEventsAdapter.java:147)
	at org.agrona.concurrent.broadcast.CopyBroadcastReceiver.receive(CopyBroadcastReceiver.java:116)
	at io.aeron.DriverEventsAdapter.receive(DriverEventsAdapter.java:68)
	at io.aeron.ClientConductor.service(ClientConductor.java:1214)
	at io.aeron.ClientConductor.awaitResponse(ClientConductor.java:1272)
	at io.aeron.ClientConductor.addPublication(ClientConductor.java:454)
	at io.aeron.Aeron.addPublication(Aeron.java:274)
	at io.aeron.cluster.service.ContainerClientSession.connect(ContainerClientSession.java:107)
	at io.aeron.cluster.service.ClusteredServiceAgent.onSessionOpen(ClusteredServiceAgent.java:411)
	at io.aeron.cluster.service.BoundedLogAdapter.onMessage(BoundedLogAdapter.java:186)
	at io.aeron.cluster.service.BoundedLogAdapter.onFragment(BoundedLogAdapter.java:68)
	at io.aeron.Image.boundedControlledPoll(Image.java:530)
	at io.aeron.cluster.service.BoundedLogAdapter.poll(BoundedLogAdapter.java:125)
	at io.aeron.cluster.service.ClusteredServiceAgent.doWork(ClusteredServiceAgent.java:169)
	at org.agrona.concurrent.AgentRunner.doWork(AgentRunner.java:292)
	at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:165)
	at java.base/java.lang.Thread.run(Thread.java:829)

Can be reproduced by running the master version of SingleNodeTest.shouldSendMessagesToCluster() with that property set. Seems to be an issue since 1.35.0.

@vyazelenko
Copy link
Contributor

There are two calls to add publication for the same channel:

  • An async call which succeeds:
    DriverProxy#addPublication: channel=aeron:udp?endpoint=localhost:59150|term-length=128k, streamId=102
    java.lang.Exception
      at io.aeron.DriverProxy.addPublication(DriverProxy.java:74)
      at io.aeron.ClientConductor.asyncAddPublication(ClientConductor.java:493)
      at io.aeron.Aeron.asyncAddPublication(Aeron.java:300)
      at io.aeron.cluster.ClusterSession.asyncConnect(ClusterSession.java:132)
      at io.aeron.cluster.ConsensusModuleAgent.onSessionConnect(ConsensusModuleAgent.java:378)
      at io.aeron.cluster.IngressAdapter.onFragment(IngressAdapter.java:109)
      at io.aeron.ControlledFragmentAssembler.onFragment(ControlledFragmentAssembler.java:122)
      at io.aeron.Image.controlledPoll(Image.java:369)
      at io.aeron.Subscription.controlledPoll(Subscription.java:235)
      at io.aeron.cluster.IngressAdapter.poll(IngressAdapter.java:178)
      at io.aeron.cluster.ConsensusModuleAgent.consensusWork(ConsensusModuleAgent.java:2025)
      at io.aeron.cluster.ConsensusModuleAgent.doWork(ConsensusModuleAgent.java:337)
      at org.agrona.concurrent.AgentRunner.doWork(AgentRunner.java:292)
      at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:165)
      at java.lang.Thread.run(Thread.java:748)
    
  • A blocking call that fails with an NPE:
    DriverProxy#addPublication: channel=aeron:udp?endpoint=localhost:59150|term-length=128k, streamId=102
    java.lang.Exception
      at io.aeron.DriverProxy.addPublication(DriverProxy.java:74)
      at io.aeron.ClientConductor.addPublication(ClientConductor.java:453)
      at io.aeron.Aeron.addPublication(Aeron.java:274)
      at io.aeron.cluster.service.ContainerClientSession.connect(ContainerClientSession.java:107)
      at io.aeron.cluster.service.ClusteredServiceAgent.onSessionOpen(ClusteredServiceAgent.java:411)
      at io.aeron.cluster.service.BoundedLogAdapter.onMessage(BoundedLogAdapter.java:186)
      at io.aeron.cluster.service.BoundedLogAdapter.onFragment(BoundedLogAdapter.java:68)
      at io.aeron.Image.boundedControlledPoll(Image.java:530)
      at io.aeron.cluster.service.BoundedLogAdapter.poll(BoundedLogAdapter.java:125)
      at io.aeron.cluster.service.ClusteredServiceAgent.doWork(ClusteredServiceAgent.java:169)
      at org.agrona.concurrent.AgentRunner.doWork(AgentRunner.java:292)
      at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:165)
      at java.lang.Thread.run(Thread.java:748)
    

@mjpt777
Copy link
Contributor

mjpt777 commented Sep 1, 2021

Using the pre-touch highlights a race which was a bug introduced with async adding of publications. It should be fixed with this commit. 54e4be4

@vyazelenko
Copy link
Contributor

I modified the test to run with pre-touch on and off and it passes with the 54e4be4.

@mjpt777 mjpt777 closed this as completed Sep 6, 2021
@mjpt777
Copy link
Contributor

mjpt777 commented Sep 6, 2021

Even if pre-touch is not enabled this can result in Publication.channel() returning null and a memory leak in the client conductor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants