Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow RMI socket connection timeout to be configurable #279

Closed
sirianni opened this issue Mar 4, 2020 · 3 comments
Closed

Allow RMI socket connection timeout to be configurable #279

sirianni opened this issue Mar 4, 2020 · 3 comments

Comments

@sirianni
Copy link

sirianni commented Mar 4, 2020

Problem

We are seeing issues where jmxfetch is having prolonged collection cycles when monitored apps are removed or restarted.

For example, note the 68 second collection cycle when this kafka broker is restarted.

 2020-03-04 16:00:48,639 | DEBUG| Reporter | Instance kafka-10.5.225.11-7203 is sending 281 metrics to the metrics reporter during collection #152
 2020-03-04 16:00:49,493 | DEBUG| App | Iteration ran in 964 ms
 2020-03-04 16:01:04,640 | DEBUG| Reporter | Instance kafka-10.5.225.11-7203 is sending 281 metrics to the metrics reporter during collection #153
 2020-03-04 16:01:04,652 | DEBUG| App | Iteration ran in 140 ms
 2020-03-04 16:02:27,753 | DEBUG| App | Iteration ran in 68081 ms
 2020-03-04 16:02:57,185 | DEBUG| Reporter | Instance kafka-10.5.225.12-7203 is sending 250 metrics to the metrics reporter during collection #155
 2020-03-04 16:02:57,210 | DEBUG| App | Iteration ran in 14406 ms
 2020-03-04 16:03:12,375 | DEBUG| Reporter | Instance kafka-10.5.225.12-7203 is sending 250 metrics to the metrics reporter during collection #156
 2020-03-04 16:03:12,391 | DEBUG| App | Iteration ran in 166 ms
 2020-03-04 16:03:27,547 | DEBUG| Reporter | Instance kafka-10.5.225.12-7203 is sending 250 metrics to the metrics reporter during collection #157
 2020-03-04 16:03:27,555 | DEBUG| App | Iteration ran in 142 ms

During the long collection cycle, I noticed the MetricCollectionTask hung up on the following stack trace:

 "pool-9-thread-1" #31 prio=5 os_prio=0 cpu=2222.15ms elapsed=1233.58s tid=0x00007fb988350000 nid=0x293 runnable  [0x00007fb97090f000]
    java.lang.Thread.State: RUNNABLE
 	at java.net.PlainSocketImpl.socketConnect(java.base@11.0.6-ea/Native Method)
 	at java.net.AbstractPlainSocketImpl.doConnect(java.base@11.0.6-ea/AbstractPlainSocketImpl.java:399)
 	- locked <0x000000066bd3fe00> (a java.net.SocksSocketImpl)
 	at java.net.AbstractPlainSocketImpl.connectToAddress(java.base@11.0.6-ea/AbstractPlainSocketImpl.java:242)
 	at java.net.AbstractPlainSocketImpl.connect(java.base@11.0.6-ea/AbstractPlainSocketImpl.java:224)
 	at java.net.SocksSocketImpl.connect(java.base@11.0.6-ea/SocksSocketImpl.java:403)
 	at java.net.Socket.connect(java.base@11.0.6-ea/Socket.java:609)
 	at java.net.Socket.connect(java.base@11.0.6-ea/Socket.java:558)
 	at java.net.Socket.<init>(java.base@11.0.6-ea/Socket.java:454)
 	at java.net.Socket.<init>(java.base@11.0.6-ea/Socket.java:231)
 	at sun.rmi.transport.tcp.TCPDirectSocketFactory.createSocket(java.rmi@11.0.6-ea/TCPDirectSocketFactory.java:40)
 	at sun.rmi.transport.tcp.TCPEndpoint.newSocket(java.rmi@11.0.6-ea/TCPEndpoint.java:613)
 	at sun.rmi.transport.tcp.TCPChannel.createConnection(java.rmi@11.0.6-ea/TCPChannel.java:209)
 	at sun.rmi.transport.tcp.TCPChannel.newConnection(java.rmi@11.0.6-ea/TCPChannel.java:196)
 	at sun.rmi.server.UnicastRef.invoke(java.rmi@11.0.6-ea/UnicastRef.java:129)
 	at jdk.jmx.remote.internal.rmi.PRef.invoke(jdk.remoteref/Unknown Source)
 	at javax.management.remote.rmi.RMIConnectionImpl_Stub.getAttribute(java.management.rmi@11.0.6-ea/Unknown Source)
 	at javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.getAttribute(java.management.rmi@11.0.6-ea/RMIConnector.java:904)
 	at org.datadog.jmxfetch.Connection.getAttribute(Connection.java:72)
 	at org.datadog.jmxfetch.JmxAttribute.getJmxValue(JmxAttribute.java:266)
 	at org.datadog.jmxfetch.JmxSimpleAttribute.getValue(JmxSimpleAttribute.java:120)
 	at org.datadog.jmxfetch.JmxSimpleAttribute.getMetrics(JmxSimpleAttribute.java:45)
 	at org.datadog.jmxfetch.Instance.getMetrics(Instance.java:423)
 	at org.datadog.jmxfetch.MetricCollectionTask.call(MetricCollectionTask.java:26)
 	at org.datadog.jmxfetch.MetricCollectionTask.call(MetricCollectionTask.java:8)

Digging in a bit more I saw that TCPDirectSocketFactory calls the new Socket(host, port) constructor which ultimately calls Socket.connect(endpoint) with an infinite (0) timeout.

Solution

I was able to work around this issue by forking jmxfetch and implementing a custom RMISocketFactory:

    class SocketFactory extends RMISocketFactory {

        private static final int DEFAULT_SOCKET_CONNECT_TIMEOUT = 5000;
        
        private final int timeout;
        
        SocketFactory() {
            this.timeout = DEFAULT_SOCKET_CONNECT_TIMEOUT;
        }

        SocketFactory(int timeout) {
            this.timeout = timeout;
        }

        @Override
        public Socket createSocket(String host, int port) throws IOException {
            Socket socket = new Socket();
            socket.connect(new InetSocketAddress(host, port), timeout);
            return socket;
        }

        @Override
        public ServerSocket createServerSocket(int port) throws IOException {
            return new ServerSocket(port);
        }

    }

And setting it using

RMISocketFactory.setSocketFactory(new SocketFactory());

With this change, I am able to get more reasonable collection cycle times when apps are restarted out from underneath jmxfetch. For this scenario, the problematic collection cycle time was reduced from ~70 seconds to ~20 seconds.

Suggestion

  • Implement a custom RMISocketFactory in jmxfetch
  • Allow the timeout to be configurable
    • Using the same value as the existing rmi_client_timeout param
    • Or adding a new config param, rmi_connection_timeout
@prognant
Copy link
Contributor

Hello there, thanks for opening an issue and a fix proposal.
So far we think that's a valid use case, we will work to merge your change and expose the configuration settings in the agent so it will be able to call jmxfetch with the flag you propose to add.

@prognant
Copy link
Contributor

Hi there, on-going work on RMI socket factory : #298

@prognant
Copy link
Contributor

As #298 got merged, I think we can close this one.
Thanks for raising the issue, hopefully it will be close for good !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants