Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make daemon “reset” itself when the IP address changes #284

Merged
merged 8 commits into from
Oct 18, 2019

Conversation

ivanpauno
Copy link
Member

The problem is that when the IP address changes (e.g.: by DHCP), discovery doesn't work all the time.

This is a first proposal for solving the problem.
I'm doing the following:
The methods registered in the XMLRPC server now check if one IP has changed, and in that case the DirectNode used by the Daemon is reset.

When the DirectNode is reset, the requested method is immediately call. Maybe an spin_time should be added (for ensuring enough discovery time).

With "checking that one IP has changed", I'm referring to the following -> If the IP address of any interface that has a gateway has changed, the DirectNode is reset.

As far as I understand, it's not possible to know what interface is the rmw implementation using.

Python doesn't have in the standard library a package for listing network interfaces, etc (in a platform independent way).
I'm using netifaces for that. I don't know if we want to add a new dependency or not.
In case that we don't want to add it, I can maybe add this as an optional plugin in another package.

For trying this, you have to install netifaces: python3 -m pip install -U netifaces

@ivanpauno ivanpauno added the enhancement New feature or request label Jun 14, 2019
@ivanpauno ivanpauno self-assigned this Jun 14, 2019
@wjwwood
Copy link
Member

wjwwood commented Jun 18, 2019

@codebot can you please ask someone on the SOSS project to try this out? Maybe Grey?

@codebot
Copy link
Member

codebot commented Jun 27, 2019

Thanks for this! Indeed, it does seem to be correctly deciding when the interface address has changed (I inserted a few syslog.syslog() calls in the various branches to watch it run when daemonized).

However, I find that although the initial run of ros2 topic list returns the "right" answer, unfortunately subsequent runs always return blank, like this:

mquigley@feather4:~/ros2/ros2cli_ws$ ros2 topic list
/parameter_events
/rosout
/topic
mquigley@feather4:~/ros2/ros2cli_ws$ ros2 topic list
mquigley@feather4:~/ros2/ros2cli_ws$ ros2 topic list
mquigley@feather4:~/ros2/ros2cli_ws$ ros2 topic list

Test scenario:

Machine A is continually connected to a single WiFi network, letting this command run forever:

ros2 run examples_rclpy_minimal_publisher publisher_old_school

Machine B is running ros2 topic list and sometimes changes networks in-between invocations. Although the problem shown above, where repeated invocations of ros2 topic list return empty, happens even when Machine B stays on the same WiFi net as Machine A.

ROS Dashing is being used on both machines.

Machine B is running the ros2 command from a ros2cli workspace overlaid on Dashing which has this branch checked out.

@ivanpauno
Copy link
Member Author

ivanpauno commented Jun 27, 2019

@codebot I will try to reproduce a similar scenario here.
Does the interface address of Machine A change too?

Edit:

I will also like to know platform details. Including rmw implementation, OS.

@codebot
Copy link
Member

codebot commented Jun 27, 2019 via email

@ivanpauno
Copy link
Member Author

ivanpauno commented Jun 27, 2019

@codebot I'm not being able to reproduce your problem.
I ran in one machine a node. In another machine I ran different ros2 commands and verbs (node list, topic list), and I haven't seen a "discovery" problem. In the middle of running the commands, I have manually switched the IP of the computer a bunch of times. Both machines were connected using WiFi.
I also tried with the master branch of this repo, which doesn't reset the daemon when an interface changes, and I couldn't reproduce the problem either.

Could you provide me some other details?:

  • rmw implementation you're using
  • Operating system

And anything else you think is useful for reproducing the issue correctly.

@codebot
Copy link
Member

codebot commented Jun 27, 2019

Thanks for trying to replicate this.

  • Ubuntu 18.04
  • rmw_fastrtps_cpp

The network in question is "unusual" in some ways. For reasons I still don't understand, it requires TTL >= 2 in order for UDP multicast from "peers" on the WiFi to reach each other. I'm doing that via the FASTRTPS_DEFAULT_PROFILES_FILE environment variable to set the transport parameters. Maybe there are other weird things happening resulting in abnormally high packet loss, but the behavior I was seeing today of ros2 topic list immediately returning blank seemed like an unrelated issue 🤷‍♂️

I'll try this experiment again on a "normal" WiFi network tomorrow.

@dirk-thomas
Copy link
Member

@codebot Just checking if you came up with any further information on this topic.

@wjwwood wjwwood mentioned this pull request Jul 23, 2019
34 tasks
@dirk-thomas
Copy link
Member

@codebot Friendly ping.

@claireyywang
Copy link
Contributor

OS: machine A (linux desktop), machine B (linux VM on laptop), ubuntu 18.04
RMW: fastrtps
ROS distro: Eloquent
I'm getting slightly different behaviour when testing this branch. I was not able to discover the topic publishing on machine A from machine B, but was able to use multicast both ways. I also encountered the same ros2 topic list returning blank behaviour, which was temporarily fixed by ros2 daemon stop. I found a thread https://answers.ros.org/question/304373/ros2-nodes-can-not-see-one-another-via-network/ discussing the same issues I ran into. It seems like they are both related to ros2 daemon.

@ivanpauno ivanpauno force-pushed the ivanpauno/reset-daemon-when-interface-changes branch from ea15772 to b8c9a75 Compare October 7, 2019 16:48
@ivanpauno
Copy link
Member Author

@claireyywang It would be good to have some additional information:

  • I was not able to discover the topic publishing on machine A from machine B, but was able to use multicast both ways

    I guess that you were initially able to discover topics on machine A from machine B, but after a network change it stopped working. In that case, how did you trigger the network change?

  • I haven't been able to reproduce the problem in our network. I'm not to sure what can be the difference, but if you could provide detailed information on how the computers were connected, it may be useful.

My setup: I have two machines in the same LAN, using static IPs. I run a node in machine A (e.g. ros2 run demo_nodes_cpp talker) and in machine B I run ros2 topic list, the node is listed correctly.
I then manually change the IP of MACHINE B, and run ros2 topic list again, having the same result.
I didn't need to run ros2 daemon stop, and I was using the master branch of this repo (it also works using this branch).

  • Does this happen using other DDS vendors?

I think there are two possible reasons why doing ros2 daemon stop solves the problem and this patch doesn't:

  • An error in the detection of a network interface change. That's part of this PR (this piece of code).
  • This patch is currently just destroying a node, shutting done rclpy, and creating a new node when a network interface change is detected. That may not be enough, as the DDS implementation may have some global state.
    Instead of destroying the node and creating a new one, I could run a new daemon process and stop the original one, which is equivalent to manually run ros2 daemon stop.

I don't believe it's the first problem, as I have checked and tested the logic a bunch of times.
To be sure that's not the case, I added some debug messages.
For seeing them, you have to use ros2 daemon start --debug (just in cast, first run ros2 daemon stop).
Each time you run ros2 topic list, you should see something like this:

get_topic_names_and_types()
Interface kind: 2, info: [('10.34.0.1', 'enx70886b89a852', True), ('10.34.0.1', 'wlp5s0', False)]
Addresses by interfaces: {2: {'enx70886b89a852': '10.34.0.67', 'wlp5s0': '10.34.0.104'}}
__LIST_OF_NODES__ (it may be empty)

And if some network interface changed, you should see:

get_topic_names_and_types()
Interface kind: 2, info: [('10.34.0.1', 'enx70886b89a852', True)]
Addresses by interfaces: {2: {'enx70886b89a852': '10.34.0.67'}}
Daemon node was reset
__LIST_OF_NODES__ (it may be empty)

@claireyywang Could you test this and check if you see the Daemon node was reset message when you run a cli command (e.g.: ros2 topic list) after a change happened in a network interface?
If you're seeing that message, I can discard the first problem and I'm sure it's the second.

If you're not seeing that, please copy all your logs here (the ones when you were able to discover topics in the other machine, and the ones after you stopped being able).

@claireyywang
Copy link
Contributor

claireyywang commented Oct 11, 2019

Thanks for the suggestions @ivanpauno! Here are some updates.
machine A (Linux VM on laptop), machine B (Linux desktop)

master branch:
Machine B publish to topic /testingmaster the whole time. Machine A and B start with being on the same wifi network.
Machine A was able to discover the topic published by B. After machine A was switched to a mobile hotspot network, A still lists /testingmaster in ros2 topic list, and only removed it after I ran ros2 daemon stop. Same thing happened after I switched A back to the same wifi network as B; A was not able to discover /testingmaster until daemon was restarted.

reset-daemon branch:
(followed your steps and saw the Daemon node was reset message)
Machine B publishing to topic /testingmaster. A and B on the same wifi network.
Machine A was not able to discover the topic (could also be experiencing the same delay as in master branch). I ran ros2 daemon stop and ros2 daemon start --debug, then ros2 topic list and got this output

get_topic_names_and_types()
Interface kind: 2, info: [('172.23.0.1', 'enp0s5', True)]
Addresses by interfaces: {2: {'enp0s5': '172.23.1.156'}}

After switching to a different wifi network, I ran ros2 topic list, got this output

get_topic_names_and_types()
Interface kind: 2, info: [('172.25.0.1', 'enp0s5', True)]
Addresses by interfaces: {2: {'enp0s5': '172.25.0.238'}}
Daemon node was reset

Signed-off-by: ivanpauno <ivanpauno@ekumenlabs.com>
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
@ivanpauno ivanpauno force-pushed the ivanpauno/reset-daemon-when-interface-changes branch from fa1ef33 to d7fd738 Compare October 11, 2019 21:02
@ivanpauno
Copy link
Member Author

Thanks for checking @claireyywang !

Machine A was able to discover the topic published by B. After machine A was switched to a mobile hotspot network, A still lists /testingmaster in ros2 topic list, and only removed it after I ran ros2 daemon stop.

Though participant removal is detected fast (e.g.: when you ctrl-c a node), dropping a participant (i.e. when not being able to communicate with it for a while) is inherently slow. If you wait some minutes, /testingmaster should stop appearing (I have tested that, and it finally disappears).
Resetting the daemon node should make the process faster.

Same thing happened after I switched A back to the same wifi network as B; A was not able to discover /testingmaster until daemon was restarted.

I was finally able to reproduce this.
Originally, I was just trying to change the IP and not to switch between networks.
AFAIU, network interfaces are checked when a Participant is created, and the Participant doesn't detect a network change (at least, that seems to be the case for Fast-RTPS).
Now I'm forcing to reset the daemon node when a new interface is detected (before, I was just resetting it when an interface disappeared).

reset-daemon branch:

I corrected some bugs in 9767d0f and forced to reset the daemon node when a new network interface is detected in d7fd738.
After that, both problems are solved.
@claireyywang could you please check if this is working for you?


I wonder if this only happens with Fast-RTPS or if it also happens with other rmw implementations.
I will check that on Monday.

Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
@claireyywang
Copy link
Contributor

I confirm it is working. I switched machine between two different networks and the delay is significantly shorter; topics are correctly discovered without me doing anything to the daemon. Seems to me that both problems are solved indeed.

wrapper.__name__ = attr.__name__

if inspect.ismethod(attr):
return wrapper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivanpauno consider:

if inspect.ismethod(attr):
    @functools.wraps(attr)
    def wrapper(*args, **kwargs):
        self.reset_if_addresses_changed()
        return getattr(self.node, name)(*args, **kwargs)
    wrapper.__signature__ = inspect.signature(attr)

instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?
wrapper.__signature__ = inspect.signature(attr)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, functools.wraps does that automatically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed if we want to keep the decorator as transparent as possible, and no, functools.wraps does not force any function signature, only naming.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See ros2/launch#279 (comment), though it's arguably a detail that has no impact on this particular code (at least for the time being).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a proof, you run do the following:

import functools
import inspect

def asd(bsd, csd):
    print('asd')

@functools.wraps(asd)
def wrapper(*args, **kwargs):
    asd(*args, **kwargs)

print(inspect.signature(wrapper) == inspect.signature(asd))

The result is True.

@ivanpauno ivanpauno added the in review Waiting for review (Kanban column) label Oct 15, 2019
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
@ivanpauno ivanpauno requested a review from hidmic October 16, 2019 14:18
Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
if not hasattr(args, 'debug') or not args.debug:
kwargs['stdout'] = subprocess.DEVNULL
kwargs['stderr'] = subprocess.DEVNULL
subprocess.Popen(cmd, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivanpauno meta: I had not noticed, and it predates this PR, but this is a peculiar way to demonize a process.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what is the correct way, but in any case, I would make that modification in a follow-up PR.

'--ros-domain-id', str(ros_domain_id)],
stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, **kwargs)
'--ros-domain-id', str(ros_domain_id)])
if not hasattr(args, 'debug') or not args.debug:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivanpauno hmm all CLIs that use NodeStrategy and happen to have a debug option may unintentionally hook up the daemon to their standard streams. Consider adding another argument to the spawn_daemon function instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See e339123.

Signed-off-by: Ivan Santiago Paunovic <ivanpauno@ekumenlabs.com>
Copy link
Contributor

@hidmic hidmic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ivanpauno
Copy link
Member Author

CI, up to ros2cli test_ros2cli:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@ivanpauno
Copy link
Member Author

CI in ros2/ci#346 (comment).
Going in!

@ivanpauno ivanpauno merged commit ae2c044 into master Oct 18, 2019
@delete-merged-branch delete-merged-branch bot deleted the ivanpauno/reset-daemon-when-interface-changes branch October 18, 2019 17:56
esteve pushed a commit to esteve/ros2cli that referenced this pull request Dec 16, 2022
Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request in review Waiting for review (Kanban column)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants