-
Notifications
You must be signed in to change notification settings - Fork 436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ibv_fork_init #686
Comments
@yosefe - please take a look |
ibv_fork_init() adds very little overhead and saves trouble in applications (data corruption and unexpected behavior). |
@yosefe - this is not a good enough reason to force people to upgrade the driver. I some cases this is not an option for people that use some older hardware. This can be easily fixed by adding a configure check and #ifdef |
@shamisp according to the bug report, ibv_fork_init() is present, but failed in runtime, so ./configure won't help here be. I want to understand the failure reason. |
@abouteiller - can you please confirm that the function is present but it fails. Why it fails ? |
Yes, compiles and links, fails at runtime with this error
system is CentOS 6.7, with an older OFED 1.5.4.1. The same system image is deployed on compile node and compute nodes. The test does run to completion with the following patch: diff --git a/src/uct/ib/base/ib_device.c b/src/uct/ib/base/ib_device.c
index 83bc13d..a38f796 100644
--- a/src/uct/ib/base/ib_device.c
+++ b/src/uct/ib/base/ib_device.c
@@ -145,8 +145,8 @@ ucs_status_t uct_ib_device_init(uct_ib_device_t *dev, struct ibv_device *ibv_dev
ret = ibv_fork_init();
if (ret) {
ucs_error("ibv_fork_init() failed: %m");
- status = UCS_ERR_IO_ERROR;
- goto err;
+ //status = UCS_ERR_IO_ERROR;
+ //goto err;
}
/* Open verbs context */ |
Interestingly, the issue does not happen when using the "sock" RTE, but only when using the MPI rte (open MPI 1.8.8). From the Open MPI "verbs" file, I find the following warnings:
So in essence, to enable interoperability with MPI, we should try to differentiate between fork_init failing because it has failed, or failing because somebody else already did it and then already created endpoints. |
hmm.. that means OpenMPI is probably using verbs without calling ibv_fork_init(), and then UCX fails when it calls ibv_fork_init(). |
maybe, can somebody look at the implementation ? |
Indeed: forcing Open MPI to issue itself ibv_fork_init (mpirun -mca btl_openib_want_fork_support 1) does remove the error in UCX. In Open MPI, if the user does not explicitly require fork support, it is tried, but any error is silently ignored. |
Maybe we should do the same in UCX? Or at least give a warning and not fail? |
Well , what will be our default behavior with ompi ? I don't want to generat warning on every run... |
maybe print the warning only in a handler registered by |
if this is OMPI bug, we have to probably put some note on Wiki+Readme |
I would suggest to enhance the error message to something like: |
pr #702 |
These recently dropped the `master` branch and switched to `main`. So update the install steps to use `main` instead.
On some of my machines ibv_fork_init fails (using older OFED 1.5)
[1458141609.002887] [arc01:15309:0] ib_device.c:147 UCX ERROR ibv_fork_init() failed: No such file or directory
Now a more general question is why does UXC sets fork-safe by default ?
Users that care about forking their program can set the IBV_FORK_SAFE environment.
The text was updated successfully, but these errors were encountered: