Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ibv_fork_init #686

Closed
abouteiller opened this issue Mar 16, 2016 · 17 comments
Closed

ibv_fork_init #686

abouteiller opened this issue Mar 16, 2016 · 17 comments
Assignees
Milestone

Comments

@abouteiller
Copy link
Contributor

On some of my machines ibv_fork_init fails (using older OFED 1.5)
[1458141609.002887] [arc01:15309:0] ib_device.c:147 UCX ERROR ibv_fork_init() failed: No such file or directory

Now a more general question is why does UXC sets fork-safe by default ?
Users that care about forking their program can set the IBV_FORK_SAFE environment.

@shamisp shamisp added this to the v1.2 milestone Mar 16, 2016
@shamisp
Copy link
Contributor

shamisp commented Mar 16, 2016

@yosefe - please take a look

@yosefe
Copy link
Contributor

yosefe commented Mar 16, 2016

ibv_fork_init() adds very little overhead and saves trouble in applications (data corruption and unexpected behavior).
a different problem is why ibv_fork_init() failed. What system are you using (OS/HCA)? What is the exact OFED version? Was UCX compiled on same system?

@yosefe
Copy link
Contributor

yosefe commented Mar 20, 2016

@abouteiller

@shamisp
Copy link
Contributor

shamisp commented Mar 20, 2016

@yosefe - this is not a good enough reason to force people to upgrade the driver. I some cases this is not an option for people that use some older hardware. This can be easily fixed by adding a configure check and #ifdef

@yosefe
Copy link
Contributor

yosefe commented Mar 20, 2016

@shamisp according to the bug report, ibv_fork_init() is present, but failed in runtime, so ./configure won't help here be. I want to understand the failure reason.

@shamisp
Copy link
Contributor

shamisp commented Mar 20, 2016

@abouteiller - can you please confirm that the function is present but it fails. Why it fails ?

@abouteiller
Copy link
Contributor Author

Yes, compiles and links, fails at runtime with this error

11:53:22@dancer:~/ucx/ucx/debug.build                                                       Sun Mar. 20; 6 users, load5,15: 0.67,1.00
$ mpirun -np 2 -hostfile /opt/etc/arc.machinefile.ompi -map-by node bin/ucx_perftest -d mlx4_0:2 -x rc -D zcopy -t put_bw -c 0
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
[1458489225.312986] [arc01:24732:0]      ib_device.c:147  UCX  ERROR ibv_fork_init() failed: No such file or directory
[1458489225.313150] [arc00:18159:0]      ib_device.c:147  UCX  ERROR ibv_fork_init() failed: No such file or directory

system is CentOS 6.7, with an older OFED 1.5.4.1. The same system image is deployed on compile node and compute nodes.

The test does run to completion with the following patch:

diff --git a/src/uct/ib/base/ib_device.c b/src/uct/ib/base/ib_device.c
index 83bc13d..a38f796 100644
--- a/src/uct/ib/base/ib_device.c
+++ b/src/uct/ib/base/ib_device.c
@@ -145,8 +145,8 @@ ucs_status_t uct_ib_device_init(uct_ib_device_t *dev, struct ibv_device *ibv_dev
     ret = ibv_fork_init();
     if (ret) {
         ucs_error("ibv_fork_init() failed: %m");
-        status = UCS_ERR_IO_ERROR;
-        goto err;
+        //status = UCS_ERR_IO_ERROR;
+        //goto err;
     }

     /* Open verbs context */

@abouteiller
Copy link
Contributor Author

Interestingly, the issue does not happen when using the "sock" RTE, but only when using the MPI rte (open MPI 1.8.8).

From the Open MPI "verbs" file, I find the following warnings:

/*
169  * ibv_fork_init testing - if fork support is requested then ibv_fork_init                                                       
170  * should be called right at the beginning of the verbs initialization flow, before ibv_create_* call.
171  *
172  * Known limitations:
173  * If ibv_fork_init is called after ibv_create_* functions - it will have no effect.
174  * OMPI initializes verbs many times during initialization in the following verbs components:
175  *      oob/ud, btl/openib, mtl/mxm, pml/yalla, oshmem/ikrit, oshmem/yoda, ompi/mca/coll/{fca,hcoll}
176  *
177  * So, ibv_fork_init should be called once, in the beginning of the init flow of every verb component
178  * to proper request fork support.
179  *
180  */

So in essence, to enable interoperability with MPI, we should try to differentiate between fork_init failing because it has failed, or failing because somebody else already did it and then already created endpoints.

@yosefe
Copy link
Contributor

yosefe commented Mar 20, 2016

hmm.. that means OpenMPI is probably using verbs without calling ibv_fork_init(), and then UCX fails when it calls ibv_fork_init().
i'll take it with our driver team.

@shamisp
Copy link
Contributor

shamisp commented Mar 20, 2016

maybe, can somebody look at the implementation ?

@abouteiller
Copy link
Contributor Author

Indeed: forcing Open MPI to issue itself ibv_fork_init (mpirun -mca btl_openib_want_fork_support 1) does remove the error in UCX.

In Open MPI, if the user does not explicitly require fork support, it is tried, but any error is silently ignored.
(see opal/mca/common/verbs/common_verbs_basics.c)

@yosefe
Copy link
Contributor

yosefe commented Mar 21, 2016

Maybe we should do the same in UCX? Or at least give a warning and not fail?

@shamisp
Copy link
Contributor

shamisp commented Mar 21, 2016

Well , what will be our default behavior with ompi ? I don't want to generat warning on every run...

@yosefe
Copy link
Contributor

yosefe commented Mar 21, 2016

maybe print the warning only in a handler registered by pthread_atfork?

@shamisp
Copy link
Contributor

shamisp commented Mar 21, 2016

if this is OMPI bug, we have to probably put some note on Wiki+Readme

@yosefe
Copy link
Contributor

yosefe commented Mar 21, 2016

I would suggest to enhance the error message to something like:
"please set UCX_FORK_SAFE=n, or make sure IB/verbs memory registration is not used without calling ibv_fork_init() first."

@abouteiller
Copy link
Contributor Author

pr #702

@yosefe yosefe closed this as completed Mar 24, 2016
dmitrygx pushed a commit to dmitrygx/ucx that referenced this issue Dec 1, 2021
These recently dropped the `master` branch and switched to `main`. So
update the install steps to use `main` instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants