-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test new HTCondor python bindings #8272
Comments
@Todd-L-Miller FYI (just a check that I got your correct GH handle) |
Yup, that's me. (Checking ability to comment...) |
Just to report what I found today. I tried to build and run CRAB with the new htcondor binding (
But, eventually, I could build it without errors. Please see the Dockerfile for the details. Next, I added the whole Still, I cannot submit jobs to Schedd yet because of API mismatch:
htcondor.param['COLLECTOR_HOST'] = self.getCollector().encode('ascii', 'ignore') It raise type error but remove Another place is schedds = coll.query(htcondor.AdTypes.Schedd, 'Name=?=%s' % HTCondorUtils.quote(schedd.encode('ascii', 'ignore')),
["AddressV1", "CondorPlatform", "CondorVersion", "Machine", "MyAddress", "Name", "MyType",
"ScheddIpAddr", "RemoteCondorSetup"]) Also, remove
At DagmanSubmitter.py#L330, we use
Here at HTCondorUtils.py#L62. I am stuck at this point. I do not know what That is all for today. |
I tried to build and run CRAB with the new htcondor binding
(`htcondor2`) for fun.
Thanks! This is great.
[snip: building]
I'm glad you were able to get this figured out; the build
containers are public for sadly good reason. The version 2 bindings
don't require Boost, but we haven't made that optional for the build
system yet.
Next, I added the whole `release_dir` to my TaskWorker image and just
exported PYTHONPATH to `$(realpath release_dir)/lib/python3`.
And, it magically works (I could do `import htcondor2 as htcondor` with
python3.8 inside my image).
Yay!
It raise type error but remove `.encode('ascii', 'ignore')` fix the issue.
Interesting. HTCondor will generally accept UTF-8 encoding
anywhere (so the conversion to ASCII shouldn't be necessary), but the
version 2 bindings are generally stricter about type-checking (which is
a little un-Pythonic, but makes the C++ side of the bindings way easier
to write).
Also, remove `.encode('ascii', 'ignore')` fix the issue.
Do you know if this encode() calls were added to fix problems, or
just to be careful?
2. `xquery()` method of Schedd object.
We had to retire xquery(); this will be called out in a transition
guide. The original intent of xquery() was to stream results back from
the schedd, but given that the code converts the result into a list before
looking at it, there's clearly no concern about using too much memory
here, which I'm glad to hear.
3. No `htcondor.SecMan()` or similar name in `htcondor2`.
I am stuck at this point. I do not know what `htcondor.SecMan()` does
or which API can replace it.
SecMan is currently missing from the htcondor2 library. That
call appears to be the only place that it's being used in CRAB; I'm not
sure why the authenticated subprocess needs to reauthenticate from
scratch, but that particular function probably shouldn't take too long to
implement.
That is all for today.
Thank you for the excellent reporting.
I don't know if CRAB will work properly if you just comment-out
the call to SecMan, but even getting things to the point where it fails
because of that missing call rather than because of any other problem with
the new bindings would be tremendous. Thanks again for your help.
- ToddM
|
Amazing work Wa, thanks ! I can't comment/advice on building. I hope that we will not have to worry about things like python 3.8 vs. python 3.9, atm we use python3.8 in TW , python 3.6 in schedulers, same HTC major version (10) and everything is OK. Todd let me try make 3 comments:
Bottom line: we can surely profit anyhow from aligning current code to current bindings and clean up GSI legacy stuff. Then chances are that htcondor2 and classad2 work smoothly. In the end we do not to do anything more then what is the examples in the documentation ! The only non trivial I am not sure about next step(s) though. We can remove 1. 2. 3. above and try again but for a full test it would be way more convenient to have a build from HTCondor to use since we need to deploy both in the container we use for TW (what Wa did) and in the scheduler hosts where we currently get condor via
To see that warning one needs to call it in an interactive shell. It is not in our logs. Working w/o locking condor log files is a different story though. And it may simply mean that eventually we really need to do the work outlined in #6270 |
I hope that we will not have to worry about things like python 3.8 vs.
python 3.9, atm we use python3.8 in TW , python 3.6 in schedulers, same
HTC major version (10) and everything is OK.
The htcondor and classad packages are (and must be) built for
specific Python minor versions, but have the same compatibility guarantees
as HTCondor proper. At some point, hopefully in the near future, we'll
start building the htcondor2 and classad2 packages in the
minor-version-independent way that they support.
3. `SecMan` and `authenticated subprocess`
but I am optimistic that all of this code can be simply removed and we
should make calls to condor bindings directly in TW code w/o going
through a (authenticated) subprocess.
That would be good news for me, and I think simplify and speed up
your code, so this sounds like a good idea to me. :)
Yet I am puzzled that `SecMan` is listed in latest bindings
as the only cure in case a daemon restarts and `all future commands
will fail with strange connection errors.`
I'll have to do some digging to determine if this is a general
Python bindings problem, a v1-specific problem, or just a design decision
on our part, because it's clearly a situation our daemons can deal with.
Maybe we simply need to be sure that all collector and schedd objects
which we create are in the temporary scope of some object and do not try
to persist in memory for days ?
That seems like a reasonable approach, but I think the C++
library's session cache isn't tied to the lifetime of any particular
daemon object. That doesn't necessarily mean the Python bindings have to
expose that fact, but I have some investigating to do.
Bottom line: we can surely profit anyhow from aligning current code to
current bindings and clean up GSI legacy stuff. Then chances are that
htcondor2 and classad2 work smoothly.
That would be awesome.
In the end we do not to do anything more then what is the examples in
the documentation ! The only non trivial thing is that the schedd we
submit to runs on a different VM. But I presume that that is part of
standard tests for the bindings.
Our test suite is not that sophisticated, sadly. :(
Wa did not get to test `classAd2` here, but the use we make of it should
be very basic.
Good news. I have a fair amount of confidence in `classad2`, but
we also deliberately removed a few things from the API there.
I am not sure about next step(s) though. We can remove 1. 2. 3. above
and try again but for a full test it would be way more convenient to
have a build from HTCondor to use since we need to deploy both in the
container we use for TW (what Wa did) and in the scheduler hosts where
we currently get condor via `yum`.
I believe the RPMs for 23.x in our repositories have (some version
of) the `classad2` and `htcondor2` modules in them already; I recommended
building them from source so that you could test any updates I made as a
result of your testing more easily.
I appreciate your willingness to update your code and help us test
ours!
In the schedulers we have a bunch of classAd read/write (hopefully
transparent) and a couple more exotic use case with JEL's
I believe the JobEventLog class is already 100% supported in the
htcondor2 module.
and `htcondor.lock()`.
I am just a tiny bit worried by the latter since I do not find it in the
documentation even if it is working in HTC 23.
`htcondor.lock()` doesn't appear in the 10.0 documentation,
either, that I can tell.
To see that warning one needs to call it in an interactive shell. It is
not in our logs.
To my -- limited -- understanding, this was a design decision by
the Python people. See
https://docs.python.org/3/library/warnings.html#default-warning-filter
; you can turn on deprecation warnings explicitly.
Working w/o locking condor log files is a different story though.
And it may simply mean that eventually we really need to do the work
outlined in #6270
That looks like a much cleaner solution.
- ToddM
|
Hi @Todd-L-Miller BTW I tried to use htcondor2 in HTC 23, but it fails to be imported due to lack of classad2
same error from Could this be a shortcoming of our installation ? In the submission machine we simply
and have same problem with |
Could this be a shortcoming of our installation ?
The Python wheels don't have htcondor2 or classad2 yet; you should
only be getting those from the .rpm / .deb / .tar.gz installation(s). It
looks like (based on the .local in the path) that maybe both of your
tests were against a verion of the bindings installed from a wheel?
(I'm curious about the contents of
/afs/cern.ch/user/b/belforte/.local/lib/python3.9/site-packages/classad2
and
/afs/cern.ch/user/b/belforte/.local/lib/python3.9/site-packages/htcondor2
in this test -- IIRC, htcondor2 tries to load classad2 before it tries to
load its own extension module, so it's possible that both of them are
missing the .so's, as expected).
…-- ToddM
|
indeed, here's the container where we pip-installed condor
But in the scheduler host we install from rpm (via puppet) from htcondor-stable-23.0
in this case I have these binaries
thanks for having pointed out that somehow I have managed to get Maybe as simple as install the rpm for the HTC 23 feature branch (23.7 e.g.) non the stable 23.0 ? |
I switched the scheduler host to 23.7.2 and now import of htcondor2 and classad2 work ! The submitter process runs in a debian-based docker container where we currently pip-install. |
HI @Todd-L-Miller
|
for me and @novicecpp : the changes I did to get a TW container with HTC 23.7 and htcondor2 binaries are in https://github.com/belforte/CRABServer/tree/htc23.x and the container image is |
Good news: as of last Friday, the version 2 bindings are
included (in a minor-version independent way) in the version 1 wheels.
For our own internal purposes, those wheels are available as version
'23.9.0a2', which you should hopefully just be able to pip-install:
https://pypi.org/project/htcondor/23.9.0a2/
This particular release is of course unsupported, but it was made so that
we could release version 2's documentation, which you can preview at:
https://htcondor.readthedocs.io/en/main/apis/python-bindings/api/version2/index.html
…-- ToddM
|
I can pip-install and import. WIll build a new container so we can run our code with this. |
I can pip-install and import. WIll build a new container so we can run
our code with this.
Good. Thanks again for working with all this before its official
release.
…-- ToddM
|
docker image now running in oops, latest fixes in master were left out. Try again. Anyhow "it runs" |
try again with |
Hi @Todd-L-Miller Is this a feature ? or an oversight ? |
first real problem.
but
of course something like |
also
? |
and finally scheduler.spool() method is changed, for the good of course, and it takes as argument a submitterResult object, does not need the workaround in https://lists.cs.wisc.edu/archive/htcondor-users/2024-May/msg00047.shtml So I doubt we can have a single code version which works with both bindings. I still have to "test everything" and already see that something has changed in JEL area. |
@Todd-L-Miller
|
The v2 Submit object now handles complete submit files correctly, so its string converter now generates complete submit files. If removing the queue statement proves troublesome, let me know and we can think about additional APIs or about API tweaks.
I'll hold off on updating the documentation until you let me know how painful the change/fix you had to make was. |
Nonetheless, it's not OK; thanks for the heads up. Fixed, but it make take a bit for that fix to be released. |
about the addition of the subdag = str(subdagJDL) # print subdagJDL into a string
subdag = subdag.rstrip('queue\n') # strip "queue" from last line but in a way that's a bit fragile as it relies on that being the last line. I guess I can be more general there, though. Another option is that condor_dagman_submit gets smarter and when passed the That would elegantly move the ball out of mine and your field ! |
thanks @Todd-L-Miller |
That certainly sounds like a bug; I'll ask the DAGMan guy about it. Are you actually shelling out |
yes I "shell" it CRABServer/src/python/TaskWorker/Actions/PreDAG.py Lines 335 to 339 in 3616cf5
|
if there's now a python API I can certainly look into switching |
I talked to the DAG guy. Two main points: first, every condor_submit_dag command-line flag has, in version 2, a corresponding API bit. Second, if you call |
The docstring for |
i hesitate to go that way w/o an example. v1 doc is not fully clear to me, and it does not mention |
So something like
|
If it's not easier for you, don't worry about switching. You've got a solid work-around already. |
And then |
Yeah, exactly. |
in the long term using python API will make this easier to understand for those who come after me. But I need to pick my battles. Thanks |
No problem. And thank you for finding these problems. |
well.. IMHO the |
Oh, agreed, and that ball has successfully been moved into another field. |
Hi @Todd-L-Miller Maybe you can also reproduce it locally by using a different clusterId :-) |
#8660 solved, waiting for deployment of HTC 23.9 |
all schedulers have HTC 23.9 I will enable crab-preprod-tw01 to use v2 bindings, so that all our validation jobs will use it. |
currently 50% of all user tasks run with HTC v2 API no reported issue |
Testing phase is over ! |
w00t! |
following up on mail thread with Todd Miller @ Madison
They are preparing a new version of HTCondor python bindings which is almost but not
fully backward compatible and would like us to get a feedback from us before finalizing.
We need to setup once TW and one scheduler where we replace htcondor python
with the new version as per
https://github.com/htcondor/htcondor/blob/main/INSTALL.md
Details to be figured out. We never did this before !
The text was updated successfully, but these errors were encountered: