Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test new HTCondor python bindings #8272

Closed
belforte opened this issue Mar 6, 2024 · 72 comments
Closed

test new HTCondor python bindings #8272

belforte opened this issue Mar 6, 2024 · 72 comments

Comments

@belforte
Copy link
Member

belforte commented Mar 6, 2024

following up on mail thread with Todd Miller @ Madison
They are preparing a new version of HTCondor python bindings which is almost but not
fully backward compatible and would like us to get a feedback from us before finalizing.

We need to setup once TW and one scheduler where we replace htcondor python
with the new version as per
https://github.com/htcondor/htcondor/blob/main/INSTALL.md
Details to be figured out. We never did this before !


-------- Forwarded Message --------
Subject: Re: testing CRAB server
Date: Wed, 6 Mar 2024 12:46:14 -0600 (CST)
From: Todd L Miller <tlmiller@cs.wisc.edu>
To: Stefano Belforte <stefano.belforte@cern.ch>
CC: Thanayut Seethongchuen <thanayut.seethongchuen@cern.ch>, Todd Tannenbaum <tannenba@cs.wisc.edu>, Dario Mapelli <dario.mapelli@cern.ch>

> Todd, send us the instructions and we'll give it a try.

https://github.com/htcondor/htcondor/blob/main/INSTALL.md

	For now, just build the 'main' branch; the new modules should show up in the python3 package.

> I really appreciate your willingness to do the work for us,
> but having us do it is overall the most efficient way.

	Thanks for your help. :)

> Chances are that anything we find can be reproduced outside
> the CRAB server.

	Probably.

> I guess it is mostly a matter of which exact calls we make with which 
> arguments.

	Exactly.  Almost all of the functions are still there, with
almost all of the same arguments.

> If we come to the point that you need to login, we will dig instructions 
> to resurrect your account. I presume that Center for High Throughput 
> Computing - University of Wisconsin which is now a CMS member 
> institution will need to add you as member. Members currently are Todd 
> Tannenbaum (as the boss) and James Frey.

	I didn't know CHTC was a CMS member institution.  That'll probably make things a lot easier. :)

-- ToddM
@belforte belforte self-assigned this Mar 6, 2024
@belforte
Copy link
Member Author

belforte commented Mar 6, 2024

@Todd-L-Miller FYI (just a check that I got your correct GH handle)

@Todd-L-Miller
Copy link

Yup, that's me. (Checking ability to comment...)

@novicecpp
Copy link
Contributor

novicecpp commented Mar 8, 2024

Just to report what I found today.

I tried to build and run CRAB with the new htcondor binding (htcondor2) for fun.
To build htcondor:

  • I need to build htcondor from source inside htcondor/nmi-build:x86_64_Debian11-23000100 because we use the official python:3.8 image as our baseimage, which uses Debian 11 as its baseimage.
  • Because of python3.8, I need to compile and install python3.8 manually in the htcondor build image,
    plus manually build libboost-python that need by htcondor build system. Otherwise, I will get htcondor2 static lib for python3.9.
  • I failed to set up the build environment inside my own TaskWorker image (built on the python:3.8 baseimage)

But, eventually, I could build it without errors. Please see the Dockerfile for the details.
The steps in the Dockerfile are guess-plus-trial-and-errors until it works. I have zero knowledge of cmake, boost, and how dynamic links work.


Next, I added the whole release_dir to my TaskWorker image and just exported PYTHONPATH to $(realpath release_dir)/lib/python3.
And, it magically works (I could do import htcondor2 as htcondor with python3.8 inside my image).

Still, I cannot submit jobs to Schedd yet because of API mismatch:

  1. Not accepting bytes

In HTCondorLocator.py#L184

     htcondor.param['COLLECTOR_HOST'] = self.getCollector().encode('ascii', 'ignore')

It raise type error but remove .encode('ascii', 'ignore') fix the issue.

Another place is classad.quote in HTCondorLocator.py#L186 (line 18 in HTCondorUtils.py make HTCondorUtils.quote = classad.quote).

schedds = coll.query(htcondor.AdTypes.Schedd, 'Name=?=%s' % HTCondorUtils.quote(schedd.encode('ascii', 'ignore')),
                     ["AddressV1", "CondorPlatform", "CondorVersion", "Machine", "MyAddress", "Name", "MyType",
                      "ScheddIpAddr", "RemoteCondorSetup"])

Also, remove .encode('ascii', 'ignore') fix the issue.

  1. xquery() method of Schedd object.

At DagmanSubmitter.py#L330, we use xquery() to get..I do not know, sorry.
But it looks like replacing it with query() work; I cross-checked with another server, both of them returned an empty list.
The object from htcondor2 is <htcondor2._schedd.Schedd object at 0x7f64760d5e20>.

  1. No htcondor.SecMan() or similar name in htcondor2.

Here at HTCondorUtils.py#L62. I am stuck at this point. I do not know what htcondor.SecMan() does or which API can replace it.

That is all for today.

@Todd-L-Miller
Copy link

Todd-L-Miller commented Mar 9, 2024 via email

@belforte
Copy link
Member Author

Amazing work Wa, thanks !

I can't comment/advice on building. I hope that we will not have to worry about things like python 3.8 vs. python 3.9, atm we use python3.8 in TW , python 3.6 in schedulers, same HTC major version (10) and everything is OK.

Todd let me try make 3 comments:

  1. encode{'ascii'
    Let's keep in mind that our source code is very old here. And it reflects the very first version of python bindings from Brian B. Plus was originally written in python2. So I have no idea why `encode('ascii', ignore) was used. I suspect that we could take it away also in current code and it will work.

  2. xquery vs. query
    again, from BB firs implementation, IIRC the point at the time was to be more kind on collector, by doing the equivalent of condor_q -af p1 p2 ... instead of condor_q -l | egrep .... . We never had a worry about memory on our side. IIUC we can change with query already. Happy to do so !

  3. SecMan and authenticated subprocess
    the extra complexity is a legacy of GSI authentication. First implementation of CRAB used a service proxy to talk with other central services and to perform "global queries to HTCondor" while inside the subprocesses which submit user jobs to schedd's (which run on different VM's) we had to switch to the specific user proxy every time. But now all interactions with HTCondor are done with a single, common token. We kept changes at minimum at the time, but I am optimistic that all of this code can be simply removed and we should make calls to condor bindings directly in TW code w/o going through a (authenticated) subprocess.
    Yet I am puzzled that SecMan is listed in latest bindings doc as the only cure in case a daemon restarts and all future commands will fail with strange connection errors. Maybe we simply need to be sure that all collector and schedd objects which we create are in the temporary scope of some object and do not try to persist in memory for days ?

Bottom line: we can surely profit anyhow from aligning current code to current bindings and clean up GSI legacy stuff. Then chances are that htcondor2 and classad2 work smoothly. In the end we do not to do anything more then what is the examples in the documentation ! The only non trivial
thing is that the schedd we submit to runs on a different VM. But I presume that that is part of standard tests for the bindings.
Wa did not get to test classAd2 here, but the use we make of it should be very basic.

I am not sure about next step(s) though. We can remove 1. 2. 3. above and try again but for a full test it would be way more convenient to have a build from HTCondor to use since we need to deploy both in the container we use for TW (what Wa did) and in the scheduler hosts where we currently get condor via yum.
In the schedulers we have a bunch of classAd read/write (hopefully transparent) and a couple more exotic use case with JEL's and htcondor.lock(). I am just a tiny bit worried by the latter since I do not find it in the documentation even if it is working in HTC 23. BTW I just discovered that

/usr/lib64/python3.6/site-packages/htcondor/_deprecation.py:41: FutureWarning: lock() is deprecated since v8.9.8 and will be removed in a future release.

To see that warning one needs to call it in an interactive shell. It is not in our logs.

Working w/o locking condor log files is a different story though. And it may simply mean that eventually we really need to do the work outlined in #6270

@Todd-L-Miller
Copy link

Todd-L-Miller commented Mar 12, 2024 via email

@belforte
Copy link
Member Author

Hi @Todd-L-Miller
I wanted to reassure you that this is not forgotten, but we had to give priority to Alma9 transition and @novicecpp (aka Wa) is busy with a new CI/CD setup for CRAB, so in spite of some progress in modernizing our usage of python bindings I can't test with htcondor2 yet. I need his magic build/

BTW I tried to use htcondor2 in HTC 23, but it fails to be imported due to lack of classad2
Here's in a schedd whith htcondor 23.0

belforte@crab-sched-903/~> condor_version
$CondorVersion: 23.0.4 2024-02-08 BuildID: 712251 PackageID: 23.0.4-1 $
$CondorPlatform: x86_64_AlmaLinux9 $
belforte@crab-sched-903/~> python3
Python 3.9.18 (main, Jan 24 2024, 00:00:00) 
[GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import htcondor
>>> htcondor.version()
'$CondorVersion: 23.7.2 2024-05-16 BuildID: UW_Python_Wheel_Build $'
>>> import htcondor2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/afs/cern.ch/user/b/belforte/.local/lib/python3.9/site-packages/htcondor2/__init__.py", line 30, in <module>
    from ._common_imports import classad
  File "/afs/cern.ch/user/b/belforte/.local/lib/python3.9/site-packages/htcondor2/_common_imports.py", line 7, in <module>
    import classad2 as classad
  File "/afs/cern.ch/user/b/belforte/.local/lib/python3.9/site-packages/classad2/__init__.py", line 26, in <module>
    from .classad2_impl import _version as version
ModuleNotFoundError: No module named 'classad2.classad2_impl'
>>> 

same error from import classad2 (of course). While import classad works finely (of course).

Could this be a shortcoming of our installation ?
In that case it is out of our turf and we need to ask Marco Mascheroni and his team to follow up.

In the submission machine we simply pip install htcondor 23

root@crab-dev-tw01:/data/srv/TaskManager# pip install htcondor
Requirement already satisfied: htcondor in /usr/local/lib/python3.8/site-packages (23.7.2)

and have same problem with classad2

@Todd-L-Miller
Copy link

Todd-L-Miller commented May 30, 2024 via email

@belforte
Copy link
Member Author

indeed, here's the container where we pip-installed condor

root@crab-dev-tw01:/data/srv/TaskManager# ls /usr/local/lib/python3.8/site-packages/htcondor/
__init__.py  _deprecation.py  _lock.py	_wrap.py     dags     htcondor.so
__pycache__  _job_status.py   _utils	compat_enum  htchirp  personal.py
root@crab-dev-tw01:/data/srv/TaskManager# 

But in the scheduler host we install from rpm (via puppet) from htcondor-stable-23.0

[root@crab-sched-903 ~]# yum list|grep condor
condor.x86_64                                                                            23.0.4-1.el9                                   @htcondor-stable-23.0    
condor-stash-plugin.x86_64                                                               6.12.1-1                                       @htcondor-stable-23.0    
python3-condor.x86_64                                                                    23.0.4-1.el9                                   @htcondor-stable-23.0    
globus-gram-job-manager-condor.noarch                                                    3.0-10.el9                                     epel                     
[root@crab-sched-903 ~]# 

in this case I have these binaries

[root@crab-sched-903 ~]# ls /usr/lib64/python3.9/site-packages/htcondor*/*.so
/usr/lib64/python3.9/site-packages/htcondor/htcondor.cpython-39-x86_64-linux-gnu.so
[root@crab-sched-903 ~]# ls /usr/lib64/python3.9/site-packages/classad*/*.so
/usr/lib64/python3.9/site-packages/classad/classad.cpython-39-x86_64-linux-gnu.so
[root@crab-sched-903 ~]# 

thanks for having pointed out that somehow I have managed to get $HOME/.local/lib/python3.9/site-packages/htcondor2 which I am not sure how it got there !
In my .local I also have classad2 and classad3. All of them w/o .so files.
Anyhow, my .local confusion aside (it is a sort of scratch area/playground at the end of the day)
I am puzzled that we do not have the binaries in the rpm install.

Maybe as simple as install the rpm for the HTC 23 feature branch (23.7 e.g.) non the stable 23.0 ?

@belforte
Copy link
Member Author

I switched the scheduler host to 23.7.2 and now import of htcondor2 and classad2 work !

The submitter process runs in a debian-based docker container where we currently pip-install.
I will try to use apt-get instead

@belforte
Copy link
Member Author

HI @Todd-L-Miller
I played a bit with HTC installs, and things are difficult to do cleanly since we want to test htcondor2 in our container based on python3.8.. longish story short, do you think it is OK if I simply grab

/usr/lib/python3/dist-packages/htcondor2/htcondor2_impl.cpython-39-x86_64-linux-gnu.so
/usr/lib/python3/dist-packages/classad2/classad2_impl.cpython-39-x86_64-linux-gnu.so
``
from a debian install and put them in our `/usr/local/lib/python3.8/....` where we pip-installed the bindings w/o the wheels ?
Yeah.. also  change `cpython-39` to `cpython-3.8` in the name.

I tried and at least `import htcondor2` works. But maybe once called they will look for other packages and this is never going to work, at least not well enough that it there's a problem we can't tell if it is on your or our side w/o spending too much time.

@belforte
Copy link
Member Author

belforte commented May 31, 2024

for me and @novicecpp : the changes I did to get a TW container with HTC 23.7 and htcondor2 binaries are in https://github.com/belforte/CRABServer/tree/htc23.x and the container image is registry.cern.ch/cmscrab/crabtaskworker:belforte-HTC23x

@Todd-L-Miller
Copy link

Todd-L-Miller commented Jun 3, 2024 via email

@belforte
Copy link
Member Author

belforte commented Jun 3, 2024

I can pip-install and import. WIll build a new container so we can run our code with this.

@Todd-L-Miller
Copy link

Todd-L-Miller commented Jun 3, 2024 via email

@belforte
Copy link
Member Author

belforte commented Jun 3, 2024

docker image registry.cern.ch/cmscrab/crabtaskworker:belforte-HTC23_0603 has now HTC 23.9.0a2

and crab TW from master

now running in crab-dev-tw01
first validation:

oops, latest fixes in master were left out. Try again. Anyhow "it runs"

@belforte
Copy link
Member Author

belforte commented Jun 3, 2024

@belforte
Copy link
Member Author

Hi @Todd-L-Miller
Now that I have a code version which does not use deprecated methods, I can try v2 bindings.
First discovery is that v1 works even w/o a config file, simply prints a warning as it reverts to
using a null condor_config
v2 instead fails with the rather obscure
ERROR "Unwilling or unable to try IPv4 or IPv6. Check the settings ENABLE_IPV4, ENABLE_IPV6, and NETWORK_INTERFACE." at line 1202 in file /var/lib/condor/execute/slot1/dir_2361636/htcondor_source/src/condor_io/sock.cpp
So the condor_config file is mandatory, even if pointing $CONDOR_CONFIG to an empty file seems to work.

Is this a feature ? or an oversight ?

@belforte
Copy link
Member Author

belforte commented Jun 12, 2024

first real problem.
How do I put an integer as value in the Submit object ?
This works in v1

bash-5.1$ python3
Python 3.9.18 (main, Jan 24 2024, 00:00:00) 
[GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import htcondor
>>> jdl=htcondor.Submit()
>>> jdl['somead']=1
>>> print(jdl)
somead = 1

but

bash-5.1$ python3
Python 3.9.18 (main, Jan 24 2024, 00:00:00) 
[GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import htcondor2 as htcondor
>>> jdl=htcondor.Submit()
>>> jdl['somead']=1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/grid/crabtw/.local/lib/python3.9/site-packages/htcondor2/_submit.py", line 123, in __setitem__
    raise TypeError("value must be a string")
TypeError: value must be a string
>>> 

of course something like jdl['priority'] = "10" is accepted. Will it work ? With v1 too ? I had to change from a string to an int that one the other day since submission was complaining.

@belforte
Copy link
Member Author

also htcondor.HTCondorException is gone. What should I use instead to notice that a call failed ?
e.g. our favorite "submission failed becaue scheduler is busy, unreacheable, unhappy, overloaded..." ?
fall back to

try:
 result=schedd.Submit(jdl, 1, True)
except Exception:
  log("something went wrong")

?

@belforte
Copy link
Member Author

and finally scheduler.spool() method is changed, for the good of course, and it takes as argument a submitterResult object, does not need the workaround in https://lists.cs.wisc.edu/archive/htcondor-users/2024-May/msg00047.shtml

So I doubt we can have a single code version which works with both bindings.
Will need some kind of switch flag.

I still have to "test everything" and already see that something has changed in JEL area.
But I managed to to the remote submission of a CRAB DAG and it got executed.
🎉

@belforte
Copy link
Member Author

@Todd-L-Miller
is this bug on your side, by any chance ? or the way to use JEL's is quite different. I haven't looked up new documentation yet.

 File "/data/srv/glidecondor/condor_local/spool/9550/0/cluster9550.proc0.subproc0/task_process/cache_status.py", line 348, in storeNodesInfoInFile
    pickle.dump(jel, f)
  File "/home/grid/crabtw/.local/lib/python3.9/site-packages/htcondor2/_job_event_log.py", line 101, in __getstate__
    offset = _job_event_log_get_offset(self, self._handle)
NameError: name '_job_event_log_get_offset' is not defined

@Todd-L-Miller
Copy link

I presume that there are good reasons for the change,

The v2 Submit object now handles complete submit files correctly, so its string converter now generates complete submit files. If removing the queue statement proves troublesome, let me know and we can think about additional APIs or about API tweaks.

so will not question it, but this breaks our code now (I end up with a dagman submission file with two queue statements in it ). I guess that I will find a fix, but I suggest that you point out this change in the documentation.

I'll hold off on updating the documentation until you let me know how painful the change/fix you had to make was.

@Todd-L-Miller
Copy link

there is a also an error when looking at an empty Submit object. Nothing fatal of course, but a bit annoying

Nonetheless, it's not OK; thanks for the heads up. Fixed, but it make take a bit for that fix to be released.

@belforte
Copy link
Member Author

belforte commented Aug 9, 2024

about the addition of the Queue statement in the streaming to string of the Submit object, it was easy to fix

subdag = str(subdagJDL)  # print subdagJDL into a string
subdag = subdag.rstrip('queue\n')  # strip "queue" from last line

but in a way that's a bit fragile as it relies on that being the last line. I guess I can be more general there, though.

Another option is that condor_dagman_submit gets smarter and when passed the -insert_sub_file option "properly handles a proper Submit file which may include a queue statement"

That would elegantly move the ball out of mine and your field !

@belforte
Copy link
Member Author

belforte commented Aug 9, 2024

thanks @Todd-L-Miller

@Todd-L-Miller
Copy link

Another option is that condor_dagman_submit gets smarter and when passed the -insert_sub_file option "properly handles a proper Submit file which may include a queue statement"

That certainly sounds like a bug; I'll ask the DAGMan guy about it. Are you actually shelling out condor_dagman_submit? If so, that may imply something missing in the Python API, too..

@belforte
Copy link
Member Author

belforte commented Aug 9, 2024

yes I "shell" it

subprocess.check_call(['condor_submit_dag', '-DoRecov', '-AutoRescue', '0', '-MaxPre', '20', '-MaxIdle', str(maxidle),
'-MaxPost', str(maxpost), '-insert_sub_file', 'subdag.jdl',
'-append', '+Environment = strcat(Environment," _CONDOR_DAGMAN_LOG={0}/{1}.dagman.out")'.format(os.getcwd(), subdag),
'-append', '+TaskType = "{0}"'.format(stage.upper()), subdag])

@belforte
Copy link
Member Author

belforte commented Aug 9, 2024

if there's now a python API I can certainly look into switching

@Todd-L-Miller
Copy link

I talked to the DAG guy. Two main points: first, every condor_submit_dag command-line flag has, in version 2, a corresponding API bit. Second, if you call Submit.from_dag(), you get a Submit object back... which you could then manipulate in exactly the same way you manipulate subdagJDL before you turn around and submit it, which would avoid this problem (and a fork()) entirely.

@Todd-L-Miller
Copy link

Todd-L-Miller commented Aug 9, 2024

The docstring for Submit.from_dag() in v2 is wrong; the Submit object is generated from the DAG description on disk.

@belforte
Copy link
Member Author

belforte commented Aug 9, 2024

i hesitate to go that way w/o an example. v1 doc is not fully clear to me, and it does not mention insert_sub_file. I guess it can be replaces by joining two Submit objects. Yeah.. time to outgrow what Brian wrote 7 years ago, I know. But... CLI should also work, right ? Those fork are executed at << 1/hour rate. I do not care for performance here.

@Todd-L-Miller
Copy link

So something like

s = Submit.from_dag( "subdag", {
    "DoRecov": true,
    "AutoRescue": 0,
    "MaxIdle": maxidle,
   # etc
}
s["subdag_jdl_property_one"] = str(value_one)

@Todd-L-Miller
Copy link

But... CLI should also work, right ? Those fork are executed at << 1/hour rate. I do not care for performance here.

If it's not easier for you, don't worry about switching. You've got a solid work-around already.

@belforte
Copy link
Member Author

belforte commented Aug 9, 2024

So something like

And then result=schedd.submit(s) ?

@Todd-L-Miller
Copy link

So something like

And then result=schedd.submit(s) ?

Yeah, exactly.

@belforte
Copy link
Member Author

belforte commented Aug 9, 2024

in the long term using python API will make this easier to understand for those who come after me. But I need to pick my battles. Thanks

@Todd-L-Miller
Copy link

No problem. And thank you for finding these problems.

@belforte
Copy link
Member Author

belforte commented Aug 9, 2024

well.. IMHO the condor_submit_dag CLI could stll benefit from a fix here. Even if I am protected now.

@Todd-L-Miller
Copy link

Oh, agreed, and that ball has successfully been moved into another field.

@belforte
Copy link
Member Author

Hi @Todd-L-Miller
I have a problem with schedd.edit in V2 when trying to set a classAd to a list of values.
Looks like something at your end #8604

Maybe you can also reproduce it locally by using a different clusterId :-)

@belforte
Copy link
Member Author

on HOLD waiting for #8572 which is waiting on #8660

@belforte
Copy link
Member Author

belforte commented Sep 1, 2024

#8660 solved, waiting for deployment of HTC 23.9

@belforte
Copy link
Member Author

belforte commented Oct 1, 2024

all schedulers have HTC 23.9 I will enable crab-preprod-tw01 to use v2 bindings, so that all our validation jobs will use it.

@belforte
Copy link
Member Author

currently 50% of all user tasks run with HTC v2 API
https://cms-talk.web.cern.ch/t/start-testing-htcv2-api-in-production/76456/3

no reported issue
Will leave like that and go to 100% in January

@belforte
Copy link
Member Author

Testing phase is over !

@Todd-L-Miller
Copy link

Testing phase is over !

w00t!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants