-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TW fails to catch and report crashes during submission to HTCondor #8420
Comments
I manually set the task status of [1] to "submitfailed" in the DB [1]
|
I am testing adding this to my task
interestingly, in DagmanSubmitter, just before calling
The former comes from
and I have no idea what it is supposed to mean, but since Brian B. put it there in the original version, it must be needed or at least not harm. Maybe a way to say "no requirements from here", since anyhow schedd will add requirements based on site, memory, cores... The latter appears to contain an extra space and extra single quotes.
|
and when I execute the submit line I get
which IMHO suggests that setting I do not know if there's a way to catch that. |
hmm.. not really. HTCondor doc clearly states that if multiple I am also quite puzzled that single quotes made it to the classAd value, the code which handles the extraJDL is in here CRABServer/src/python/TaskWorker/Actions/DagmanSubmitter.py Lines 89 to 105 in fa96438
|
If I remove the extra spaces around the
which makes the code a bit more robust. Still I do not know how to possibly catch the boost error :-( Do we need to resurrect "calling schedd.submit in a forked subprocess" ? |
I have no problem with the user's attempt to set Requirements to be ignored, documentation clearly states that |
Indeed even the simple Conclusion: spaces around the |
As to catching the fact that slave aborts and task is left in QUEUED, the only info we have atm are these lines in
We can possibly monitor-for or otherwise discover dead workers, but a restart will not fix (the QUEUED tasks will be processed again) and checking logs to find which task was being processed and what went wrong will take a lot of human time. One way I can think f now it to change MasterWorker to fork a separate process for each task, up to a maximun number of concurrent ones. Like we do in PublisherMaster. Instead of the current fixed pool of slaves which get work from a shared queue. Another way is to check if slave is alive inside the master loop and record somewhere the current task of a slave. Checking on heatch of TW slave is something we "need to do" since ever, urgency comes and go as problems which make them die come and go. @novicecpp I will welcome your suggestions !!! Sorry Marco Mambelli if you were spammed. |
I had experience with Long story short: I think that in
Example, that provides a longer explainationimport multiprocessing
import time
import os
import signal
def worker(x, qout, pids):
print(f"pid={os.getpid()}; input={x}")
pids.put(os.getpid())
time.sleep(10)
qout.put(x*x)
def main():
inputs = list(range(1,6))
qin = multiprocessing.Queue()
for i in inputs:
qin.put(i)
qout = multiprocessing.Queue()
pids = multiprocessing.Queue()
pool = []
for _ in range(qin.qsize()):
x = qin.get()
p = multiprocessing.Process(target=worker, args = (x, qout, pids))
p.start()
pool.append(p)
print("started the processes")
pkill = pids.get()
print("about to kill: ", pkill)
os.kill(pkill, signal.SIGKILL)
print("check if a process died")
for p in pool:
print(f" died(0)? pid={p.pid}; alive={p.is_alive()}; exitcode={p.exitcode}")
p.join(timeout=0.1)
print(f" died(1)? pid={p.pid}; alive={p.is_alive()}; exitcode={p.exitcode}")
print("wait until all processes finish")
for p in pool:
p.join()
print(f" finished? pid={p.pid}; alive={p.is_alive()}; exitcode={p.exitcode}")
print("processes finished. results:")
for _ in range(qout.qsize()):
print(" ", qout.get())
if __name__ == "__main__":
main() output
notice the lines:
this means the before the join the value of [1] https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process.join |
I think that
I added 1 sec sleep and it is OK, if I am not mistaken. all in all, I use Sorry if I was not clear, I want feedback on a plan. Not converge on all implementation details right away. I know how to start processes, I am not sure how exactly to change the Worker.py code which we did not touch "since ever" w/o too much risk. And I do not know a good way to tell which task was being worked on when process. crashed |
If the plan is "let's invest time to make sure that we detect dead workers", then I am all in. if you need help with a plan for " a good way to tell which task was being worked on when process. crashed" or "I am not sure how exactly to change the Worker.py code which we did not touch "since ever" w/o too much risk", then sorry i am not sure how to do it without giving feedback that will be rejected as "implementation details" |
in other words: it will be a bit of work and require a lot of testing. If we decide to do it, let's open a new issue and discuss details there. Adding detection of dead workers e.g. in CRABServer/src/python/TaskWorker/Worker.py Line 246 in 40a796f
by adding an is_alive check should be easy and as long as we simply record a message, safe !
|
What's about handling the task that causes the child's process to crash? |
I support this idea, also solve #8350. We can divide into 2 steps, first is wrap the work() in One concern is about performance. It will fork grandchild for every task, which costs a lot of CPU overhead. |
It's also true that TW actions take at least a few seconds, spawning a new process for every task may not be a that much more cpu usage in the grand scheme of things |
we discussed in the meeting and agreed on Wa's suggestion and can go straight for forking one child to handle each task and take care of timeout at same time. Will open ad-hoc issue |
I am closing this on:
|
see e.g.
https://s3.cern.ch/crabcache_prod/piperov/240517_215114%3Apiperov_crab_SONIC_MiniAOD_CRAB_testPurdue807_1x4900-1000j/twlog?AWSAccessKeyId=e57ff634b5334df9819ae3f956de5ca6&Signature=xK9VdrPWqew0wCVP2IQ3l5wPkGM%3D&Expires=1718830467
which ends with
the slave died at that time
and the task was left in QUEUED forever
https://cmsweb.cern.ch/crabserver/ui/task/240517_215114%3Apiperov_crab_SONIC_MiniAOD_CRAB_testPurdue807_1x4900-1000j
Task stuck in QUEUED is bad, and crashed slave is worse !
The submission failure was due to a bad extraJDL argument on user side
in the Task Info tab of the UI
in the user config (as privately reported by @kpedro88 ) :
The text was updated successfully, but these errors were encountered: