-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding automated thread controls for fallback triton servers #41876
Conversation
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-41876/35793
|
A new Pull Request was created by @wpmccormack (Patrick McCormack) for master. It involves the following packages:
@cmsbuild, @makortel, @fwyzard can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
DRYRUN="" | ||
PARENTPID="" | ||
BASEPORT=8000 | ||
AUTOPORT="" | ||
NPORTS=3 | ||
IMAGE=fastml/triton-torchgeo:22.07-py3-geometric | ||
IMAGE=fastml/triton-torchgeo:21.06-py3-geometric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fix the rebase, this is a regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -38,17 +40,18 @@ usage() { | |||
$ECHO "-c \t don't cleanup temporary dir (for debugging)" | |||
$ECHO "-C [dir] \t directory containing Nvidia compatibility drivers (checks CMSSW_BASE by default if available)" | |||
$ECHO "-D \t dry run: print container commands rather than executing them" | |||
$ECHO "-d \t use Docker instead of Apptainer" | |||
$ECHO "-d \t use Docker instead of Singularity" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fix the rebase, every change from Apptainer back to Singularity is a regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry I copied that in from an older repository. Went ahead and edited the new version by hand now
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-41876/35794
|
@cmsbuild, please test |
-1 Failed Tests: UnitTests Unit TestsI found errors in the following unit tests: ---> test TestHeterogeneousCoreSonicTritonProducerCPU had ERRORS ---> test DRNTest had ERRORS Comparison SummarySummary:
|
@wpmccormack it looks like you removed the executable bit from |
Yikes, ok I fixed the permissions thing. The permissions now match what they were before I made any edits |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-41876/35799
|
please test |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-052e5c/33001/summary.html Comparison SummarySummary:
|
Workflow 12434.0 shows an occurrence of #41200 |
+heterogeneous |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
@wpmccormack @kpedro88 please have a look at the errors which are showing up in the IBs after the merging of this PR, wf 10804.31 and 10805.31:
|
This error is occurring only in the DRN workflows (10804.31 and 10805.31). The relevant message is:
Investigating the right way to handle this. |
@wpmccormack here's the documentation: https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#ensemble-model-instance-groups
and here's the ensemble model config: https://github.com/cms-data/RecoEgamma-EgammaPhotonProducers/blob/4f3629e77af2cafa0f4a8937f859b45e37ca0cb6/models/photonObjectEnsemble/config.pbtxt I think we need to do two things:
|
Ahhh ok great, thanks for finding that. I'll work on adjusting the implementation and run a few tests |
PR description:
For SONIC, the number of instances of a model in a fallback server and the number of threads that fallback server can occupy on local CPU resources can be controlled via the model's triton config file. We set the number of instances to the number of threads specified for the job, so this implementation requires a bit of text editing before the fallback servers can run. The model files live in cvmfs, and the config files cannot be edited there, so they are pulled locally. The precise edits depend on the model's backend; luckily the backend information can be extracted from the config files, so no additional dictionary containing this information is needed.
This PR was tested and able to duplicate results were model config files were adjusted manually.
This PR is also a re-do of #41228. That PR was aiming at the wrong target branch.
PR validation:
This has been tested locally to verify that: 1) the config files get edited correctly, 2) that the fallback servers start properly, and 3) that a miniAOD processing job can run completely. It also reproduces some rigorous timings tests that were previously performed with manual config file editing.