Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding automated thread controls for fallback triton servers #41876

Merged
merged 3 commits into from
Jun 6, 2023

Conversation

wpmccormack
Copy link
Contributor

PR description:

For SONIC, the number of instances of a model in a fallback server and the number of threads that fallback server can occupy on local CPU resources can be controlled via the model's triton config file. We set the number of instances to the number of threads specified for the job, so this implementation requires a bit of text editing before the fallback servers can run. The model files live in cvmfs, and the config files cannot be edited there, so they are pulled locally. The precise edits depend on the model's backend; luckily the backend information can be extracted from the config files, so no additional dictionary containing this information is needed.

This PR was tested and able to duplicate results were model config files were adjusted manually.

This PR is also a re-do of #41228. That PR was aiming at the wrong target branch.

PR validation:

This has been tested locally to verify that: 1) the config files get edited correctly, 2) that the fallback servers start properly, and 3) that a miniAOD processing job can run completely. It also reproduces some rigorous timings tests that were previously performed with manual config file editing.

@wpmccormack
Copy link
Contributor Author

@kpedro88 and @makortel , this is a new PR based on #41228. Commits have been squashed here, and I rebased to 13_2_0_pre1 just to make things easier

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 5, 2023

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-41876/35793

  • This PR adds an extra 16KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 5, 2023

A new Pull Request was created by @wpmccormack (Patrick McCormack) for master.

It involves the following packages:

  • HeterogeneousCore/SonicTriton (heterogeneous)

@cmsbuild, @makortel, @fwyzard can you please review it and eventually sign? Thanks.
@makortel, @missirol, @riga, @kpedro88, @rovere this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

DRYRUN=""
PARENTPID=""
BASEPORT=8000
AUTOPORT=""
NPORTS=3
IMAGE=fastml/triton-torchgeo:22.07-py3-geometric
IMAGE=fastml/triton-torchgeo:21.06-py3-geometric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix the rebase, this is a regression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -38,17 +40,18 @@ usage() {
$ECHO "-c \t don't cleanup temporary dir (for debugging)"
$ECHO "-C [dir] \t directory containing Nvidia compatibility drivers (checks CMSSW_BASE by default if available)"
$ECHO "-D \t dry run: print container commands rather than executing them"
$ECHO "-d \t use Docker instead of Apptainer"
$ECHO "-d \t use Docker instead of Singularity"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix the rebase, every change from Apptainer back to Singularity is a regression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry I copied that in from an older repository. Went ahead and edited the new version by hand now

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 5, 2023

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-41876/35794

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 5, 2023

Pull request #41876 was updated. @cmsbuild, @makortel, @fwyzard can you please check and sign again.

@makortel
Copy link
Contributor

makortel commented Jun 5, 2023

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 5, 2023

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-052e5c/32996/summary.html
COMMIT: 827e622
CMSSW: CMSSW_13_2_X_2023-06-05-1100/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/41876/32996/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found errors in the following unit tests:

---> test TestHeterogeneousCoreSonicTritonProducerCPU had ERRORS
---> test DRNTest had ERRORS

Comparison Summary

Summary:

  • You potentially added 31 lines to the logs
  • Reco comparison results: 13 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3219909
  • DQMHistoTests: Total failures: 9
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3219878
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 207 log files, 159 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

@kpedro88
Copy link
Contributor

kpedro88 commented Jun 5, 2023

@wpmccormack it looks like you removed the executable bit from cmsTriton (GitHub says 100755 → 100644), so the unit tests fell back to the base release version of the script (which is no longer compatible with TritonService).

@wpmccormack
Copy link
Contributor Author

wpmccormack commented Jun 6, 2023

Yikes, ok I fixed the permissions thing. The permissions now match what they were before I made any edits

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 6, 2023

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-41876/35799

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 6, 2023

Pull request #41876 was updated. @cmsbuild, @makortel, @fwyzard can you please check and sign again.

@kpedro88
Copy link
Contributor

kpedro88 commented Jun 6, 2023

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 6, 2023

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-052e5c/33001/summary.html
COMMIT: ec8d367
CMSSW: CMSSW_13_2_X_2023-06-05-2300/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/41876/33001/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 11 lines from the logs
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3219909
  • DQMHistoTests: Total failures: 7
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3219880
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 207 log files, 159 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor

makortel commented Jun 6, 2023

Workflow 12434.0 shows an occurrence of #41200

@makortel
Copy link
Contributor

makortel commented Jun 6, 2023

+heterogeneous

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 6, 2023

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

perrotta commented Jun 6, 2023

+1

@cmsbuild cmsbuild merged commit 57ceb23 into cms-sw:master Jun 6, 2023
@perrotta
Copy link
Contributor

perrotta commented Jun 6, 2023

@wpmccormack @kpedro88 please have a look at the errors which are showing up in the IBs after the merging of this PR, wf 10804.31 and 10805.31:

I0606 11:31:14.136789 21 tritonserver.cc:2176] 
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                       |
| server_version                   | 2.24.0                                                                                                                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0]         | /tmp/a2096bdc-a802-4016-95e7-0558b7b506da/local_model_repo                                                                                                                                   |
| model_control_mode               | MODE_NONE                                                                                                                                                                                    |
| strict_model_config              | 0                                                                                                                                                                                            |
| rate_limit                       | OFF                                                                                                                                                                                          |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                    |
| response_cache_byte_size         | 0                                                                                                                                                                                            |
| min_supported_compute_capability | 6.0                                                                                                                                                                                          |
| strict_readiness                 | 1                                                                                                                                                                                            |
| exit_timeout                     | 30                                                                                                                                                                                           |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0606 11:31:14.147276 21 server.cc:260] Waiting for in-flight requests to complete.
I0606 11:31:14.147308 21 server.cc:276] Timeout 30: Found 0 model versions that have in-flight inferences
I0606 11:31:14.147315 21 server.cc:291] All models are stopped, unloading models
I0606 11:31:14.147321 21 server.cc:298] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models

%MSG
----- Begin Fatal Exception 06-Jun-2023 13:37:31 CEST-----------------------
An exception of category 'FallbackFailed' occurred while
   [0] Calling beginJob
Exception Message:
TritonService: Starting the fallback server failed with exit code 256
----- End Fatal Exception -------------------------------------------------

@kpedro88
Copy link
Contributor

kpedro88 commented Jun 6, 2023

This error is occurring only in the DRN workflows (10804.31 and 10805.31). The relevant message is:

E0606 11:31:14.136317 21 model_repository_manager.cc:2071] Poll failed for model directory 'photonObjectEnsemble': instance group should not be specified for ensemble 'photonObjectEnsemble'

Investigating the right way to handle this.

@kpedro88
Copy link
Contributor

kpedro88 commented Jun 6, 2023

@wpmccormack here's the documentation: https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#ensemble-model-instance-groups

Ensemble Model Instance Groups

Ensemble models are an abstraction Triton uses to execute a user-defined pipeline of models. Since there is no physical instance associated with an ensemble model, the instance_group field can not be specified for it.

However, each composing model that makes up an ensemble can specify instance_group in its config file and individually support parallel execution as described above when the ensemble receives multiple requests.

and here's the ensemble model config: https://github.com/cms-data/RecoEgamma-EgammaPhotonProducers/blob/4f3629e77af2cafa0f4a8937f859b45e37ca0cb6/models/photonObjectEnsemble/config.pbtxt

I think we need to do two things:

  1. Detect when platform: "ensemble" (simple)
  2. In that case, instead of putting the instance group in the ensemble model config, find each model_name in the ensemble and put instance groups in those models' configs (a little more complicated)

@wpmccormack
Copy link
Contributor Author

Ahhh ok great, thanks for finding that. I'll work on adjusting the implementation and run a few tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants