Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial pass at controlling model instance and threads for fallback triton servers #41228

Conversation

wpmccormack
Copy link
Contributor

PR description:

For SONIC, the number of instances of a model in a fallback server and the number of threads that fallback server can occupy on local CPU resources can be controlled via the model's triton config file. We set the number of instances to the number of threads specified for the job, so this implementation requires a bit of text editing before the fallback servers can run. The model files live in cvmfs, and the config files cannot be edited there, so they are pulled locally. The precise edits depend on the model's backend; luckily the backend information can be extracted from the config files, so no additional dictionary containing this information is needed.

PR validation:

This has been tested locally to verify that: 1) the config files get edited correctly, 2) that the fallback servers start properly, and 3) that a miniAOD processing job can run completely.

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @wpmccormack (Patrick McCormack) for CMSSW_12_6_X.

It involves the following packages:

  • HeterogeneousCore/SonicTriton (heterogeneous)

@cmsbuild, @makortel, @fwyzard can you please review it and eventually sign? Thanks.
@makortel, @missirol, @riga, @kpedro88, @rovere this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

Copy link
Contributor

@makortel makortel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the "copy model locally" expected to be a temporary solution, or the long-term solution?

Comment on lines 6 to 7
#include "FWCore/ServiceRegistry/interface/SystemBounds.h"
#include "FWCore/ServiceRegistry/interface/ServiceMaker.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ServiceMaker.h include should be unnecessary, and for SystemBounds.h it would be better to forward-declare edm::service::SystemBounds below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed and forward declared

@@ -218,6 +218,7 @@ start_singularity(){
for REPO in ${REPOS[@]}; do
MOUNTARGS="$MOUNTARGS -B $REPO"
REPOARGS="$REPOARGS --model-repository=${REPO}"
ls ${REPO}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line a leftover?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sorry, it's been removed now

@@ -1,4 +1,5 @@
#include "FWCore/ServiceRegistry/interface/ServiceMaker.h"
#include "FWCore/ServiceRegistry/interface/SystemBounds.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This #include should not be needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@@ -6,6 +6,8 @@
#include "FWCore/ParameterSet/interface/ConfigurationDescriptions.h"
#include "FWCore/ParameterSet/interface/ParameterSetDescription.h"
#include "FWCore/ServiceRegistry/interface/ActivityRegistry.h"
#include "FWCore/ServiceRegistry/interface/SystemBounds.h"
#include "FWCore/ServiceRegistry/interface/ServiceMaker.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ServiceMaker.h include should not be needed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@@ -128,12 +131,14 @@ class TritonService {
unsigned currentModuleId_;
bool allowAddModel_;
bool startedFallback_;
unsigned int numInstancesForFallbacks_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This member appears to be unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it's been removed

out << "parameters { key: \"TF_NUM_INTRA_THREADS\" value: { string_value: \"1\" } }" << std::endl;
out << "parameters { key: \"TF_NUM_INTER_THREADS\" value: { string_value: \"1\" } }" << std::endl;
out << "parameters { key: \"TF_USE_PER_SESSION_THREADS\" value: { string_value: \"1\" } }" << std::endl;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if neither isONNX or isTF is true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is another subtlety with triton server configs: they are model-type dependent. Thread controls are only implemented right now for ONNX and TF models, so if neither of those options are true, then nothing else should be included in the config

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should at least emit a warning in this case. something like: "Warning: thread (instance) control not implemented for $PLATFORM" (following one of my suggestions below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this, thanks!

bool isONNX = false;
bool isTF = false;

size_t offset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the offset is not really needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has all been deleted

bool isTF = false;

size_t offset;
std::string line;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this declaration inside the if block as it is needed only there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all deleted and moved to cmsTriton script


size_t offset;
std::string line;
std::ifstream Myfile;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable names should start with lower case character.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted

}

//append instance and thread information to config files
std::ofstream out;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to explicitly close the out file. My personal preference would be to place all the code accessing out inside a block along

{
  std::ofstream out{"./local_model_repo/" + token + "/config.pbtxt", std::ios::app};
  // all code accessing out
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is no longer done here. Now I just echo in the cmsTriton bash script

@@ -238,7 +247,67 @@ void TritonService::preBeginJob(edm::PathsAndConsumesOfModulesBase const&, edm::
if (fallbackOpts_.wait >= 0)
fallbackOpts_.command += " -w " + std::to_string(fallbackOpts_.wait);
for (const auto& [modelName, model] : unservedModels_) {
fallbackOpts_.command += " -m " + model.path;
//We need to edit the triton config files, need to copy from cvmfs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually suggest to handle the copying and editing of model config files in the cmsTriton script (by passing in the numberOfThreads value when the script is called). Two reasons:

  1. most operations here are shell commands, therefore more naturally suited to be implemented in bash
  2. allows this functionality (controlling number of instances per model) to be used outside of CMSSW/TritonService

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes good suggestion, thank you! Let me switch that over

@kpedro88
Copy link
Contributor

Is the "copy model locally" expected to be a temporary solution, or the long-term solution?

We do not expect a different way to do this in the foreseeable future. It's possible at some point that Triton's model configuration interface changes so these settings can be specified "on the fly", but that's not on the current roadmap.

@cmsbuild
Copy link
Contributor

Pull request #41228 was updated. @cmsbuild, @makortel, @fwyzard can you please check and sign again.

echo "]" >> "$CONFIG"
echo "" >> "$CONFIG"

if grep -Fq "onnx" "$CONFIG"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to try to maintain a consistent style in this script, a few things to fix:

  • indent with tabs (currently a mix of tabs and spaces in newly added lines, which is the worst of all possibilities)
  • format if statements as:
    if condition; then
    	action
    fi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Sorry, didn't notice that my emacs was doing space instead of tabs when I was pressing "tab"

@@ -315,14 +319,59 @@ wait_server(){
echo "server is ready!"
}

edit_model(){
cp -r $1 $TMPDIR/local_model_repo/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make the code a little easier to read/follow, I would start out by assigning $1, $2, etc. to local variables and then use the local variables everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

IFS='/' read -ra ADDR <<< "$1"
CONFIG=$TMPDIR/local_model_repo/${ADDR[-1]}/config.pbtxt

echo "" >> "$CONFIG"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be a little easier to read (and a little more efficient) if cat (heredoc) were used, e.g.

cat <<EOF > finalpath_expected_not_found.log
did not find thing '' TEST
did not find thing '' TEST
did not find thing '' TEST
found thing 'beginLumi' TEST
found thing 'endLumi' TEST
found thing 'beginRun' TEST
found thing 'endRun' TEST
EOF

(the downside is that the text body can't be indented to match the rest of the function, but I think it's a worthwhile tradeoff)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to this format

echo "parameters { key: \"inter_op_thread_count\" value: { string_value: \"1\" } }" >> "$CONFIG"
fi

if grep -Fq "tensorflow" "$CONFIG"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be an elif (there's no case where we would apply both TF and ONNX settings to the same file)

also, I would check for the backend type more specifically, something like this:

PLATFORM=$(grep -m 1 "^platform:" "$CONFIG")
if [[ $PLATFORM == *"tensorflow"* ]]; then
...
fi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Thanks for the code snippet


if grep -Fq "onnx" "$CONFIG"
then
echo "parameters { key: \"intra_op_thread_count\" value: { string_value: \"1\" } }" >> "$CONFIG"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, use cat/heredoc (also avoids need to escape quotes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched

list_models(){
# make list of model repositories
if [ "$INSTANCES" -gt 0 ]
then
if [ -d "$TMPDIR/local_model_repo" ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the name "local_model_repo" in a local variable somewhere to reduce duplication/hardcoding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this in a global variable so that it can be used by both list_models and edit_model without being forwarded from one to the other. Is that ok? I can switch to a local variable if you'd prefer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine.

then
#Want to start with a fresh copy of model files in case this directory already exists with local edits
rm -rf $TMPDIR/local_model_repo
mkdir $TMPDIR/local_model_repo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is also in the "else" block, so just remove the else block and put this line outside the if statement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if [ "$INSTANCES" -gt 0 ]
then
edit_model $MODEL "$INSTANCES"
REPOS=$TMPDIR/local_model_repo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just assign to REPOS once outside of (after) the loop over models

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 376 to 377
for ((r=0; r < ${#REPOS[@]}; r++)); do
# avoid issues w/ multiple levels of symlinks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part (extending till the end of the function) may not be necessary for the local_model_repo case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this into an if statement that is also related to the above comment

list_models

make_tmp
#make_tmp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented-out code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 3, 2023

Pull request #41228 was updated. @cmsbuild, @makortel, @fwyzard can you please check and sign again.

EOF


#if grep -Fq "onnx" "$CONFIG"; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete commented-out code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted

parameters { key: "inter_op_thread_count" value: { string_value: "1" } }
EOF

#elif grep -Fq "tensorflow" "$CONFIG"; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete commented-out code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 3, 2023

Pull request #41228 was updated. @cmsbuild, @makortel, @fwyzard can you please check and sign again.

@@ -50,6 +52,7 @@ usage() {
$ECHO "-r [num] \t number of retries when starting container (default: ${RETRIES})"
$ECHO "-s [dir] \t Singularity sandbox directory (default: $(get_sandbox))"
$ECHO "-t [dir] \t non-default hidden temporary dir"
$ECHO "-I [num] \t number of model instances (default: ${INSTANCES} -> means no local editing of config files)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to keep alphabetical order, move this up to L47

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it; not sure why I didn't have it alphabetical to begin with

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 1, 2023

Pull request #41228 was updated. @cmsbuild, @makortel, @fwyzard can you please check and sign again.

@wpmccormack wpmccormack marked this pull request as ready for review June 1, 2023 17:08
@makortel
Copy link
Contributor

makortel commented Jun 1, 2023

@kpedro88 Do you have any further comments?

@wpmccormack
Copy link
Contributor Author

(Oh, I forgot to mention that we validated this PR by reproducing some previous results where thread controls were performed manually)

@kpedro88
Copy link
Contributor

kpedro88 commented Jun 1, 2023

@makortel this is good to go from my side.

@makortel
Copy link
Contributor

makortel commented Jun 5, 2023

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 5, 2023

This pull request is fully signed and it will be integrated in one of the next CMSSW_12_6_X IBs (but tests are reportedly failing) and once validation in the development release cycle CMSSW_13_2_X is complete. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@wpmccormack
Copy link
Contributor Author

ah... looks like this PR is targeting the wrong branch. @wpmccormack , you should change the target branch to master (and then we should retest it).

I tried switching the target branch to master, and it said that ~200 files had been changed. I started this work from 12_6_4, not master I think. Unless there's an easier way to do things, I'll just start a new PR. Is that ok?

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 5, 2023

@makortel
Copy link
Contributor

makortel commented Jun 5, 2023

Thanks for spotting @kpedro88.

@wpmccormack While technically changing the target branch and updating your triton_model_instance_control branch with a rebase on top of any 13_[012]_0_preX prerelease (e.g. 13_2_0_pre1 for the most recent one) should work (one also needs to be faster than @cmsbuild between the two operations to avoid tagging unrelated reviewers), a new PR is fine as well.

@perrotta
Copy link
Contributor

perrotta commented Jun 5, 2023

Replaced by #41876, which is correctly based on master.
Let close this PR, then

@perrotta perrotta closed this Jun 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants