-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial pass at controlling model instance and threads for fallback triton servers #41228
initial pass at controlling model instance and threads for fallback triton servers #41228
Conversation
A new Pull Request was created by @wpmccormack (Patrick McCormack) for CMSSW_12_6_X. It involves the following packages:
@cmsbuild, @makortel, @fwyzard can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the "copy model locally" expected to be a temporary solution, or the long-term solution?
#include "FWCore/ServiceRegistry/interface/SystemBounds.h" | ||
#include "FWCore/ServiceRegistry/interface/ServiceMaker.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ServiceMaker.h
include should be unnecessary, and for SystemBounds.h
it would be better to forward-declare edm::service::SystemBounds
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed and forward declared
@@ -218,6 +218,7 @@ start_singularity(){ | |||
for REPO in ${REPOS[@]}; do | |||
MOUNTARGS="$MOUNTARGS -B $REPO" | |||
REPOARGS="$REPOARGS --model-repository=${REPO}" | |||
ls ${REPO} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line a leftover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes sorry, it's been removed now
@@ -1,4 +1,5 @@ | |||
#include "FWCore/ServiceRegistry/interface/ServiceMaker.h" | |||
#include "FWCore/ServiceRegistry/interface/SystemBounds.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This #include
should not be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
@@ -6,6 +6,8 @@ | |||
#include "FWCore/ParameterSet/interface/ConfigurationDescriptions.h" | |||
#include "FWCore/ParameterSet/interface/ParameterSetDescription.h" | |||
#include "FWCore/ServiceRegistry/interface/ActivityRegistry.h" | |||
#include "FWCore/ServiceRegistry/interface/SystemBounds.h" | |||
#include "FWCore/ServiceRegistry/interface/ServiceMaker.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ServiceMaker.h
include should not be needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
@@ -128,12 +131,14 @@ class TritonService { | |||
unsigned currentModuleId_; | |||
bool allowAddModel_; | |||
bool startedFallback_; | |||
unsigned int numInstancesForFallbacks_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This member appears to be unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it's been removed
out << "parameters { key: \"TF_NUM_INTRA_THREADS\" value: { string_value: \"1\" } }" << std::endl; | ||
out << "parameters { key: \"TF_NUM_INTER_THREADS\" value: { string_value: \"1\" } }" << std::endl; | ||
out << "parameters { key: \"TF_USE_PER_SESSION_THREADS\" value: { string_value: \"1\" } }" << std::endl; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if neither isONNX
or isTF
is true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is another subtlety with triton server configs: they are model-type dependent. Thread controls are only implemented right now for ONNX and TF models, so if neither of those options are true, then nothing else should be included in the config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should at least emit a warning in this case. something like: "Warning: thread (instance) control not implemented for $PLATFORM"
(following one of my suggestions below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this, thanks!
bool isONNX = false; | ||
bool isTF = false; | ||
|
||
size_t offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the offset
is not really needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has all been deleted
bool isTF = false; | ||
|
||
size_t offset; | ||
std::string line; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this declaration inside the if
block as it is needed only there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all deleted and moved to cmsTriton script
|
||
size_t offset; | ||
std::string line; | ||
std::ifstream Myfile; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable names should start with lower case character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted
} | ||
|
||
//append instance and thread information to config files | ||
std::ofstream out; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to explicitly close the out
file. My personal preference would be to place all the code accessing out
inside a block along
{
std::ofstream out{"./local_model_repo/" + token + "/config.pbtxt", std::ios::app};
// all code accessing out
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is no longer done here. Now I just echo in the cmsTriton bash script
@@ -238,7 +247,67 @@ void TritonService::preBeginJob(edm::PathsAndConsumesOfModulesBase const&, edm:: | |||
if (fallbackOpts_.wait >= 0) | |||
fallbackOpts_.command += " -w " + std::to_string(fallbackOpts_.wait); | |||
for (const auto& [modelName, model] : unservedModels_) { | |||
fallbackOpts_.command += " -m " + model.path; | |||
//We need to edit the triton config files, need to copy from cvmfs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually suggest to handle the copying and editing of model config files in the cmsTriton
script (by passing in the numberOfThreads
value when the script is called). Two reasons:
- most operations here are shell commands, therefore more naturally suited to be implemented in bash
- allows this functionality (controlling number of instances per model) to be used outside of CMSSW/
TritonService
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes good suggestion, thank you! Let me switch that over
We do not expect a different way to do this in the foreseeable future. It's possible at some point that Triton's model configuration interface changes so these settings can be specified "on the fly", but that's not on the current roadmap. |
echo "]" >> "$CONFIG" | ||
echo "" >> "$CONFIG" | ||
|
||
if grep -Fq "onnx" "$CONFIG" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to try to maintain a consistent style in this script, a few things to fix:
- indent with tabs (currently a mix of tabs and spaces in newly added lines, which is the worst of all possibilities)
- format if statements as:
if condition; then action fi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Sorry, didn't notice that my emacs was doing space instead of tabs when I was pressing "tab"
@@ -315,14 +319,59 @@ wait_server(){ | |||
echo "server is ready!" | |||
} | |||
|
|||
edit_model(){ | |||
cp -r $1 $TMPDIR/local_model_repo/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make the code a little easier to read/follow, I would start out by assigning $1
, $2
, etc. to local variables and then use the local variables everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
IFS='/' read -ra ADDR <<< "$1" | ||
CONFIG=$TMPDIR/local_model_repo/${ADDR[-1]}/config.pbtxt | ||
|
||
echo "" >> "$CONFIG" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be a little easier to read (and a little more efficient) if cat (heredoc) were used, e.g.
cmssw/FWCore/Integration/test/test_finalpath.sh
Lines 19 to 27 in 3d13002
cat <<EOF > finalpath_expected_not_found.log | |
did not find thing '' TEST | |
did not find thing '' TEST | |
did not find thing '' TEST | |
found thing 'beginLumi' TEST | |
found thing 'endLumi' TEST | |
found thing 'beginRun' TEST | |
found thing 'endRun' TEST | |
EOF |
(the downside is that the text body can't be indented to match the rest of the function, but I think it's a worthwhile tradeoff)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to this format
echo "parameters { key: \"inter_op_thread_count\" value: { string_value: \"1\" } }" >> "$CONFIG" | ||
fi | ||
|
||
if grep -Fq "tensorflow" "$CONFIG" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be an elif
(there's no case where we would apply both TF and ONNX settings to the same file)
also, I would check for the backend type more specifically, something like this:
PLATFORM=$(grep -m 1 "^platform:" "$CONFIG")
if [[ $PLATFORM == *"tensorflow"* ]]; then
...
fi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Thanks for the code snippet
|
||
if grep -Fq "onnx" "$CONFIG" | ||
then | ||
echo "parameters { key: \"intra_op_thread_count\" value: { string_value: \"1\" } }" >> "$CONFIG" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above, use cat/heredoc (also avoids need to escape quotes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched
list_models(){ | ||
# make list of model repositories | ||
if [ "$INSTANCES" -gt 0 ] | ||
then | ||
if [ -d "$TMPDIR/local_model_repo" ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the name "local_model_repo" in a local variable somewhere to reduce duplication/hardcoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this in a global variable so that it can be used by both list_models and edit_model without being forwarded from one to the other. Is that ok? I can switch to a local variable if you'd prefer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine.
then | ||
#Want to start with a fresh copy of model files in case this directory already exists with local edits | ||
rm -rf $TMPDIR/local_model_repo | ||
mkdir $TMPDIR/local_model_repo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line is also in the "else" block, so just remove the else block and put this line outside the if statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if [ "$INSTANCES" -gt 0 ] | ||
then | ||
edit_model $MODEL "$INSTANCES" | ||
REPOS=$TMPDIR/local_model_repo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just assign to REPOS
once outside of (after) the loop over models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
for ((r=0; r < ${#REPOS[@]}; r++)); do | ||
# avoid issues w/ multiple levels of symlinks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part (extending till the end of the function) may not be necessary for the local_model_repo
case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this into an if statement that is also related to the above comment
list_models | ||
|
||
make_tmp | ||
#make_tmp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented-out code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted
EOF | ||
|
||
|
||
#if grep -Fq "onnx" "$CONFIG"; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete commented-out code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted
parameters { key: "inter_op_thread_count" value: { string_value: "1" } } | ||
EOF | ||
|
||
#elif grep -Fq "tensorflow" "$CONFIG"; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete commented-out code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted
@@ -50,6 +52,7 @@ usage() { | |||
$ECHO "-r [num] \t number of retries when starting container (default: ${RETRIES})" | |||
$ECHO "-s [dir] \t Singularity sandbox directory (default: $(get_sandbox))" | |||
$ECHO "-t [dir] \t non-default hidden temporary dir" | |||
$ECHO "-I [num] \t number of model instances (default: ${INSTANCES} -> means no local editing of config files)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to keep alphabetical order, move this up to L47
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved it; not sure why I didn't have it alphabetical to begin with
@kpedro88 Do you have any further comments? |
(Oh, I forgot to mention that we validated this PR by reproducing some previous results where thread controls were performed manually) |
@makortel this is good to go from my side. |
@cmsbuild, please test |
This pull request is fully signed and it will be integrated in one of the next CMSSW_12_6_X IBs (but tests are reportedly failing) and once validation in the development release cycle CMSSW_13_2_X is complete. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2) |
I tried switching the target branch to master, and it said that ~200 files had been changed. I started this work from 12_6_4, not master I think. Unless there's an easier way to do things, I'll just start a new PR. Is that ok? |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-41228/35789 |
Thanks for spotting @kpedro88. @wpmccormack While technically changing the target branch and updating your |
Replaced by #41876, which is correctly based on master. |
PR description:
For SONIC, the number of instances of a model in a fallback server and the number of threads that fallback server can occupy on local CPU resources can be controlled via the model's triton config file. We set the number of instances to the number of threads specified for the job, so this implementation requires a bit of text editing before the fallback servers can run. The model files live in cvmfs, and the config files cannot be edited there, so they are pulled locally. The precise edits depend on the model's backend; luckily the backend information can be extracted from the config files, so no additional dictionary containing this information is needed.
PR validation:
This has been tested locally to verify that: 1) the config files get edited correctly, 2) that the fallback servers start properly, and 3) that a miniAOD processing job can run completely.