The motivation behind this repository is that i'm a lazy man and do not like to make a ton of edit on existing ML train.py
scripts. Riding on the fact that most ML training scripts are executed by python train.py --args1 val1 --args2 val2
, this repository aims to:
- Make minimal edit on the original
train.py
- Decouple ClearML bits from
train.py
script, allowing the script to still be executable as a standalone script - Achieve the above without any additional Python packages
Kinda like just putting a hat over your train script
- ClearML config (
clearml.conf
) with s3 keys
- For boolean arguments in
train.py
, they are allstore_true
.
- First and foremost, create a Github Repository for the project. Note that the repository must be public. I don't set the rules
- Next, have a virtual environment set up on your local machine (either Python venv or conda env is fine).
- Next, install the necessary dependencies for the project with the following amendments to
requirements.txt
:- Include the necessary packages indicated in the
requirements.txt
of this repository. - Comment out
torch
andtorchvision
if you intend to use the PyTorch docker image (recommended) for remote execution. The PyTorch docker image already has these packages installed for you. If you intend to execute locally, and use ClearML solely for tracking purpose, then you may leave these two dependencies in.
- Include the necessary packages indicated in the
Instead of typing a long list of args each time you run a new experiment, it is a good practice to store your experiment parameters as a .yaml
file. This step will generate a .yaml
file based on the default parameters from youtrair train.py
.
- From your original
train.py
script, copy theadd_argument()
lines over toargparse_to_yaml.py
and replace line 8-12. - Execute
python argparse_to_yaml.py
to generate a defaultconfig.yaml
.
- Firstly, define a new function
main(args=None)
in yourtrain.py
(Refer to line 10 intrain_aip.py
). - Next, copy everything in the main routine of
train.py
intomain()
function (Refer to line 11-23 intrain_aip.py
). - Add
args
as the parameter forparser.parse_args()
(Refer to line 17 intrain_aip.py
) - Typically, after training, you'd have a
torch.save(model_state, path)
statement. Return the path of the folder containing the saved model(s), so that thehat.py
can later upload it to S3. (Refer to line 24 intrain_aip.py
). - Lastly, replace the main routine with
main()
function call (Refer to line 27 intrain_aip.py
)
You may compare train_ori.py
and train_aip.py
to see the changes applied. Despite the amendments, train_aip.py
can still be executed in the same manner as train_ori.py
(e.g. python train_aip.py --epochs 2
)
- In the
data
folder, edit theS3_LINK
inupload_dataset_to_s3.py
(Courtesy of Nic) to point to the s3 bucket (Line 3) - Edit the dataset project and the dataset name (e.g. train/val/test) as well (Line 5). The convention for dataset project is
datasets/<project_name>
. - Lastly, edit the path to your dataset (Line 6)
The hat.py
script handles most of the ClearML bits. This script initialize the ClearML task, setup for remote execution, and retrieves data from s3.
- On Line 7, change the
PROJECT_NAME
. - On Line 8, include your s3 bucket link
- On Line 47, change the base docker image, if necessary. You may also add other docker setup scripts here using the
docker_setup_bash_script
parameter. Refer to ClearML Task Documentation for more details - On Line 48, indicate your queue name, if necessary.
- On Line 53, modify the
dataset_name
to your respective dataset. Theget_local_copy()
method downloads and cache the data, then returns the path of the cache folder. This is why the dataset path needs to be overwritten on Line 55. - On Line 60, the
train_aip.py
script will then be imported. This is because in the case of remote execution, prior to Line 47-48, all other Python packages would not have been ready, thus, importing thetrain_aip.py
at the top of the script will yield error. (Shearman say one) - On Line 61, the training parameters from the
.yaml
file will be converted in a list (e.g.['--data_path', './data/Images', '--epochs', '10', '--img_size', '640', '640', '--model_name', 'some_model']
). This list can then be passed to thetrain_aip.py
script, in Line 63. - On Line 66, create a new dataset for the trained model for uploading of the trained models to S3.
yeah, commit your changes. otherwise it doesn't work sometimes. i honestly don't know why
- Local execution, local file path:
python hat.py --train_yaml config.yaml --task_name local
- Local execution, s3 file path:
python hat.py --train_yaml config.yaml --task_name local_s3 --s3
- Remote execution, s3 file path:
python hat.py --train_yaml config.yaml --task_name local_s3 --s3 --remote
On ClearML, you can repeat the experiments with minor tweaks to the parameter if the experiment is executed remotely.
- On the experiment page, right click on a complete task, then click on Clone
- Give the task a name, then proceed to clone.
- The cloned experiment will be in the Draft state. Click on the task, and under the Configuration tab, you can edit the hyperparameters. (Note: The hyperparameters to be amended is under
train_args
NOTArgs
. The hyperparameters underArgs
are the defaults) - Once done, right click on the draft task and enqueue it onto one of the queues.
- To remove the default hyperparameters shown under
Args