PermissionError during model fitting in Optuna with ChemPropRegressor #40

ahmed1212212 · 2025-02-01T16:22:25Z

Issue Title:
PermissionError during model fitting in Optuna with ChemPropRegressor

Description:
When running the optimization script using optunaz with the ChemPropRegressor algorithm, the process fails during cross-validation due to a PermissionError when attempting to save temporary files to a specified directory. The error occurs while trying to save a DataFrame to CSV format in an invalid or restricted directory.

Error Message:
PermissionError: [Errno 13] Permission denied: 'C:\path\to\another\temp\folder\tmp5s1mkddb'

config = OptimizationConfig(
data=Dataset(
input_column="canonical",
response_column="molwt",
training_dataset_file=r"C:\Users\aalhilal\Downloads\train.csv", # This will be split into train and test.
),
descriptors=[
SmilesFromFile.new(),
],
algorithms=[
ChemPropRegressor.new(epochs=5), #epochs=5 to ensure run finishes quickly
],
settings=OptimizationConfig.Settings(
mode=ModelMode.REGRESSION,
cross_validation=2,
n_trials=2,
direction=OptimizationDirection.MAXIMIZATION,
),
)

study = optimize(config, study_name="my_study")

lewismervin1 · 2025-02-07T17:05:40Z

Hello @ahmed1212212 , and welcome to the QSARtuna community.

This error may be specific to a windows installation of QSARtuna as I have not observed this myself. It might be possible that you need to provide permissions to cmd or the python process to resolve this.

Please can you share the full error message so I can debug at what stage you are getting this error?

Thanks

ahmed1212212 · 2025-02-10T14:06:43Z

import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np

Load the CSV file containing SMILES

file_path = r"...\OneSidedSelection_Lipinski_All_Dataset\modified_OneSidedSelection_Lipinski_All_Dataset.csv"
df = pd.read_csv(file_path)

smiles_column = "SMILES" # The actual name of the column containing SMILES strings

Generate RDKit molecules from SMILES

molecules = [Chem.MolFromSmiles(smiles) for smiles in df[smiles_column]]

Generate Morgan fingerprints (bit vector) for each molecule

fingerprints = []
radius = 2 # Default radius for Morgan fingerprint
n_bits = 512 # Size of the bit vector

for mol in molecules:
if mol is not None:
fingerprint = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=n_bits)
# Convert the bit vector to a comma-separated string
fingerprint_str = ",".join(map(str, list(fingerprint)))
fingerprints.append(fingerprint_str)
else:
# Append a zero string if the molecule is invalid
fingerprints.append(",".join(["0"] * n_bits))

Add the fingerprints as a new column to the original DataFrame

df['fp'] = fingerprints

Save the DataFrame (with the original columns + the fingerprint column) to a new CSV file

output_file_path = r"....\morgan_fingerprints_with_columns.csv"
df.to_csv(output_file_path, index=False)

print(f"Data with Morgan fingerprints saved to: {output_file_path}")

Then I run this code .....
from optunaz.utils.preprocessing.transform import VectorFromColumn

vector_covariate_config = OptimizationConfig(
data=Dataset(
input_column="SMILES",
response_column="LogP",
response_type="regression",
training_dataset_file=r"...\morgan_fingerprints_with_columns.csv",
aux_column="fp", # use a comma separated co-variate vector in column fp
aux_transform=VectorFromColumn.new(), # split the comma separated values into a vector
split_strategy=Stratified(fraction=0.2),
),
descriptors=[
CompositeDescriptor.new(
descriptors=[
PrecomputedDescriptorFromFile.new(file=r"....\morgan_fingerprints_with_columns.csv",
input_column="SMILES", response_column="fp"),
ECFP.new()])
],
algorithms=[
RandomForestRegressor.new(n_estimators={"low": 5, "high": 10}),
Ridge.new(),
Lasso.new(),
PLSRegression.new(),
],
settings=OptimizationConfig.Settings(
mode=ModelMode.REGRESSION,
cross_validation=2,
n_trials=4,
n_startup_trials=0,
direction=OptimizationDirection.MAXIMIZATION,
track_to_mlflow=False,
random_seed=42,
),
)

precomputed_study = optimize(vector_covariate_config, study_name="precomputed_example")
build_best(buildconfig_best(precomputed_study), "../target/precomputed_model.pkl")
.......[I 2025-02-10 14:01:38,040] Trial 0 pruned. Descriptor generation failed for descriptor
[I 2025-02-10 14:01:42,632] Trial 1 pruned. Descriptor generation failed for descriptor
[I 2025-02-10 14:01:46,956] Trial 2 pruned. Descriptor generation failed for descriptor
[I 2025-02-10 14:01:51,361] Trial 3 pruned. Descriptor generation failed for descriptor

ValueError Traceback (most recent call last)
Cell In[27], line 37
1 from optunaz.utils.preprocessing.transform import VectorFromColumn
3 vector_covariate_config = OptimizationConfig(
4 data=Dataset(
5 input_column="SMILES",
(...)
34 ),
35 )
---> 37 precomputed_study = optimize(vector_covariate_config, study_name="precomputed_example")
38 build_best(buildconfig_best(precomputed_study), "../target/precomputed_model.pkl")

File c:\Users\aalhilal\AppData\Local\anaconda3\envs\my_env_with_qsartuna\lib\site-packages\optunaz\three_step_opt_build_merge.py:208, in optimize(optconfig, study_name)
200 for cfg_idx, cfg in enumerate(split_optimize(optconfig)):
201 sub_objective = Objective(
202 optconfig=cfg,
203 train_smiles=train_smiles,
(...)
206 cache=optconfig.cache,
207 )
--> 208 study = run_study(
209 cfg,
210 f"study_name{cfg_idx}",
211 sub_objective,
212 n_startup_trials,
213 n_trials,
214 random_seed,
215 storage=False,
216 trial_number_offset=trial_number_offset,
217 )
218 # manually set the distributions to avoid dynamic subspace error
219 for st_idx, st in enumerate(study.get_trials(deepcopy=False)):

File c:\Users\aalhilal\AppData\Local\anaconda3\envs\my_env_with_qsartuna\lib\site-packages\optunaz\three_step_opt_build_merge.py:168, in run_study(optconfig, study_name, objective, n_startup_trials, n_trials, seed, storage, trial_number_offset)
164 if (~study.trials_dataframe()["user_attrs_trial_ran"]).all():
165 logging.warning(
166 f"None of the trials were able to finish: {study.trials_dataframe()}"
167 )
--> 168 raise ValueError("Exiting since no trials returned values")
169 return study

ValueError: Exiting since no trials returned values

lewismervin1 · 2025-03-07T15:05:27Z

The following code runs ok for me, please can you check?

from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np

from optunaz.config.optconfig import OptimizationConfig
from optunaz.datareader import Dataset
from optunaz.utils.preprocessing.splitter import Stratified
from optunaz.descriptors import CompositeDescriptor, PrecomputedDescriptorFromFile, ECFP
from optunaz.config.optconfig import RandomForestRegressor, Ridge, Lasso, PLSRegression
from optunaz.optbuild import optimize
from optunaz.utils.preprocessing.transform import VectorFromColumn
from optunaz.config import ModelMode, OptimizationDirection

optuna_path=''# set this
file_path = f"{optuna_path}tests/data/DRD2/subset-50/train.csv"
df = pd.read_csv(file_path)
smiles_column = "canonical" # The actual name of the column containing SMILES strings

molecules = [Chem.MolFromSmiles(smiles) for smiles in df[smiles_column]]

fingerprints = []
radius = 2 # Default radius for Morgan fingerprint
n_bits = 512 # Size of the bit vector

for mol in molecules:
    if mol is not None:
        fingerprint = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=n_bits)
        # Convert the bit vector to a comma-separated string
        fingerprint_str = ",".join(map(str, list(fingerprint)))
        fingerprints.append(fingerprint_str)
    else:
        # Append a zero string if the molecule is invalid
        fingerprints.append(",".join(["0"] * n_bits))

df['fp'] = fingerprints

output_file_path = f"{optuna_path}target/morgan_fingerprints_with_columns.csv"
df.to_csv(output_file_path, index=False)

print(f"Data with Morgan fingerprints saved to: {output_file_path}")

vector_covariate_config = OptimizationConfig(
data=Dataset(
    input_column="canonical",
    response_column="molwt",
    response_type="regression",
    training_dataset_file=f"{optuna_path}/target/morgan_fingerprints_with_columns.csv",
    aux_column="fp", # use a comma separated co-variate vector in column fp
    aux_transform=VectorFromColumn.new(), # split the comma separated values into a vector
    split_strategy=Stratified(fraction=0.2),
),
descriptors=[
    CompositeDescriptor.new(
        descriptors=[
        PrecomputedDescriptorFromFile.new(file=f"{optuna_path}/target/morgan_fingerprints_with_columns.csv",
        input_column="canonical", response_column="fp"),
        ECFP.new()
        ])
],
algorithms=[
    RandomForestRegressor.new(n_estimators={"low": 5, "high": 10}),
    Ridge.new(),
    Lasso.new(),
    PLSRegression.new(),
],
settings=OptimizationConfig.Settings(
    mode=ModelMode.REGRESSION,
    cross_validation=2,
    n_trials=40,
    n_startup_trials=0,
    direction=OptimizationDirection.MAXIMIZATION,
    track_to_mlflow=False,
    random_seed=42,
    ),
)

study = optimize(vector_covariate_config, study_name="precomputed_example")```

lewismervin1 · 2025-03-07T15:08:40Z

FYI I just wanted to highlight that your implementation will present the ECFP 3 times for any given molecule:

Twice within the composite fingerprint (one precomputed 512bit ECFP_4 fingerprint manually calculated and one default ECFP [2048 bit ECFP_6] from ECFP.new()) and once again as a co-variate using the vector of your manually calculated ECFP_4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PermissionError during model fitting in Optuna with ChemPropRegressor #40

PermissionError during model fitting in Optuna with ChemPropRegressor #40

ahmed1212212 commented Feb 1, 2025

lewismervin1 commented Feb 7, 2025

ahmed1212212 commented Feb 10, 2025

lewismervin1 commented Mar 7, 2025 •

edited

Loading

lewismervin1 commented Mar 7, 2025 •

edited

Loading

PermissionError during model fitting in Optuna with ChemPropRegressor #40

PermissionError during model fitting in Optuna with ChemPropRegressor #40

Comments

ahmed1212212 commented Feb 1, 2025

lewismervin1 commented Feb 7, 2025

ahmed1212212 commented Feb 10, 2025

Load the CSV file containing SMILES

Generate RDKit molecules from SMILES

Generate Morgan fingerprints (bit vector) for each molecule

Add the fingerprints as a new column to the original DataFrame

Save the DataFrame (with the original columns + the fingerprint column) to a new CSV file

lewismervin1 commented Mar 7, 2025 • edited Loading

lewismervin1 commented Mar 7, 2025 • edited Loading

lewismervin1 commented Mar 7, 2025 •

edited

Loading

lewismervin1 commented Mar 7, 2025 •

edited

Loading