Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PermissionError during model fitting in Optuna with ChemPropRegressor #40

Open
ahmed1212212 opened this issue Feb 1, 2025 · 4 comments

Comments

@ahmed1212212
Copy link

Issue Title:
PermissionError during model fitting in Optuna with ChemPropRegressor

Description:
When running the optimization script using optunaz with the ChemPropRegressor algorithm, the process fails during cross-validation due to a PermissionError when attempting to save temporary files to a specified directory. The error occurs while trying to save a DataFrame to CSV format in an invalid or restricted directory.

Error Message:
PermissionError: [Errno 13] Permission denied: 'C:\path\to\another\temp\folder\tmp5s1mkddb'

config = OptimizationConfig(
data=Dataset(
input_column="canonical",
response_column="molwt",
training_dataset_file=r"C:\Users\aalhilal\Downloads\train.csv", # This will be split into train and test.
),
descriptors=[
SmilesFromFile.new(),
],
algorithms=[
ChemPropRegressor.new(epochs=5), #epochs=5 to ensure run finishes quickly
],
settings=OptimizationConfig.Settings(
mode=ModelMode.REGRESSION,
cross_validation=2,
n_trials=2,
direction=OptimizationDirection.MAXIMIZATION,
),
)

study = optimize(config, study_name="my_study")

@lewismervin1
Copy link
Collaborator

Hello @ahmed1212212 , and welcome to the QSARtuna community.

This error may be specific to a windows installation of QSARtuna as I have not observed this myself. It might be possible that you need to provide permissions to cmd or the python process to resolve this.

Please can you share the full error message so I can debug at what stage you are getting this error?

Thanks

@ahmed1212212
Copy link
Author

import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np

Load the CSV file containing SMILES

file_path = r"...\OneSidedSelection_Lipinski_All_Dataset\modified_OneSidedSelection_Lipinski_All_Dataset.csv"
df = pd.read_csv(file_path)

smiles_column = "SMILES" # The actual name of the column containing SMILES strings

Generate RDKit molecules from SMILES

molecules = [Chem.MolFromSmiles(smiles) for smiles in df[smiles_column]]

Generate Morgan fingerprints (bit vector) for each molecule

fingerprints = []
radius = 2 # Default radius for Morgan fingerprint
n_bits = 512 # Size of the bit vector

for mol in molecules:
if mol is not None:
fingerprint = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=n_bits)
# Convert the bit vector to a comma-separated string
fingerprint_str = ",".join(map(str, list(fingerprint)))
fingerprints.append(fingerprint_str)
else:
# Append a zero string if the molecule is invalid
fingerprints.append(",".join(["0"] * n_bits))

Add the fingerprints as a new column to the original DataFrame

df['fp'] = fingerprints

Save the DataFrame (with the original columns + the fingerprint column) to a new CSV file

output_file_path = r"....\morgan_fingerprints_with_columns.csv"
df.to_csv(output_file_path, index=False)

print(f"Data with Morgan fingerprints saved to: {output_file_path}")

Then I run this code .....
from optunaz.utils.preprocessing.transform import VectorFromColumn

vector_covariate_config = OptimizationConfig(
data=Dataset(
input_column="SMILES",
response_column="LogP",
response_type="regression",
training_dataset_file=r"...\morgan_fingerprints_with_columns.csv",
aux_column="fp", # use a comma separated co-variate vector in column fp
aux_transform=VectorFromColumn.new(), # split the comma separated values into a vector
split_strategy=Stratified(fraction=0.2),
),
descriptors=[
CompositeDescriptor.new(
descriptors=[
PrecomputedDescriptorFromFile.new(file=r"....\morgan_fingerprints_with_columns.csv",
input_column="SMILES", response_column="fp"),
ECFP.new()])
],
algorithms=[
RandomForestRegressor.new(n_estimators={"low": 5, "high": 10}),
Ridge.new(),
Lasso.new(),
PLSRegression.new(),
],
settings=OptimizationConfig.Settings(
mode=ModelMode.REGRESSION,
cross_validation=2,
n_trials=4,
n_startup_trials=0,
direction=OptimizationDirection.MAXIMIZATION,
track_to_mlflow=False,
random_seed=42,
),
)

precomputed_study = optimize(vector_covariate_config, study_name="precomputed_example")
build_best(buildconfig_best(precomputed_study), "../target/precomputed_model.pkl")
.......[I 2025-02-10 14:01:38,040] Trial 0 pruned. Descriptor generation failed for descriptor
[I 2025-02-10 14:01:42,632] Trial 1 pruned. Descriptor generation failed for descriptor
[I 2025-02-10 14:01:46,956] Trial 2 pruned. Descriptor generation failed for descriptor
[I 2025-02-10 14:01:51,361] Trial 3 pruned. Descriptor generation failed for descriptor

ValueError Traceback (most recent call last)
Cell In[27], line 37
1 from optunaz.utils.preprocessing.transform import VectorFromColumn
3 vector_covariate_config = OptimizationConfig(
4 data=Dataset(
5 input_column="SMILES",
(...)
34 ),
35 )
---> 37 precomputed_study = optimize(vector_covariate_config, study_name="precomputed_example")
38 build_best(buildconfig_best(precomputed_study), "../target/precomputed_model.pkl")

File c:\Users\aalhilal\AppData\Local\anaconda3\envs\my_env_with_qsartuna\lib\site-packages\optunaz\three_step_opt_build_merge.py:208, in optimize(optconfig, study_name)
200 for cfg_idx, cfg in enumerate(split_optimize(optconfig)):
201 sub_objective = Objective(
202 optconfig=cfg,
203 train_smiles=train_smiles,
(...)
206 cache=optconfig.cache,
207 )
--> 208 study = run_study(
209 cfg,
210 f"study_name
{cfg_idx}",
211 sub_objective,
212 n_startup_trials,
213 n_trials,
214 random_seed,
215 storage=False,
216 trial_number_offset=trial_number_offset,
217 )
218 # manually set the distributions to avoid dynamic subspace error
219 for st_idx, st in enumerate(study.get_trials(deepcopy=False)):

File c:\Users\aalhilal\AppData\Local\anaconda3\envs\my_env_with_qsartuna\lib\site-packages\optunaz\three_step_opt_build_merge.py:168, in run_study(optconfig, study_name, objective, n_startup_trials, n_trials, seed, storage, trial_number_offset)
164 if (~study.trials_dataframe()["user_attrs_trial_ran"]).all():
165 logging.warning(
166 f"None of the trials were able to finish: {study.trials_dataframe()}"
167 )
--> 168 raise ValueError("Exiting since no trials returned values")
169 return study

ValueError: Exiting since no trials returned values

@lewismervin1
Copy link
Collaborator

lewismervin1 commented Mar 7, 2025

The following code runs ok for me, please can you check?

from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np

from optunaz.config.optconfig import OptimizationConfig
from optunaz.datareader import Dataset
from optunaz.utils.preprocessing.splitter import Stratified
from optunaz.descriptors import CompositeDescriptor, PrecomputedDescriptorFromFile, ECFP
from optunaz.config.optconfig import RandomForestRegressor, Ridge, Lasso, PLSRegression
from optunaz.optbuild import optimize
from optunaz.utils.preprocessing.transform import VectorFromColumn
from optunaz.config import ModelMode, OptimizationDirection

optuna_path=''# set this
file_path = f"{optuna_path}tests/data/DRD2/subset-50/train.csv"
df = pd.read_csv(file_path)
smiles_column = "canonical" # The actual name of the column containing SMILES strings

molecules = [Chem.MolFromSmiles(smiles) for smiles in df[smiles_column]]

fingerprints = []
radius = 2 # Default radius for Morgan fingerprint
n_bits = 512 # Size of the bit vector

for mol in molecules:
    if mol is not None:
        fingerprint = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=n_bits)
        # Convert the bit vector to a comma-separated string
        fingerprint_str = ",".join(map(str, list(fingerprint)))
        fingerprints.append(fingerprint_str)
    else:
        # Append a zero string if the molecule is invalid
        fingerprints.append(",".join(["0"] * n_bits))

df['fp'] = fingerprints

output_file_path = f"{optuna_path}target/morgan_fingerprints_with_columns.csv"
df.to_csv(output_file_path, index=False)

print(f"Data with Morgan fingerprints saved to: {output_file_path}")

vector_covariate_config = OptimizationConfig(
data=Dataset(
    input_column="canonical",
    response_column="molwt",
    response_type="regression",
    training_dataset_file=f"{optuna_path}/target/morgan_fingerprints_with_columns.csv",
    aux_column="fp", # use a comma separated co-variate vector in column fp
    aux_transform=VectorFromColumn.new(), # split the comma separated values into a vector
    split_strategy=Stratified(fraction=0.2),
),
descriptors=[
    CompositeDescriptor.new(
        descriptors=[
        PrecomputedDescriptorFromFile.new(file=f"{optuna_path}/target/morgan_fingerprints_with_columns.csv",
        input_column="canonical", response_column="fp"),
        ECFP.new()
        ])
],
algorithms=[
    RandomForestRegressor.new(n_estimators={"low": 5, "high": 10}),
    Ridge.new(),
    Lasso.new(),
    PLSRegression.new(),
],
settings=OptimizationConfig.Settings(
    mode=ModelMode.REGRESSION,
    cross_validation=2,
    n_trials=40,
    n_startup_trials=0,
    direction=OptimizationDirection.MAXIMIZATION,
    track_to_mlflow=False,
    random_seed=42,
    ),
)

study = optimize(vector_covariate_config, study_name="precomputed_example")```

@lewismervin1
Copy link
Collaborator

lewismervin1 commented Mar 7, 2025

FYI I just wanted to highlight that your implementation will present the ECFP 3 times for any given molecule:

Twice within the composite fingerprint (one precomputed 512bit ECFP_4 fingerprint manually calculated and one default ECFP [2048 bit ECFP_6] from ECFP.new()) and once again as a co-variate using the vector of your manually calculated ECFP_4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants