Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aux_column #44

Open
jackhhewitt opened this issue Feb 28, 2025 · 2 comments
Open

Aux_column #44

jackhhewitt opened this issue Feb 28, 2025 · 2 comments

Comments

@jackhhewitt
Copy link

jackhhewitt commented Feb 28, 2025

Hi,

Sorry for opening another issue.

When building a classification model, I supply the model with a training set of SMILES and Class (1s or 0s) as well as a few hundred other descriptors. I see that when the optimisation is done and the model is built, the other descriptors I supply aren't being used by the model.

When I include ECFP or PhyschemDescriptors in the descriptor section of the config, the model.predictor.n_features_in is the length of either ECFP or PhyschemDescriptor and doesn't include my supplied descriptors in the training set.

I've noticed there's a aux_column arg that looks to take an auxiliary descriptor/feature. How do I use this to supply all my other descriptors in my training set? I've tried supplying just one column say NumRotatableBonds, but when I predict using that model, I get an error of: X has 210 features, RandomForestClassifier expects 211. Not entirely sure what X is here or how this error has come about?

Also, in the tutorial, it's mentioned that configs can be loaded in through JSON files, but then proceeds to not show how to do that. How would I load in a config stored in a JSON?

Any help would be massively appreciated.

Thanks

Edit: Actually this could be merged with #43

@lewismervin1
Copy link
Collaborator

Please can you supply the config for the optimisation with the number of expected descriptors different from the trained? It will be helpful to see how you are supplying your manually defined descriptors. I think you might need to supply as a composite descriptor. If your precomputed descriptor is not correctly setup then those trials with that descriptor set may throwing an error and being pruned. I'd need to see your config to check.

Example JSONs can be found in the examples folder, .e.g. https://github.com/MolecularAI/QSARtuna/blob/master/examples/optimization/ChemProp_drd2_50.json and you can export a config to a JSON from within a notebook by exporting the serialized OptimizationConfig or BuildConfig, like here: https://molecularai.github.io/QSARtuna/notebooks/QSARtuna_Tutorial.html#Pick-the-best-trial-and-build-a-model-for-it.

Once you have the config you can run the command line CLI interface for qptuna like so: https://molecularai.github.io/QSARtuna/README.html#running-via-cli

The primary intended use of the auxiliary column is used for proteochemometric models etc. when you have a defined space which is concatenated as a covariate to your conventional chemical modelling space. Covariate modelling in itself could be useful in a variety of situations however.

@jackhhewitt
Copy link
Author

jackhhewitt commented Mar 8, 2025

So yes I did end up using the PrecomputedDescriptor supplied within a composite descriptor. In this precomputed descriptor file is my various physicochemical descriptors processed from sets like Mordred, Mold2 and RDkit. However for the model to take in all of these descriptors, I've concatenated them into 1 column (external_desc) and comma separated. My concern with that is that the identity of those features is lost as I combined them into a singular component.

When I do y_preds = model.predict_from_smiles(test_set), a lot of the predictions are NaN values and therefore I can't do things like look the ROC-AUC curve and other metrics.

composite_precomp_config3 = OptimizationConfig(
     data=Dataset(
         input_column="Structure", 
         response_column="Class", 
         training_dataset_file=r"M:\ML_scripts\MODEL DATA\TOBRAMYCIN_TRAIN_SET_RES.csv", 
         test_dataset_file=r"M:\ML_scripts\MODEL DATA\TOBRAMYCIN_TEST_SET.csv"
     ),
     descriptors=[
         CompositeDescriptor.new(descriptors=[
             PrecomputedDescriptorFromFile.new(file=r'M:\ML_scripts\MODEL DATA\TOBRAMYCIN_TRAIN_SET_RES_PRECOMP.csv',                                               
                                                                          input_column='Structure', response_column='external_desc'),
                                              UnscaledPhyschemDescriptors.new(),
                                              UnscaledJazzyDescriptors.new(),
                                              ECFP.new(returnRdkit=True),
                                              ECFP_counts.new(),
                                              MACCS_keys.new(),                                               
                                              PathFP.new()                                             
                                              Avalon.new(),         ]     )     ],
    algorithms=[
        RandomForestClassifier.new(),         
        LogisticRegression.new(),         
        SVC.new(),
        KNeighborsClassifier.new(),
    ],     settings=OptimizationConfig.Settings(
        mode=ModelMode.CLASSIFICATION,
        cross_validation=10,
        n_trials=200,
        n_startup_trials=50,
        random_seed=42,
        direction=OptimizationDirection.MAXIMIZATION,
        n_jobs=-1,
    ),
)
 
composite_precomp_study3 = optimize(composite_precomp_config3, study_name='precomp_3')

build_best(buildconfig_best(composite_precomp_study3), r"M:\ML_scripts\MODEL DATA\models\composite models\precomp_3.pkl")

with open(f"M:\ML_scripts\MODEL DATA\models\composite models\precomp_3.pkl", "rb") as m:
    comp_precomp3 = pickle.load(m)

print(comp_precomp3)

y_pred3 = comp_precomp3.predict_from_smiles(test_smiles)

I guess from this I'd wanna know why the model is predicting NaN values, is there any advantage to supplying descriptors when concatenated (as in the precomputed descriptor file, all in one column, comma separated but are individual physicochemical descriptors, not fingerprints), how would the model recognise them because they're now just numbers in a column (which would be fine for say a fingerprint). And why does including PhyschemDescriptors.new() as opposed to UnscaledPhyschemDescriptors throw a ValidationError: [{'loc': ['parameters', 'descriptors', 1, '_scaler'], 'err': 'unexpected property'} <--- of this type? Is there a method of including the scaled PhyschemDescriptors within the composite descriptor? I'd much prefer to work with the scaled descriptors from that set

I also get various SHAP issues when I look into feature importances. Errors including object of NoneType has no len(), Could not find descriptor for C in file: 'path/to/train/file', MemoryError: unable to allocate 517 GiB etc and Additivity check failed in TreeExplainer

I apologise if there's a lot to wade through there :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants