-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aux_column #44
Comments
Please can you supply the config for the optimisation with the number of expected descriptors different from the trained? It will be helpful to see how you are supplying your manually defined descriptors. I think you might need to supply as a composite descriptor. If your precomputed descriptor is not correctly setup then those trials with that descriptor set may throwing an error and being pruned. I'd need to see your config to check. Example JSONs can be found in the examples folder, .e.g. https://github.com/MolecularAI/QSARtuna/blob/master/examples/optimization/ChemProp_drd2_50.json and you can export a config to a JSON from within a notebook by exporting the serialized OptimizationConfig or BuildConfig, like here: https://molecularai.github.io/QSARtuna/notebooks/QSARtuna_Tutorial.html#Pick-the-best-trial-and-build-a-model-for-it. Once you have the config you can run the command line CLI interface for qptuna like so: https://molecularai.github.io/QSARtuna/README.html#running-via-cli The primary intended use of the auxiliary column is used for proteochemometric models etc. when you have a defined space which is concatenated as a covariate to your conventional chemical modelling space. Covariate modelling in itself could be useful in a variety of situations however. |
So yes I did end up using the PrecomputedDescriptor supplied within a composite descriptor. In this precomputed descriptor file is my various physicochemical descriptors processed from sets like Mordred, Mold2 and RDkit. However for the model to take in all of these descriptors, I've concatenated them into 1 column (external_desc) and comma separated. My concern with that is that the identity of those features is lost as I combined them into a singular component. When I do
I guess from this I'd wanna know why the model is predicting NaN values, is there any advantage to supplying descriptors when concatenated (as in the precomputed descriptor file, all in one column, comma separated but are individual physicochemical descriptors, not fingerprints), how would the model recognise them because they're now just numbers in a column (which would be fine for say a fingerprint). And why does including I also get various SHAP issues when I look into feature importances. Errors including I apologise if there's a lot to wade through there :) |
Hi,
Sorry for opening another issue.
When building a classification model, I supply the model with a training set of SMILES and Class (1s or 0s) as well as a few hundred other descriptors. I see that when the optimisation is done and the model is built, the other descriptors I supply aren't being used by the model.
When I include ECFP or PhyschemDescriptors in the descriptor section of the config, the
model.predictor.n_features_in
is the length of either ECFP or PhyschemDescriptor and doesn't include my supplied descriptors in the training set.I've noticed there's a
aux_column
arg that looks to take an auxiliary descriptor/feature. How do I use this to supply all my other descriptors in my training set? I've tried supplying just one column say NumRotatableBonds, but when I predict using that model, I get an error of: X has 210 features, RandomForestClassifier expects 211. Not entirely sure what X is here or how this error has come about?Also, in the tutorial, it's mentioned that configs can be loaded in through JSON files, but then proceeds to not show how to do that. How would I load in a config stored in a JSON?
Any help would be massively appreciated.
Thanks
Edit: Actually this could be merged with #43
The text was updated successfully, but these errors were encountered: