Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LIPID MAPS fragment optimization dataset #394

Merged
merged 3 commits into from
Oct 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -274,6 +274,8 @@ These are currently used to find a minimum energy conformation of a molecule.
|`OpenFF Iodine Fragment Opt v1.0` | [2024-09-10-OpenFF-Iodine-Fragment-Opt-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-09-10-OpenFF-Iodine-Fragment-Opt-v1.0) | B3LYP-D3BJ/DZVP optimized conformers for a variety of I-containing fragment molecules | C, O, I, S, F, Br, Cl, N, H ||
| `OpenFF Sulfur Optimization Training Coverage Supplement v1.0` | [2024-09-11-OpenFF-Sulfur-Optimization-Training-Coverage-Supplement-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-09-11-OpenFF-Sulfur-Optimization-Training-Coverage-Supplement-v1.0) | Additional optimization training data for Sage sulfur and phosphorus parameters | C, S, F, O, H, Cl, Br, P, N | |
| `OpenFF Sulfur Optimization Benchmarking Coverage Supplement v1.0` | [2024-09-18-OpenFF-Sulfur-Optimization-Benchmarking-Coverage-Supplement-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-09-18-OpenFF-Sulfur-Optimization-Benchmarking-Coverage-Supplement-v1.0) | Additional optimization benchmarking data for Sage sulfur and phosphorus parameters | S, P, Cl, C, N, O, H, Br, F | |
| `OpenFF Lipid Optimization Training Supplement v1.0` | [2024-10-08-OpenFF-Lipid-Optimization-Training-Supplement-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-10-08-OpenFF-Lipid-Optimization-Training-Supplement-v1.0) | Additional optimization training data for Sage from representative LIPID MAPS fragments | I, Br, O, H, P, C, N, Cl, F, S | |


# TorsionDrive Datasets
These are currently used perform a complete rotation of one or more selected bonds, where optimizations are performed over a discrete set of angles.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# OpenFF Lipid Optimization Training Supplement v1.0

## Description

An optimization data set created to improve the training coverage of lipid-like
molecules in Sage. The molecules in this data set were selected from the [LIPID
MAPS](https://www.lipidmaps.org/) database via
[cura](https://github.com/ntBre/curato/tree/6039bea5c64f8cd6b374fd12b5fa3b355898d98b)
after being fragmented by the [XFF
fragmentation](https://github.com/XtalPi-XFF/2023_XFF_paper/tree/b84ef6079d24ebd2e86f78a495e3257b375255fa/fragmentation)
algorithm, clustered on the 2048-bit, radius-3 Morgan fingerprint from RDKit by
the `LeaderPicker.LazyBitVectorPick` algorithm, also from RDKit, with a distance
threshold of 0.6559. Placeholder dummy atoms in the fragments were generally
filled with H atoms, and `S(=O)(=O)*` groups were replaced with `S(=O)([O-])`.
The candidate fragments were further restricted to those with between 3 and 70
heavy atoms and containing only the elements Cl, P, Br, I, H, C, O, N, F, and S.
Candidates with InChIKeys matching existing Sage training or benchmarking data
were also filtered out.

## General Information

* Date: 2024-10-08
* Class: OpenFF Optimization Dataset
* Purpose: Improve training coverage in Sage
* Name: OpenFF Lipid Optimization Training Supplement v1.0
* Number of unique molecules: 3971
* Number of filtered molecules: 32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking -- are these number of filtered molecules the ones that were filtered out, i.e. are not part of the number of unique molecules above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's right. This number is usually zero (at least in my previous submissions), but this time qcsubmit actually filtered out 32 molecules, I think for timeouts in conformer generation. My input SMILES file had 4003 entries, so these add up to that total at least.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Brent!

* Number of conformers: 29770
* Number of conformers per molecule (min, mean, max): 1, 7.50, 10
* Mean molecular weight: 291.73
* Max molecular weight: 1016.29
* Charges: -4.0, -1.0, 0.0, 1.0
* Dataset submitter: Brent Westbrook
* Dataset generator: Brent Westbrook

## QCSubmit Generation Pipeline

* `generate-dataset.py`: This script shows how the dataset was prepared from the input file `input.smi`.
* `main.py`: This script shows how the dataset was prepared from the initial cura database.
* `inchis.dat`: The list of InChIKeys present in existing Sage training and
benchmarking data, used by `main.py`.

## QCSubmit Manifest

* `generate-dataset.py`: Script describing dataset generation and submission
* `input-environment.yaml`: Environment file used to create the Python environment for the script
* `full-environment.yaml`: Fully-resolved environment used to execute the script
* `opt.toml`: Experimental [qcaide](https://github.com/ntBre/qcaide) input file for defining
variables used throughout the QCA submission process
* `dataset.json.bz2`: Compressed dataset ready for submission
* `dataset.pdf`: Visualization of dataset molecules
* `output.smi`: SMILES strings for dataset molecules

## Metadata
* Elements: {I, Br, O, H, P, C, N, Cl, F, S}
* Spec: default
* basis: DZVP
* implicit_solvent: None
* keywords: {}
* maxiter: 200
* method: B3LYP-D3BJ
* program: psi4
* SCF properties:
* dipole
* quadrupole
* wiberg_lowdin_indices
* mayer_indices
Git LFS file not shown
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,306 @@
name: qcarchive-user-submit
channels:
- openeye
- conda-forge
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=2_gnu
- ambertools=23.3=py311h9fea076_6
- annotated-types=0.6.0=pyhd8ed1ab_0
- anyio=4.2.0=pyhd8ed1ab_0
- apsw=3.46.0.0=py311h3ea06b8_0
- argcomplete=3.2.2=pyhd8ed1ab_0
- argon2-cffi=23.1.0=pyhd8ed1ab_0
- argon2-cffi-bindings=21.2.0=py311h459d7ec_4
- arpack=3.8.0=nompi_h0baa96a_101
- arrow=1.3.0=pyhd8ed1ab_0
- asttokens=2.4.1=pyhd8ed1ab_0
- astunparse=1.6.3=pyhd8ed1ab_0
- async-lru=2.0.4=pyhd8ed1ab_0
- attrs=23.2.0=pyh71513ae_0
- babel=2.14.0=pyhd8ed1ab_0
- basis_set_exchange=0.9.1=pyhd8ed1ab_0
- beautifulsoup4=4.12.3=pyha770c72_0
- bleach=6.1.0=pyhd8ed1ab_0
- blosc=1.21.5=h0f2a231_0
- brotli=1.1.0=hd590300_1
- brotli-bin=1.1.0=hd590300_1
- brotli-python=1.1.0=py311hb755f60_1
- bson=0.5.9=py_0
- bzip2=1.0.8=hd590300_5
- c-ares=1.26.0=hd590300_0
- c-blosc2=2.13.1=hb4ffafa_0
- ca-certificates=2024.8.30=hbcca054_0
- cached-property=1.5.2=hd8ed1ab_1
- cached_property=1.5.2=pyha770c72_1
- cachetools=5.3.2=pyhd8ed1ab_0
- cairo=1.18.0=h3faef2a_0
- certifi=2024.8.30=pyhd8ed1ab_0
- cffi=1.16.0=py311hb3a22ac_0
- chardet=5.2.0=py311h38be061_1
- charset-normalizer=3.3.2=pyhd8ed1ab_0
- click=8.1.7=unix_pyh707e725_0
- colorama=0.4.6=pyhd8ed1ab_0
- comm=0.2.1=pyhd8ed1ab_0
- contourpy=1.2.0=py311h9547e67_0
- cudatoolkit=11.8.0=h4ba93d1_12
- cycler=0.12.1=pyhd8ed1ab_0
- debugpy=1.8.0=py311hb755f60_1
- decorator=5.1.1=pyhd8ed1ab_0
- defusedxml=0.7.1=pyhd8ed1ab_0
- entrypoints=0.4=pyhd8ed1ab_0
- exceptiongroup=1.2.0=pyhd8ed1ab_2
- executing=2.0.1=pyhd8ed1ab_0
- expat=2.5.0=hcb278e6_1
- fftw=3.3.10=nompi_hc118613_108
- font-ttf-dejavu-sans-mono=2.37=hab24e00_0
- font-ttf-inconsolata=3.000=h77eed37_0
- font-ttf-source-code-pro=2.038=h77eed37_0
- font-ttf-ubuntu=0.83=h77eed37_1
- fontconfig=2.14.2=h14ed4e7_0
- fonts-conda-ecosystem=1=0
- fonts-conda-forge=1=0
- fonttools=4.47.2=py311h459d7ec_0
- fqdn=1.5.1=pyhd8ed1ab_0
- freetype=2.12.1=h267a509_2
- freetype-py=2.3.0=pyhd8ed1ab_0
- gettext=0.21.1=h27087fc_0
- greenlet=3.0.3=py311hb755f60_0
- hdf4=4.2.15=h2a13503_7
- hdf5=1.14.3=nompi_h4f84152_100
- icu=73.2=h59595ed_0
- idna=3.6=pyhd8ed1ab_0
- importlib-metadata=7.0.1=pyha770c72_0
- importlib_metadata=7.0.1=hd8ed1ab_0
- importlib_resources=6.1.1=pyhd8ed1ab_0
- iniconfig=2.0.0=pyhd8ed1ab_0
- ipykernel=6.29.0=pyhd33586a_0
- ipython=8.20.0=pyh707e725_0
- ipywidgets=8.1.1=pyhd8ed1ab_0
- isoduration=20.11.0=pyhd8ed1ab_0
- jedi=0.19.1=pyhd8ed1ab_0
- jinja2=3.1.3=pyhd8ed1ab_0
- joblib=1.3.2=pyhd8ed1ab_0
- json5=0.9.14=pyhd8ed1ab_0
- jsonpointer=2.4=py311h38be061_3
- jsonschema=4.21.1=pyhd8ed1ab_0
- jsonschema-specifications=2023.12.1=pyhd8ed1ab_0
- jsonschema-with-format-nongpl=4.21.1=pyhd8ed1ab_0
- jupyter-lsp=2.2.2=pyhd8ed1ab_0
- jupyter_client=8.6.0=pyhd8ed1ab_0
- jupyter_core=5.7.1=py311h38be061_0
- jupyter_events=0.9.0=pyhd8ed1ab_0
- jupyter_server=2.12.5=pyhd8ed1ab_0
- jupyter_server_terminals=0.5.2=pyhd8ed1ab_0
- jupyterlab=4.0.12=pyhd8ed1ab_0
- jupyterlab_pygments=0.3.0=pyhd8ed1ab_0
- jupyterlab_server=2.25.2=pyhd8ed1ab_0
- jupyterlab_widgets=3.0.9=pyhd8ed1ab_0
- keyutils=1.6.1=h166bdaf_0
- kiwisolver=1.4.5=py311h9547e67_1
- krb5=1.21.2=h659d440_0
- lcms2=2.16=hb7c19ff_0
- ld_impl_linux-64=2.40=h41732ed_0
- lerc=4.0.0=h27087fc_0
- libaec=1.1.2=h59595ed_1
- libblas=3.9.0=21_linux64_openblas
- libboost=1.82.0=h6fcfa73_6
- libboost-python=1.82.0=py311h92ebd52_6
- libbrotlicommon=1.1.0=hd590300_1
- libbrotlidec=1.1.0=hd590300_1
- libbrotlienc=1.1.0=hd590300_1
- libcblas=3.9.0=21_linux64_openblas
- libcurl=8.5.0=hca28451_0
- libdeflate=1.19=hd590300_0
- libedit=3.1.20191231=he28a2e2_2
- libev=4.33=hd590300_2
- libexpat=2.5.0=hcb278e6_1
- libffi=3.4.2=h7f98852_5
- libgcc=14.1.0=h77fa898_1
- libgcc-ng=14.1.0=h69a702a_1
- libgfortran-ng=13.2.0=h69a702a_4
- libgfortran5=13.2.0=ha4646dd_4
- libglib=2.78.3=h783c2da_0
- libgomp=14.1.0=h77fa898_1
- libiconv=1.17=hd590300_2
- libjpeg-turbo=3.0.0=hd590300_1
- liblapack=3.9.0=21_linux64_openblas
- libnetcdf=4.9.2=nompi_h9612171_113
- libnghttp2=1.58.0=h47da74e_1
- libnsl=2.0.1=hd590300_0
- libopenblas=0.3.26=pthreads_h413a1c8_0
- libpng=1.6.39=h753d276_0
- libsodium=1.0.18=h36c2ea0_1
- libsqlite=3.46.0=hde9e2c9_0
- libssh2=1.11.0=h0841786_0
- libstdcxx-ng=13.2.0=h7e041cc_4
- libtiff=4.6.0=ha9c0a0a_2
- libuuid=2.38.1=h0b41bf4_0
- libwebp-base=1.3.2=hd590300_0
- libxcb=1.15=h0b41bf4_0
- libxcrypt=4.4.36=hd590300_1
- libxml2=2.12.4=h232c23b_1
- libzip=1.10.1=h2629f0a_3
- libzlib=1.2.13=hd590300_5
- lz4-c=1.9.4=hcb278e6_0
- lzo=2.10=h516909a_1000
- markupsafe=2.1.4=py311h459d7ec_0
- matplotlib-base=3.8.2=py311h54ef318_0
- matplotlib-inline=0.1.6=pyhd8ed1ab_0
- mda-xdrlib=0.2.0=pyhd8ed1ab_0
- mdtraj=1.9.9=py311h90fe790_1
- mistune=3.0.2=pyhd8ed1ab_0
- msgpack-python=1.0.7=py311h9547e67_0
- munkres=1.1.4=pyh9f0ad1d_0
- nbclient=0.8.0=pyhd8ed1ab_0
- nbconvert-core=7.14.2=pyhd8ed1ab_0
- nbformat=5.9.2=pyhd8ed1ab_0
- ncurses=6.5=h59595ed_0
- nest-asyncio=1.6.0=pyhd8ed1ab_0
- netcdf-fortran=4.6.1=nompi_hacb5139_103
- networkx=3.2.1=pyhd8ed1ab_0
- nomkl=1.0=h5ca1d4c_0
- notebook=7.0.7=pyhd8ed1ab_0
- notebook-shim=0.2.3=pyhd8ed1ab_0
- numexpr=2.8.8=py311h039bad6_100
- numpy=1.26.3=py311h64a7726_0
- ocl-icd=2.3.1=h7f98852_0
- ocl-icd-system=1.0.0=1
- openeye-toolkits=2023.1.1=py311_0
- openff-amber-ff-ports=0.0.4=pyhca7485f_0
- openff-forcefields=2024.01.0=pyhca7485f_0
- openff-interchange=0.3.18=pyhd8ed1ab_0
- openff-interchange-base=0.3.18=pyhd8ed1ab_0
- openff-models=0.1.1=pyhca7485f_0
- openff-qcsubmit=0.53.0=pyhd8ed1ab_1
- openff-toolkit=0.15.1=pyhd8ed1ab_0
- openff-toolkit-base=0.15.1=pyhd8ed1ab_0
- openff-units=0.2.1=pyh1a96a4e_0
- openff-utilities=0.1.12=pyhd8ed1ab_0
- openjpeg=2.5.0=h488ebb8_3
- openmm=8.1.1=py311h9766050_0
- openssl=3.3.2=hb9d3cd8_0
- overrides=7.7.0=pyhd8ed1ab_0
- packaging=23.2=pyhd8ed1ab_0
- packmol=20.010=h86c2bf4_0
- pandas=2.2.0=py311h320fe9a_0
- pandocfilters=1.5.0=pyhd8ed1ab_0
- panedr=0.8.0=pyhd8ed1ab_0
- parmed=4.2.2=py311hb755f60_1
- parso=0.8.3=pyhd8ed1ab_0
- pcre2=10.42=hcad00b1_0
- perl=5.32.1=7_hd590300_perl5
- pexpect=4.9.0=pyhd8ed1ab_0
- pickleshare=0.7.5=py_1003
- pillow=10.2.0=py311ha6c5da5_0
- pint=0.21=pyhd8ed1ab_0
- pip=23.3.2=pyhd8ed1ab_0
- pixman=0.43.2=h59595ed_0
- pkgutil-resolve-name=1.3.10=pyhd8ed1ab_1
- platformdirs=4.2.0=pyhd8ed1ab_0
- pluggy=1.4.0=pyhd8ed1ab_0
- prometheus_client=0.19.0=pyhd8ed1ab_0
- prompt-toolkit=3.0.42=pyha770c72_0
- psutil=5.9.8=py311h459d7ec_0
- pthread-stubs=0.4=h36c2ea0_1001
- ptyprocess=0.7.0=pyhd3deb0d_0
- pure_eval=0.2.2=pyhd8ed1ab_0
- py-cpuinfo=9.0.0=pyhd8ed1ab_0
- pycairo=1.25.1=py311h8feb60e_0
- pycalverter=1.6.1=py_0
- pycparser=2.21=pyhd8ed1ab_0
- pydantic=2.6.0=pyhd8ed1ab_0
- pydantic-core=2.16.1=py311h46250e7_0
- pyedr=0.8.0=pyhd8ed1ab_0
- pygments=2.17.2=pyhd8ed1ab_0
- pyjwt=2.8.0=pyhd8ed1ab_0
- pyparsing=3.1.1=pyhd8ed1ab_0
- pysocks=1.7.1=pyha2e5f31_6
- pytables=3.9.2=py311h10c7f7f_1
- pytest=8.0.0=pyhd8ed1ab_0
- python=3.11.7=hab00c5b_1_cpython
- python-constraint=1.4.0=py_0
- python-dateutil=2.8.2=pyhd8ed1ab_0
- python-fastjsonschema=2.19.1=pyhd8ed1ab_0
- python-json-logger=2.0.7=pyhd8ed1ab_0
- python-tzdata=2023.4=pyhd8ed1ab_0
- python_abi=3.11=4_cp311
- pytz=2023.4=pyhd8ed1ab_0
- pyyaml=6.0.1=py311h459d7ec_1
- pyzmq=25.1.2=py311h34ded2d_0
- qcelemental=0.27.1=pyhd8ed1ab_0
- qcportal=0.55=pyhd8ed1ab_0
- rdkit=2023.09.4=py311h4c2f14b_0
- readline=8.2=h8228510_1
- referencing=0.33.0=pyhd8ed1ab_0
- regex=2023.12.25=py311h459d7ec_0
- reportlab=4.0.9=py311h459d7ec_0
- requests=2.31.0=pyhd8ed1ab_0
- rfc3339-validator=0.1.4=pyhd8ed1ab_0
- rfc3986-validator=0.1.1=pyh9f0ad1d_0
- rlpycairo=0.2.0=pyhd8ed1ab_0
- rpds-py=0.17.1=py311h46250e7_0
- scipy=1.12.0=py311h64a7726_2
- send2trash=1.8.2=pyh41d4057_0
- setuptools=69.0.3=pyhd8ed1ab_0
- six=1.16.0=pyh6c4a22f_0
- smirnoff99frosst=1.1.0=pyh44b312d_0
- snappy=1.1.10=h9fff704_0
- sniffio=1.3.0=pyhd8ed1ab_0
- soupsieve=2.5=pyhd8ed1ab_1
- sqlalchemy=2.0.25=py311h459d7ec_0
- sqlite=3.46.0=h6d4b2fc_0
- stack_data=0.6.2=pyhd8ed1ab_0
- tabulate=0.9.0=pyhd8ed1ab_1
- terminado=0.18.0=pyh0d859eb_0
- tinycss2=1.2.1=pyhd8ed1ab_0
- tk=8.6.13=noxft_h4845f30_101
- tomli=2.0.1=pyhd8ed1ab_0
- tornado=6.3.3=py311h459d7ec_1
- tqdm=4.66.1=pyhd8ed1ab_0
- traitlets=5.14.1=pyhd8ed1ab_0
- types-python-dateutil=2.8.19.20240106=pyhd8ed1ab_0
- typing-extensions=4.9.0=hd8ed1ab_0
- typing_extensions=4.9.0=pyha770c72_0
- typing_utils=0.1.0=pyhd8ed1ab_0
- tzdata=2023d=h0c530f3_0
- unidecode=1.3.8=pyhd8ed1ab_0
- uri-template=1.3.0=pyhd8ed1ab_0
- urllib3=2.2.0=pyhd8ed1ab_0
- wcwidth=0.2.13=pyhd8ed1ab_0
- webcolors=1.13=pyhd8ed1ab_0
- webencodings=0.5.1=pyhd8ed1ab_2
- websocket-client=1.7.0=pyhd8ed1ab_0
- wheel=0.42.0=pyhd8ed1ab_0
- widgetsnbextension=4.0.9=pyhd8ed1ab_0
- xmltodict=0.13.0=pyhd8ed1ab_0
- xorg-kbproto=1.0.7=h7f98852_1002
- xorg-libice=1.1.1=hd590300_0
- xorg-libsm=1.2.4=h7391055_0
- xorg-libx11=1.8.7=h8ee46fc_0
- xorg-libxau=1.0.11=hd590300_0
- xorg-libxdmcp=1.1.3=h7f98852_0
- xorg-libxext=1.3.4=h0b41bf4_2
- xorg-libxrender=0.9.11=hd590300_0
- xorg-libxt=1.3.0=hd590300_1
- xorg-renderproto=0.11.1=h7f98852_1002
- xorg-xextproto=7.3.0=h0b41bf4_1003
- xorg-xproto=7.0.31=h7f98852_1007
- xz=5.2.6=h166bdaf_0
- yaml=0.2.5=h7f98852_2
- zeromq=4.3.5=h59595ed_0
- zipp=3.17.0=pyhd8ed1ab_0
- zlib=1.2.13=hd590300_5
- zlib-ng=2.0.7=h0b41bf4_0
- zstandard=0.22.0=py311haa97af0_0
- zstd=1.5.5=hfc55251_0
- pip:
- amberutils==21.0
- edgembar==0.2
- mmpbsa-py==16.0
- packmol-memgen==2023.2.24
- pdb4amber==22.0
- pymsmt==22.0
- pytraj==2.0.6
- sander==22.0
prefix: /home/brent/mambaforge/envs/qcarchive-user-submit
Loading
Loading