Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add calclib #3

Closed
wants to merge 60 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
4eaef44
add exceptions
nebfield Jan 16, 2024
cf22ffd
add QueryError
nebfield Jan 16, 2024
95042e9
add InvalidAccessionErrors
nebfield Jan 16, 2024
78a7faf
refactor to include normalise() and private modules
nebfield Jan 18, 2024
90990ba
add functions for normalise
nebfield Jan 18, 2024
ea27eab
add combine CLI
nebfield Jan 18, 2024
e938838
add test combineapp
nebfield Jan 19, 2024
3dac918
Update README.md
nebfield Jan 19, 2024
809eda1
use threadpoolexecutor
nebfield Jan 19, 2024
5d8bbe6
migrate to relative imports, add liftover
nebfield Jan 22, 2024
e1ab9dd
export genomebuild
nebfield Jan 22, 2024
7533e04
update lockfile
nebfield Jan 22, 2024
8fcb98a
update lockfile
nebfield Jan 22, 2024
b9381f7
fix import
nebfield Jan 22, 2024
c8dabdd
add liftover to combine cli
nebfield Jan 22, 2024
a99f001
fix test
nebfield Jan 22, 2024
171c71f
improve dunder methods
nebfield Jan 23, 2024
5c583f7
fix test
nebfield Jan 23, 2024
0776f31
add coverage
nebfield Jan 23, 2024
3442196
move read_header
nebfield Jan 23, 2024
5809e96
Revert "add coverage"
nebfield Jan 23, 2024
fda919f
add progress bar
nebfield Jan 26, 2024
63fcca1
add autoapi and docs
nebfield Jan 26, 2024
f8cc11d
ignore autoapi
nebfield Jan 26, 2024
db217b4
add targetvariants class
nebfield Jan 26, 2024
7ce6798
simplfy reading zstd/gz with xopen and add ID to TargetVariant
nebfield Jan 29, 2024
ff6b66e
set up pyarrow support for TargetVariants
nebfield Jan 29, 2024
3f62044
add tqdm
nebfield Jan 30, 2024
2a7e36e
add pyarrow and exports
nebfield Jan 30, 2024
0223f7c
fix test
nebfield Jan 30, 2024
f5bf05f
export NormalisedScoringFile
nebfield Jan 30, 2024
7f67aa6
add dataframe classes
nebfield Jan 30, 2024
e0b0ec4
improve context managers
nebfield Jan 30, 2024
0007979
update matchlib
nebfield Jan 31, 2024
114311e
add tests
nebfield Feb 1, 2024
f00b614
write out scorefiles
nebfield Feb 2, 2024
d9c519d
refactor writing out scorefiles
nebfield Feb 2, 2024
dc3f565
make plinkframes internal
nebfield Feb 2, 2024
e636112
add PlinkScoreFiles
nebfield Feb 5, 2024
37b9e24
fix labelling
nebfield Feb 5, 2024
bf06681
use labelled df to write scoring files
nebfield Feb 5, 2024
e7d1221
fix tests
nebfield Feb 6, 2024
605bbbd
extract plinkscorefiles
nebfield Feb 6, 2024
850b297
fix tests
nebfield Feb 6, 2024
d700be2
check log counts
nebfield Feb 7, 2024
8f6bd77
fix test
nebfield Feb 7, 2024
ab09409
add matchlib to docs
nebfield Feb 7, 2024
a588ff5
update exports
nebfield Feb 7, 2024
46754f3
add matchlib pytest
nebfield Feb 7, 2024
1de04fd
install all extras for tests
nebfield Feb 7, 2024
eeccf94
add match cli
nebfield Feb 8, 2024
e0437dd
add matchapp tests
nebfield Feb 8, 2024
ef31b8d
update pytest action
nebfield Feb 8, 2024
ccb0cc7
sort test output to get consistent results
nebfield Feb 8, 2024
229bfe1
bump dependency for node 20
nebfield Feb 8, 2024
d1363c7
add merge cli
nebfield Feb 9, 2024
59948e7
fix test
nebfield Feb 9, 2024
daf2d3f
add support for writing --split and --combined
nebfield Feb 9, 2024
5496494
update docs and tests
nebfield Feb 9, 2024
563a442
add polygenicscore to calclib
nebfield Feb 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .github/workflows/calclib-pytest.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
on:
push:
paths:
- 'pgscatalog.calclib/**.py'
pull_request:
paths:
- 'pgscatalog.calclib/**.py'

jobs:
pytest-calclib:
uses: ./.github/workflows/pytest.yaml
with:
package-directory: "pgscatalog.calclib"
13 changes: 13 additions & 0 deletions .github/workflows/combineapp-pytest.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
on:
push:
paths:
- 'pgscatalog.combineapp/**.py'
pull_request:
paths:
- 'pgscatalog.combineapp/**.py'

jobs:
combineapp-pytest:
uses: ./.github/workflows/pytest.yaml
with:
package-directory: "pgscatalog.combineapp"
2 changes: 1 addition & 1 deletion .github/workflows/corelib-pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ on:
- 'pgscatalog.corelib/**.py'

jobs:
downloadapp-corelib:
pytest-corelib:
uses: ./.github/workflows/pytest.yaml
with:
package-directory: "pgscatalog.corelib"
13 changes: 13 additions & 0 deletions .github/workflows/matchapp-pytest.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
on:
push:
paths:
- 'pgscatalog.matchapp/**.py'
pull_request:
paths:
- 'pgscatalog.matchapp/**.py'

jobs:
matchapp-pytest:
uses: ./.github/workflows/pytest.yaml
with:
package-directory: "pgscatalog.matchapp"
13 changes: 13 additions & 0 deletions .github/workflows/matchlib-pytest.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
on:
push:
paths:
- 'pgscatalog.matchlib/**.py'
pull_request:
paths:
- 'pgscatalog.matchlib/**.py'

jobs:
pytest-matchlib:
uses: ./.github/workflows/pytest.yaml
with:
package-directory: "pgscatalog.matchlib"
8 changes: 4 additions & 4 deletions .github/workflows/pytest.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,20 +19,20 @@ jobs:
- uses: actions/checkout@v4

- name: Install Python
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: ${{ inputs.python-version }}
cache: 'pip'

- uses: actions/cache@v3
- uses: actions/cache@v4
with:
path: ${{ inputs.package-directory }}/.venv
key: venv-${{ hashFiles('poetry.lock') }}

- run: pip install poetry

- run: poetry install --with dev
- run: poetry install --with dev --all-extras
working-directory: ${{ inputs.package-directory }}

- run: poetry run pytest --doctest-modules
working-directory: ${{ inputs.package-directory }}
working-directory: ${{ inputs.package-directory }}
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
*/dist/*
build
_build
docs/autoapi
.cache
*.so

Expand All @@ -19,6 +20,7 @@ pip-log.txt

.DS_Store
.idea/*
*/.idea/*
.python-version
.vscode/*

Expand Down
23 changes: 20 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ This repository contains Python applications and libraries for working with poly

These CLI applications are used by the PGS Catalog Calculator workflow.

| Application | Description | Link |
|-----------------------|------------------------------------------------|-------------------------------------------------------|
| `pgscatalog-download` | Download scoring files from the PGS Catalog | [README](pgscatalog.downloadapp/pgscatalog/README.md) |
| Application | Description | Link |
|-----------------------|------------------------------------------------|------------------------------------------------------|
| `pgscatalog-download` | Download scoring files from the PGS Catalog | [README](pgscatalog.downloadapp/README.md) |
| `pgscatalog-combine` | Combine scoring files into a consistent format |


Expand All @@ -23,6 +23,23 @@ If you write code to work with PGS, we publish some libraries that might be hel
| `pgscatalog-calclib` | Ancestry estimation and normalisation |


## Installation

### pip

If you want to use the packages in this repository, use pip:

### Local install for developers

If you want to make changes to a package or application, it's simplest to clone the repository and install packages in editable mode.

```
$ git clone https://github.com/PGScatalog/pygscatalog.git
$ cd pygscatalog/pgscatalog.downloadapp # replace with the package you want to edit
$ poetry add --editable ../pgscatalog.corelib # downloadapp requires corelib
$ poetry install
```

## Documentation

Full documentation is provided..
Expand Down
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
24 changes: 24 additions & 0 deletions docs/_templates/autoapi/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
API Reference
=============

This page contains auto-generated API reference documentation, which describes
pygscatalog libraries.

The information is mostly useful for developers: people that want to write Python
code to work with polygenic scores.


.. toctree::
:titlesonly:

{% for page in pages | sort %}
{#
Add the top most levels in "pgscatalog.X" to the index file
This is needed because we don't have __init__.py file in pgscatalog package
as we use nested implicit namespace packages.
https://github.com/readthedocs/sphinx-autoapi/issues/298
#}
{% if (page.top_level_object or page.name.split('.') | length == 2) and page.display %}
{{ page.short_name }} <{{ page.include_path }}>
{% endif %}
{% endfor %}
56 changes: 56 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Configuration file for the Sphinx documentation builder.
#
# For the full list of built-in configuration values, see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html

# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information

project = "pygscatalog"
copyright = "2024, PGS Catalog"
author = "PGS Catalog"

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

extensions = ["autoapi.extension"]

templates_path = ["_templates"]
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]


# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

html_theme = "alabaster"
html_static_path = ["_static"]

# use autoapi for packages that provide APIs (libraries)
autoapi_dirs = [
"../pgscatalog.corelib/src/pgscatalog",
"../pgscatalog.matchlib/src/pgscatalog",
]
# see _templates/autoapi/index.rst for autoapi fix
autoapi_template_dir = "_templates/autoapi"
autoapi_python_use_implicit_namespaces = True
autoapi_keep_files = True

# hide private members
autoapi_options = [
"members",
"undoc-members",
"show-inheritance",
"show-module-summary",
"imported-members",
]
autoapi_member_order = "groupwise"


def skip_submodules(app, what, name, obj, skip, options):
if what == "module":
skip = True
return skip


def setup(sphinx):
sphinx.connect("autoapi-skip-member", skip_submodules)
69 changes: 69 additions & 0 deletions docs/how-to/guides/combine.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
How to combine scoring files from the PGS Catalog
=================================================

``pgscatalog-combine`` is a CLI application that makes it easy to combine scoring files into a standardised output.

The process involves:

* extracting important fields from scoring files
* doing some quality control checks
* optionally lifting over variants to a consistent genome build
* writing a long format / melted output file

Input scoring files must follow PGS Catalog standards. The output file is useful for
doing data science tasks, like matching variants across a scoring file and target
genome.

Installation
------------

::

$ pip install pgscatalog-combine

Usage
-----

Combining PGS Catalog scoring files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. tip:: It's easiest to get started by downloading scoring files in the same genome build: :doc:`download`

::

$ pgscatalog-combine -s PGS000001_hmPOS_GRCh38.txt.gz PGS0001229_hmPOS_GRCh38.txt.gz -t GRCh38 -o combined.txt

.. note:: If you're combining lots of files, you can compress the output automatically ``--o combined.txt.gz``

Lifting over scoring files
~~~~~~~~~~~~~~~~~~~~~~~~~~

It's possible to combine scoring files with different genome builds using liftover.

.. danger:: You should only do this when combining PGS Catalog and custom scoring files, because the PGS Catalog provides harmonised data

First, download chain files from UCSC:

* `hg19ToHg38.over.chain.gz`_
* `hg38ToHg19.over.chain.gz`_

.. _hg19ToHg38.over.chain.gz: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/
.. _hg38ToHg19.over.chain.gz: https://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/

And copy them into a directory (e.g. ``my_chain_dir/``).

Assuming you have a custom scoring file in GRCh37 (``my_scorefile_grch37.txt.gz``), and you want to combine it with a PGS Catalog scoring file in GRCh38.

::

$ pgscatalog-combine -s PGS000001_hmPOS_GRCh38.txt.gz my_scorefile_grch37.txt.gz \
--chain_dir my_chain_dir/ \
-t GRCh38 \
-o combined.txt

Help
----

::

$ pgscatalog-combine --help
66 changes: 66 additions & 0 deletions docs/how-to/guides/download.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
How to download scoring files from the PGS Catalog
==================================================

``pgscatalog-download`` is a CLI application that makes it easy to download scoring files from the
PGS Catalog with a mixture of PGS, publication, or trait accessions. The application:

* automatically retries downloads if they fail
* validates the checksum of downloaded scoring files
* automatically selects scoring files aligned to a requested genome build

Installation
-------------

::

$ pip install pgscatalog-download

Usage
-----

Downloading PGS IDs scoring files aligned to GRCh38
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

::

$ mkdir downloads
$ pgscatalog-download --pgs PGS000822 PGS001229 --build GRCh38 -o downloads

.. note::

Setting ``--build`` will download scoring files harmonised by the PGS Catalog. This means scoring fields have consistent fields, like genomic coordinates.

Downloading all scores associated with a trait
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To download all scores associated with Alzheimer's disease:

::

$ mkdir downloads
$ pgscatalog-download --efo MONDO_0004975 -b GRCh38 -o downloads

By default scores associated with child traits, like late-onset Alzheimer's disease, are included.
To exclude them use:

::

$ mkdir downloads
$ pgscatalog-download --efo MONDO_0004975 -b GRCh38 -o downloads --efo_direct

Downloading all scores associated with a publication
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you're interested in scores from a specific publication:

::

$ mkdir downloads
$ pgscatalog-download --pgp PGP000517 -b GRCh38 -o downloads

Help
----

::

$ pgscatalog-download --help
Loading
Loading