Skip to content

Commit a70beaf

Browse files
authored
merge matchlib and matchapp (#2)
* add exceptions * add QueryError * add InvalidAccessionErrors * refactor to include normalise() and private modules * add functions for normalise * add combine CLI * add test combineapp * Update README.md * use threadpoolexecutor * migrate to relative imports, add liftover * export genomebuild * update lockfile * update lockfile * fix import * add liftover to combine cli * fix test * improve dunder methods * fix test * add coverage * move read_header * Revert "add coverage" This reverts commit 0776f31. * add progress bar * add autoapi and docs * ignore autoapi * add targetvariants class * simplfy reading zstd/gz with xopen and add ID to TargetVariant * set up pyarrow support for TargetVariants * add tqdm * add pyarrow and exports * fix test * export NormalisedScoringFile * add dataframe classes * improve context managers * update matchlib * add tests * write out scorefiles * refactor writing out scorefiles * make plinkframes internal * add PlinkScoreFiles * fix labelling * use labelled df to write scoring files * fix tests * extract plinkscorefiles * fix tests * check log counts * fix test * add matchlib to docs * update exports * add matchlib pytest * install all extras for tests * add match cli * add matchapp tests * update pytest action * sort test output to get consistent results * bump dependency for node 20 * add merge cli * fix test * add support for writing --split and --combined * update docs and tests
1 parent d3556ec commit a70beaf

File tree

109 files changed

+9226
-338
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

109 files changed

+9226
-338
lines changed
+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
on:
2+
push:
3+
paths:
4+
- 'pgscatalog.combineapp/**.py'
5+
pull_request:
6+
paths:
7+
- 'pgscatalog.combineapp/**.py'
8+
9+
jobs:
10+
combineapp-pytest:
11+
uses: ./.github/workflows/pytest.yaml
12+
with:
13+
package-directory: "pgscatalog.combineapp"

.github/workflows/corelib-pytest.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ on:
77
- 'pgscatalog.corelib/**.py'
88

99
jobs:
10-
downloadapp-corelib:
10+
pytest-corelib:
1111
uses: ./.github/workflows/pytest.yaml
1212
with:
1313
package-directory: "pgscatalog.corelib"

.github/workflows/matchapp-pytest.yml

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
on:
2+
push:
3+
paths:
4+
- 'pgscatalog.matchapp/**.py'
5+
pull_request:
6+
paths:
7+
- 'pgscatalog.matchapp/**.py'
8+
9+
jobs:
10+
matchapp-pytest:
11+
uses: ./.github/workflows/pytest.yaml
12+
with:
13+
package-directory: "pgscatalog.matchapp"

.github/workflows/matchlib-pytest.yml

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
on:
2+
push:
3+
paths:
4+
- 'pgscatalog.matchlib/**.py'
5+
pull_request:
6+
paths:
7+
- 'pgscatalog.matchlib/**.py'
8+
9+
jobs:
10+
pytest-matchlib:
11+
uses: ./.github/workflows/pytest.yaml
12+
with:
13+
package-directory: "pgscatalog.matchlib"

.github/workflows/pytest.yaml

+4-4
Original file line numberDiff line numberDiff line change
@@ -19,20 +19,20 @@ jobs:
1919
- uses: actions/checkout@v4
2020

2121
- name: Install Python
22-
uses: actions/setup-python@v4
22+
uses: actions/setup-python@v5
2323
with:
2424
python-version: ${{ inputs.python-version }}
2525
cache: 'pip'
2626

27-
- uses: actions/cache@v3
27+
- uses: actions/cache@v4
2828
with:
2929
path: ${{ inputs.package-directory }}/.venv
3030
key: venv-${{ hashFiles('poetry.lock') }}
3131

3232
- run: pip install poetry
3333

34-
- run: poetry install --with dev
34+
- run: poetry install --with dev --all-extras
3535
working-directory: ${{ inputs.package-directory }}
3636

3737
- run: poetry run pytest --doctest-modules
38-
working-directory: ${{ inputs.package-directory }}
38+
working-directory: ${{ inputs.package-directory }}

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
*/dist/*
88
build
99
_build
10+
docs/autoapi
1011
.cache
1112
*.so
1213

@@ -19,6 +20,7 @@ pip-log.txt
1920

2021
.DS_Store
2122
.idea/*
23+
*/.idea/*
2224
.python-version
2325
.vscode/*
2426

README.md

+19-2
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,8 @@ This repository contains Python applications and libraries for working with poly
55

66
These CLI applications are used by the PGS Catalog Calculator workflow.
77

8-
| Application | Description | Link |
9-
|-----------------------|------------------------------------------------|-------------------------------------------------------|
8+
| Application | Description | Link |
9+
|-----------------------|------------------------------------------------|------------------------------------------------------|
1010
| `pgscatalog-download` | Download scoring files from the PGS Catalog | [README](pgscatalog.downloadapp/README.md) |
1111
| `pgscatalog-combine` | Combine scoring files into a consistent format |
1212

@@ -23,6 +23,23 @@ If you write code to work with PGS, we publish some libraries that might be hel
2323
| `pgscatalog-calclib` | Ancestry estimation and normalisation |
2424

2525

26+
## Installation
27+
28+
### pip
29+
30+
If you want to use the packages in this repository, use pip:
31+
32+
### Local install for developers
33+
34+
If you want to make changes to a package or application, it's simplest to clone the repository and install packages in editable mode.
35+
36+
```
37+
$ git clone https://github.com/PGScatalog/pygscatalog.git
38+
$ cd pygscatalog/pgscatalog.downloadapp # replace with the package you want to edit
39+
$ poetry add --editable ../pgscatalog.corelib # downloadapp requires corelib
40+
$ poetry install
41+
```
42+
2643
## Documentation
2744

2845
Full documentation is provided..

docs/Makefile

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Minimal makefile for Sphinx documentation
2+
#
3+
4+
# You can set these variables from the command line, and also
5+
# from the environment for the first two.
6+
SPHINXOPTS ?=
7+
SPHINXBUILD ?= sphinx-build
8+
SOURCEDIR = .
9+
BUILDDIR = _build
10+
11+
# Put it first so that "make" without argument is like "make help".
12+
help:
13+
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14+
15+
.PHONY: help Makefile
16+
17+
# Catch-all target: route all unknown targets to Sphinx using the new
18+
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
19+
%: Makefile
20+
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

docs/_templates/autoapi/index.rst

+24
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
API Reference
2+
=============
3+
4+
This page contains auto-generated API reference documentation, which describes
5+
pygscatalog libraries.
6+
7+
The information is mostly useful for developers: people that want to write Python
8+
code to work with polygenic scores.
9+
10+
11+
.. toctree::
12+
:titlesonly:
13+
14+
{% for page in pages | sort %}
15+
{#
16+
Add the top most levels in "pgscatalog.X" to the index file
17+
This is needed because we don't have __init__.py file in pgscatalog package
18+
as we use nested implicit namespace packages.
19+
https://github.com/readthedocs/sphinx-autoapi/issues/298
20+
#}
21+
{% if (page.top_level_object or page.name.split('.') | length == 2) and page.display %}
22+
{{ page.short_name }} <{{ page.include_path }}>
23+
{% endif %}
24+
{% endfor %}

docs/conf.py

+56
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Configuration file for the Sphinx documentation builder.
2+
#
3+
# For the full list of built-in configuration values, see the documentation:
4+
# https://www.sphinx-doc.org/en/master/usage/configuration.html
5+
6+
# -- Project information -----------------------------------------------------
7+
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
8+
9+
project = "pygscatalog"
10+
copyright = "2024, PGS Catalog"
11+
author = "PGS Catalog"
12+
13+
# -- General configuration ---------------------------------------------------
14+
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
15+
16+
extensions = ["autoapi.extension"]
17+
18+
templates_path = ["_templates"]
19+
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
20+
21+
22+
# -- Options for HTML output -------------------------------------------------
23+
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
24+
25+
html_theme = "alabaster"
26+
html_static_path = ["_static"]
27+
28+
# use autoapi for packages that provide APIs (libraries)
29+
autoapi_dirs = [
30+
"../pgscatalog.corelib/src/pgscatalog",
31+
"../pgscatalog.matchlib/src/pgscatalog",
32+
]
33+
# see _templates/autoapi/index.rst for autoapi fix
34+
autoapi_template_dir = "_templates/autoapi"
35+
autoapi_python_use_implicit_namespaces = True
36+
autoapi_keep_files = True
37+
38+
# hide private members
39+
autoapi_options = [
40+
"members",
41+
"undoc-members",
42+
"show-inheritance",
43+
"show-module-summary",
44+
"imported-members",
45+
]
46+
autoapi_member_order = "groupwise"
47+
48+
49+
def skip_submodules(app, what, name, obj, skip, options):
50+
if what == "module":
51+
skip = True
52+
return skip
53+
54+
55+
def setup(sphinx):
56+
sphinx.connect("autoapi-skip-member", skip_submodules)

docs/how-to/guides/combine.rst

+69
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
How to combine scoring files from the PGS Catalog
2+
=================================================
3+
4+
``pgscatalog-combine`` is a CLI application that makes it easy to combine scoring files into a standardised output.
5+
6+
The process involves:
7+
8+
* extracting important fields from scoring files
9+
* doing some quality control checks
10+
* optionally lifting over variants to a consistent genome build
11+
* writing a long format / melted output file
12+
13+
Input scoring files must follow PGS Catalog standards. The output file is useful for
14+
doing data science tasks, like matching variants across a scoring file and target
15+
genome.
16+
17+
Installation
18+
------------
19+
20+
::
21+
22+
$ pip install pgscatalog-combine
23+
24+
Usage
25+
-----
26+
27+
Combining PGS Catalog scoring files
28+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
29+
30+
.. tip:: It's easiest to get started by downloading scoring files in the same genome build: :doc:`download`
31+
32+
::
33+
34+
$ pgscatalog-combine -s PGS000001_hmPOS_GRCh38.txt.gz PGS0001229_hmPOS_GRCh38.txt.gz -t GRCh38 -o combined.txt
35+
36+
.. note:: If you're combining lots of files, you can compress the output automatically ``--o combined.txt.gz``
37+
38+
Lifting over scoring files
39+
~~~~~~~~~~~~~~~~~~~~~~~~~~
40+
41+
It's possible to combine scoring files with different genome builds using liftover.
42+
43+
.. danger:: You should only do this when combining PGS Catalog and custom scoring files, because the PGS Catalog provides harmonised data
44+
45+
First, download chain files from UCSC:
46+
47+
* `hg19ToHg38.over.chain.gz`_
48+
* `hg38ToHg19.over.chain.gz`_
49+
50+
.. _hg19ToHg38.over.chain.gz: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/
51+
.. _hg38ToHg19.over.chain.gz: https://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/
52+
53+
And copy them into a directory (e.g. ``my_chain_dir/``).
54+
55+
Assuming you have a custom scoring file in GRCh37 (``my_scorefile_grch37.txt.gz``), and you want to combine it with a PGS Catalog scoring file in GRCh38.
56+
57+
::
58+
59+
$ pgscatalog-combine -s PGS000001_hmPOS_GRCh38.txt.gz my_scorefile_grch37.txt.gz \
60+
--chain_dir my_chain_dir/ \
61+
-t GRCh38 \
62+
-o combined.txt
63+
64+
Help
65+
----
66+
67+
::
68+
69+
$ pgscatalog-combine --help

docs/how-to/guides/download.rst

+66
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
How to download scoring files from the PGS Catalog
2+
==================================================
3+
4+
``pgscatalog-download`` is a CLI application that makes it easy to download scoring files from the
5+
PGS Catalog with a mixture of PGS, publication, or trait accessions. The application:
6+
7+
* automatically retries downloads if they fail
8+
* validates the checksum of downloaded scoring files
9+
* automatically selects scoring files aligned to a requested genome build
10+
11+
Installation
12+
-------------
13+
14+
::
15+
16+
$ pip install pgscatalog-download
17+
18+
Usage
19+
-----
20+
21+
Downloading PGS IDs scoring files aligned to GRCh38
22+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
23+
24+
::
25+
26+
$ mkdir downloads
27+
$ pgscatalog-download --pgs PGS000822 PGS001229 --build GRCh38 -o downloads
28+
29+
.. note::
30+
31+
Setting ``--build`` will download scoring files harmonised by the PGS Catalog. This means scoring fields have consistent fields, like genomic coordinates.
32+
33+
Downloading all scores associated with a trait
34+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
35+
36+
To download all scores associated with Alzheimer's disease:
37+
38+
::
39+
40+
$ mkdir downloads
41+
$ pgscatalog-download --efo MONDO_0004975 -b GRCh38 -o downloads
42+
43+
By default scores associated with child traits, like late-onset Alzheimer's disease, are included.
44+
To exclude them use:
45+
46+
::
47+
48+
$ mkdir downloads
49+
$ pgscatalog-download --efo MONDO_0004975 -b GRCh38 -o downloads --efo_direct
50+
51+
Downloading all scores associated with a publication
52+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
53+
54+
If you're interested in scores from a specific publication:
55+
56+
::
57+
58+
$ mkdir downloads
59+
$ pgscatalog-download --pgp PGP000517 -b GRCh38 -o downloads
60+
61+
Help
62+
----
63+
64+
::
65+
66+
$ pgscatalog-download --help

0 commit comments

Comments
 (0)