Update ScoreVariant to support exact PGS Catalog standard #46

nebfield · 2024-09-09T08:58:36Z

Background

We had some problems including:

Non-additive variants caused a ValueError when instantiating a ScoreVariant
When normalising scoring files the CLI would crash if ScoreVariants didn't look quite right
The implementation of the ScoreVariant class wasn't great, rare combinations of fields in valid scoring files would trigger random problems

New approach

The pydantic model CatalogScoreVariant is an exact implementation of the PGS Catalog data standard for scoring files
The ScoreVariant model inherits from CatalogScoreVariant to include attributes for normalisation (row_nr, accession)
Some normalisation steps are automatically done by pydantic now (check_effect_weight)
Non-additive scores are skipped by the CLI. If no scores are processed, an exception is thrown

---
title: PGS Catalog pydantic data models
---
classDiagram
    CatalogScoreVariant <|-- ScoreVariant
    CatalogScoreVariant : +String rsID 
    CatalogScoreVariant : +String chr_name
    CatalogScoreVariant : +int chr_position
    CatalogScoreVariant : ...
    
    note for CatalogScoreVariant "Implements PGS Catalog standard"

    class ScoreVariant{
        +String accession
        +int row_nr
        +bool is_duplicated
    }
    note for ScoreVariant "Used for normalisation"

    ScoreHeader <|-- CatalogScoreHeader

    ScoreHeader : +String pgs_id
    ScoreHeader : +Optional[String] pgs_name
    ScoreHeader : +String trait_reported
    ScoreHeader : +Optional[enum] genome_build
    ScoreHeader : +from_path()
    note for ScoreHeader "A minimal header \n for the calculator"

    note for CatalogScoreHeader "Implements PGS Catalog standard"

    class CatalogScoreHeader {
        +enum format_version
        ...
    }

    note for ScoreLog "Header metadata and \nvariant summary stats"

    class ScoreLog {
        +ScoreHeader header
        +dict variant_sources
        +bool compatible_effect_type
        ...
    }

Gradual typing

Pydantic uses type hints to build data models, so pgscatalog.core now has type hints checked (if they exist) with mypy.

Test notes

Running the entire Catalog:

PGS002253 contains both effect weights and dosage weight columns, so I removed the restriction of exclusive weight types (if the effect weight column is present, we'll use it)
PGS002263 contains a peculiar variant on row 223

.       GAA     GAAA    -0.188  author_rsID=rs544624542;rs563204200;rs574206742 Unknown

This will currently raise an irrecoverable ValidationError because there's no positional information
Complex variants (e.g. APOE) may look odd, so a lot of validation is disabled for them
Some scoring files will need their rsID field updated to have a valid prefix
4 missing citations caused some ValueErrors (fixing headers)
The new approach is slower because each variant is validated (~1 1/2 days to combine the entire Catalog), but the checks are much better
- I set up the CLI to work in parallel but haven't enabled this feature yet (requires changes in pgscatalog-match)

Submission validation

CatalogScoreVariant would be a good place to implement validation logic using field_validator or model_validator so we have unified data models across submission validation and calculator normalisation.

Outstanding questions:

~~effect weights: treat them as strings that are castable to numeric, normal floats, or decimals with a defined precision? (thinking of plink limits)~~ (be consistent internally using strings that can be coerced to floats)
Add has_complex_variants to log?

Closes PGScatalog/pgsc_calc#370

pgscatalog.core/src/pgscatalog/core/lib/_normalise.py

codecov · 2024-09-09T10:45:20Z

Codecov Report

Attention: Patch coverage is 95.86466% with 22 lines in your changes missing coverage. Please review.

Project coverage is 90.44%. Comparing base (84a111f) to head (3c1ebee).
Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
pgscatalog.core/src/pgscatalog/core/lib/models.py	95.19%	17 Missing ⚠️
...gscatalog.core/src/pgscatalog/core/cli/_combine.py	81.81%	2 Missing ⚠️
pgscatalog.core/tests/test_combine_cli.py	94.87%	2 Missing ⚠️
...atalog.core/src/pgscatalog/core/cli/combine_cli.py	98.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #46      +/-   ##
==========================================
+ Coverage   87.98%   90.44%   +2.45%     
==========================================
  Files          20       42      +22     
  Lines        1049     2596    +1547     
==========================================
+ Hits          923     2348    +1425     
- Misses        126      248     +122

Flag	Coverage Δ
pgscatalog.core	`92.11% <95.86%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pgscatalog.core/src/pgscatalog/core/lib/_normalise.py

pgscatalog.core/src/pgscatalog/core/__init__.py

smlmbrt · 2024-09-09T15:38:16Z

I think @ens-lgil might be a better first reviewer? And @fyvon when he gets back. Happy to look at it after.

pgscatalog.core/src/pgscatalog/core/cli/combine_cli.py

smlmbrt · 2024-09-12T09:34:33Z

Re effect_weight: I think "strings that are castable to numeric" is the right way to go. This offloads the burden of precision to whatever downstream tool is using the information (and preserves what the author reported). Plink actually uses the full precision of the float, but rounds the output (so it's still useful information and we shouldn't arbitrarily truncate the precision).

smlmbrt · 2024-09-12T09:38:32Z

With regard to pgsc_calc the requested score will sort of disappear from the analysis after combine? Like you won't see that not of them matched, or make it into the header of the report. Is there a way to show users that it was excluded because it didn't have the right effect weights?

pgscatalog.core/src/pgscatalog/core/lib/models.py

smlmbrt · 2024-10-03T09:45:28Z

I thought the solution was to always read effect weight and position columns as str and then check that they can be coerced into a float?

smlmbrt · 2024-10-03T10:35:47Z

Also missing the version bump?

pygscatalog/pgscatalog.core/pyproject.toml

Line 3 in c7021b0

version = "0.2.2"

This reverts commit 0d5c6e5.

nebfield added 6 commits September 6, 2024 15:50

add pydantic dependency

ce6cc72

implement pydantic model drafts

1f4d4a9

split models into CatalogScoreVariant and ScoreVariant

30a071f

fix doctests

5261daf

refactor to use a DictWriter

15cdf8f

fix doctest imports

dbf3155

nebfield commented Sep 9, 2024

View reviewed changes

pgscatalog.core/src/pgscatalog/core/lib/_normalise.py Show resolved Hide resolved

nebfield commented Sep 9, 2024

View reviewed changes

pgscatalog.core/src/pgscatalog/core/lib/_normalise.py Show resolved Hide resolved

nebfield commented Sep 9, 2024

View reviewed changes

pgscatalog.core/src/pgscatalog/core/lib/_normalise.py Show resolved Hide resolved

nebfield added 6 commits September 9, 2024 12:07

refactor classes into separate files

0a0d971

add non-additive error checking

068ef88

skip non-additive scores

073b4b5

add non-additive CLI tests

20ec9c6

export EffectTypeError

214ec0c

fix missing CatalogError

8cff2a6

nebfield marked this pull request as ready for review September 9, 2024 13:01

nebfield linked an issue Sep 9, 2024 that may be closed by this pull request

Exception: Bad effect weights #44

Closed

nebfield requested a review from fyvon September 9, 2024 13:59

tidy up: put reusable pydantic models in models.py and export them

71f2f90

nebfield commented Sep 9, 2024

View reviewed changes

pgscatalog.core/src/pgscatalog/core/__init__.py Outdated Show resolved Hide resolved

nebfield requested a review from smlmbrt September 9, 2024 15:36

smlmbrt requested a review from ens-lgil September 9, 2024 15:37

ens-lgil approved these changes Sep 10, 2024

View reviewed changes

pgscatalog.core/src/pgscatalog/core/cli/combine_cli.py Outdated Show resolved Hide resolved

simplify n_finished check

14771f9

set all effect weight fields to str with coerce_numbers_to_str

aee104f

nebfield mentioned this pull request Sep 16, 2024

Update score log structure #49

Merged

nebfield added 12 commits September 27, 2024 14:53

drop unused NormalisedScoringFile

e6d01c6

rename variants -> variant_sources in ScoreLog

e7294eb

add support for ancestry specific allele frequencies

3bcd8ec

fix writing out

9446c30

hm_match_chr and hm_match_pos: treat empty strings as None

d043efc

set up a ProcessPoolExecutor skeleton

d949ff6

fix multiprocessing coverage

a5fce3b

convert empty strings to None: is_haplotype and is_diplotype

18607f6

fix TypeError -> ValueError for pydantic

f8510df

fix doctest

d9f7b3c

add support for variant_type field

d2a04fe

add complex variant tests

c7021b0

smlmbrt reviewed Oct 3, 2024

View reviewed changes

pgscatalog.core/src/pgscatalog/core/lib/models.py Outdated Show resolved Hide resolved

pgscatalog.core/src/pgscatalog/core/lib/models.py Outdated Show resolved Hide resolved

nebfield added 13 commits October 3, 2024 12:49

improve model documentation

eed79cb

bump minor version

b379292

simplify effect_weight_must_float again

0d5c6e5

Revert "simplify effect_weight_must_float again"

beb77bc

This reverts commit 0d5c6e5.

document and improve effect_weight_must_float

cffead4

docs: stop being confusing

31b6996

cache computed fields in model

3e500e1

add more CLI logging messages

f83569c

add has_complex_alleles computed field

c78184e

set up annotated types in models and VariantLog model

43a2de3

add package mypy config

21c2d64

ignore typing when reading ScoreVariants as dicts of strings

1e3f0df

clarify log messages when writing out

3c1ebee

nebfield merged commit 2fe8105 into main Oct 7, 2024
10 checks passed

nebfield deleted the fix-nonadditive branch October 7, 2024 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update ScoreVariant to support exact PGS Catalog standard #46

Update ScoreVariant to support exact PGS Catalog standard #46

nebfield commented Sep 9, 2024 •

edited

Loading

codecov bot commented Sep 9, 2024 •

edited

Loading

smlmbrt commented Sep 9, 2024

smlmbrt commented Sep 12, 2024

smlmbrt commented Sep 12, 2024

smlmbrt commented Oct 3, 2024

smlmbrt commented Oct 3, 2024

Update ScoreVariant to support exact PGS Catalog standard #46

Update ScoreVariant to support exact PGS Catalog standard #46

Conversation

nebfield commented Sep 9, 2024 • edited Loading

Background

New approach

Gradual typing

Test notes

Submission validation

Outstanding questions:

codecov bot commented Sep 9, 2024 • edited Loading

Codecov Report

smlmbrt commented Sep 9, 2024

smlmbrt commented Sep 12, 2024

smlmbrt commented Sep 12, 2024

smlmbrt commented Oct 3, 2024

smlmbrt commented Oct 3, 2024

nebfield commented Sep 9, 2024 •

edited

Loading

codecov bot commented Sep 9, 2024 •

edited

Loading