Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ScoreVariant to support exact PGS Catalog standard #46

Merged
merged 46 commits into from
Oct 7, 2024

Conversation

nebfield
Copy link
Member

@nebfield nebfield commented Sep 9, 2024

Background

We had some problems including:

  • Non-additive variants caused a ValueError when instantiating a ScoreVariant
  • When normalising scoring files the CLI would crash if ScoreVariants didn't look quite right
  • The implementation of the ScoreVariant class wasn't great, rare combinations of fields in valid scoring files would trigger random problems

New approach

  • The pydantic model CatalogScoreVariant is an exact implementation of the PGS Catalog data standard for scoring files
  • The ScoreVariant model inherits from CatalogScoreVariant to include attributes for normalisation (row_nr, accession)
  • Some normalisation steps are automatically done by pydantic now (check_effect_weight)
  • Non-additive scores are skipped by the CLI. If no scores are processed, an exception is thrown
---
title: PGS Catalog pydantic data models
---
classDiagram
    CatalogScoreVariant <|-- ScoreVariant
    CatalogScoreVariant : +String rsID 
    CatalogScoreVariant : +String chr_name
    CatalogScoreVariant : +int chr_position
    CatalogScoreVariant : ...
    
    note for CatalogScoreVariant "Implements PGS Catalog standard"

    class ScoreVariant{
        +String accession
        +int row_nr
        +bool is_duplicated
    }
    note for ScoreVariant "Used for normalisation"

    ScoreHeader <|-- CatalogScoreHeader

    ScoreHeader : +String pgs_id
    ScoreHeader : +Optional[String] pgs_name
    ScoreHeader : +String trait_reported
    ScoreHeader : +Optional[enum] genome_build
    ScoreHeader : +from_path()
    note for ScoreHeader "A minimal header \n for the calculator"

    note for CatalogScoreHeader "Implements PGS Catalog standard"

    class CatalogScoreHeader {
        +enum format_version
        ...
    }

    note for ScoreLog "Header metadata and \nvariant summary stats"

    class ScoreLog {
        +ScoreHeader header
        +dict variant_sources
        +bool compatible_effect_type
        ...
    }
    

Loading

Gradual typing

Pydantic uses type hints to build data models, so pgscatalog.core now has type hints checked (if they exist) with mypy.

Test notes

Running the entire Catalog:

  • PGS002253 contains both effect weights and dosage weight columns, so I removed the restriction of exclusive weight types (if the effect weight column is present, we'll use it)
  • PGS002263 contains a peculiar variant on row 223
.       GAA     GAAA    -0.188  author_rsID=rs544624542;rs563204200;rs574206742 Unknown
  • This will currently raise an irrecoverable ValidationError because there's no positional information
  • Complex variants (e.g. APOE) may look odd, so a lot of validation is disabled for them
  • Some scoring files will need their rsID field updated to have a valid prefix
  • 4 missing citations caused some ValueErrors (fixing headers)
  • The new approach is slower because each variant is validated (~1 1/2 days to combine the entire Catalog), but the checks are much better
    • I set up the CLI to work in parallel but haven't enabled this feature yet (requires changes in pgscatalog-match)

Submission validation

CatalogScoreVariant would be a good place to implement validation logic using field_validator or model_validator so we have unified data models across submission validation and calculator normalisation.

Outstanding questions:

  • effect weights: treat them as strings that are castable to numeric, normal floats, or decimals with a defined precision? (thinking of plink limits) (be consistent internally using strings that can be coerced to floats)
  • Add has_complex_variants to log?

Closes PGScatalog/pgsc_calc#370

Copy link

codecov bot commented Sep 9, 2024

Codecov Report

Attention: Patch coverage is 95.86466% with 22 lines in your changes missing coverage. Please review.

Project coverage is 90.44%. Comparing base (84a111f) to head (3c1ebee).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
pgscatalog.core/src/pgscatalog/core/lib/models.py 95.19% 17 Missing ⚠️
...gscatalog.core/src/pgscatalog/core/cli/_combine.py 81.81% 2 Missing ⚠️
pgscatalog.core/tests/test_combine_cli.py 94.87% 2 Missing ⚠️
...atalog.core/src/pgscatalog/core/cli/combine_cli.py 98.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #46      +/-   ##
==========================================
+ Coverage   87.98%   90.44%   +2.45%     
==========================================
  Files          20       42      +22     
  Lines        1049     2596    +1547     
==========================================
+ Hits          923     2348    +1425     
- Misses        126      248     +122     
Flag Coverage Δ
pgscatalog.core 92.11% <95.86%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nebfield nebfield marked this pull request as ready for review September 9, 2024 13:01
@nebfield nebfield linked an issue Sep 9, 2024 that may be closed by this pull request
@nebfield nebfield requested a review from fyvon September 9, 2024 13:59
@nebfield nebfield requested a review from smlmbrt September 9, 2024 15:36
@smlmbrt smlmbrt requested a review from ens-lgil September 9, 2024 15:37
@smlmbrt
Copy link
Member

smlmbrt commented Sep 9, 2024

I think @ens-lgil might be a better first reviewer? And @fyvon when he gets back. Happy to look at it after.

@smlmbrt
Copy link
Member

smlmbrt commented Sep 12, 2024

Re effect_weight: I think "strings that are castable to numeric" is the right way to go. This offloads the burden of precision to whatever downstream tool is using the information (and preserves what the author reported). Plink actually uses the full precision of the float, but rounds the output (so it's still useful information and we shouldn't arbitrarily truncate the precision).

@smlmbrt
Copy link
Member

smlmbrt commented Sep 12, 2024

With regard to pgsc_calc the requested score will sort of disappear from the analysis after combine? Like you won't see that not of them matched, or make it into the header of the report. Is there a way to show users that it was excluded because it didn't have the right effect weights?

@smlmbrt
Copy link
Member

smlmbrt commented Oct 3, 2024

I thought the solution was to always read effect weight and position columns as str and then check that they can be coerced into a float?

@smlmbrt
Copy link
Member

smlmbrt commented Oct 3, 2024

Also missing the version bump?

@nebfield nebfield merged commit 2fe8105 into main Oct 7, 2024
10 checks passed
@nebfield nebfield deleted the fix-nonadditive branch October 7, 2024 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some PGS Catalog scores not working with pgsc_calc Exception: Bad effect weights
4 participants