Update score log structure #49

nebfield · 2024-09-13T16:30:57Z

What's changed

Added pydantic models for:

Custom scoring file headers
PGS Catalog scoring file headers (inherited from the base header model, could be reused for validation 👀 )
A score log (before we chucked a mix of useful data into a JSON file)

Integrated these models with the ScoringFile object and combine CLI

There is a breaking change in log structure. Changes include:

New compatible_effect_type field (bool)
The dynamic accession field ( {'PGSXXXX': { nested_dict }) has been made static because of pydantic limitations (replaced with "header" field)

Why changed

#46 (comment)

Example log file

PGS000001 is normalised OK and will be in the output, PGS004255 is ignored but warned about.

[
  {
    "header": {
      "pgs_id": "PGS000001",
      "pgs_name": "PRS77_BC",
      "trait_reported": "Breast cancer",
      "genome_build": "NR",
      "format_version": "2.0",
      "trait_mapped": [
        "breast carcinoma"
      ],
      "trait_efo": [
        "EFO_0000305"
      ],
      "variants_number": 77,
      "weight_type": null,
      "pgp_id": "PGP000001",
      "citation": "Mavaddat N et al. J Natl Cancer Inst (2015). doi:10.1093/jnci/djv036",
      "HmPOS_build": "GRCh38",
      "HmPOS_date": "2022-07-29",
      "HmPOS_match_pos": "{\"True\": null, \"False\": null}",
      "HmPOS_match_chr": "{\"True\": null, \"False\": null}",
      "license": "PGS obtained from the Catalog should be cited appropriately, and used in accordance with any licensing restrictions set by the authors. See EBI Terms of Use (https://www.ebi.ac.uk/about/terms-of-use/) for additional details."
    },
    "compatible_effect_type": true,
    "pgs_id": "PGS000001",
    "is_harmonised": true,
    "sources": [
      "ENSEMBL"
    ]
  },
  {
    "header": {
      "pgs_id": "PGS004255",
      "pgs_name": "GenoBoost_rheumatoid_arthritis_0",
      "trait_reported": "Rheumatoid arthritis",
      "genome_build": "GRCh37",
      "format_version": "2.0",
      "trait_mapped": [
        "rheumatoid arthritis"
      ],
      "trait_efo": [
        "EFO_0000685"
      ],
      "variants_number": 30,
      "weight_type": "beta",
      "pgp_id": "PGP000546",
      "citation": "Ohta R et al. Nat Commun (2024). doi:10.1038/s41467-024-48654-x",
      "HmPOS_build": "GRCh38",
      "HmPOS_date": "2024-06-11",
      "HmPOS_match_pos": "{\"True\": null, \"False\": null}",
      "HmPOS_match_chr": "{\"True\": null, \"False\": null}",
      "license": "PGS obtained from the Catalog should be cited appropriately, and used in accordance with any licensing restrictions set by the authors. See EBI Terms of Use (https://www.ebi.ac.uk/about/terms-of-use/) for additional details."
    },
    "compatible_effect_type": false,
    "pgs_id": "PGS004255",
    "is_harmonised": true,
    "sources": null
  }
]

codecov · 2024-09-13T16:31:17Z

Codecov Report

Attention: Patch coverage is 96.77419% with 10 lines in your changes missing coverage. Please review.

Project coverage is 89.95%. Comparing base (aee104f) to head (7c76e12).
Report is 3 commits behind head on fix-nonadditive.

Files with missing lines	Patch %	Lines
pgscatalog.core/src/pgscatalog/core/lib/models.py	96.21%	10 Missing ⚠️

Additional details and impacted files

@@                 Coverage Diff                 @@
##           fix-nonadditive      #49      +/-   ##
===================================================
+ Coverage            89.18%   89.95%   +0.77%     
===================================================
  Files                   40       39       -1     
  Lines                 2320     2508     +188     
===================================================
+ Hits                  2069     2256     +187     
- Misses                 251      252       +1

Flag	Coverage Δ
pgscatalog.core	`91.36% <96.77%> (+1.19%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nebfield · 2024-09-13T16:32:06Z

pgscatalog.core/src/pgscatalog/core/lib/scorefiles.py

-        "pgs_id",
-        "pgp_id",
-        "pgs_name",
-        "genome_build",
-        "variants_number",
-        "trait_reported",
-        "trait_efo",
-        "trait_mapped",
-        "weight_type",
-        "citation",
-        "HmPOS_build",
-        "HmPOS_date",
-        "format_version",
-        "license",
-    )
-
-    _default_license_text = (
-        "PGS obtained from the Catalog should be cited appropriately, and "
-        "used in accordance with any licensing restrictions set by the authors. See "
-        "EBI "
-        "Terms of Use (https://www.ebi.ac.uk/about/terms-of-use/) for additional "
-        "details."
-    )
-
-    def __init__(
-        self,
-        *,
-        pgs_name,
-        genome_build,
-        pgs_id=None,
-        pgp_id=None,
-        variants_number=None,
-        trait_reported=None,
-        trait_efo=None,
-        trait_mapped=None,
-        weight_type=None,
-        citation=None,
-        HmPOS_build=None,
-        HmPOS_date=None,
-        format_version=None,
-        license=None,
-    ):
-        """kwargs are forced because this is a complicated init and from_path() is
-        almost always the correct thing to do.
-
-        Mandatory/optional fields in a header are less clear than columns. The
-        Catalog provides this information automatically but scoring files from other
-        places might not.
-
-        We don't want to annoy people and force them to reformat their custom scoring
-        files, but we do require some minimum information for the calculator,
-        so pgs_name and genome_build are mandatory.
-        """
-        self.pgs_name = pgs_name
-        self.genome_build = GenomeBuild.from_string(genome_build)
-
-        if self.genome_build is None:
-            # try overwriting with harmonised data
-            self.genome_build = GenomeBuild.from_string(HmPOS_build)
-
-        if self.pgs_name is None:
-            raise ValueError("pgs_name cannot be None")
-
-        # the rest of these fields are optional
-        self.pgs_id = pgs_id
-        self.pgp_id = pgp_id
-
-        try:
-            self.variants_number = int(variants_number)
-        except TypeError:
-            self.variants_number = None
-
-        self.trait_reported = trait_reported
-        self.trait_efo = trait_efo
-        self.trait_mapped = trait_mapped
-        self.weight_type = weight_type
-        self.citation = citation
-        self.HmPOS_build = GenomeBuild.from_string(HmPOS_build)
-        self.HmPOS_date = HmPOS_date
-        self.format_version = format_version
-        self.license = license
-
-        if self.license is None:
-            self.license = self._default_license_text
-
-    def __repr__(self):
-        values = {x: getattr(self, x) for x in self.fields}
-        value_strings = ", ".join([f"{key}='{value}'" for key, value in values.items()])
-        return f"{type(self).__name__}({value_strings})"
-
-    @classmethod
-    def from_path(cls, path):
-        raw_header: dict = read_header(path)
-
-        if len(raw_header) == 0:
-            raise ValueError(f"No header detected in scoring file {path=}")
-
-        header_dict = {k: raw_header.get(k) for k in cls.fields}
-
-        return cls(**header_dict)


deleted because migrated to a pydantic model

nebfield · 2024-09-13T16:33:42Z

pgscatalog.core/src/pgscatalog/core/lib/scorefiles.py

+    @property
+    def pgs_id(self):
+        return self._pgs_id

-        if (build := self._header.HmPOS_build) is not None:
-            self.genome_build = build
-            self.harmonised = True
+    @property
+    def is_harmonised(self):
+        return self._header.is_harmonised
+
+    @property
+    def genome_build(self):
+        if self.is_harmonised:
+            return self._header.HmPOS_build
        else:
-            self.genome_build = self._header.genome_build
-            self.harmonised = False
+            return self._header.genome_build
+
+    @property
+    def header(self):
+        return self._header


I added these properties because they're a really helpful shortcut in the CLI

nebfield · 2024-09-13T16:34:37Z

pgscatalog.core/src/pgscatalog/core/lib/scorefiles.py

-    def get_log(self, drop_missing=False, variant_log=None):
-        """Create a JSON log from a ScoringFile's header and variant rows."""
-
-        logger.debug(f"Creating JSON log for {self!r}")
-
-        log = {}
-
-        for attr in self._header.fields:
-            if (extracted_attr := getattr(self._header, attr, None)) is not None:
-                log[attr] = str(extracted_attr)
-            else:
-                log[attr] = None
-
-        if log["variants_number"] is None:
-            # custom scoring files might not have this information
-            log["variants_number"] = variant_log["n_variants"]
-
-        if (
-            variant_log is not None
-            and int(log["variants_number"]) != variant_log["n_variants"]
-            and not drop_missing
-        ):
-            logger.warning(
-                f"Mismatch between header ({log['variants_number']}) and output row count ({variant_log['n_variants']}) for {self.pgs_id}"
-            )
-            logger.warning(
-                "This can happen with older scoring files in the PGS Catalog (e.g. PGS000028)"
-            )
-
-        # multiple terms may be separated with a pipe
-        if log["trait_mapped"]:
-            log["trait_mapped"] = log["trait_mapped"].split("|")
-
-        if log["trait_efo"]:
-            log["trait_efo"] = log["trait_efo"].split("|")
-
-        log["columns"] = self._fields
-        log["use_harmonised"] = self.harmonised
-
-        if variant_log is not None:
-            log["sources"] = [k for k, v in variant_log.items() if k != "n_variants"]
-
-        return {self.pgs_id: log}
-


Replaced by the pydantic models + model_dump()

nebfield · 2024-09-13T16:35:31Z

pgscatalog.core/src/pgscatalog/core/lib/models.py

+class ScoreVariant(CatalogScoreVariant):
+    """This model includes attributes useful for processing and normalising variants
+
+    >>> variant = ScoreVariant(**{"rsID": None, "chr_name": "1", "chr_position": 1, "effect_allele": "A", "effect_weight": 0.5, "row_nr": 0, "accession": "test"})
+    >>> variant  # doctest: +ELLIPSIS
+    ScoreVariant(rsID=None, chr_name='1', chr_position=1, effect_allele=Allele(allele='A', ...
+    >>> variant.is_complex
+    False
+    >>> variant.is_non_additive
+    False
+    >>> variant.is_harmonised
+    False
+    >>> variant.effect_type
+    EffectType.ADDITIVE
+
+    >>> variant_missing_positions = ScoreVariant(**{"rsID": None, "chr_name": None, "chr_position": None, "effect_allele": "A", "effect_weight": 0.5,  "row_nr": 0, "accession": "test"}) # doctest: +ELLIPSIS
+    Traceback (most recent call last):
+    ...
+    TypeError: Bad position: self.rsID=None, self.chr_name=None, self.chr_position=None
+
+    >>> harmonised_variant = ScoreVariant(**{"rsID": None, "chr_name": "1", "chr_position": 1, "effect_allele": "A", "effect_weight": 0.5, "hm_chr": "1", "hm_pos": 1, "hm_rsID": "rs1921", "hm_source": "ENSEMBL",  "row_nr": 0, "accession": "test"})
+    >>> harmonised_variant.is_harmonised
+    True
+
+    >>> variant_badly_harmonised = ScoreVariant(**{"rsID": None, "chr_name": "1", "chr_position": 1, "effect_allele": "A", "effect_weight": 0.5, "hm_chr": None, "hm_pos": None, "hm_rsID": "rs1921", "hm_source": "ENSEMBL",  "row_nr": 0, "accession": "test"})
+    Traceback (most recent call last):
+    ...
+    TypeError: Missing harmonised column data: hm_chr
+
+    >>> variant_nonadditive = ScoreVariant(**{"rsID": None, "chr_name": "1", "chr_position": 1, "effect_allele": "A", "effect_weight": 0.5, "dosage_0_weight": 0, "dosage_1_weight": 1,  "row_nr": 0, "accession": "test"})
+    >>> variant_nonadditive.is_non_additive
+    True
+    >>> variant_nonadditive.is_complex
+    False
+    >>> variant_nonadditive.effect_type
+    EffectType.NONADDITIVE
+
+    >>> variant_complex = ScoreVariant(**{"rsID": None, "chr_name": "1", "chr_position": 1, "effect_allele": "A", "effect_weight": 0.5, "is_haplotype": True,  "row_nr": 0, "accession": "test"})
+    >>> variant_complex.is_complex
+    True
+    """
+
+    model_config = ConfigDict(use_enum_values=True)
+
+    row_nr: int = Field(
+        title="Row number",
+        description="Row number of variant in scoring file (first variant = 0)",
+    )
+    accession: str = Field(title="Accession", description="Accession of score variant")
+    is_duplicated: Optional[bool] = Field(
+        default=False,
+        title="Duplicated variant",
+        description="In a list of variants with the same accession, is ID duplicated?",
+    )
+
+    # column names for output are used by __iter__ and when writing out
+    output_fields: ClassVar[tuple[str]] = (
+        "chr_name",
+        "chr_position",
+        "effect_allele",
+        "other_allele",
+        "effect_weight",
+        "effect_type",
+        "is_duplicated",
+        "accession",
+        "row_nr",
+    )
+
+    def __iter__(self):
+        for attr in self.output_fields:
+            yield getattr(self, attr)


thought it would be best to keep all pydantic models in this module 🤔

* add pydantic dependency * implement pydantic model drafts * split models into CatalogScoreVariant and ScoreVariant * fix doctests * refactor to use a DictWriter * fix doctest imports * refactor classes into separate files * add non-additive error checking * skip non-additive scores * add non-additive CLI tests * export EffectTypeError * fix missing CatalogError * tidy up: put reusable pydantic models in models.py and export them * simplify n_finished check * set all effect weight fields to str with coerce_numbers_to_str * add rsid field validator * fix field validator field name o_o * Update score log structure (#49) * include example of validating a scoring file header * update header models * integrate ScoreLog and ScoreLogs * fix doctest imports * add log warning re: missing variants * test log output with incompatible effect type * make ScoreLogs a RootModel to act as a json list * refactor weight types from enum to string * prevent revalidation of ScoreVariants during ScoreLog instantiation * prevent checking position for complex variants * fix log creation with missing variants * update is_non_additive check to support mixed column types * Tidy up: delete unused target variants module (#47) * remove targetvariants class * fix deleted imports * fix imports * Gradually introduce type checking (#50) * set up type checking pre-commit * set up type checks for pgscatalog.core * add mypy action * don't use poetry for mypy * fix is_complex check * delete redundant functions * drop unused NormalisedScoringFile * rename variants -> variant_sources in ScoreLog * add support for ancestry specific allele frequencies * fix writing out * hm_match_chr and hm_match_pos: treat empty strings as None * set up a ProcessPoolExecutor skeleton * fix multiprocessing coverage * convert empty strings to None: is_haplotype and is_diplotype * fix TypeError -> ValueError for pydantic * fix doctest * add support for variant_type field * add complex variant tests * improve model documentation * bump minor version * simplify effect_weight_must_float again * Revert "simplify effect_weight_must_float again" This reverts commit 0d5c6e5. * document and improve effect_weight_must_float * docs: stop being confusing * cache computed fields in model * add more CLI logging messages * add has_complex_alleles computed field * set up annotated types in models and VariantLog model * add package mypy config * ignore typing when reading ScoreVariants as dicts of strings * clarify log messages when writing out

nebfield added 4 commits September 12, 2024 12:01

include example of validating a scoring file header

a08e196

update header models

52f8a2e

integrate ScoreLog and ScoreLogs

8507b19

fix doctest imports

84a68c4

nebfield commented Sep 13, 2024

View reviewed changes

add log warning re: missing variants

eb8fbaa

nebfield marked this pull request as ready for review September 16, 2024 08:33

nebfield added 2 commits September 16, 2024 11:00

test log output with incompatible effect type

aa095c8

make ScoreLogs a RootModel to act as a json list

435e692

nebfield requested a review from fyvon September 16, 2024 10:56

nebfield added 5 commits September 23, 2024 12:07

refactor weight types from enum to string

8b767ce

prevent revalidation of ScoreVariants during ScoreLog instantiation

1130627

prevent checking position for complex variants

ea4ee62

fix log creation with missing variants

c7da8bd

update is_non_additive check to support mixed column types

7c76e12

nebfield merged commit 8989ab1 into fix-nonadditive Sep 24, 2024
6 checks passed

nebfield deleted the update-scorelog branch September 24, 2024 15:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update score log structure #49

Update score log structure #49

nebfield commented Sep 13, 2024 •

edited

Loading

codecov bot commented Sep 13, 2024 •

edited

Loading

nebfield Sep 13, 2024

nebfield Sep 13, 2024

nebfield Sep 13, 2024

nebfield Sep 13, 2024

Update score log structure #49

Update score log structure #49

Conversation

nebfield commented Sep 13, 2024 • edited Loading

What's changed

Why changed

Example log file

codecov bot commented Sep 13, 2024 • edited Loading

Codecov Report

nebfield Sep 13, 2024

Choose a reason for hiding this comment

nebfield Sep 13, 2024

Choose a reason for hiding this comment

nebfield Sep 13, 2024

Choose a reason for hiding this comment

nebfield Sep 13, 2024

Choose a reason for hiding this comment

nebfield commented Sep 13, 2024 •

edited

Loading

codecov bot commented Sep 13, 2024 •

edited

Loading