Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to Coding in gks-core-source.yaml #107

Open
wants to merge 1 commit into
base: 1.x
Choose a base branch
from

Conversation

mbrush
Copy link
Contributor

@mbrush mbrush commented Dec 27, 2024

These changes aim to address feedback in the gks va-spec ballot release discussion here: ga4gh/va-spec#234 (comment)

  • fixes a non-resolving url example in the Coding definition
  • updates recommendations in system and code attribute descriptions
  • adds two exploratory, optional attributes that would allow complete specification of prefix and prefix expansion where useful (systemPrefix and systemPrefixExpansion) .

- fixes the non-resolving url example in the Coding definition
- updates recommendations in several Coding attribute descriptions
- adds two exploratory, optional attributes that would allow complete specification of prefix and prefix expansion where useful.

These changes aim to address feedback in the gks va-spec ballot release discussion here: ga4gh/va-spec#234 (comment)
description: >-
A symbol uniquely identifying the concept, as in a syntax defined by the code system.
CURIE format is preferred where possible (e.g. 'SO:0000704' is the CURIE form of the
Sequence Ontology code for 'gene'). See https://www.w3.org/TR/curie/.
label:
type: string
description: The human-readable name for the coded concept, as defined by the code system.
system:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, system is very broad and can lead to ambiguity / inconsistencies since it can be free-text name, IRI, or URL. Have you considered splitting this out into multiple fields (i.e systemName, systemIri, and systemUrl) and making at least of of them required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I kind of like this idea. although we may not need both systemIri and systemUrl attributes.

Beyond this, I think we need to have a broader discussion about what it is we want the Coding object to do and not to do, and if/how we want to commit to use of CURIEs as values in this object. And generally how we want to handle enumerations that may be wrapped in MappableConcepts and Codings. It still feels a bit complicated and unclear to me.

The use of CURIEs in the linked data / linkML world provides a way to bypass the need to explicitly list out the system and code separately in the data - and relies on a prefix mapping document that lives outside the data for this. We are recommending use of CURIEs here in the context of a more FHIR-like MappableConcept and Coding objects that also capture the system url and code, and now possibly a prefix and prefix expansion. This all seems duplicative and likely to lead to confusion. I think we need to tease apart what we want here and implement something clearer to support it.

Copy link

@jsstevenson jsstevenson Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there could/should be multiple types of Coding? IE one for CURIEs and another for FHIR-style. Flexibility to support either seems like a good thing, but unless you think there are cases where a user wants to provide representation for both or a mixture therein, it might be simpler to just force them to choose one of the two types.

I appreciate the message compactness that a separate prefix expansion document provides -- it might be good to codify this in some kind of GKS object, though, just so that there's an expected standard for how to access it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is one option for how to approach the issue, that favors flexibility over interoperability (i.e. enforcing a single way of doing things that would ensure interoperable data) . . . The broader discussion I want to have is what we prioritize in the short and long term to ensure success of the va-spec. Our initial spec leans more toward flexibility over interoperability - to lower the barrier to early adoption. But in the long run if we true interoperability, we will either need to converge on a single way of doing things, or provide tooling that supports automated transformations between alternate representations that can be used to derive interoperable data from divergent representations. The linkML framework is doing some interesting work in this area that I am keen to explore.

systemVersion:
type: string
description: Version of the terminology or code system that provided the code.
code:
$ref: '#/$defs/code'
systemPrefix:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm a little confused on the purpose of systemPrefix if code is preferred to be a CURIE (systemPrefix:identifier).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

systemPrefix and systemPrefixExpansion would just allow data providers to be explicit about what the prefix is, and what it expands to in order to generate a resolvable url - if they wanted to. This kind of info is often captured in prefix mapping sections of schema in the linkML / linked data world. Not sure we need it here right now - but in combination with the other proposed changes it lets a data provider be very clear about the system, the code from that system, and how to resolve this code to find more info about it (in cases where the code is a CURIE with a url prefix that lets it resolve). Whether this is useful/necessary to support is another question.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also want to make sure the json/rst files are updated too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating the examples would also be really helpful

korikuzma added a commit to cancervariants/gene-normalization that referenced this pull request Dec 30, 2024
korikuzma added a commit to cancervariants/gene-normalization that referenced this pull request Dec 30, 2024
korikuzma added a commit to cancervariants/gene-normalization that referenced this pull request Dec 30, 2024
description: >-
A symbol uniquely identifying the concept, as in a syntax defined by the code system.
CURIE format is preferred where possible (e.g. 'SO:0000704' is the CURIE form of the
Sequence Ontology code for 'gene'). See https://www.w3.org/TR/curie/.
label:
type: string
description: The human-readable name for the coded concept, as defined by the code system.
system:
type: string
description: >-
The terminology/code system that defined the code. May be reported as a free-text name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbrush , @ahwagner and I had a discussion regarding system when reviewing my draft PR in va-spec where I initially used free text label for this field. However, @ahwagner believes URI should be used to align with FHIR.

Tagging @larrybabb to follow too

Comment on lines +194 to +196
A symbol uniquely identifying the concept, as in a syntax defined by the code system.
CURIE format is preferred where possible (e.g. 'SO:0000704' is the CURIE form of the
Sequence Ontology code for 'gene'). See https://www.w3.org/TR/curie/.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbrush "A symbol uniquely identifying the concept, as in a syntax defined by the code system." and "CURIE format is preferred where possible" seem to conflict. For example, NCIt code syntax is C\d+. But to me, the CURIE format part could mean that we should use ncit:C\d+.

A previous description version states that code should be "Symbol in syntax defined by the terminology system."

I think we should consider updating the description to:

Suggested change
A symbol uniquely identifying the concept, as in a syntax defined by the code system.
CURIE format is preferred where possible (e.g. 'SO:0000704' is the CURIE form of the
Sequence Ontology code for 'gene'). See https://www.w3.org/TR/curie/.
A symbol uniquely identifying the concept, as in a syntax defined by the code system.
(e.g. 'SO:0000704' is the proper code syntax defined by Sequence Ontology for 'gene'
and 'C287' is the proper code syntax defined by NCIt for 'aspirin').

Tagging @ahwagner @larrybabb for comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Kori. I think a broader discussion is needed around if and how strongly we recommend the use of CURIEs - so we can address issues like the one you raise, and others that have come up due to our trying to support use of CURIES in a FHIR-like Coding object that was not designed to support them. This will definitely be on the agenda to address over the next few weeks as we finalize things for the 1.0 release.

@mbrush
Copy link
Contributor Author

mbrush commented Jan 28, 2025

Longer term (or perhaps even now), esp. if we decide to capture more information about a 'system' that provided a code, it may be worth considering crating a Class to represent such systems. In other models I have seen/developed, there is the notion of an 'InformationResource. For example, the Biolink model defines an InformationResource as:

"A database or knowledgebase and its supporting ecosystem of interfaces and services that deliver content to consumers (e.g. web portals, APIs, query endpoints, streaming services, data downloads, etc.). A single Information Resource by this definition may span many different datasets or databases, and include many access endpoints and user interfaces".

And in projects that use Biolink, such as the NCATS Data Translator, the InformationResource model and corresponding 'InfoRes Catalog' is used to capture provenance of where knowledge and terms come from.

Terminology systems like thesauri, ontologies, etc, represent one kind of InformationResource. We could define a model for describing such resources, that can be populated to capture info about a term / code system within a Coding. And even use the Biolink InfoRes Catalog to provide identifiers and metadata.

e.g. a model may look like:

InformationResource:
  inherits: Entity
  type: object
  properties:
    name:
    homepage_url:
    content_url:
    prefix:
    prefixExpansions:
    version:

Then a Coding instance may look like:

- coding: 
    code: SO:0000704
    label: Gene
    system: 
      id:  infores:so
      name: Sequence Ontology
      homepage_url: http://www.sequenceontology.org/
      content_url: http://purl.obolibrary.org/obo/so.owl
      version: 2.1.3
      prefix: SO
      prefixExpansions: 
          - http://purl.obolibrary.org/obo/SO_
          - http://purl.bioontology.org/ontology/SO/SO
          - http://www.sequenceontology.org/miso/current_release/term/SO:

We likely wouldn't want or need to pass along all this metadata about the system in the message every time a term form the system was used - so we could establish or re-use infrastructure that lets us provide a url that resolves to such info. And only include critical info about the system in the message itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants