Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add schema versioning for derived tables #11446

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sheridancbio
Copy link
Contributor

@sheridancbio sheridancbio commented Mar 11, 2025

Fix # (see https://help.github.com/en/articles/closing-issues-using-keywords)

Describe changes proposed in this pull request:

  • derived_table.version is reported by Info api
  • derived_table.version pom value should match db table info value (for backend function)
  • derived_table.version does not affect importers currently (importers only import into MySQL, not clickhouse)

In order to insure that the proper clickhouse derived table construction logic is applied (based on the cBioPortal backend version that the installer/deployer is running), a version label (in the same pattern as db.version / DB_SCHEMA_VERSION) is added to the info table in the database and in the pom.xml file. Initial schema version is '1.0.0'.

A 'versions' subdirectory is added to src/main/resource/db_scripts/clickhouse. This is intended to store a copy of every version of the derived table construction scripts. This is to avoid introducing another dimension of repo tagging or documentation maintenance to enable finding old versions of the derived table schema.

Checks

Any screenshots or GIFs?

If this is a new visual feature please add a before/after screenshot or gif
here with e.g. Giphy CAPTURE or Peek

Notify reviewers

Read our Pull request merging
policy
. It can help to figure out who worked on the
file before you. Please use git blame <filename> to determine that
and notify them either through slack or by assigning them as a reviewer on the PR

- derived_table.version is reported by Info api
- derived_table.version pom value should match db table info value (for backend function)
- derived_table.version does not affect importers currently (importers
  only import into MySQL, not clickhouse)
@inodb
Copy link
Member

inodb commented Mar 11, 2025

This looks good! Thanks!

Few questions:

  • Should the derived table version live in the clickhouse db instead?
  • Do we have to capture the various clickhouse schema versions in the versions/ folder? Why not have only one version? The schema would be included in the tagged docker file, so not sure why the app or importer would need to know about other versions?

@sheridancbio
Copy link
Contributor Author

This looks good! Thanks!

Few questions:

  • Should the derived table version live in the clickhouse db instead?
  • Do we have to capture the various clickhouse schema versions in the versions/ folder? Why not have only one version? The schema would be included in the tagged docker file, so not sure why the app or importer would need to know about other versions?

Currently the entire MySQL database content lives inside of the ClickHouse database. By putting this into the MySQL database, it also puts it into the ClickHouse database. Also, it is currently true that there is nothing "first class" living inside the ClickHouse database. All data in ClickHouse is either a direct copy from MySQL or is derived from MySQL through joins of the copied data. Introducing a first class table or record into ClickHouse (making ClickHouse the authoritative source for that content) would make it more difficult to program the API. If a query comes to /api/info for example, the handler would need to know to gather some of the requested state information from MySQL and some of the requested state information from ClickHouse. In a system where ClickHouse was not configured, it would need to avoid making the query to ClickHouse in order to avoid producing an error. So we would need to have somewhat divergent copies of the handler code ... one form for ClickHouse-enabled installations and a slightly different form for MySQL-only installations. This could be done, but my thought was that the effort to program a distinction between the two types of installations would be wasted effort given that we hope/plan to eliminate the MySQL database. As long as that is the direction we are moving, then I think we can instead focus on making all data in the ClickHouse database first class and handling the loss of constraint satisfaction. By not splitting the location of schema version values now, we do not need to re-integrate it later.

For not collecting the history of clickhouse schema versions in a subdirectory, I think that change makes sense. I debated about this, and I do recognize that this would be a departure from prior practice. So I'll do this:

  • drop the versions subdirectory
  • add a documentation page which addresses how we are going to connect a derived_table_version identifier with the git repo history timepoint where the content defines that version.

I believe we will need to maintain a record as the version number is incremented. Potentially we could have a policy of only taking a snapshot of the derived_table_version at points where we have a defined release of cBioPortal. So for each cBioPortal release number, we could know what derived_table_schema was used. I believe that the derived_table_schema is going to change much less frequently than the rate of cBioPortal releases. But potentially we may have 2 or more increments to the derived_table_schema during a single development increment between cBioPortal releases. So some of these "in-between" release increments may not get captured. But that is ok if we only expect deployers to be deploying a tagged, identified version of cBioPortal. So the version mapping documentation may be a table with four recorded fields:

  • cBioPortal release version
  • db_schema_version
  • derived_table_version
  • link to supporting docker-compose image for database tools
    Or ... if we are willing to not be able to go back and correct past mistakes/omissions from prior tagged version releases, we could instead have a single documentation page which says
    "as of this time (the current tagged cBioPortal release), the db_schema_version is X, the derived_table_schema_version is Y, and the link to the built set of database tools for this version of cBioPortal is L"
    We would then need to only maintain this single set of references (which would only change when a change happened to the schemas)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants