Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Find out which source systems have table/column descriptions available to pull direct from source databases #921

Closed
3 tasks
LavMatt opened this issue Oct 3, 2024 · 5 comments
Assignees

Comments

@LavMatt
Copy link
Contributor

LavMatt commented Oct 3, 2024

User Story

As a developer
I want to understand and document the available metadata, with respect to column and table descriptions of source databases
So that i can work towards ingesting it into the find-moj-data service

Value / Purpose

No response

Useful Contacts

Data engineering domain leads

User Types

No response

Hypothesis

If we... [do a thing]
Then... [this will happen]

Proposal

We know there are description metadata available to get directly from source databases (like nomis, oasys) and that some of these metadata are already ingested into the AP and glue by DE pipelines.

What we don't know is where data are missing - what available and potentially possible to include in data engineering pipelines

We should find out!

We should also document the source databases that are external of create-a-derived-table but are in the glue catalog - these will not currently be ingested, they probably should be and also might have metadata available

This is quite a big ask potentially so we should maybe split the work further during a refinement sessions:

  1. Document the gaps as is in the glue catalog by database.
  2. We could then make individual tickets to investigate whether gaps are available to get direct from source per database or per domain.

This Spike could potentially do those 2 tasks

Additional Information

No response

Definition of Done

  • create a list all source databases on the AP (in the glue catalog) and not on the AP, and indicate whether the AP available databases are included within the current cadet ingestion.
  • Document the metadata gaps as is in the glue catalog by database, adding to the list whether description metadata available in glue.
  • make individual tickets to investigate whether gaps are available to get direct from source per database or per domain.
@MatMoore MatMoore self-assigned this Oct 23, 2024
@MatMoore
Copy link
Contributor

MatMoore commented Oct 23, 2024

I've added to the sharepoint a dump of all databases in the AP, and filtered out dev/sandbox stuff and staging data.

I'm in the process of trying to identify which ones are sources. In addition to the ones we know about, there's also things like the the People Survey, and more general purpose data such as Ordnance survey data. I'm assuming all of this is worth cataloguing.

IMO we don't need to understand everything that's in AP (if that is even possible), but a decent amount of these do have at a least a database-level description in Glue, so I think we can have one task just to ingest these ones as they are, and do that first.

Then there are those source systems we've identified but don't have any glue metadata. If we want to get this into the catalogue quickly, we could come up with database-level descriptions ourselves based on information on confluence etc. Then I think we can create tickets to engage the pipeline owners and enrich the table-level metadata via the glue catalogue.

I've also added a tab for derived data that is not coming from CaDeT, such as the Data First outputs. I'm assuming this is also valuable to catalogue, even though it doesn't fall under the scope of this ticket.

@murdo-moj
Copy link
Contributor

murdo-moj commented Oct 23, 2024

Matt L and I are talking to Oliver Critchfield on Fri 25th. He's in charge of HMCTS datasets data engineering wise.

@murdo-moj
Copy link
Contributor

I asked the data modellers for some more datasets which are in glue and not CaDeT but there was no response and we will need to target leads directly. https://asdslack.slack.com/archives/C03J21VFHQ9/p1729176561022829

@murdo-moj
Copy link
Contributor

Chat with Oliver:

  • Column definitions in glue for magistrates are in code and have been created via Airflow. So analytics/data engineers can flow changes through to the catalogue via their airflow code. This will be generalisable to other airflow processed pipelines
  • Source system data for CCD (which will replace Caseman, familyman, pcol, probate) is available but contains thousands of definitions. The analytics engineers like having definitions for the tables they make so that they can add to source system metadata
  • Oliver said he's happy to create tickets to enrich glue databases with poor metadata eg pcol. So we should ingest databases with just schema information currently so it can be enriched.
  • He seemed very happy to see the catalogue! I don't think he'd explored it before. He especially asked for the lineage, which we could just show him, which is nice.

Let's start with:

  • familyman_live_v4
  • mags_curated_v3
  • sop_preprocessed
  • sop_base
  • sop_transformed_v1_ac
  • contracts_rio_v1
  • contracts_jaggaer_v1

Once they are on Find MoJ data, we can feed these back to the analytics engineers. They can make adjustments they want to the metadata, and this might trigger a second wave of requests to ingest databases 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done ✅
Development

No branches or pull requests

3 participants