Spike: Find out which source systems have table/column descriptions available to pull direct from source databases #921

LavMatt · 2024-10-03T14:31:08Z

User Story

As a developer
I want to understand and document the available metadata, with respect to column and table descriptions of source databases
So that i can work towards ingesting it into the find-moj-data service

Value / Purpose

No response

Useful Contacts

Data engineering domain leads

User Types

No response

Hypothesis

If we... [do a thing]
Then... [this will happen]

Proposal

We know there are description metadata available to get directly from source databases (like nomis, oasys) and that some of these metadata are already ingested into the AP and glue by DE pipelines.

What we don't know is where data are missing - what available and potentially possible to include in data engineering pipelines

We should find out!

We should also document the source databases that are external of create-a-derived-table but are in the glue catalog - these will not currently be ingested, they probably should be and also might have metadata available

This is quite a big ask potentially so we should maybe split the work further during a refinement sessions:

Document the gaps as is in the glue catalog by database.
We could then make individual tickets to investigate whether gaps are available to get direct from source per database or per domain.

This Spike could potentially do those 2 tasks

Additional Information

No response

Definition of Done

create a list all source databases on the AP (in the glue catalog) and not on the AP, and indicate whether the AP available databases are included within the current cadet ingestion.
Document the metadata gaps as is in the glue catalog by database, adding to the list whether description metadata available in glue.
make individual tickets to investigate whether gaps are available to get direct from source per database or per domain.

murdo-moj · 2024-10-21T14:29:27Z

https://justiceuk-my.sharepoint.com/:x:/g/personal/matthew_laverty_justice_gov_uk/EU5ljCQ_HbtNsO2Ub9qn49gBpF173qAY12FeH4IMxUw2RQ?email=Murdo.Moyse1%40justice.gov.uk&e=cfBHkW

MatMoore · 2024-10-23T12:29:41Z

I've added to the sharepoint a dump of all databases in the AP, and filtered out dev/sandbox stuff and staging data.

I'm in the process of trying to identify which ones are sources. In addition to the ones we know about, there's also things like the the People Survey, and more general purpose data such as Ordnance survey data. I'm assuming all of this is worth cataloguing.

IMO we don't need to understand everything that's in AP (if that is even possible), but a decent amount of these do have at a least a database-level description in Glue, so I think we can have one task just to ingest these ones as they are, and do that first.

Then there are those source systems we've identified but don't have any glue metadata. If we want to get this into the catalogue quickly, we could come up with database-level descriptions ourselves based on information on confluence etc. Then I think we can create tickets to engage the pipeline owners and enrich the table-level metadata via the glue catalogue.

I've also added a tab for derived data that is not coming from CaDeT, such as the Data First outputs. I'm assuming this is also valuable to catalogue, even though it doesn't fall under the scope of this ticket.

murdo-moj · 2024-10-23T13:59:52Z

Matt L and I are talking to Oliver Critchfield on Fri 25th. He's in charge of HMCTS datasets data engineering wise.

murdo-moj · 2024-10-23T14:02:14Z

I asked the data modellers for some more datasets which are in glue and not CaDeT but there was no response and we will need to target leads directly. https://asdslack.slack.com/archives/C03J21VFHQ9/p1729176561022829

murdo-moj · 2024-10-25T13:18:50Z

Chat with Oliver:

Column definitions in glue for magistrates are in code and have been created via Airflow. So analytics/data engineers can flow changes through to the catalogue via their airflow code. This will be generalisable to other airflow processed pipelines
Source system data for CCD (which will replace Caseman, familyman, pcol, probate) is available but contains thousands of definitions. The analytics engineers like having definitions for the tables they make so that they can add to source system metadata
Oliver said he's happy to create tickets to enrich glue databases with poor metadata eg pcol. So we should ingest databases with just schema information currently so it can be enriched.
He seemed very happy to see the catalogue! I don't think he'd explored it before. He especially asked for the lineage, which we could just show him, which is nice.

Let's start with:

familyman_live_v4
mags_curated_v3
sop_preprocessed
sop_base
sop_transformed_v1_ac
contracts_rio_v1
contracts_jaggaer_v1

Once they are on Find MoJ data, we can feed these back to the analytics engineers. They can make adjustments they want to the metadata, and this might trigger a second wave of requests to ingest databases 🤞

LavMatt added this to Data Catalogue Oct 3, 2024

github-project-automation bot moved this to Todo in Data Catalogue Oct 3, 2024

MatMoore self-assigned this Oct 23, 2024

murdo-moj assigned LavMatt and murdo-moj Oct 23, 2024

murdo-moj mentioned this issue Oct 25, 2024

Ingest glue data #954

Closed

murdo-moj closed this as completed Oct 30, 2024

github-project-automation bot moved this from Review to Done in Data Catalogue Oct 30, 2024

LavMatt mentioned this issue Feb 12, 2025

Explore the possibilities for enriching column (and table) level metadata source systems hosted on AP #1380

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: Find out which source systems have table/column descriptions available to pull direct from source databases #921

Spike: Find out which source systems have table/column descriptions available to pull direct from source databases #921

LavMatt commented Oct 3, 2024 •

edited

Loading

murdo-moj commented Oct 21, 2024

MatMoore commented Oct 23, 2024 •

edited

Loading

murdo-moj commented Oct 23, 2024 •

edited

Loading

murdo-moj commented Oct 23, 2024

murdo-moj commented Oct 25, 2024

Spike: Find out which source systems have table/column descriptions available to pull direct from source databases #921

Spike: Find out which source systems have table/column descriptions available to pull direct from source databases #921

Comments

LavMatt commented Oct 3, 2024 • edited Loading

User Story

Value / Purpose

Useful Contacts

User Types

Hypothesis

Proposal

Additional Information

Definition of Done

murdo-moj commented Oct 21, 2024

MatMoore commented Oct 23, 2024 • edited Loading

murdo-moj commented Oct 23, 2024 • edited Loading

murdo-moj commented Oct 23, 2024

murdo-moj commented Oct 25, 2024

LavMatt commented Oct 3, 2024 •

edited

Loading

MatMoore commented Oct 23, 2024 •

edited

Loading

murdo-moj commented Oct 23, 2024 •

edited

Loading