-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spike: Find out which source systems have table/column descriptions available to pull direct from source databases #921
Comments
I've added to the sharepoint a dump of all databases in the AP, and filtered out dev/sandbox stuff and staging data. I'm in the process of trying to identify which ones are sources. In addition to the ones we know about, there's also things like the the People Survey, and more general purpose data such as Ordnance survey data. I'm assuming all of this is worth cataloguing. IMO we don't need to understand everything that's in AP (if that is even possible), but a decent amount of these do have at a least a database-level description in Glue, so I think we can have one task just to ingest these ones as they are, and do that first. Then there are those source systems we've identified but don't have any glue metadata. If we want to get this into the catalogue quickly, we could come up with database-level descriptions ourselves based on information on confluence etc. Then I think we can create tickets to engage the pipeline owners and enrich the table-level metadata via the glue catalogue. I've also added a tab for derived data that is not coming from CaDeT, such as the Data First outputs. I'm assuming this is also valuable to catalogue, even though it doesn't fall under the scope of this ticket. |
Matt L and I are talking to Oliver Critchfield on Fri 25th. He's in charge of HMCTS datasets data engineering wise. |
I asked the data modellers for some more datasets which are in glue and not CaDeT but there was no response and we will need to target leads directly. https://asdslack.slack.com/archives/C03J21VFHQ9/p1729176561022829 |
Chat with Oliver:
Let's start with:
Once they are on Find MoJ data, we can feed these back to the analytics engineers. They can make adjustments they want to the metadata, and this might trigger a second wave of requests to ingest databases 🤞 |
User Story
As a developer
I want to understand and document the available metadata, with respect to column and table descriptions of source databases
So that i can work towards ingesting it into the find-moj-data service
Value / Purpose
No response
Useful Contacts
Data engineering domain leads
User Types
No response
Hypothesis
If we... [do a thing]
Then... [this will happen]
Proposal
We know there are description metadata available to get directly from source databases (like nomis, oasys) and that some of these metadata are already ingested into the AP and glue by DE pipelines.
What we don't know is where data are missing - what available and potentially possible to include in data engineering pipelines
We should find out!
We should also document the source databases that are external of create-a-derived-table but are in the glue catalog - these will not currently be ingested, they probably should be and also might have metadata available
This is quite a big ask potentially so we should maybe split the work further during a refinement sessions:
This Spike could potentially do those 2 tasks
Additional Information
No response
Definition of Done
The text was updated successfully, but these errors were encountered: