Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike/POC: Investigate how to implement the business glossary feature in datahub #1302

Closed
4 tasks
teeceeas opened this issue Jan 28, 2025 · 6 comments
Closed
4 tasks
Assignees

Comments

@teeceeas
Copy link

teeceeas commented Jan 28, 2025

User Story

As a
Product Manager

I want
The team to investigate options for implementing a feature where users can search for business terms in the data catalogue and retrieve:

  • The definition of the business term.
  • Links to datasets that are related to the term.

So that
Users can better understand business terms and find relevant data to support their work efficiently.

Acceptance Criteria:
Understand technical feasibility:
Investigate whether the current search functionality can incorporate definitions for business terms and linked datasets or is this a separate search.
Identify where definitions of business terms can be stored.
Explore how these definitions can be maintained, i.e. Added, removed or updated
Explore how to link related datasets.
Explore how to link to business term from the dataset
Consider sorting or filtering options to enhance usability.
Ensure definitions and dataset links remain up-to-date with catalogue updates.
Understand what UI changes will be required to support these changes

Value / Purpose

No response

Useful Contacts

No response

User Types

No response

Hypothesis

If we... [do a thing]
Then... [this will happen]

Proposal

No response

Additional Information

No response

Definition of Done

Example - [ ] Documentation has been written / updated

  • README has been updated
  • User docs have been updated
  • Another team member has reviewed
  • Tests are green

Master Business Glossary v.40.pdf

@github-project-automation github-project-automation bot moved this to Todo 📝 in Data Catalogue Jan 28, 2025
@teeceeas teeceeas changed the title Spike: Investigate how to search for business terms and provide definitions with links to related datasets Spike: Investigate how to implement the business glossary feature in datahub Jan 31, 2025
@teeceeas
Copy link
Author

#85
#69
#83
#68
#103
#86
#204

@teeceeas teeceeas changed the title Spike: Investigate how to implement the business glossary feature in datahub Spike/POC: Investigate how to implement the business glossary feature in datahub Jan 31, 2025
@MatMoore MatMoore self-assigned this Feb 4, 2025
@MatMoore
Copy link
Contributor

MatMoore commented Feb 4, 2025

Options

Diagram of options

Search design

Both options are possible, but there are some limitations that complicate a combined search, so the preferred option is a separate glossary page.

Separate glossary page

We already have a dedicated glossary page that pulls glossary terms from Datahub. It looks like this:

308605386-8dd73cfe-9f7e-4d09-8c63-824e00a0a5f9.1.mov

It was hidden in #649, but the code is still there, so if we want to reinstate this design we can. We hid it in the first place because most users weren't noticing it and interacting with it; however we could decide to intentionally put this into a different part of the service for governance users. We just need to be clear about who this feature is for.

Combined search option

We could pull back glossary terms as search results in the main search, but terms and term groups do not support tags. This means we can't easily filter them by subject area, and the search query would be more complicated.

There is also a risk that glossary terms clutter up the search results... e.g. I search "Electronic Monitoring" and get 20 definitions about electronic monitoring but no datasets. If we try this, we should first do #1056 so that we understand how to customise the search results. Perhaps limit to one glossary term per page of search results.

Since the use case for glossary terms is different than datasets, search results might look a bit different even if they are part of the main search results. E.g. by including links to datasets related to the term, like this example on GOV.UK.

Surface glossary term in search results

When glossary terms are assigned to datasets, we can surface them in our existing search results using a clickable link.

In the Datahub interface example below, CRDS is a glossary term:
Screenshot showing a search result. There is a glossary term of "CRDS" as a clickable label below the description

Glossary term pages

It would be helpful to add a page for each glossary term, which shows entities linked to the term. This way users can click on a glossary term when it appears in the metadata, and find related entities.

Export functionality

Elsewhere in the catalogue we have the functionality to export to CSV. The user need for this was sharing metadata with suppliers outside of the organisation, and it seems likely that we will need something like this for glossaries as well (unless we implement share links)

Glossary management

Use Datahub's UI

This is the simplest approach to start with. However, Datahub's functionality has some limitations, as there is no way to create versions of a glossary, distinguish between working drafts vs published, or export PDFs.

With some amount of data wrangling I was able to convert the 0.40 pdf into a format I could load into Datahub. We wouldn't want to repeat this work every time EM make changes to the glossary (nor repeat for every other business area) - we would want the users who manage the glossary to be able to make their changes in Datahub themselves, which they would be able to do if they have the right role assigned.

Datahub supports an arbitrary nesting of term groups. We should make use of this and import terms from different sources into different term groups, rather than trying to combine information into one mega glossary for the whole department, which would be impossible for us to maintain and govern right now. All the Electronic monitoring terms should live in a glossary term group called "Electronic monitoring".

Term groups can be assigned owners, and it would save us a lot of hassle if we could require that someone takes responsibility for maintaining these glossary terms as a condition of importing them.

Ideally the glossary should be replicated to different environments, so that dev, preprod contain everything that's been added to prod. This is possible in Datahub via the Datahub source or Metadata file source but we haven't implemented anything like this so far.

We already refer users back to Datahub for lineage, and it would probably only be used by a small number of people, so I don't think it's a huge problem if this is isn't built into Find MoJ data. We would need to document this in our user guide, and audit it for accessibility, as I don't think our previous audits have looked at this part of Datahub.

Find MoJ data

If we do decide to build our own glossary management interface then it would need to cover the following to reach parity with Datahub:

  • Listing the term groups
  • Listing/searching through the terms within a term group
  • Adding a new term to a term group
  • Editing a term
  • Deleting a term
  • Permission checks

Ingest from machine readable format or tool we don't manage

I.e. treat it like other metadata we ingest from elsewhere.

For EM, this is not available, so I think we can discard this option for now.

Stand up a new glossary tool, integrate it with Datahub

I thought this might be another option, but it doesn't seem like there are many tools out there that just focus on publishing glossaries. It's much more common for this to be built into data catalogues. So I've discarded this option as well.

Linking glossary terms to Datasets

Manual linkage

  • For CaDeT models, the data engineers would need to add additional metadata to the meta dictionary in dbt. We would need to modify our ingestion to map this to terms in a similar manner to the owner metadata.
  • The data engineers would also need to update the database metadata and we will would update the CaDeT databases source
  • If assigning terms to datasets besides EM, other ingestion sources will need updating as well

Automated linkage

It is fairly easy to generate a list of glossary terms for each dataset by looking for their presence in titles/descriptions. The resulting metadata can be ingested via Datahub's csv enricher.

@MatMoore MatMoore moved this from Todo 📝 to In Progress 🚀 in Data Catalogue Feb 4, 2025
@MatMoore
Copy link
Contributor

MatMoore commented Feb 4, 2025

Proof of concept

branch https://github.com/ministryofjustice/find-moj-data/compare/fmd-1302-bring-back-the-glossary-poc

  • add back in the glossary link
  • add glossary terms to search results
  • add a glossary term page, showing linked datasets
  • enable exporting glossary terms
  • manually populate EM term definitions in Datahub dev
  • get rid of glossary term groups we aren't using in Datahub dev
  • fix pagination

Here's the glossary in spreadsheet format, with descriptions converted to markdown https://docs.google.com/spreadsheets/d/180F1I9rfD-zUsx9NF4Vo8tSwub9-6CNL_TehimH_1T8/edit?usp=sharing

To make this I first ran https://github.com/VikParuchuri/marker to generate some markdown, then I manually corrected mistakes, then I ran this code to generate csv

import csv


def get_titles_and_descriptions(f):
    title = None
    current_entry = []
    for line in f:
        line = line.strip()
        if line.startswith("# "):
            if title:
                yield (title, "\n".join(current_entry).strip())
            title = line[2:]
            current_entry = []
        else:
            current_entry.append(line)

    yield (title, "\n".join(current_entry).strip())


if __name__ == "__main__":
    with open("glossary.md") as f:
        with open("glossary.csv", "w") as csvfile:
            writer = csv.writer(csvfile)
            writer.writerow(("name", "documentation"))

            for title, description in get_titles_and_descriptions(f):
                writer.writerow((title, description))

Limitations

  • The current glossary page is quite slow and may need to be optimised. This will get worse as we expand the number of terms it's fetching.
  • Managing definitions in Datahub might not meet our needs long term, but I suggest we give it a go with EM and LAA, and see how it all tests with users. If we find Datahub does not meet our needs, we can consider, contributing upstream, standing up a separate tool and integrating it with Datahub, or building our own interface.
  • Users might not understand what will happen when they click on a glossary term from the search result page - this came up in previous research around tags.

@MatMoore
Copy link
Contributor

MatMoore commented Feb 4, 2025

We had a quick catchup today with Neil and Matt who is data eng lead on EM.

The EM glossary was authored by Clare, it's not an export from any existing tool. It is expected to be updated by other members of the team. It's primary purpose is to provide documentation to the supplier who are building the new system.

Based on this I think we can work off the assumption that we can do a one-off population of the glossary into Datahub, and from then on out we would ask them to manage their glossary in Datahub or Find MoJ data (if we build our own editing interface). But it sounds like there is a need to also export the glossary, just like we allow exporting of schemas and table lists, so that metadata can be easily shared with external suppliers without having to onboard them onto our EntraID tenant.

@MatMoore MatMoore moved this from In Progress 🚀 to Review 🛂 in Data Catalogue Feb 10, 2025
@MatMoore
Copy link
Contributor

MatMoore commented Feb 11, 2025

Next steps

@MatMoore
Copy link
Contributor

MatMoore commented Feb 11, 2025

Questions from the PoC review today:

Q. Can we link glossary terms to columns in addition to entities?
A. Yes - I think this would be very helpful for understanding whats in a table, although it will be easier to start by focusing on the entity <-> glossary term links first, and then expand on it later.

Q. Is there any workflow functionality for the glossary?
A. No, but we could potentially ask people to draft their glossary in preprod, and then transfer to prod when ready.

Q. Does asking people to edit directly in Datahub mean Datahub becomes the source of truth for glossaries? This is a departure from the current situation where everything we ingest can be reingested from somewhere else.
A. Yes - this affects our disaster recovery plan as we would be more reliant on backups. We considered editing glossary terms in YAML files through github, but that would be less user friendly than Datahub's UI.

Another suggestion for exporting glossaries is to work on a print stylesheet, so users can save to PDF from the glossary page.

@MatMoore MatMoore moved this from Review 🛂 to Done ✅ in Data Catalogue Feb 11, 2025
@MatMoore MatMoore closed this as completed by moving to Done ✅ in Data Catalogue Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done ✅
Development

No branches or pull requests

2 participants