-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spike/POC: Investigate how to implement the business glossary feature in datahub #1302
Comments
Options![]() Search designBoth options are possible, but there are some limitations that complicate a combined search, so the preferred option is a separate glossary page. Separate glossary pageWe already have a dedicated glossary page that pulls glossary terms from Datahub. It looks like this: 308605386-8dd73cfe-9f7e-4d09-8c63-824e00a0a5f9.1.movIt was hidden in #649, but the code is still there, so if we want to reinstate this design we can. We hid it in the first place because most users weren't noticing it and interacting with it; however we could decide to intentionally put this into a different part of the service for governance users. We just need to be clear about who this feature is for. Combined search optionWe could pull back glossary terms as search results in the main search, but terms and term groups do not support tags. This means we can't easily filter them by subject area, and the search query would be more complicated. There is also a risk that glossary terms clutter up the search results... e.g. I search "Electronic Monitoring" and get 20 definitions about electronic monitoring but no datasets. If we try this, we should first do #1056 so that we understand how to customise the search results. Perhaps limit to one glossary term per page of search results. Since the use case for glossary terms is different than datasets, search results might look a bit different even if they are part of the main search results. E.g. by including links to datasets related to the term, like this example on GOV.UK. Surface glossary term in search resultsWhen glossary terms are assigned to datasets, we can surface them in our existing search results using a clickable link. In the Datahub interface example below, CRDS is a glossary term: Glossary term pagesIt would be helpful to add a page for each glossary term, which shows entities linked to the term. This way users can click on a glossary term when it appears in the metadata, and find related entities. Export functionalityElsewhere in the catalogue we have the functionality to export to CSV. The user need for this was sharing metadata with suppliers outside of the organisation, and it seems likely that we will need something like this for glossaries as well (unless we implement share links) Glossary managementUse Datahub's UIThis is the simplest approach to start with. However, Datahub's functionality has some limitations, as there is no way to create versions of a glossary, distinguish between working drafts vs published, or export PDFs. With some amount of data wrangling I was able to convert the 0.40 pdf into a format I could load into Datahub. We wouldn't want to repeat this work every time EM make changes to the glossary (nor repeat for every other business area) - we would want the users who manage the glossary to be able to make their changes in Datahub themselves, which they would be able to do if they have the right role assigned. Datahub supports an arbitrary nesting of term groups. We should make use of this and import terms from different sources into different term groups, rather than trying to combine information into one mega glossary for the whole department, which would be impossible for us to maintain and govern right now. All the Electronic monitoring terms should live in a glossary term group called "Electronic monitoring". Term groups can be assigned owners, and it would save us a lot of hassle if we could require that someone takes responsibility for maintaining these glossary terms as a condition of importing them. Ideally the glossary should be replicated to different environments, so that dev, preprod contain everything that's been added to prod. This is possible in Datahub via the Datahub source or Metadata file source but we haven't implemented anything like this so far. We already refer users back to Datahub for lineage, and it would probably only be used by a small number of people, so I don't think it's a huge problem if this is isn't built into Find MoJ data. We would need to document this in our user guide, and audit it for accessibility, as I don't think our previous audits have looked at this part of Datahub. Find MoJ dataIf we do decide to build our own glossary management interface then it would need to cover the following to reach parity with Datahub:
Ingest from machine readable format or tool we don't manageI.e. treat it like other metadata we ingest from elsewhere. For EM, this is not available, so I think we can discard this option for now. Stand up a new glossary tool, integrate it with DatahubI thought this might be another option, but it doesn't seem like there are many tools out there that just focus on publishing glossaries. It's much more common for this to be built into data catalogues. So I've discarded this option as well. Linking glossary terms to DatasetsManual linkage
Automated linkageIt is fairly easy to generate a list of glossary terms for each dataset by looking for their presence in titles/descriptions. The resulting metadata can be ingested via Datahub's csv enricher. |
Proof of conceptbranch https://github.com/ministryofjustice/find-moj-data/compare/fmd-1302-bring-back-the-glossary-poc
Here's the glossary in spreadsheet format, with descriptions converted to markdown https://docs.google.com/spreadsheets/d/180F1I9rfD-zUsx9NF4Vo8tSwub9-6CNL_TehimH_1T8/edit?usp=sharing To make this I first ran https://github.com/VikParuchuri/marker to generate some markdown, then I manually corrected mistakes, then I ran this code to generate csv import csv
def get_titles_and_descriptions(f):
title = None
current_entry = []
for line in f:
line = line.strip()
if line.startswith("# "):
if title:
yield (title, "\n".join(current_entry).strip())
title = line[2:]
current_entry = []
else:
current_entry.append(line)
yield (title, "\n".join(current_entry).strip())
if __name__ == "__main__":
with open("glossary.md") as f:
with open("glossary.csv", "w") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(("name", "documentation"))
for title, description in get_titles_and_descriptions(f):
writer.writerow((title, description)) Limitations
|
We had a quick catchup today with Neil and Matt who is data eng lead on EM. The EM glossary was authored by Clare, it's not an export from any existing tool. It is expected to be updated by other members of the team. It's primary purpose is to provide documentation to the supplier who are building the new system. Based on this I think we can work off the assumption that we can do a one-off population of the glossary into Datahub, and from then on out we would ask them to manage their glossary in Datahub or Find MoJ data (if we build our own editing interface). But it sounds like there is a need to also export the glossary, just like we allow exporting of schemas and table lists, so that metadata can be easily shared with external suppliers without having to onboard them onto our EntraID tenant. |
Next steps
|
Questions from the PoC review today: Q. Can we link glossary terms to columns in addition to entities? Q. Is there any workflow functionality for the glossary? Q. Does asking people to edit directly in Datahub mean Datahub becomes the source of truth for glossaries? This is a departure from the current situation where everything we ingest can be reingested from somewhere else. Another suggestion for exporting glossaries is to work on a print stylesheet, so users can save to PDF from the glossary page. |
User Story
As a
Product Manager
I want
The team to investigate options for implementing a feature where users can search for business terms in the data catalogue and retrieve:
So that
Users can better understand business terms and find relevant data to support their work efficiently.
Acceptance Criteria:
Understand technical feasibility:
Investigate whether the current search functionality can incorporate definitions for business terms and linked datasets or is this a separate search.
Identify where definitions of business terms can be stored.
Explore how these definitions can be maintained, i.e. Added, removed or updated
Explore how to link related datasets.
Explore how to link to business term from the dataset
Consider sorting or filtering options to enhance usability.
Ensure definitions and dataset links remain up-to-date with catalogue updates.
Understand what UI changes will be required to support these changes
Value / Purpose
No response
Useful Contacts
No response
User Types
No response
Hypothesis
If we... [do a thing]
Then... [this will happen]
Proposal
No response
Additional Information
No response
Definition of Done
Example - [ ] Documentation has been written / updated
Master Business Glossary v.40.pdf
The text was updated successfully, but these errors were encountered: