Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract common ingestion tasks #5847

Merged
merged 8 commits into from
Dec 12, 2024

Conversation

oliverjwroberts
Copy link
Contributor

@oliverjwroberts oliverjwroberts commented Dec 9, 2024

Description of change

This PR creates a new ingest sub-app that contains base functionality and utilities for ingesting objects from S3 buckets. The intention is the existing ingestion tasks can be refactored and plugged into this new common functionality in subsequent work.

Notably, it adds:

  • A get_s3_client() function that returns a boto3 S3 client, either using the automatic endpoint URL or localstack's if the env variable S3_LOCAL_ENDPOINT_URL is set.
  • An S3ObjectProcessor class for processing objects located at a specified prefix within an S3 bucket.
  • Several constants for when using S3 buckets.
  • A new ingest.IngestedObject model, that mirrors the functionality of company_activity.IngestedFile.
  • A BaseObjectIdentificationTask class for identifying new objects in S3 and determining if they should be ingested.
  • A BaseObjectIngestionTask class for ingesting a specified object from S3.

The BaseObject...Task classes are intended to be inherited for each dataset that we wish to ingest. The class methods can be overridden to tailor the tasks to the specific dataset; the idea being that only minimal code changes are needed to do so.

In terms of testing, it also modifies/adds:

  • An s3_client fixture
  • An s3_object_processor fixture
  • An test_object_tuples fixture which can be used to generate some object definitions for testing. This fixture also uses the utility function compressed_json_faker to serialize the contents and compress them into a .jsonl.gz object.
  • An upload_objects_to_s3 function that can be used to upload objects to a test S3 bucket.

Checklist

  • Has this branch been rebased on top of the current main branch?

    Explanation

    The branch should not be stale or have conflicts at the time reviews are requested.

  • Is the CircleCI build passing?

General points

Other things to check

  • Make sure fixtures/test_data.yaml is maintained when updating models
  • Consider the admin site when making changes to models
  • Use select-/prefetch-related field lists in views and search apps, and update them when fields are added
  • Make sure the README is updated e.g. when adding new environment variables

See docs/CONTRIBUTING.md for more guidelines.

@oliverjwroberts oliverjwroberts self-assigned this Dec 9, 2024
@oliverjwroberts oliverjwroberts force-pushed the fixup/CLS2-995-extract-common-ingestion-tasks branch from bcc8713 to 1c317d8 Compare December 10, 2024 07:09
Copy link

codecov bot commented Dec 10, 2024

Codecov Report

Attention: Patch coverage is 99.47368% with 1 line in your changes missing coverage. Please review.

Project coverage is 96.62%. Comparing base (84afe4a) to head (d0d65e3).
Report is 34 commits behind head on main.

Files with missing lines Patch % Lines
datahub/ingest/tasks.py 99.02% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5847      +/-   ##
==========================================
+ Coverage   96.61%   96.62%   +0.01%     
==========================================
  Files        1046     1053       +7     
  Lines       24904    25094     +190     
  Branches     1647     1663      +16     
==========================================
+ Hits        24060    24248     +188     
- Misses        688      690       +2     
  Partials      156      156              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@oliverjwroberts oliverjwroberts force-pushed the fixup/CLS2-995-extract-common-ingestion-tasks branch from 2b7d1c4 to ec1e900 Compare December 10, 2024 22:21
@oliverjwroberts oliverjwroberts marked this pull request as ready for review December 11, 2024 07:07
@oliverjwroberts oliverjwroberts requested a review from a team as a code owner December 11, 2024 07:07
Copy link
Contributor

@baarkerlounger baarkerlounger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I know preference is generally for smaller PRs but I'd say in cases like this it would be worth including at least one existing ingestion task to use this mechanism. Helps show it works, and makes it easier to review the new generic code side by side with the old being removed.

@oliverjwroberts
Copy link
Contributor Author

@baarkerlounger I was very conscious of how large this PR already is, but I agree, it probably would be easier to see a real example. I did attempt to describe an example in the docstring, but appreciate it's not the same. I was planning to plug the EYB ingestion tasks into this common functionality next; I wonder if it would be useful see that first before merging this in?

@baarkerlounger
Copy link
Contributor

@oliverjwroberts my personal preference would be for the PR to be a "functionally complete and useful" unit, so I'd include it, otherwise this new code potentially never ends up doing anything if the rest doesn't get added (aside from being easier to review that way).

Copy link
Contributor

@DeanElliott96 DeanElliott96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few suggestions/changes.

The `list_objects` method now returns a list of each object's metadata. The `get_most_recent_object_key` method now returns the key of the object with the most recent `LastModified` datetime.
@oliverjwroberts oliverjwroberts force-pushed the fixup/CLS2-995-extract-common-ingestion-tasks branch from cda0893 to d0d65e3 Compare December 12, 2024 11:30
Copy link
Contributor

@DeanElliott96 DeanElliott96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Once merged I will try using it with Stova.

@oliverjwroberts oliverjwroberts merged commit 6aaf1a6 into main Dec 12, 2024
7 checks passed
@oliverjwroberts oliverjwroberts deleted the fixup/CLS2-995-extract-common-ingestion-tasks branch December 12, 2024 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants