-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract common ingestion tasks #5847
Extract common ingestion tasks #5847
Conversation
…_instance`, and `_log_ingestion_metrics` methods
bcc8713
to
1c317d8
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5847 +/- ##
==========================================
+ Coverage 96.61% 96.62% +0.01%
==========================================
Files 1046 1053 +7
Lines 24904 25094 +190
Branches 1647 1663 +16
==========================================
+ Hits 24060 24248 +188
- Misses 688 690 +2
Partials 156 156 ☔ View full report in Codecov by Sentry. |
2b7d1c4
to
ec1e900
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I know preference is generally for smaller PRs but I'd say in cases like this it would be worth including at least one existing ingestion task to use this mechanism. Helps show it works, and makes it easier to review the new generic code side by side with the old being removed.
@baarkerlounger I was very conscious of how large this PR already is, but I agree, it probably would be easier to see a real example. I did attempt to describe an example in the docstring, but appreciate it's not the same. I was planning to plug the EYB ingestion tasks into this common functionality next; I wonder if it would be useful see that first before merging this in? |
@oliverjwroberts my personal preference would be for the PR to be a "functionally complete and useful" unit, so I'd include it, otherwise this new code potentially never ends up doing anything if the rest doesn't get added (aside from being easier to review that way). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few suggestions/changes.
The `list_objects` method now returns a list of each object's metadata. The `get_most_recent_object_key` method now returns the key of the object with the most recent `LastModified` datetime.
cda0893
to
d0d65e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Once merged I will try using it with Stova.
Description of change
This PR creates a new ingest sub-app that contains base functionality and utilities for ingesting objects from S3 buckets. The intention is the existing ingestion tasks can be refactored and plugged into this new common functionality in subsequent work.
Notably, it adds:
get_s3_client()
function that returns a boto3 S3 client, either using the automatic endpoint URL or localstack's if the env variableS3_LOCAL_ENDPOINT_URL
is set.S3ObjectProcessor
class for processing objects located at a specified prefix within an S3 bucket.ingest.IngestedObject
model, that mirrors the functionality ofcompany_activity.IngestedFile
.BaseObjectIdentificationTask
class for identifying new objects in S3 and determining if they should be ingested.BaseObjectIngestionTask
class for ingesting a specified object from S3.The BaseObject...Task classes are intended to be inherited for each dataset that we wish to ingest. The class methods can be overridden to tailor the tasks to the specific dataset; the idea being that only minimal code changes are needed to do so.
In terms of testing, it also modifies/adds:
s3_client
fixtures3_object_processor
fixturetest_object_tuples
fixture which can be used to generate some object definitions for testing. This fixture also uses the utility functioncompressed_json_faker
to serialize the contents and compress them into a.jsonl.gz
object.upload_objects_to_s3
function that can be used to upload objects to a test S3 bucket.Checklist
Has this branch been rebased on top of the current
main
branch?Explanation
The branch should not be stale or have conflicts at the time reviews are requested.
Is the CircleCI build passing?
General points
Other things to check
fixtures/test_data.yaml
is maintained when updating modelsSee docs/CONTRIBUTING.md for more guidelines.