-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🎉 New Destination: S3-Glue #18695
🎉 New Destination: S3-Glue #18695
Conversation
@itaseskii please request my review when this is ready! |
Will do! We are planning to test this tomorrow on a live aws environment and if everything works as expected I will request a review. |
} | ||
} | ||
case "object" -> { | ||
//TODO (itaseski) should we take into account nested objects and generate nested structs i.e struct<col_name : struct<col_name : string>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately structs only support primitive data types. This means it will likely require some means of flattening nested records. Possibly by allowing a configurable separator (e.g. __
).
However, now that I type this, that means that the actual data would need to be mutated to represent the flattened schema structure. Alternatively, anything that's nested deeper than 1 level would need to be treated as a string and parsed as JSON in the query engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After all it seems that Glue does support nesting of complex types. I have fully reimplemented the schema generation algorithm to use recursion in order to include all the nested types in the schema
.withOutputFormat("org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat") | ||
.withSerdeInfo( | ||
new SerDeInfo() | ||
.withSerializationLibrary("org.openx.data.jsonserde.JsonSerDe") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to be mapped to the file type (e.g. JSON vs. Avro vs. Parquet) so that the query engine can use the proper library when processing the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm aware of this. but I though that we agreed that in v1 we will support only JSON and I'm performing that check in order to decide whether to generate the schema in Glue or not.
this.objectMapper = new ObjectMapper(); | ||
} | ||
|
||
//TODO (itaseski) can location change after table is created? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. I would imagine the answer is yes? In practice I think it's a rare occurrence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have provided the ability to update the location through the input parameter so we are good to go.
|
||
UpdateTableRequest updateTableRequest = new UpdateTableRequest() | ||
.withDatabaseName(databaseName) | ||
//TODO (itaseski) do we need to send all table inputs or the ones that are going to be changed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should only be necessary to modify the schema of tables that are actually being modified. Existing table definitions will continue to operate as long as the underlying data is still present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I'm only sending schema and location the only data that can change.
// TODO (itaseski) does the airbyte schema support integers as first class types or as airbyte_types? | ||
case "integer" -> "int"; | ||
case "boolean" -> "boolean"; | ||
// TODO (itaseski) throw exception on unknown types or set as received? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to keep them as-is (at least as a first pass)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kept as a default case which will send the type as received without performing any mapping on it.
Hello! I'm going to be out of the office this Friday and won't be able to review your contribution again today, I return next Monday. So far, most contributions look solid and are almost done to be approved. As said in Chris' comment all contributions made before 2-November are eligible to receive the prize and have 2 weeks to merge the contributions. But I ensure next week we're going to have your contribution merged. If you have questions about the implementation you can send them in Sorry the inconvenience and see you again next week, thank you so much for your contribution! |
@itaseskii please change to ready and request my review when this the connector is ready to review. |
|
||
import com.fasterxml.jackson.databind.JsonNode; | ||
|
||
// TODO (itaseskii) allow config based implementation of different metastores i.e Hive, Nessie, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the intent is to support different metastores then it might make sense to change the name from s3_glue
to s3_lakehouse
or s3_metastore
so that there isn't the hardcoded assumption that it only works with AWS Glue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully agree with this and I was thinking the same thing but I would prefer if we rename it once we actually have another metastore implementation since that is not for certain...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense 👍
/test connector=connectors/destination-s3
Build PassedTest summary info:
|
/test connector=connectors/destination-s3-glue
Build FailedTest summary info:
Build FailedTest summary info:
|
/test connector=connectors/destination-s3-glue
Build FailedTest summary info:
|
/test connector=connectors/destination-s3-glue
Build PassedTest summary info:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @itaseskii I think the connector is in a good first state!
/publish connector=connectors/destination-s3-glue
if you have connectors that successfully published but failed definition generation, follow step 4 here |
Hello @itaseskii can you please update your profile with your email? If you prefer to DM me instead, please dm with the following:
|
* destination-s3-glue * reimplement schema generation to use recursion * configure serialization library * include data on update * flatten data field * improve location path * generate s3-glue destination * refactor s3-glue as separate connector * add acceptance tests and cleanup * check field presence * override test image name * add s3-glue readme * format files * add redpanda readme * add s3 glue to source def * auto-bump connector version Co-authored-by: marcosmarxm <marcosmarxm@gmail.com> Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
What
Integrating the S3 destination connector with AWS Glue.
How
Describe the solution
Recommended reading order
x.java
y.python
🚨 User Impact 🚨
Are there any breaking changes? What is the end result perceived by the user? If yes, please merge this PR with the 🚨🚨 emoji so changelog authors can further highlight this if needed.
Pre-merge Checklist
Expand the relevant checklist and delete the others.
New Connector
Community member or Airbyter
airbyte_secret
./gradlew :airbyte-integrations:connectors:<name>:integrationTest
.README.md
bootstrap.md
. See description and examplesdocs/integrations/<source or destination>/<name>.md
including changelog. See changelog exampledocs/integrations/README.md
airbyte-integrations/builds.md
Airbyter
If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.
/test connector=connectors/<name>
command is passing/publish
command described hereUpdating a connector
Community member or Airbyter
airbyte_secret
./gradlew :airbyte-integrations:connectors:<name>:integrationTest
.README.md
bootstrap.md
. See description and examplesdocs/integrations/<source or destination>/<name>.md
including changelog. See changelog exampleAirbyter
If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.
/test connector=connectors/<name>
command is passing/publish
command described hereConnector Generator
-scaffold
in their name) have been updated with the latest scaffold by running./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates
then checking in your changesTests
Unit
Put your unit tests output here.
Integration
Put your integration tests output here.
Acceptance
Put your acceptance tests output here.