-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(manifest): ManifestEntry partition field schema should be dynamically generated #307
fix(manifest): ManifestEntry partition field schema should be dynamically generated #307
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bunch of comments plus you need to add the Apache license blurb at the top of the new files.
utils.go
Outdated
switch f.Type.(type) { | ||
case *StringType: | ||
sch = avro.NewPrimitiveSchema(avro.String, nil) | ||
case *Int32Type: | ||
sch = avro.NewPrimitiveSchema(avro.Int, nil) | ||
case *Int64Type: | ||
sch = avro.NewPrimitiveSchema(avro.Long, nil) | ||
case *BinaryType: | ||
sch = avro.NewPrimitiveSchema(avro.Bytes, nil) | ||
case *BooleanType: | ||
sch = avro.NewPrimitiveSchema(avro.Boolean, nil) | ||
case *Float32Type: | ||
sch = avro.NewPrimitiveSchema(avro.Float, nil) | ||
case *Float64Type: | ||
sch = avro.NewPrimitiveSchema(avro.Double, nil) | ||
case *DateType: | ||
sch = avro.NewPrimitiveSchema(avro.Int, avro.NewPrimitiveLogicalSchema(avro.Date)) | ||
default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing some types here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Do we agree that we only need to handle primitive/scalar types here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I'm aware, yes. You can't partition on a struct/list/map type to my knowledge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you see missing ones.
return b.m | ||
} | ||
|
||
type manifestFileV2 struct { | ||
partitionType *StructType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be integrated with the FieldSummary
instead of being at the top level here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In either case, we should probably be populating this after reading in a manifest file too, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it belongs to FieldSummary
.
We also need to be able to do Avro
-> Iceberg
schema conversion to populate the field on read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done Avro
<--> Iceberg
schema conversions.
Can you please help me with the populate part ? Maybe point me to the right part of the code ?
I'm trying to create and maintain Iceberg tables from externally generated Parquet files with this lib. While writing the various Avro metadata files with the lib, I encountered many errors from ClickHouse, which complained about missing metadata on the Avro manifest files. As of the spec, Manifest files must have specific metadata to be valid: https://iceberg.apache.org/spec/#manifests I thus created the As of now, the table produced is queryable by ClickHouse. |
@arnaudbriche i'll take a look and see if I can condense and simplify this a bit for you, probably quicker than going back and forth like this |
… single generic Must function - create constants of non-parametric Avro schema to prevent wasting resources on rebuilding them each time - add license blurb to all new files
…etadata: https://iceberg.apache.org/spec/#manifests - fix a few bug in Metadata builder that triggers on table creation
@arnaudbriche would it be easier for you to enable maintainers to push to this branch and update or should I just open a new PR with the changes? |
I cannot find the "Allow edits from maintainers" button. Can you point-out where it is supposed to be ? |
Seems like the feature does not exists when the fork repository is owner by an Organization. |
Looks like i have read access but not write access currently 😦 |
Ah i had to accept the invite |
1ceb7a1
to
41647c0
Compare
@arnaudbriche can you take a look at my updated version here and make sure that it works with what you were trying for clickhouse? Feel free to comment on the actual API also if you like @Fokko @kevinjqliu I'd also love any feedback you two might have as far as the manifest writing APIs |
@zeroshade it's working fine ! |
Nothing to do with the PR but just a quick question regarding io API. Why is there a requirement for |
The seek is necessary for efficient Parquet processing, and required by the interface APIs for reading Parquet |
Ok. I though ReadAt would be sufficient. Thx for answering! |
The only way to get the file length from the file itself is via Seeking to the end unfortunately. ReadAt could be sufficient, but would require having to externally provide the file's size somehow. |
Avro schemas for Iceberg metadata files are now built dynamically with code.
I know this changes a lot of code, but I did not find other less intrusive ways to fix the issue.