Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(manifest): ManifestEntry partition field schema should be dynamically generated #307

Merged

Conversation

arnaudbriche
Copy link
Contributor

Avro schemas for Iceberg metadata files are now built dynamically with code.
I know this changes a lot of code, but I did not find other less intrusive ways to fix the issue.

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of comments plus you need to add the Apache license blurb at the top of the new files.

utils.go Outdated
Comment on lines 277 to 294
switch f.Type.(type) {
case *StringType:
sch = avro.NewPrimitiveSchema(avro.String, nil)
case *Int32Type:
sch = avro.NewPrimitiveSchema(avro.Int, nil)
case *Int64Type:
sch = avro.NewPrimitiveSchema(avro.Long, nil)
case *BinaryType:
sch = avro.NewPrimitiveSchema(avro.Bytes, nil)
case *BooleanType:
sch = avro.NewPrimitiveSchema(avro.Boolean, nil)
case *Float32Type:
sch = avro.NewPrimitiveSchema(avro.Float, nil)
case *Float64Type:
sch = avro.NewPrimitiveSchema(avro.Double, nil)
case *DateType:
sch = avro.NewPrimitiveSchema(avro.Int, avro.NewPrimitiveLogicalSchema(avro.Date))
default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing some types here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Do we agree that we only need to handle primitive/scalar types here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I'm aware, yes. You can't partition on a struct/list/map type to my knowledge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you see missing ones.

return b.m
}

type manifestFileV2 struct {
partitionType *StructType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be integrated with the FieldSummary instead of being at the top level here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In either case, we should probably be populating this after reading in a manifest file too, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it belongs to FieldSummary.
We also need to be able to do Avro -> Iceberg schema conversion to populate the field on read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done Avro <--> Iceberg schema conversions.
Can you please help me with the populate part ? Maybe point me to the right part of the code ?

@arnaudbriche
Copy link
Contributor Author

arnaudbriche commented Feb 20, 2025

I'm trying to create and maintain Iceberg tables from externally generated Parquet files with this lib.
The goal is for Clickhouse to be able to query the table.

While writing the various Avro metadata files with the lib, I encountered many errors from ClickHouse, which complained about missing metadata on the Avro manifest files.

As of the spec, Manifest files must have specific metadata to be valid: https://iceberg.apache.org/spec/#manifests

I thus created the WriteManifestEntries that writes the Avro Manifest files with proper metadata.
Not sure how to make the WriteEntries method of manifestFileV1 and manifestFileV2 write compliant Avro Manifest files.

As of now, the table produced is queryable by ClickHouse.

@zeroshade
Copy link
Member

@arnaudbriche i'll take a look and see if I can condense and simplify this a bit for you, probably quicker than going back and forth like this

@zeroshade
Copy link
Member

@arnaudbriche would it be easier for you to enable maintainers to push to this branch and update or should I just open a new PR with the changes?

@arnaudbriche
Copy link
Contributor Author

@zeroshade

I cannot find the "Allow edits from maintainers" button. Can you point-out where it is supposed to be ?

@arnaudbriche
Copy link
Contributor Author

Seems like the feature does not exists when the fork repository is owner by an Organization.
I just gave you write access to the fork repository.
Is that ok ?

@zeroshade
Copy link
Member

Looks like i have read access but not write access currently 😦

@zeroshade
Copy link
Member

Ah i had to accept the invite

@zeroshade zeroshade force-pushed the fix-manifest-entry-partition-schema branch from 1ceb7a1 to 41647c0 Compare February 21, 2025 22:59
@zeroshade
Copy link
Member

@arnaudbriche can you take a look at my updated version here and make sure that it works with what you were trying for clickhouse? Feel free to comment on the actual API also if you like

@Fokko @kevinjqliu I'd also love any feedback you two might have as far as the manifest writing APIs

@arnaudbriche
Copy link
Contributor Author

@zeroshade it's working fine !
API looks better.

@arnaudbriche
Copy link
Contributor Author

Nothing to do with the PR but just a quick question regarding io API.

Why is there a requirement for io.ReadSeekCloser on File ? It seems like Seek is not used anywhere, and having to implement it to be compliant with IO is a bit annoying.

@zeroshade
Copy link
Member

The seek is necessary for efficient Parquet processing, and required by the interface APIs for reading Parquet

@arnaudbriche
Copy link
Contributor Author

Ok. I though ReadAt would be sufficient. Thx for answering!

@zeroshade
Copy link
Member

The only way to get the file length from the file itself is via Seeking to the end unfortunately.

ReadAt could be sufficient, but would require having to externally provide the file's size somehow.

@zeroshade zeroshade changed the title Fix #305: ManifestEntry partition field schema should be dynamically … fix(manifest): ManifestEntry partition field schema should be dynamically generated Feb 24, 2025
@zeroshade zeroshade merged commit b1b432e into apache:main Feb 24, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants