fix(manifest): ManifestEntry partition field schema should be dynamically generated #307

arnaudbriche · 2025-02-17T15:55:39Z

Avro schemas for Iceberg metadata files are now built dynamically with code.
I know this changes a lot of code, but I did not find other less intrusive ways to fix the issue.

zeroshade

A bunch of comments plus you need to add the Apache license blurb at the top of the new files.

zeroshade · 2025-02-17T23:08:48Z

utils.go

+	switch f.Type.(type) {
+	case *StringType:
+		sch = avro.NewPrimitiveSchema(avro.String, nil)
+	case *Int32Type:
+		sch = avro.NewPrimitiveSchema(avro.Int, nil)
+	case *Int64Type:
+		sch = avro.NewPrimitiveSchema(avro.Long, nil)
+	case *BinaryType:
+		sch = avro.NewPrimitiveSchema(avro.Bytes, nil)
+	case *BooleanType:
+		sch = avro.NewPrimitiveSchema(avro.Boolean, nil)
+	case *Float32Type:
+		sch = avro.NewPrimitiveSchema(avro.Float, nil)
+	case *Float64Type:
+		sch = avro.NewPrimitiveSchema(avro.Double, nil)
+	case *DateType:
+		sch = avro.NewPrimitiveSchema(avro.Int, avro.NewPrimitiveLogicalSchema(avro.Date))
+	default:


missing some types here

Sure. Do we agree that we only need to handle primitive/scalar types here ?

As far as I'm aware, yes. You can't partition on a struct/list/map type to my knowledge.

Let me know if you see missing ones.

utils.go

manifest.go

zeroshade · 2025-02-17T23:12:01Z

manifest.go

 	return b.m
 }

 type manifestFileV2 struct {
+	partitionType      *StructType


Could this be integrated with the FieldSummary instead of being at the top level here?

In either case, we should probably be populating this after reading in a manifest file too, right?

Not sure if it belongs to FieldSummary.
We also need to be able to do Avro -> Iceberg schema conversion to populate the field on read.

I have done Avro <--> Iceberg schema conversions.
Can you please help me with the populate part ? Maybe point me to the right part of the code ?

internal/utils.go

internal/manifest_list_v2.go

internal/manifest_list_v1.go

internal/manifest-entry-v1.json

arnaudbriche · 2025-02-20T10:50:27Z

I'm trying to create and maintain Iceberg tables from externally generated Parquet files with this lib.
The goal is for Clickhouse to be able to query the table.

While writing the various Avro metadata files with the lib, I encountered many errors from ClickHouse, which complained about missing metadata on the Avro manifest files.

As of the spec, Manifest files must have specific metadata to be valid: https://iceberg.apache.org/spec/#manifests

I thus created the WriteManifestEntries that writes the Avro Manifest files with proper metadata.
Not sure how to make the WriteEntries method of manifestFileV1 and manifestFileV2 write compliant Avro Manifest files.

As of now, the table produced is queryable by ClickHouse.

zeroshade · 2025-02-20T20:01:11Z

@arnaudbriche i'll take a look and see if I can condense and simplify this a bit for you, probably quicker than going back and forth like this

…cally built

… single generic Must function - create constants of non-parametric Avro schema to prevent wasting resources on rebuilding them each time - add license blurb to all new files

…etadata: https://iceberg.apache.org/spec/#manifests - fix a few bug in Metadata builder that triggers on table creation

zeroshade · 2025-02-21T22:13:54Z

@arnaudbriche would it be easier for you to enable maintainers to push to this branch and update or should I just open a new PR with the changes?

arnaudbriche · 2025-02-21T22:31:00Z

@zeroshade

I cannot find the "Allow edits from maintainers" button. Can you point-out where it is supposed to be ?

arnaudbriche · 2025-02-21T22:38:30Z

Seems like the feature does not exists when the fork repository is owner by an Organization.
I just gave you write access to the fork repository.
Is that ok ?

zeroshade · 2025-02-21T22:57:30Z

Looks like i have read access but not write access currently 😦

zeroshade · 2025-02-21T22:58:29Z

Ah i had to accept the invite

zeroshade · 2025-02-21T23:00:05Z

@arnaudbriche can you take a look at my updated version here and make sure that it works with what you were trying for clickhouse? Feel free to comment on the actual API also if you like

@Fokko @kevinjqliu I'd also love any feedback you two might have as far as the manifest writing APIs

arnaudbriche · 2025-02-21T23:47:45Z

@zeroshade it's working fine !
API looks better.

arnaudbriche · 2025-02-21T23:54:31Z

Nothing to do with the PR but just a quick question regarding io API.

Why is there a requirement for io.ReadSeekCloser on File ? It seems like Seek is not used anywhere, and having to implement it to be compliant with IO is a bit annoying.

zeroshade · 2025-02-21T23:56:14Z

The seek is necessary for efficient Parquet processing, and required by the interface APIs for reading Parquet

arnaudbriche · 2025-02-22T00:04:52Z

Ok. I though ReadAt would be sufficient. Thx for answering!

zeroshade · 2025-02-22T00:20:57Z

The only way to get the file length from the file itself is via Seeking to the end unfortunately.

ReadAt could be sufficient, but would require having to externally provide the file's size somehow.

zeroshade requested changes Feb 17, 2025

View reviewed changes

arnaudbriche requested a review from zeroshade February 19, 2025 00:27

arnaudbriche and others added 6 commits February 20, 2025 15:03

Fix apache#305: ManifestEntry partition field schema should be dynami…

f39f69d

…cally built

- replace all Must* functions on the Avro schema building code with a…

d6e5715

… single generic Must function - create constants of non-parametric Avro schema to prevent wasting resources on rebuilding them each time - add license blurb to all new files

Add support for new types when converting Iceberg -> Avro schema

4a58797

Working iceberg <--> avro schema type conversions

9a8eaf6

- add new WriteManifestEntries to writes Manifest files with proper m…

40e63d5

…etadata: https://iceberg.apache.org/spec/#manifests - fix a few bug in Metadata builder that triggers on table creation

simplifications of manifest handling

41647c0

zeroshade force-pushed the fix-manifest-entry-partition-schema branch from 1ceb7a1 to 41647c0 Compare February 21, 2025 22:59

oops unused var

6890fdf

zeroshade changed the title ~~Fix #305: ManifestEntry partition field schema should be dynamically …~~ fix(manifest): ManifestEntry partition field schema should be dynamically generated Feb 24, 2025

zeroshade approved these changes Feb 24, 2025

View reviewed changes

zeroshade merged commit b1b432e into apache:main Feb 24, 2025
10 checks passed

arnaudbriche mentioned this pull request Feb 25, 2025

Wrong Avro schema for ManifestEntry #305

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(manifest): ManifestEntry partition field schema should be dynamically generated #307

fix(manifest): ManifestEntry partition field schema should be dynamically generated #307

arnaudbriche commented Feb 17, 2025

zeroshade left a comment

zeroshade Feb 17, 2025

arnaudbriche Feb 18, 2025

zeroshade Feb 18, 2025

arnaudbriche Feb 19, 2025

arnaudbriche Feb 19, 2025

zeroshade Feb 17, 2025

zeroshade Feb 17, 2025

arnaudbriche Feb 18, 2025

arnaudbriche Feb 19, 2025

arnaudbriche commented Feb 20, 2025 •

edited

Loading

zeroshade commented Feb 20, 2025

zeroshade commented Feb 21, 2025

arnaudbriche commented Feb 21, 2025

arnaudbriche commented Feb 21, 2025

zeroshade commented Feb 21, 2025

zeroshade commented Feb 21, 2025

zeroshade commented Feb 21, 2025

arnaudbriche commented Feb 21, 2025

arnaudbriche commented Feb 21, 2025

zeroshade commented Feb 21, 2025

arnaudbriche commented Feb 22, 2025

zeroshade commented Feb 22, 2025

fix(manifest): ManifestEntry partition field schema should be dynamically generated #307

fix(manifest): ManifestEntry partition field schema should be dynamically generated #307

Conversation

arnaudbriche commented Feb 17, 2025

zeroshade left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnaudbriche commented Feb 20, 2025 • edited Loading

zeroshade commented Feb 20, 2025

zeroshade commented Feb 21, 2025

arnaudbriche commented Feb 21, 2025

arnaudbriche commented Feb 21, 2025

zeroshade commented Feb 21, 2025

zeroshade commented Feb 21, 2025

zeroshade commented Feb 21, 2025

arnaudbriche commented Feb 21, 2025

arnaudbriche commented Feb 21, 2025

zeroshade commented Feb 21, 2025

arnaudbriche commented Feb 22, 2025

zeroshade commented Feb 22, 2025

arnaudbriche commented Feb 20, 2025 •

edited

Loading