Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iceberg writes to match file size #3823

Open
kevinzwang opened this issue Feb 19, 2025 · 0 comments
Open

Iceberg writes to match file size #3823

kevinzwang opened this issue Feb 19, 2025 · 0 comments
Labels
enhancement New feature or request p2 (backlog) Nice to have features

Comments

@kevinzwang
Copy link
Member

Discussed in #3815

Originally posted by gero90 February 14, 2025
If there is anyway to estimate parquet file size in df.write_iceberg() , it would be really nice to try to get parquet files of size close to the iceberg table property write.target-file-size-bytes (default is 512 MiB)

Having parquet files close to that size makes iceberg reads more efficient, and there is less table maintenance (compaction) to perform.

As example, I'm doing df.into_partitions(1) right before df.write_iceberg() where I know the total data is small, to get a single file per write.

Thanks in advance for taking a look and for making daft awesome!

@kevinzwang kevinzwang added enhancement New feature or request p2 (backlog) Nice to have features labels Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request p2 (backlog) Nice to have features
Projects
None yet
Development

No branches or pull requests

1 participant