Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: streamline column-aware row encoding impl #20583

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

BugenZhao
Copy link
Member

@BugenZhao BugenZhao commented Feb 24, 2025

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Reimplement column-aware row encoding to make the code more clear, concise and less error-prone. See review comments for explanation of changes.

Checklist

  • I have written necessary rustdoc comments.
  • I have added necessary unit tests and integration tests.
  • I have added test labels as necessary.
  • I have added fuzzing tests or opened an issue to track them.
  • My PR contains breaking changes.
  • My PR changes performance-critical code, so I will run (micro) benchmarks and present the results.
  • My PR contains critical fixes that are necessary to be merged into the latest release.

Documentation

  • My PR needs documentation updates.
Release note

Comment on lines -35 to -37
const OFFSET8 = 0b01;
const OFFSET16 = 0b10;
const OFFSET32 = 0b11;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OFFSET32 here overlaps with OFFSET8 and OFFSET16, for which the bitflag doesn't seem to provide clear semantics. That's why I decide to re-implement with bitfield.

offsets: Vec<u8>,
buf: Vec<u8>,
}

/// A trait unifying [`ToDatumRef`] and already encoded bytes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To unify encode and encode_slice.

Comment on lines +138 to +139
// Use 0 if there's no data.
let max_offset = usize_offsets.last().copied().unwrap_or(0);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that we support ser/de for empty rows.

.reserve_exact(usize_offsets.len() * offset_width.width());
for &offset in usize_offsets {
self.offsets
.put_uint_le(offset as u64, offset_width.width());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unifies put_u8, put_u16_le, and put_u32_le.

Comment on lines +223 to +227
/// A view of the encoded bytes, which can be iterated over to get the column id and data.
/// Used for deserialization.
// TODO: can we unify this with `RowEncoding`, which is for serialization?
#[derive(Clone)]
struct EncodedBytes<'a> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To unify Deserializer and try_drop_invalid_columns.

@BugenZhao BugenZhao marked this pull request as ready for review February 25, 2025 07:26
@BugenZhao BugenZhao requested a review from a team as a code owner February 25, 2025 07:26
@BugenZhao BugenZhao requested a review from xxchan February 25, 2025 07:26
@graphite-app graphite-app bot requested a review from a team February 25, 2025 07:26
@BugenZhao BugenZhao requested review from fuyufjh and zwang28 and removed request for a team February 25, 2025 07:26
@BugenZhao BugenZhao requested review from stdrc and hzxa21 and removed request for stdrc February 26, 2025 07:14
@BugenZhao BugenZhao force-pushed the bz/col-aware-encoding-refactor branch from 04f2c7b to 8acd50f Compare February 26, 2025 07:40
@xxchan xxchan requested a review from Copilot February 27, 2025 04:04

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR refactors the column-aware row encoding implementation to improve clarity, conciseness, and reduce potential errors.

  • Replaces the bitflags-based header with a bitfield struct and an enum representing offset width.
  • Consolidates encoding logic by introducing a generic Encode trait and refines the serializer/deserializer implementations.
  • Updates tests and configuration to match the new encoding scheme.

Reviewed Changes

File Description
src/common/src/util/value_encoding/column_aware_row_encoding.rs Refactored row encoding/decoding using a bitfield-based header and generic encoding trait.
src/storage/src/row_serde/value_serde.rs Adjusted tests to validate new behavior for handling dropped columns.
src/common/Cargo.toml Added the bitfield-struct dependency to support new header implementation.
src/common/src/util/value_encoding/error.rs Changed the error message format for invalid flags to use binary representation.

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

src/common/src/util/value_encoding/error.rs:45

  • [nitpick] Using binary formatting for the invalid flag may obscure its numeric value. Consider providing additional context or including a decimal representation for improved clarity.
#[error("Invalid flag: {0:b}")]
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
@BugenZhao BugenZhao force-pushed the bz/col-aware-encoding-refactor branch from 8acd50f to 5bc558d Compare February 27, 2025 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant