Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert proto output from internal string encoding to Unicode #24935

Closed
wants to merge 1 commit into from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Jan 16, 2025

Work towards #374

@fmeum fmeum requested a review from tjgq January 16, 2025 07:24
@github-actions github-actions bot added team-Performance Issues for Performance teams awaiting-review PR is awaiting review from an assigned reviewer labels Jan 16, 2025
@fmeum
Copy link
Collaborator Author

fmeum commented Jan 16, 2025

@bazel-io fork 8.1.0

@aiuto
Copy link
Contributor

aiuto commented Jan 16, 2025

Fabian: Can you help me understand the subtle change of this PR.

It looks to me like:

  • protobuf builder setXXX() methods expect a normal Java String
  • bazel internal strings are actually UTF-8 representation under the guise of a String
  • Thus, previous behavior was for setXXX() to whomp down the bytes an convert each octet to UTF-8. This is essentially double encoding
  • The fix is to decode the Bazel secret representation to Unicode as UTF-16 (or whatever Java uses), so that the protobuf setter sees the kind thing it expects and makes it UTF-8 on the wire.

So essentially equivalent to "copy the bytes without interpreting them", which is what Bazel does elsewhere like in ctx.actions.write()

If that's what you intend, then LGTM.

@fmeum
Copy link
Collaborator Author

fmeum commented Jan 17, 2025

That's exactly correct!

The StringEncoding class has a long comment about the different types of string encodings relevant with Bazel (turns out there are three). I added these central conversion methods to avoid a comment on each use site.

@tjgq tjgq added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Jan 20, 2025
@@ -347,7 +348,7 @@ private AttributeBuilderAdapter(

@Override
public void addStringListValue(String s) {
attributeBuilder.addStringListValue(s);
attributeBuilder.addStringListValue(internalToUnicode(s));
Copy link
Contributor

@tjgq tjgq Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this conversion is incorrect, since all callsites already perform the conversion to Unicode (also for the other three below)?

If you agree, I will remove it, and also add some integration tests in bazel_query_test.sh when I import.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right, but I would personally drop the internalToUnicode calls from the callsites rather than this function as addStringListValue etc. are the actual sinks for valid Unicode. I don't have a strong opinion though, feel free to do what you think is easier to understand

Thanks for taking care of the tests!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I generally agree with the principle of putting the conversion as close to the sink as possible, in this case it would cause writeAttributeValueToBuilder to end up with mixed application/non-application of internalToUnicode, as not every proto field is set through the AttributeValueBuilderAdapter interface. That seems less readable to me, because it's harder to convince oneself that all fields have been covered. (It might be possible to do better with additional indirection, but I'd rather not do that now.)

@github-actions github-actions bot removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label Jan 22, 2025
@fmeum fmeum deleted the 374-proto-unicode branch January 24, 2025 08:13
phst added a commit to phst/bazel_features that referenced this pull request Jan 26, 2025
bazelbuild/bazel#24935 changes the observable behavior
of starlark_doc_extract, and consumers need to adapt.

Work towards bazelbuild/bazel#374
Work towards phst/rules_elisp#818
tjgq pushed a commit to tjgq/bazel that referenced this pull request Jan 27, 2025
Work towards bazelbuild#374

Closes bazelbuild#24935.

PiperOrigin-RevId: 718549143
Change-Id: Ibe6c685a2f8dd75430cae7f770d392de35bdeb68
github-merge-queue bot pushed a commit that referenced this pull request Jan 27, 2025
…#25097)

Work towards #374

Closes #24935.

PiperOrigin-RevId: 718549143
Change-Id: Ibe6c685a2f8dd75430cae7f770d392de35bdeb68

Co-authored-by: Fabian Meumertzheim <fabian@meumertzhe.im>
phst added a commit to phst/bazel_features that referenced this pull request Jan 28, 2025
bazelbuild/bazel#24935 changes the observable behavior
of starlark_doc_extract, and consumers need to adapt.

Work towards bazelbuild/bazel#374
Work towards phst/rules_elisp#818
fmeum pushed a commit to bazel-contrib/bazel_features that referenced this pull request Jan 28, 2025
bazelbuild/bazel#24935 changes the observable
behavior of starlark_doc_extract, and consumers need to adapt.

Work towards bazelbuild/bazel#374
Work towards phst/rules_elisp#818
fmeum added a commit to fmeum/bazel that referenced this pull request Jan 29, 2025
Work towards bazelbuild#374

Closes bazelbuild#24935.

PiperOrigin-RevId: 718549143
Change-Id: Ibe6c685a2f8dd75430cae7f770d392de35bdeb68
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Performance Issues for Performance teams
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants