Proposal: Add a method to check if a string is a OneByteString #56090

theweipeng · 2024-11-30T12:54:43Z

What is the problem this feature will solve?

In the Buffer module, we have a series of methods for handling string writing and reading, such as buffer.write("string", 'latin1') and buffer.write("string", 'utf16le'). However, in some situations, we don't know the actual encoding of the string without checking every character. Checking the encoding will introduce overhead, especially when the string is large since SIMD is not accessible on the JavaScript side. In some string processing programs like the serialize framework (https://fury.apache.org/), high performance in string processing is highly beneficial for such programs.

What is the feature you are proposing to solve the problem?

Add isOneByteString function on the javascript side.

function isOneByteString(str) {
    if (typeof str  !== "string") {
        return null;
    }
    return getIsOneByte(str);
}

Add the getIsOneByte function on the C++ side.
There is GetIsOneByteSlow for the slow mode, which is used when the place where it is being used cannot be compiled by TurboFan.
And there is GetIsOneByteFast for the fast mode. This function is only applicable when the input string is a FastOneByteString, and in such a case, it will return true directly.

void GetIsOneByteSlow(
    const v8::FunctionCallbackInfo<v8::Value>& info) {
  DCHECK(ValidateCallbackInfo(info));
  if (info.Length() != 1 || !info[0]->IsString()) {
    info.GetIsolate()->ThrowError(
        "isOneByteString() requires a single string argument.");
    return;
  }
  bool is_one_byte = Utils::OpenDirectHandle(*info[0].As<v8::String>())
                         ->IsOneByteRepresentation();
  info.GetReturnValue().Set(is_one_byte);
}

bool GetIsOneByteFast(v8::Local<v8::Value> receiver,
                  const v8::FastOneByteString &source) {
  return true;
}

What alternatives have you considered?

No response

The text was updated successfully, but these errors were encountered:

juanarbol · 2024-12-01T07:12:05Z

Go for it! This seems to have a good shape w/ the shared snippets

joyeecheung · 2024-12-03T16:48:43Z

I feel that it might need some clarifications about what this util is targeting

If this is for checking "whether the string will hit an internal fast path", then GetIsOneByteSlow() should just return false. Being a one byte string doesn't mean it will actually ensure fast call invocation, as there are other factors to consider (e.g. concatenated strings are still not optimized IIRC). And this should be named something like isFastOneByteString() instead, though that might be disclosing too many implementation details and lead to micro-optimizations that assumes calling into fast calls twice is always faster than calling into normal binding once, which may or may not be true, and doing this extra check can regress the performance in the case where the the string is not fast one byte.
If this is for checking "whether the string only contains Latin-1 characters", then what you should call in the slow path is not IsOneByteRepresentation() but ContainsOnlyOneByte() because IsOneByteRepresentation() can return false when the string contains only Latin-1 characters but for some reason is stored as two-byte.

theweipeng · 2024-12-04T02:46:17Z

I feel that it might need some clarifications about what this util is targeting

If this is for checking "whether the string will hit an internal fast path", then GetIsOneByteSlow() should just return false. Being a one byte string doesn't mean it will actually ensure fast call invocation, as there are other factors to consider (e.g. concatenated strings are still not optimized IIRC). And this should be named something like isFastOneByteString() instead, though that might be disclosing too many implementation details and lead to micro-optimizations that assumes calling into fast calls twice is always faster than calling into normal binding once, which may or may not be true, and doing this extra check can regress the performance in the case where the the string is not fast one byte.

If this is for checking "whether the string only contains Latin-1 characters", then what you should call in the slow path is not IsOneByteRepresentation() but ContainsOnlyOneByte() because IsOneByteRepresentation() can return false when the string contains only Latin-1 characters but for some reason is stored as two-byte.

Thank you for reminding me. Please correct me if there is any problem with my expression.

My intention is to eliminate the overhead of "encoding checks and byteLength compute" before latin1Write or ucs2Write.
Thus, this proposal aims to check "whether the string is store by oneByte or twoByte".

My code will be as follows:

const Flags = {
    latin1,
    utf16
}
function serializeString(buffer, str) {
    let cursor = 0;
    const isOneByte = util.isOneByteString(str);
    if (isOneByte) {
           buffer.writeUint8(Flags.latin1, cursor++);  // write the flags
           buffer.writeUInt32LE(str.length, cursor); // write byteLength
           buffer.write(str, cursor + 4, "latin1");   // memcpy
           cursor += str.length + 4;
    } else {
           buffer.writeUint8(Flags.utf16, cursor++);  // write the flags
           buffer.writeUInt32LE(str.length * 2, cursor); // write byteLength
           buffer.write(str, cursor + 4, "utf16le");   //  memcpy
           cursor += str.length * 2  + 4;
    }
}

So, "whether the string contains only Latin-1 characters" is not crucial in this situation. If it is a oneByte string, I will write it to the buffer as latin1, and if it is a twoByte string, I will write it as utf16.

joyeecheung · 2024-12-04T04:34:47Z

I see, in that case it should probably be placed as something like v8.isOneByteRepresentation(), as this is very V8-specific, although note that this can start to backfire if V8 changes its string representation schemes (e.g. if one day it decides to just store UTF8, or at least support UTF8 storage). I feel that what you are trying to do, a method that does the serialization + returning the encoding used would be a better encapsulation?

theweipeng · 2024-12-04T05:03:49Z

I see, in that case it should probably be placed as something like v8.isOneByteRepresentation(), as this is very V8-specific, although note that this can start to backfire if V8 changes its string representation schemes (e.g. if one day it decides to just store UTF8, or at least support UTF8 storage). I feel that what you are trying to do, a method that does the serialization + returning the encoding used would be a better encapsulation?

Yes, this is very V8-specific, and not a good encapsulation. Perhaps we can add a util tool to return the raw information of a string. It contains byteLength(1x or 2x string length) and encoding(latin1 or ucs2). With these info serializer could work in the samy way

joyeecheung · 2024-12-04T05:37:03Z

Returning byte length + encoding SGTM (although I think the one for two-byte should be utf16, and for V8 the endianness is the same as the one from the platform IIRC).

theweipeng · 2024-12-04T05:41:40Z

Returning byte length + encoding SGTM (although I think the one for two-byte should be utf16, and for V8 the endianness is the same as the one from the platform IIRC).

Yes, it is same, utf16 is better and more common

Add a util method to check the encoding information of a string. The encoing information is from V8 Refs: nodejs#56090

theweipeng · 2024-12-10T02:39:52Z

Returning byte length + encoding SGTM (although I think the one for two-byte should be utf16, and for V8 the endianness is the same as the one from the platform IIRC).

Hi, I submitted a pull request #56147. Could you take a moment to review it. :)

theweipeng · 2024-12-11T14:13:18Z

@joyeecheung @juanarbol I'm sorry to bring this up, but I still want to ask if there's any plan for this or if there are any problems with this PR. :)

References: nodejs#56090

theweipeng added the feature request Issues that request new features to be added to Node.js. label Nov 30, 2024

github-project-automation bot added this to Node.js feature requests Nov 30, 2024

github-project-automation bot moved this to Awaiting Triage in Node.js feature requests Nov 30, 2024

theweipeng added a commit to theweipeng/node that referenced this issue Dec 5, 2024

util: add a method to get string encoding info

a16d7dd

Add a util method to check the encoding information of a string. The encoing information is from V8 Refs: nodejs#56090

theweipeng mentioned this issue Dec 5, 2024

src: add a method to get string encoding info #56147

Closed

theweipeng added a commit to theweipeng/node that referenced this issue Dec 7, 2024

util: add a method to get string encoding info

a4a445a

Add a util method to check the encoding information of a string. The encoing information is from V8 Refs: nodejs#56090

theweipeng added a commit to theweipeng/node that referenced this issue Dec 21, 2024

util: Detect whether the string is one byte representation or not.

22e4721

References: nodejs#56090

theweipeng added a commit to theweipeng/node that referenced this issue Dec 21, 2024

util: Detect whether the string is one byte representation or not.

8a6f12e

References: nodejs#56090

theweipeng added a commit to theweipeng/node that referenced this issue Dec 21, 2024

util: Detect whether the string is one byte representation or not.

6f30b0e

References: nodejs#56090

theweipeng added a commit to theweipeng/node that referenced this issue Dec 22, 2024

src: detect whether the string is one byte representation or not

6eaf64c

References: nodejs#56090

theweipeng added a commit to theweipeng/node that referenced this issue Dec 28, 2024

src: detect whether the string is one byte representation or not

65e06d9

References: nodejs#56090

jasnell closed this as completed in 6cb0690 Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Add a method to check if a string is a OneByteString #56090

Proposal: Add a method to check if a string is a OneByteString #56090

theweipeng commented Nov 30, 2024 •

edited

Loading

juanarbol commented Dec 1, 2024

joyeecheung commented Dec 3, 2024 •

edited

Loading

theweipeng commented Dec 4, 2024

joyeecheung commented Dec 4, 2024 •

edited

Loading

theweipeng commented Dec 4, 2024

joyeecheung commented Dec 4, 2024

theweipeng commented Dec 4, 2024

theweipeng commented Dec 10, 2024

theweipeng commented Dec 11, 2024

Proposal: Add a method to check if a string is a OneByteString #56090

Proposal: Add a method to check if a string is a OneByteString #56090

Comments

theweipeng commented Nov 30, 2024 • edited Loading

What is the problem this feature will solve?

What is the feature you are proposing to solve the problem?

What alternatives have you considered?

juanarbol commented Dec 1, 2024

joyeecheung commented Dec 3, 2024 • edited Loading

theweipeng commented Dec 4, 2024

joyeecheung commented Dec 4, 2024 • edited Loading

theweipeng commented Dec 4, 2024

joyeecheung commented Dec 4, 2024

theweipeng commented Dec 4, 2024

theweipeng commented Dec 10, 2024

theweipeng commented Dec 11, 2024

theweipeng commented Nov 30, 2024 •

edited

Loading

joyeecheung commented Dec 3, 2024 •

edited

Loading

joyeecheung commented Dec 4, 2024 •

edited

Loading