Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add a method to check if a string is a OneByteString #56090

Closed
theweipeng opened this issue Nov 30, 2024 · 9 comments
Closed

Proposal: Add a method to check if a string is a OneByteString #56090

theweipeng opened this issue Nov 30, 2024 · 9 comments
Labels
feature request Issues that request new features to be added to Node.js.

Comments

@theweipeng
Copy link
Contributor

theweipeng commented Nov 30, 2024

What is the problem this feature will solve?

In the Buffer module, we have a series of methods for handling string writing and reading, such as buffer.write("string", 'latin1') and buffer.write("string", 'utf16le'). However, in some situations, we don't know the actual encoding of the string without checking every character. Checking the encoding will introduce overhead, especially when the string is large since SIMD is not accessible on the JavaScript side. In some string processing programs like the serialize framework (https://fury.apache.org/), high performance in string processing is highly beneficial for such programs.

What is the feature you are proposing to solve the problem?

Add isOneByteString function on the javascript side.

function isOneByteString(str) {
    if (typeof str  !== "string") {
        return null;
    }
    return getIsOneByte(str);
}

Add the getIsOneByte function on the C++ side.
There is GetIsOneByteSlow for the slow mode, which is used when the place where it is being used cannot be compiled by TurboFan.
And there is GetIsOneByteFast for the fast mode. This function is only applicable when the input string is a FastOneByteString, and in such a case, it will return true directly.

void GetIsOneByteSlow(
    const v8::FunctionCallbackInfo<v8::Value>& info) {
  DCHECK(ValidateCallbackInfo(info));
  if (info.Length() != 1 || !info[0]->IsString()) {
    info.GetIsolate()->ThrowError(
        "isOneByteString() requires a single string argument.");
    return;
  }
  bool is_one_byte = Utils::OpenDirectHandle(*info[0].As<v8::String>())
                         ->IsOneByteRepresentation();
  info.GetReturnValue().Set(is_one_byte);
}

bool GetIsOneByteFast(v8::Local<v8::Value> receiver,
                  const v8::FastOneByteString &source) {
  return true;
}

What alternatives have you considered?

No response

@theweipeng theweipeng added the feature request Issues that request new features to be added to Node.js. label Nov 30, 2024
@github-project-automation github-project-automation bot moved this to Awaiting Triage in Node.js feature requests Nov 30, 2024
@juanarbol
Copy link
Member

Go for it! This seems to have a good shape w/ the shared snippets

@joyeecheung
Copy link
Member

joyeecheung commented Dec 3, 2024

I feel that it might need some clarifications about what this util is targeting

  1. If this is for checking "whether the string will hit an internal fast path", then GetIsOneByteSlow() should just return false. Being a one byte string doesn't mean it will actually ensure fast call invocation, as there are other factors to consider (e.g. concatenated strings are still not optimized IIRC). And this should be named something like isFastOneByteString() instead, though that might be disclosing too many implementation details and lead to micro-optimizations that assumes calling into fast calls twice is always faster than calling into normal binding once, which may or may not be true, and doing this extra check can regress the performance in the case where the the string is not fast one byte.
  2. If this is for checking "whether the string only contains Latin-1 characters", then what you should call in the slow path is not IsOneByteRepresentation() but ContainsOnlyOneByte() because IsOneByteRepresentation() can return false when the string contains only Latin-1 characters but for some reason is stored as two-byte.

@theweipeng
Copy link
Contributor Author

I feel that it might need some clarifications about what this util is targeting

  1. If this is for checking "whether the string will hit an internal fast path", then GetIsOneByteSlow() should just return false. Being a one byte string doesn't mean it will actually ensure fast call invocation, as there are other factors to consider (e.g. concatenated strings are still not optimized IIRC). And this should be named something like isFastOneByteString() instead, though that might be disclosing too many implementation details and lead to micro-optimizations that assumes calling into fast calls twice is always faster than calling into normal binding once, which may or may not be true, and doing this extra check can regress the performance in the case where the the string is not fast one byte.
  2. If this is for checking "whether the string only contains Latin-1 characters", then what you should call in the slow path is not IsOneByteRepresentation() but ContainsOnlyOneByte() because IsOneByteRepresentation() can return false when the string contains only Latin-1 characters but for some reason is stored as two-byte.

Thank you for reminding me. Please correct me if there is any problem with my expression.

My intention is to eliminate the overhead of "encoding checks and byteLength compute" before latin1Write or ucs2Write.
Thus, this proposal aims to check "whether the string is store by oneByte or twoByte".

My code will be as follows:

const Flags = {
    latin1,
    utf16
}
function serializeString(buffer, str) {
    let cursor = 0;
    const isOneByte = util.isOneByteString(str);
    if (isOneByte) {
           buffer.writeUint8(Flags.latin1, cursor++);  // write the flags
           buffer.writeUInt32LE(str.length, cursor); // write byteLength
           buffer.write(str, cursor + 4, "latin1");   // memcpy
           cursor += str.length + 4;
    } else {
           buffer.writeUint8(Flags.utf16, cursor++);  // write the flags
           buffer.writeUInt32LE(str.length * 2, cursor); // write byteLength
           buffer.write(str, cursor + 4, "utf16le");   //  memcpy
           cursor += str.length * 2  + 4;
    }
}

So, "whether the string contains only Latin-1 characters" is not crucial in this situation. If it is a oneByte string, I will write it to the buffer as latin1, and if it is a twoByte string, I will write it as utf16.

@joyeecheung
Copy link
Member

joyeecheung commented Dec 4, 2024

I see, in that case it should probably be placed as something like v8.isOneByteRepresentation(), as this is very V8-specific, although note that this can start to backfire if V8 changes its string representation schemes (e.g. if one day it decides to just store UTF8, or at least support UTF8 storage). I feel that what you are trying to do, a method that does the serialization + returning the encoding used would be a better encapsulation?

@theweipeng
Copy link
Contributor Author

I see, in that case it should probably be placed as something like v8.isOneByteRepresentation(), as this is very V8-specific, although note that this can start to backfire if V8 changes its string representation schemes (e.g. if one day it decides to just store UTF8, or at least support UTF8 storage). I feel that what you are trying to do, a method that does the serialization + returning the encoding used would be a better encapsulation?

Yes, this is very V8-specific, and not a good encapsulation. Perhaps we can add a util tool to return the raw information of a string. It contains byteLength(1x or 2x string length) and encoding(latin1 or ucs2). With these info serializer could work in the samy way

@joyeecheung
Copy link
Member

Returning byte length + encoding SGTM (although I think the one for two-byte should be utf16, and for V8 the endianness is the same as the one from the platform IIRC).

@theweipeng
Copy link
Contributor Author

Returning byte length + encoding SGTM (although I think the one for two-byte should be utf16, and for V8 the endianness is the same as the one from the platform IIRC).

Yes, it is same, utf16 is better and more common

theweipeng added a commit to theweipeng/node that referenced this issue Dec 5, 2024
Add a util method to check the encoding information of a string.
The encoing information is from V8

Refs: nodejs#56090
theweipeng added a commit to theweipeng/node that referenced this issue Dec 7, 2024
Add a util method to check the encoding information of a string.
The encoing information is from V8

Refs: nodejs#56090
@theweipeng
Copy link
Contributor Author

Returning byte length + encoding SGTM (although I think the one for two-byte should be utf16, and for V8 the endianness is the same as the one from the platform IIRC).

Hi, I submitted a pull request #56147. Could you take a moment to review it. :)

@theweipeng
Copy link
Contributor Author

@joyeecheung @juanarbol I'm sorry to bring this up, but I still want to ask if there's any plan for this or if there are any problems with this PR. :)

theweipeng added a commit to theweipeng/node that referenced this issue Dec 21, 2024
theweipeng added a commit to theweipeng/node that referenced this issue Dec 21, 2024
theweipeng added a commit to theweipeng/node that referenced this issue Dec 21, 2024
theweipeng added a commit to theweipeng/node that referenced this issue Dec 22, 2024
theweipeng added a commit to theweipeng/node that referenced this issue Dec 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Issues that request new features to be added to Node.js.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants
@joyeecheung @theweipeng @juanarbol and others