-
Notifications
You must be signed in to change notification settings - Fork 30.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Add a method to check if a string is a OneByteString #56090
Comments
Go for it! This seems to have a good shape w/ the shared snippets |
I feel that it might need some clarifications about what this util is targeting
|
Thank you for reminding me. Please correct me if there is any problem with my expression. My intention is to eliminate the overhead of "encoding checks and byteLength compute" before My code will be as follows: const Flags = {
latin1,
utf16
}
function serializeString(buffer, str) {
let cursor = 0;
const isOneByte = util.isOneByteString(str);
if (isOneByte) {
buffer.writeUint8(Flags.latin1, cursor++); // write the flags
buffer.writeUInt32LE(str.length, cursor); // write byteLength
buffer.write(str, cursor + 4, "latin1"); // memcpy
cursor += str.length + 4;
} else {
buffer.writeUint8(Flags.utf16, cursor++); // write the flags
buffer.writeUInt32LE(str.length * 2, cursor); // write byteLength
buffer.write(str, cursor + 4, "utf16le"); // memcpy
cursor += str.length * 2 + 4;
}
} So, "whether the string contains only Latin-1 characters" is not crucial in this situation. If it is a oneByte string, I will write it to the buffer as latin1, and if it is a twoByte string, I will write it as utf16. |
I see, in that case it should probably be placed as something like |
Yes, this is very V8-specific, and not a good encapsulation. Perhaps we can add a util tool to return the raw information of a string. It contains byteLength(1x or 2x string length) and encoding(latin1 or ucs2). With these info serializer could work in the samy way |
Returning byte length + encoding SGTM (although I think the one for two-byte should be utf16, and for V8 the endianness is the same as the one from the platform IIRC). |
Yes, it is same, utf16 is better and more common |
Add a util method to check the encoding information of a string. The encoing information is from V8 Refs: nodejs#56090
Add a util method to check the encoding information of a string. The encoing information is from V8 Refs: nodejs#56090
Hi, I submitted a pull request #56147. Could you take a moment to review it. :) |
@joyeecheung @juanarbol I'm sorry to bring this up, but I still want to ask if there's any plan for this or if there are any problems with this PR. :) |
What is the problem this feature will solve?
In the Buffer module, we have a series of methods for handling string writing and reading, such as
buffer.write("string", 'latin1')
andbuffer.write("string", 'utf16le')
. However, in some situations, we don't know the actual encoding of the string without checking every character. Checking the encoding will introduce overhead, especially when the string is large since SIMD is not accessible on the JavaScript side. In some string processing programs like the serialize framework (https://fury.apache.org/), high performance in string processing is highly beneficial for such programs.What is the feature you are proposing to solve the problem?
Add
isOneByteString
function on the javascript side.Add the getIsOneByte function on the C++ side.
There is
GetIsOneByteSlow
for the slow mode, which is used when the place where it is being used cannot be compiled by TurboFan.And there is
GetIsOneByteFast
for the fast mode. This function is only applicable when the input string is aFastOneByteString
, and in such a case, it will return true directly.What alternatives have you considered?
No response
The text was updated successfully, but these errors were encountered: