-
-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Smaller encode/decode maps #909
Conversation
Hi @sapphi-red, cool optimisations, thanks for looking into this! |
scripts/write-decode-map.ts
Outdated
const encoded = encodeTrie(getTrie(map, legacy)); | ||
const hex = encoded.map((v) => v.toString(36)).join(","); | ||
|
||
// Write the encoded trie to disk | ||
fs.writeFileSync( | ||
`${__dirname}/../src/generated/decode-data-${name}.ts`, | ||
`// Generated using scripts/write-decode-map.ts | ||
/* eslint-disable */ | ||
// prettier-ignore | ||
export default new Uint16Array([${encoded | ||
.map((val) => val.toString(10)) | ||
.join(",")}]); | ||
export default /* #__PURE__ */ (function () { | ||
const hex = "${hex}".split(','); | ||
const arr = new Uint16Array(${encoded.length}); | ||
for (let i = 0; i < arr.length; i++) { | ||
arr[i] = parseInt(hex[i], 36); | ||
} | ||
return arr; | ||
})(); | ||
` | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've played with making this base64, for an even compacter encoding. Curious what you think:
const encoded = new Uint16Array(encodeTrie(getTrie(map, legacy)));
// eslint-disable-next-line node/no-unsupported-features/node-builtins
const base64 = Buffer.from(encoded.buffer).toString("base64");
/*
* We don't want to depend on Node's Buffer, so we use
* the native `atob` function.
*
* Let's make sure the decoding works as expected.
*/
assert.deepEqual(
new Uint16Array(
new Uint8Array(
atob(base64)
.split("")
.map((c) => c.charCodeAt(0))
).buffer
),
encoded
);
// Write the encoded trie to disk
fs.writeFileSync(
`${__dirname}/../src/generated/decode-data-${name}.ts`,
`// Generated using scripts/write-decode-map.ts
export default new Uint16Array(
new Uint8Array(
atob(
"${base64}"
)
.split("")
.map((c) => c.charCodeAt(0))
).buffer
);
`
This leads to a ~38% reduction of the raw size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's smart! I hadn't thought of using Uint8Array
.
I took more detailed data and found that it greatly reduces the raw size, but increases the gzipped size.
raw size (w/o terser) |
raw size (w/ terser) |
gzip size (w/ terser) |
brotli size (w/ terser) |
|
---|---|---|---|---|
master branch | 130261B | 98431B | 31629B | 25828B |
base36 | 100718B (-22.7%) | 83939B (-14.7%) | 30569B (-3.4%) | 25256B (-2.2%) |
base64 | 89979B (-31.0%) | 73438B (-25.4%) | 32148B (+1.6%) | 28135B (+8.9%) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting — makes some sense given that patterns aren't as repetitive anymore, but I did expect this to be less of a factor. As yet another alternative, there is always storing the binary data directly in a string:
const encoded = encodeTrie(getTrie(map, legacy));
const stringified = JSON.stringify(String.fromCharCode(...encoded));
// Write the encoded trie to disk
fs.writeFileSync(
`${__dirname}/../src/generated/decode-data-${name}.ts`,
`// Generated using scripts/write-decode-map.ts
// prettier-ignore
export default new Uint16Array(
${stringified}
.split("")
.map((c) => c.charCodeAt(0))
);
`
);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not ideal, as it adds the trap door of character encodings, but for the sake of argument 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context, this reduces the file size for decode-data-html
by 58%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I avoided that way to avoid encoding problems at first thought. But it seems it reduces the size significantly.
To mitigate encoding problems, maybe escaping non-ascii non-control characters will work.
This increases raw size but terser will be unescape them unless format.ascii_only = true
is set.
This escaping way reduces gzip/brotli file size than base36/base64.
raw size (w/o terser) |
raw size (w/ terser) |
gzip size (w/ terser) |
brotli size (w/ terser) |
|
---|---|---|---|---|
master branch | 130261B | 98431B | 31629B | 25828B |
base36 | 100718B (-22.7%) | 83939B (-14.7%) | 30569B (-3.4%) | 25256B (-2.2%) |
base64 | 89979B (-31.0%) | 73438B (-25.4%) | 32148B (+1.6%) | 28135B (+8.9%) |
binary string | 76693B (-41.1%) | 59324B (-39.7%) | 28783B (-9.0%) | 23668B (-8.4%) |
binary string escaped |
98479B (-24.4%) | 59324B (-39.7%) | 28783B (-9.0%) | 23668B (-8.4%) |
binary string escaped + terser ascii_only=true |
98479B (-24.4%) | 80257B (-18.5%) | 30335B (-4.1%) | 24507B (-5.1%) |
const encoded = encodeTrie(getTrie(map, legacy));
const stringified = `"${encoded
.map((e) => {
if (e >= 0x20 && e <= 0x7e) {
return String.fromCharCode(e)
.replace("\\", "\\\\")
.replace('"', '\\"');
}
return `\\u${e.toString(16).padStart(4, "0")}`;
})
.join("")}"`;
// Write the encoded trie to disk
fs.writeFileSync(
`${__dirname}/../src/generated/decode-data-${name}.ts`,
`// Generated using scripts/write-decode-map.ts
// prettier-ignore
export default new Uint16Array(
${stringified}
.split("")
.map((c) => c.charCodeAt(0))
);
`
);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it! Let's simplify the escaping logic to
const stringified = JSON.stringify(String.fromCharCode(...encoded)).replace(
/[^\x20-\x7e]/g,
(c) => `\\u${c.charCodeAt(0).toString(16).padStart(4, "0")}`
);
This will save some bytes if a known escape (eg. \t
) is found. The prettier-ignore
is now unnecessary as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out we can use hex escapes to save a few more bytes (see https://mathiasbynens.be/notes/javascript-escapes):
const stringified = JSON.stringify(String.fromCharCode(...encoded))
.replace(
/[^\0-\x7e]/g,
(c) => `\\u${c.charCodeAt(0).toString(16).padStart(4, "0")}`
)
.replace(/\\u0000/g, "\\0")
.replace(/\\u00([\da-f]{2})/g, "\\x$1");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyhow, this feels as far as this can be taken. Lmk what you think, and then let's get this merged!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also https://github.com/mathiasbynens/jsesc, which could be added as a dev dependency and would remove the need for all of the replace
logic.
I've pushed my change to use strings for decode data. Curious if this actually helps at all with size, as patterns are slightly less repetitive. Let me know if this makes sense for you, and then this should be good to land. |
Thank you! It's great. I've updated the data on the description. This PR now reduces size by 10% - 45%!
terser was already doing these conversions, so it only affected |
@sapphi-red Thanks for all of your work getting this over the finish line! I genuinely enjoyed thinking through all of the potential variants and looking for the best way forward. I'm not publishing a release straight away, as I'm not sure whether this should be a minor or a major version. I'm leaning towards major, as there is a small chance of breaking people's code if they haven't configured terser as needed for their environment. The counter argument would be that their setup is broken, and this can be a minor version. Curious to hear what you think. Reading your original message again, I wanted to address this:
Counterintuitively, the decode change could actually help with performance. Not important, but I thought it was fun and worth sharing. |
I think a minor is acceptable, but I think a major is safer, too.
That's interesting. |
This PR reduces bundle file size by 10% - 45%.
(w/o terser)
(w/ terser)
(w/ terser)
(w/ terser)
+ terser
ascii_only=true
+ terser
ascii_only=false
+ terser
ascii_only=true
+ terser
ascii_only=false
These values are the file size when a file containing
export * from 'entities'
is bundled.I took these value with this script.
This way it takes slightly longer time to start up, but the time to run functions won't be affected.
old data
(w/o terser)
(w/ terser)
(w/ terser)
(w/ terser)