feat: Smaller encode/decode maps #909

sapphi-red · 2022-08-20T18:10:36Z

This PR reduces bundle file size by 10% - 45%.

	raw size (w/o terser)	raw size (w/ terser)	gzip size (w/ terser)	brotli size (w/ terser)
master branch	130261B	98431B	31629B	25828B
only decode map change + terser `ascii_only=true`	96799B (-25.7%)	80257B (-18.5%)	30335B (-4.1%)	24507B (-5.1%)
only decode map change + terser `ascii_only=false`	96799B (-25.7%)	59324B (-39.7%)	28783B (-9.0%)	23668B (-8.4%)
only encode map change	126081B (-3.2%)	94083B (-4.4%)	29117B (-8.0%)	24322B (-5.8%)
both changes + terser `ascii_only=true`	92619B (-28.9%)	75909B (-22.9%)	27777B (-12.2%)	23093B (-10.5%)
both changes + terser `ascii_only=false`	92619B (-28.9%)	54976B (-44.2%)	26083B (-17.5%)	22175B (-14.1%)

These values are the file size when a file containing export * from 'entities' is bundled.
I took these value with this script.

This way it takes slightly longer time to start up, but the time to run functions won't be affected.

old data

	raw size (w/o terser)	raw size (w/ terser)	gzip size (w/ terser)	brotli size (w/ terser)
master branch	130261B	98431B	31629B	25828B
only decode map change (`f9ece1d`)	100718B (-22.7%)	83939B (-14.7%)	30569B (-3.4%)	25256B (-2.2%)
only encode map change (`4eac4ae`)	126081B (-3.2%)	94083B (-4.4%)	29117B (-8.0%)	24322B (-5.8%)
both changes	96538B (-25.9%)	79591B (-19.1%)	27909B (-11.8%)	23842B (-7.7%)

scripts/write-decode-map.ts

scripts/write-encode-map.ts

fb55 · 2022-08-21T08:46:17Z

Hi @sapphi-red, cool optimisations, thanks for looking into this!

fb55 · 2022-08-21T08:52:43Z

scripts/write-decode-map.ts

    const encoded = encodeTrie(getTrie(map, legacy));
+    const hex = encoded.map((v) => v.toString(36)).join(",");

    // Write the encoded trie to disk
    fs.writeFileSync(
        `${__dirname}/../src/generated/decode-data-${name}.ts`,
        `// Generated using scripts/write-decode-map.ts
+/* eslint-disable */
 // prettier-ignore
-export default new Uint16Array([${encoded
-            .map((val) => val.toString(10))
-            .join(",")}]);
+export default /* #__PURE__ */ (function () {
+    const hex = "${hex}".split(',');
+    const arr = new Uint16Array(${encoded.length});
+    for (let i = 0; i < arr.length; i++) {
+        arr[i] = parseInt(hex[i], 36);
+    }
+    return arr;
+})();
 `
    );


I've played with making this base64, for an even compacter encoding. Curious what you think:

const encoded = new Uint16Array(encodeTrie(getTrie(map, legacy))); // eslint-disable-next-line node/no-unsupported-features/node-builtins const base64 = Buffer.from(encoded.buffer).toString("base64"); /* * We don't want to depend on Node's Buffer, so we use * the native `atob` function. * * Let's make sure the decoding works as expected. */ assert.deepEqual( new Uint16Array( new Uint8Array( atob(base64) .split("") .map((c) => c.charCodeAt(0)) ).buffer ), encoded ); // Write the encoded trie to disk fs.writeFileSync( `${__dirname}/../src/generated/decode-data-${name}.ts`, `// Generated using scripts/write-decode-map.ts export default new Uint16Array( new Uint8Array( atob( "${base64}" ) .split("") .map((c) => c.charCodeAt(0)) ).buffer ); `

This leads to a ~38% reduction of the raw size.

That's smart! I hadn't thought of using Uint8Array.
I took more detailed data and found that it greatly reduces the raw size, but increases the gzipped size.

raw size
(w/o terser) raw size
(w/ terser) gzip size
(w/ terser) brotli size
(w/ terser)

master branch 130261B 98431B 31629B 25828B

base36 100718B (-22.7%) 83939B (-14.7%) 30569B (-3.4%) 25256B (-2.2%)

base64 89979B (-31.0%) 73438B (-25.4%) 32148B (+1.6%) 28135B (+8.9%)

Interesting — makes some sense given that patterns aren't as repetitive anymore, but I did expect this to be less of a factor. As yet another alternative, there is always storing the binary data directly in a string:

const encoded = encodeTrie(getTrie(map, legacy)); const stringified = JSON.stringify(String.fromCharCode(...encoded)); // Write the encoded trie to disk fs.writeFileSync( `${__dirname}/../src/generated/decode-data-${name}.ts`, `// Generated using scripts/write-decode-map.ts // prettier-ignore export default new Uint16Array( ${stringified} .split("") .map((c) => c.charCodeAt(0)) ); ` );

This is not ideal, as it adds the trap door of character encodings, but for the sake of argument 🙂

For context, this reduces the file size for decode-data-html by 58%.

I avoided that way to avoid encoding problems at first thought. But it seems it reduces the size significantly.
To mitigate encoding problems, maybe escaping non-ascii non-control characters will work.
This increases raw size but terser will be unescape them unless format.ascii_only = true is set.
This escaping way reduces gzip/brotli file size than base36/base64.

raw size
(w/o terser) raw size
(w/ terser) gzip size
(w/ terser) brotli size
(w/ terser)

master branch 130261B 98431B 31629B 25828B

base36 100718B (-22.7%) 83939B (-14.7%) 30569B (-3.4%) 25256B (-2.2%)

base64 89979B (-31.0%) 73438B (-25.4%) 32148B (+1.6%) 28135B (+8.9%)

binary string 76693B (-41.1%) 59324B (-39.7%) 28783B (-9.0%) 23668B (-8.4%)

binary string escaped
98479B (-24.4%) 59324B (-39.7%) 28783B (-9.0%) 23668B (-8.4%)

binary string escaped
+ terser ascii_only=true 98479B (-24.4%) 80257B (-18.5%) 30335B (-4.1%) 24507B (-5.1%)

const encoded = encodeTrie(getTrie(map, legacy)); const stringified = `"${encoded .map((e) => { if (e >= 0x20 && e <= 0x7e) { return String.fromCharCode(e) .replace("\\", "\\\\") .replace('"', '\\"'); } return `\\u${e.toString(16).padStart(4, "0")}`; }) .join("")}"`; // Write the encoded trie to disk fs.writeFileSync( `${__dirname}/../src/generated/decode-data-${name}.ts`, `// Generated using scripts/write-decode-map.ts // prettier-ignore export default new Uint16Array( ${stringified} .split("") .map((c) => c.charCodeAt(0)) ); ` );

I like it! Let's simplify the escaping logic to

const stringified = JSON.stringify(String.fromCharCode(...encoded)).replace( /[^\x20-\x7e]/g, (c) => `\\u${c.charCodeAt(0).toString(16).padStart(4, "0")}` );

This will save some bytes if a known escape (eg. \t) is found. The prettier-ignore is now unnecessary as well.

Turns out we can use hex escapes to save a few more bytes (see https://mathiasbynens.be/notes/javascript-escapes):

const stringified = JSON.stringify(String.fromCharCode(...encoded)) .replace( /[^\0-\x7e]/g, (c) => `\\u${c.charCodeAt(0).toString(16).padStart(4, "0")}` ) .replace(/\\u0000/g, "\\0") .replace(/\\u00([\da-f]{2})/g, "\\x$1");

Anyhow, this feels as far as this can be taken. Lmk what you think, and then let's get this merged!

There is also https://github.com/mathiasbynens/jsesc, which could be added as a dev dependency and would remove the need for all of the replace logic.

scripts/write-encode-map.ts

fb55 · 2022-08-23T09:47:40Z

I've pushed my change to use strings for decode data. Curious if this actually helps at all with size, as patterns are slightly less repetitive. Let me know if this makes sense for you, and then this should be good to land.

sapphi-red · 2022-08-23T10:55:20Z

Thank you! It's great. I've updated the data on the description. This PR now reduces size by 10% - 45%!
Also I added a commit to update src/generated/decode-data-html.ts because it was slightly different from the generated result (" was ').

Curious if this actually helps at all with size, as patterns are slightly less repetitive.

terser was already doing these conversions, so it only affected raw size (w/o terser).

fb55 · 2022-08-23T20:32:27Z

@sapphi-red Thanks for all of your work getting this over the finish line! I genuinely enjoyed thinking through all of the potential variants and looking for the best way forward.

I'm not publishing a release straight away, as I'm not sure whether this should be a minor or a major version. I'm leaning towards major, as there is a small chance of breaking people's code if they haven't configured terser as needed for their environment. The counter argument would be that their setup is broken, and this can be a minor version. Curious to hear what you think.

Reading your original message again, I wanted to address this:

This way it takes slightly longer time to start up, but the time to run functions won't be affected.

Counterintuitively, the decode change could actually help with performance. Not important, but I thought it was fun and worth sharing.

sapphi-red · 2022-08-24T10:12:37Z

I think a minor is acceptable, but I think a major is safer, too.

Counterintuitively, the decode change could actually help with performance. Not important, but I thought it was fun and worth sharing.

That's interesting.

sapphi-red added 2 commits August 21, 2022 02:31

feat: smaller decode map

f9ece1d

feat: smaller encode map

4eac4ae

sapphi-red commented Aug 20, 2022

View reviewed changes

scripts/write-decode-map.ts Outdated Show resolved Hide resolved

scripts/write-encode-map.ts Outdated Show resolved Hide resolved

chore: format

e43eaa4

fb55 reviewed Aug 21, 2022

View reviewed changes

scripts/write-encode-map.ts Outdated Show resolved Hide resolved

sapphi-red and others added 2 commits August 21, 2022 20:57

chore: apply suggestion

c34a5e8

Use strings for decode data

3265d56

regenerate decode data

64cbf45

fix format

9bd699f

fb55 merged commit 0a74924 into fb55:master Aug 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Smaller encode/decode maps #909

feat: Smaller encode/decode maps #909

sapphi-red commented Aug 20, 2022 •

edited

Loading

fb55 commented Aug 21, 2022 •

edited

Loading

fb55 Aug 21, 2022 •

edited

Loading

sapphi-red Aug 21, 2022 •

edited

Loading

fb55 Aug 21, 2022

fb55 Aug 21, 2022

fb55 Aug 21, 2022

sapphi-red Aug 21, 2022 •

edited

Loading

fb55 Aug 22, 2022

fb55 Aug 22, 2022 •

edited

Loading

fb55 Aug 22, 2022

fb55 Aug 22, 2022

fb55 commented Aug 23, 2022

sapphi-red commented Aug 23, 2022

fb55 commented Aug 23, 2022

sapphi-red commented Aug 24, 2022

feat: Smaller encode/decode maps #909

feat: Smaller encode/decode maps #909

Conversation

sapphi-red commented Aug 20, 2022 • edited Loading

fb55 commented Aug 21, 2022 • edited Loading

fb55 Aug 21, 2022 • edited Loading

Choose a reason for hiding this comment

sapphi-red Aug 21, 2022 • edited Loading

Choose a reason for hiding this comment

fb55 Aug 21, 2022

Choose a reason for hiding this comment

fb55 Aug 21, 2022

Choose a reason for hiding this comment

fb55 Aug 21, 2022

Choose a reason for hiding this comment

sapphi-red Aug 21, 2022 • edited Loading

Choose a reason for hiding this comment

fb55 Aug 22, 2022

Choose a reason for hiding this comment

fb55 Aug 22, 2022 • edited Loading

Choose a reason for hiding this comment

fb55 Aug 22, 2022

Choose a reason for hiding this comment

fb55 Aug 22, 2022

Choose a reason for hiding this comment

fb55 commented Aug 23, 2022

sapphi-red commented Aug 23, 2022

fb55 commented Aug 23, 2022

sapphi-red commented Aug 24, 2022

sapphi-red commented Aug 20, 2022 •

edited

Loading

fb55 commented Aug 21, 2022 •

edited

Loading

fb55 Aug 21, 2022 •

edited

Loading

sapphi-red Aug 21, 2022 •

edited

Loading

sapphi-red Aug 21, 2022 •

edited

Loading

fb55 Aug 22, 2022 •

edited

Loading