Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Smaller encode/decode maps #909

Merged
merged 7 commits into from
Aug 23, 2022
Merged

feat: Smaller encode/decode maps #909

merged 7 commits into from
Aug 23, 2022

Conversation

sapphi-red
Copy link
Contributor

@sapphi-red sapphi-red commented Aug 20, 2022

This PR reduces bundle file size by 10% - 45%.

raw size
(w/o terser)
raw size
(w/ terser)
gzip size
(w/ terser)
brotli size
(w/ terser)
master branch 130261B 98431B 31629B 25828B
only decode map change
+ terser ascii_only=true
96799B (-25.7%) 80257B (-18.5%) 30335B (-4.1%) 24507B (-5.1%)
only decode map change
+ terser ascii_only=false
96799B (-25.7%) 59324B (-39.7%) 28783B (-9.0%) 23668B (-8.4%)
only encode map change 126081B (-3.2%) 94083B (-4.4%) 29117B (-8.0%) 24322B (-5.8%)
both changes
+ terser ascii_only=true
92619B (-28.9%) 75909B (-22.9%) 27777B (-12.2%) 23093B (-10.5%)
both changes
+ terser ascii_only=false
92619B (-28.9%) 54976B (-44.2%) 26083B (-17.5%) 22175B (-14.1%)

These values are the file size when a file containing export * from 'entities' is bundled.
I took these value with this script.

This way it takes slightly longer time to start up, but the time to run functions won't be affected.

old data
raw size
(w/o terser)
raw size
(w/ terser)
gzip size
(w/ terser)
brotli size
(w/ terser)
master branch 130261B 98431B 31629B 25828B
only decode map change (f9ece1d) 100718B (-22.7%) 83939B (-14.7%) 30569B (-3.4%) 25256B (-2.2%)
only encode map change (4eac4ae) 126081B (-3.2%) 94083B (-4.4%) 29117B (-8.0%) 24322B (-5.8%)
both changes 96538B (-25.9%) 79591B (-19.1%) 27909B (-11.8%) 23842B (-7.7%)

@fb55
Copy link
Owner

fb55 commented Aug 21, 2022

Hi @sapphi-red, cool optimisations, thanks for looking into this!

Comment on lines 14 to 32
const encoded = encodeTrie(getTrie(map, legacy));
const hex = encoded.map((v) => v.toString(36)).join(",");

// Write the encoded trie to disk
fs.writeFileSync(
`${__dirname}/../src/generated/decode-data-${name}.ts`,
`// Generated using scripts/write-decode-map.ts
/* eslint-disable */
// prettier-ignore
export default new Uint16Array([${encoded
.map((val) => val.toString(10))
.join(",")}]);
export default /* #__PURE__ */ (function () {
const hex = "${hex}".split(',');
const arr = new Uint16Array(${encoded.length});
for (let i = 0; i < arr.length; i++) {
arr[i] = parseInt(hex[i], 36);
}
return arr;
})();
`
);
Copy link
Owner

@fb55 fb55 Aug 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've played with making this base64, for an even compacter encoding. Curious what you think:

const encoded = new Uint16Array(encodeTrie(getTrie(map, legacy)));
// eslint-disable-next-line node/no-unsupported-features/node-builtins
const base64 = Buffer.from(encoded.buffer).toString("base64");

/*
 * We don't want to depend on Node's Buffer, so we use
 * the native `atob` function.
 *
 * Let's make sure the decoding works as expected.
 */
assert.deepEqual(
    new Uint16Array(
        new Uint8Array(
            atob(base64)
                .split("")
                .map((c) => c.charCodeAt(0))
        ).buffer
    ),
    encoded
);

// Write the encoded trie to disk
fs.writeFileSync(
        `${__dirname}/../src/generated/decode-data-${name}.ts`,
        `// Generated using scripts/write-decode-map.ts

export default new Uint16Array(
    new Uint8Array(
        atob(
            "${base64}" 
        )
            .split("")
            .map((c) => c.charCodeAt(0))
    ).buffer
);
`

This leads to a ~38% reduction of the raw size.

Copy link
Contributor Author

@sapphi-red sapphi-red Aug 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's smart! I hadn't thought of using Uint8Array.
I took more detailed data and found that it greatly reduces the raw size, but increases the gzipped size.

raw size
(w/o terser)
raw size
(w/ terser)
gzip size
(w/ terser)
brotli size
(w/ terser)
master branch 130261B 98431B 31629B 25828B
base36 100718B (-22.7%) 83939B (-14.7%) 30569B (-3.4%) 25256B (-2.2%)
base64 89979B (-31.0%) 73438B (-25.4%) 32148B (+1.6%) 28135B (+8.9%)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting — makes some sense given that patterns aren't as repetitive anymore, but I did expect this to be less of a factor. As yet another alternative, there is always storing the binary data directly in a string:

const encoded = encodeTrie(getTrie(map, legacy));
const stringified = JSON.stringify(String.fromCharCode(...encoded));

// Write the encoded trie to disk
fs.writeFileSync(
    `${__dirname}/../src/generated/decode-data-${name}.ts`,
    `// Generated using scripts/write-decode-map.ts
// prettier-ignore
export default new Uint16Array(
  ${stringified}
    .split("")
    .map((c) => c.charCodeAt(0))
);
`
);

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not ideal, as it adds the trap door of character encodings, but for the sake of argument 🙂

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, this reduces the file size for decode-data-html by 58%.

Copy link
Contributor Author

@sapphi-red sapphi-red Aug 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I avoided that way to avoid encoding problems at first thought. But it seems it reduces the size significantly.
To mitigate encoding problems, maybe escaping non-ascii non-control characters will work.
This increases raw size but terser will be unescape them unless format.ascii_only = true is set.
This escaping way reduces gzip/brotli file size than base36/base64.

raw size
(w/o terser)
raw size
(w/ terser)
gzip size
(w/ terser)
brotli size
(w/ terser)
master branch 130261B 98431B 31629B 25828B
base36 100718B (-22.7%) 83939B (-14.7%) 30569B (-3.4%) 25256B (-2.2%)
base64 89979B (-31.0%) 73438B (-25.4%) 32148B (+1.6%) 28135B (+8.9%)
binary string 76693B (-41.1%) 59324B (-39.7%) 28783B (-9.0%) 23668B (-8.4%)
binary string escaped
98479B (-24.4%) 59324B (-39.7%) 28783B (-9.0%) 23668B (-8.4%)
binary string escaped
+ terser ascii_only=true
98479B (-24.4%) 80257B (-18.5%) 30335B (-4.1%) 24507B (-5.1%)
    const encoded = encodeTrie(getTrie(map, legacy));
    const stringified = `"${encoded
        .map((e) => {
            if (e >= 0x20 && e <= 0x7e) {
                return String.fromCharCode(e)
                    .replace("\\", "\\\\")
                    .replace('"', '\\"');
            }
            return `\\u${e.toString(16).padStart(4, "0")}`;
        })
        .join("")}"`;

    // Write the encoded trie to disk
    fs.writeFileSync(
        `${__dirname}/../src/generated/decode-data-${name}.ts`,
        `// Generated using scripts/write-decode-map.ts

// prettier-ignore
export default new Uint16Array(
    ${stringified}
        .split("")
        .map((c) => c.charCodeAt(0))
);
`
    );

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! Let's simplify the escaping logic to

const stringified = JSON.stringify(String.fromCharCode(...encoded)).replace(
    /[^\x20-\x7e]/g,
    (c) => `\\u${c.charCodeAt(0).toString(16).padStart(4, "0")}`
);

This will save some bytes if a known escape (eg. \t) is found. The prettier-ignore is now unnecessary as well.

Copy link
Owner

@fb55 fb55 Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out we can use hex escapes to save a few more bytes (see https://mathiasbynens.be/notes/javascript-escapes):

const stringified = JSON.stringify(String.fromCharCode(...encoded))
  .replace(
      /[^\0-\x7e]/g,
      (c) => `\\u${c.charCodeAt(0).toString(16).padStart(4, "0")}`
  )
  .replace(/\\u0000/g, "\\0")
  .replace(/\\u00([\da-f]{2})/g, "\\x$1");

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyhow, this feels as far as this can be taken. Lmk what you think, and then let's get this merged!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also https://github.com/mathiasbynens/jsesc, which could be added as a dev dependency and would remove the need for all of the replace logic.

@fb55
Copy link
Owner

fb55 commented Aug 23, 2022

I've pushed my change to use strings for decode data. Curious if this actually helps at all with size, as patterns are slightly less repetitive. Let me know if this makes sense for you, and then this should be good to land.

@sapphi-red
Copy link
Contributor Author

Thank you! It's great. I've updated the data on the description. This PR now reduces size by 10% - 45%!
Also I added a commit to update src/generated/decode-data-html.ts because it was slightly different from the generated result (" was ').

Curious if this actually helps at all with size, as patterns are slightly less repetitive.

terser was already doing these conversions, so it only affected raw size (w/o terser).

@fb55
Copy link
Owner

fb55 commented Aug 23, 2022

@sapphi-red Thanks for all of your work getting this over the finish line! I genuinely enjoyed thinking through all of the potential variants and looking for the best way forward.

I'm not publishing a release straight away, as I'm not sure whether this should be a minor or a major version. I'm leaning towards major, as there is a small chance of breaking people's code if they haven't configured terser as needed for their environment. The counter argument would be that their setup is broken, and this can be a minor version. Curious to hear what you think.

Reading your original message again, I wanted to address this:

This way it takes slightly longer time to start up, but the time to run functions won't be affected.

Counterintuitively, the decode change could actually help with performance. Not important, but I thought it was fun and worth sharing.

@fb55 fb55 merged commit 0a74924 into fb55:master Aug 23, 2022
@sapphi-red
Copy link
Contributor Author

I think a minor is acceptable, but I think a major is safer, too.

Counterintuitively, the decode change could actually help with performance. Not important, but I thought it was fun and worth sharing.

That's interesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants