Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor CesiumUtility::Uri to use ada instead of uriparser #1072

Merged
merged 38 commits into from
Feb 1, 2025
Merged

Conversation

azrogers
Copy link
Contributor

In the discussion around #973, we decided we wanted to see if we could replace the aging uriparser library with an alternative that is compatible with the URIs modern browsers use, which has advanced past the RFC 3986 specification in many ways. To this end, I've refactored the implementation of CesiumUtility::Uri to work with ada, a modern C++ URI parser - notably, the one used by NodeJS. This will help us avoid unicode issues like the aforementioned #973, but also hopefully give us more tools to deal with URLs and avoid other problems like these in the future.

The biggest part of this refactor is that CesiumUtility::Uri instances are now constructed and interacted with via member functions as opposed to the previous approach of static methods accepting and returning strings (though the static methods remain for API compatibility). The advantage of this is it means we only have to parse a URL once - if we're performing multiple operations, such as using it as the base URL for many different relative URLs, or fetching multiple query values from it, or setting multiple query values on it, we don't need to be constantly allocating and copying around strings to accomplish this.

I've leaned on the side of permissiveness over adherence to the spec. For example, opaque paths are allowed as URLs despite being specifically disallowed by the WhatWG spec. This is because in at least one point in Cesium Native, we're adding query parameters to an opaque path (what will be a relative path, but hasn't yet been resolved). To accomplish this, we just pretend that it's actually a file URI, and remove the dummy scheme when the user extracts the URI into a string. This also helps some of the tests work out because, for whatever reason, we wrote them counting on the fallback behavior of the old implementation returning the relative parameter when a URL failed to parse.

I wasn't able to work out all of the issues from CI before the end of the week, but I'll pick it back up on Monday. In the mean time, it's definitely worth taking a look at my approach and giving feedback on.

Comment on lines 264 to 265
"%7Bz%7D/%7Bx%7D/"
"%7By%7D.terrain?v=%7Bversion%7D&extensions=octvertexnormals-metadata");
Copy link
Member

@kring kring Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me a little nervous. I guess it's needed because { and } aren't valid characters in a URL path, right? But it gets tricky fast. What if the URL already contains %7B and %7D characters? How do we distinguish the ones that are really the placeholder delimiters from the others that might be in the URL?

The best solution, I think, is for the URL library to understand placeholders. That sounds like a tall order, but it looks like ada added support two weeks ago!
ada-url/ada#785

This isn't in a released version. And it might not fit perfectly; we might need to replace {x} with {:x}, if I'm reading it correctly. Still, this should be a lot more reliable than trying to encode/decode URLs with placeholders.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be good to switch to the URLPattern implementation. It looks like it would speed up placeholder substitution (not that it's a big performance cost for us right now, but a win is a win), and having our placeholders in line with the spec can't hurt. My question is, is there anywhere in the codebase currently where users can input their own templated URLs? Besides UrlTemplateRasterOverlay, of course, which hasn't yet been shipped so it's OK if we change the behavior. I'd hate for us to update and break everyone's projects because they don't have colons in their URL templates.

As for it not being in a released version, that might be a concern. It doesn't look like there's any defined release cadence for ada, since the last release was in September. Using this might mean we have to add it as a submodule - which wouldn't be the end of the world, but I would prefer not to if we can avoid it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder how many URLs contain encoded { and } characters that aren't URL templates that we should be worried about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder how many URLs contain encoded { and } characters that aren't URL templates that we should be worried about.

Well, it's not just the { and } we have to worry about, though. It looks like URLs are now percent-decoded before doing the placeholder substitution. So if you had, say, a percent encoded slash, that would get decoded, too, and could break the URL. How common is this? I don't know, but it definitely came up in CesiumJS at one point. CesiumJS replaces just the { and } characters:
https://github.com/CesiumGS/cesium/blob/main/packages/engine/Source/Core/Resource.js#L550

Using this might mean we have to add it as a submodule - which wouldn't be the end of the world, but I would prefer not to if we can avoid it.

No need for a submodule, it can be done with a vcpkg overlay port. We haven't used those in cesium-native itself yet, though (only Unreal and Unity).

is there anywhere in the codebase currently where users can input their own templated URLs

Yeah, the WMTS tile provider uses placeholders.

I suspect we could simply replace {whatever} with {:whatever} on-the-fly, though, and that would be reliable.

Copy link
Contributor Author

@azrogers azrogers Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kring After looking into ada's URL pattern support, I think it would be better if we avoided it. Besides the URL pattern support being fairly sparse on documentation, not yet in a stable release, and not currently building on GCC versions lower than 13.1, it's actually meant to do the opposite of what we're looking to do. It's meant to match a pattern against a URL, such as for the purposes of routing (like how Express.js uses them), rather than using a pattern to generate a URL from a set of parameters. We could probably finagle the library to work this way, but it would be a bit of a headache and doesn't seem too worth it all things considered. Probably would not be particularly more performant than our current implementation.

Instead, I've rewritten substituteTemplateParameters so that encoded { and } characters are treated as if they are their unencoded equivalents. This allows the templates to be substituted without decoding the entire string, saving us a string copy as well as preventing any issues from URL decoding things we shouldn't be. Technically, this does still have the issue where an encoded { } that isn't meant to be treated as a placeholder will still be treated as one, but considering CesiumJS would have the same issue I think it's a fine solution.

@azrogers
Copy link
Contributor Author

@kring I've done my best to make sure that the substituteTemplateParameters implementation is consistent with CesiumJS. Instead of the approach of manually iterating over the string that I was doing before, I've used a regex-based implementation, which matches the implementation in CesiumJS. Unfortunately, we can't hit 100% parity with CesiumJS's implementation, at least with the current API: in CesiumJS, if a key isn't found in templateValues, it'll ignore the placeholder and return the original match - so {a{b}} would just return {a{b}} as a{b is not a key in the map. But for us we have no idea of keys, just a callback that returns a string - so the best we can do is have the callback written to return "{" + key + "}" if the placeholder isn't what's expected.

If we did change the API to allow us to know when a template should be ignored (if the callback returns std::nullopt, for example), this might help us with cases where there's bracketed text in the URL that shouldn't be treated as a template parameter. Currently, it's up to the callback to preserve that original template parameter if it's not a placeholder that's expected.

Anyways, probably the most impactful change here is that an unclosed template parameter no longer throws a runtime_error. It's probably best that we don't throw an exception for something that could happen just from a user messing up a copy-and-paste job, and besides, the behavior of CesiumJS is just to ignore an unclosed template parameter.

Copy link
Member

@kring kring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good! I have a few small comments. See also #1084. And please update CHANGES.md because this does have at least one important user-facing improvement (unicode in URLs!).

Copy link
Member

@kring kring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes so far! I'm not sure if you were done after the last round of review, but here are a few more comments for you.

@azrogers
Copy link
Contributor Author

@kring Split off UriQueryParams from Uri to avoid initializing url_search_params when we don't need to. With that it should be good for a (hopefully final!) review!

@kring kring added this to the February 2025 Release milestone Jan 31, 2025
@kring
Copy link
Member

kring commented Jan 31, 2025

This looks great and is ready to merge other than one small reservation I have left, and I feel bad about this one: we really shouldn't use abbreviations like "Params", especially in a relatively prominent part of our public API. How about renaming UriQueryParams to just UriQuery?

@azrogers
Copy link
Contributor Author

@kring Renamed!

@kring
Copy link
Member

kring commented Feb 1, 2025

Thanks for all the work (and patience!) on this @azrogers!

@kring kring merged commit 5ec42ed into main Feb 1, 2025
22 checks passed
@kring kring deleted the ada-url branch February 1, 2025 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants