Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[html] Allow ampersands in attribute values. #2036

Merged

Conversation

ditman
Copy link
Member

@ditman ditman commented Mar 13, 2025

Skips unexpected tokens when attempting to find a StartTagToken in dom.dart, so attribute values can contain the & sign without it being part of an actual HTML Entity.


  • I’ve reviewed the contributor guide and applied the relevant portions to this PR.
Contribution guidelines:

Note that many Dart repos have a weekly cadence for reviewing PRs - please allow for some latency before initial review feedback.

Copy link

github-actions bot commented Mar 13, 2025

Package publishing

Package Version Status Publish tag (post-merge)
package:bazel_worker 1.1.3-wip WIP (no publish necessary)
package:benchmark_harness 2.3.1 already published at pub.dev
package:boolean_selector 2.1.2 already published at pub.dev
package:browser_launcher 1.1.3 already published at pub.dev
package:cli_config 0.2.1-wip WIP (no publish necessary)
package:cli_util 0.4.2 already published at pub.dev
package:clock 1.1.2 already published at pub.dev
package:code_builder 4.10.1 already published at pub.dev
package:coverage 1.12.0-wip WIP (no publish necessary)
package:csslib 1.0.2 already published at pub.dev
package:extension_discovery 2.1.0 already published at pub.dev
package:file 7.0.2-wip WIP (no publish necessary)
package:file_testing 3.1.0-wip WIP (no publish necessary)
package:glob 2.1.3 already published at pub.dev
package:graphs 2.3.3-wip WIP (no publish necessary)
package:html 0.15.5+1 ready to publish html-v0.15.5+1
package:io 1.1.0-wip WIP (no publish necessary)
package:json_rpc_2 3.0.3 already published at pub.dev
package:markdown 7.3.1-wip WIP (no publish necessary)
package:mime 2.0.0 already published at pub.dev
package:oauth2 2.0.4-wip WIP (no publish necessary)
package:package_config 2.2.0 already published at pub.dev
package:pool 1.5.2-wip WIP (no publish necessary)
package:pub_semver 2.2.0 already published at pub.dev
package:pubspec_parse 1.5.0 already published at pub.dev
package:source_map_stack_trace 2.1.3-wip WIP (no publish necessary)
package:source_maps 0.10.14-wip WIP (no publish necessary)
package:source_span 1.10.1 already published at pub.dev
package:sse 4.1.7 already published at pub.dev
package:stack_trace 1.12.1 already published at pub.dev
package:stream_channel 2.1.4 already published at pub.dev
package:stream_transform 2.1.2-wip WIP (no publish necessary)
package:string_scanner 1.4.1 already published at pub.dev
package:term_glyph 1.2.3-wip WIP (no publish necessary)
package:test_reflective_loader 0.2.3 already published at pub.dev
package:timing 1.0.2 already published at pub.dev
package:unified_analytics 7.0.2 ready to publish unified_analytics-v7.0.2
package:watcher 1.1.1 already published at pub.dev
package:yaml 3.1.3 already published at pub.dev
package:yaml_edit 2.2.2 already published at pub.dev

Documentation at https://github.com/dart-lang/ecosystem/wiki/Publishing-automation.

Copy link

github-actions bot commented Mar 13, 2025

PR Health

Breaking changes ✔️
Package Change Current Version New Version Needed Version Looking good?
html None 0.15.5 0.15.5+1 0.15.5 ✔️
Changelog Entry ✔️
Package Changed Files

Changes to files need to be accounted for in their respective changelogs.

Coverage ✔️
File Coverage
pkgs/html/lib/src/tokenizer.dart 💚 100 %

This check for test coverage is informational (issues shown here will not fail the PR).

API leaks ⚠️

The following packages contain symbols visible in the public API, but not exported by the library. Export these symbols or remove them from your publicly visible API.

Package Leaked API symbols
html HtmlTokenizer
Token
HtmlInputStream
TagToken
DoctypeToken
StringToken
TreeBuilder
ActiveFormattingElements
StartTagToken
TagAttribute
CommentToken
CharactersToken
SpaceCharactersToken
EndTagToken

This check can be disabled by tagging the PR with skip-leaking-check.

License Headers ⚠️
// Copyright (c) 2025, the Dart project authors. Please see the AUTHORS file
// for details. All rights reserved. Use of this source code is governed by a
// BSD-style license that can be found in the LICENSE file.
Files
pkgs/html/lib/src/tokenizer.dart
pkgs/html/test/parser_feature_test.dart

All source files should start with a license header.

Unrelated files missing license headers
Files
pkgs/bazel_worker/benchmark/benchmark.dart
pkgs/bazel_worker/example/client.dart
pkgs/bazel_worker/example/worker.dart
pkgs/benchmark_harness/integration_test/perf_benchmark_test.dart
pkgs/boolean_selector/example/example.dart
pkgs/clock/lib/clock.dart
pkgs/clock/lib/src/clock.dart
pkgs/clock/lib/src/default.dart
pkgs/clock/lib/src/stopwatch.dart
pkgs/clock/lib/src/utils.dart
pkgs/clock/test/clock_test.dart
pkgs/clock/test/default_test.dart
pkgs/clock/test/stopwatch_test.dart
pkgs/clock/test/utils.dart
pkgs/coverage/lib/src/coverage_options.dart
pkgs/coverage/test/collect_coverage_config_test.dart
pkgs/coverage/test/config_file_locator_test.dart
pkgs/html/example/main.dart
pkgs/html/lib/dom.dart
pkgs/html/lib/dom_parsing.dart
pkgs/html/lib/html_escape.dart
pkgs/html/lib/parser.dart
pkgs/html/lib/src/constants.dart
pkgs/html/lib/src/encoding_parser.dart
pkgs/html/lib/src/html_input_stream.dart
pkgs/html/lib/src/list_proxy.dart
pkgs/html/lib/src/query_selector.dart
pkgs/html/lib/src/token.dart
pkgs/html/lib/src/treebuilder.dart
pkgs/html/lib/src/utils.dart
pkgs/html/test/dom_test.dart
pkgs/html/test/parser_test.dart
pkgs/html/test/query_selector_test.dart
pkgs/html/test/selectors/level1_baseline_test.dart
pkgs/html/test/selectors/level1_lib.dart
pkgs/html/test/selectors/selectors.dart
pkgs/html/test/support.dart
pkgs/html/test/tokenizer_test.dart
pkgs/pubspec_parse/test/git_uri_test.dart
pkgs/stack_trace/example/example.dart
pkgs/watcher/test/custom_watcher_factory_test.dart
pkgs/yaml_edit/example/example.dart

This check can be disabled by tagging the PR with skip-license-check.

@natebosch
Copy link
Member

@sigmundch - can you take a look? Do you think it's safe to silently ignore parse errors here?

I think it would be better if we can track down a way to avoid parsing URL attribute values as HTML entities but I don't see an obvious way to track that in the current implementation.

@natebosch
Copy link
Member

FWIW the diff I assumed we'd land in this code location is something like

         generateSpans: true, attributeSpans: true);
 
     tokenizer.moveNext();
-    final token = tokenizer.current as StartTagToken;
+    final token = tokenizer.current;
+    if (token is ParseErrorToken) {
+      throw ParseError(token.data, token.span, token.messageParams);
+    }
+    token as StartTagToken;
 
     if (token.attributeSpans == null) return; // no attributes

But that would require us to track down and fix the spurious parse error.

@ditman
Copy link
Member Author

ditman commented Mar 13, 2025

@natebosch I tracked the tokenizer error to here (I added a couple of links to the related issue):

Fixing it gracefully there... well, I didn't find a good solution. I tried the navie thing of not emitting those non-fatal tokenizer errors, but that seemed to mess up the computation of the spans. There's also a test validating that the ParseErrorToken is emitted:

So I guessed that token is a feature, and attempted to fix it one level up.

(Throwing if token is ParseErrorToken doesn't fix the problem I'm having, but it might produce a better error message, that's for sure :) )

@natebosch
Copy link
Member

Throwing if token is ParseErrorToken doesn't fix the problem I'm having, but it might produce a better error message, that's for sure :)

Yeah a better error message is my intention with that diff.

not emitting those non-fatal tokenizer errors, but that seemed to mess up the computation of the spans

Is it a hard requirement to parse this invalid HTML? Or is it possible for you to fix the input to <script src="foo?key=value&amp;key2=value2">?

I don't fully understand how or whether we differentiate fatal and non-fatal errors. I expect browser probably allow this syntax to work but if we can be more strict to the standards in the parser that's definitely easier for us.

@ditman
Copy link
Member Author

ditman commented Mar 13, 2025

Or is it possible for you to fix the input to <script src="foo?key=value&key2=value2">?

Yes, using &amp; was my workaround, and I was surprised that the browser supported it :)

Is it a hard requirement to parse this invalid HTML? [...] if we can be more strict to the standards in the parser that's definitely easier for us.

I disagree with this being invalid HTML.

In practice, it is rare that anybody is going to use the &amp; char ref when using URLs on their HTML, for example: https://developers.google.com/maps/documentation/javascript/load-maps-js-api#direct_script_loading_url_parameters. (see b/403028163 for my rabbit hole, somebody added a & to a URL that had a previous &amp; and stuff broke)

As for the standard; the URI rfc specifically lists & as a reserved character "to provide a set of delimiting characters that are distinguishable from other data within a URI."

The HTML Standard says that Character references like &amp; "can be used to escape characters that couldn't otherwise legally be included in text."

I think it's a little unfortunate that the HTML parser behaves this way, but I can't suggest a better solution than this PR.

@sigmundch
Copy link
Member

Interesting, reading this part of the spec, it does appear like the parse error was not intended to happen within an attribute value. In particular, see the text around being "consumed as part of an attrinbute".

The location you found @ditman in

} else {
_addToken(ParseErrorToken('expected-named-entity'));
stream.unget(charStack.removeLast());
output = '&${charStack.join()}';
}

does appear to be the place to address this issue. Note that there is a parameter fromAttribute already in this method, which denotes whether we are in the special case above. I believe we should be able to change the tokenizer instead to do hide the error only in that case. The other two operations are still applicable (we need the fix the state of the stream and append the attribute to the attributeValue):

if (!fromAttribute) {
  _addToken(ParseErrorToken(...))
}
stream.unget(...)
output = ...;

Just trying that out locally seems to still keep all tests passing and make your new test pass as well.

@ditman
Copy link
Member Author

ditman commented Mar 13, 2025

@sigmundch what a cool find Siggi, I totally missed that param! (That method is long and full of terrors :P)

Let me add that change and another test case to the tokenizer.

@sigmundch
Copy link
Member

lol, totally :)

ditman and others added 2 commits March 13, 2025 11:03
…side an attribute value.

Co-Authored-By: sigmundch <sigmund@google.com>
Copy link
Member

@sigmundch sigmundch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks David, looks great!

@ditman ditman requested a review from sigmundch March 13, 2025 18:28
@ditman
Copy link
Member Author

ditman commented Mar 13, 2025

This seems to be all green, but I can't click the submit button :) /cc @sigmundch

@sigmundch sigmundch merged commit b55643d into dart-lang:main Mar 13, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants