-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange sanitizing of <style>, <title> and <iframe> tags #494
Comments
What you're describing in #.2 is invalid HTML - I assume AngleSharp is trying to correct for that. |
The attacker doesn't care about HTML structure, he wants to break XSS protection and provided cases allow to do it more efficiently :) |
The point was, AngleSharp is going to try to do what your browser would do, which is correct invalid markup. The result you're getting is an encoded img tag, not an actual HTML tag, so I'm not sure what the concern is (aside from having a result that shows Is there something which you think is broken here, I guess is what I'm asking? |
I would say that all these cases should be processed in the same way when But we have 3 different results by just replacing parent tag:
We use |
If I had to guess, my theory would be that AngleSharp considers anything following the style or title tag as "content" (since neither allows HTML) and encodes anything that has an HTML entity equivalent. Once it's done that, HtmlSanitizer won't process the "image" as HTML, because, well, it's not - at that point, it's just text. Not sure how this output would be an XSS vector - you'd have to decode As for the img element is self-terminating, so there's no child content to preserve. If you want to keep things like JUST TO CLARIFY: I am not a maintainer or member of this project, I just use this pretty heavily in my projects, so I try to contribute back if I can. It's entirely possible that I'm misunderstanding what you're trying to do, but I don't think so. If I am, @mganss will probably jump in at some point and can explain what's happening, better. You might also consider posting a related question in the AngleSharp repo, since that's ultimately the driver behind what you're seeing, but if you do so, I'd suggest limiting the question to why tags within elements like |
Thanks for your thoughts @tiesont. You understood correctly what I tried to explain.
correct, that is what we are doing due to our internal logic as 99% of app is React based (and xss safe), but for this 1% of third party libraries it doesn't work sometimes and XSS may happen due to unescape 😄
Your explanation makes sense except iframe case. With iframe, img tag is additionally escaped (double escaping) that is weird. In general, it doesn't matter if all mentioned cases are sanitized differently until you don't unescape them back like we do. |
Yeah, Again, just conjecture based on past observed behavior (although my use-cases are much simpler, so I haven't run into anything like you're describing here). |
As @tiesont pointed out, elements like var d = document.createElement("div");
d.innerHTML = '<title><img></title>';
d.innerHTML
-> '<title><img></title>' To eliminate XSS potential surrounding text content HtmlSanitizer encodes any text content of raw text elements. There was a recent vulnerability in HtmlSanitizer that was caused by a bug in AngleSharp and allowed a bypass when foreign content elements were allowed, see GHSA-43cp-6p3q-2pc4 I'll have to look into the |
The double encoding of As @tiesont mentioned, iframe is not permitted to have content so I wouldn't consider the current behavior a bug. OTOH I'm wondering what HtmlSanitizer should do in a case like this. Perhaps we should always remove the content of iframe elements? |
@mganss That's probably something that gets tricky - since I'll do some experimenting, but do you know what happens if something invalid like |
Hi again, thanks for the explanation, we will try to handle it by our own, but I have a question. |
@tiesont It seems
That's why AngleSharp continues to parse what follows the opening |
@andrewQwer Yes, that might be possible. Will experiment in the next few days. |
This would be great and potentially allow us to use |
I'm considering to change the processing order so that the encoding of text content occurs after the removal of disallowed elements. This would mean the double encoding in the iframe case would no longer occur, i.e. the output of Does that facilitate the use case you have in mind @andrewQwer? I'd prefer not to add a special property like |
Hi @mganss. At least this behavior will follow other tags behavior. As for expectations I would better remove child nodes in all cases no matter what parent tag is, like I showed in initial message for div tag. But I understand that this could not be possible due to how AngleSharp treats different tags. |
You can perform all kinds of processing using the sanitizer.RemovingTag += (s, e) =>
{
var tag = e.Tag;
var dom = tag.GetAncestor<IHtmlDocument>();
if (dom != null)
{
var txt = tag.TextContent;
var txtNode = dom.CreateTextNode(txt);
tag.Replace(txtNode);
e.Cancel = true;
}
}; |
Thanks! Works as expected. Just added extra Sanitize for TextContent to remove tags that remain inside style/title tags and extra unescape for the specific cases with iframe that double escapes tags inside. |
Hi, we are trying to sanitize our user input and want to erase all the tags keeping other text unchanged. But sanitizer behaves differently depending on what tag goes first.
Class initialization looks like this:
In all examples the same
Sanitize
method is called:some text <div> <img> text continuation
- this in an input string.Output is ->
some text text continuation
- this is what we actually expect from the library.some text <style> <img> text continuation
Output is ->
some text <img> text continuation
- image tag remained for some reason. Same behavior withtitle
tag as well. Why?Same happens if you wrap
<img>
tag into style or title completely like this:some text <style> <img></style> text continuation
- img tag will remain.some text <iframe> <img> text continuation
Output is ->
some text &lt;img&gt; text continuation
- image tag remained but was additionally escaped. Better than p.2 but still unclear why output differs from p.1.I haven't found any other tags except
style
,title
andiframe
that would cause this strange behavior. Other tags I checked are sanitized like in p.1Can you please advise what settings combination should be used to always have results like in p.1 or at least like in p.3?
The text was updated successfully, but these errors were encountered: