Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite GetAllNodes to use a stack on the heap #423

Merged
merged 2 commits into from
Jan 17, 2023
Merged

Rewrite GetAllNodes to use a stack on the heap #423

merged 2 commits into from
Jan 17, 2023

Conversation

bjornri
Copy link
Contributor

@bjornri bjornri commented Jan 16, 2023

HtmlSanitizer.GetAllNodes() uses the program stack to iterate on elements in the DOM, and when we try to parse a deeply nested email (a few thousand elements deep) we get a stack overflow exception. Because our app runs in IIS it has a very small stack size. However, even if we increase our stack size, we will still crash - we just need a bigger HTML document. And when we crash with stack overflow, the entire app goes down.

See #417 as this is the same issue as we are having.

Even though HtmlSanitizer throws StackOverflowException, AngleSharp is still able to parse the HTML.

I have rewritten GetAllNodes() to use a stack on the heap to avoid stack overflows, so that if AngleSharp can parse then HtmlSanitizer can sanitize it. I don't think HtmlSanitizer rely on the ordering of the returned enumeration, but I have kept the depth-first ordering we got when using the recursive approach.

Note that there is the same issue with StyleExtensions.GetStylesheets() in AngleSharp that still cause HtmlSanitizer to crash when enumerating style sheets. I plan to send a PR to that repository as well.

I think existing unit tests has enough coverage for this method so I did not create any additional tests.

@codecov
Copy link

codecov bot commented Jan 16, 2023

Codecov Report

Base: 94.43% // Head: 94.64% // Increases project coverage by +0.20% 🎉

Coverage data is based on head (a0da33f) compared to base (008cff7).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #423      +/-   ##
==========================================
+ Coverage   94.43%   94.64%   +0.20%     
==========================================
  Files           6        6              
  Lines         827      840      +13     
  Branches       79       83       +4     
==========================================
+ Hits          781      795      +14     
  Misses         34       34              
+ Partials       12       11       -1     
Impacted Files Coverage Δ
src/HtmlSanitizer/HtmlSanitizer.cs 93.41% <100.00%> (+0.48%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@bjornri
Copy link
Contributor Author

bjornri commented Jan 17, 2023

It looks like if we are to reach the codecov/patch target, the parameter on GetAllNodes() must be INode (not nullable) and callers have to check for null before calling it. I added a commit with this change.

@mganss mganss merged commit 7be648b into mganss:master Jan 17, 2023
@mganss
Copy link
Owner

mganss commented Jan 17, 2023

Thanks!

@bjornri bjornri deleted the Stackoverflow branch January 18, 2023 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants