-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whitehall Govspeak upgrade #159
Conversation
00a853c
to
0e8aabd
Compare
Thanks for pushing this up Erin and great job with all the digging in. I've had a bit of a play around myself and came to some conclusions. For dealing with tbody we should change: govspeak/lib/govspeak/html_validator.rb Line 23 in 0e8aabd
Nokogiri::HTML5.fragment(html).to_s . The HTML5 rather than HTML will require nokogumbo, but that's ok since it's already required by sanitize. The parse => fragment should have no effect, just that we should have done that in the first place as parse spits out html, head and body elements that are superfluous. I believe this also resolves a lot of the odd quotes issues.
With the special characters I feel like we should just do a data migration find and replace on the govspeak_content body field to change them to a non-special counterpart. I guess that is \ufffc => \n and the rest to a space character? Once I cut all of those out on my local machine I just have 7 that are still invalid:
which can probably just be fixed manually. |
0e8aabd
to
7c81117
Compare
Test cases for content in Whitehall which currently fails Govspeak 6+ validation.
This will use the Nokogump and the Gumbo HTML5 parser for HTML normalisation, but as this is already required by Sanitize this should be ok. Changing `parse => fragment` should have no effect and will stip out superfluous elements such as html, head and body.
7c81117
to
75e034a
Compare
Most of GOV.UK is on 2.6.3 so this saves an extra Ruby install.
This is used in the validation process for comparing HTML, previously this was an implicit dependency but I think it's better to call it out explicitly.
There was previously a variety of different tests used to check properties of tables that may have called govspeak problems in Whitehall. Of these I think the only one of these we should keep long term is the check that presence, or lack of, `<tbody>` elements. I did consider keeping the scenario of quotes - but as that just looks like valid HTML I don't think it's necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me - hope the data migration is more enjoyable than it sounds ✊
test/html_validator_test.rb
Outdated
assert Govspeak::HtmlValidator.new("<table><tbody><tr><td>\"Hello</td><td>World\"</td></tr></tbody></table>").valid? | ||
end | ||
|
||
test "allow a govspeak table" do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this one also comes under 'valid HTML'.
This is to resolve a blocking issue that prevents Whitehall from upgrading to Govspeak > 5.5. A problem we have is that a number of pieces of Whitehall content have unusual unicode characters that the Santize gem strips out, this then makes the content fail HTML Validation and was introduced as part of a Santize update in #127. The approach I've taken to resolve this is to remove forbidden unicode characters from the source markdown that is converted to HTML. Since these characters are invalid in a HTML output it seems reasonable to consider them as invalid as part of govspeak input. Looking through Whitehalls database I found there to be 765 Edition::Translation models and 248 GovspeakContent models that have at last one of these characters in their body fields. Initially I intended to resolve this with a data migration on Whitehall, however that would have caused a problem if anyone used these characters again as it would be very difficult to determine why content was failing govspeak validation. I figure it's actually more user friendly to strip the character out and then if they notice a space or similar is missing they can correct that with their keyboard. I noticed that in 2019 there had been 74 of these items added so this doesn't appear to be a legacy issue. My current theory is that perhaps these are characters that are present in content that is being pasted from such as a PDF document. Anyway, with this merged in Whitehall can proceed to be upgraded to Govspeak 6 as with this and #159 all GovspeakContent models are valid govspeak and only 14 Edition::Translation bodies are invalid, all of which are also invalid under Govspeak 5.5.
This is to resolve a blocking issue that prevents Whitehall from upgrading to Govspeak > 5.5. A problem we have is that a number of pieces of Whitehall content have unusual unicode characters that the Santize gem strips out, this then makes the content fail HTML Validation and was introduced as part of a Santize update in #127. The approach I've taken to resolve this is to remove forbidden unicode characters from the source markdown that is converted to HTML. Since these characters are invalid in a HTML output it seems reasonable to consider them as invalid as part of govspeak input. Looking through Whitehalls database I found there to be 765 Edition::Translation models and 248 GovspeakContent models that have at last one of these characters in their body fields. Initially I intended to resolve this with a data migration on Whitehall, however that would have caused a problem if anyone used these characters again as it would be very difficult to determine why content was failing govspeak validation. I figure it's actually more user friendly to strip the character out and then if they notice a space or similar is missing they can correct that with their keyboard. I noticed that in 2019 there had been 74 of these items added so this doesn't appear to be a legacy issue. My current theory is that perhaps these are characters that are present in content that is being pasted from such as a PDF document. Anyway, with this merged in Whitehall can proceed to be upgraded to Govspeak 6 as with this and #159 all GovspeakContent models are valid govspeak and only 14 Edition::Translation bodies are invalid, all of which are also invalid under Govspeak 5.5.
A variety of fixes have been done to hopefully allow for Whitehall to upgrade to this without problems 🤞 see alphagov/govspeak#159 and alphagov/govspeak#164 for details.
A variety of fixes have been done to hopefully allow for Whitehall to upgrade to this without problems 🤞 see alphagov/govspeak#159 and alphagov/govspeak#164 for details.
Whitehall has a number of documents created with previous versions of Govspeak which now fail validations. This is preventing Whitehall from using the latest version of Govspeak as publishers will no longer be able to updated published document.
I've captured some of the reasons why content is failing in 607aab0 and later commits are to make the tests pass.
I'm not sure that this is the correct approach, which is why this PR is a draft is to start a discussion on the best way of fixing this problem.