Whitehall Govspeak upgrade #159

erino · 2019-06-26T14:52:37Z

Whitehall has a number of documents created with previous versions of Govspeak which now fail validations. This is preventing Whitehall from using the latest version of Govspeak as publishers will no longer be able to updated published document.

I've captured some of the reasons why content is failing in 607aab0 and later commits are to make the tests pass.

I'm not sure that this is the correct approach, which is why this PR is a draft is to start a discussion on the best way of fixing this problem.

kevindew · 2019-07-01T19:08:18Z

Thanks for pushing this up Erin and great job with all the digging in.

I've had a bit of a play around myself and came to some conclusions.

For dealing with tbody we should change:

govspeak/lib/govspeak/html_validator.rb

Line 23 in 0e8aabd

Nokogiri::HTML.parse(html).to_s

to be Nokogiri::HTML5.fragment(html).to_s. The HTML5 rather than HTML will require nokogumbo, but that's ok since it's already required by sanitize. The parse => fragment should have no effect, just that we should have done that in the first place as parse spits out html, head and body elements that are superfluous. I believe this also resolves a lot of the odd quotes issues.

With the special characters I feel like we should just do a data migration find and replace on the govspeak_content body field to change them to a non-special counterpart. I guess that is \ufffc => \n and the rest to a space character?

Once I cut all of those out on my local machine I just have 7 that are still invalid:

[10] pry(main)> still_bad.map(&:id)
=> [1514055, 1514255, 3113168, 3124581, 3127439, 3128096, 3238660]

which can probably just be fixed manually.

Test cases for content in Whitehall which currently fails Govspeak 6+ validation.

This will use the Nokogump and the Gumbo HTML5 parser for HTML normalisation, but as this is already required by Sanitize this should be ok. Changing `parse => fragment` should have no effect and will stip out superfluous elements such as html, head and body.

Most of GOV.UK is on 2.6.3 so this saves an extra Ruby install.

This is used in the validation process for comparing HTML, previously this was an implicit dependency but I think it's better to call it out explicitly.

There was previously a variety of different tests used to check properties of tables that may have called govspeak problems in Whitehall. Of these I think the only one of these we should keep long term is the check that presence, or lack of, `<tbody>` elements. I did consider keeping the scenario of quotes - but as that just looks like valid HTML I don't think it's necessary.

benthorner

Looks good to me - hope the data migration is more enjoyable than it sounds ✊

benthorner · 2019-07-17T12:00:48Z

test/html_validator_test.rb

-    assert Govspeak::HtmlValidator.new("<table><tbody><tr><td>\"Hello</td><td>World\"</td></tr></tbody></table>").valid?
-  end
-
-  test "allow a govspeak table" do


I guess this one also comes under 'valid HTML'.

This is to resolve a blocking issue that prevents Whitehall from upgrading to Govspeak > 5.5. A problem we have is that a number of pieces of Whitehall content have unusual unicode characters that the Santize gem strips out, this then makes the content fail HTML Validation and was introduced as part of a Santize update in #127. The approach I've taken to resolve this is to remove forbidden unicode characters from the source markdown that is converted to HTML. Since these characters are invalid in a HTML output it seems reasonable to consider them as invalid as part of govspeak input. Looking through Whitehalls database I found there to be 765 Edition::Translation models and 248 GovspeakContent models that have at last one of these characters in their body fields. Initially I intended to resolve this with a data migration on Whitehall, however that would have caused a problem if anyone used these characters again as it would be very difficult to determine why content was failing govspeak validation. I figure it's actually more user friendly to strip the character out and then if they notice a space or similar is missing they can correct that with their keyboard. I noticed that in 2019 there had been 74 of these items added so this doesn't appear to be a legacy issue. My current theory is that perhaps these are characters that are present in content that is being pasted from such as a PDF document. Anyway, with this merged in Whitehall can proceed to be upgraded to Govspeak 6 as with this and #159 all GovspeakContent models are valid govspeak and only 14 Edition::Translation bodies are invalid, all of which are also invalid under Govspeak 5.5.

I missed this on: #159

* Unicode characters forbidden in HTML are stripped from input: #164 * Validation is now more lenient for HTML input: #159

A variety of fixes have been done to hopefully allow for Whitehall to upgrade to this without problems 🤞 see alphagov/govspeak#159 and alphagov/govspeak#164 for details.

erino requested review from benthorner, kevindew and kr8n3r June 26, 2019 14:52

erino force-pushed the whitehall-govspeak-upgrade branch from 00a853c to 0e8aabd Compare June 26, 2019 14:58

erino force-pushed the whitehall-govspeak-upgrade branch from 0e8aabd to 7c81117 Compare July 11, 2019 08:41

erino marked this pull request as ready for review July 11, 2019 08:41

erino added 2 commits July 11, 2019 10:41

Tests for failing Whitehall Govspeak

7256fa9

Test cases for content in Whitehall which currently fails Govspeak 6+ validation.

erino force-pushed the whitehall-govspeak-upgrade branch from 7c81117 to 75e034a Compare July 11, 2019 09:41

kevindew added 3 commits July 16, 2019 21:08

Bump ruby version

97f51d5

Most of GOV.UK is on 2.6.3 so this saves an extra Ruby install.

Add nokogumbo as a dependency

80276fc

This is used in the validation process for comparing HTML, previously this was an implicit dependency but I think it's better to call it out explicitly.

benthorner approved these changes Jul 17, 2019

View reviewed changes

kevindew merged commit 3f56ec5 into master Jul 17, 2019

kevindew deleted the whitehall-govspeak-upgrade branch July 17, 2019 15:01

kevindew mentioned this pull request Jul 17, 2019

Remove forbidden unicode characters from output #164

Merged

kevindew added a commit that referenced this pull request Jul 17, 2019

Update changelog

199e827

I missed this on: #159

kevindew added a commit that referenced this pull request Jul 17, 2019

Bump to 6.3.0

58c2e3f

* Unicode characters forbidden in HTML are stripped from input: #164 * Validation is now more lenient for HTML input: #159

kevindew mentioned this pull request Jul 17, 2019

Bump to 6.3.0 #165

Merged

kevindew mentioned this pull request Jul 19, 2019

Bump to Govspeak 6.3 alphagov/whitehall#4913

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whitehall Govspeak upgrade #159

Whitehall Govspeak upgrade #159

erino commented Jun 26, 2019

kevindew commented Jul 1, 2019

benthorner left a comment

benthorner Jul 17, 2019

Whitehall Govspeak upgrade #159

Whitehall Govspeak upgrade #159

Conversation

erino commented Jun 26, 2019

kevindew commented Jul 1, 2019

benthorner left a comment

Choose a reason for hiding this comment

benthorner Jul 17, 2019

Choose a reason for hiding this comment