Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whitehall Govspeak upgrade #159

Merged
merged 5 commits into from
Jul 17, 2019
Merged

Whitehall Govspeak upgrade #159

merged 5 commits into from
Jul 17, 2019

Conversation

erino
Copy link
Member

@erino erino commented Jun 26, 2019

Whitehall has a number of documents created with previous versions of Govspeak which now fail validations. This is preventing Whitehall from using the latest version of Govspeak as publishers will no longer be able to updated published document.

I've captured some of the reasons why content is failing in 607aab0 and later commits are to make the tests pass.

I'm not sure that this is the correct approach, which is why this PR is a draft is to start a discussion on the best way of fixing this problem.

@erino erino requested review from benthorner, kevindew and kr8n3r June 26, 2019 14:52
@erino erino force-pushed the whitehall-govspeak-upgrade branch from 00a853c to 0e8aabd Compare June 26, 2019 14:58
@kevindew
Copy link
Member

kevindew commented Jul 1, 2019

Thanks for pushing this up Erin and great job with all the digging in.

I've had a bit of a play around myself and came to some conclusions.

For dealing with tbody we should change:

Nokogiri::HTML.parse(html).to_s
to be Nokogiri::HTML5.fragment(html).to_s. The HTML5 rather than HTML will require nokogumbo, but that's ok since it's already required by sanitize. The parse => fragment should have no effect, just that we should have done that in the first place as parse spits out html, head and body elements that are superfluous. I believe this also resolves a lot of the odd quotes issues.

With the special characters I feel like we should just do a data migration find and replace on the govspeak_content body field to change them to a non-special counterpart. I guess that is \ufffc => \n and the rest to a space character?

Once I cut all of those out on my local machine I just have 7 that are still invalid:

[10] pry(main)> still_bad.map(&:id)
=> [1514055, 1514255, 3113168, 3124581, 3127439, 3128096, 3238660]

which can probably just be fixed manually.

@erino erino force-pushed the whitehall-govspeak-upgrade branch from 0e8aabd to 7c81117 Compare July 11, 2019 08:41
@erino erino marked this pull request as ready for review July 11, 2019 08:41
erino added 2 commits July 11, 2019 10:41
Test cases for content in Whitehall which currently fails Govspeak
6+ validation.
This will use the Nokogump and the Gumbo HTML5 parser for HTML
normalisation, but as this is already required by Sanitize this
should be ok. Changing `parse => fragment` should have no effect
and will stip out superfluous elements such as html, head and body.
@erino erino force-pushed the whitehall-govspeak-upgrade branch from 7c81117 to 75e034a Compare July 11, 2019 09:41
kevindew added 3 commits July 16, 2019 21:08
Most of GOV.UK is on 2.6.3 so this saves an extra Ruby install.
This is used in the validation process for comparing HTML, previously
this was an implicit dependency but I think it's better to call it out
explicitly.
There was previously a variety of different tests used to check
properties of tables that may have called govspeak problems in
Whitehall. Of these I think the only one of these we should keep long
term is the check that presence, or lack of, `<tbody>` elements.

I did consider keeping the scenario of quotes - but as that just looks
like valid HTML I don't think it's necessary.
Copy link
Contributor

@benthorner benthorner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - hope the data migration is more enjoyable than it sounds ✊

assert Govspeak::HtmlValidator.new("<table><tbody><tr><td>\"Hello</td><td>World\"</td></tr></tbody></table>").valid?
end

test "allow a govspeak table" do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this one also comes under 'valid HTML'.

kevindew added a commit that referenced this pull request Jul 17, 2019
This is to resolve a blocking issue that prevents Whitehall from
upgrading to Govspeak > 5.5. A problem we have is that a number of pieces of
Whitehall content have unusual unicode characters that the Santize gem
strips out, this then makes the content fail HTML Validation and was
introduced as part of a Santize update in
#127.

The approach I've taken to resolve this is to remove forbidden unicode
characters from the source markdown that is converted to HTML. Since
these characters are invalid in a HTML output it seems reasonable to
consider them as invalid as part of govspeak input.

Looking through Whitehalls database I found there to be 765
Edition::Translation models and 248 GovspeakContent models that have at
last one of these characters in their body fields.

Initially I intended to resolve this with a data migration on Whitehall,
however that would have caused a problem if anyone used these characters
again as it would be very difficult to determine why content was failing
govspeak validation. I figure it's actually more user friendly to strip
the character out and then if they notice a space or similar is missing
they can correct that with their keyboard.

I noticed that in 2019 there had been 74 of these items added so this
doesn't appear to be a legacy issue. My current theory is that perhaps
these are characters that are present in content that is being pasted
from such as a PDF document.

Anyway, with this merged in Whitehall can proceed to be upgraded to
Govspeak 6 as with this and
#159 all GovspeakContent models
are valid govspeak and only 14 Edition::Translation bodies are invalid,
all of which are also invalid under Govspeak 5.5.
@kevindew kevindew merged commit 3f56ec5 into master Jul 17, 2019
@kevindew kevindew deleted the whitehall-govspeak-upgrade branch July 17, 2019 15:01
kevindew added a commit that referenced this pull request Jul 17, 2019
This is to resolve a blocking issue that prevents Whitehall from
upgrading to Govspeak > 5.5. A problem we have is that a number of pieces of
Whitehall content have unusual unicode characters that the Santize gem
strips out, this then makes the content fail HTML Validation and was
introduced as part of a Santize update in
#127.

The approach I've taken to resolve this is to remove forbidden unicode
characters from the source markdown that is converted to HTML. Since
these characters are invalid in a HTML output it seems reasonable to
consider them as invalid as part of govspeak input.

Looking through Whitehalls database I found there to be 765
Edition::Translation models and 248 GovspeakContent models that have at
last one of these characters in their body fields.

Initially I intended to resolve this with a data migration on Whitehall,
however that would have caused a problem if anyone used these characters
again as it would be very difficult to determine why content was failing
govspeak validation. I figure it's actually more user friendly to strip
the character out and then if they notice a space or similar is missing
they can correct that with their keyboard.

I noticed that in 2019 there had been 74 of these items added so this
doesn't appear to be a legacy issue. My current theory is that perhaps
these are characters that are present in content that is being pasted
from such as a PDF document.

Anyway, with this merged in Whitehall can proceed to be upgraded to
Govspeak 6 as with this and
#159 all GovspeakContent models
are valid govspeak and only 14 Edition::Translation bodies are invalid,
all of which are also invalid under Govspeak 5.5.
kevindew added a commit that referenced this pull request Jul 17, 2019
I missed this on: #159
kevindew added a commit that referenced this pull request Jul 17, 2019
* Unicode characters forbidden in HTML are stripped from input:
#164
* Validation is now more lenient for HTML input:
#159
@kevindew kevindew mentioned this pull request Jul 17, 2019
kevindew added a commit to alphagov/whitehall that referenced this pull request Jul 18, 2019
A variety of fixes have been done to hopefully allow for Whitehall to
upgrade to this without problems 🤞 see
alphagov/govspeak#159 and
alphagov/govspeak#164 for details.
kevindew added a commit to alphagov/whitehall that referenced this pull request Jul 19, 2019
A variety of fixes have been done to hopefully allow for Whitehall to
upgrade to this without problems 🤞 see
alphagov/govspeak#159 and
alphagov/govspeak#164 for details.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants