Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BIPM: add transparency/white text behind/above math object in Apache FOP Intermediate Format (IF) #90

Closed
Intelligent2013 opened this issue Aug 20, 2021 · 12 comments
Assignees
Labels
enhancement New feature or request

Comments

@Intelligent2013
Copy link
Contributor

metanorma/bipm-si-brochure#118

  • generate IF xml before last step (PDF rendering)
  • iterate for each <fo:instream-foreign-object with the attribute @fox:actual-text
    Example:
<fo:instream-foreign-object fox:alt-text="Math" fox:actual-text="&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;mn&gt;43 520&lt;/mn&gt;&lt;mo.... &lt;/math&gt;" foi:struct-id="6b" foi:struct-ref="6a"/>
    • get the attributes @foi:struct-id and @fox:actual-text
    • find the element image with the attribute @foi:struct-ref = fo:instream-foreign-object/@foi:struct-id
      Example:
<image x="257301" y="2321" width="45962" height="10115" foi:struct-ref="6b">
    • calculate replacement text position: x=image/@x, y=image/@height + @y
    • for the text fo:instream-foreign-object/@fox:actual-text - calculate font-size (font name see in the first preceding element font/@family (example: <font family="Times New Roman") to bounding box with sizes @width and @height (need to be investigated how to do it)
    • find first preceding element font with the attribute @color and get the value of @color
      Example:
<font color="#000000"/>
    • if there isn't font, then default value is #000000 (black color)
    • add these elements before image:
      • <font color="#FFFFFF"/> (set white color for text, because there isn't information yet how to create transparent text)
      • <text x="257301" y="12436" word-spacing="1166" foi:struct-ref="6b">...</text>
      • <font color="..."/> (restore the value of text color to previous state, for example <font color="#000000"/>)

Need investigate how to:

  • set transparent color for the text
  • calculate font-size by the string, font family and bounding box (width and height)
@Intelligent2013 Intelligent2013 added the enhancement New feature or request label Aug 20, 2021
@Intelligent2013 Intelligent2013 self-assigned this Aug 20, 2021
Intelligent2013 added a commit that referenced this issue Aug 22, 2021
@Intelligent2013
Copy link
Contributor Author

Similar to color 'white', for hidden text we have to set the font family (same as for math object, for example 'STIX Two Math' ). And after math object we need to restore the font family to default (for normal text, for example 'Times New Roman').
If we put white zero-width space <fo:inline color="white">&#x200b;</fo:inline> before math object:

<xsl:template match="mathml:math">
...
  <fo:inline xsl:use-attribute-sets="mathml-style">
     <xsl:if test="$namespace = 'bipm'">
       <fo:inline color="white">&#x200b;</fo:inline> <!-- zero width space -->
     </xsl:if>
     ...
     <fo:instream-foreign-object fox:alt-text="Math">
     ...

then in intermediate xml format we get preceding and following elements font with needed properties:

<font family="STIX Two Math" color="#ffffff"/>
<image x="257301" y="2321" width="45962" height="10115" foi:struct-ref="6b">
<math 
...
</image>
<font family="Times New Roman" color="#000000"/>

so, no need to put the elements <font via xslt on IF preprocessing step.

@Intelligent2013
Copy link
Contributor Author

Issue occurred: FOP generates non-correct intermediate format xml with surrogate pairs.
For example, source xml:

<mtext>&#120243;</mtext>

IF:

<mtext>&#55349;&#56755;</mtext>

image

This issue was fixed in the main processing (XML->XSLT->XSL-FO->PDF) by setting UTF-16 encoding for transformer (on stage XSLT->XSL-FO):

transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-16");

but for (XML->XSLT->XSL-FO->IF->PDF) on stage 'XSL-FO->IF' has not effect. Need to be investigate.

@Intelligent2013
Copy link
Contributor Author

FOP generates non-correct intermediate format xml with surrogate pairs.

Fixed.

New issues:

  1. hidden text for bold-faced math shows with strokes:
    image

  2. copy text function doesn't work properly - non ASCII chars copies as 1-byte chars
    Example 1:
    PDF:
    image

Text:
image

Example 2:
PDF:
image

Text:
image

  1. standalone math object doesn't have hidden text:
    image
    The reason - there is the element id between the elements font and image:
<font family="STIX Two Math" color="#ffffff"/>
<id name="_11017bd0-9a6e-41bf-8f1c-7b1d283bc737"/>
<image x="136436" y="8321" width="115474" height="13320" foi:struct-ref="75d">
<math 

To do: add_hidden_math.xsl should be updated for this case.

Intelligent2013 added a commit to metanorma/mn-native-pdf that referenced this issue Aug 24, 2021
Intelligent2013 added a commit to metanorma/mn-native-pdf that referenced this issue Aug 24, 2021
@Intelligent2013
Copy link
Contributor Author

  1. hidden text for bold-faced math shows with strokes:

Fixed.

  1. standalone math object doesn't have hidden text:

Fixed.

@Intelligent2013
Copy link
Contributor Author

  1. copy text function doesn't work properly - non ASCII chars copies as 1-byte chars

It seems there is the issue with the font 'STIX Two Math' for text.
For testing purpose I've created simple xsl-fo example:

<fo:block font-family="Arial">Arial: αβγ</fo:block>
<fo:block font-family="Times New Roman">Times New Roman: αβγ</fo:block>
<fo:block font-family="STIX Two Math">STIX Two Math: αβγ</fo:block>
<fo:block font-family="Source Han Sans">Source Han Sans: αβγ</fo:block>

Rendered PDF (note: greek characters displayed as text, not vector image):
image

Then select all text in Acrobat, copy and paste into 'Notepad++' (it allows to display non-standard chars):
image

Hex:
image

I.e.

  • alpha -> hex0E
  • beta ->hex0F
  • gamma -> hex10

Similar issue mentioned here: https://superuser.com/questions/92615/cannot-copy-non-latin-characters-from-pdf-document
When cutting and pasting Unicode text from Acrobat to another application, Acrobat needs enough information to reconstruct the Unicode characters from the letter shapes. If the font used has glyphs named accoring to the Adobe Glyph Naming Convention then Acrobat can parse these names (which are also stored in the PDF document) and reconstruct the Unicode text. Unfortunatly, there are many Unicode fonts, including the standard Windows fonts, which do not follow this convention - so this may not be possible.

Information about embedded fonts:
image

Note: STIXTwoMath font has Encoding: Custom vs. 'Identity-H' for another fonts.

To do:

  • find another instance of STIX Two Math file (may be another source has corrected font already)
  • investigate how to set the encoding 'Identity-H'
  • use Source Has Sans font for hidden text (we need to use the big font with a lot of glyph, because if font doesn't contain the glyph, it will be inserted as dash #)

@ronaldtse
Copy link
Contributor

find another instance of STIX Two Math file (may be another source has corrected font already)

@Intelligent2013 for this issue can you report to https://github.com/stipub/stixfonts linking back to this issue?

@Intelligent2013
Copy link
Contributor Author

@Intelligent2013 for this issue can you report to https://github.com/stipub/stixfonts linking back to this issue?

@ronaldtse it seems this issue relates to OTF font. There are OTF (https://github.com/stipub/stixfonts/tree/master/fonts/static_otf) and TTF (https://github.com/stipub/stixfonts/tree/master/fonts/static_ttf) versions of STIX Two Math.
Copy-paste text works ok for STIX Two Math TTF:

XSL-FO:

<fo:block font-family="Arial">Arial: αβγ</fo:block>
<fo:block font-family="Times New Roman">Times New Roman: αβγ</fo:block>
<fo:block font-family="STIX Two Math OTF">STIX Two Math OTF: αβγ</fo:block>
<fo:block font-family="STIX Two Math TTF">STIX Two Math TTF: αβγ</fo:block>
<fo:block font-family="Source Han Sans">Source Han Sans: αβγ</fo:block>

FOP font config:

<font embed-url="file:/D:/Metanorma/fonts/STIXTwov2.13b171/STIXTwoMath-Regular.otf" kerning="yes">
  <font-triplet name="STIX Two Math OTF" style="normal" weight="normal"/>
</font>
<font embed-url="file:/D:/Metanorma/fonts/STIXTwov2.13b171/STIXTwoMath-Regular.ttf" kerning="yes">
  <font-triplet name="STIX Two Math TTF" style="normal" weight="normal"/>
</font>

PDF:
image

Text:
image

I don't sure yet that this issue relates to STIX Two Math OTF directly. May be Apache FOP can't works normally with OTF fonts. I'll check it.

@Intelligent2013
Copy link
Contributor Author

Some investigations results:

  1. Both STIX Two Math TTF and OTF fonts contain correct glyph name:
    image

  2. I've tried a few OTF fonts (from here https://ctan.org/topic/font-otf and https://tug.org/FontCatalogue/mathfonts.html) - XITSMath-Regular.otf, Asana-Math.otf, Garamond-Math.otf and Roboto-Regular.otf, and copied text is non-correct for all of them:
    PDF:
    image

Text:
image

  1. Apache FOP config has the "embedding-mode" attribute (http://xmlgraphics.apache.org/fop/2.5/fonts.html):
    The "embedding-mode" attribute is optional and can take two values: subset (the default) and full. If not specified the font is subset embedded for TTF and OTF or full embedded for Type 1, unless it is explicitly referenced (see below)

I've set it to 'full' and the copied text is correct:
image

So looks like Apache FOP has the issue with OTF subset embedding. Therefore we have alternatives:

  • don't use OTF fonts at all
  • set embedding-mode="full" for OTF fonts always. For STIX Two Math font, the resulted PDF file size increased by 650Kb.

@Intelligent2013
Copy link
Contributor Author

Here is first resulted PDF with hidden text (white color, transparency is not supported by FOP):
si-brochure-en.presentation.pdf

For testing purposes, hidden text is just the sequence of chars from mathml (not xml). Later, it will be replaced on the ASCII math markup (from commented text inside mathml).

@ronaldtse
Copy link
Contributor

@Intelligent2013 this is amazing. It mostly works. I think this is a winner!

@ronaldtse
Copy link
Contributor

(Time for that blog post!)

Intelligent2013 added a commit that referenced this issue Aug 26, 2021
@Intelligent2013
Copy link
Contributor Author

All done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants