Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kanji codeset differences between adoc and pdf #168

Closed
ReesePlews opened this issue Aug 27, 2024 · 13 comments
Closed

kanji codeset differences between adoc and pdf #168

ReesePlews opened this issue Aug 27, 2024 · 13 comments
Labels
doc level code revision revise code issues in the document question Further information is requested

Comments

@ReesePlews
Copy link
Contributor

ReesePlews commented Aug 27, 2024

hi @ronaldtse i noticed this discrepancy in doc2 between the .adoc text and the output pdf

it is easily visible in notepad

image

.adoc file ( sources\002-v4\sections\section-05\section-03.adoc) has this string

作成すべきメタデータ項目

image

however the generated pdf file has this string

作成すべきメタデータ項⽬

image

i am not sure why this is happening, and i am not sure if it is only happening with this character or possibly others.

it is not easily visible in the pdf. i was trying to search for the table in the PDF using the characters copied from the .adoc file but the search was not picking up anything. so i pasted both the .adoc and pdf strings into wordpad and noticed the difference.

not sure how to handle this. could it be a difference between CJK chars? like using C instead of J?

please advise: if you know a better character codeset search tool, please tell me. these are from https://symbl.cc/

@ReesePlews ReesePlews added question Further information is requested doc level code revision revise code issues in the document labels Aug 27, 2024
@ronaldtse
Copy link
Contributor

This is interesting. It is possible that the PDF generation process stacks CJK fonts in an order that prioritizes non Japanese.

@Intelligent2013 any idea?

@Intelligent2013
Copy link
Contributor

Hmm, very strange behavior.

  • 作成すべきメタデータ項目 - adoc (found in [[table-5-1]] caption)
  • 作成すべきメタデータ項目 - XSL-FO (as in adoc)
  • 作成すべきメタデータ項⽬ - PDF (different than in adoc)

Looks like there is a conversion internally in Apache FOP between XSL-FO and PDF... I'll investigate it.

If we select the char (from adoc) and try to find it on this GitHub page, then the browser (I use Firefox) find both characters:
image

but actually they are different:
image
(copy-pasted into Notepad++ (Windows program)).

Another problematic char:
- adoc
- PDF

@ReesePlews
Copy link
Contributor Author

hello @Intelligent2013 thanks for checking this. they are very easy to see in Notepad or Notepad++, i wonder if they and other characters be identified in MS-Visual-Studio-Code? the select in the browser method is quite serious if it thinks they are the same, unless the browser is trying to harmonize via a certain font?

if you have found 用 then i would think there are others too. they are slightly smaller than the correct characters. i wonder if they are part of some "character grouping"?

thank you for the update

@ReesePlews
Copy link
Contributor Author

i guess they would not be visible in MS-Visual Studio Code as it would seem they are correct in the .adoc files? and this may be happening elsewhere in the tool chain... or perhaps there are differences in the .adoc files but i am not seeing them... we could see them if there was a tool that picked up the "unique characters" used in the .adoc files. probably a lot to go through but visually checking them with a large font, etc, would not be so difficult. if you think such an idea is helpful, i could check them.

@opoudjis
Copy link
Contributor

opoudjis commented Aug 28, 2024

ADOC uses 用 = U+7528 and 目 = U+76EE

PDF uses ⽤ = U+2F64 and ⽬ = U+2F6C

https://en.wikipedia.org/wiki/Kangxi_Radicals_(Unicode_block)

The PDF characters are radicals: they are intended to represent the building blocks of CJK characters, such as you would find in a Japanese dictionary as a heading. They are NOT meant to be used in running text: the ADOC characters are what you would use.

These are specific code points intended to represent the radical qua radical, as opposed to the character consisting of the unaugmented radical; thus, U+2F00 represents radical 1 while U+4E00 represents the character yī meaning "one".

Which means XSL::FO should not be mapping U+7528 to U+2F64, or U+76EE to U+2F6C. Those characters are intended for dictionary headings and discussions of Chinese writing, not for running text.

I don't know how you prevent that transformation, but that's what's going on.

@ReesePlews
Copy link
Contributor Author

very interesting information @opoudjis , thank you for the feedback. its been a long time since i studied about kanji radicals!

@ronaldtse
Copy link
Contributor

Thanks @opoudjis for the investigation. Indeed, FOP should not be making this conversation because it is wrong.

@Intelligent2013
Copy link
Contributor

This character conversion can be turned off by disabling complex scripts (https://xmlgraphics.apache.org/fop/2.9/complexscripts.html#Disabling-complex-scripts) in FOP.
But I have to check the possible side-effects.

@Intelligent2013
Copy link
Contributor

The font controls the replacements. For instance, the glyph U+76EE has the Unicode value U+2f6c:
image
Apache FOP substitutes the glyph U+2F6C instead of U+76EE.

Note: the glyph U+2F6C has the same glyph info values:
image

I'll try to generate IF XML (Apache FOP Intermediate Format) with complex scripts feature (default) and turned-off. And then compare two IF XMLs.

Intelligent2013 added a commit to metanorma/mn-native-pdf that referenced this issue Aug 28, 2024
Intelligent2013 added a commit to metanorma/mn-native-pdf that referenced this issue Aug 28, 2024
Intelligent2013 added a commit to metanorma/mn2pdf that referenced this issue Aug 28, 2024
@Intelligent2013
Copy link
Contributor

I'll try to generate IF XML (Apache FOP Intermediate Format) with complex scripts feature (default) and turned-off. And then compare two IF XMLs.

No differences, except the correct characters in the PDF now.
I've turned-off 'Complex scripts' feature for the Plateau XSLT.

The issue fixed in Plateau XSLT and mn2pdf v1.98 (https://github.com/metanorma/mn2pdf/releases/tag/v1.98).

@Intelligent2013
Copy link
Contributor

Need fix.
The turn-off whole Complex Feature doesn't allow to render the vertical text (metanorma/mn-native-pdf#713 (comment))
Need turn-off the characters replacement only.

@Intelligent2013
Copy link
Contributor

I've found the patch https://issues.apache.org/jira/browse/FOP-2529.

Intelligent2013 added a commit to metanorma/mn2pdf that referenced this issue Sep 9, 2024
Intelligent2013 added a commit to metanorma/mn2pdf that referenced this issue Sep 9, 2024
@Intelligent2013
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc level code revision revise code issues in the document question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants