-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kanji codeset differences between adoc and pdf #168
Comments
This is interesting. It is possible that the PDF generation process stacks CJK fonts in an order that prioritizes non Japanese. @Intelligent2013 any idea? |
hello @Intelligent2013 thanks for checking this. they are very easy to see in Notepad or Notepad++, i wonder if they and other characters be identified in MS-Visual-Studio-Code? the select in the browser method is quite serious if it thinks they are the same, unless the browser is trying to harmonize via a certain font? if you have found 用 then i would think there are others too. they are slightly smaller than the correct characters. i wonder if they are part of some "character grouping"? thank you for the update |
i guess they would not be visible in MS-Visual Studio Code as it would seem they are correct in the .adoc files? and this may be happening elsewhere in the tool chain... or perhaps there are differences in the .adoc files but i am not seeing them... we could see them if there was a tool that picked up the "unique characters" used in the .adoc files. probably a lot to go through but visually checking them with a large font, etc, would not be so difficult. if you think such an idea is helpful, i could check them. |
ADOC uses 用 = U+7528 and 目 = U+76EE PDF uses ⽤ = U+2F64 and ⽬ = U+2F6C https://en.wikipedia.org/wiki/Kangxi_Radicals_(Unicode_block) The PDF characters are radicals: they are intended to represent the building blocks of CJK characters, such as you would find in a Japanese dictionary as a heading. They are NOT meant to be used in running text: the ADOC characters are what you would use.
Which means XSL::FO should not be mapping U+7528 to U+2F64, or U+76EE to U+2F6C. Those characters are intended for dictionary headings and discussions of Chinese writing, not for running text. I don't know how you prevent that transformation, but that's what's going on. |
very interesting information @opoudjis , thank you for the feedback. its been a long time since i studied about kanji radicals! |
Thanks @opoudjis for the investigation. Indeed, FOP should not be making this conversation because it is wrong. |
This character conversion can be turned off by disabling complex scripts (https://xmlgraphics.apache.org/fop/2.9/complexscripts.html#Disabling-complex-scripts) in FOP. |
Plateau xslt updated for metanorma/mn-samples-plateau#168
No differences, except the correct characters in the PDF now. The issue fixed in Plateau XSLT and |
JIS xslt updated for metanorma/mn-samples-plateau#168
Need fix. |
I've found the patch https://issues.apache.org/jira/browse/FOP-2529. |
patch from FOP-2529 applied, metanorma/mn-samples-plateau#168
hi @ronaldtse i noticed this discrepancy in doc2 between the .adoc text and the output pdf
it is easily visible in notepad
.adoc file ( sources\002-v4\sections\section-05\section-03.adoc) has this string
作成すべきメタデータ項目
however the generated pdf file has this string
作成すべきメタデータ項⽬
i am not sure why this is happening, and i am not sure if it is only happening with this character or possibly others.
it is not easily visible in the pdf. i was trying to search for the table in the PDF using the characters copied from the .adoc file but the search was not picking up anything. so i pasted both the .adoc and pdf strings into wordpad and noticed the difference.
not sure how to handle this. could it be a difference between CJK chars? like using C instead of J?
please advise: if you know a better character codeset search tool, please tell me. these are from https://symbl.cc/
The text was updated successfully, but these errors were encountered: