kanji codeset differences between adoc and pdf #168

ReesePlews · 2024-08-27T10:50:36Z

hi @ronaldtse i noticed this discrepancy in doc2 between the .adoc text and the output pdf

it is easily visible in notepad

.adoc file ( sources\002-v4\sections\section-05\section-03.adoc) has this string

作成すべきメタデータ項目

however the generated pdf file has this string

作成すべきメタデータ項⽬

i am not sure why this is happening, and i am not sure if it is only happening with this character or possibly others.

it is not easily visible in the pdf. i was trying to search for the table in the PDF using the characters copied from the .adoc file but the search was not picking up anything. so i pasted both the .adoc and pdf strings into wordpad and noticed the difference.

not sure how to handle this. could it be a difference between CJK chars? like using C instead of J?

please advise: if you know a better character codeset search tool, please tell me. these are from https://symbl.cc/

The text was updated successfully, but these errors were encountered:

ronaldtse · 2024-08-27T19:45:12Z

This is interesting. It is possible that the PDF generation process stacks CJK fonts in an order that prioritizes non Japanese.

@Intelligent2013 any idea?

Intelligent2013 · 2024-08-28T07:29:07Z

Hmm, very strange behavior.

作成すべきメタデータ項目 - adoc (found in [[table-5-1]] caption)
作成すべきメタデータ項目 - XSL-FO (as in adoc)
作成すべきメタデータ項⽬ - PDF (different than in adoc)

Looks like there is a conversion internally in Apache FOP between XSL-FO and PDF... I'll investigate it.

If we select the char 目 (from adoc) and try to find it on this GitHub page, then the browser (I use Firefox) find both characters:

but actually they are different:

(copy-pasted into Notepad++ (Windows program)).

Another problematic char:
用 - adoc
⽤ - PDF

ReesePlews · 2024-08-28T08:14:34Z

hello @Intelligent2013 thanks for checking this. they are very easy to see in Notepad or Notepad++, i wonder if they and other characters be identified in MS-Visual-Studio-Code? the select in the browser method is quite serious if it thinks they are the same, unless the browser is trying to harmonize via a certain font?

if you have found 用 then i would think there are others too. they are slightly smaller than the correct characters. i wonder if they are part of some "character grouping"?

thank you for the update

ReesePlews · 2024-08-28T08:23:46Z

i guess they would not be visible in MS-Visual Studio Code as it would seem they are correct in the .adoc files? and this may be happening elsewhere in the tool chain... or perhaps there are differences in the .adoc files but i am not seeing them... we could see them if there was a tool that picked up the "unique characters" used in the .adoc files. probably a lot to go through but visually checking them with a large font, etc, would not be so difficult. if you think such an idea is helpful, i could check them.

opoudjis · 2024-08-28T11:03:28Z

ADOC uses 用 = U+7528 and 目 = U+76EE

PDF uses ⽤ = U+2F64 and ⽬ = U+2F6C

https://en.wikipedia.org/wiki/Kangxi_Radicals_(Unicode_block)

The PDF characters are radicals: they are intended to represent the building blocks of CJK characters, such as you would find in a Japanese dictionary as a heading. They are NOT meant to be used in running text: the ADOC characters are what you would use.

These are specific code points intended to represent the radical qua radical, as opposed to the character consisting of the unaugmented radical; thus, U+2F00 represents radical 1 while U+4E00 represents the character yī meaning "one".

Which means XSL::FO should not be mapping U+7528 to U+2F64, or U+76EE to U+2F6C. Those characters are intended for dictionary headings and discussions of Chinese writing, not for running text.

I don't know how you prevent that transformation, but that's what's going on.

ReesePlews · 2024-08-28T11:08:53Z

very interesting information @opoudjis , thank you for the feedback. its been a long time since i studied about kanji radicals!

ronaldtse · 2024-08-28T12:22:46Z

Thanks @opoudjis for the investigation. Indeed, FOP should not be making this conversation because it is wrong.

Intelligent2013 · 2024-08-28T15:45:50Z

This character conversion can be turned off by disabling complex scripts (https://xmlgraphics.apache.org/fop/2.9/complexscripts.html#Disabling-complex-scripts) in FOP.
But I have to check the possible side-effects.

Intelligent2013 · 2024-08-28T17:56:38Z

The font controls the replacements. For instance, the glyph U+76EE has the Unicode value U+2f6c:

Apache FOP substitutes the glyph U+2F6C instead of U+76EE.

Note: the glyph U+2F6C has the same glyph info values:

I'll try to generate IF XML (Apache FOP Intermediate Format) with complex scripts feature (default) and turned-off. And then compare two IF XMLs.

…lateau#168

Plateau xslt updated for metanorma/mn-samples-plateau#168

Intelligent2013 · 2024-08-28T21:59:48Z

I'll try to generate IF XML (Apache FOP Intermediate Format) with complex scripts feature (default) and turned-off. And then compare two IF XMLs.

No differences, except the correct characters in the PDF now.
I've turned-off 'Complex scripts' feature for the Plateau XSLT.

The issue fixed in Plateau XSLT and mn2pdf v1.98 (https://github.com/metanorma/mn2pdf/releases/tag/v1.98).

JIS xslt updated for metanorma/mn-samples-plateau#168

Intelligent2013 · 2024-09-07T06:46:17Z

Need fix.
The turn-off whole Complex Feature doesn't allow to render the vertical text (metanorma/mn-native-pdf#713 (comment))
Need turn-off the characters replacement only.

Intelligent2013 · 2024-09-09T18:33:17Z

I've found the patch https://issues.apache.org/jira/browse/FOP-2529.

patch from FOP-2529 applied, metanorma/mn-samples-plateau#168

Intelligent2013 · 2024-09-09T20:12:17Z

Fixed in https://github.com/metanorma/mn2pdf/releases/tag/v1.99.

ReesePlews added question Further information is requested doc level code revision revise code issues in the document labels Aug 27, 2024

Intelligent2013 added a commit to metanorma/mn2pdf that referenced this issue Aug 28, 2024

complex scripts feature added into FOP config, metanorma/mn-samples-p…

73f6740

…lateau#168

Intelligent2013 added a commit to metanorma/mn-native-pdf that referenced this issue Aug 28, 2024

Plateau xslt updated for metanorma/mn-samples-plateau#168

8a52712

Intelligent2013 added a commit to metanorma/mn-native-pdf that referenced this issue Aug 28, 2024

Merge pull request #735 from metanorma/plateau_update

b90b298

Plateau xslt updated for metanorma/mn-samples-plateau#168

Intelligent2013 added a commit to metanorma/mn2pdf that referenced this issue Aug 28, 2024

FOP config updated, metanorma/mn-samples-plateau#168

d5b6c5c

Intelligent2013 closed this as completed Aug 28, 2024

Intelligent2013 added a commit to metanorma/mn-native-pdf that referenced this issue Aug 29, 2024

JIS xslt updated for metanorma/mn-samples-plateau#168

6056888

Intelligent2013 added a commit to metanorma/mn-native-pdf that referenced this issue Aug 29, 2024

Merge pull request #738 from metanorma/jis_update

de37ce0

JIS xslt updated for metanorma/mn-samples-plateau#168

Intelligent2013 mentioned this issue Sep 6, 2024

PDF: experimental feature - vertical layout metanorma/mn-native-pdf#713

Closed

Intelligent2013 reopened this Sep 7, 2024

Intelligent2013 added a commit to metanorma/mn2pdf that referenced this issue Sep 9, 2024

patch from FOP-2529 applied, metanorma/mn-samples-plateau#168

36adf80

Intelligent2013 added a commit to metanorma/mn2pdf that referenced this issue Sep 9, 2024

Merge pull request #281 from metanorma/fix/char_replacement

b6c044a

patch from FOP-2529 applied, metanorma/mn-samples-plateau#168

Intelligent2013 closed this as completed Sep 9, 2024

Intelligent2013 mentioned this issue Nov 2, 2024

normative references clause missing from ToC; possibly due to incorrect kanji in subclause heading metanorma/metanorma-plateau#120

Closed

Intelligent2013 mentioned this issue Nov 15, 2024

Fix: Avoid character remapping if font contains the same character multiple times metanorma/xmlgraphics-fop#22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kanji codeset differences between adoc and pdf #168

kanji codeset differences between adoc and pdf #168

ReesePlews commented Aug 27, 2024 •

edited

Loading

ronaldtse commented Aug 27, 2024

Intelligent2013 commented Aug 28, 2024

ReesePlews commented Aug 28, 2024

ReesePlews commented Aug 28, 2024

opoudjis commented Aug 28, 2024 •

edited

Loading

ReesePlews commented Aug 28, 2024

ronaldtse commented Aug 28, 2024

Intelligent2013 commented Aug 28, 2024

Intelligent2013 commented Aug 28, 2024

Intelligent2013 commented Aug 28, 2024

Intelligent2013 commented Sep 7, 2024

Intelligent2013 commented Sep 9, 2024

Intelligent2013 commented Sep 9, 2024

kanji codeset differences between adoc and pdf #168

kanji codeset differences between adoc and pdf #168

Comments

ReesePlews commented Aug 27, 2024 • edited Loading

ronaldtse commented Aug 27, 2024

Intelligent2013 commented Aug 28, 2024

ReesePlews commented Aug 28, 2024

ReesePlews commented Aug 28, 2024

opoudjis commented Aug 28, 2024 • edited Loading

ReesePlews commented Aug 28, 2024

ronaldtse commented Aug 28, 2024

Intelligent2013 commented Aug 28, 2024

Intelligent2013 commented Aug 28, 2024

Intelligent2013 commented Aug 28, 2024

Intelligent2013 commented Sep 7, 2024

Intelligent2013 commented Sep 9, 2024

Intelligent2013 commented Sep 9, 2024

ReesePlews commented Aug 27, 2024 •

edited

Loading

opoudjis commented Aug 28, 2024 •

edited

Loading