Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Colorer.exe not producing correct UTF-8 as output #8

Closed
jonisb opened this issue Jan 8, 2018 · 6 comments · Fixed by #17
Closed

Colorer.exe not producing correct UTF-8 as output #8

jonisb opened this issue Jan 8, 2018 · 6 comments · Fixed by #17
Assignees

Comments

@jonisb
Copy link

jonisb commented Jan 8, 2018

I'm using v1.0.2 colorer.exe when testing and I discovered that if viewing a UTF-8 file it works fine, but if outputting it to a file its not properly utf-8 formatted.

After awhile I was able to decode the result and figured what I need to do (using Python)

  • First I need to decode the output from utf-8.
  • Then I need to encode to cp1251.
  • And lastly I need to decode from utf-8 again even though it's encoded to cp1251.

So basically non-ASCII characters are double encoded to cp1251 and then utf-8.

I made a small Python script that demonstrates it:

# -*- coding: utf-8 -*-
from __future__ import print_function, unicode_literals
import subprocess

Format = "python"
ColorerCommand = ['colorer.exe', '-ht', '-dc', '-dh', '-db', '-eiutf-8', '-eoutf-8', '-t{0}'.format(Format)]

source = u'éåäö'
proc = subprocess.Popen(ColorerCommand, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
output = proc.communicate(('"%s"' % source).encode('utf-8'))[0].split(b'"')[1]

print("Test text:", repr(source))
print("Colorer output:", repr(output))
Unicode = output.decode('utf-8')
print("utf-8 decoded text:", repr(Unicode))
cp1251 = Unicode.encode('cp1251')
print("cp1251 encoded text:", repr(cp1251))
Unicode = cp1251.decode('utf-8')
print("utf-8 decoded text:", repr(Unicode))
print()

if source == Unicode:
    print("Source and decoded text are the same!")
else:
    print("Source and decoded text don't match!")

jonib

@jonisb
Copy link
Author

jonisb commented Jan 9, 2018

Looks like I get correct UTF-8 if I don't use the -eiutf-8 and -eoutf-8 options when calling Colorer.exe

@ctapmex
Copy link
Member

ctapmex commented Jan 9, 2018

hi.
check by outputting the stream to a file. for example
d:\code\my\Colorer-schemes\bin>colorer.exe -ht -dc -dh -db -eiutf-8 -eoutf-8 -cd:\code\my\Colorer-schemes\build\base\catalog.xml d:\test.txt

d:\code\my\Colorer-schemes\bin>colorer.exe -ht -dc -dh -db -eiutf-8 -eoutf-8 -cd:\code\my\Colorer-schemes\build\base\catalog.xml d:\test.txt > d:\out2.txt

result will be correct.
colorer outputs the data in byte. I do not know how to correctly process a byte stream on a python.

@jonisb
Copy link
Author

jonisb commented Jan 9, 2018

So I did some more testing:

colorer.exe -ht -dc -dh -eiutf-8 -eoutf-8 -tpython test > outdata.html
Reading the file with Colorer.exe directly works correctly.

colorer.exe -ht -dc -dh -eiutf-8 -eoutf-8 -tpython < test > outdata.html
Piping the text in seems to be the problem as this gives the wrong encoding.

So it seems its not the output that is the problem but the input via piping

colorer.exe -ht -dc -dh -tpython < test > outdata.html
This works correctly when I don't specify " -eiutf-8 -eoutf-8"

Thanks for looking into this.

jonib

@ctapmex
Copy link
Member

ctapmex commented Jan 9, 2018

checked by code, when reading from the stream, the specified encoding parameter is ignored.

@jonisb
Copy link
Author

jonisb commented Jan 9, 2018

If that is by design, this issue can be closed.

It works when not specifying " -eiutf-8 -eoutf-8" and that's good enough for me.

Edit:
Looks like my original problem is not fixed as the Unicode characters are handled as multiple characters and get split in the output xml and can't be decoded properly. I need to do some more testing.

edit2:
I really want to use piping to send the text to Colorer.exe. so is there anyway to send Unicode via piping if the encoding is ignored? and is there a reason why it is ignored? can this be changed?

jonib

@jonisb
Copy link
Author

jonisb commented Mar 22, 2021

Now piping UTF-8 seems to work great, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants