Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

odt export: "Text body" paragraph style is misspelled. #948

Closed
peter88213 opened this issue Jan 2, 2022 · 15 comments · Fixed by #949
Closed

odt export: "Text body" paragraph style is misspelled. #948

peter88213 opened this issue Jan 2, 2022 · 15 comments · Fixed by #949
Assignees
Labels
bug Issue: Something isn't working

Comments

@peter88213
Copy link

Please be aware that style identifiers in Open/LibreOffice are case sensitive. In exported odt documents, a "Text Body" style is assigned to the normal text paragraphs. LibreOffice 7 recognizes this as a custom paragraph style. The built-in text body style, which is also translated into other languages, is called, correctly written, "Text body". This is important when assigning templates to documents.
I suggest you to correct this in the toodt module.

Apart from that, I find novelWriter quite appealing. The installation under Xubuntu was without problems.

@peter88213 peter88213 added the bug Issue: Something isn't working label Jan 2, 2022
@vkbo
Copy link
Owner

vkbo commented Jan 2, 2022

Hmm, in my version of LibreOffice it is called "Text Body" in the GUI but I see now that if I write an .fodt file the display name in the XML is with a lower case "b". I suppose that's the issue you're referring to? I have my OS set to English, and I see no difference in behaviour with an upper or lower case b that I can detect, but I may not be using it the way you do.

I'm happy to take a PR on this if you could correct it and test that it behaves as you expect on your end. The relevant code is in the function def _useableStyles around line 800 in the toodt.py file. I suspect all that is needed is to change the display name setting in that one place, but since I don't know quite how to verify the result, it's a bit tricky to check.

@vkbo vkbo added this to the Release 1.6 Beta 1 milestone Jan 2, 2022
@peter88213
Copy link
Author

In the meantime, I cloned the source, installed the missing modules (is lxml really needed? what's wrong with Python3 standard xml.etree?), fixed it locally at line 814 and tested it. This works fine:
oStyle.setDisplayName("Text body")

Unforunately, my Eclipse PyDev PEP8 Auto-formatter removed all your variable alignment, so i guess you would't want a pull request.
For me, it's almost impossible to track changes this way.

Just for information, the odf xml tag is like <text:p text:style-name="Text_20_body">
This is not just about the GUI representation, it is essential for applying document templates, such as with my styleSwitcher extension.

By the way, Happy New Year!

@vkbo
Copy link
Owner

vkbo commented Jan 2, 2022

The primary feature of lxml that is used is the "pretty print" feature. The standard Python XML package didn't do that until recently, and indented XML is essential if the novelWriter projects are to be tracked and diffed with git etc. I'm not sure if lxml and internal package are identical with all the messy stuff I do in the toodt class either.

As for autopep, I don't use it. It is far too brutal for most of the code I work on on a daily basis because most people care little for linting. I use flake8 instead, which suggests changes instead.

As for the last point about the style names, those are LibreOffice quirks, aren't they? I never understood what the point of the numbers were, so I saw no point in adding them either. LibreOffice changes them anyway if you save the document again from LibreOffice. It adds a lot of new XML too. I mostly followed the Open Document standard as closely as I could manage.

Edit: Just to clarify, changing to Text_20_body is no issue. I just never saw a reason to add seemingly redundant characters to the labels as I had to type them several places in the code.

@vkbo
Copy link
Owner

vkbo commented Jan 2, 2022

Aha, now I get it. _20_ just represents the space character in the display name. As in 0x20. That's a bit obscure, but it makes sense. I guess LibreOffice assumes the display name and style name matches in such a way.

It's not an issue changing this. I've queued this up for the next beta release anyway, which I was planning on doing today but got stuck on some old Fortran code instead.

Also, happy new year to you as well :)

@peter88213
Copy link
Author

peter88213 commented Jan 2, 2022

The primary feature of lxml that is used is the "pretty print" feature.

I see. That's okay, but because I try to avoid dependencies to third-party libs, I myself included a small 'xml pretty printer' snippet by Frederik Lundh, which does the trick perfectly well.

As for the last point about the style names, those are LibreOffice quirks, aren't they?

This applies to OpenOffice as well. I wrote some scripts for yWriter that generate ODT and ODS, but I went the easy route of template-based generation instead of dealing with all the intricacies of the XML format. I just copied bits and pieces from an existing document and nothing could go wrong.

@vkbo
Copy link
Owner

vkbo commented Jan 2, 2022

I made a PR #949 that should fix the style names. Could you possibly test it before I merge it?

Anyway, I know it's fairly simple to write this for XML. I've written the reverse code myself for the previous job. It's on my todo list, but not high priority. lxml is a good library and is almost always installed on Linux PCs. I agree with the minimal dependency approach though. I've dropped a number of dependencies in novelWriter since I started. I even had a spell check implementation based on difflib a while back, but having to distribute dictionaries was annoying.

I wrote my own pretty printer for JSON, which will only indent up to a given level. It wastes a lot less space than the default one, and is still diff-friendly.

def jsonEncode(data, n=0, nmax=0):
"""Encode a dictionary, list or tuple as a json object or array, and
indent from level n up to a max level nmax if nmax is larger than 0.
"""
if not isinstance(data, (dict, list, tuple)):
return "[]"
buffer = []
indent = ""
for chunk in json.JSONEncoder().iterencode(data):
if chunk == "": # pragma: no cover
# Just a precaution
continue
first = chunk[0]
if chunk in ("{}", "[]"):
buffer.append(chunk)
elif first in ("{", "["):
n += 1
indent = "\n"+" "*n
if n > nmax and nmax > 0:
buffer.append(chunk)
else:
buffer.append(chunk[0] + indent + chunk[1:])
elif first in ("}", "]"):
n -= 1
indent = "\n"+" "*n
if n >= nmax and nmax > 0:
buffer.append(chunk)
else:
buffer.append(indent + chunk)
elif first == ",":
if n > nmax and nmax > 0:
buffer.append(chunk)
else:
buffer.append(chunk[0] + indent + chunk[1:].lstrip())
else:
buffer.append(chunk)
return "".join(buffer)

@vkbo vkbo self-assigned this Jan 2, 2022
@peter88213
Copy link
Author

Well, I checked out the odt_libreoffice_friendly branch, started the application and loaded a small dummy project. The text body is now correct. I loaded the result in OpenOffice 3.4.1 and LibreOffice 7.1.8.
I guess, you can also verify it yourself by looking at the paragraph styles sidebar (press F11) in Open/LibreOffice. A file exported by the old build might show the text body style twice, once the original style with other styles such as "first line indent" inherited, and once in the "user defined" section.

Your JSON processor looks very impressive. Hobby programmer that I am, I used the following standard method instead:
json.dumps(jsonData, indent=4, sort_keys=True, ensure_ascii=False)
The indent parameter does the pretty printing.

@vkbo
Copy link
Owner

vkbo commented Jan 2, 2022

Thanks for testing it. I did not see duplicate styles in my tree actually, which is why I couldn't reproduce the mentioned issue. Anyway, I'll merge this for 1.6 and do the release tomorrow or when I have the time.

As for the JSON indent, yes, the internal library has it, but it lacks the "indent up to level X" feature that I wanted. I'm considering contributing it to the Python library, but the use case may be too narrow for inclusion.

On the subject of pretty printing XML, Python added it in 3.9. Since I have to support at least 3.7 and 3.8 for some time still, I have considered just coping it over. If so, I can drop lxml. The code is here: https://github.com/python/cpython/blob/863729e9c6f599286f98ec37c8716e982c4ca9dd/Lib/xml/etree/ElementTree.py#L1165

@vkbo vkbo closed this as completed in #949 Jan 2, 2022
@peter88213
Copy link
Author

Thank you for fixing so fast.

On the subject of pretty printing XML, Python added it in 3.9.

That's good news. However, myXubuntu distro still comes with Python 3.8. And since I distribute plain Python scripts, I'm keen to make them run with the lowest Python version possible.

@peter88213
Copy link
Author

A little off-topic, but I just had an idea about limiting indentation to a certain level: First format it with the standard method, and then delete the excess leading blanks line by line.

@vkbo
Copy link
Owner

vkbo commented Jan 2, 2022

That's good news. However, myXubuntu distro still comes with Python 3.8. And since I distribute plain Python scripts, I'm keen to make them run with the lowest Python version possible.

novelWriter works with 3.6 still, and I don't plan to drop that support until I have to.

@vkbo
Copy link
Owner

vkbo commented Mar 28, 2022

Just an update on lxml vs Python xml that we discussed here.

I just tested in a branch to drop lxml by copying over the indent function from the Python 3.9 source to work for older versions. That was all fine. However, I quickly found out that another difference between the two implementations is how they handle namespaces. The core XML for novelWriter doesn't use it, so no problem there, but they are all over the place in the ODT writer class. I think lxml handles them a lot better too. The changes that needed to be done to the ODT writer were too great that I thought it was worth the effort, so I think I'll keep lxml for now.

It was worth a quick try anyway.

@peter88213
Copy link
Author

That's interesting. Well, one thing leads to another. If I understand correctly, with the ToOdt class you have a sophisticated document generator that goes deep into the details of the ODT format and builds the XML trees from scratch.
Since you're already on QT, what's wrong with its QTextDocumentWriter? Can't it be connected to the tokenizer?

@vkbo
Copy link
Owner

vkbo commented Mar 29, 2022

I wrote the ToOdt class to replace the Qt ODT writer which was used before. The Qt implementation is very basic and is missing a lot of things. It doesn't even write text headers. I also wanted to control the formatting of text paragraphs vs metadata, and control page header/footer formatting.

@peter88213
Copy link
Author

I see. However, as a user I would rely on the superior formatting capabilities of OpenOffice anyway, so a simple document with an emphasis on clean structuring and the strict application of pargraph/character styles is enough for me in my programs. My yWriter-to-ODT exporter enters e.g. author and title values as metadata, so a header, as generated by novelWriter export, can be added by page style at any time. Of course, this requires that the users know how to handle Open/LibreOffice.

One thing that is not so easy to do afterwards is the different formatting of the first paragraph after a heading or blank line (text body) and the following paragraphs (first line indent). The document generator takes care of that for me.

@HeyMyian HeyMyian mentioned this issue Feb 1, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue: Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants