Performance of the library #476

skinkie · 2021-05-02T11:33:05Z

I am currently testing the the branch issue-469. Which does not require any manual changes, which is great.
I am parsing this file: http://data.ndovloket.nl/netex/htm/NeTEx_HTM__2020-10-12.xml.gz

import time
import gzip

from xsdata.formats.dataclass.parsers.config import ParserConfig
config = ParserConfig(
     process_xinclude=False,
     fail_on_unknown_properties=False,
)
from xsdata.formats.dataclass.context import XmlContext
from xsdata.formats.dataclass.parsers import XmlParser

print("Before import", time.time())

from netex import PublicationDelivery

print("Before parser", time.time())

parser = XmlParser(context=XmlContext(), config=config)
pd = parser.parse(gzip.open("/var/tmp/NeTEx_HTM__2020-10-12.xml.gz", 'r'), PublicationDelivery)

print("After parser", time.time())

timing_links = {}
for timing_link in pd.data_objects.composite_frame[0].frames.service_frame[0].timing_links.timing_link:
     timing_links[timing_link.id] = timing_link.distance

print("After dict", time.time())

print(timing_links)

Before import 1619954488.1244667
Before parser 1619954492.6376452 (4s)
After parser 1619954562.600524 (70s)
After dict 1619954562.601241

If I compare this with below (which is done within 1 second). I agree that this is not very compatible, but maybe there is a way to just in time deserialise the file just in time.

import gzip
from lxml import etree
etree.parse(gzip.open('/var/tmp/NeTEx_HTM__2020-10-12.xml.gz', 'r'))

tefra · 2021-05-02T20:07:09Z

Hi @skinkie,

First of all take out the gzip out of the equation, that alone accounts for ~20 seconds.

import time
import xml
from contextlib import contextmanager
from pathlib import Path

import lxml
from xsdata.formats.dataclass.context import XmlContext
from xsdata.formats.dataclass.parsers import XmlParser
from xsdata.formats.dataclass.parsers.config import ParserConfig
from xsdata.formats.dataclass.parsers.handlers import LxmlEventHandler
from xsdata.formats.dataclass.parsers.handlers import LxmlSaxHandler
from xsdata.formats.dataclass.parsers.handlers import XmlEventHandler
from xsdata.formats.dataclass.parsers.handlers import XmlSaxHandler


@contextmanager
def timing(description: str) -> None:
    start = time.time()
    yield
    ellapsed_time = time.time() - start

    print(f"{description}: {ellapsed_time}")


with timing("importing module"):
    from netex.models import *

xml_path = str(Path.cwd().joinpath("NeTEx_HTM__2020-10-12.xml"))
context = XmlContext()
config = ParserConfig(fail_on_unknown_properties=False)

parser = XmlParser(context=context, config=config, handler=LxmlEventHandler)
with timing("first parse - xml context warmup"):
    parser.parse(xml_path, PublicationDelivery)

parser = XmlParser(context=context, config=config, handler=LxmlEventHandler)
with timing("parse - lxml EventHandler"):
    parser.parse(xml_path, PublicationDelivery)

parser = XmlParser(context=context, config=config, handler=LxmlSaxHandler)
with timing("parse - lxml SaxHandler"):
    parser.parse(xml_path, PublicationDelivery)

parser = XmlParser(context=context, config=config, handler=XmlEventHandler)
with timing("parse - xml EventHandler (native python)"):
    parser.parse(xml_path, PublicationDelivery)

parser = XmlParser(context=context, config=config, handler=XmlSaxHandler)
with timing("parse - xml SaxHandler (native python)"):
    parser.parse(xml_path, PublicationDelivery)

with timing("lxml xml element tree"):
    result = lxml.etree.parse(xml_path)

These are my results, which are comparable to the benchmarks in the ci, that 10000 records sample is about ~4mb and takes about ~1sec

importing module: 3.845527410507202
first parse - xml context warmup: 51.35779690742493
parse - lxml EventHandler: 51.03626036643982
parse - lxml SaxHandler: 50.466455936431885
parse - xml EventHandler (native python): 46.76586675643921
parse - xml SaxHandler (native python): 61.00073552131653
lxml xml element tree: 1.521801233291626

the good

The good part in all this is that the xml context building is actually pretty fast for the ~160 models used in that document

the weird

The (lxml|xml) EventHandler(s) are based on iterparse, the native python solution is a bit faster but not much. I 've noticed lxml iterparse is struggling a bit with documents with a lot of attributes

the bad

I took a few profile dumps to see where the bottlenecks are, almost half of the time is going to the value converters and I actually can see a few quick wins to save an additional 5++ seconds

Realistically though we will never reach lxml's raw power due to the binding/conversion processes without rewriting the parsers to c. Java with jaxb is also embarrassingly fast (just over 5 seconds for that document) 😞

Are 150mb documents common for the NeTEx schemas?

skinkie · 2021-05-02T20:31:05Z

The fact that gzip would take 20 seconds surprises me greatly, how is this possible? zcat /var/tmp/NeTEx_HTM__2020-10-12.xml.gz | wc -l takes less than a second.

I don't think lxml is comparible if you have to create objects / apply a binding. And yes... JAXB is amazing, but I wonder how much of that is done 'just in time'. I wonder if doing something like a lazy binding could help here, for example creating a partial tree. Other directions I wondered about when evaluating the problem of validators (see below) is: why isn't anybody creating a perfect parser out of an XSD? (Perfect parser, as in a parser that only allows to deserialise the XSD schema.)

Regarding the size of the files. What we see validators stop working for NeTEx document around 250MB, it seems this is resolved within libxml2 recently (git, on release yet). There are NeTEx documents that alone are gigabytes in size (open data from Swiss or Germany for example). If you want some references I can obviously provide them.

If you would like to have a more interactive chat on the subject, we can.

skinkie · 2021-05-02T21:32:16Z

I have read about the poor gzip performance in python. stubbed mgzip for it.

tefra · 2021-05-02T21:40:13Z

I am not aware of any lazy bindings technique in jaxb but I will do some more research on this. Regarding the partial tree, xsdata is using lxml/xml iterparse and sax interfaces to bind data as soon as they are ready, but java and jaxb performance is out of reach without rewriting a lot of things in c.

I started this project for fun as a side project to stay current with python, in my experience dealing with xsd, there are many many, way to many, different approaches to accomplice the same thing, the NeTEx collection is an excellent example of that 😄
The whole schema has 5 or 6 issues that I am working on that I have never encountered before.

Most bindings libraries are trying to cover the most common practices and there are features in both xsd 1.0 and 1.1 that are simply impossible to implement per language.

How are these documents been generated? gigabytes???

Out of curiosity I tried to validate that 150mb sample against the schema in python using lxml and I gave up after 20 minutes.

skinkie · 2021-05-02T21:55:47Z

You should understand, that you have likely (already) come up with the best XSD tool for Python, and might be the best implementation after JAXB. We have tested a lot implementations for different languages, including C#, and virtually all of them, including the commercial ones, directly fail on the substitution groups, and require "xs:choice" instead. So we are very impressed. You might need to think about generating code for different programming languages exploiting the same generator infrastructure.

There are a few tricks. The obvious one is, generating the code by a string formatter. Hence having a relational database structure, but serializing it by hand to XML. JAXB has an option to export fragments, hence without serializing the entire document at once (consuming large amounts of memory).

I have written an LXML based validator which takes a "constraintless" document (hence ignoring identity key constraint) and validates structure, and implements the constraint checking in a multithreaded python way. It still out performs the "new" libxml2 code, but I think if libxml2 would employ multithreading itself it could still be faster.

This is where the magic happens with libxml2:
GNOME/libxml2@faea2fa

tefra · 2021-05-06T21:19:09Z

Thank you for the sample @skinkie, because of the size and complexity it was easy to spot some very easy quick wins all around, in some cases I saw almost %20 improvement.

These are some of the best timings I recorded for your sample

first parse - xml context warmup: 42.239562034606934
parse - lxml EventHandler: 41.22812223434448
parse - xml EventHandler (native python): 37.97033977508545

I will add the sample in my benchmark suite and keep digging for more areas that can be improved but for next release I think I am gonna leave it at that.

tefra · 2021-05-06T21:27:00Z

The schema analyzer is decoupled from the actual code generator and the code generator is completely plug-able, so who knows maybe in the future we can add outputs for other languages as well.

skinkie · 2021-05-06T23:12:19Z

Thank you for the sample @skinkie, because of the size and complexity it was easy to spot some very easy quick wins all around, in some cases I saw almost %20 improvement.

These are some of the best timings I recorded for your sample

first parse - xml context warmup: 42.239562034606934
parse - lxml EventHandler: 41.22812223434448
parse - xml EventHandler (native python): 37.97033977508545

I will add the sample in my benchmark suite and keep digging for more areas that can be improved but for next release I think I am gonna leave it at that.

Thanks for this massive effort.

The schema analyzer is decoupled from the actual code generator and the code generator is completely plug-able, so who knows maybe in the future we can add outputs for other languages as well.

I think I will even consider a documentation generator from the XSD.

tefra mentioned this issue May 5, 2021

Performance tweaks #482

Merged

tefra closed this as completed May 6, 2021

skinkie mentioned this issue Jun 17, 2021

Deserialising a subtree from lxml to improve performance of huge documents #530

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of the library #476

Performance of the library #476

skinkie commented May 2, 2021

tefra commented May 2, 2021

skinkie commented May 2, 2021

skinkie commented May 2, 2021

tefra commented May 2, 2021

skinkie commented May 2, 2021

tefra commented May 6, 2021

tefra commented May 6, 2021

skinkie commented May 6, 2021

Performance of the library #476

Performance of the library #476

Comments

skinkie commented May 2, 2021

tefra commented May 2, 2021

the good

the weird

the bad

skinkie commented May 2, 2021

skinkie commented May 2, 2021

tefra commented May 2, 2021

skinkie commented May 2, 2021

tefra commented May 6, 2021

tefra commented May 6, 2021

skinkie commented May 6, 2021