Considering a Nokogiri equivalent for V #15186
Replies: 1 comment
-
It seems Nokogiri (and majority of other parsers) are not "lazy" parsers, but "parse everything to memory right at the beginning disregarding how much of the input data will actually be needed". I'd rather go for a much simpler, faster and optimization-friendly approach - namely "make a lazy view with write-layer" - i.e. parsing would happen on-demand (by fragments, not that the first trigger will parse everything) and if there won't be any writes, then it'll stay like that. See https://github.com/zserge/jsmn and https://github.com/google/gumbo-parser for this approach. Note, that this approach allows extremely easy caching (in the spirit of the answers in https://stackoverflow.com/questions/815110/is-there-a-decorator-to-simply-cache-function-return-values ) of already withdrawn values to avoid re-parsing of the fragment over and over. If any change to a certain chunk of data (imagine a string field in JSON) in the middle of the input JSON should occur (for document-oriented DBs and data-designs this is the most common operation), then there is a direct approach with 3 options:
(If any inserts (either in the middle of the buffer or at the end) shall occur, then The second approach to changing data would be COW (copy-on-write) inspired - e.g. making a lightweight overlay over the original static data (in other words, it's a cache). For advanced overlaying/snapshotting see e.g. ctrie for parallel tree structures). The overlay will be merged with the static data first at the serialization phase in the future before sending them away. Overlay is for small data actually about as fast as (3) from the direct approach, but faster for data starting at hundreds of megabytes. |
Beta Was this translation helpful? Give feedback.
-
While the following is not something that is needed directly within V core, it is quite essential to V adoption.
With the work on ORM well underway and moving towards full web production and consumption, it could be important to consider to consider a
Nokogiri
equivalent for V. For frameworks such as Rails we see that this is a critical component to the system and of course it's critical for many other areas.Nokogiri
is obviously a very robust system that is an HTML, XML, SAX, and Reader parser. Being able to use XPath and CSS3 selectors to search documents is pretty critical nowadays.Nokogiri
is also somewhat "slow" when compared to other options, but I suppose there's a big question of return on investment. Would it be worthwhile to implement a VNokogiri
equivalent (I don't know if essentially convertingNokogiri
directly to V would be difficult) OR to convert something in C++ that's relatively "small" (around 13k loc) likepugixml
and remarkably fast (benchmarks) (faster than RapidXML and in some cases faster thanasmxml
).On the
pugixml
front, I suppose there are notable drawbacks such as missing XML namespace and other such missing W3C specifications for XPath. You'd also have to preprocess HTML to XHTML to verify/accommodate compliance.Another question or thought: Would it be better to start with a more robust solution like
Nokogiri
that's more "enjoyable" for development and then also provide a more performant option likepugixml
? Or not?Just some thoughts and questions to consider here. I DO believe that working in this space is very important since ROBUST (and comprehensive) XML/HTML consumption is going to be critical to V pretty darn quickly.
Beta Was this translation helpful? Give feedback.
All reactions