Normalize (consolidate) tokens of the same kind #23

iafan · 2017-06-01T01:25:40Z

The way underlying text/scanner package works is it produces two different tokens for combined punctuation like <=, >=, !=, :=, ||, :: and so on. For example, != would be represented in the highlighted output as follows:

<span class="pun">!</span><span class="pun">=</span>

This is suboptimal in terms of final HTML size, and also prevents font ligatures from being displayed properly (see e.g. Fira Code font).

This PR implements token normalization so that adjacent tokens of the same kind are merged together. The example above now looks like this:

<span class="pun">!=</span>

...and allows for proper ligature rendering.

dmitshur · 2017-06-01T02:45:24Z

I think this is a reasonable change to consider because it affects only the text/scanner-based Print and Annotate. But here are some thoughts for your consideration.

It's an improvement in terms of reducing final HTML size, but it reduces how much control the underlying scanner has over the output.

It makes sense to join the two separate tokens of <= or != into one, but maybe not so much for unrelated tokens that just happen to be right next to each other. For example, ().

Practically speaking, users of the text/scanner-based scanner are probably not going to be picky and the simplified HTML will benefit them. But, it would be a problem if someone's using the distinct objects and are highlighting each distinct character with a border, for example.

Since you mentioned ligatures and I suspect you're using the scanner for code, you could also consider using a more specialized scanner rather than the generic text/scanner.

I don't know if you need it for just one language, or many different ones. But if you need it for just Go, there's a package go/scanner in Go standard library that can produce much more precise results than the generic text/scanner. I've created highlight_go that uses that go/scanner for syntax highlighting Go code. It produces a single syntaxhighlight.Keyword token for things like <=, >=, !=, :=, ||, etc.

iafan · 2017-06-01T03:51:21Z

Thanks for your extensive explanation, Dmitri. I was just acting under impression that this package is used for pure visual highlighting (pretty-printing) of the code, and from that perspective two consequent chunks of the same style have exactly the same visual effect as one combined, so there's no need for two. Your point about borders is a good one, as in this case the rendering will differ, but then it looks like that this specific use case is only needed when one wants to visualize how text/scanner works itself, and not for pretty-printing of the code.

So it definitely depends on how this package is supposed to be used, and you know that better, for sure. I'm currently using it to highlight only Go code, but I like how it gives good results on other code as well, so I was hoping that a generic highlighting package would benefit from this normalization.

If you feel that the individual punctuation wrappers should stay for specific use cases, I can make this normalization optional (as I did with <ol> decoration) — this way one can choose the behavior they need. What should be the default is up to you.

The Go-specific highlighting package that you're mentioning also produces code like <span class="pln">(</span><span class="pln">)</span> which is extraneous for usual highlighting purposes, so if I use that, I'll still need to implement both normalization and <ol>-wrapping on top of it. :)

dmitshur · 2017-06-01T04:06:28Z

I think it would be acceptable to make this always on for the text/scanner-based scanner. That is just my opinion, I don't make final decisions in this package.

But if you do that, it'll be tightly integrated with just this printer/annotator. An alternative for you to consider is to perform the same-consequent-token-unification as a higher level wrapper, so that it can be optionally applied on top of the printer here, or the one I have in highlight_go. I'm not sure if it'd work out in code, but it's an idea to consider.

iafan · 2017-06-01T04:28:11Z

Please correct me if I'm wrong, but the way this is currently implemented is already Printer- and Annotator- agnostic (at least this is how I interpret the package code). The consolidation is done in global Print and Annotate wrapper functions that accept different Scanner, Printer and Annotator interfaces, so it might be possible to extend the code to use different interface implementations and still have the consolidation working.

dmitshur · 2017-06-01T05:43:07Z

different Scanner, Printer and Annotator interfaces

Look at the signature of the Print func you're modifying:

func Print(s *scanner.Scanner, w io.Writer, p Printer) error

The *scanner.Scanner is the "text/scanner".Scanner struct. It's not an interface.

Similarly, Annotate doesn't even take a scanner argument, it just creates it itself:

func Annotate(src []byte, a Annotator) (annotate.Annotations, error) {
 	s := NewScanner(src)

That's why I had to create similar Print and Annotate funcs in highlight_go. But this may be a separate problem on its own.

iafan · 2017-06-01T06:28:26Z

You're right (and I agree that the ability to work with different scanners/printers interchangeably might then require a more high-level approach to the exposed API, including introducing some backward-incompatible changes). Maybe the package needs to define its own syntaxhighlight.Scanner interface which will satisfy the current implementation of text/scanner.Scanner, and use that.

Speaking of normalization as a second step, I don't think it's effective (at least I don't immediately see a good API for that). Normalization needs to be performed on top of the tokenizer stream (not the final printed output) and based on internal tokenKind() knowledge.

All in all, if the scope of the package is considered to be a simple text/scanner-based highlighter that only renders HTML or does annotations for code highlighting purposes only, then I guess my code fits the purpose, as it now produces the more compact set of code/annotations. Otherwise, I guess we need to define what that scope is, and with that this PR might simply become out-of-scope or not fitting the expected use cases.

This makes the output smaller and allows for font ligatures (like `<=`, `!=`, `==` and so on) to be displayed correctly

Normalize (consolidate) tokens of the same kind

4e4300d

This makes the output smaller and allows for font ligatures (like `<=`, `!=`, `==` and so on) to be displayed correctly

iafan force-pushed the consolidate-tokens branch from 04113ff to 4e4300d Compare June 7, 2017 22:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize (consolidate) tokens of the same kind #23

Normalize (consolidate) tokens of the same kind #23

iafan commented Jun 1, 2017

dmitshur commented Jun 1, 2017

iafan commented Jun 1, 2017

dmitshur commented Jun 1, 2017

iafan commented Jun 1, 2017

dmitshur commented Jun 1, 2017

iafan commented Jun 1, 2017

Normalize (consolidate) tokens of the same kind #23

Are you sure you want to change the base?

Normalize (consolidate) tokens of the same kind #23

Conversation

iafan commented Jun 1, 2017

dmitshur commented Jun 1, 2017

iafan commented Jun 1, 2017

dmitshur commented Jun 1, 2017

iafan commented Jun 1, 2017

dmitshur commented Jun 1, 2017

iafan commented Jun 1, 2017