-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Policy: Discuss extras repository for additional languages #2149
Comments
Are we still moving languages into repositories? Is that ALL the languges, or only "additional" ones? |
Only new ones because then we can make submitters the repo maintainers. |
So what is the policy on new (or very old) PRs already in the queue? Is there some cut off date or pretty much the answer now is always "separate repo"? |
Yep, separate repo. |
The separate language repos concept addresses a maintenance problem that has caused consternation here since @isagalaev slowed down the regular maintenance. I think the separate repo idea is going to address that but we need to discuss the many problems it introduces. We should decide how to handle these things because they change the way this library has been managed since its inception.
Most of this is solvable with new documentation on the language contributors page and some new development such as a language registry and supporting process scripts like I still haven't been able to figure out how to update the documentation. Have we documented how stuff in BTW, really nice work @yyyc514 going through all those issues 🚀 |
Well, the "test it" part is just a tooling problem but I think logically this will prove impossible with the infinite range of possibilities as languages grow and grow, yes.
I think to be "semi-official" or included in some type of global list that there should be some minimal amount of specs that a syntax is required to pass. But obviously if someone just wants to rip a language down from somewhere and use it then we can't stop them. We do have power to decide what gets hosted at
Well, I wasn't going to say it publicly, but I guess now I will. I can't speak for everyone but I can't imagine this ban on new languages in core is ABSOLUTE. It's to prevent a proliferation of 100 tiny languages stagnating that no one has time or inclination to maintain (or to baby sit PRs, tests, QA, etc). If the next Swift comes around and 50% of the world is writing code in it, I imagine we'd consider adding it to core and someone would make time to maintain it. Although if we figure out this whole "separate repo" thing perhaps eventually none of the languages will be in core... but seems that's a bit far away at the moment.
Yes, this is a for sure concern and why I think there should be some gateway between "one of the core contributors has read this, or agreed it passes "reasonable" specs and "who knows I just found this laying around somewhere". It would be very bad if someone installs a shiny new "Pancakes 1.0" syntax that locks up their website and blames it instead on Highlight.js.
No idea. Someone who knows how needs to find time to write up something ROUGH... and then we need to find people who have the time an inclination to help keep docs updated. They could iterate on the rough docs and push them forward.
Thanks. |
@yyyc514 I had offered to pitch in on the docs in some other issue here, and in particular to help rewrite the contribute a new language guide. I had worked on redoing the languages I help maintain as separate repos to work out the way to do it. I setup a github template project, we can review this effort and see if it's on the right track. It sets up a template project with a unit test to get started with a new language. |
@jf990 See my thoughts on this thread regarding auto-detection: I actually think there are some pretty great ideas (or the beginnings of ideas there). In this context instead of saying "Yes, all 10,000 languages from all maintainers have NO conflicts!" We'd run the tests and then say:
Need a better word than sticky, lol. And then we'd have a metric we could use for including in a master list, including on the "main website" for build packs, etc... |
So I have been looking at a few of the issues over the past few days, and feel I can maybe provide some outsider input. Before I start, I want to make it clear, I am not a proper JavaScript Developer, I primarily work in other back-end languages, so my approach may be naive or going against best practices, but I do believe it will solve most issues. The approach is very similar to what GitHub does with their Linguist library, and that is using Git Submodules. I have made a proof of concept for everything I am going to say below, which can be found here, like I said above, I'm not that well versed in JS build tools (and it is currently 5AM), so I did a dirty hack to get it to work, in reality it will either need to be all or nothing, with all languages following the same format (being in a folder with the language name for example), but having a standard format for languages is good in my opinion. Things I changed in the proof of concept:
Right now, based on issues I have read on this repo, if someone wants a new language, they make an issue / PR on this repo, then a @highlightjs member creates a repo for their language, and adds the original author of the issue / PR to that repo for their language to live in. As @jf990 said, while it helps with maintenance, it comes with a few drawbacks, I believe Git Submodules can fix most of those, like so:
This is the whole reason why I am even commenting here, I made support for a language, and now trying to get products to use it is an up hill battle, most developers feel that I should try and get my language to be supported in HighlightJS itself, which right now, isn't an option. So the solution, since manual work is already required when someone wants to add language support (making a repo for them and add them as a collaborator), then it shouldn't be an issue to make a commit to this repo, adding that newly created repo as a submodule, like I do in this commit of the PoC. This will only add the submodule to the So with the 3rd party languages now in the languages folder, it is treated exactly the same as a "first party" language, so it is built when running
See above about third party languages being treated as first party, so this would no longer be an issue.
Like I said above, I don't see why tests wouldn't run on these third party languages, besides the issue of getting other files into the
Once again this is all handled because the files are physically there when the commands are ran, I'm not sure how the The one thing that would need doing, or at least would be a quality of life feature, would be having the build script update all the submodules to their latest commit, but this can be done manually in the case of broken commits on submodules, or just forcing a submodule to use a specific commit.
I have no solution for this, the only thing that comes to mind is implementing a set of requirements and guidelines for new languages, for example, Github Linguist (example), they require the language to be used in "hundreds of repositories", for them, being Github, that makes sense, since the PR would affect those repositories, so for HighlightJS, it gets a bit tricky, you could use the same metric as Github, and that way you could help ensure that:
Or highlightJS could work on a different metric, I will admit that this doesn't seem like an easy issue to solve. As for the documentation, I have used ReadTheDocs in the past and have some experience with it, so I am happy to help figure out how it works, and from there document any of the changes I listed above (if implemented) to help ensure that everyone knows what the new protocol is. I hope this all makes sense, I am happy to go into more detail on git submodules if need be, or even brainstorm a different solution (possibly having the build script traverse the highlightJS organization and pull the languages from there, that way there are no submodules). |
From another thread (I'm replying to @egor-rogov) (#1829):
Well, already in or not is a very weird (meaningless?) metric (IMHO) to decide whether that's where they BELONG. I thought the whole idea of not letting more languages in [to core] had to do with developer time/maintenance/responsibility/who is in the best position to maintain the language long-term, etc... so surely the right way to think about existing languages ALSO is how they fare on those exact same metrics... This "already in core" vs "sorry, you just missed the cut-off!" is a VERY weird and arbitrary line. |
Not a fan of git submodules, though I haven't worked with them in years. It's possible they have improved. Back then all I heard was whining about how annoying they were. I do see the advantage of "just works" (other than for tests, which you didn't go into in great detail)... but I don't think the paths is the hard part... Having a languages.toml or languages.json file that anyone could contribute too and a smart build tool (doesn't have to be that smart) could accomplish the same thing. Our build pipeline is crazy old and needs replacing anyways - so keeping it "as-is" isn't a priority. |
Another possible suggestion:
I'm also a fan of a shared language repository. |
The idea about submodules is very interesting. I use them (in other project, not on github) and didn't find them annoying or something. |
What do you mean by this, @yyyc514? |
Absolutely! I really like to removing this barrier. |
Just another SHARED repository... so you have "core" then you have "extras" (which has a bunch of languages)... and we'd "police" core more carefully than "extras"... (if we plan to keep a distinction at all long-term).
Not opposed to trying if you've had good experiences. How do they work when the submodule just drops off the planet? or someone disappears from GitHub and takes their work with them? Easy to fix?
Yeah, I think the tests would move into the languages and then "running the full suite" would have to be taught to look there if you truly wanted to run EVERYTHING. |
Submodules + tests is going to require fixing those It's bad enough with 184 languages it'd be even worse with 250... |
Funilly enough, when I was making my PR to linguist, this exact thing happened, someone deleted a repo that was being used as a submodule, thankfully they were active and got github support to restore it, but that is really not ideal. I think a requirement for the submodules should be that they need to be under the highlightJS organization, that way only a team member can delete the repo and no one (in theory) can take their work with them. |
Oh so it works poorly? LOL. |
If that is what you want to take from what I said, sure, they work poorly when you have a submodule of someone else's repository, and they decide to delete the repository. Which is what I addressed in the second paragraph. If the repositories are under the @highlightjs organization (like this repository https://github.com/highlightjs/highlightjs-robots-txt for example, or any of the other repositories that have been made for third-party language support), then the only people who can actually delete that repository, are people in the @highlightjs organization, so in this case, it should be fine. |
Yeah, I followed that. Just it sounded a lot better when the only thing we had to to was accept PRs to "link" them... rather than host them all as well. :-) |
Well you are currently hosting them, so it isn't much of a difference. I myself wouldn't be too worried about people deleting repositories and taking their code, Linguist has over 300 submodules and if it happened that often, I'm sure they would have found a different solution. If it does happen however, all that would need doing is to just remove the submodule, which would remove the language support, but this would most likely only happen on more third party languages, as you said:
So the "common" languages would probably have "official" highlightJS support, and you would only need to worry about the more "uncommon" language repositories getting deleted. So if a submodule was deleted, there are a few things that could be done:
|
Does using submodules become an issue when people want duplicates or have differing opinion on core style choices? How do we handle that? Someone has a PHP grammar that is MUCH better than ours (but perhaps it's too colorful, or it's too large, ours is more "minimal", etc)... do we just give it a different name and then let people build it by name? IE, I'd imagined some way that such things could "grow organically" over time then one day when it turns out everyone prefers |
In a scenario like that, so firstly whoever made I really do think this scenario is a bit out of scope for this issue. The way I would deal with something like this, would be to add or better yet, add support for language replacement (not sure if this is already a thing), so the person who made hljs.replaceLanguage("php", hljsPHPSuper); So instead of registering it as a new language, they register it as a replacement, and highlightJS will use their grammar instead. In my opinion, grammars for already existing "core" languages would be denied, and would be opt in, if someone really wanted If in a few months if everyone is using |
For sure I'd support
Sure, I guess I was just imagining it might happen more organically than that... let people vote with what they build - but then I'm not sure how many people build this themselves vs just use a packaged version... Obviously if they're just using the default set then they're getting what core wants them to have in any case - and would have to plug in things on top of that. |
The problem I see with voting with what they build, is how can you actually track that? Unless there is analytics code built into the build process, I don't see a feasible way to actually know what people are using. What I do know though, is that a good amount people are using "what the core wants them to have", since they are just pulling it from NPM (based on weekly downloads) and not building it themselves. The whole reason I am here is because I wanted Discord to support my language, but after speaking with people, the general consensus is that I should try and get my language into highlightJS itself, since they don't want to have to add another library, they just want to pull highlightJS and ideally have the language without any extra hassle. I know I may have some bias, but I honestly think that the language situation should be sorted out in general, before worrying about someone making a different flavour of PHP, the people who care about having a different flavour of PHP, are most likely the same people who would be willing to build the package themselves to get that flavour. Since right now new languages aren't being used, and to echo what was said in this comment
that applies to anything in this project, not only a style, a language grammar as well. |
There are several problems with this, you might want to see related discussion here. [I've simple gone ahead and moved your thoughts as I've responded to them). But the hard problems such as lack of developer time and security concerns might be hard to solve. Security is still a very real concern though since obviously just loading any random files from a CDN that no one validations is a huge code injection attack waiting to happen. Auto-loading is harder since right now the official languages are over 1 mb of Javascript - not even counting 3rd party languages. Some will point out that only a few are really large, and there is some truth to this but then we're right back to having some party who plays god picking and choosing which languages are "blessed" and which are not. :-) So from where I sit it seems like JIT loading really only works when you know the language you need to highlight in advance.
I think we'd be open to a PR that supported "just in time" loading of languages via CDN, where the CDN was configured at load time when HLJS was initialized. That could definitely be a small piece of the puzzle, allowing someone else to step in and run a "trusted" CDN source for a broader set of community languages.
I think this is a LONG way off. It would first require a blessed community repo and it would require everyone changing their configurations to automatically trust that repo. But if people are willing to put in the time and work towards that goal, that'd be awesome. It might be easier to first add this support and then enable it for all core languages - such that if a language isn't compiled in we first try to fetch it from the official core library CDN (and make that easy to configure). That would instantly increase the # of languages available for highlighting. |
Honestly I have always had an issue with your usage of "blessed", at what point do you put your foot down and remove older, larger languages that are a detriment to the project and are unfair for new languages? Just because you have to remove a language from the officially supported list doesn't mean that the language is bad or that the maintainers of HLJS don't like the language. Like I said ages ago:
4 languages take up over 40% of the whole file size. That isn't even counting compression such as GZIP and Brotli. Using GZIP (which is supported by all major browsers), takes the total size down to 478KB, taking 66% of the file size Still using GZIP, but excluding ISBL, takes the compress size down to 423Kb, taking the size down to 32% of the original file size. Still using GZIP, but excluding the "big 4" that I mentioned earlier, the file size is down to 278Kb, taking the size down to only 24% of the original file size. That is just with GZIP, using brotli (disclaimer, I know that this is still fairly new, and isn't implemented everywhere, or in all the browsers, but it is still a compression algo that is being used), the file sizes go down even more. With Brotli, all the languages are only 35% of the original size (244.6KB). With Brotli, all the languages excluding ISBL, are only 32% of the original size (227.4KB). With Brotli, all the languages excluding the "big 4" are only 24% of the original size (171.2KB). CDNJS has brotli enabled by default btw, so you are already getting these benefits: The "core set" of languages that get shipped with HLJS on it's own is already being compressed to just 38.6% of it's original size. Maybe you should have "some party who plays god picking and choosing which languages are "blessed" and which are not.", since having 4 languages that take up over 60% of the projects file size is a bit ridiculous in general, regardless of if it is stopping new languages being accepted or not. In my honest opinion those 4 languages should be made 3rd party, you could fit over 70 other languages in the same space they took (based on my calculation of the average earlier, which is being skewed by these 4 languages, so it is possible you could fit even more)
There are a ton of issues with this. Unless you control what is being posted on the CDN, then you have no control. Lets use CDNJS for example. Lets say that you guys implemented this, and made it that if a language isn't found, it looks on CDNJS for Sure that would work, it would find the language (if it existed) and load it. However, what happens if I come along, and upload a package to CDNJS named Then I go onto a forum, and post a message using You're just creating an even bigger security risk.
And here we have the actual issue with languages, developers don't want to have to read the source code of every library that they're using, but at the same time they don't want to just add every NPM package they see for whatever language one exists for. Even if they did do that, are developers expected to monitor NPM like a hawk looking for new languages to add? They would maybe do it once, and then never again, leaving new languages to never be noticed.
This is a very closed minded view, if someone made an electron app that relied on JIT being a thing, and only loaded in the core set of languages as a base, what happens when that app is used offline? Not everything is web based, so sure this solution could maybe work for web based usages, for offline usages it doesn't do much. (Sure the developer of the app could write a offline mode that downloaded all the languages, but then we're just back where we started, how does that developer know what all the languages are, and where to get them?). I honestly don't think there is a solution to this problem using the current code base. At this point I would be pushing for going to version 10.0.0, start with the core set of languages and have people PR new languages in (with the PR having a checkbox saying like: By increasing the Major version number, it tells people that this is incompatible with previous versions, and it gives you a clean slate to add new languages and set proper guidelines, so you don't have things like languages that take up 60% of the file size. Also things won't just break for old projects, but new, active projects could take advantage of 10.0.0 and the new languages that it could provide. |
I'm not sure what you're ranting about for half that. This is why we don't distribute a FULL monolithic build. We already bless some languages for "common" and that currently looks like:
This is the default library we publish. 28kb gzipped.
I'm not sure I completely disagree (about removing some), but I don't understand your accounting of space, since we don't include them by default... and the user ultimately decides if they are worthwhile or not... or if they just use the prepackaged library then they don't get them at all - or can fetch them from CDN. Personally I worry more about which languages require the most MAINTENANCE. A huge language that just sits there and everyone is happy and requires no maintenance and isn't in the default "common" set is something, but it's NOT a huge concern.
It's not. We're not "out of space". :-)
Well someone would have to either trust their CDN or host it themselves. This is already an issue for anyone using CDNs. If you don't trust your CDN, then you shouldn't be using it... If it's easy enough for someone to add another file I don't see why they couldn't also easily just change the core file. And you're screwed either way.
I don't think anyone (certainly not myself) ever suggested we load RANDOM files from a HUGE CDN that collects massive libraries... the idea would be that you could point to a SPECIFIC CDN build of highlight.js and fetch from there, or you could add languages one off by URL (as you can already do). I would opposed a feature in core that randomly loaded almost random URLs, as would hopefully any sane person. :-) When that has been discussed it's been in the context of BUILD time (with people picking and choosing manually), not automatically at run-time.
No, check our README of known 3rd party languages. :-) IF someone wants to be on the list they make a PR. We already do this.
I'm not sure how this is a new problem. It wouldn't be "CDN or nothing". If someone wants a monolithic build they can still always do that. I'm talking about web usage here when we're talking about CDNs and JIT.
But if it's a 3rd party language it's really not our problem - so it's a huge difference. If it becomes OBVIOUS a 3rd party language is completely dead, dead and benefiting no one then it can always be removed from the README. A checkbox means nothing.
Again, not sure why you keep coming back to this. :-) If you're building a monolith with all 1mb of languages you should stop doing that. ;-) |
@jaredlll08 The first step in removing ANY languages from core is making 3rd party language support silly smooth, so if you really want to force some languages to "walk the plank" I'd suggest finding a way to contribute to the 3rd party language support. :-) |
You are the one the said:
I'm just saying:
You did suggest loading libraries here though:
Regardless of where it is posted, if I had an official highlightJS repo, I assume I would be the one doing builds and releasing them, so once accepted with a nice "safe" language, I could push nasty code, have that be built and pushed to the safe CDN and do the exact same thing. This could be solved by having a core maintainer looking over the changes before pushing, but at that point you're just adding more work for the core maintainers and that still doesn't ensure that a core maintainer didn't just glance at the commit names and approve it based on that.
That is what you are pushing to NPM. Please don't stop making NPM builds.
Like I said I don't think languages should have to walk the plank (unless all languages besides the core languages are removed and a new system is inplace, like I said in the 10.0.0 suggestion
I did try and find a way to contribute to the 3rd party language support, I suggested using submodules, and made a Proof Of Concept of using them. Which apparently "no one is really in FAVOR of that". Looking back though, on this:
https://github.com/issues?utf8=%E2%9C%93&q=is%3Aopen+is%3Aissue+user%3Ahighlightjs |
Ah I think I led you down a confusing road. I meant to say auto-detection, or at least that's the realm in which I was talking about auto-loading... If you only have a few languages there isn't any need to auto-load them in my opinion. But if you run a blog where different pages might use say ALL of the languages (over time) then it might be VERY useful to only load the one you need for a given article, etc... So my point about size really has nothing to do with the size of individual packages and rather more to do with the size of the total or perhaps even the quantity. For auto-detection all the languages have to be loaded in advance (so we can scan them and see which one "wins"). So having 1000 small languages that add up to 1mb is just as bad as a few big ones... you have to load them all to auto-detect them all. Loading "as needed" from a CDN just doesn't really work for auto-detect because you can't detect until AFTER you've loaded... but on demand loading could be great if you know the languages in advance, since then you can only load what you need. So that's what I was talking about. The individual size of any language doesn't really matter so much. If one language is TRULY 100kb but you NEED it, well that's 100kb you have to download and just deal with... shrugs |
No plans to. True, but for MOST people (AFAIK) the size of our NPM build is not really an issue. If you don't want to parse all that JS our README has easy instructions to only loading the languages you need. Or you could build a custom NPM package yourself. If someone is living in an environment where a 1mb server-side package is a real issue they need to be doing a custom build.
Well by the very definition (currently) core language are only those that are included. :-) If we drop one, it's no longer a core language. Perhaps you meant "common" language or popular or some other metric?
Interesting. Though I don't know how easy it is to manage permissions at that level, but still interesting.
But I think (trying to remember) the problem there was there you were still talking about putting them in OUR repo, yes? I believe we desire a bit more isolation than that. One logical conclusion to what you were trying to do is to make a whole Once we had a nice build system to me that's the next logical step if for someone to tie it all together with a bow, but I'm just not sure who that person is going to be - or if that's what the community really wants. :-) |
@jaredlll08 To me the fun/scary/interesting thing here is that the 3rd party stuff exists OUTSIDE core. That is a limitation in some ways, sure, but also a huge freedom. If you have a plan and the time feel free to build a larger proof of concept... and share it with people. See if it works, see if they like it, see if you get contributors. You don't necessary need core's explicit blessing to whatever you want. One could imagine a parent repo where you had like:
I'm very open to making the "extra language" path configurable somehow... so then you could just check everything out... point the extra language source to Actually after the new build stuff works the only thing you might actually have to do here is create the repo that ties it all together (and of course figure out how/where to share/publish). |
While it may be just as bad in terms of performance, it would be better (my opinion) to have 1000 languages being supported, compared to 180 languages. From what I understand of what you've said: If that is correct (using my hypothetical numbers), then would it not be better to be able to detect those 1000 languages instead of having the big language files limit the project (in theory removing them would make the current project faster?)
You're right, I did mean "common", I forgot what name you used for them, sorry for the confusion.
What permissions? It just gives a link to all the issues, so you could go into them and it should be fine. If you're talking about things like setting author/labels/assignee without having to go into the issue itself or on multiple issues, then yes, that wouldn't be possible with that link unfortunately.
I just gave a simple POC, it could be implemented in any repo, so doing sub modules in a community repo would work. |
That's why we have common groups of languages. We do "remove" them for 95% of the users of Highlight.js who are just using the default distribution.
Maybe, that's why we leave it up to the user how they build the library. If more max speed/max languages is a criteria for you then you'd simply build the library with a cap on the size and only include languages below a threshold. In practice this has been a non-issue since people just use the default build, which gives you a nice common platform to work from. If we added JIT loading then the "common" build would gain support for all 185 languages (via auto-load). And that's about as good as we can get (citing the previous issues with having to load a language in order to auto-detect it). Going broader than that requires someone stepping up and maintaining a 3rd party CDN and solving all the related issues doing so (when you could use it as your JIT source) and support who knows how many languages via JIT. Auto-detection (at least for the present time) is always going be limited by the languages you've decided to build into the library. |
I was referring to managing them, etc. I don't think I magically have permission to administrate all of the issues for the highlightjs organization. That's probably possible, but it's another nuance of having things lots of different places.
Exactly, I do that stuff ALL the time. :-)
Not opposed to seeing someone try that. :-) I'd suggest you consider subtrees though as I mentioned I've read they are a lot more sane. :-) But maybe you're a die hard submodule believer, which is ok too. :-) |
@jaredlll08 There is no need to discuss removing languages from "core" further here. None of the maintainers is in a hurry to do that, so it's simply not going to happen soon - if ever. I'd personally like to remove some, but I'm not in a hurry either... as for most people they have no day to day impact on their usage of the library. As far as removing from "common", I don' think we need to... there is a thread on that and no one really spoke up, plus it'd be a breaking change for people. If we were going to... the transition to v10 would be a good time, but honestly I don't think there are any TRULY worth removing. The current gzip size is 28kb and I'm pretty happy with that. So lets try and move past the fact that we have a few large languages and focus on the other things here. |
@jaredlll08 If you felt strongly and would like to publish a npm-highlight-js-small package or something that'd be understandable... but as said elsewhere I think the 1mb installed size of our npm package also isn't really an issue for 99% of people. And that's really the only place most people actually "feel" the impact of those extra languages (npm package size). Or CDN size if they hosted the CDN files, but 1mb of assets is really nothing for web hosting. |
Closing. Inactive thread. |
I see we have a lot of requests for new languages in the PRs... if the issue is time/maintenance over time, etc... might we perhaps consider an "extra" respository or something with community/unsupported syntaxes? That way the criteria to be approved could be lessened a little and obscure languages that might not really make sense in Highlight.js proper could still have a home?
Or is the idea that eventually we'll get to them?
The text was updated successfully, but these errors were encountered: