Skip to content

VoiceOver-Style Language Switching #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TTWNO opened this issue Sep 5, 2022 · 16 comments
Open

VoiceOver-Style Language Switching #21

TTWNO opened this issue Sep 5, 2022 · 16 comments
Labels
enhancement New feature or request help wanted Extra attention is needed TTS Improvements to the text to speech subsystem
Milestone

Comments

@TTWNO
Copy link
Member

TTWNO commented Sep 5, 2022

VoiceOver allows you to switch languages mid-string as well as keep settings unique to each language's voice.
For example, if I read the following sentence in VoiceOver: "In Chinese, 你好 means hello!"

VoiceOver will automatically switch between voices and settings to describe the entire sentence in one smooth motion. This can be done even on the tiny CPU on the iPhone SE. Side note: it's not that simple, I think the block of foreign text must be a bit longer for it to switch voices, but there is for sure a threshold and it can switch mid-sentence.

Odilia should eventually have the ability to use this feature. Obviously, without voices set up through speech-dispatcher, you may need to fall back to espeak, but it's still a way to read and write multi-language documents; you should not need to switch back and forth.

Language identification, unless I'm completely wrong, is very likely a fairly complex process relatively speaking. So this should always be optional and as a setting for the user to change.

I believe this should be possible through using the SSIP protocol with speech-dispatcher. I haven't looked deep enough to figure this one out myself, but I suspect that if it isn't possible like this, then I'm really not sure how it could be. More research required.

@TTWNO TTWNO added enhancement New feature or request help wanted Extra attention is needed TTS Improvements to the text to speech subsystem labels Sep 5, 2022
@albertotirla
Copy link
Member

there are afew ways to deal with this, mostly two come to mind right now:

  • the most straight-forward one is to just rely on text attributes, especially on the web. Correct me if I'm wrong, since I don't have much of a web background, however every html page can be marked as being in a specific language, with the language attribute, for example language="en-us". In that case, won't it be logical that specific paragraphs or pieces of text in a paragraph could similarly be annotated with language tags? as an aside, I think that's how wikipedia does it.
  • we can use something like lingua-rs, which won't use html, in stead it actually properly detects the language being used in the text, even though this requires more processing power and would have to be set behind a user-configurable flag, this isn't ment to be in use for long because of the not small at all! memory and cpu consumption, I believe it uses machine learning or something close, so the resource hog is expected.

Also, what you saw in voiceover is voice specific, not voiceover specific, some of that can be done in espeak as well. What's happening over there is that the voice itself knows when a transition from a latin to a non-latin alphabet is happening, therefore it does its own language selection when that text is given to it. I can't write chinese examples because espeak doesn't support that or japanese, so I will do something similar with ukrainian. So, DeepL translate says "ласкаво просимо до оділії скринрідера! "means "welcome to odilia screenreader!" If you read that with espeak, you will hear it change voice and language to spell it as well as it can given your locale, codepage and such. Even though that's not voiceover, nvda or orca specific, we can potentially make that odilia specific, as long as the speech dispatcher module currently in use supports the detected language, which can be wrong sometimes, but better than nothing.
also, we have the problem that speech dispatcher doesn't allow us to change language midsentence, however we can possibly do language processing before feeding the thing to speech dispatcher, then in the language processing faze we insert speech markers where language changes if we can accurately determin that, then when the callback fires with a marker reached event, we know to change language. We could probably track what language we have to change to with some kind of text position, marker name, or whatever that marker event contains, to language mapping. Yes, this may possibly delay spoken things a lot, I'm not sure, however it's a plan of action if nothing else comes to mind untill that feature would come to be implemented.

@mcb2003
Copy link
Contributor

mcb2003 commented Sep 9, 2022

  • the most straight-forward one is to just rely on text attributes, especially on the web. Correct me if I'm wrong, since I don't have much of a web background, however every html page can be marked as being in a specific language, with the language attribute, for example language="en-us". In that case, won't it be logical that specific paragraphs or pieces of text in a paragraph could similarly be annotated with language tags? as an aside, I think that's how wikipedia does it.

From my understanding this is correct, yes.

  • we can use something like lingua-rs, which won't use html, in stead it actually properly detects the language being used in the text, even though this requires more processing power and would have to be set behind a user-configurable flag, this isn't ment to be in use for long because of the not small at all! memory and cpu consumption, I believe it uses machine learning or something close, so the resource hog is expected.

Will look into this more, but cool.

also, we have the problem that speech dispatcher doesn't allow us to change language midsentence, however we can possibly do language processing before feeding the thing to speech dispatcher, then in the language processing faze we insert speech markers where language changes if we can accurately determin that, then when the callback fires with a marker reached event, we know to change language. We could probably track what language we have to change to with some kind of text position, marker name, or whatever that marker event contains, to language mapping. Yes, this may possibly delay spoken things a lot, I'm not sure, however it's a plan of action if nothing else comes to mind untill that feature would come to be implemented.

A much simpler solution would be to use SSIP blocks.

@trypsynth
Copy link

Unicode character ranges can also be used for most languages with latin-alphabets, for what its worth. Might also be worth looking into how NVDA on Windows does this.

@albertotirla
Copy link
Member

Unicode character ranges can also be used for most languages with latin-alphabets, for what its worth. Might also be worth looking into how NVDA on Windows does this.

nvda doesn't do a very good job of it either as far as i know, not speaking from a coding/implementation viewpoint here, rather from a user one. Most of the language processing on nvda is either handled by the synthesizer currently speaking it, or by nvda itself, but as far as I know nvda only switches language when UIA or whatever changes the language attribute of the currently read text to something else than the language of the current voice, for example if a paragraph is annotated with the language attribute. About using character ranges, probably that's one of the tricks lingua-rs uses as well, but that alone doesn't guarantee any reliability whatsoever. For example, just try distinguishing, based on that method, german text from an english translation. We know that german has ü, ö, ä, and ß, however once we identified those, what do we do? consider the whole lexical unit german, or try to identify, with a german dictionary, the smallest part of that unit that's german and speak that? what can even be considered a lexical unit, how do we do this, do we make a synthesizer level engine and shuv that in odilia? Or maybe I'm misunderstanding what you mean, in which case please post back with an example or a wider explanation, since all this will be taken into account when we will arrive to that feature set and will have to revisit this in order to implement it.

@trypsynth
Copy link

Look at www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml for a list of different languages, and their Unicode character ranges. Wouldn't work for all languages, though.

@albertotirla
Copy link
Member

I wanted to reply to your comment via email, but I guess github doesn't want me to do that, so yeah, will have to post in this field again
thanks for that link, will be very useful, even though I don't personally understand much from it since it's not an actual html table and it's kinda confusing me. Yes, I see what you mean now, however those character ranges are pretty much all non-latin alphabets, aka hiragana and catacana, so that method won't help us separating, say, english from german, plus a synthesizer with that capability can recognise such languages on its own already.

@TTWNO TTWNO added this to the 1.0 milestone Oct 3, 2022
@C-Loftus
Copy link
Contributor

C-Loftus commented Mar 2, 2025

I've been interested in this issue lately and played around with using lingua in go https://github.com/C-Loftus/MultilingualSpeechDispatcherClient/ I was curious if there have been any new thoughts or developments on this issue since the last conversation.

It seems that lingua works fairly well, especially if you have a setting in a config file or cli arg that limits which languages you care about disambiguating. It lazy loads the NLP model so if you only care about disambiguating two languages like Russian vs English which are easy to distinguish based on their writing system, it won't even use NLP and is very fast and lightweight. It also can tell you specifically which tokens in the input where it changes from one language to another.

However, for disambiguating languages with the same writing system, it hogs a lot of memory, around 80mb extra, and should be opt in. There are ways to reduce this by using the smaller, lower accuracy model and passing in more surrounding context or unloading it from memory when unused, but this is another optimization question that is probably another discussion of its own.

My understanding is that, if you want to disambiguate a large amount of different languages, you essentially need to use NLP since a rule based approach has too many edge cases by looking for unicode character codes and trying to lookup words in a dictionary doesn't reflect the probabilistic nature of word occurrence in the respective language.

I am not sure how VoiceOver works internally but just from testing recently even VoiceOver often doesn't properly switch to the proper language for languages with similar writing systems (i.e. English vs Spanish). Using HTML tags like lang="es" are a trivial way to once again skip using NLP, but I find these often don't really reflect the content of the page, especially if it is user generated. See an example like this readme where it is in Chinese but the lang tag from Github is always set to en regardless. I am sort of the opinion that heuristics based on HTML tags may just add more complexity but not actually add accuracy in many situations. And then obviously outside of the browser or in user generated text like an email or text file, such tags will not be present. It could be useful to detect the user's system language though to bias the model to select that one more often on small input sentences where it is especially ambiguous.

With that in mind, just curious if you thought that having some sort of lingua-rs wrapper over the general ssip client with an associated languages_to_disambiguate setting to Odilia (or something analogous) may make sense. (It would presumably default to None and thus not use lingua at all unless opted-in to) It is not perfect and there are tradeoffs but I think lingua is the best solution if you want to support many different languages across the entire desktop.

I may look at this myself at some point, not sure, never any urgency on any of this, just curious to check in and hear others' thoughts and document my own

@TTWNO
Copy link
Member Author

TTWNO commented Mar 2, 2025

However, for disambiguating languages with the same writing system, it hogs a lot of memory, around 80mb extra, and should be opt in. There are ways to reduce this by using the smaller, lower accuracy model and passing in more surrounding context or unloading it from memory when unused, but this is another optimization question that is probably another discussion of its own.

I agree it should probably be opt-in. But my biggest concern is not absolute CPU/memory usage but latency. It looks like there is a pretty significant performance penalty for using lingua, in the range of 50-100ms+; this is nowhere near acceptable for a screen reader. That said, the benchmarks for whichlang (1.5ms) do seem reasonable and has... ok accuracy. At first, I thought the multi-threaded benchmarks would fair well, but those benchmarks calculate throughput over many threads; it does not parallelize the detection algorithm itself.

I'd still be open to having a more accurate model activated with a config option, as long as it's made clear that Lingua adds significant latency.

I am not sure how VoiceOver works internally but just from testing recently even VoiceOver often doesn't properly switch to the proper language for languages with similar writing systems (i.e. English vs Spanish). Using HTML tags like lang="es" are a trivial way to once again skip using NLP, but I find these often don't really reflect the content of the page, especially if it is user generated. See an example like this readme where it is in Chinese but the lang tag from Github is always set to en regardless.

Never even considered that. Obviously, it'd be nice if we had perfect data to work with, lol! Even having a manual way to change languages (which we currently do not) would be an improvement. At least then the user can choose between their activated locales (think something like the output of localectl list-locales, but without opening a shell).

Thanks for looking into this. I do really want to have some multi-language support (even if it just means I can use Odilia to read my flashcards without explicitly changing the language :)

@C-Loftus
Copy link
Contributor

C-Loftus commented Mar 2, 2025

It looks like there is a pretty significant performance penalty for using lingua, in the range of 50-100ms+

In my testing at least lingua appears to have that 100ms latency on the first recognition if it has to lazy load the model, but after that, recognitions are very fast, at least for shorter text. So it appears that latency is at least partially caused by loading the model into memory the first time. In my testing I am getting less than 5ms latency consistently on an Intel 12th Gen Core i9-12900K but this could be different depending on CPU and how much text you pass in at once.

host@computer ~/g/bilingualSynth (master) [1]> ./example/script.sh
2025/03/02 12:49:09 DEBU Trying to detect the following languages: [Spanish,English]
2025/03/02 12:49:10 INFO Connected to Speech Dispatcher
Enter text to detect language (CTRL+D to exit):
2025/03/02 12:49:10 DEBU Detected 1 language sections after 126ms
2025/03/02 12:49:10 DEBU Detected language ES for substring 'Hey, Sofia, ¿cómo estás?'
2025/03/02 12:49:12 DEBU Detected 1 language sections after 0ms
2025/03/02 12:49:12 DEBU Detected language ES for substring 'I’m good, Alex. ¿Y tú?'
2025/03/02 12:49:15 DEBU Detected 1 language sections after 3ms
2025/03/02 12:49:15 DEBU Detected language ES for substring 'Bien, gracias. '
^C
Maximum memory used: 86.88 MB

If you don't load the model and just use lingua's rule-based disambiguation for easy languages comparisons like English vs Arabic vs Russian vs Chinese, it never has this upfront cost and appears to be nearly instant always. I think even just being able to automatically speak these based on lingua's rule-based parsing would benefit a lot of users. Unfortunately English vs Danish vs German or combinations like that would be much harder.

And fwiw I am not attached to lingua or anything, I will take a look at whichlang, mainly was hoping to drop this here for context/future documentation. Going to continue to poke around things a bit and explore options

@TTWNO
Copy link
Member Author

TTWNO commented Mar 2, 2025

If you don't load the model and just use lingua's rule-based disambiguation for easy languages comparisons like English vs Arabic vs Russian vs Chinese, it never has this upfront cost and appears to be nearly instant always. I think even just being able to automatically speak these based on lingua's rule-based parsing would benefit a lot of users. Unfortunately English vs Danish vs German or combinations like that would be much harder.

Strange. The benchmarks for lingua itself show significantly worse results and appear to load the language models before running. Maybe the benchmark is reporting inaccurately, your examples are short (the benchmark sentences are quite large), or maybe I misread it. Take a look at the lingua benchmarks and see what you get. I'm curious.

And fwiw I am not attached to lingua or anything, I will take a look at whichlang, mainly was hoping to drop this here for context/future documentation. Going to continue to poke around things a bit and explore options

Awesome! So glad to see you involved!

@albertotirla
Copy link
Member

I believe we should use something like lingua, if nothing else, because it has good rust integrations. The model loading thing, we could load it all upfront, and then maintain an instance of it in memory as a tokio task, sending messages to it instead of directly to the tts instance, which...probably shouldn't be a task anymore in that case? unfortunately, changing the possible languages to be detected in a text would require an odilia restart, but that'll probably not be an issue, unless we're willing to load all the language models upfront and pay the memory cost once, using the library at its full potential?

@TTWNO
Copy link
Member Author

TTWNO commented Mar 4, 2025

changing the possible languages to be detected in a text would require an odilia restart, but that'll probably not be an issue, unless we're willing to load all the language models upfront and pay the memory cost once, using the library at its full potential?

Does it actually require a process-level restart, or can we just reload all config values and keep going? If yes, then even better we have a way to test a new configuration with the ability to revert (much like changing visual settings do); default is to revert in 15 seconds or something like that. Just a thought.

If we do need a process level restart, we could add it as a keybinding too. Fork, invoke Odilia via shell, immediatley exit? Could work.

@albertotirla
Copy link
Member

no idea how we just reload like that, especially because of the way the config works. Hmm, I'm thinking something simple can be done, using inotify we watch for changes to the config file and then we quit and restart, or better, the configuration program sends us sigup or sigusr1, and then we act on it by quitting and restarting. Or, hmm, perhaps the arcswap crate? I'm not sure if we can load more models after the instance has been initialised anyway, I seemn to not remember that being a thing, but I'll have to review the documentation again to make sure

@TTWNO
Copy link
Member Author

TTWNO commented Mar 5, 2025

In my testing at least lingua appears to have that 100ms latency on the first recognition if it has to lazy load the model, but after that, recognitions are very fast, at least for shorter text. So it appears that latency is at least partially caused by loading the model into memory the first time. In my testing I am getting less than 5ms lat^ency consistently on an Intel 12th Gen Core i9-12900K but this could be different depending on CPU and how much text you pass in at once.

Looks like the sentences are repeated 125 times for the benchmarks. The benchmarks actually measure doing all (16) sentences 125 times. So my results should be divided by 2000 to get a better picture of latency. This does still only apply to the single-threaded benchmarks as the multi-threaded ones just split those 125 copies of the sentences across multiple cores—throughput not latency.

Thank you for independently testing, as this got me curious as to the difference, and now I can say for sure I would be comfortable with lingua in Odilia.

Or, hmm, perhaps the arcswap crate?

Yes. This is basically a perfect use case for it.

@C-Loftus
Copy link
Contributor

I've done some exploration on this in the past month and wanted to summarize it here for others if they want to reference:

Essentially, it seems that lingua-rs, as it stands right now, is not suitable for language detection in the context of screen reader use. I have opened an issue here pemistahl/lingua-rs#464 a bit back and if there are any updates I can update here as well. To summarize it, essentially lingua does not appear to be able to distinguish languages on a very small text input lengths that one would often get in a screen reader context; this is the case, confusingly, even for two languages like Chinese vs English where the language scripts are entirely different. I am hoping to receive a response on why this is the case since its possible I am misunderstanding something. I had hoped it would fall back to a rule based approach in trivial contexts but this doesn't appear to necessarily be the case.

It seems that the best way for the time being may be just to detect the language based on unicode character ranges using a crate like https://github.com/the-type-founders/unicode-language-rs/ This has its own set of problems (some of which are a bit interesting like https://en.wikipedia.org/wiki/Han_unification) , but the nice thing is that it does not require loading a model into memory and would be quite fast for trivial rule-based examples with as long as both languages weren't sharing the same script.

It remains an open question if a simple rule based approach of distinguishing only trivial language differences that use different unicode characters (i.e. Russian vs English vs Chinese vs Korean etc) is useful. If that would be, sometime in the future I could look at changing my draft PR in the ssip repo to use this unicode rule based approach instead of lingua odilia-app/ssip-client-async#22 (Obviously still need to experiment a bit more though to confirm it is possible to use that library above)

@RoDmitry
Copy link

RoDmitry commented Apr 23, 2025

I'm not sure what you need, but check out Langram. It's a complete rewrite of Lingua: 5x faster, more accuracy, more languages. Or if you just want to determine an alphabet/script there is alphabet_detector which is very fast and does not have any models. Maybe you can find a way to filter the output, to get what you need.
P.S.: Russian vs English vs Chinese vs Korean is an easy task for alphabet_detector, but is it enough for you? There are more than 120 languages in Latin script. You might be missing the big ones like French or Spanish.
@C-Loftus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed TTS Improvements to the text to speech subsystem
Projects
None yet
Development

No branches or pull requests

6 participants