-
Notifications
You must be signed in to change notification settings - Fork 19
VoiceOver-Style Language Switching #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
there are afew ways to deal with this, mostly two come to mind right now:
Also, what you saw in voiceover is voice specific, not voiceover specific, some of that can be done in espeak as well. What's happening over there is that the voice itself knows when a transition from a latin to a non-latin alphabet is happening, therefore it does its own language selection when that text is given to it. I can't write chinese examples because espeak doesn't support that or japanese, so I will do something similar with ukrainian. So, DeepL translate says "ласкаво просимо до оділії скринрідера! "means "welcome to odilia screenreader!" If you read that with espeak, you will hear it change voice and language to spell it as well as it can given your locale, codepage and such. Even though that's not voiceover, nvda or orca specific, we can potentially make that odilia specific, as long as the speech dispatcher module currently in use supports the detected language, which can be wrong sometimes, but better than nothing. |
From my understanding this is correct, yes.
Will look into this more, but cool.
A much simpler solution would be to use SSIP blocks. |
Unicode character ranges can also be used for most languages with latin-alphabets, for what its worth. Might also be worth looking into how NVDA on Windows does this. |
nvda doesn't do a very good job of it either as far as i know, not speaking from a coding/implementation viewpoint here, rather from a user one. Most of the language processing on nvda is either handled by the synthesizer currently speaking it, or by nvda itself, but as far as I know nvda only switches language when UIA or whatever changes the language attribute of the currently read text to something else than the language of the current voice, for example if a paragraph is annotated with the language attribute. About using character ranges, probably that's one of the tricks lingua-rs uses as well, but that alone doesn't guarantee any reliability whatsoever. For example, just try distinguishing, based on that method, german text from an english translation. We know that german has ü, ö, ä, and ß, however once we identified those, what do we do? consider the whole lexical unit german, or try to identify, with a german dictionary, the smallest part of that unit that's german and speak that? what can even be considered a lexical unit, how do we do this, do we make a synthesizer level engine and shuv that in odilia? Or maybe I'm misunderstanding what you mean, in which case please post back with an example or a wider explanation, since all this will be taken into account when we will arrive to that feature set and will have to revisit this in order to implement it. |
Look at www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml for a list of different languages, and their Unicode character ranges. Wouldn't work for all languages, though. |
I wanted to reply to your comment via email, but I guess github doesn't want me to do that, so yeah, will have to post in this field again |
I've been interested in this issue lately and played around with using lingua in go https://github.com/C-Loftus/MultilingualSpeechDispatcherClient/ I was curious if there have been any new thoughts or developments on this issue since the last conversation. It seems that lingua works fairly well, especially if you have a setting in a config file or cli arg that limits which languages you care about disambiguating. It lazy loads the NLP model so if you only care about disambiguating two languages like Russian vs English which are easy to distinguish based on their writing system, it won't even use NLP and is very fast and lightweight. It also can tell you specifically which tokens in the input where it changes from one language to another. However, for disambiguating languages with the same writing system, it hogs a lot of memory, around 80mb extra, and should be opt in. There are ways to reduce this by using the smaller, lower accuracy model and passing in more surrounding context or unloading it from memory when unused, but this is another optimization question that is probably another discussion of its own. My understanding is that, if you want to disambiguate a large amount of different languages, you essentially need to use NLP since a rule based approach has too many edge cases by looking for unicode character codes and trying to lookup words in a dictionary doesn't reflect the probabilistic nature of word occurrence in the respective language. I am not sure how VoiceOver works internally but just from testing recently even VoiceOver often doesn't properly switch to the proper language for languages with similar writing systems (i.e. English vs Spanish). Using HTML tags like With that in mind, just curious if you thought that having some sort of lingua-rs wrapper over the general ssip client with an associated I may look at this myself at some point, not sure, never any urgency on any of this, just curious to check in and hear others' thoughts and document my own |
I agree it should probably be opt-in. But my biggest concern is not absolute CPU/memory usage but latency. It looks like there is a pretty significant performance penalty for using I'd still be open to having a more accurate model activated with a config option, as long as it's made clear that Lingua adds significant latency.
Never even considered that. Obviously, it'd be nice if we had perfect data to work with, lol! Even having a manual way to change languages (which we currently do not) would be an improvement. At least then the user can choose between their activated locales (think something like the output of Thanks for looking into this. I do really want to have some multi-language support (even if it just means I can use Odilia to read my flashcards without explicitly changing the language :) |
In my testing at least lingua appears to have that 100ms latency on the first recognition if it has to lazy load the model, but after that, recognitions are very fast, at least for shorter text. So it appears that latency is at least partially caused by loading the model into memory the first time. In my testing I am getting less than 5ms latency consistently on an Intel 12th Gen Core i9-12900K but this could be different depending on CPU and how much text you pass in at once.
If you don't load the model and just use lingua's rule-based disambiguation for easy languages comparisons like English vs Arabic vs Russian vs Chinese, it never has this upfront cost and appears to be nearly instant always. I think even just being able to automatically speak these based on lingua's rule-based parsing would benefit a lot of users. Unfortunately English vs Danish vs German or combinations like that would be much harder. And fwiw I am not attached to lingua or anything, I will take a look at |
Strange. The benchmarks for
Awesome! So glad to see you involved! |
I believe we should use something like lingua, if nothing else, because it has good rust integrations. The model loading thing, we could load it all upfront, and then maintain an instance of it in memory as a tokio task, sending messages to it instead of directly to the tts instance, which...probably shouldn't be a task anymore in that case? unfortunately, changing the possible languages to be detected in a text would require an odilia restart, but that'll probably not be an issue, unless we're willing to load all the language models upfront and pay the memory cost once, using the library at its full potential? |
Does it actually require a process-level restart, or can we just reload all config values and keep going? If yes, then even better we have a way to test a new configuration with the ability to revert (much like changing visual settings do); default is to revert in 15 seconds or something like that. Just a thought. If we do need a process level restart, we could add it as a keybinding too. Fork, invoke Odilia via shell, immediatley exit? Could work. |
no idea how we just reload like that, especially because of the way the config works. Hmm, I'm thinking something simple can be done, using inotify we watch for changes to the config file and then we quit and restart, or better, the configuration program sends us sigup or sigusr1, and then we act on it by quitting and restarting. Or, hmm, perhaps the arcswap crate? I'm not sure if we can load more models after the instance has been initialised anyway, I seemn to not remember that being a thing, but I'll have to review the documentation again to make sure |
Looks like the sentences are repeated 125 times for the benchmarks. The benchmarks actually measure doing all (16) sentences 125 times. So my results should be divided by 2000 to get a better picture of latency. This does still only apply to the single-threaded benchmarks as the multi-threaded ones just split those 125 copies of the sentences across multiple cores—throughput not latency. Thank you for independently testing, as this got me curious as to the difference, and now I can say for sure I would be comfortable with lingua in Odilia.
Yes. This is basically a perfect use case for it. |
I've done some exploration on this in the past month and wanted to summarize it here for others if they want to reference: Essentially, it seems that lingua-rs, as it stands right now, is not suitable for language detection in the context of screen reader use. I have opened an issue here pemistahl/lingua-rs#464 a bit back and if there are any updates I can update here as well. To summarize it, essentially lingua does not appear to be able to distinguish languages on a very small text input lengths that one would often get in a screen reader context; this is the case, confusingly, even for two languages like Chinese vs English where the language scripts are entirely different. I am hoping to receive a response on why this is the case since its possible I am misunderstanding something. I had hoped it would fall back to a rule based approach in trivial contexts but this doesn't appear to necessarily be the case. It seems that the best way for the time being may be just to detect the language based on unicode character ranges using a crate like https://github.com/the-type-founders/unicode-language-rs/ This has its own set of problems (some of which are a bit interesting like https://en.wikipedia.org/wiki/Han_unification) , but the nice thing is that it does not require loading a model into memory and would be quite fast for trivial rule-based examples with as long as both languages weren't sharing the same script. It remains an open question if a simple rule based approach of distinguishing only trivial language differences that use different unicode characters (i.e. Russian vs English vs Chinese vs Korean etc) is useful. If that would be, sometime in the future I could look at changing my draft PR in the ssip repo to use this unicode rule based approach instead of lingua odilia-app/ssip-client-async#22 (Obviously still need to experiment a bit more though to confirm it is possible to use that library above) |
I'm not sure what you need, but check out Langram. It's a complete rewrite of Lingua: 5x faster, more accuracy, more languages. Or if you just want to determine an alphabet/script there is alphabet_detector which is very fast and does not have any models. Maybe you can find a way to filter the output, to get what you need. |
VoiceOver allows you to switch languages mid-string as well as keep settings unique to each language's voice.
For example, if I read the following sentence in VoiceOver: "In Chinese, 你好 means hello!"
VoiceOver will automatically switch between voices and settings to describe the entire sentence in one smooth motion. This can be done even on the tiny CPU on the iPhone SE. Side note: it's not that simple, I think the block of foreign text must be a bit longer for it to switch voices, but there is for sure a threshold and it can switch mid-sentence.
Odilia should eventually have the ability to use this feature. Obviously, without voices set up through
speech-dispatcher
, you may need to fall back toespeak
, but it's still a way to read and write multi-language documents; you should not need to switch back and forth.Language identification, unless I'm completely wrong, is very likely a fairly complex process relatively speaking. So this should always be optional and as a setting for the user to change.
I believe this should be possible through using the SSIP protocol with speech-dispatcher. I haven't looked deep enough to figure this one out myself, but I suspect that if it isn't possible like this, then I'm really not sure how it could be. More research required.
The text was updated successfully, but these errors were encountered: