- π Overview
- β¨ Core Features
- π Workflow Diagram
- π§ Module Breakdown
- π Benchmark Scores
- π Getting Started
- π οΈ Troubleshooting Guide
- π Roadmap & Progress
- π Additional Resources
- π Acknowledgements
- π License
The Multilingual VoiceBot revolutionizes cross-linguistic communication in India by enabling seamless voice-based interactions in 11 native languages and English, using advanced AI models (NLP, ASR, real-time translation) to bridge gaps between Indiaβs 121 major languages and 19,500 dialects. For farmers seeking crop advice, patients accessing telemedicine, or citizens navigating government services, it delivers instant voice-to-voice understanding without requiring literacy or English proficiency, democratizing access to critical services while preserving cultural identity. Technical users benefit from modular architecture with state-of-the-art models like CCC-Wav2Vec and IndicTrans2, while end-users experience human-like conversational AI that empowers marginalized communities, reduces digital exclusion, and drives socioeconomic equity across Indiaβs diverse linguistic landscape.
"Imagine a farmer in Odisha asking about crop prices in Odia π§πΎ, a grandmother in Kerala describing symptoms to a Hindi-speaking doctor π΅π₯, or a migrant worker accessing government schemes in Bengali while working in Tamil Nadu ποΈ. This AI-powered voice assistant breaks language barriers with human-like conversations in 11 Indian languages, empowering 1.4 billion people to access services in their mother tongue."
Demo Video (Coming Soon!)
- π 11+ Indian Languages - Bengali, Hindi, Tamil, Telugu + 8 more & English
- π£οΈ Voice β Text Chat - Seamless switch between voice & text modes
- π§ Human-like Voices - Natural male/female voice output (Non-robotic)
- π― Accurate ASR - Optimized for Indian accents & regional dialects
- β‘ Low Latency - <3s response time from speech-to-speech
- π Real-time Translation - Instantly convert between any supported languages
- π± Mobile Ready - Works smoothly on smartphones & low-end devices
Figure 1: End-to-end pipeline of Multilingual VoiceBot
Key Features:
- Supports 12 Indian languages + English
- Real-time audio processing with noise filtering
- GUI + CLI interfaces for flexible deployment
- Base Model:
ccc-wav2vec
from IIT Madras SPRING Lab - Feature Extraction: u-vector embeddings with Within Sample Similarity Loss (WSSL)
- Classifier: Feedforward Neural Network
Training Data:
- 500+ hours of speech data per language from BPCC and ULCA datasets
Benchmark Results (WER%):
Language | Common Voice | IndicTTS | Kathbath |
---|---|---|---|
Hindi | 16.4 | 12.2 | 11.0 |
Bengali | 17.2 | 20.5 | 13.9 |
Tamil | 30.0 | 20.8 | 24.4 |
Key Features:
- Cross-contrastive learning for robust speech representation
- Cluster-based negative sampling
- Supports code-switched speech
- Encoder: 24-layer Transformer
- Quantizer: Gumbel-Softmax clustering
-
Loss Function:
$L_{cc} = Ξ±L_c + Ξ²L_{cross} + Ξ³L_{cross'}$
Training Data:
- 1M+ hours from LibriSpeech, Switchboard, MUCS
Benchmark Results:
Word Error Rates without use of Language Models across various Languages
Dataset / Language | Common Voice | Fleurs | IndicTTS | ULCA | Kathbath | Kathbath hard | MUCS | SPRING Test | Average |
---|---|---|---|---|---|---|---|---|---|
Bengali | 17.2 | 19.6 | 20.5 | 21.6 | 13.9 | - | - | - | 18.56667 |
Gujarati | - | 18.1 | 16 | 25.5 | 14.5 | 15.7 | 23.6 | - | 18.9 |
Hindi | 16.4 | 14.5 | 12.2 | 15.8 | 11 | 12.3 | 15.2 | 44.9 | 17.7875 |
Kannada | - | 19.4 | 18.4 | - | 17.4 | 19.3 | - | - | 18.625 |
Malayalam | 41.3 | 20.8 | 20.7 | - | 27.4 | 29.9 | - | 49.5 | 31.6 |
Marathi | 19.3 | 21.5 | 14.6 | - | 16.2 | - | 10.8 | - | 16.48 |
Odia | 30.3 | 29.1 | 18.8 | 33.6 | 18.3 | 22.6 | 26.8 | - | 25.64286 |
Punjabi | 20.3 | 21.1 | - | - | 12.6 | 13.8 | - | 51.5 | 23.86 |
Tamil | 30 | 29.8 | 20.8 | - | 24.4 | 26.4 | 26.9 | - | 26.38333 |
Telugu | - | 24.3 | 26.2 | - | 22.5 | 23.8 | 21.7 | - | 23.7 |
Key Features:
- 462 translation directions for 22 Indian languages
- Direct Indic-Indic translation without English pivot
Model Architecture:
- Framework: Transformer Big (6 layers)
- Parameters: 210M
- Training: Multilingual NMT with BPCC corpus
Training Data:
- 230M sentence pairs from Bharat Parallel Corpus (BPCC)
Benchmark Scores (chrF++):
Test result of IndicTrans2 on Indic languages
Language | asm_Beng | ben_Beng | brx_Deva | doi_Deva | gom_Deva | guj_Gujr | hin_Deva | kan_Knda | kas_Arab | mai_Deva | mal_Mlym | mar_Deva | mni_Mtei | npi_Deva | ory_Orya | pan_Guru | san_Deva | sat_Olck | snd_Deva | tam_Taml | tel_Telu | urd_Arab | Avg. | Ξ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
En-Indic | 46.8 | 49.7 | 45.3 | 53.9 | 42.5 | 53.1 | 50.6 | 33.8 | 35.6 | 44.3 | 45.2 | 48.6 | 40.2 | 51.5 | 42.1 | 61.1 | 35.5 | 34.6 | 39.1 | 39.1 | 45.5 | 61.6 | 44.8 | 5.9 |
Indic-En | 62.9 | 58.4 | 56.3 | 65.0 | 51.7 | 61.4 | 59.7 | 47.5 | 52.6 | 55.2 | 54.3 | 57.5 | 59.6 | 63.0 | 59.8 | 63.0 | 38.8 | 43.5 | 49.6 | 46.8 | 53.3 | 65.5 | 52.7 | 4.4 |
Key Features:
- 8k token context window
- RLHF-aligned responses
- Grouped-Query Attention (GQA)
- Layers: 32
- Heads: 32 attention heads
- Pre-training: 15T token corpus
Benchmarks:
Meta-Llama-3-8B-Instruct benchmark score
Category | Benchmark | Llama 3 8B | Llama 2 7B | Llama 2 13B | Llama 3 70B | Llama 2 70B |
---|---|---|---|---|---|---|
General | MMLU (5-shot) | 68.4 | 34.1 | 47.8 | 82.0 | 52.9 |
GPQA (0-shot) | 34.2 | 21.7 | 22.3 | 39.5 | 21.0 | |
Code Generation | HumanEval (0-shot) | 62.2 | 7.9 | 14.0 | 81.7 | 25.6 |
Mathematical Reasoning | GSM-8K (8-shot, CoT) | 79.6 | 25.7 | 77.4 | 93.0 | 57.5 |
MATH (4-shot, CoT) | 30.0 | 3.8 | 6.7 | 50.4 | 11.6 |
Key Features:
- Hybrid HMM-GD-DNN alignment
- Unified phone parser for Indian languages
- 98% MOS score for naturalness
Model Architecture:
- Duration Predictor: Bi-LSTM
- Vocoder: HiFi-GAN V1
- Sampling Rate: 24kHz
Training Data:
- 100hrs/language from IndicTTS and Kathbath
MOS Scores:
Language | Assamese (Indicative) | Bengali | Bodo | Gujarati | Hindi | Kannada | Malayalam | Manipuri (Indicative) | Marathi | Odia (Indicative) | Rajasthani (Indicative) | Tamil | Telugu |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Male Voice (%) | 34.72 (3.15) | 71.09 (3.92) | β | 45.19 (3.75) | 57.14 (3.91) | 70.45 (4.17) | 64.38 (4.04) | 52.08 (2.83) | 77.98 (4.21) | 68.75 (3.55) | 56.25 (3.84) | 68.18 (4.16) | 51.67 (3.87) |
Female Voice (%) | 69.44 (3.72) | 72.66 (4.14) | 58.93 (3.93) | 80.77 (4.17) | 69.64 (4.26) | 48.86 (4.20) | 43.75 (3.79) | 37.50 (2.68) | 76.78 (4.02) | 59.38 (3.19) | 93.75 (4.47) | 55.11 (3.95) | 84.17 (3.73) |
π‘ Technical Note: All models optimized for L40 GPUs with TensorRT-LLM. Requires Python 3.10+ and CUDA 12.1.
Language | WER (%) |
---|---|
Hindi | 16.4 |
Bengali | 17.2 |
Tamil | 30.0 |
Language Pair | Score |
---|---|
English β Hindi | 50.6 |
Hindi β English | 59.7 |
Language | MOS (1-5) |
---|---|
Bengali | 4.14 |
Marathi | 4.21 |
- Ubuntu 22.04 LTS (or compatible Linux distro)
- NVIDIA L40 GPU with 32GB+ VRAM
- Python 3.10+
- CUDA 12.1 & cuDNN 8.9+
- Git LFS installed
π‘ Recommended:
- 64GB System RAM
- 1TB SSD Storage
- Stable internet connection (>50Mbps)
# Install core dependencies
sudo apt-get update && sudo apt-get install -y \
git-lfs ffmpeg python3-pip python3-venv \
nvidia-cuda-toolkit libsndfile1-dev
git clone https://github.com/anindyamitra2002/Multilingual-VoiceBot.git
cd Multilingual-VoiceBot
# Initialize Git LFS and submodules
git lfs install
git submodule update --init --recursive
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip wheel setuptools
pip install -r requirements.txt \
torch==2.1.0+cu121 \
torchaudio==2.1.0+cu121 \
--extra-index-url https://download.pytorch.org/whl/cu121
cd IndicTransToolkit
pip install --editable ./
python -c "import nltk; nltk.download('punkt')"
cd ..
cd LIDv2
wget -P model/ https://asr.iitm.ac.in/SPRING_INX/models/foundation/SPRING_INX_ccc_wav2vec2_SSL.pt
pip install -r requirement.txt
cd ..
cd Fastspeech2_HS_Flask_API
pip install scipy==1.9.1 -U
pip install -r requirements.txt
cd ..
source .venv/bin/activate
python translator_server.py
source .venv/bin/activate
python lang_identifier_server.py
source .venv/bin/activate
cd Fastspeech2_HS_Flask_API && python flask_app.py
source .venv/bin/activate
streamlit run --server.port 8500 clients/main.py
- Open browser:
http://localhost:8500
- Select input/output languages
- Click microphone icon to start voice interaction
- View real-time processing pipeline:
Symptom | Solution |
---|---|
CUDA Out of Memory | Use L40 GPU from Lightning Studio (Free) with 32+ |
Audio Not supported | Check supported audio in ASR client |
Translation Errors | Provide correct input language for input voice and text |
Language Detection Failures | Reduce background noise and provide clear voice by keeping your mouth near microphone |
Service | Default Port |
---|---|
Translation | 8000 |
Language ID | 8001 |
TTS Engine | 5000 |
Web UI | 8500 |
Status | Feature | Details |
---|---|---|
β | Multilingual Support | 11 Indian languages + English |
β | API Integration | REST/gRPC endpoints operational |
β¬ | Web Search Integration | Planned: Google/Bing API hooks |
β¬ | Document Q&A System | PDF/TXT analysis pipeline |
β¬ | Multimodal Input Support | Image+Voice simultaneous processing |
- Models: AI4Bharat (IndicTrans2), Meta AI (Llama-3), IIT Madras (CCC-Wav2Vec).
- Tools: Hugging Face, Lightning Studio, FFmpeg.
This project is licensed under the MIT License. See LICENSE for details.