Skip to content

Multilingual VoiceBot: Bridging India's Language Barriers with AI

License

Notifications You must be signed in to change notification settings

anindyamitra2002/Multilingual-VoiceBot

Repository files navigation

πŸŒπŸ—£οΈ Multilingual VoiceBot: Bridging India's Language Barriers with AI

GitHub License Python Version GitHub last commit Contributors


πŸ“‘ Table of Contents


🌍 Overview

The Multilingual VoiceBot revolutionizes cross-linguistic communication in India by enabling seamless voice-based interactions in 11 native languages and English, using advanced AI models (NLP, ASR, real-time translation) to bridge gaps between India’s 121 major languages and 19,500 dialects. For farmers seeking crop advice, patients accessing telemedicine, or citizens navigating government services, it delivers instant voice-to-voice understanding without requiring literacy or English proficiency, democratizing access to critical services while preserving cultural identity. Technical users benefit from modular architecture with state-of-the-art models like CCC-Wav2Vec and IndicTrans2, while end-users experience human-like conversational AI that empowers marginalized communities, reduces digital exclusion, and drives socioeconomic equity across India’s diverse linguistic landscape.

"Imagine a farmer in Odisha asking about crop prices in Odia πŸ§‘πŸŒΎ, a grandmother in Kerala describing symptoms to a Hindi-speaking doctor πŸ‘΅πŸ₯, or a migrant worker accessing government schemes in Bengali while working in Tamil Nadu πŸ—οΈ. This AI-powered voice assistant breaks language barriers with human-like conversations in 11 Indian languages, empowering 1.4 billion people to access services in their mother tongue."

Demo Video (Coming Soon!)


✨ Core Features

  • 🌐 11+ Indian Languages - Bengali, Hindi, Tamil, Telugu + 8 more & English
  • πŸ—£οΈ Voice ↔ Text Chat - Seamless switch between voice & text modes
  • 🎧 Human-like Voices - Natural male/female voice output (Non-robotic)
  • 🎯 Accurate ASR - Optimized for Indian accents & regional dialects
  • ⚑ Low Latency - <3s response time from speech-to-speech
  • πŸ”„ Real-time Translation - Instantly convert between any supported languages
  • πŸ“± Mobile Ready - Works smoothly on smartphones & low-end devices

↑ Back to Top


πŸ“Š Workflow Diagram

Workflow
Figure 1: End-to-end pipeline of Multilingual VoiceBot

↑ Back to Top


πŸ”§ Module Breakdown

1. Language Detection Module πŸ”

Key Features:

  • Supports 12 Indian languages + English
  • Real-time audio processing with noise filtering
  • GUI + CLI interfaces for flexible deployment

Model Architecture: LiDv2 Diagram

  • Base Model: ccc-wav2vec from IIT Madras SPRING Lab
  • Feature Extraction: u-vector embeddings with Within Sample Similarity Loss (WSSL)
  • Classifier: Feedforward Neural Network

Training Data:

  • 500+ hours of speech data per language from BPCC and ULCA datasets

Benchmark Results (WER%):

Language Common Voice IndicTTS Kathbath
Hindi 16.4 12.2 11.0
Bengali 17.2 20.5 13.9
Tamil 30.0 20.8 24.4

↑ Back to Top


2. ASR Module (CCC-Wav2vec 2.0) πŸŽ™οΈ

Key Features:

  • Cross-contrastive learning for robust speech representation
  • Cluster-based negative sampling
  • Supports code-switched speech

Model Architecture:
CCC-Wav2vec Diagram

  • Encoder: 24-layer Transformer
  • Quantizer: Gumbel-Softmax clustering
  • Loss Function: $L_{cc} = Ξ±L_c + Ξ²L_{cross} + Ξ³L_{cross'}$

Training Data:

  • 1M+ hours from LibriSpeech, Switchboard, MUCS

Benchmark Results:
Word Error Rates without use of Language Models across various Languages

Dataset / Language Common Voice Fleurs IndicTTS ULCA Kathbath Kathbath hard MUCS SPRING Test Average
Bengali 17.2 19.6 20.5 21.6 13.9 - - - 18.56667
Gujarati - 18.1 16 25.5 14.5 15.7 23.6 - 18.9
Hindi 16.4 14.5 12.2 15.8 11 12.3 15.2 44.9 17.7875
Kannada - 19.4 18.4 - 17.4 19.3 - - 18.625
Malayalam 41.3 20.8 20.7 - 27.4 29.9 - 49.5 31.6
Marathi 19.3 21.5 14.6 - 16.2 - 10.8 - 16.48
Odia 30.3 29.1 18.8 33.6 18.3 22.6 26.8 - 25.64286
Punjabi 20.3 21.1 - - 12.6 13.8 - 51.5 23.86
Tamil 30 29.8 20.8 - 24.4 26.4 26.9 - 26.38333
Telugu - 24.3 26.2 - 22.5 23.8 21.7 - 23.7

↑ Back to Top


3. Translation Module (IndicTrans2) 🌐

Key Features:

  • 462 translation directions for 22 Indian languages
  • Direct Indic-Indic translation without English pivot

Model Architecture:

  • Framework: Transformer Big (6 layers)
  • Parameters: 210M
  • Training: Multilingual NMT with BPCC corpus

Training Data:

  • 230M sentence pairs from Bharat Parallel Corpus (BPCC)

Benchmark Scores (chrF++):
Test result of IndicTrans2 on Indic languages

Language asm_Beng ben_Beng brx_Deva doi_Deva gom_Deva guj_Gujr hin_Deva kan_Knda kas_Arab mai_Deva mal_Mlym mar_Deva mni_Mtei npi_Deva ory_Orya pan_Guru san_Deva sat_Olck snd_Deva tam_Taml tel_Telu urd_Arab Avg. Ξ”
En-Indic 46.8 49.7 45.3 53.9 42.5 53.1 50.6 33.8 35.6 44.3 45.2 48.6 40.2 51.5 42.1 61.1 35.5 34.6 39.1 39.1 45.5 61.6 44.8 5.9
Indic-En 62.9 58.4 56.3 65.0 51.7 61.4 59.7 47.5 52.6 55.2 54.3 57.5 59.6 63.0 59.8 63.0 38.8 43.5 49.6 46.8 53.3 65.5 52.7 4.4

↑ Back to Top


4. LLM Module (Meta-Llama-3-8B-Instruct) 🧠

Key Features:

  • 8k token context window
  • RLHF-aligned responses
  • Grouped-Query Attention (GQA)

Model Architecture:
Llama-3 Architecture

  • Layers: 32
  • Heads: 32 attention heads
  • Pre-training: 15T token corpus

Benchmarks:
Meta-Llama-3-8B-Instruct benchmark score

Category Benchmark Llama 3 8B Llama 2 7B Llama 2 13B Llama 3 70B Llama 2 70B
General MMLU (5-shot) 68.4 34.1 47.8 82.0 52.9
GPQA (0-shot) 34.2 21.7 22.3 39.5 21.0
Code Generation HumanEval (0-shot) 62.2 7.9 14.0 81.7 25.6
Mathematical Reasoning GSM-8K (8-shot, CoT) 79.6 25.7 77.4 93.0 57.5
MATH (4-shot, CoT) 30.0 3.8 6.7 50.4 11.6

↑ Back to Top


5. TTS Module (FastSpeech2 + HiFi-GAN) πŸ“’

Key Features:

  • Hybrid HMM-GD-DNN alignment
  • Unified phone parser for Indian languages
  • 98% MOS score for naturalness

Model Architecture:

  • Duration Predictor: Bi-LSTM
  • Vocoder: HiFi-GAN V1
  • Sampling Rate: 24kHz

Training Data:

  • 100hrs/language from IndicTTS and Kathbath

MOS Scores:

Language Assamese (Indicative) Bengali Bodo Gujarati Hindi Kannada Malayalam Manipuri (Indicative) Marathi Odia (Indicative) Rajasthani (Indicative) Tamil Telugu
Male Voice (%) 34.72 (3.15) 71.09 (3.92) – 45.19 (3.75) 57.14 (3.91) 70.45 (4.17) 64.38 (4.04) 52.08 (2.83) 77.98 (4.21) 68.75 (3.55) 56.25 (3.84) 68.18 (4.16) 51.67 (3.87)
Female Voice (%) 69.44 (3.72) 72.66 (4.14) 58.93 (3.93) 80.77 (4.17) 69.64 (4.26) 48.86 (4.20) 43.75 (3.79) 37.50 (2.68) 76.78 (4.02) 59.38 (3.19) 93.75 (4.47) 55.11 (3.95) 84.17 (3.73)

↑ Back to Top


πŸ’‘ Technical Note: All models optimized for L40 GPUs with TensorRT-LLM. Requires Python 3.10+ and CUDA 12.1.


πŸ† Benchmark Scores

Word Error Rate (WER) for ASR

Language WER (%)
Hindi 16.4
Bengali 17.2
Tamil 30.0

Translation Quality (chrF++)

Language Pair Score
English β†’ Hindi 50.6
Hindi β†’ English 59.7

TTS Mean Opinion Score (MOS)

Language MOS (1-5)
Bengali 4.14
Marathi 4.21

↑ Back to Top


πŸš€ Getting Started

πŸ“‹ Prerequisites

⚠️ Essential Requirements:

  • Ubuntu 22.04 LTS (or compatible Linux distro)
  • NVIDIA L40 GPU with 32GB+ VRAM
  • Python 3.10+
  • CUDA 12.1 & cuDNN 8.9+
  • Git LFS installed

πŸ’‘ Recommended:

  • 64GB System RAM
  • 1TB SSD Storage
  • Stable internet connection (>50Mbps)

βš™οΈ Installation Guide

1. System Preparation

# Install core dependencies
sudo apt-get update && sudo apt-get install -y \
    git-lfs ffmpeg python3-pip python3-venv \
    nvidia-cuda-toolkit libsndfile1-dev

2. Clone Repository

git clone https://github.com/anindyamitra2002/Multilingual-VoiceBot.git
cd Multilingual-VoiceBot

# Initialize Git LFS and submodules
git lfs install
git submodule update --init --recursive

3. Python Environment Setup

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip wheel setuptools

πŸ”§ Component Configuration

Core Dependencies Installation

pip install -r requirements.txt \
    torch==2.1.0+cu121 \
    torchaudio==2.1.0+cu121 \
    --extra-index-url https://download.pytorch.org/whl/cu121

Module-Specific Setup

A. Translation Engine (IndicTrans2)
cd IndicTransToolkit
pip install --editable ./
python -c "import nltk; nltk.download('punkt')"
cd ..

B. Language Identification

cd LIDv2
wget -P model/ https://asr.iitm.ac.in/SPRING_INX/models/foundation/SPRING_INX_ccc_wav2vec2_SSL.pt
pip install -r requirement.txt
cd ..

C. Voice Synthesis

cd Fastspeech2_HS_Flask_API
pip install scipy==1.9.1 -U
pip install -r requirements.txt
cd ..

🚦 Service Initialization

Start All Components (4 Terminal Sessions)

Terminal 1: Translation Server
source .venv/bin/activate
python translator_server.py
Terminal 2: Language Detector
source .venv/bin/activate
python lang_identifier_server.py
Terminal 3: Voice Synthesis
source .venv/bin/activate
cd Fastspeech2_HS_Flask_API && python flask_app.py
Terminal 4: Web Interface
source .venv/bin/activate
streamlit run --server.port 8500 clients/main.py

🌐 Accessing the System

  1. Open browser: http://localhost:8500
  2. Select input/output languages
  3. Click microphone icon to start voice interaction
  4. View real-time processing pipeline:

Interface Preview

↑ Back to Top


πŸ› οΈ Troubleshooting Guide

Common Issues & Solutions

Symptom Solution
CUDA Out of Memory Use L40 GPU from Lightning Studio (Free) with 32+
Audio Not supported Check supported audio in ASR client
Translation Errors Provide correct input language for input voice and text
Language Detection Failures Reduce background noise and provide clear voice by keeping your mouth near microphone

Port Configuration Reference

Service Default Port
Translation 8000
Language ID 8001
TTS Engine 5000
Web UI 8500

↑ Back to Top


πŸ“Œ Roadmap & Progress

Status Feature Details
βœ… Multilingual Support 11 Indian languages + English
βœ… API Integration REST/gRPC endpoints operational
⬜ Web Search Integration Planned: Google/Bing API hooks
⬜ Document Q&A System PDF/TXT analysis pipeline
⬜ Multimodal Input Support Image+Voice simultaneous processing

↑ Back to Top


πŸ“š Additional Resources

↑ Back to Top


πŸ™ Acknowledgements

  • Models: AI4Bharat (IndicTrans2), Meta AI (Llama-3), IIT Madras (CCC-Wav2Vec).
  • Tools: Hugging Face, Lightning Studio, FFmpeg.

↑ Back to Top


πŸ“œ License

This project is licensed under the MIT License. See LICENSE for details.

↑ Back to Top


About

Multilingual VoiceBot: Bridging India's Language Barriers with AI

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published