This project is designed to make visual content accessible to visually impaired individuals by generating audio descriptions of images. Leveraging Python, machine learning, and text-to-speech technologies, the system identifies image features, converts them into textual captions, and translates them into audio for playback.
- Image Recognition: Uses deep learning models (e.g., Vision Transformers) to analyze images and extract features.
- Caption Generation: Generates meaningful descriptions of images in text form.
- Multilingual Audio Output: Provides audio descriptions in English, Kannada, and Hindi using advanced text-to-speech (TTS) libraries.
- User-Friendly Interface: Enables users to upload images and listen to detailed audio descriptions seamlessly.
- Frontend: Google Colab for interactive prototyping.
- Backend: Python, TensorFlow, PyTorch, VisionEncoderDecoderModel, gTTS, and OpenCV.
- Additional Libraries: NumPy, Matplotlib, and Google Translate API for multilingual support.
- Development Tools: Flask for interface development and Git for version control.
- Image Upload: Users upload an image via the interface.
- Feature Extraction: The system analyzes the image using pre-trained deep learning models.
- Caption Generation: Converts features into meaningful captions.
- Text-to-Speech: Translates captions into audio in the desired language.
- Playback: Users can listen to detailed audio descriptions of the image.
- Enhancing accessibility for visually impaired individuals.
- Use in education and assistive technologies.
- Real-time applications for image and video captioning.
The project has demonstrated effective use of audio descriptions to convey image features. It significantly enhances accessibility and independence for visually impaired users.