This project focuses on enabling accessibility for visually impaired individuals by automating the generation of descriptive captions for images and providing these descriptions through audio narration. The solution integrates advanced machine learning techniques in image processing, natural language understanding, and speech synthesis.
- Image Feature Extraction: Utilizes the InceptionV3 CNN model to extract high-level image features.
- Semantic Word Embeddings: Employs GloVe embeddings to enhance the language representation.
- Caption Generation: Generates meaningful and contextually relevant captions using an LSTM-based decoder.
- Speech Narration: Converts generated captions into audio using Text-to-Speech (TTS) technology.
The system follows an encoder-decoder paradigm:
- Image Input: Accepts images as input.
- Feature Extraction: InceptionV3 CNN extracts image features.
- Language Representation: GloVe embeddings provide semantic word vectors.
- Caption Generation: LSTM decoder generates captions using image features and word embeddings.
- Text-to-Speech Conversion: TTS converts captions to speech.
- Audio Output: Delivers the generated description as audio for user accessibility.
- Python: Programming language
- TensorFlow/Keras: For building and training the CNN and LSTM models
- GloVe: Pre-trained word embeddings for language representation
- Text-to-Speech (TTS): For converting text captions into speech
- Flask/Django (Optional): For deploying the application
- NumPy, Pandas, Matplotlib: For data handling and visualization
-
Clone this repository:
git clone https://github.com/faizahkureshi232/imagetospeech.git cd project-name
-
Download the pre-trained models:
- InceptionV3 weights
- GloVe word embeddings
-
Run the application:
bash
Copy code
eval.ipnby
- Upload an image through the interface or specify the image path in the script.
- The system will generate a descriptive caption.
- The caption will be converted into speech and played as audio.
- Input Image: [example.png]
- Generated Caption: "A Dog Running through the."
- Audio Output: Speech narration of the generated caption.
- Integration with real-time image capture (e.g., through a smartphone camera).
- Support for multiple languages in Text-to-Speech.
- Advanced customization for user-specific accessibility needs.
Contributions are welcome! Please follow these steps:
-
Fork the repository.
-
Create a feature branch:
bash
Copy code
git checkout -b feature-name
-
Commit your changes:
bash
Copy code
git commit -m "Add feature description"
-
Push to the branch:
bash
Copy code
git push origin feature-name
-
Create a pull request.
This project is licensed under the MIT License. See LICENSE
for more details.
- InceptionV3 for feature extraction.
- GloVe for pre-trained word embeddings.
- OpenAI and community resources for inspiration and support.
Feel free to suggest improvements or report issues in the repository!