Multimodal RAG Pipeline with CLIP, Whisper, and Qwen-VL

This repository contains a Multimodal Retrieval-Augmented Generation (RAG) Pipeline that integrates images, audio, and text for advanced multimodal querying and response generation. The pipeline uses CLIP for image embeddings, Whisper for audio transcription, SentenceTransformer for text embeddings, ChromaDB for vector storage, and Qwen-VL for multimodal text generation.

Features

PDF to Image Conversion: Extracts images from PDFs for processing.
Image Embedding with CLIP: Generates embeddings for images using the CLIP model.
Audio Transcription with Whisper: Transcribes audio files into text chunks.
Text Embedding with SentenceTransformer: Generates embeddings for transcribed audio chunks.
Vector Storage with ChromaDB: Stores and retrieves embeddings for images and audio.
Multimodal Querying: Retrieves relevant images and audio chunks based on a user query.
Multimodal Text Generation with Qwen-VL: Generates text responses using retrieved images, audio, and the query.
Image and Text Output: Displays retrieved images and generated text.

Pipeline Overview

Extract Images from PDFs: Convert PDFs into images and save them.
Embed Images using CLIP: Generate embeddings for the extracted images.
Store Image Embeddings in ChromaDB: Store the embeddings in a ChromaDB collection.
Process Audio: Transcribe audio files and generate embeddings for the text chunks.
Store Audio Embeddings in ChromaDB: Store the embeddings in a ChromaDB collection.
Retrieve Relevant Data: Query ChromaDB to retrieve relevant images and audio chunks.
Display Retrieved Images: Display the retrieved images to the user.
Prepare Multimodal Input for Qwen-VL: Combine retrieved images, audio, and the query.
Generate Text with Qwen-VL: Generate a text response using Qwen-VL.
Output Generated Text: Display the final generated text.

Prerequisites

Python 3.8 or higher
GPU (recommended for faster processing)

Dependencies

pdf2image: For converting PDFs to images.
Pillow: For image processing.
transformers: For CLIP, Whisper, and Qwen-VL models.
sentence-transformers: For text embeddings.
chromadb: For vector storage and retrieval.
librosa: For audio processing.
torch: For deep learning operations.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
dataset		dataset
Implementing_Multi-modal_RAG_Systems_2.png		Implementing_Multi-modal_RAG_Systems_2.png
RAG_Multimodal.ipynb		RAG_Multimodal.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal RAG Pipeline with CLIP, Whisper, and Qwen-VL

Features

Pipeline Overview

Prerequisites

Dependencies

About

Releases

Packages

Languages

CornelliusYW/Multimodal-RAG-Implementation

Folders and files

Latest commit

History

Repository files navigation

Multimodal RAG Pipeline with CLIP, Whisper, and Qwen-VL

Features

Pipeline Overview

Prerequisites

Dependencies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages