This project provides a Gradio web interface that automatically downloads audio from a YouTube video, transcribes it using the faster-whisper (small) model, and then refines the transcription with a language model (ChatOllama, model: "gemma3:12b"
). The process uses independent progress bars to display the status of each stage.
Note:
Sensitive details (e.g., local file paths) have been omitted. Please update the configuration in the code according to your environment.
-
Audio Download π₯:
Usesyt-dlp
to download audio from a YouTube link and convert it to MP3. -
Transcription π:
Processes the audio using the faster-whisper (small) model with real-time progress updates. -
Refinement π‘:
Cleans up the raw transcription by removing timestamps and formatting artifacts using ChatOllama (model:"gemma3:12b"
) with its own progress bar. -
Gradio Interface π₯οΈ:
Offers a user-friendly web interface divided into two columns, with automatic processing for transcription and refinement.
- Python 3.7+
- imageio-ffmpeg
- Gradio
- faster-whisper (using the small model)
- yt-dlp
- tqdm
- Ollama (for running ChatOllama)
- LangChain (framework for LLM integrations)
Before running the application, ensure you have Ollama installed on your system. You must also pull the ChatOllama model by running:
ollama pull gemma3:12b
This will download the required model locally so that ChatOllama can be used for transcription refinement.
-
FFmpeg Path:
Update theffmpeg_location
variable in thedownload_audio
function with the path to your ffmpeg binary.Example:
ffmpeg_location = r"your/local/path/to/ffmpeg.exe"
-
Local Paths and Secrets:
Avoid hardcoding sensitive paths. Consider using environment variables or a configuration file to manage these settings. -
ChatOllama and Model Configuration:
This project uses ChatOllama with the model"gemma3:12b"
. Update this value as needed to match your available models.
This project includes a requirements.txt
file, so you donβt need to create one yourselfβjust follow these steps:
-
Create a Virtual Environment:
- Windows (cmd/PowerShell):
python -m venv venv
- macOS/Linux:
python3 -m venv venv
- Windows (cmd/PowerShell):
-
Activate the Virtual Environment:
- Windows (cmd):
venv\Scripts\activate
- Windows (PowerShell):
venv\Scripts\Activate.ps1
- macOS/Linux:
source venv/bin/activate
- Windows (cmd):
-
Install Dependencies:
With the virtual environment active, run:
pip install -r requirements.txt
The download_audio
function uses yt-dlp
to download the best available audio stream and convert it to MP3.
(Remember to update the ffmpeg path.)
def download_audio(youtube_url, file_name):
ffmpeg_location = r"your/local/path/to/ffmpeg.exe" # Update this path
cmd = [
"yt-dlp",
"-f", "bestaudio",
"-x",
"--audio-format", "mp3",
"-o", file_name,
"--ffmpeg-location", ffmpeg_location,
youtube_url
]
subprocess.run(cmd, check=True)
The transcribe_youtube
function downloads the audio, transcribes it using the faster-whisper (small) model, and updates a progress bar during each step.
def transcribe_youtube(youtube_url, progress=gr.Progress(track_tqdm=False)):
file_name = f"video_audio_converted_{uuid.uuid4().hex}.mp3"
# Step 1: Download and Transcription
progress((0, 100), desc="Starting download...")
try:
download_audio(youtube_url, file_name)
except Exception as e:
progress((100, 100), desc="Download error")
return f"Error downloading audio: {e}"
progress((10, 100), desc="Download complete. Starting transcription...")
segments, info = model.transcribe(file_name, beam_size=5)
segments = list(segments)
total_segments = len(segments)
if total_segments == 0:
progress((100, 100), desc="No segments detected")
return "No audio to transcribe."
progress((20, 100), desc=f"Detected language: {info.language} (Prob: {info.language_probability:.2f})")
transcription = ""
start_percent = 20
end_percent = 90
for i, segment in enumerate(segments, start=1):
current_percent = start_percent + (end_percent - start_percent) * (i / total_segments)
progress((int(current_percent), 100), desc=f"Processing segment {i}/{total_segments}")
transcription += f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}\n"
progress((100, 100), desc="Transcription complete")
try:
os.remove(file_name)
except Exception as e:
print(f"Could not remove temporary file {file_name}: {e}")
return transcription
Explanation:
- Progress Bars: Keeps you informed about the download and transcription progress.
- Concatenation: Combines all audio segments into one comprehensive transcription string.
- Cleanup: Deletes the temporary audio file after processing.
The refine_transcription
function refines the raw transcription using ChatOllama (model: "gemma3:12b"
) and its prompt. It also displays its own progress bar.
def refine_transcription(transcription, progress=gr.Progress(track_tqdm=False)):
# This function refines the transcription with its own progress bar
progress((0, 100), desc="Starting refinement...")
llm = ChatOllama(temperature=0, model="gemma3:12b")
template = """You are an expert assistant in refining raw video transcriptions. The text provided contains timestamps, occasional disfluencies, and formatting artifacts that make it hard to read. Your task is to reformat the transcription so that it is clear and well-organized, while preserving all the original content and details. Do not summarize or omit any information; just remove unnecessary timestamps and artifacts, and adjust the text for improved readability.
Raw Transcription:
{transcription}
Refined Transcription (in the language of the transcription):
"""
prompt = ChatPromptTemplate.from_template(template)
messages = prompt.invoke({"transcription": transcription})
progress((20, 100), desc="Refinement in progress...")
response = llm.invoke(messages)
progress((100, 100), desc="Refinement complete")
return response.content
Explanation:
- LLM Integration: Uses ChatOllama with a fixed prompt to reformat the transcription.
- Independent Progress: Displays progress specifically for the refinement process.
The Gradio interface is built using Blocks and splits the screen into two columns: one for transcription and one for refinement. When the "Transcribe" button is clicked, the transcription is generated automatically. Then, when the transcription box updates, the refinement function is triggered to update the refined output.
with gr.Blocks() as demo:
gr.Markdown("# Automatic YouTube Video Transcription and Refinement")
with gr.Column():
gr.Markdown("## Transcription")
youtube_url = gr.Textbox(label="YouTube Link", placeholder="Paste the YouTube video link here")
transcribe_btn = gr.Button("Transcribe")
transcription_box = gr.Textbox(label="Complete Transcription", lines=15)
with gr.Column():
gr.Markdown("## Refinement")
refined_box = gr.Textbox(label="Refined Transcription", lines=15)
# When the button is clicked, the video is transcribed and the transcription is shown.
transcribe_btn.click(
fn=transcribe_youtube,
inputs=youtube_url,
outputs=transcription_box
)
# When the transcription box is updated, automatically call the refinement function.
transcription_box.change(
fn=refine_transcription,
inputs=transcription_box,
outputs=refined_box
)
demo.launch()
Explanation:
- Two Columns: Clearly separates the transcription and refinement outputs.
- Event Handling:
- Clicking the "Transcribe" button calls
transcribe_youtube
and displays the transcription. - Updating the transcription box automatically triggers
refine_transcription
to update the refined output.
- Clicking the "Transcribe" button calls
For users who want to understand the overall construction of the application in a cleaner format, a demo Jupyter Notebook is provided (demo.ipynb). This notebook contains a simplified implementation without the Gradio interface, allowing you to see the core logic and workflow for transcription and refinement.
-
Clone the Repository:
git clone https://github.com/thaisaraujom/youtube-transcript-refiner.git cd youtube-transcript-refiner
-
Create a Virtual Environment and Install Dependencies:
# Create the virtual environment (use python or python3 depending on your system) python -m venv venv # Activate the virtual environment: # On Windows: venv\Scripts\activate # On macOS/Linux: source venv/bin/activate # Install the dependencies: pip install -r requirements.txt
-
Configure Environment Variables:
Update the ffmpeg path (and any other configuration) in the code as needed. -
Run the Application:
python transcribe_youtube.py
Open the provided local URL in your browser to use the interface.
This project is licensed under the MIT License. See the LICENSE file for details.