Supervisors: Tuomas Virtanen, Toni Heittola
Author: Bilal Lamsili
WebSED is a web-based application that implements Sound Event Detection (SED) using Pretrained Audio Neural Networks (PANNs). It performs real-time audio tagging and event detection directly in the browser, leveraging the ONNX model format for fast inference. The application ensures user privacy by minimizing backend dependencies.
Note: This project builds on the work of Qiuqiang Kong's Pretrained Audio Neural Networks (PANNs). For more details on the PANNs architecture, refer to the official repository.
You can try or use the WebSED application by visiting the following link:
WebSED Live Application
- Real-time sound event detection in the browser.
- Powered by PANNs for high-accuracy audio tagging.
- Utilizes the ONNX Runtime for fast and efficient model inference.
- Minimal backend reliance, ensuring user privacy.
Before converting the model, make sure you have the following dependencies installed:
torch
onnx
- Pretrained model weights (
Cnn14_DecisionLevelMax.pth
)
To convert the model to the ONNX format, follow these steps:
-
Create a Python script
convert_to_onnx.py
with the following content:import torch import warnings from pytorch.models import Cnn14_DecisionLevelMax # Adjust this path if needed # Override torch.log10 globally to ensure ONNX compatibility if hasattr(torch, 'log10'): torch.log10 = lambda x: torch.log(x) / torch.log(torch.tensor(10.0)) # Suppress ONNX export warnings related to unsupported operators warnings.filterwarnings("ignore", category=UserWarning, module="torch.onnx") def export_model_to_onnx(): # Initialize the model with required parameters model = Cnn14_DecisionLevelMax( sample_rate=32000, window_size=1024, hop_size=320, mel_bins=64, fmin=50, fmax=14000, classes_num=527 ) # Load the checkpoint file (update the file path as necessary) checkpoint = torch.load('Cnn14_DecisionLevelMax_mAP=0.385.pth', map_location='cpu') model.load_state_dict(checkpoint['model']) model.eval() # Define a dummy input for the model (adjust size if needed) dummy_input = torch.randn(1, 320000) # Example input size for 32kHz audio # Export the model to ONNX format with opset version 12 torch.onnx.export( model, dummy_input, "Cnn14_DecisionLevelMax.onnx", export_params=True, opset_version=12, input_names=['input'], output_names=['clipwise_output', 'framewise_output'], dynamic_axes={'input': {1: 'audio_length'}} # Allows variable input lengths ) print("Model successfully converted to ONNX format.") if __name__ == "__main__": export_model_to_onnx()
-
Run the script to generate the ONNX model:
python convert_to_onnx.py
This will create the ONNX model file Cnn14_DecisionLevelMax_mAP=0.385.onnx
, which you can use in the web application.
To set up the project locally, ensure you have the following installed:
- Node.js
- Vue.js
- ONNX Runtime for Web for running the model inference in the browser.
-
Clone the repository:
git clone https://github.com/TUT-ARG/WebSED cd WebSED
-
Install the project dependencies:
npm install
-
Start the development server:
npm run dev
-
Open your browser and navigate to
http://localhost:5173
to interact with the web application.
Once the development server is up and running, you can interact with the WebSED interface to perform real-time sound event detection. The web app takes in audio input, processes it through the ONNX model in the browser, and displays the detected sound events.
- Start the server using the command
npm run dev
. - Access the application via your browser at
http://localhost:5173
. - Provide an audio input through your microphone or an audio file.
- The detected sound events will be displayed in real-time.
This project is licensed under the MIT License. See the LICENSE file for more details.