Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does the model support the variable audio length? #2

Open
haha010508 opened this issue Dec 31, 2021 · 8 comments
Open

Does the model support the variable audio length? #2

haha010508 opened this issue Dec 31, 2021 · 8 comments

Comments

@haha010508
Copy link

Hi,
Thanks for sharing this great work!
I have a lot of audio files, but the length are different, so i want to know if the model support the variable audio length? or another question, you know, some audio events need more length to get the better embedding to get a good classification result, (maybe i can remove the silence, but i do not know how to keep the
spectrometer smooth, if remove the silence and make a step in wave, i think the spectrometer is polluted, and the embedding maybe have some problem), so how to process those audio files?
Thanks.
Looking forward to your reply.

@kkoutini
Copy link
Owner

kkoutini commented Jan 4, 2022

Hi,
Thank you for your interest!

Yes the model support variable audio lengths, as this was a requirement for the HEAR challenge.
However, our models were trained on 10-second clips. Therefore, the trained time-positional-encodings is only available for 10-seconds. In the challenge, in order to get scene embedings for longer audio clips, we use a simple approach: to average the predictions of windows of 10-seconds (with overlap) as implemented here

For the timestamp embedding, we submitted a model "base" with 160 ms window, and a model "2 Level" with a larger window 800 ms. Precisely, we concatenated the embeding as implemented here.
Genrally, in the results, the 2 level model performed better, with the exception of FSD50K.

I'm not sure which preprocessing method would be the best, but I'd guess that silence trimming won't affect the performance to a large extent.

I hope this helps!

@haha010508
Copy link
Author

Thanks for your reply, I learned a lot. Recently i want to retrain the model use my data, and i want to fine-tune pretained model for 2-class, so how can i do it? can you give some examples? thanks a lot !

@kkoutini
Copy link
Owner

Hi! Sure here I call this function. You can use the argument n_classes=2 then you'll get a model with pretrained embeding and a new classifier, you can fine-tune it on your task.

@faroit
Copy link

faroit commented Jun 7, 2022

@kkoutini likely related to this:

Is it correct that the default img_size of img_size=(128, 998) does through a warning?

/usr/local/lib/python3.7/dist-packages/hear21passt/models/passt.py:260: UserWarning: Input image size (128*1000) doesn't match model (128*998).
  warnings.warn(f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]}).")

@kkoutini
Copy link
Owner

kkoutini commented Jun 9, 2022

yes, unfortunately the 998 was due to a pre-processing bug in the original pre-trained weights and should have a minimal effect on the output if you input 1000 frames (the last 2 frames will be ignored by the model).

@faroit
Copy link

faroit commented Jun 12, 2022

yes, unfortunately the 998 was due to a pre-processing bug in the original pre-trained weights and should have a minimal effect on the output if you input 1000 frames (the last 2 frames will be ignored by the model).

@kkoutini thanks for the pointer. I'm still not sure why this is also raised for inputs with less than 998/1000 frames. Is this due to pos-enc interpolation?

is there any use of changing parameters like scene_embedding_size? My inputs are 20s duration....

@kkoutini
Copy link
Owner

I'm still not sure why this is also raised for inputs with less than 998/1000 frames. Is this due to pos-enc interpolation?

This warning is always shown when the input size doesn't equal the size provided when training the model.

is there any use of changing parameters like scene_embedding_size? My inputs are 20s duration....

Unfortunately the scene_embedding_size come from the embedding_size(768) + logits (527) of the pretrained model. In later experiments, we found out that using the embedding only yields better performance on the HEAR tasks. Therefore, one option would be to set scene_embedding_size=768 and mode="embed_only"

@faroit
Copy link

faroit commented Jun 17, 2022

@kkoutini thanks! I guess this can be closed then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants