Does the model support the variable audio length? #2

haha010508 · 2021-12-31T10:28:58Z

Hi,
Thanks for sharing this great work!
I have a lot of audio files, but the length are different, so i want to know if the model support the variable audio length? or another question, you know, some audio events need more length to get the better embedding to get a good classification result, (maybe i can remove the silence, but i do not know how to keep the
spectrometer smooth, if remove the silence and make a step in wave, i think the spectrometer is polluted, and the embedding maybe have some problem), so how to process those audio files?
Thanks.
Looking forward to your reply.

kkoutini · 2022-01-04T12:56:12Z

Hi,
Thank you for your interest!

Yes the model support variable audio lengths, as this was a requirement for the HEAR challenge.
However, our models were trained on 10-second clips. Therefore, the trained time-positional-encodings is only available for 10-seconds. In the challenge, in order to get scene embedings for longer audio clips, we use a simple approach: to average the predictions of windows of 10-seconds (with overlap) as implemented here

For the timestamp embedding, we submitted a model "base" with 160 ms window, and a model "2 Level" with a larger window 800 ms. Precisely, we concatenated the embeding as implemented here.
Genrally, in the results, the 2 level model performed better, with the exception of FSD50K.

I'm not sure which preprocessing method would be the best, but I'd guess that silence trimming won't affect the performance to a large extent.

I hope this helps!

haha010508 · 2022-01-17T11:44:41Z

Thanks for your reply, I learned a lot. Recently i want to retrain the model use my data, and i want to fine-tune pretained model for 2-class, so how can i do it? can you give some examples? thanks a lot !

kkoutini · 2022-01-25T14:06:55Z

Hi! Sure here I call this function. You can use the argument n_classes=2 then you'll get a model with pretrained embeding and a new classifier, you can fine-tune it on your task.

faroit · 2022-06-07T20:13:13Z

@kkoutini likely related to this:

Is it correct that the default img_size of img_size=(128, 998) does through a warning?

/usr/local/lib/python3.7/dist-packages/hear21passt/models/passt.py:260: UserWarning: Input image size (128*1000) doesn't match model (128*998).
  warnings.warn(f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]}).")

kkoutini · 2022-06-09T15:29:48Z

yes, unfortunately the 998 was due to a pre-processing bug in the original pre-trained weights and should have a minimal effect on the output if you input 1000 frames (the last 2 frames will be ignored by the model).

faroit · 2022-06-12T14:52:53Z

yes, unfortunately the 998 was due to a pre-processing bug in the original pre-trained weights and should have a minimal effect on the output if you input 1000 frames (the last 2 frames will be ignored by the model).

@kkoutini thanks for the pointer. I'm still not sure why this is also raised for inputs with less than 998/1000 frames. Is this due to pos-enc interpolation?

is there any use of changing parameters like scene_embedding_size? My inputs are 20s duration....

kkoutini · 2022-06-17T08:44:57Z

I'm still not sure why this is also raised for inputs with less than 998/1000 frames. Is this due to pos-enc interpolation?

This warning is always shown when the input size doesn't equal the size provided when training the model.

is there any use of changing parameters like scene_embedding_size? My inputs are 20s duration....

Unfortunately the scene_embedding_size come from the embedding_size(768) + logits (527) of the pretrained model. In later experiments, we found out that using the embedding only yields better performance on the HEAR tasks. Therefore, one option would be to set scene_embedding_size=768 and mode="embed_only"

faroit · 2022-06-17T09:15:14Z

@kkoutini thanks! I guess this can be closed then

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the model support the variable audio length? #2

Does the model support the variable audio length? #2

haha010508 commented Dec 31, 2021

kkoutini commented Jan 4, 2022

haha010508 commented Jan 17, 2022

kkoutini commented Jan 25, 2022

faroit commented Jun 7, 2022

kkoutini commented Jun 9, 2022

faroit commented Jun 12, 2022

kkoutini commented Jun 17, 2022

faroit commented Jun 17, 2022

Does the model support the variable audio length? #2

Does the model support the variable audio length? #2

Comments

haha010508 commented Dec 31, 2021

kkoutini commented Jan 4, 2022

haha010508 commented Jan 17, 2022

kkoutini commented Jan 25, 2022

faroit commented Jun 7, 2022

kkoutini commented Jun 9, 2022

faroit commented Jun 12, 2022

kkoutini commented Jun 17, 2022

faroit commented Jun 17, 2022