training problems.. #18

gingseo · 2025-03-04T02:15:35Z

I'm trying to train your model from scratch using the provided dataset (no transfer learning).
The backbone is frozen, but I'm running into OOM errors and really long training times. I'm using a single A100 (80GB).

Could you share the setup you used for training? Specifically, which GPU, memory, and how long it took? Also, do certain datasets take much longer to train?

HengLan · 2025-03-04T02:19:10Z

Please check the running requirement for GPU at here: #4

gingseo · 2025-03-04T08:16:37Z

Thank you Professor
In my case, the estimated training time continues to increase during training, and I expect there to be a memory leak. 😭😭
Do you have any guesses?

GX77 · 2025-03-10T06:15:27Z

What is your current training setting?

gingseo · 2025-03-18T13:23:27Z

Thank you for your attention. I am currently using a single A100 80GB. As shown in the attached image, memory usage gradually increases and seems to converge, but it keeps rising slowly until it eventually results in an OOM error. Have you also encountered this issue?

I also have a second question.
From the code, I see that a batch size of 1 (i.e., one video) is assigned per GPU. However, in the paper, I found that the batch size was increased to 64 for VidSTG. Additionally, I read in a GitHub issue that HCSTVG-v1 was trained on 16 GPUs. Was the batch size fixed at 1 per GPU? If so, would it be possible to increase the batch size given sufficient memory? I am curious about how the batch sizes of 16, 32, and 64 were managed in the paper.

My third question is regarding the text processing in the code. I noticed that there is an implementation to truncate text exceeding 26 tokens, but it does not seem to be executed. Since the transformer processes only a single batch at a time, does this mean that variable-length text is directly fed into the model without truncation? If increasing the batch size was possible, would there have been additional logic to handle text length adjustments?

I know this is a long question, but your responses have been incredibly helpful for our research. Thank you!

GX77 · 2025-03-20T03:17:31Z

Thank you for your interest in our work.

Training Environment: I have not conducted training on a single GPU; I have trained on at least 16 GPUs. If you encounter an OOM issue, you can try reducing the resolution.
Single GPU Training: The batch size on a single GPU is fixed at 1, so the batch size is essentially the number of GPUs used for training. The current code only supports a batch size of 1 on a single GPU because a batch size of 2 on a single GPU is prone to OOM issues.
Text Length: Since the batch size on a single GPU is 1, the length of the text does not need to be fixed.

gingseo · 2025-03-20T05:19:21Z

Thank you very much.
I saw in the paper that VIDSTG used a batch size of 64. Does that mean they used 64 A100 GPUs?

GX77 closed this as completed Apr 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training problems.. #18

training problems.. #18

gingseo commented Mar 4, 2025

HengLan commented Mar 4, 2025

gingseo commented Mar 4, 2025

GX77 commented Mar 10, 2025

gingseo commented Mar 18, 2025

GX77 commented Mar 20, 2025

gingseo commented Mar 20, 2025

training problems.. #18

training problems.. #18

Comments

gingseo commented Mar 4, 2025

HengLan commented Mar 4, 2025

gingseo commented Mar 4, 2025

GX77 commented Mar 10, 2025

gingseo commented Mar 18, 2025

GX77 commented Mar 20, 2025

gingseo commented Mar 20, 2025