Skip to content

training problems.. #18

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gingseo opened this issue Mar 4, 2025 · 6 comments
Closed

training problems.. #18

gingseo opened this issue Mar 4, 2025 · 6 comments

Comments

@gingseo
Copy link

gingseo commented Mar 4, 2025

I'm trying to train your model from scratch using the provided dataset (no transfer learning).
The backbone is frozen, but I'm running into OOM errors and really long training times. I'm using a single A100 (80GB).

Could you share the setup you used for training? Specifically, which GPU, memory, and how long it took? Also, do certain datasets take much longer to train?

@HengLan
Copy link
Owner

HengLan commented Mar 4, 2025

Please check the running requirement for GPU at here: #4

@gingseo
Copy link
Author

gingseo commented Mar 4, 2025

Thank you Professor
In my case, the estimated training time continues to increase during training, and I expect there to be a memory leak. 😭😭
Do you have any guesses?

@GX77
Copy link
Collaborator

GX77 commented Mar 10, 2025

What is your current training setting?

@gingseo
Copy link
Author

gingseo commented Mar 18, 2025

Thank you for your attention. I am currently using a single A100 80GB. As shown in the attached image, memory usage gradually increases and seems to converge, but it keeps rising slowly until it eventually results in an OOM error. Have you also encountered this issue?

I also have a second question.
From the code, I see that a batch size of 1 (i.e., one video) is assigned per GPU. However, in the paper, I found that the batch size was increased to 64 for VidSTG. Additionally, I read in a GitHub issue that HCSTVG-v1 was trained on 16 GPUs. Was the batch size fixed at 1 per GPU? If so, would it be possible to increase the batch size given sufficient memory? I am curious about how the batch sizes of 16, 32, and 64 were managed in the paper.

My third question is regarding the text processing in the code. I noticed that there is an implementation to truncate text exceeding 26 tokens, but it does not seem to be executed. Since the transformer processes only a single batch at a time, does this mean that variable-length text is directly fed into the model without truncation? If increasing the batch size was possible, would there have been additional logic to handle text length adjustments?

I know this is a long question, but your responses have been incredibly helpful for our research. Thank you!

Image

@GX77
Copy link
Collaborator

GX77 commented Mar 20, 2025

Thank you for your interest in our work.

  1. Training Environment: I have not conducted training on a single GPU; I have trained on at least 16 GPUs. If you encounter an OOM issue, you can try reducing the resolution.
  2. Single GPU Training: The batch size on a single GPU is fixed at 1, so the batch size is essentially the number of GPUs used for training. The current code only supports a batch size of 1 on a single GPU because a batch size of 2 on a single GPU is prone to OOM issues.
  3. Text Length: Since the batch size on a single GPU is 1, the length of the text does not need to be fixed.

@gingseo
Copy link
Author

gingseo commented Mar 20, 2025

Thank you very much.
I saw in the paper that VIDSTG used a batch size of 64. Does that mean they used 64 A100 GPUs?

@GX77 GX77 closed this as completed Apr 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants