-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lightning and apex amp performance not improved #2699
Comments
Can you check your gpu utilization without fp16 and make sure it is high? a low gpu utilization indicates that gpu compute is not the bottleneck and if this is the case, speeding your forward/backwork pass won't help much. |
I commented the vgg loss, and found that the bottleneck for model_backward (method 'item' of 'torch._C._TensorBase' objects) is still there. But the model_forward has not bottleneck '{method 'new_tensor' of 'torch._C._TensorBase' objects}' anymore. |
I t turns out that I used '.new_tensor' to normalize input every time I call vgg loss. This caused a lot of extra work in model_forward. And using pytorch 1.6 native amp instead of Apex solved the model_backward problem. I don't know why Apex amp doesn't work. Maybe because I use 'ddp' as backend. Thanks for your help! |
❓ lightning and apex amp performance not improved
Before asking:
I'm trying to use lightning and Apex amp to speed ddp training. I tried amp_level O0, O1, O2, and O3, and they use almost the same time (all around 45 minutes).
train_loader = DataLoader(dataset=train_dataset, batch_size=2, shuffle=True, num_workers=4)
val_loader = DataLoader(dataset=val_dataset, batch_size=1, shuffle=False, num_workers=4)
trainer = pl.Trainer(gpus= 8, num_nodes = 1, distributed_backend='ddp', precision = 16, amp_level = 'O1')
trainer.fit(model, train_dataloader=train_loader, val_dataloaders=val_loader)
i didn't change batchsize to be a multiple of 8 because i saw this post and my cudnn version is 7.6.5.
Thanks!
What have you tried?
I also tried torch.backends.cudnn.benchmark = True but got no improvement.
What's your environment?
The text was updated successfully, but these errors were encountered: