Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug with O1 and FusedLayerNorm #760

Closed
quanpn90 opened this issue Mar 17, 2020 · 6 comments
Closed

Possible bug with O1 and FusedLayerNorm #760

quanpn90 opened this issue Mar 17, 2020 · 6 comments

Comments

@quanpn90
Copy link

quanpn90 commented Mar 17, 2020

Cuda version: 10.1.243
Torch version: 1.3.1

FusedLayerNorm (patched with O1 amp) seems to not accept input with type Half.

In the following snippet:


import torch
from apex import amp
from apex.normalization.fused_layer_norm import FusedLayerNorm
torch.cuda.set_device(1)

class NeuralNet(torch.nn.Module):

    def __init__(self, d_in, d_out):
        self.d_in = d_in
        self.d_out = d_out
        super().__init__()

        self.norm = torch.nn.LayerNorm(d_in)
        self.norm2 = torch.nn.LayerNorm(d_out)
        self.linear = torch.nn.Linear(d_in, d_out)
        self.linear2 = torch.nn.Linear(d_out, d_out)

    def forward(self, input):

        input = self.norm(input)
        print(input.type())
        output = self.linear(input)
        print(output.type())
        output = torch.relu(output)
        print(output.type())
        output = self.norm2(output)
        output = self.linear2(output)
        print(output.type())
        output = torch.nn.functional.log_softmax(output)
        print("end")
        return output

model = NeuralNet(500, 1000)
model = model.cuda()
loss_function = torch.nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

for i in range(1000):
    x = torch.rand(128, 500).cuda()
    o = model(x).float()
    y = torch.randint(low=0, high=999, size=(128, )).cuda()
    loss = loss_function(o, y)
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()

    optimizer.step()
    optimizer.zero_grad()

The snippet works fine, but If I replace nn.LayerNorm with FusedLayerNorm, it will show the following bug:

RuntimeError: expected scalar type Half but found Float (data_ptrc10::Half at .../torch/include/ATen/core/TensorMethods.h:5747)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7ff61d171687 in ..../torch/lib/libc10.so)
frame #1: c10::Half* at::Tensor::data_ptrc10::Half() const + 0x3ee (0x7ff5fae5188e in .../lib/python3.7/site-packages/fused_layer_norm_cuda.cpython-37m-x86_64-linux-gnu.so)
frame #2: cuda_layer_norm(at::Tensor*, at::Tensor*, at::Tensor*, at::Tensor*, int, int, c10::ArrayRef, at::Tensor*, at::Tensor*, double) + 0x4c5 (0x7ff5fae4d825 in .../python3.7/site-packages/fused_layer_norm_cuda.cpython-37m-x86_64-linux-gnu.so)
frame #3: layer_norm_affine(at::Tensor, c10::ArrayRef, at::Tensor, at::Tensor, double) + 0x2a4 (0x7ff5fae380c4 in .../lib/python3.7/site-packages/fused_layer_norm_cuda.cpython-37m-x86_64-linux-gnu.so)
frame #4: + 0x217e4 (0x7ff5fae4b7e4 in .../python3.7/site-packages/fused_layer_norm_cuda.cpython-37m-x86_64-linux-gnu.so)
frame #5: + 0x1ef2a (0x7ff5fae48f2a in ..../python3.7/site-packages/fused_layer_norm_cuda.cpython-37m-x86_64-linux-gnu.so)

frame #10: THPFunction_apply(_object*, _object*) + 0x8d6 (0x7ff64de7f086 in ..../lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #42: __libc_start_main + 0xe7 (0x7ff65d835b97 in /lib/x86_64-linux-gnu/libc.so.6)

Context: I have been using O2 for a long time and there is a problem with loading/saving checkpoint, thats why I want to change to O1. I could make my code work with O1 by adding several type conversion, but currently the speed isn't ideal. Is that the case (especially for Sequence2Sequence models)?

Thank you for your help!
Best

@vgoklani
Copy link

vgoklani commented Jul 9, 2020

Same issues here! Thanks

@hitvoice
Copy link

I also ran into this error. It seems like in O1 FusedLayerNorm can only accept FP32 inputs.

@yhbian
Copy link

yhbian commented Mar 8, 2021

Same issues!

@RQsky
Copy link

RQsky commented Nov 29, 2021

mark

@jinderek
Copy link

Same issue

@quanpn90
Copy link
Author

quanpn90 commented May 11, 2022

This happens because O1 needs functions to be declared to be whitelisted, so that inputs are converted to half.

https://github.com/NVIDIA/apex/blob/master/apex/amp/README.md

The solution is to add @amp.half_function on top of the fused_layer_norm function.
Anyways the fused layer norm is either updated or even slower than the official PyTorch function, and O1 is inferior than PyTorch amp so there shouldn't be any reason to use anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants