Possible bug with O1 and FusedLayerNorm #760

quanpn90 · 2020-03-17T14:03:05Z

Cuda version: 10.1.243
Torch version: 1.3.1

FusedLayerNorm (patched with O1 amp) seems to not accept input with type Half.

In the following snippet:


import torch
from apex import amp
from apex.normalization.fused_layer_norm import FusedLayerNorm
torch.cuda.set_device(1)

class NeuralNet(torch.nn.Module):

    def __init__(self, d_in, d_out):
        self.d_in = d_in
        self.d_out = d_out
        super().__init__()

        self.norm = torch.nn.LayerNorm(d_in)
        self.norm2 = torch.nn.LayerNorm(d_out)
        self.linear = torch.nn.Linear(d_in, d_out)
        self.linear2 = torch.nn.Linear(d_out, d_out)

    def forward(self, input):

        input = self.norm(input)
        print(input.type())
        output = self.linear(input)
        print(output.type())
        output = torch.relu(output)
        print(output.type())
        output = self.norm2(output)
        output = self.linear2(output)
        print(output.type())
        output = torch.nn.functional.log_softmax(output)
        print("end")
        return output

model = NeuralNet(500, 1000)
model = model.cuda()
loss_function = torch.nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

for i in range(1000):
    x = torch.rand(128, 500).cuda()
    o = model(x).float()
    y = torch.randint(low=0, high=999, size=(128, )).cuda()
    loss = loss_function(o, y)
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()

    optimizer.step()
    optimizer.zero_grad()

The snippet works fine, but If I replace nn.LayerNorm with FusedLayerNorm, it will show the following bug:

RuntimeError: expected scalar type Half but found Float (data_ptrc10::Half at .../torch/include/ATen/core/TensorMethods.h:5747)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7ff61d171687 in ..../torch/lib/libc10.so)
frame #1: c10::Half* at::Tensor::data_ptrc10::Half() const + 0x3ee (0x7ff5fae5188e in .../lib/python3.7/site-packages/fused_layer_norm_cuda.cpython-37m-x86_64-linux-gnu.so)
frame #2: cuda_layer_norm(at::Tensor*, at::Tensor*, at::Tensor*, at::Tensor*, int, int, c10::ArrayRef, at::Tensor*, at::Tensor*, double) + 0x4c5 (0x7ff5fae4d825 in .../python3.7/site-packages/fused_layer_norm_cuda.cpython-37m-x86_64-linux-gnu.so)
frame #3: layer_norm_affine(at::Tensor, c10::ArrayRef, at::Tensor, at::Tensor, double) + 0x2a4 (0x7ff5fae380c4 in .../lib/python3.7/site-packages/fused_layer_norm_cuda.cpython-37m-x86_64-linux-gnu.so)
frame #4: + 0x217e4 (0x7ff5fae4b7e4 in .../python3.7/site-packages/fused_layer_norm_cuda.cpython-37m-x86_64-linux-gnu.so)
frame #5: + 0x1ef2a (0x7ff5fae48f2a in ..../python3.7/site-packages/fused_layer_norm_cuda.cpython-37m-x86_64-linux-gnu.so)

frame #10: THPFunction_apply(_object*, _object*) + 0x8d6 (0x7ff64de7f086 in ..../lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #42: __libc_start_main + 0xe7 (0x7ff65d835b97 in /lib/x86_64-linux-gnu/libc.so.6)

Context: I have been using O2 for a long time and there is a problem with loading/saving checkpoint, thats why I want to change to O1. I could make my code work with O1 by adding several type conversion, but currently the speed isn't ideal. Is that the case (especially for Sequence2Sequence models)?

Thank you for your help!
Best

The text was updated successfully, but these errors were encountered:

vgoklani · 2020-07-09T03:35:36Z

Same issues here! Thanks

hitvoice · 2020-07-16T09:15:52Z

I also ran into this error. It seems like in O1 FusedLayerNorm can only accept FP32 inputs.

yhbian · 2021-03-08T04:15:12Z

Same issues!

RQsky · 2021-11-29T09:43:32Z

mark

jinderek · 2022-04-28T08:45:09Z

Same issue

quanpn90 · 2022-05-11T19:28:32Z

This happens because O1 needs functions to be declared to be whitelisted, so that inputs are converted to half.

https://github.com/NVIDIA/apex/blob/master/apex/amp/README.md

The solution is to add @amp.half_function on top of the fused_layer_norm function.
Anyways the fused layer norm is either updated or even slower than the official PyTorch function, and O1 is inferior than PyTorch amp so there shouldn't be any reason to use anymore.

mindest mentioned this issue May 10, 2022

FusedLayerNorm not working properly (apex O1) #1379

Closed

quanpn90 closed this as completed May 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible bug with O1 and FusedLayerNorm #760

Possible bug with O1 and FusedLayerNorm #760

quanpn90 commented Mar 17, 2020 •

edited

Loading

vgoklani commented Jul 9, 2020

hitvoice commented Jul 16, 2020

yhbian commented Mar 8, 2021

RQsky commented Nov 29, 2021

jinderek commented Apr 28, 2022

quanpn90 commented May 11, 2022 •

edited

Loading

Possible bug with O1 and FusedLayerNorm #760

Possible bug with O1 and FusedLayerNorm #760

Comments

quanpn90 commented Mar 17, 2020 • edited Loading

vgoklani commented Jul 9, 2020

hitvoice commented Jul 16, 2020

yhbian commented Mar 8, 2021

RQsky commented Nov 29, 2021

jinderek commented Apr 28, 2022

quanpn90 commented May 11, 2022 • edited Loading

quanpn90 commented Mar 17, 2020 •

edited

Loading

quanpn90 commented May 11, 2022 •

edited

Loading