Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Floating point exception (core dumped) #8702

Closed
kanchangcheng opened this issue Mar 2, 2018 · 12 comments
Closed

Floating point exception (core dumped) #8702

kanchangcheng opened this issue Mar 2, 2018 · 12 comments
Labels
User 用于标记用户问题

Comments

@kanchangcheng
Copy link

kanchangcheng commented Mar 2, 2018

*** Aborted at 1519973684 (unix time) try "date -d @1519973684" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGFPE (@0x7f89bd8a840a) received by PID 46 (TID 0x7f8793992700) from PID 18446744072594555914; stack trace: ***
    @     0x7f89da7c6390 (unknown)
    @     0x7f89bd8a840a _ZN6paddle17AssignCpuEvaluateIRNS_14TensorAssignOpINS_11BaseMatrixTIfEENS_14TensorBinaryOpIN4hppl6binary3addIfEEKNS_13TensorUnaryOpINS5_5unary9mul_scaleIfEEKS3_fEESF_fEEfEEJRNS1_IS3_NS4_IS8_SF_KNS9_ISC_KNS9_INSA_6squareIfEESD_fEEfEEfEEfEERNS1_IS3_NS4_INS6_3subIfEESD_KNS4_INS6_3divIfEESF_KNS9_INSA_9add_scaleIfEEKNS9_INSA_7sqrt_opIfEESD_fEEfEEfEEfEEfEEEEEviibOT_DpOT0_
    @     0x7f89bd8ab637 _ZN6paddle14AssignEvaluateIRNS_14TensorAssignOpINS_11BaseMatrixTIfEENS_14TensorBinaryOpIN4hppl6binary3addIfEEKNS_13TensorUnaryOpINS5_5unary9mul_scaleIfEEKS3_fEESF_fEEfEEJRNS1_IS3_NS4_IS8_SF_KNS9_ISC_KNS9_INSA_6squareIfEESD_fEEfEEfEEfEERNS1_IS3_NS4_INS6_3subIfEESD_KNS4_INS6_3divIfEESF_KNS9_INSA_9add_scaleIfEEKNS9_INSA_7sqrt_opIfEESD_fEEfEEfEEfEEfEEEEEvOT_DpOT0_
    @     0x7f89bd8a4abb paddle::adamApply()
    @     0x7f89bd894496 paddle::AdamParameterOptimizer::update()
    @     0x7f89bd894956 paddle::OptimizerWithGradientClipping::update()
    @     0x7f89bd88906f paddle::SgdThreadUpdater::threadUpdateDense()
    @     0x7f89bd88a0ef _ZNSt17_Function_handlerIFvimEZN6paddle16SgdThreadUpdater11finishBatchEfEUlimE_E9_M_invokeERKSt9_Any_dataim
    @     0x7f89bd6aa39c _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
    @     0x7f89cadebc80 (unknown)
    @     0x7f89da7bc6ba start_thread
    @     0x7f89da4f241d clone
    @                0x0 (unknown)
Floating point exception (core dumped)

训练过程遇到该错误,已查看过类似的issue,但该错误还没有解决,求助各位大神!

@chengduoZH chengduoZH added the User 用于标记用户问题 label Mar 2, 2018
@chengduoZH
Copy link
Contributor

请描述一下您的模型是在什么环境下训练的?

@shboy
Copy link

shboy commented Mar 2, 2018

docker.paddlepaddlehub.com/paddle latest-gpu 在这个docker里面跑的 @chengduoZH

前两天刚下的 应该是最新版本了

@chengduoZH
Copy link
Contributor

您的Adam层的参数是怎么设置的?

@shboy
Copy link

shboy commented Mar 2, 2018

lr = 0.000002
Adam_optimizer = paddle.optimizer.Adam(
learning_rate=lr,
beta1=0.9, beta2=0.999, epsilon=0, gradient_clipping_threshold=10.0)
@chengduoZH

@shboy
Copy link

shboy commented Mar 2, 2018

我们之前用keras训练同样的数据 是没有问题的

@chengduoZH
Copy link
Contributor

related issue #2262 and #2563

@shboy
Copy link

shboy commented Mar 5, 2018

    f_para_grad = open("para_grad.txt",'a+')
    if isinstance(event, paddle.event.EndForwardBackward):
        if isinstance(event, paddle.event.EndForwardBackward):
            for p in parameters.keys():
                print("Param %s, Grad %s",
                    parameters.get(p), parameters.get_grad(p))
                #f_para_grad.write("Param %s, Grad %s",
                #    parameters.get(p), parameters.get_grad(p))
                f_para_grad.write("Param %s"+"\n")
                for item in parameters.get(p):
                    f_para_grad.write(str(item)+ ' ')
                f_para_grad.write("\n")
                f_para_grad.write("Grad %s"+"\n")
                for item in parameters.get_grad(p):
                    f_para_grad.write(str(item)+ ' ')
                f_para_grad.write("\n")

3531cc881c22e6822a9d72ae7de72a8c

我把梯度打出来了 貌似也没有错

@shboy
Copy link

shboy commented Mar 5, 2018

lr = 0.000002
Adam_optimizer = paddle.optimizer.Adam(
learning_rate=lr,
beta1=0.9, beta2=0.999, epsilon=0, gradient_clipping_threshold=10.0)

我把gradient_clipping_threshold=10.0给去了 仍然是同样的错
3531cc881c22e6822a9d72ae7de72a8c

@chengduoZH
Copy link
Contributor

Adam_optimizer = paddle.optimizer.Adam(
learning_rate=lr,
beta1=0.9, beta2=0.999, epsilon=0, gradient_clipping_threshold=10.0)

不要把epsilon设成0,epsilon一般是非常小的值,比如0.000001,如果这里不设置,Adam会使用默认的epsilon。

@chengduoZH
Copy link
Contributor

问题已解决

@Littlehead27
Copy link

[2023/05/24 20:21:18] ppocr INFO: cur metric, precision: 0, recall: 0, hmean: 0, fps: 7.03872743678866
[2023/05/24 20:21:35] ppocr INFO: save best model is to ./output/re_vi_layoutxlm_xfund_zh/best_accuracy
[2023/05/24 20:21:35] ppocr INFO: best metric, hmean: 0, precision: 0, recall: 0, fps: 7.03872743678866, best_epoch: 1
[2023/05/24 20:21:37] ppocr INFO: epoch: [1/50], global_step: 210, lr: 0.000004, loss: 0.267303, avg_reader_cost: 0.00025 s, avg_batch_cost: 0.19397 s, avg_samples: 1.0, ips: 5.15534 samples/s, eta: 1:44:15
[2023/05/24 20:21:39] ppocr INFO: epoch: [1/50], global_step: 220, lr: 0.000004, loss: 0.204350, avg_reader_cost: 0.00019 s, avg_batch_cost: 0.23311 s, avg_samples: 1.0, ips: 4.28986 samples/s, eta: 1:41:35
[2023/05/24 20:21:42] ppocr INFO: epoch: [1/50], global_step: 230, lr: 0.000005, loss: 0.237258, avg_reader_cost: 0.00019 s, avg_batch_cost: 0.19782 s, avg_samples: 1.0, ips: 5.05522 samples/s, eta: 1:38:50
[2023/05/24 20:21:44] ppocr INFO: epoch: [1/50], global_step: 240, lr: 0.000005, loss: 0.265792, avg_reader_cost: 0.00019 s, avg_batch_cost: 0.18138 s, avg_samples: 1.0, ips: 5.51319 samples/s, eta: 1:36:10
Floating point exception (core dumped)

训练 re模型 报这个错误

@Littlehead27
Copy link

问题已解决

[2023/05/24 20:21:18] ppocr INFO: cur metric, precision: 0, recall: 0, hmean: 0, fps: 7.03872743678866
[2023/05/24 20:21:35] ppocr INFO: save best model is to ./output/re_vi_layoutxlm_xfund_zh/best_accuracy
[2023/05/24 20:21:35] ppocr INFO: best metric, hmean: 0, precision: 0, recall: 0, fps: 7.03872743678866, best_epoch: 1
[2023/05/24 20:21:37] ppocr INFO: epoch: [1/50], global_step: 210, lr: 0.000004, loss: 0.267303, avg_reader_cost: 0.00025 s, avg_batch_cost: 0.19397 s, avg_samples: 1.0, ips: 5.15534 samples/s, eta: 1:44:15
[2023/05/24 20:21:39] ppocr INFO: epoch: [1/50], global_step: 220, lr: 0.000004, loss: 0.204350, avg_reader_cost: 0.00019 s, avg_batch_cost: 0.23311 s, avg_samples: 1.0, ips: 4.28986 samples/s, eta: 1:41:35
[2023/05/24 20:21:42] ppocr INFO: epoch: [1/50], global_step: 230, lr: 0.000005, loss: 0.237258, avg_reader_cost: 0.00019 s, avg_batch_cost: 0.19782 s, avg_samples: 1.0, ips: 5.05522 samples/s, eta: 1:38:50
[2023/05/24 20:21:44] ppocr INFO: epoch: [1/50], global_step: 240, lr: 0.000005, loss: 0.265792, avg_reader_cost: 0.00019 s, avg_batch_cost: 0.18138 s, avg_samples: 1.0, ips: 5.51319 samples/s, eta: 1:36:10
Floating point exception (core dumped)

你好,我这 还没解决啊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
User 用于标记用户问题
Projects
None yet
Development

No branches or pull requests

4 participants