[MXNET-171] Fix a bug that was causing training accuracy to be printed as nan sometimes #10437

indhub · 2018-04-06T09:08:09Z

Description

Fix a bug (issue #4253) that was causing training accuracy to be printed as nan sometimes

The "Epoch[x] Train-accuracy=xxx" line printed at the end of every epoch gives the impression that the accuracy metric is for the entire epoch. In reality, it is NOT.

For example if an epoch consists of 101 batches and we were printing metrics every 10 batch because callback was created that way, what will be printed as 'Train-accuracy=xxx' at the end of an epoch is actually just the accuracy from a single batch (101'st batch). Printing this as 'Epoch[x] Train-accuracy=xxx' is very misleading.

Ideally we should remove this misleading print statement. But then, I'm sure there is a lot of existing scripts out there that look for this statement. We can't remove this without breaking those scripts. Since this will be a breaking change, let's do it in a major version change.

For now, to avoid the nan error, we can avoid resetting the metrics when processing callback for the last batch. It will be reset at the beginning of the next epoch anyway.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:

…etimes.

indhub · 2018-04-06T09:18:04Z

Output of finetune notebook before fix:

2018-04-03 23:40:21,316 Epoch[0] Batch [10]	Speed: 368.51 samples/sec	accuracy=0.002131
2018-04-03 23:40:24,773 Epoch[0] Batch [20]	Speed: 370.50 samples/sec	accuracy=0.007812
2018-04-03 23:40:28,226 Epoch[0] Batch [30]	Speed: 370.75 samples/sec	accuracy=0.011719
2018-04-03 23:40:31,684 Epoch[0] Batch [40]	Speed: 370.26 samples/sec	accuracy=0.014063
2018-04-03 23:40:35,161 Epoch[0] Batch [50]	Speed: 368.28 samples/sec	accuracy=0.026562
2018-04-03 23:40:38,624 Epoch[0] Batch [60]	Speed: 369.71 samples/sec	accuracy=0.026562
2018-04-03 23:40:42,108 Epoch[0] Batch [70]	Speed: 367.54 samples/sec	accuracy=0.060937
2018-04-03 23:40:45,583 Epoch[0] Batch [80]	Speed: 368.39 samples/sec	accuracy=0.094531
2018-04-03 23:40:49,061 Epoch[0] Batch [90]	Speed: 368.16 samples/sec	accuracy=0.101562
2018-04-03 23:40:52,542 Epoch[0] Batch [100]	Speed: 367.77 samples/sec	accuracy=0.135937
2018-04-03 23:40:56,040 Epoch[0] Batch [110]	Speed: 366.01 samples/sec	accuracy=0.135937
2018-04-03 23:40:59,541 Epoch[0] Batch [120]	Speed: 365.68 samples/sec	accuracy=0.171875
2018-04-03 23:40:59,542 Epoch[0] Train-accuracy=nan
2018-04-03 23:40:59,543 Epoch[0] Time cost=85.549
2018-04-03 23:41:13,888 Epoch[0] Validation-accuracy=0.237001

Output of finetune notebook after fix:

2018-04-06 08:45:23,057 Epoch[0] Batch [10]	Speed: 366.62 samples/sec	accuracy=0.003551
2018-04-06 08:45:26,538 Epoch[0] Batch [20]	Speed: 367.82 samples/sec	accuracy=0.006250
2018-04-06 08:45:30,026 Epoch[0] Batch [30]	Speed: 367.14 samples/sec	accuracy=0.011719
2018-04-06 08:45:33,512 Epoch[0] Batch [40]	Speed: 367.24 samples/sec	accuracy=0.025781
2018-04-06 08:45:36,997 Epoch[0] Batch [50]	Speed: 367.43 samples/sec	accuracy=0.027344
2018-04-06 08:45:40,489 Epoch[0] Batch [60]	Speed: 366.63 samples/sec	accuracy=0.032031
2018-04-06 08:45:43,973 Epoch[0] Batch [70]	Speed: 367.42 samples/sec	accuracy=0.056250
2018-04-06 08:45:47,469 Epoch[0] Batch [80]	Speed: 366.36 samples/sec	accuracy=0.085156
2018-04-06 08:45:50,969 Epoch[0] Batch [90]	Speed: 365.75 samples/sec	accuracy=0.089844
2018-04-06 08:45:54,482 Epoch[0] Batch [100]	Speed: 364.56 samples/sec	accuracy=0.123438
2018-04-06 08:45:57,982 Epoch[0] Batch [110]	Speed: 365.80 samples/sec	accuracy=0.133594
2018-04-06 08:46:01,484 Epoch[0] Batch [120]	Speed: 365.58 samples/sec	accuracy=0.184375
2018-04-06 08:46:01,485 Epoch[0] Train-accuracy=0.184375
2018-04-06 08:46:01,486 Epoch[0] Time cost=83.268
2018-04-06 08:46:15,864 Epoch[0] Validation-accuracy=0.234112

piiswrong · 2018-04-06T17:27:20Z

python/mxnet/module/base_module.py

@@ -523,8 +524,9 @@ def fit(self, train_data, eval_data=None, eval_metric='acc',
                    monitor.toc_print()

                if batch_end_callback is not None:
+                    arg_eval_metric = eval_metric if not end_of_batch else deepcopy(eval_metric)


This is too hacky and potentially very slow

Deep copy happens maximum once an epoch. That adds negligible time to an epoch.

Hacky - yes. That's because we are trying to print metrics outside the loop which could get cleared inside the loop. Any solution that doesn't modify the existing behavior will be a little hacky.

If you can think of a less hacky solution, please propose. If not, I think this is better than printing accuracy as nan in training log which creates bad user experience. A number of users have reported this.

overhead except for the batch in an epoch.

piiswrong · 2018-04-08T02:05:25Z

Deep copy is slow and can cause a lot of problems. We can't do this kind of hack in the main package

…nt epoch metrics

…d as nan sometimes (apache#10437) * Fix a bug that was causing training accuracy to be printed as nan sometimes. * Avoid the additional 'arg_eval_metric' variable. There should be no overhead except for the batch in an epoch. * Fix lint. * For the last batch, Capture metrics before callback and use it to print epoch metrics * Remove unused import

Fix a bug that was causing training accuracy to be printed as nan som…

a63d11a

…etimes.

indhub requested a review from szha as a code owner April 6, 2018 09:08

indhub mentioned this pull request Apr 6, 2018

Train-accuracy=nan, my parameter setting error? #5528

Closed

indhub requested a review from piiswrong April 6, 2018 17:25

piiswrong reviewed Apr 6, 2018

View reviewed changes

indhub added 2 commits April 6, 2018 23:53

Avoid the additional 'arg_eval_metric' variable. There should be no

2bb78ec

overhead except for the batch in an epoch.

Fix lint.

62f8a90

indhub added 2 commits April 9, 2018 08:14

For the last batch, Capture metrics before callback and use it to pri…

f0b95b3

…nt epoch metrics

Remove unused import

cf50e69

piiswrong merged commit 6abcdbb into apache:master Apr 9, 2018

indhub deleted the metric_nan branch April 9, 2018 18:37

indhub mentioned this pull request Apr 25, 2018

training-accuracy nan #4253

Closed

This was referenced Aug 14, 2018

Image classfication example has wrong accuracy metric. #11480

Closed

[MXNET-698] Correct train-metric log to reflect epoch metric #12182

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-171] Fix a bug that was causing training accuracy to be printed as nan sometimes #10437

[MXNET-171] Fix a bug that was causing training accuracy to be printed as nan sometimes #10437

indhub commented Apr 6, 2018

indhub commented Apr 6, 2018

piiswrong Apr 6, 2018

indhub Apr 6, 2018 •

edited

Loading

indhub Apr 6, 2018

piiswrong commented Apr 8, 2018

[MXNET-171] Fix a bug that was causing training accuracy to be printed as nan sometimes #10437

[MXNET-171] Fix a bug that was causing training accuracy to be printed as nan sometimes #10437

Conversation

indhub commented Apr 6, 2018

Description

Checklist

Essentials

indhub commented Apr 6, 2018

piiswrong Apr 6, 2018

Choose a reason for hiding this comment

indhub Apr 6, 2018 • edited Loading

Choose a reason for hiding this comment

indhub Apr 6, 2018

Choose a reason for hiding this comment

piiswrong commented Apr 8, 2018

indhub Apr 6, 2018 •

edited

Loading