Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-171] Fix a bug that was causing training accuracy to be printed as nan sometimes #10437

Merged
merged 5 commits into from
Apr 9, 2018

Conversation

indhub
Copy link
Contributor

@indhub indhub commented Apr 6, 2018

Description

Fix a bug (issue #4253) that was causing training accuracy to be printed as nan sometimes

The "Epoch[x] Train-accuracy=xxx" line printed at the end of every epoch gives the impression that the accuracy metric is for the entire epoch. In reality, it is NOT.

For example if an epoch consists of 101 batches and we were printing metrics every 10 batch because callback was created that way, what will be printed as 'Train-accuracy=xxx' at the end of an epoch is actually just the accuracy from a single batch (101'st batch). Printing this as 'Epoch[x] Train-accuracy=xxx' is very misleading.

Ideally we should remove this misleading print statement. But then, I'm sure there is a lot of existing scripts out there that look for this statement. We can't remove this without breaking those scripts. Since this will be a breaking change, let's do it in a major version change.

For now, to avoid the nan error, we can avoid resetting the metrics when processing callback for the last batch. It will be reset at the beginning of the next epoch anyway.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:

@indhub indhub requested a review from szha as a code owner April 6, 2018 09:08
@indhub
Copy link
Contributor Author

indhub commented Apr 6, 2018

Output of finetune notebook before fix:

2018-04-03 23:40:21,316 Epoch[0] Batch [10]	Speed: 368.51 samples/sec	accuracy=0.002131
2018-04-03 23:40:24,773 Epoch[0] Batch [20]	Speed: 370.50 samples/sec	accuracy=0.007812
2018-04-03 23:40:28,226 Epoch[0] Batch [30]	Speed: 370.75 samples/sec	accuracy=0.011719
2018-04-03 23:40:31,684 Epoch[0] Batch [40]	Speed: 370.26 samples/sec	accuracy=0.014063
2018-04-03 23:40:35,161 Epoch[0] Batch [50]	Speed: 368.28 samples/sec	accuracy=0.026562
2018-04-03 23:40:38,624 Epoch[0] Batch [60]	Speed: 369.71 samples/sec	accuracy=0.026562
2018-04-03 23:40:42,108 Epoch[0] Batch [70]	Speed: 367.54 samples/sec	accuracy=0.060937
2018-04-03 23:40:45,583 Epoch[0] Batch [80]	Speed: 368.39 samples/sec	accuracy=0.094531
2018-04-03 23:40:49,061 Epoch[0] Batch [90]	Speed: 368.16 samples/sec	accuracy=0.101562
2018-04-03 23:40:52,542 Epoch[0] Batch [100]	Speed: 367.77 samples/sec	accuracy=0.135937
2018-04-03 23:40:56,040 Epoch[0] Batch [110]	Speed: 366.01 samples/sec	accuracy=0.135937
2018-04-03 23:40:59,541 Epoch[0] Batch [120]	Speed: 365.68 samples/sec	accuracy=0.171875
2018-04-03 23:40:59,542 Epoch[0] Train-accuracy=nan
2018-04-03 23:40:59,543 Epoch[0] Time cost=85.549
2018-04-03 23:41:13,888 Epoch[0] Validation-accuracy=0.237001

Output of finetune notebook after fix:

2018-04-06 08:45:23,057 Epoch[0] Batch [10]	Speed: 366.62 samples/sec	accuracy=0.003551
2018-04-06 08:45:26,538 Epoch[0] Batch [20]	Speed: 367.82 samples/sec	accuracy=0.006250
2018-04-06 08:45:30,026 Epoch[0] Batch [30]	Speed: 367.14 samples/sec	accuracy=0.011719
2018-04-06 08:45:33,512 Epoch[0] Batch [40]	Speed: 367.24 samples/sec	accuracy=0.025781
2018-04-06 08:45:36,997 Epoch[0] Batch [50]	Speed: 367.43 samples/sec	accuracy=0.027344
2018-04-06 08:45:40,489 Epoch[0] Batch [60]	Speed: 366.63 samples/sec	accuracy=0.032031
2018-04-06 08:45:43,973 Epoch[0] Batch [70]	Speed: 367.42 samples/sec	accuracy=0.056250
2018-04-06 08:45:47,469 Epoch[0] Batch [80]	Speed: 366.36 samples/sec	accuracy=0.085156
2018-04-06 08:45:50,969 Epoch[0] Batch [90]	Speed: 365.75 samples/sec	accuracy=0.089844
2018-04-06 08:45:54,482 Epoch[0] Batch [100]	Speed: 364.56 samples/sec	accuracy=0.123438
2018-04-06 08:45:57,982 Epoch[0] Batch [110]	Speed: 365.80 samples/sec	accuracy=0.133594
2018-04-06 08:46:01,484 Epoch[0] Batch [120]	Speed: 365.58 samples/sec	accuracy=0.184375
2018-04-06 08:46:01,485 Epoch[0] Train-accuracy=0.184375
2018-04-06 08:46:01,486 Epoch[0] Time cost=83.268
2018-04-06 08:46:15,864 Epoch[0] Validation-accuracy=0.234112

@@ -523,8 +524,9 @@ def fit(self, train_data, eval_data=None, eval_metric='acc',
monitor.toc_print()

if batch_end_callback is not None:
arg_eval_metric = eval_metric if not end_of_batch else deepcopy(eval_metric)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too hacky and potentially very slow

Copy link
Contributor Author

@indhub indhub Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep copy happens maximum once an epoch. That adds negligible time to an epoch.

Hacky - yes. That's because we are trying to print metrics outside the loop which could get cleared inside the loop. Any solution that doesn't modify the existing behavior will be a little hacky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can think of a less hacky solution, please propose. If not, I think this is better than printing accuracy as nan in training log which creates bad user experience. A number of users have reported this.

@piiswrong
Copy link
Contributor

Deep copy is slow and can cause a lot of problems. We can't do this kind of hack in the main package

@piiswrong piiswrong merged commit 6abcdbb into apache:master Apr 9, 2018
@indhub indhub deleted the metric_nan branch April 9, 2018 18:37
@indhub indhub mentioned this pull request Apr 25, 2018
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
…d as nan sometimes (apache#10437)

* Fix a bug that was causing training accuracy to be printed as nan sometimes.

* Avoid the additional 'arg_eval_metric' variable. There should be no
overhead except for the batch in an epoch.

* Fix lint.

* For the last batch, Capture metrics before callback and use it to print epoch metrics

* Remove unused import
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
…d as nan sometimes (apache#10437)

* Fix a bug that was causing training accuracy to be printed as nan sometimes.

* Avoid the additional 'arg_eval_metric' variable. There should be no
overhead except for the batch in an epoch.

* Fix lint.

* For the last batch, Capture metrics before callback and use it to print epoch metrics

* Remove unused import
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants