Description
Hi, thanks for the really helpful work.
I just wonder how long it took for the training.
My desktop has the following cpu and gpu.
cpu: Intel Core i7-6900K CPU @ 3.2GHz
SSD: Sanmsung SSD 850 EVO
gpu: NVIDIA GeForce RTX 2080 TI
I ran the training script and it says active GPUs: 0, from which I can tell the my GPU is properly processing. I changed the size of batch to 50 in config.json because it complained about OOM issue.
I ran the script for about 23 min and it only completed one epoch.
One concern is that the utilization of CPU is like 99% but utilization of GPU is less than 10%.
Any configuration I need to change to fully utilize GPU?
Following is the command line log.
$ python train.py --config configs/config.json -g 0
=> active GPUs: 0
=> Output folder for this run -- jester_conv6
Using 9 processes for data loader.
Training is getting started...
Training takes 999999 epochs.
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
Epoch: [0][0/2371] Loss 3.3603 (3.3603) Prec@1 2.000 (2.000) Prec@5 24.000 (24.000)
Epoch: [0][100/2371] Loss 3.3065 (3.3294) Prec@1 8.000 (5.267) Prec@5 28.000 (21.010)
Epoch: [0][200/2371] Loss 3.4034 (3.3176) Prec@1 6.000 (6.179) Prec@5 16.000 (21.980)
Epoch: [0][300/2371] Loss 3.3358 (3.3123) Prec@1 12.000 (6.698) Prec@5 20.000 (22.213)
Epoch: [0][400/2371] Loss 3.2839 (3.3080) Prec@1 10.000 (7.137) Prec@5 20.000 (22.339)
Epoch: [0][500/2371] Loss 3.2690 (3.3068) Prec@1 12.000 (7.246) Prec@5 28.000 (22.367)
Epoch: [0][600/2371] Loss 3.3679 (3.3045) Prec@1 6.000 (7.384) Prec@5 22.000 (22.326)
Epoch: [0][700/2371] Loss 3.3639 (3.3040) Prec@1 6.000 (7.387) Prec@5 14.000 (22.397)
Epoch: [0][800/2371] Loss 3.2118 (3.3035) Prec@1 8.000 (7.366) Prec@5 36.000 (22.429)
Epoch: [0][900/2371] Loss 3.3153 (3.3017) Prec@1 2.000 (7.478) Prec@5 24.000 (22.562)
Epoch: [0][1000/2371] Loss 3.3295 (3.3003) Prec@1 4.000 (7.538) Prec@5 16.000 (22.691)
Epoch: [0][1100/2371] Loss 3.2486 (3.2990) Prec@1 10.000 (7.599) Prec@5 30.000 (22.874)
Epoch: [0][1200/2371] Loss 3.3112 (3.2973) Prec@1 6.000 (7.607) Prec@5 14.000 (22.981)
Epoch: [0][1300/2371] Loss 3.2315 (3.2960) Prec@1 14.000 (7.631) Prec@5 36.000 (23.148)
Epoch: [0][1400/2371] Loss 3.3065 (3.2944) Prec@1 4.000 (7.659) Prec@5 26.000 (23.269)
Epoch: [0][1500/2371] Loss 3.2688 (3.2931) Prec@1 12.000 (7.695) Prec@5 34.000 (23.387)
Epoch: [0][1600/2371] Loss 3.1971 (3.2921) Prec@1 12.000 (7.734) Prec@5 40.000 (23.492)
Epoch: [0][1700/2371] Loss 3.2873 (3.2908) Prec@1 8.000 (7.790) Prec@5 20.000 (23.588)
Epoch: [0][1800/2371] Loss 3.1563 (3.2894) Prec@1 16.000 (7.842) Prec@5 42.000 (23.719)
Epoch: [0][1900/2371] Loss 3.2181 (3.2875) Prec@1 8.000 (7.883) Prec@5 36.000 (23.916)
Epoch: [0][2000/2371] Loss 3.2744 (3.2859) Prec@1 4.000 (7.929) Prec@5 18.000 (24.034)
Epoch: [0][2100/2371] Loss 3.3153 (3.2836) Prec@1 6.000 (7.952) Prec@5 28.000 (24.207)
Epoch: [0][2200/2371] Loss 3.1725 (3.2810) Prec@1 12.000 (8.038) Prec@5 36.000 (24.462)
Epoch: [0][2300/2371] Loss 3.2124 (3.2788) Prec@1 8.000 (8.044) Prec@5 38.000 (24.708)
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
Test: [0/296] Loss 3.2033 (3.2033) Prec@1 14.000 (14.000) Prec@5 32.000 (32.000)
// EDIT: Wait a sec... I just checked the tensorboard.. and is it supposed to take more than 1 day?