Skip to content

How long does it take for the training? #8

Open
@moon5756

Description

@moon5756

Hi, thanks for the really helpful work.

I just wonder how long it took for the training.
My desktop has the following cpu and gpu.
cpu: Intel Core i7-6900K CPU @ 3.2GHz
SSD: Sanmsung SSD 850 EVO
gpu: NVIDIA GeForce RTX 2080 TI

I ran the training script and it says active GPUs: 0, from which I can tell the my GPU is properly processing. I changed the size of batch to 50 in config.json because it complained about OOM issue.

I ran the script for about 23 min and it only completed one epoch.
One concern is that the utilization of CPU is like 99% but utilization of GPU is less than 10%.
Any configuration I need to change to fully utilize GPU?
Following is the command line log.

$ python train.py --config configs/config.json -g 0
=> active GPUs: 0
=> Output folder for this run -- jester_conv6
Using 9 processes for data loader.
Training is getting started...
Training takes 999999 epochs.
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
Epoch: [0][0/2371]      Loss 3.3603 (3.3603)    Prec@1 2.000 (2.000)    Prec@5 24.000 (24.000)
Epoch: [0][100/2371]    Loss 3.3065 (3.3294)    Prec@1 8.000 (5.267)    Prec@5 28.000 (21.010)
Epoch: [0][200/2371]    Loss 3.4034 (3.3176)    Prec@1 6.000 (6.179)    Prec@5 16.000 (21.980)
Epoch: [0][300/2371]    Loss 3.3358 (3.3123)    Prec@1 12.000 (6.698)   Prec@5 20.000 (22.213)
Epoch: [0][400/2371]    Loss 3.2839 (3.3080)    Prec@1 10.000 (7.137)   Prec@5 20.000 (22.339)
Epoch: [0][500/2371]    Loss 3.2690 (3.3068)    Prec@1 12.000 (7.246)   Prec@5 28.000 (22.367)
Epoch: [0][600/2371]    Loss 3.3679 (3.3045)    Prec@1 6.000 (7.384)    Prec@5 22.000 (22.326)
Epoch: [0][700/2371]    Loss 3.3639 (3.3040)    Prec@1 6.000 (7.387)    Prec@5 14.000 (22.397)
Epoch: [0][800/2371]    Loss 3.2118 (3.3035)    Prec@1 8.000 (7.366)    Prec@5 36.000 (22.429)
Epoch: [0][900/2371]    Loss 3.3153 (3.3017)    Prec@1 2.000 (7.478)    Prec@5 24.000 (22.562)
Epoch: [0][1000/2371]   Loss 3.3295 (3.3003)    Prec@1 4.000 (7.538)    Prec@5 16.000 (22.691)
Epoch: [0][1100/2371]   Loss 3.2486 (3.2990)    Prec@1 10.000 (7.599)   Prec@5 30.000 (22.874)
Epoch: [0][1200/2371]   Loss 3.3112 (3.2973)    Prec@1 6.000 (7.607)    Prec@5 14.000 (22.981)
Epoch: [0][1300/2371]   Loss 3.2315 (3.2960)    Prec@1 14.000 (7.631)   Prec@5 36.000 (23.148)
Epoch: [0][1400/2371]   Loss 3.3065 (3.2944)    Prec@1 4.000 (7.659)    Prec@5 26.000 (23.269)
Epoch: [0][1500/2371]   Loss 3.2688 (3.2931)    Prec@1 12.000 (7.695)   Prec@5 34.000 (23.387)
Epoch: [0][1600/2371]   Loss 3.1971 (3.2921)    Prec@1 12.000 (7.734)   Prec@5 40.000 (23.492)
Epoch: [0][1700/2371]   Loss 3.2873 (3.2908)    Prec@1 8.000 (7.790)    Prec@5 20.000 (23.588)
Epoch: [0][1800/2371]   Loss 3.1563 (3.2894)    Prec@1 16.000 (7.842)   Prec@5 42.000 (23.719)
Epoch: [0][1900/2371]   Loss 3.2181 (3.2875)    Prec@1 8.000 (7.883)    Prec@5 36.000 (23.916)
Epoch: [0][2000/2371]   Loss 3.2744 (3.2859)    Prec@1 4.000 (7.929)    Prec@5 18.000 (24.034)
Epoch: [0][2100/2371]   Loss 3.3153 (3.2836)    Prec@1 6.000 (7.952)    Prec@5 28.000 (24.207)
Epoch: [0][2200/2371]   Loss 3.1725 (3.2810)    Prec@1 12.000 (8.038)   Prec@5 36.000 (24.462)
Epoch: [0][2300/2371]   Loss 3.2124 (3.2788)    Prec@1 8.000 (8.044)    Prec@5 38.000 (24.708)
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
Test: [0/296]   Loss 3.2033 (3.2033)    Prec@1 14.000 (14.000)  Prec@5 32.000 (32.000)

// EDIT: Wait a sec... I just checked the tensorboard.. and is it supposed to take more than 1 day?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions