I posted
this as an issue on Github, maybe someone here will have a
magic solution:
- TensorFlow version: 2.4.0-rc4 (also tried with stable
2.4.0)
- TensorFlow Git version: v2.4.0-rc3-20-g97c3fef64ba
- Python version: 3.8.5
- CUDA/cuDNN version: CUDA 11.0, cuDNN 8.0.4
- GPU model and memory: Nvidia RTX 3090, 24GB RAM
Model training regularly freezes for large models.
Sometimes the first batch or so works, but then just a few
batches later and training seems stuck in a loop. From my activity
monitor, I see GPU CUDA use hovering around 100%. This goes on for
minutes or more, with no more batches being trained.
I don’t see an OOM error, nor does it seem like I’m hitting
memory limits in activity monitor or nvidia-smi.
I would expect the first batch to take a bit longer, then any
subsequent batches to take less than <1s. Never have a random
batch take minutes or stall forever.
Run through all the cells in the notebook shared below to
initialize the model, then run the final cell just a few times.
Eventually it will hang and never finish.
https://github.com/not-Ian/tensorflow-bug-example/blob/main/tensorflow%20error%20example.ipynb
Smaller models train quickly as expected, however I think even
then they eventually stall out after training many, many batches. I
had another similar, small VAE like in my example that trained for
5k-10k batches overnight before stalling.
Someone suggested I set a hard memory limit on the GPU like
this:
gpus = tf.config.experimental.list_physical_devices('GPU') tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024 * 23)])
And finally, I’ve tried using the hacky ptxas.exe file from CUDA
11.1 in my CUDA 11.0 installation. This seems to remove a warning?
But still no change.
Open to any other ideas, thanks.