Model training stalls forever after just a few batches.

I posted
this as an issue on Github, maybe someone here will have a
magic solution:

TensorFlow version: 2.4.0-rc4 (also tried with stable
2.4.0)
TensorFlow Git version: v2.4.0-rc3-20-g97c3fef64ba
Python version: 3.8.5
CUDA/cuDNN version: CUDA 11.0, cuDNN 8.0.4
GPU model and memory: Nvidia RTX 3090, 24GB RAM

Model training regularly freezes for large models.

Sometimes the first batch or so works, but then just a few
batches later and training seems stuck in a loop. From my activity
monitor, I see GPU CUDA use hovering around 100%. This goes on for
minutes or more, with no more batches being trained.

I don’t see an OOM error, nor does it seem like I’m hitting
memory limits in activity monitor or nvidia-smi.

I would expect the first batch to take a bit longer, then any
subsequent batches to take less than <1s. Never have a random
batch take minutes or stall forever.

Run through all the cells in the notebook shared below to
initialize the model, then run the final cell just a few times.
Eventually it will hang and never finish.

https://github.com/not-Ian/tensorflow-bug-example/blob/main/tensorflow%20error%20example.ipynb

Smaller models train quickly as expected, however I think even
then they eventually stall out after training many, many batches. I
had another similar, small VAE like in my example that trained for
5k-10k batches overnight before stalling.

Someone suggested I set a hard memory limit on the GPU like
this:

gpus = tf.config.experimental.list_physical_devices('GPU') tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024 * 23)])

And finally, I’ve tried using the hacky ptxas.exe file from CUDA
11.1 in my CUDA 11.0 installation. This seems to remove a warning?
But still no change.

Open to any other ideas, thanks.

submitted by /u/Deinos_Mousike

[visit reddit]
[comments]

Leave a Reply Cancel reply