GPU server freezes during GPU idling

user776206 asked:

We have a new Supermicro Server AS-4124GS-TNR equipped with eight NVIDIA RTX A6000. The OS is Ubuntu 20.04.2, the NVIDIA driver version is 460.73.01 (no Nouveau driver used), the CUDA Version is 11.2.

We ran a few tests on the GPUs and the system was stable. However, after some GPU idling the system crashed twice.

We assume that GpuPowerMizerMode has to be set to 1 to prevent crashes during GPU idling. In order to do that, we start gdm and then set the value accordingly. But when stopping X/gdm, the GpuPowerMizerMode value is automatically reset to 2. Keeping gdm running is not an option because this also leads to system instability.

So, the problem seems to be as follows:

  1. Idling + GpuPowerMizerMode != 1 can result in a system freeze. In order to persistently set the value to 1 X/gdm has to keep running.
  2. A running X/gdm can cause a system crash.

Are our assumptions correct?

How can we solve the problem?

My answer:

It should not be necessary to start a GUI session (or even have one installed!) to change settings such as this; nvidia-settings should work fine from the framebuffer console or even in a script you write that runs at startup.

Check to be sure:

# nvidia-settings -q GpuPowerMizerMode

  Attribute 'GPUPowerMizerMode' (blacktemple:1[gpu:0]): 1.
    Valid values for 'GPUPowerMizerMode' are: 0, 1 and 2.
    'GPUPowerMizerMode' can use the following target types: GPU.

For eight GPUs just write a simple script, something like:

for n in $(seq 0 7); do
    nvidia-settings -a "[gpu:$n]/GpuPowerMizerMode=1"

and run it at startup in whatever manner you find convenient.

I can’t say whether your crashes are due to running with GpuPowerMizerMode!=1. If that is the case, then you probably have some sort of defective hardware that you should track down and replace.

