GPU server freezes during GPU idling

user776206 asked:

We have a new Supermicro Server AS-4124GS-TNR equipped with eight NVIDIA RTX A6000. The OS is Ubuntu 20.04.2, the NVIDIA driver version is 460.73.01 (no Nouveau driver used), the CUDA Version is 11.2.

We ran a few tests on the GPUs and the system was stable. However, after some GPU idling the system crashed twice.

We assume that GpuPowerMizerMode has to be set to 1 to prevent crashes during GPU idling. In order to do that, we start gdm and then set the value accordingly. But when stopping X/gdm, the GpuPowerMizerMode value is automatically reset to 2. Keeping gdm running is not an option because this also leads to system instability.

So, the problem seems to be as follows:

  1. Idling + GpuPowerMizerMode != 1 can result in a system freeze. In order to persistently set the value to 1 X/gdm has to keep running.
  2. A running X/gdm can cause a system crash.

Are our assumptions correct?

How can we solve the problem?

My answer:

It should not be necessary to start a GUI session (or even have one installed!) to change settings such as this; nvidia-settings should work fine from the framebuffer console or even in a script you write that runs at startup.

Check to be sure:

# nvidia-settings -q GpuPowerMizerMode

  Attribute 'GPUPowerMizerMode' (blacktemple:1[gpu:0]): 1.
    Valid values for 'GPUPowerMizerMode' are: 0, 1 and 2.
    'GPUPowerMizerMode' can use the following target types: GPU.

For eight GPUs just write a simple script, something like:

for n in $(seq 0 7); do
    nvidia-settings -a "[gpu:$n]/GpuPowerMizerMode=1"

and run it at startup in whatever manner you find convenient.

I can’t say whether your crashes are due to running with GpuPowerMizerMode!=1. If that is the case, then you probably have some sort of defective hardware that you should track down and replace.

View the full question and any other answers on Server Fault.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.