We have a new Supermicro Server AS-4124GS-TNR equipped with eight NVIDIA RTX A6000. The OS is Ubuntu 20.04.2, the NVIDIA driver version is 460.73.01 (no Nouveau driver used), the CUDA Version is 11.2.
We ran a few tests on the GPUs and the system was stable. However, after some GPU idling the system crashed twice.
We assume that
GpuPowerMizerMode has to be set to 1 to prevent crashes during GPU idling. In order to do that, we start gdm and then set the value accordingly. But when stopping X/gdm, the
GpuPowerMizerMode value is automatically reset to 2. Keeping gdm running is not an option because this also leads to system instability.
So, the problem seems to be as follows:
- Idling +
GpuPowerMizerMode!= 1 can result in a system freeze. In order to persistently set the value to 1 X/gdm has to keep running.
- A running X/gdm can cause a system crash.
Are our assumptions correct?
How can we solve the problem?
It should not be necessary to start a GUI session (or even have one installed!) to change settings such as this;
nvidia-settings should work fine from the framebuffer console or even in a script you write that runs at startup.
Check to be sure:
# nvidia-settings -q GpuPowerMizerMode Attribute 'GPUPowerMizerMode' (blacktemple:1[gpu:0]): 1. Valid values for 'GPUPowerMizerMode' are: 0, 1 and 2. 'GPUPowerMizerMode' can use the following target types: GPU.
For eight GPUs just write a simple script, something like:
for n in $(seq 0 7); do nvidia-settings -a "[gpu:$n]/GpuPowerMizerMode=1" done
and run it at startup in whatever manner you find convenient.
I can’t say whether your crashes are due to running with GpuPowerMizerMode!=1. If that is the case, then you probably have some sort of defective hardware that you should track down and replace.
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.