Nvidia vs AMD CPU usage

arisu · Post by **arisu** » Thu Mar 13, 2025 3:09 am

Joe_H wrote: ↑Tue Mar 04, 2025 4:40 am From what I understand, the difference is in how Nvidia and AMD wrote their drivers. Nvidia's driver is doing a spin-wait looking for instructions to be processed and sent to the GPU. AMD from the explanations I have seen implemented this as an interrupt instead. As soon as something is handed off to the driver to process, it wakes up and takes CPU cycles to handle the request and then goes inactive until the next request. So the Nvidia driver process is always active, but the actual amount of work done by the CPU may be a fraction of the cycles available.

I believe I have the definitive answer about why we see this.

It turns out it's not the driver, just a choice that FAH made in their configuration of OpenMM. They overrode the default for UseBlockingSync and set it to false. This increases performance slightly but causes the CPU usage people report.

http://docs.openmm.org/latest/userguide ... a-platform

UseBlockingSync: This is used to control how the CUDA runtime synchronizes between the CPU and GPU. If this is set to “true” (the default), CUDA will allow the calling thread to sleep while the GPU is performing a computation, allowing the CPU to do other work. If it is set to “false”, CUDA will spin-lock while the GPU is working. Setting it to “false” can improve performance slightly, but also prevents the CPU from doing anything else while the GPU is working.

When the CPU sends data to the GPU, it calls cudaDeviceSynchronize() which will wait until the GPU has finished before it returns. The majority of the CPU's time will be spent in that function. That function will either use a spin wait loop or will yield the CPU and wait for an interrupt before returning, depending on if cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync) has been called or not. The UseBlockingSync feature in OpenMM just calls that function.

So why did FAH do this? I don't know. From reports, the performance improvement is very very slight. Maybe it was just set and forget? Maybe there are some systems out there where the performance improvement is big enough to make it worth while? I'll test this out when I get a new Nvidia and see if I can write a simple program that injects the blocking sync flag for a test WU.

Post by **calxalot** » Thu Mar 13, 2025 3:29 am

I remember reading about attempts to reduce cpu usage. Usage could be reduced but also greatly reduced folding speed. I think I read 30% slower.

This was somewhere on the openmm.org site or the github repos for openmm.

arisu · Post by **arisu** » Thu Mar 13, 2025 3:31 am

The site only says "slightly", so I'm guessing it's on the repo. I'll look.

Edit: Yep it's in the repo, probably this one https://github.com/openmm/openmm/issues/2955

The perf decrease is negligible... except on Linux where it is 35% (and Windows if kept in balanced power mode). That explains why it was done. Linux runs on a lot of the most important folding boxes!

Post by **Joe_H** » Thu Mar 13, 2025 3:59 am

The last note on the issue, that it was not possible and to continue supporting OpenCL 1.2 may be a big part. Nvidia was slow to support any version of OpenCL beyond 1.2, and drivers released in the last few years including support for higher versions were not back ported to older generation GPUs.

muziqaz · Post by **muziqaz** » Thu Mar 13, 2025 5:53 am

arisu wrote: ↑Thu Mar 13, 2025 3:31 am The site only says "slightly", so I'm guessing it's on the repo. I'll look.

Edit: Yep it's in the repo, probably this one https://github.com/openmm/openmm/issues/2955

The perf decrease is negligible... except on Linux where it is 35% (and Windows if kept in balanced power mode). That explains why it was done. Linux runs on a lot of the most important folding boxes!

The person who done these configs was nVidia employee at the time and one of the FAH founders. Their knowledge of nVidia hardware preferences beats anyone here or anywhere else.
The choices made could be considered best choices made for performance. Since that person now moved on from nVidia and CUDA fahcore is running great as is, I don't see anything changing too much in near future regarding this

arisu · Post by **arisu** » Thu Mar 13, 2025 6:21 am

Yep it looks like it. I've been reading up on potential improvements (not because I think I know better than the devs, but because I like to learn). I find that the ones that reduce CPU use without reducing GPU folding performance, like estimating the amount of time a certain kernel will take to complete, and sleeping for 90% of that duration so it only spins at the end, come with considerable engineering and testing challenges that wouldn't be worth it just to free up a single CPU core.

muziqaz · Post by **muziqaz** » Thu Mar 13, 2025 6:43 am

arisu wrote: ↑Thu Mar 13, 2025 6:21 am Yep it looks like it. I've been reading up on potential improvements (not because I think I know better than the devs, but because I like to learn). I find that the ones that reduce CPU use without reducing GPU folding performance, like estimating the amount of time a certain kernel will take to complete, and sleeping for 90% of that duration so it only spins at the end, come with considerable engineering and testing challenges that wouldn't be worth it just to free up a single CPU core.

You can try pushing PR to OpenMM GitHub, but remember, people who brought CUDA to the table, did not just do a single test and said: yeah, looks good enough. There was a lot of testing and nVidia's own labs involved in developing this stuff.
There are always compromises available in development (software or hardware).

DarkFoss · Post by **DarkFoss** » Thu Apr 17, 2025 9:01 pm

I can't speak for other platforms but the one in my sig received a bios update on April Fool's Day.

Installed it 2 days ago.I'm seeing a different behavior under Linux now. Ubuntu also released an updated 6.11 hwe kernel (using the Low latency varient).
Bios setting tweaks are enabling DOCP (ram timings) and lowering the default 90c cut off to 88c
Before it would only settle in at 4.2 across all with normal downclocking for temp reg cpu frequency dipping to 4.1 with 1 core dropping to 500 briefly.
Now with all the new changes it still does all of the above but PBO seems to be kicking in 1 core with bump up to 4.3 for additional single thread performance.
Not sure how much impact it's will have on my x570 but those with newer AMD platforms may have that little bit extra. As always ymmv.

arisu · Post by **arisu** » Mon Apr 21, 2025 1:29 pm

I've tested FAH with blocking sync turned on by hooking cuCtxCreate(), ANDing the flags with 0x07, and ORing 0x04 before returning control to the real function. It took me longer than expected (it would be easy if the EULA allowed reverse engineering or if it was open source, but I had to whitebox it and play with the CUDA runtime and driver API just to not violate it). Also, who knew that cuCtxCreate's symbol was actually cuCtxCreate_v2?? And I had to open libcuda.so.1 in a constructor with dlopen() in order to hook it fast enough. There was a lot of messiness involved just to keep within the rules which is annoying because I'm not even doing this on live projects.

A stripped down version in case anyone wants to play with it (disclaimer: Do not use on any WUs that you are intending to send back to the server):

Code: Select all

#include <stdlib.h>
#include <dlfcn.h>
#include <assert.h>
#include <cuda.h>

typedef CUresult (*cuCtxCreate_type)(CUcontext *, unsigned int, CUdevice);
cuCtxCreate_type orig_cuCtxCreate;

static __attribute__((constructor)) void hook_function_early(void)
{
        void *handle;

        handle = dlopen("libcuda.so.1", RTLD_NOW);
        assert(handle);

        orig_cuCtxCreate = (cuCtxCreate_type)dlsym(handle, "cuCtxCreate_v2");
        assert(orig_cuCtxCreate);
}

extern CUresult cuCtxCreate(CUcontext *pctx, unsigned int flags, CUdevice dev)
{
        if (getenv("FORCE_NONBLOCKING_CUDA") {
                flags &= ~(CU_CTX_SCHED_MASK);
                flags |= CU_CTX_SCHED_BLOCKING_SYNC;
        }

        return orig_cuCtxCreate(pctx, flags, dev);
}

Command I used to monitor GPU usage:

Code: Select all

nvidia-smi dmon -i 0 -s pucmt -o T

Command I used to monitor CPU usage:

Code: Select all

pid=$(pgrep -x FahCore_2.)
watch -tp -n1 "top -b -n1 -o -PID -p $pid; lscpu -e=CPU,CORE,CACHE,ONLINE,SCALMHZ%,MHZ,MAXMHZ,MINMHZ"

CPU usage on Linux does decrease, but only slightly on an RTX 4090 Mobile. The CPU core (running on a P-core) only went from 100% to around 90%, and the clock fell from 4 GHz to 3.8 GHz. Some projects seem to have minimal PPD impact, others have more substantial impact. Without looking at TPF, I saw the SM usage go from ~95 to a bit over 85 on one project. I suspect there are some scheduler tweaks that could eliminate the PPD loss.

Next step is to hook cuEventSynchronize(), add some rdtsc calls and find some efficient data structure to store and report delays, and see if there's a pattern in how long it blocks. I could use that information to see if there is a maximum amount of time I could make the synchronizer sleep before it begins the spin wait. It might even be possible to make it adapt dynamically at runtime if there isn't too huge of a variation in the amount of time the synchronize call blocks/spins. That should cut down significantly on CPU usage without impacting latency at all.

Folding Forum

Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage