I’m 100% sure that 100% is not 100%

What I’ve learned about Nvidia vGPU Monitoring

Monitoring a virtualization solution is important to know how the solution is performing and that you have sufficient resources available. This is even more important with 3D applications, because a little drop in GPU performance can make the user experience really bad. Monitoring should be done in all phases of a virtualization project, during assessment, sizing and in production. When you start to look at vGPU performance you will soon see that the vGPU computing is tricky to monitor, and I’ll explain why.

So what tools can you use to monitor GPU computing? There are two types of tools:

  1. Tools that can monitor GPU from the host level.
    1. XenCenter (vGPU only)
    2. Nvidia-smi command line tool (vGPU only)
  2. Tools that can monitor GPU from the VM level running inside the VM.
    1. Uberagent
    2. GPU-Z
    3. Process Explorer
    4. Lakeside Systrack (Passthrough only)
    5. GPUPerf (Passthrough only)

If you have read my previous article about vGPU compute sharing works, you will see that for passthrough GPU, only VM level monitoring is possible, and for vGPU only host level monitoring is possible. But tools like GPU-Z, Process monitor and Uberagent is able to capture GPU performance within a vGPU VM. So I was thinking, how is this possible, and can we trust the data? To understand the question I assume you have read about how vGPU is sharing computing power and that the available computing power is actually scaling up and down depending on what’s happening on the other virtual machines.

So let’s look at what’s happening when we monitor a virtual machine while the workload on another virtual machine is changing. Keep in mind the percentage formula.

The total value in GPU performance is the available computing power. I can monitor the total computing power used for a physical GPU (pGPU) from XenCenter. I’ve got two machines with K140Q profiles on the same pGPU. This should give each VM a minimum of 25% of the computing power of that pGPU, but with the ability to peak to 100% of the pGPU if it’s available. This is really nice, because it allows you to handle peaks in the GPU workloads. With 100% workload on 2 VM’s, I should be able to use 50% of the pGPU on each vm. Let’s just look at one VM.

I’m using Unigine Heaven benchmark to generate 100% load one vGPU

Looking at GPU-Z I can now see that I’m using 100 of the available computing power, and in XenCenter it shows that 100% of the pGPU is utilized by one VM.

As no other VM is using this GPU, one VM gets the entire computing power to itself. This will give me decent frame rates with unigine on one machine. Starting Unigine on VM number two, XenCenter still showing 100% but now framerates is dropped on both VM’s, The GPU scheduler is working very well here, about the same frame rate on both machines. As soon as Unigine is stopped on one machine, frame rate is increasing on the other. GPU-Z still shows 100% with both VM’s, it did not tell you that available computing power changed, it’s just telling you that it’s giving you all it has got. So in this case it’s using 100% in the VM is actually just of 50% of the pGPU

Let’s take another scenario: Run both machines with less than 50% load. What do we get? I’m using 3d studio to create a 40% GPU load; an animated rotation of a car gives me 40% load with one VM running.

Looking at both XenCenter and GPU-z it shows about 40% load. Firing up the same animation on VM number 2:

GPU-Z shows about 40% on both machines and XenCenter shows 80%. Then I start the Uningine benchmark on VM1 again.

Now GPU-z show 100% on VM1 and VM2 has increased from 40 to 80% without any change in the workload on VM2. What happened? The available computing power is now reduced on VM2 from 50% of pGPU to 25%. But if you look at GPU-z on VM2 only, it appears like the GPU computing load has increased. The truth is that the available computing power was reduced by 50%.

So as long all the virtual machines on one physical GPU is using less than 100% combined of the physical GPU the GPU-z values is correct. Otherwise it cannot be trusted. The conclusion is, that with vGPU, the only trusted source for GPU monitoring is per physical GPU in XenCenter. The problem with this is that there is no way to tell from XenCenter which machine and user that is creating the load. This illustrates the complexity in vGPU monitoring.

In my next blogpost I will show you more about how you can compare GPU load data gathered from workstations with multiple GPU types and GPU vendors, and use this data to predict what kind of GPU virtualization solution to use.

Thanks to Thomas Poppelgaard from Poppelgaard.com and Helge Klein from HelgeKlein.com for helping me with this blogpost. Also thanks to Jason Southern from NVidia for verifying the content of this blogpost.

2 thoughts on “I’m 100% sure that 100% is not 100%

    • Sorry for late reply. You need to install nvidia vgpu driver to xenserver, and run xenserver 6.2 sp1, then from the host level in xenserver you should be able to add gpu counters for pGPU’s that is not assigned in passthrough mode.

Leave a Reply

Your email address will not be published. Required fields are marked *

*