The future of high-performance computing will be virtualized, VMware’s Uday Kurkure has told The Register.
Kurkure, the lead engineer for VMware’s performance engineering team, has spent the past five years working on ways to virtualize machine-learning workloads running on accelerators. Earlier this month his team reported “near or better than bare-metal performance” for Bidirectional Encoder Representations from Transformers (BERT) and Mask R-CNN — two popular machine-learning workloads — running on virtualized GPUs (vGPU) connected using Nvidia’s NVLink interconnect.
NVLink enables compute and memory resources to be shared across up to four GPUs over a high-bandwidth mesh fabric operating at 6.25GB/s per lane compared to PCIe 4.0’s 2.5GB/s. The interconnect enabled Kurkure’s team to pool 160GB of GPU memory from the Dell PowerEdge system’s four 40GB Nvidia A100 SXM GPUs.
“As the machine learning models get bigger and bigger, they don’t fit into the graphics memory of a single chip, so you need to use multiple GPUs,” he explained.
Support for NVLink in VMware’s vSphere is a relatively new addition. By toggling NVLink on and off in vSphere between tests, Kurkure was able to determine how large of an impact the interconnect had on performance.
And in what should be a surprise to no one, the large ML workloads ran faster, scaling linearly with additional GPUs, when NVLink was enabled.
Testing showed Mask R-CNN training running 15 percent faster in a twin GPU, NVLink configuration, and 18 percent faster when using all four A100s. The performance delta was even greater in the BERT natural language processing model, where the NVLink-enabled system performed 243 percent faster when running on all four GPUs.
What’s more, Kurkure says the virtualized GPUs were able to achieve the same or better performance compared to running the same workloads on bare metal.
“Now with NVLink being supported in vSphere, customers have the flexibility where they can combine multiple GPUs on the same host using NVLink so they can support bigger models, without a significant communication overhead,” Kurkure said.
HPC, enterprise implications
Based on the results of these tests, Kurkure expects most HPC workloads will be virtualized moving forward. The HPC community is always running into performance bottlenecks that leaves systems underutilized, he added, arguing that virtualization enables users to make much more efficient use of their systems.
Kurkure’s team was able to achieve performance comparable to bare metal while using just a fraction of the dual-socket system’s CPU resources.
“We were only using 16 logical cores out of 128 available,” he said. “You could use that CPU resources for other jobs without affecting your machine-learning intensive graphics modules. This is going to improve your utilization, and bring down the cost of your datacenter.”
Broadcom to buy VMware ‘on Thursday for $60 billion’
By toggling on and off NVLink between GPUs, additional platform flexibility can be achieved by enabling multiple isolated AI/ML workloads to be spread across the GPUs simultaneously.
“One of the key takeaways of this testing was that because of the improved utilization offered by vGPUs connected over a NVLink mesh network, VMware was able to achieve bare-metal-like performance while freeing idle resources for other workloads,” Kurkure said.
VMWare expects these results to improve resource utilization in several applications, including investment banking, pharmaceutical research, 3D CAD, and auto manufacturing. 3D CAD is a particularly high-demand area for HPC virtualization, according to Kurkure, who cited several customers looking to implement machine learning to assist with the design process.
And while it’s possible to run many of these workloads on GPUs in the cloud, he argued that cost and/or intellectual property rules may prevent them from doing so.
vGPU vs MIG
An important note is VMware’s tests were conducted using Nvidia’s vGPU Manager in vSphere as opposed to the hardware-level partitioning offered by multi-instance GPU (MIG) on the A100. MIG essentially allows the A100 to behave like up to seven less-powerful GPUs.
By comparison, vGPUs are defined in the hypervisor and are time-sliced. You can think of this as multitasking where the GPU rapidly cycles through each vGPU workload until they’re completed.
The benefit of vGPUs is users can scale well beyond seven GPU instances at the cost of potential overheads associated with rapid context switching, Kurkure explained. However, at least in his testing, the use of vGPUs didn’t appear to have a negative impact on performance compared to running on bare metal with the GPUs passed through to the VM.
Whether MIG would change this dynamic remains to be seen and is the subject of another ongoing investigation by Kurkure’s team. “It’s not clear when you should be using vGPU and when we should be running in MIG mode,” he said.
More to come
With vGPU with NVLink validated for scale-up workloads, VMware is now exploring options such as how these workloads scale across multiple systems and racks over RDMA over converged Ethernet (RoCE). Here, he says, networking becomes a major consideration.
“The natural extension of this is scale out,” he said. “So, we’ll have a number of hosted connected by RoCE.”
VMware is also investing how virtualized GPUs perform with even larger AI/ML models,
Kurkure’s team is also investigating how these architectures scale with even larger AI/ML, like GPT-3, as well as how they can be applied to telco workloads running at the edge. ®