Why the VMware vSphere TPS vulnerability is a big deal

VMware recently acknowledged a vulnerability with their Transparent Page Sharing (TPS) feature that could potentially allow VM’s to access memory pages of other VMs running on a host. TPS is a host process that leverages the Virtual Machine Monitor (VMM) component of the VMkernel to scan physical host memory to identify redundant VM memory pages. The benefits of TPS are that it allows a host to reduce physical memory usage so you can squeeze more VMs onto a host as memory is often one of the most constrained resources on a host. TPS is basically the equivalent of storage de-duplication for RAM and works at the 4KB block level. You can read a full description on how it works in this classic VI3 memory white paper that refers to it as “Memory Sharing” or this more recent vSphere 5 white paper on memory management. Here’s a brief description of it and a slide on how it works:

[important]When multiple virtual machines are running, some of them may have identical sets of memory content. This presents opportunities for sharing memory across virtual machines (as well as sharing within a single virtual machine). For example, several virtual machines may be running the same guest operating system, have the same applications, or contain the same user data. With page sharing, the hypervisor can reclaim the redundant copies and keep only one copy, which is shared by multiple virtual machines in the host physical memory. As a result, the total virtual machine host memory consumption is reduced and a higher level of memory overcommitment is possible.[/important]

It’s important to note that the TPS feature is nothing new to vSphere having been first introduced with the VI3 release back in 2006, but apparently someone has now finally gotten around to try and successfully exploit it. Why is this a big deal? Because a virtualized architecture demands VM isolation, this is the most important security requirement for virtualization. Each VM guest running on a host must not be allowed in any way to access another VM guest. They must be kept in separate locked rooms with only the hypervisor possessing the keys to access all of them.

To illustrate this let’s use a real world scenario, imagine if you checked into a hotel, you expect privacy and isolation and to not have any other guests be able to come into your room. However your room has an adjoining shared door with another room but neither guest could get through it, only the hotel management can control the door. But somehow a guest has figured out a way to to open that door and get into your room and invade your privacy, that’s a pretty big deal wouldn’t you say.

VMware appears to be down-playing it as it obviously exposes a chink in their virtual armor, they have issued a KB article describing the vulnerability and giving guidance on how customers can disable TPS on their hosts. VMware doesn’t name the specific source that found the vulnerability in the KB article, they simply refer to it as “an academic paper”. The academic paper is entitled “Wait a minute! A fast, Cross-VM attack on AES” and was written by a group of individuals from Worcester Polytechnic Institute in 2014. It was funded by a National Science Foundation grant and it’s a pretty technical, head-spinning read with a lot of mathematical formulas. It’s worth taking a look at though as parts of it are more easily consumable and describe the attack scenarios and results. The overview of the paper is described as:

[important]In this work, we show a novel cache-based side-channel attack on AES that—by employing the Flush+Reload technique—enables, for the first time, a practical full key recovery attack across virtual machine boundaries in a realistic cloud-like server setting. The attack takes advantage of deduplication mechanism called the Transparent Page Sharing which is employed by VMware virtualization engine and is the focus of this work. The attack works well across cores, i.e. it works well in a high-end server with multiple cores scenario that is commonly found in cloud systems. The attack is, compared to [13], minimally invasive, significantly reducing requirements on the adversary: memory accesses are minimal and the accesses do not need to interrupt the victim process’ execution. This also means that the attack is hardly detectable by the victim. Last but not least, the attack is lightning fast: we show that, when running in a realistic scenario where an encryption server is attacked, the whole key is recovered in less than 10 seconds in non-virtualized setting (i.e. using a spy process) even across cores, and in less than a minute in virtualized setting across VM boundaries.[/important]

When I was going through the paper at the end of the paper in the references section one thing I noticed was that the authors of the paper had previously written a paper earlier in 2014 entitled “Fine grain Cross-VM Attacks on Xen and VMware are possible!“. So it appears based on that they worked to apply the theoretical research from that paper into a real-life attack which succeeded and prompted the 2nd paper to be written.

It’s important to note that this is not an easily exploitable vulnerability and the risk is super low so most environments should not really be impacted by it. VMware is being overly cautious though and will disable the feature by default in all upcoming releases of vSphere which includes:

ESXi 5.5 Update release – Q1 2015
ESXi 5.1 Update release – Q4 2014
ESXi 5.0 Update release – Q1 2015
The next major version of ESXi

In addition they will be issuing patches before that to address it sooner rather than later as follows:

ESXi 5.5 Patch 3
ESXi 5.1 patch planned for Q4, 2014
ESXi 5.0 patch planned for Q4, 2014

All versions of vSphere back to VI3 are vulnerable to the exploit but VMware is only patching the 5.x versions of vSphere as the 4.x versions are no longer officially supported as of May 2014. Note these patches only disable TPS which is currently enable by default, they do nothing to fix the vulnerability, it will most likely take VMware some time to figure out how to make TPS work in a way that cannot be exploited. So if you disable TPS on your own you don’t really need the patch. VMware states in the KB article that “Administrators may revert to the previous behavior if they so wish” so it sounds like they are not too worried about it.

The benefits that TPS provides will vary in each environment depending on VM workloads so if you are really paranoid about security you will probably want to disable it. You can view the effectiveness of TPS in vCenter by looking at the “shared” and “sharedcommon” memory counters to see how much you benefit from it. You can disable TPS in your current environment by changing advanced settings on each host as described in the KB article, updating the setting is pretty simple but having it take effect is tedious work:

Log in to ESX\ESXi or vCenter Server using the vSphere Client.
If connected to vCenter Server, select the relevant ESX\ESXi host.
In the Configuration tab, click Advanced Settings under the software section.
In the Advanced Settings window, click Mem.
Look for Mem.ShareScanGHz and set the value to 0.
Click OK.
Perform one of the following to make the TPS changes effective immediately:
1. Migrate all the virtual machines to other host in cluster and back to original host.
2. Shutdown and power-on the virtual machines.

While the impact and exposure may be minimal with this, the fact that someone has finally cracked those solid walls that have stood between VMs is a big deal. I’ve written about it previously in my post on Escaping the Cave, VMware officially describes this as “inter-process side channel leakage” and mentions this in the KB article:

[important]Side channel attacks that exploit information leakage from resources shared between processes running on a common processor is an area of research that has been explored for several years. Although largely theoretical, techniques are continuously improving as researchers build on each other’s work. Although this is not a problem unique to VMware technology, VMware does work with the research community to ensure that the issues are fully understood and to implement mitigation into our products when appropriate[/important]

Note they mention these types of vulnerabilities have been “explored” for years, meaning lots of people have been looking for a way to penetrate those walls as this is the holy grail type of hack of a virtual environment. Imagine if someone compromised a less critical VM through one of the many thousands of OS and application vulnerabilities that exist. At least the damage and exposure is contained within that one VM, but if some one could somehow use that compromised VM as a launchpad for attacking other VMs on a host that’s a real big deal. The fact that most of these types of vulnerabilities have been theoretical up until now really exposes VMware and their security foundation.

Overall vSphere is a very secure platform and has stood the test of time without any major issues which is a testament to how seriously they take security in vSphere. VMware will no doubt learn from this one and work to improve the security in vSphere to make it even better. However the circumstances of this one resulted from them trying to be more efficient with hypervisor resources by mixing VM data together. They will definitely need to be careful going forward as they continually look for new ways to optimize vSphere so they do it in a secure manner and do not risk exposure between VMs.

If you want to find out more about memory management in vSphere check out some of these links: