Building a Windows Hypervisor for Systems Research

Motivation

I wanted to understand the Windows kernel from underneath. Reading about how CR3 gets reloaded on a context switch or how SYSCALL/SYSRET transitions actually land only gets you so far without touching the machine at the level that code runs. A hypervisor was the excuse to do that: enable SVM, capture the live CPU state accurately enough that Windows keeps running as a guest, and then sit on VMEXITs long enough to watch what the OS is doing.

That became Hv2 (internal codename for the Lethe project). It is a Type-2 AMD-V hypervisor for Windows 11. It loads as a kernel driver, grabs whatever the CPU was already running, and slides that state into a guest so execution continues at the same RIP and RSP with a virtualized CPU mode layered underneath. No separate guest OS, no boot shim, no persistence. Same kernel, different CPU mode.

Validated on a Ryzen 9 5900X (12C/24T) running Windows 11 24H2. Every logical processor enters guest mode and the VMEXIT loop runs on all of them. Not tested on any other Windows build or any other AMD part.

The code is C++ and x86-64 MASM, built against the WDK.

What is actually in it

Roughly ordered by how hard each was to get right:

Multicore SVM bring-up. Per-core setup pinned to each processor in turn, then a DPC on every non-current core fires the VMRUN so the launch window across cores stays tight.
A VMCB save area rebuilt from live processor state. Segments decoded out of the GDT, GDTR/IDTR snapshotted, control registers and EFER mirrored, SYSCALL and SYSENTER MSRs preserved.
Identity-mapped NPT built up front from MmGetPhysicalMemoryRanges, 4KB pages everywhere, lazy install for any GPA the initial walk missed.
VMEXIT handlers for MSR, VMMCALL, SVM instructions, NPF, #DB, INVD, WBINVD, SKINIT, and guest shutdown.
A key-gated VMMCALL hypercall so a user-mode probe can confirm the hypervisor is alive. Wrong key gets #GP injected back at the guest, which exercises the event-injection path.

That is the whole surface. There is no memory-read primitive, no view-swapping NPT trick, no cross-process address translation. If any of those show up later they will be in the repo before they are in a post.

The launch trick

The odd part about Type-2 bring-up is that there is not really a separate guest. The hypervisor doesn't boot a second OS. It copies the CPU state of whatever is already running into the VMCB and VMRUNs into a guest whose initial state matches the caller exactly. Same RIP, same RSP, same CR3, same segments, same paging. The kernel doesn't notice.

Concretely:

SVM::Initialize runs as a system thread. For each logical processor it allocates a VMCB, a host save area, and a per-vCPU VMM stack.
Per-core setup enables EFER.SVME, writes MSR_VM_HSAVE_PA, and populates the VMCB from the live CPU. Segments are decoded from the GDT. Descriptor tables, control registers, EFER, and the SYSCALL/SYSENTER MSRs get snapshotted.
The VMCB's save.rip points at a one-instruction guest stub called guest_launched. save.rsp gets the caller's stack pointer.
svm_launch switches to the VMM stack, VMSAVEs host extended state to the host save area, and VMRUNs.
Control transfers into guest mode at that stub. The stub does mov al, 1; ret. The ret pops the saved return address off the caller's stack and returns from svm_launch as if it had been a normal function call.
The caller sees svm_launch return true. It has no idea it is now running as a guest. Every VMEXIT from that point on bounces into the assembly entry in Entrypoint.asm, which dispatches to the C++ handler and eventually VMRUNs back in.

The whole thing works because VMRUN's guest state includes everything you need to keep running on bare metal. Point it at the values the host already has and the transition is invisible to the code that was running.

Multicore launch

The naive version launches one core at a time. That does not work. Once a core is virtualized it is running guest code while the others are still doing normal kernel work. Windows' clock watchdog fires the second any core sees another diverging, and the whole thing bugchecks.

The version that works is two-phase. Phase one pins a thread to each processor in turn using KeSetSystemGroupAffinityThread and runs the per-core setup callback there. That only prepares state (EFER.SVME, HSAVE_PA, VMCB init) and does not VMRUN. Phase two queues a DPC on every non-current core so the actual VMRUN fires in parallel. The current core VMRUNs inline after queuing the others. The window between the first and last core entering guest mode is small enough that the watchdog stays quiet.

VMCB state capture

Rebuilding the VMCB from live CPU state is the tedious part. init_vmcb reads the current segment selectors and walks the GDT to decode each descriptor into a base/limit/attribute triple. TR and LDTR are 16-byte system descriptors, so their bases come from two halves. Segment attributes come out of lar and get shifted into the packed format SVM wants.

Control registers, DR6/DR7, and EFER get read straight off the CPU. The SYSCALL family (STAR, LSTAR, CSTAR, SFMASK) and the SYSENTER family have to be snapshotted, otherwise the guest's first syscall after VMRUN lands on a garbage RIP and instantly faults. KERNEL_GS_BASE (MSR 0xC0000102) gets copied too. So does PAT.

One subtle piece: EFER.SVME has to stay set in the host because the CPU is in SVM mode, but the guest should see EFER without SVME because the guest doesn't know it is virtualized. Each vCPU keeps a shadow_efer. RDMSR EFER returns the shadow. WRMSR EFER updates the shadow and the VMCB save. The host's own EFER.SVME bit stays on regardless. This is a correctness requirement in the SVM spec, not a stealth mechanism.

VMEXIT dispatch

The assembly entry in Entrypoint.asm handles the fast path:

VMSAVE on the vCPU's host save area, flushing FS/GS/KERNEL_GS_BASE state back where the host expects it.
GP registers get pushed onto the VMM stack.
The C++ vmexit_handler runs with a pointer to the saved register state.
Registers get restored, STGI re-enables global interrupts, VMRUN resumes the guest.

GIF is cleared while the handler runs. Anything that blocks, takes an interrupt, or touches pageable code in that window turns into a hang or a bugcheck. The dispatcher stays flat: no locks, no allocations, no waits.

The C++ side routes on exit_code:

RDMSR/WRMSR return or update the shadow copy in the VMCB. Anything outside the small allowlist reads as zero and writes are discarded. Doing a real __readmsr/__writemsr from here is not safe. An unmapped MSR raises #GP, and because the driver is manually mapped there is no .pdata for SEH to unwind, so that path bugchecks 0x1AA. PatchGuard also watches the critical MSRs and bugchecks 0x109 with Arg4=7 if you __writemsr them from host context. Both problems disappear if you only mutate VMCB save state and let VMRUN apply it.
INVD gets converted to WBINVD. INVD without writeback is legal for the guest to issue but can corrupt host cache lines that were dirty.
WBINVD gets a passthrough handler.
SKINIT is rejected. Letting the guest run SKINIT would establish a new secure root of trust and pull the platform out from under us.
VMRUN/VMLOAD/VMSAVE/STGI/CLGI all inject #UD if the guest thinks SVME is off (matching shadow_efer), or #GP otherwise. No nested virtualization.
VMMCALL runs the hypercall dispatcher (see below).
NPF triggers a lazy identity map for the faulting GPA.
#DB gets re-injected. The hypervisor never sets TF itself, so any #DB it sees belongs to a guest-side debugger.
svm_exit_invalid bugchecks with the exit info. Getting there means the VMCB is in a state the CPU refused to run, and there is nothing useful to salvage.
svm_exit_shutdown (guest triple-fault) parks the core in a HLT loop. There is nothing sensible to do from underneath a guest that just imploded.

MSR bitmap

The MSR bitmap controls which MSR accesses cause a VMEXIT. Hv2 intercepts EFER (0xC0000080) for the shadow-EFER path, and FS_BASE / GS_BASE so their values stay routed through the VMCB.

If CR4.FSGSBASE is set, RDFSBASE / WRFSBASE are legal at CPL=3 without a syscall. In that case the intercepts on FS_BASE and GS_BASE just produce noise, so init clears those bits when it sees CR4.FSGSBASE. Small optimization, but on modern Windows it comes up.

Nested page tables

NPT is AMD's second-level translation. Guest physical addresses go through a hypervisor-owned page table before hitting real RAM. In Hv2 that table is an identity map: guest PA equals host PA everywhere.

The initial build enumerates MmGetPhysicalMemoryRanges and installs a 4KB entry for every page inside every reported RAM range. It also maps the APIC BAR page explicitly (MSR 0x1B bits 12-51), because interrupt delivery needs it and it is not covered by the memory-ranges list. Anything outside those ranges (MMIO, firmware holes) gets picked up on demand: an NPF on an unmapped GPA lazily installs a 4KB identity entry there.

Intermediate tables are RWX; the actual permission lives on the leaf. Everything runs under a single ASID, and NPT mutations flush the whole guest TLB via tlb_control = 3. That is not great for throughput, but it is correct, and correctness first was the right call here.

An earlier version used 1GB large pages with on-demand splits. It worked for a few hours at a time, then started bluescreening in ways I could not reproduce, always somewhere near a split boundary. Going 4KB everywhere costs about 128MB of nonpaged pool up front and kills the whole class of bugs. Two weekends of NPT debugging was enough motivation to spend the pool.

The other allocation lesson: doing 65K individual ExAllocatePool2 calls at init took four to five minutes at boot. Preallocating 2MB chunks and slicing each one into 4KB sub-tables gets that down to seconds. The chunk is virtually contiguous but not physically contiguous, so each sub-table's physical address is captured separately and stored in a small open-addressed PA-to-VA hash table for lookup during table walks.

Allocation only happens at PASSIVE_LEVEL during init. The NPF handler cannot call ExAllocatePool2: it runs at GIF=0, and any code path that IPIs another core (which ExAllocatePool2 can) deadlocks and bugchecks 0x101. So the runtime path pulls from the preallocated slab and bumps a counter.

The hypercall

There is one hypercall: hv2_ping. A user-mode probe issues VMMCALL with a shared key in rcx and a hypercall ID in rdx. If the key does not match, the handler injects #GP back at the guest and does not advance RIP. If it matches and the ID is hv2_ping, the handler writes a magic value into rax and advances past the vmmcall. On bare metal without Hv2 loaded, the vmmcall traps as #UD.

That is the entire surface. Keeping the boundary tiny is intentional: anything crossing it is a place a guest bug or malicious input can hit the VMM, and this is a research driver, not something that should sit anywhere near a real threat model.

What actually taught me the most

Host save area is not optional. AMD's VMEXIT does not restore host FS, GS, or KERNEL_GS_BASE for you. If the exit entry does not VMLOAD the host save area at the very top, the handler runs with the guest's GS still active. Windows kernel code reads gs:[...] for the PCR (the per-CPU control region) on every call, so the moment the handler touches almost anything, it is dereferencing the guest's TEB thinking it is the host PCR. Everything corrupts quietly until something faults far away from the actual cause. This took me embarrassingly long to find, and the tell was that crashes never landed in the same place twice.

PatchGuard watches MSRs. Any __writemsr on the critical MSRs (LSTAR, STAR, EFER, and friends) from host context eventually gets caught and bugchecks 0x109 with Arg4=7. Easy to hit if you try to keep host and guest state in sync by writing straight to the CPU. The right answer is to only touch VMCB save state and let VMRUN apply it.

GIF=0 windows are hostile. Anything running between VMEXIT and the next VMRUN has global interrupts off. No IPIs, no calls into pageable code, no ExAllocatePool2, no KeWaitForSingleObject. Every mistake shows up as bugcheck 0x101 (clock watchdog) or a hard hang.

Debuggability drops through the floor. When the VMM is wrong, the whole machine is wrong. WinDbg over serial gives you some visibility, but the crash almost never points at the actual cause. It points at whatever the kernel did five milliseconds later on a corrupted PCR. You end up reading a lot of AMD APM Volume 2 and staring at the raw VMCB before you get anywhere.

Credit to Satoshi Tanda's SimpleSvm and Samuel Tulach's Memhv. Reading their code helped me figure out where mine had diverged from the spec and why.

What is next

The near-term work is unglamorous. Better TLB tracking so NPT writes do not have to invalidate the world every time. Tighter error paths in bring-up. Better telemetry out of the VMEXIT loop so a live core issue is diagnosable without a serial cable and a lot of patience. Longer term, I want to look at measured-boot flows, TPM attestation, and how the platform actually establishes trust at boot. If that turns into a UEFI-side experiment it will be about the boot integrity story, not about running anything at that layer persistently.

Source: github.com/KQAsk/Hv2