SyntheticBird45

They/them

x86/C/Rust developer. Monero enthusiast. Cryptography fan. Binary vulnerability research admirer

A before the time look at building a monerod microVM

Since the introduction of virtualization based security for Windows, number of security enthusiasts and companies have drawn interest into leveraging hardware assisted virtualization for security purpose. The introduction of protected KVM by Google into the linux kernel, as well as its very soon usage in android through AVF and microdroid also show signs of interests into virtualization for security. In combination with ARM Memory tagging extension into the Pixel 8 series and its early support from GrapheneOS, Google's android security model have a bright future ahead!

However, I'm broke… And while I can't afford a Pixel 8, there are a fair number of things I wish to virtualize on my noble x86_64 linux server, in order to improve my operational security. Monero is actively being targeted by numerous advanced adversaries, and I thought it would be the first software to look at.

In this post, I'll share with you my experience of trying to create a fully virtualized, not yet memory-efficient, secure microVM to run your isolated Monero daemon. A microVM is a type of virtual machine designed to be highly efficient with a low system footprint. The aim is to run individual software or sets of programs within a microVM with minimal overhead. Make sure to read the end before diving into the installation

Annotation: pKVM is an extension of the Linux KVM mechanism that provides protection to the guest OS from host tampering with its memory. While it is described as requiring ARM processors, similar protection can be achieved using confidential computing extensions such as x86 AMD-SEV (which opened its firmware btw).

A useless but interesting look at hardware-assisted virtualization's history

I've always been interested in hardware-assisted virtualization, which allows a system to run with the same behavior as if it weren't isolated. Cost drove virtualization development in the 1990s and early 2000s, as servers needed an efficient way to handle resources shared between multiple clients. The three existing hypervisors at the time - VMWare, QEMU, and Xen - all used the same binary translation techniques to trap privileged instructions and system calls.

In 2005, CPU manufacturers released intel VT-x and AMD-V, which allowed for native, full virtualization of memory context, second level address translations and privileged instruction support. Hypervisors and the Linux kernel with KVM (2007) switched to this technology, offering significant performance improvements and unknowingly laying the groundwork for future security improvements. However, large companies were not yet interested in offering full VMs to their clients, as it wasn't resource-efficient. Instead, they focused on single-purpose services like Google Drive and Netflix.

Fast forward to 2011, when Docker emerged, by using 2008's cgroups and linux containers feature, allowing multiple OSes to run on the same kernel. This innovation led to new development practices and automation software, driving the growth of public cloud infrastructure. But here's the catch - OS-level virtualization means security is handled by the OS, and a nearly 20 million line code shared kernel can be a problem. Docker, LXC, and Podman are good at isolating filesystem, but once the kernel is exploited, security collapses. Fully virtualized software avoids this issue since the CPU hides the context and addresses of the supervisor. As a result, there's been renewed interest in hardware-assisted virtualization for its ability to ensure high security or, at the opposite, for hacking purposes.

The development of cloud and now well-known internet services has been driven by the development of virtualization. The Wikipedia page on the topic summarizes this very well.

Note: While Windows utilizes Hardware Virtualized Code Integrity (HVCI) to protect the kernel from tampering, several bypasses have emerged. It's important to note, however, that these bypasses are due to implementation flaws rather than any inherent weakness in the hardware. Virtualization remains the most robust isolation mechanism available today. A vulnerability that causes a guest VM to execute code on the host or modify its behavior is called a Virtual Machine Escape and an up to date list exist on wikipedia.

The Hypervisor

"For some time now, the Rust programming language has been adopted for its safety, and as a result hypervisors written in this language are being developed. Specifically, there are four additional hypervisors that we could consider for our use case:

Firecracker is rust-written hypervisor managed by Amazon and optimized for short-living workloads. It strongly emphazise on security and memory-footprint. However, because Firecracker is designed for very-short living tasks, it do not release memory and is therefore not suitable for our long-living workload.
krunvm is another rust-written hypervisor that is focused on simplicty and OCI compatibility. It is meant to be used with podman, an alternative to docker. While it ongoing development is interesting, it is still an alpha software and being non-familiar with container images, I prefered to ditch this one too. If you know what you're doing you should give it a try
QEMU is the ultimate hypervisor nowaday. It is fast, stable and full of features, but because it is so big and written in C, I decided to go without it. Know that it is entirely possible to do what will follow in QEMU.
crosvm is Google's rust written client purpose hypervisor. It is meant to be used inside of ChromeOS to support other OS software at native speed using paravirtualization. It is meant to be comparable to QEMU.
cloud-hypervisor is another rust-written hypervisor that also focus on security, speed and memory footprint for cloud workloads. It is funded by several companies and managed by the Linux foundation. With the help of rust-firmware or cloud-hypervisor's edk2 branch, windows is also supported. It also support VFIO for GPU users or PCI bindings

I decided to go with cloud-hypervisor for its ease of use and cool features

The kernel

We need to compile a custom Linux kernel with specific parameters to run our microVM under cloud-hypervisor.

Cloning the linux-hardened kernel

Guest security is also important as it is part of our defense in depth strategy against an advanced attacker trying to mess with the hypervisor or simply alterate the intended software behavior (turn a legit monerod into a malicious one). That's why for VMs, you should always use the linux-hardened kernel that is safer than vanilla one. Let's clone its repo and compile it right away:

~$ mkdir monerod-microvm && cd monerod-microvm/
~/monerod-microvm$ git clone https://github.com/anthraxx/linux-hardened -b 6.7 // Replace branch with latest version

Note: The linux-hardened kernel disable by default a number of features, such as user namespaces, io_uring, unprivileged eBPF and more. It is good for server and standard desktop usage, but can soon limit power users.

Configuring kernel features

The Linux kernel needs a configuration to compile. Doing so indicate whether to compile some features that we may want or, at the opposite, don't need. cloud-hypervisor give arm64 and x86 default config for the guest to work properly. Let's clone the x86 kernel config:

~/monerod-microvm$ wget https://raw.githubusercontent.com/cloud-hypervisor/cloud-hypervisor/main/resources/linux-config-x86_64
~/monerod-microvm$ mv linux-config-x86_64 linux-hardened-6.7/.config // config file need to be named .config in the repo root

Enable free_page_reporting

In order for the guest linux kernel to report host on memory pages freed, you need to enable free_page_reporting feature. You can enable it by using the command menuconfig that let you manually toggle kernel features. Use this command:

~/monerod-microvm/linux-hardened6.7$ make menuconfig

And navigate to

Memory Management options -> Free page reporting

and toggle it.

Note: When this feature is disabled, the hypervisor will continue to allow more memory until its configured limit, even if no programs are actually make use of it.

Enable DAMON feature

In order to improve the reclamation of freed memory, we need to enable an obscure feature in the linux kernel called the Data Access MONitor or DAMON. You need to enable the DAMON and DAMON_RECLAIM feature. You can find them under:

Memory Management options -> Data Access Monitoring

Toggle DAMON: Data Access Monitoring Framework, then Data access monitoring operations for the physical address space, and then Build DAMON-based reclaim

Compile the kernel

Now let's compile our kernel with make:

~/monerod-microvm/linux-hardened6.7$ make -j <NUM THREADS> // number of core

If the configuration restart and you're asked in the console to manually select everything, just hold the ENTER key. It is irresponsible but worked for me :p

The image

The drive format

Now that we have our kernel, we need an actual image to with our system files. Exactly like in bare-metal, a SSD is recommended for reasonable syncing time. It is even more important as the choice of the disk format add a performance overhead.

cloud-hypervisor support qcow2, the dynamically sized qemu disk format, and raw, it is raw bytes. qcow2 is a smarter solution as it permit you to easily add more storage to your vm if the blockchain grow, but raw is also a lot faster. I decided to go with raw since my host is using btrfs and that raw disk do not actually allocate place until data is being written. It is a neat feature as now i'm the proud owner of a 500GB partition with 800GB files in it and 200GB of free space ;)

~/monerod-microvm$ qemu-img create monerod-disk.ext4 150G // We're going to use a pruned node

The distribution

I originally tested out using Gentoo Hardened/SELinux, Gentoo LLVM/Systemd/SELinux and Arch Linux. It worked great but Gentoo is very painful to install, and most people can also profit from Arch linux features. That's why for this tutorial I decided to go with Arch linux, but you can chose whatever distribution you want as long as you can get the base system somehow without the bootloader or kernel.

~/monerod-microvm$ mkdir tmprootfs
~/monerod-microvm$ sudo mkfs.ext4 monerod-disk.ext4
~/monerod-microvm$ sudo mount monerod-disk.ext4 tmprootfs/

From there you have successfully created an ext4 raw disk were you'll be able to install your system. Again, depending on your distribution, you'll need to install all the system librairies, files and directories. For Gentoo you can copy a stage3 tarball into the disk and chrooting. For Arch linux, the pacstrap tool exist to easily install the base system somewhere:

~/monerod-microvm$ sudo pacstrap tmprootfs/ base apparmor networkmanager nano vim wget curl // add whatever package you want

Now you need to chroot into your disk to change root password:

~/monerod-microvm$ sudo arch-chroot /tmprootfs
[root@archlinux]# passwd
// change password
[root@archlinux]# systemctl enable NetworkManager // Enable network manager for later
[root@archlinux]# exit
~/monerod-microvm$ sudo umount /tmprootfs // unmount microvm drive

We now have successfully installed Arch Linux base system and some useful utilities in it. You're allowed to say: I'm using Arch btw

Setting up NAT

Our microVM will need internet access to work properly. cloud-hypervisor only suppport binding TAP interface in standard mode (no multi queue). We'll also need to setup NAT through host iptables. While a tap interface alone could work. I'll also share with you how to configure it to a bridge for virtual networking. It'll permit you later to run other microVMs that have access to the same subnetwork.

Annotation: A Linux bridge is a virtual switch internally handled by the host kernel, it permit you to create sub networks for vms or containers. TUN/TAP is just a driver for virtual network interface. Learn more on this short blog

Single TUN/TAP device

When you just wish to connect one microVM, or simply not using a virtual network, the setup is pretty straight forward: Create tap device:

~/monerod-microvm$ sudo ip tuntap add dev microvm0 mode tap
~/monerod-microvm$ sudo ip addr add 192.168.122.1/24 dev microvm0
~/monerod-microvm$ sudo ip link set dev microvm0 up

We just setup a new TAP interface named microvm0, with IPv4 address 192.168.122.1, that is part of the network 192.168.122.0/24

Now let's configure iptables:

~/monerod-microvm$ sudo iptables -t nat -A POSTROUTING -o <YOUR_HOST_INTERFACE> -j MASQUERADE
~/monerod-microvm$ sudo iptables -I FORWARD 1 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
~/monerod-microvm$ sudo iptables -I FORWARD 1 -i microvm0 -o <YOUR_HOST_INTERFACE> -j ACCEPT

Let's break down these simple commands

First one configure the NAT table to mask the inner address of exiting packets aiming toward the host interface.
Second one order the filter to accept incoming packets tracked as ESTABLISHED or RELATED.
Third one order the filter to accept packets coming from the microVM to the host interface

Now your interface is fully working!

Linux bridge for VM subnetwork

If you wish to deploy multiple microVM for isolate multiple applications, but still want them to be accessible from each other, you'll need a shared virtual network. Linux bridges are the solution to this problem. You can bind tap interface to bridge and therefore build virtual interfaces linked by it. Let's dive into it:

~/monerod-microvm$ sudo ip link add name isolated0 type bridge
~/monerod-microvm$ sudo ip addr add 192.168.122.1/24 dev isolated0
~/monerod-microvm$ sudo ip link set dev isolated0 up

We just created our linux bridge named isolated0 with address 192.168.122.1/24, let's reconfigure iptables exactly like above example:

~/monerod-microvm$ sudo iptables -t nat -A POSTROUTING -o <YOUR_HOST_INTERFACE> -j MASQUERADE
~/monerod-microvm$ sudo iptables -I FORWARD 1 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
~/monerod-microvm$ sudo iptables -I FORWARD 1 -i isolated0 -o <YOUR_HOST_INTERFACE> -j ACCEPT

When you want to add a new VM to this bridge, create a new TAP interface like so:

~/monerod-microvm$ sudo ip tuntap add dev microvm0 mode tap
~/monerod-microvm$ sudo ip link set microvm0 master isolated0
~/monerod-microvm$ sudo ip link set dev microvm0 up

Congratulations!

You have successfully setup networking for your VM. There are no DHCP and manual addressing will be require later on. Please be aware that these configurations we just did will be destroyed next reboot. Consider making an init script or using nmcli from NetworkManager directly.

Note: Please make sure IPv4 forwarding is enabled in the kernel parameters. For systemd users, sysctl -w net.ipv4.ip_forward=1 command enable ip forwarding.

Advisory: iptables have been deprecated. Consider installing nftables and iptables compatibility layer iptables-nft on your host!

Setting up Hypervisor

Download cloud-hypervisor

Let's download latest cloud-hypervisor binary release. (you can of course compile it yourself):

~/monerod-microvm$ wget https://github.com/cloud-hypervisor/cloud-hypervisor/releases/download/v38.0/cloud-hypervisor-static -o cloud-hypervisor

What is great with cloud-hypervisor (even tho firecracker do it better) is its rootless usage. In case you didn't know, the linux kernel have introduced for a while capabilities, that permit you to set what specific binaries or users are capable to do. For our hypervisor to work, it needs to bind our vm to network interfaces. We'll therefore need to attribute the NET_ADMIN capability to the executable:

~/monerod-microvm$ sudo setcap cap_net_admin+ep ./cloud-hypervisor

We now have an unprivileged hypervisor that is still capable of attributing network interfaces to our microVM.

Note: from a security perspective, if cloud-hypervisor is exploited and that an attacker is using allowed syscalls (there is seccomp filter), they'll be able to misuse this capability to mess with host's network. But it is very hard to nearly impossible to do from the guest, and from the host it would require patching the binary or loading a malicious LD_PRELOAD. But tbh nearly all software on Linux can be messed up like this, if you really don't trust your host, consider moving cloud hypervisor behind its own user and setup Mandatory Access Control (through SELinux for example)

The configuration file

In order to easily start our microVM easily with the correct arguments, let's write a simple shell file with the command arguments needed:

#!/bin/bash
./cloud-hypervisor \
--kernel "/path/to/vmlinux" \
--disk "path=/path/to/monerod-disk.ext4" \
--cpus boot=8 \
--cmdline "console=ttyS0 root=/dev/vda rw kernel.panic=-1 damon_reclaim.enabled=Y damon_reclaim.min_age=30000000 vm.vfs_cache_pressure=200  damon_reclaim.wmarks_low=0 damon_reclaim.wmarks_high=1000 damon_reclaim.wmarks_mid=1000" \
--balloon "deflate_on_oom=on,free_page_reporting=on" \
--memory "size=6G,hotplug_method=virtio-mem,hotplug_size=512M" \
--net "tap=microvm0,mac=AC:EF:00:00:00:01,ip=192.168.122.2,mask=255.255.255.0"

This actually setup a Linux VM that:

directly boot kernel at /path/to/vmlinux
bind the drive at path/to/monerod-disk.ext4
launch with 8 vCPUs and 6GB of RAM
have suitable kernel parameters for memory management
support free_page_reporting
bind to tap interface microvm0

Some of you will say that this is a large vm and really far from micro. That's because it isn't yet! monerod needs large amount of RAM and CPU for initial syncing. You'll need to give it plenty until it have completely download the database. Once it's done we'll go back and edit the resource usage to a more reasonable amount.

Launching the VM

Now the VM is ready to be launched. Setup the executable bit to the shell file and you should be good to go:

~/monerod-microvm$ chmod +x microvm.sh

And launch it:

~/monerod-microvm$ ./microvm.sh

Post-installation/Guest configuration

If everything goes well, you should have booted up into Arch Linux and successfully logged in. Despite our efforts we still have a lot of things to do.

Network configuration

First things to do is to configure networking. This time we're going to use NetworkManager directly:

[root@cloud-hypervisor]# nmcli connection modify eth0 manual yes
[root@cloud-hypervisor]# nmcli connection modify eth0 ipv4.addresses 192.168.122.<LAST>/24
[root@cloud-hypervisor]# nmcli connection modify eth0 ipv4.gateway 192.168.122.1

With our interface on the correctly configured, the ping 8.8.8.8 should work flawlessly

DNS server

You need to set your DNS server into /etc/resolv.conf:

// /etc/resolv.conf
nameserver 1.1.1.1 // Change to whatever

Download and install monerod

Note: monerod is available as an official package in the community repository of Arch Linux. However, it have many bugs related to memory allocation. In order to avoid such, the following step will show you how to manually install the official monerod and configure it almost like the official package. If you don't care about these bugs just do pacman -Sy monero

Let's download latest official monerod version (v0.18.3.1 as of writing, v0.18.3.2 will be out soon) and verify its hash:

[root@cloud-hypervisor]# wget https://downloads.getmonero.org/cli/monero-linux-x64-v0.18.3.1.tar.bz2
[root@cloud-hypervisor]# sha256sum 23af572fdfe3459b9ab97e2e9aa7e3c11021c955d6064b801a27d7e8c21ae09d monero-linux-x64-v0.18.3.1.tar.bz2
[root@cloud-hypervisor]# tar xvf monero-linux-x64-v0.18.3.1.tar.bz2 && cd monero-linux-x64-v0.18.3.1/

Now we'll need to transfer the binary to our system:

[root@archlinux monero-linux-x64-v0.18.3.1]# install -Dm755 monerod -t /usr/bin

Every time you'll want to update monerod, you'll need to repeat this step. (replace the old binary by the new one)

Configuring monerod and unit file

For monerod to run on startup, we need to create a dedicated user for the systemd unit to operate:

[root@cloud-hypervisor]# echo "u monero - - /var/lib/monero" > /usr/lib/sysusers.d/monero.conf && chmod 644 /usr/lib/sysusers.d/monero.conf
[root@cloud-hypervisor]# echo "d /var/lib/monero 0770 monero monero - -" > /usr/lib/tmpfiles.d/monero.conf && chmod 644 /usr/lib/tmpfiles.d/monero.conf
[root@cloud-hypervisor]# systemd-sysusers

systemd comes with advanced sandboxing capabilities for spawned process. The default unit file given in the official repository is very permissive, so we can harden it more. Let's write our unit file, located at /usr/lib/systemd/system/monerod.service, for monerod:

[Unit]
Description=Monero Full Node
After=network.target

[Service]
User=monero
Group=monero
WorkingDirectory=~
StateDirectory=monero
LogsDirectory=monero

Type=simple
ExecStart=/usr/bin/monerod --config-file /etc/monerod.conf --non-interactive

#Hardening
AmbientCapabilities=
CapabilityBoundingSet=
LockPersonality=true
NoNewPrivileges=True
SecureBits=noroot-locked
PrivateDevices=true
PrivateTmp=true
ProtectClock=true
ProtectControlGroups=true
ProtectHome=true
ProtectHostname=true
ProtectKernelLogs=true
ProtectKernelModules=true
ProtectKernelTunables=true
ProtectProc=invisible
ProtectSystem=true
RestrictAddressFamilies=AF_INET AF_INET6 AF_NETLINK AF_UNIX
RestrictNamespaces=true
RestrictRealtime=true
RestrictSUIDSGID=true
SystemCallArchitectures=native
SystemCallFilter=@system-service
SystemCallErrorNumber=EPERM

Restart=always

[Install]
WantedBy=multi-user.target

[root@cloud-hypervisor]# chmod 644 /usr/lib/systemd/system/monerod.service
[root@cloud-hypervisor]# systemctl enable monerod

By adding these hardening options you're significantly limiting the scope of action of monerod, and mitigate the consequences of an attack. A more hardened profile have been proposed in the past

Now let's write our configuration file at /etc/monerod.conf:

data-dir=/var/lib/monero/
log-file=/var/log/monero/monero.log
log-level=0

# Add these arguments
# Database
prune-blockchain=1 # We prune the blockchain
sync-pruned-blocks=1 # we allow syncing from pruned blocks

# Network
hide-my-port=1 # We don't advertise our ports to network peers
enable-dns-blocklist=1 # DNS blocklist
pad-transactions=1 # Pad relayed transactions to the next kB to help defend against traffic volume analysis
no-zmq=1 # No need for ZMQ since we aren't mining

# RPC
restricted-rpc=1 # restricted rpc

[root@cloud-hypervisor]# chmod 644 /etc/monerod.conf

I also invite you to configure RPC login and RPC SSL. (I won't cover this in this tutorial)

Start it

To start monerod you can simply reboot your vm or type systemctl start monerod

Post-syncing

Once the node is fully synced, informations you can get from journalctl (systemctl status monerod), you can go back edit the resource limit in your cloud-hypervisor shell file. 3GB is actually good enough for a private monero node, and you aren't obligated to use all your CPU cores.

Once you've rebooted you should be able to connect to your local monero, fully virtualized monero node at 192.168.100.<LAST>:18081

Performance and Efficiency

Here's the bad news, you'll quickly see that our monero node is consuming at least 1.4GB. cloud-hypervisor is actually only 50~200MB of memory overhead. The Linux kernel alone take <100MB. This huge memory usage isn't due to monerod, but to Linux. The kernel have a perverse tendency to cache everything in RAM, and consider itself to be in the right to use all of it… The kernel parameters and features we enabled helped at mitigating the issue, but this is how the kernel works, and no native features exist to limit this behavior. Consider the linux kernel to take at least 50% of the available RAM for caching, even tho the total used by programs are near 150MB.

As for the CPU performance, the overhead is exactly the same as QEMU/KVM one, expect at most a 10% performance regression.

Final thoughts

If you manage to read all the way done here, thanks you. It was really exciting for me to try and discover new things on the current virtualization implementation, and I hoped it did to you as well. I really hope that in the future we might expect comparable efficiency to OS-level virtualization. At least for know, i'm the proud of my server. Here's my current configuration:

<INTERNET>
└── <Firewall microVM|OPNsense> (QEMU/KVM wEDK2)
		└──(Bridge isolated0)
				├── <monero-wallet-rpc MicroVM|Gentoo Hardened/SELinux> (cloud-hypervisor)
				└── <Monerod MicroVM|Gentoo LLVM/SELinux> (cloud-hypervisor)

Memory usage: 4 to 8GB
Storage usage: 200GB

And I plan at adding more services as I can just copy paste the disk file.

Again thanks for reading and here's a non exhaustive collection of additional insights I had during this experiment:

If monerod was ported to redoxOS, a near native memory efficiency would be possible. But i'm too lazy to read out the cookbook and compile it. (but for real, all this memory issue would be gone and a true microVM would be possible. This is a cry for help)
It is possible to setup a cron job to periodically drop cached memory in the linux kernel using `echo 3 > /proc/sys/vm/drop_caches`. But it hurts performance significantly.
OPNsense and other BSD distributions don't work under cloud-hypervisor
Libvirt iptables are shit and blocked me during 3 days (5th time I have issue with libvirt)
If one wish to run other services more efficiently, they could use one microVM with multiple containers inside it.
Windows Server can run in headless mode, I didn't knew that
From a threat model perspective, nftables and virtio are the only attack surface to the host.