I just recently landed Project Bardiche into SmartOS. The goal of Bardiche has been to create a more streamlined data path for layer two networking in illumos. While the primary motivator for this was for KVM guests, it’s opened up a lot of room for more than just virtual machines. This bulk of this project is comprised of changes to illumos-joyent; however, there were some minor changes made to smartos-live, illumos-kvm, and illumos-kvm-cmd.
Before we delve into a lot more of the specifics in the implementation, let’s take a high level view of what this project has brought to the system. Several of these topics will have their own follow up blog entries.
The global zone can now see data links for every zone in
libdlpi(3LIB) was extended to be able to access those nics.
snoop(1M) now has a -z option to capture packets on a data link that belongs to a zone.
A new DLPI(7P) primitive
A new DLPI(7P) promiscuous mode was added:
ipfilter can now filter packets for KVM guests.
The IP squeue interface was generalized to allow for multiple consumers in a new module called
The sdev file system was enhanced with a new plugin interface.
A new driver
vndwas added, an associated library
vndstat. This driver provides the new layer two data path.
A new abstraction for sending and receiving framed data called framed I/O.
There’s quite a bit there. The rest of this entry will go into detail on the motivation for this work and a bit more on the new
snoop features. Subsequent entries will cover the the new
vnd architecture and the new DLPI primitives, the new
gsqueue interface, shine a bit more light on what gets referred to as the fastpath, and cover the new sdev plugin architecture.
Project bardiche started from someone asking what would it take to allow a hypervisor-based firewall to be able to filter packets that were being sent from a KVM guest. We wanted to focus on allowing the hypervisor to provide the firewall because of the following challenges associated with managing a firewall running in the guest.
While it’s true that practically all the guests that you would run under hardware virtualization have their own firewall software, they’re rarely the same. If we wanted to leverage the firewall built into the guest, we’d need to build an agent that lived in each guest. Not only does that mean that we’d have to write one of these for every type of guest we wanted to manage, given that customers are generally the super-user in their virtual machine (VM), they’d be able to simply kill the agent or change these rules, defeating the API.
While dealing with this, there were several other deficiencies in how networking worked for KVM guests today based on how QEMU, the program that actually runs the VM, interacted with the host networking. For each Ethernet device that was in the guest, there was a corresponding virtual NIC in the host. The two were joined with the
vnic back end in QEMU which originally used
libdlpi to bind them. While this worked, there were some problems with it.
Because we had to put the device in promiscuous mode, there was no way to tell it not to send back traffic that came from ourselves. In addition to just being a waste of cycles, this causes duplicate address detection, often performed with IPv6, to fail for many systems.
In addition, the dlpi interfaces had no means of reading or writing multiple packets at a time. A lot of these issues extend from the history of the illumos networking stack. When it was first implemented, it was done using STREAMS. Over time, that has changed. In Solaris 10 the entire networking stack was revamped with a project called Fire Engine. That project, among many others, transitioned the stack from a message passing interface to one that used a series of direct calls and serialization queues (squeues). Unfortunately, the means by which we were using libdlpi, left us still using STREAMS.
While exploring the different options and means to interface with the existing firewall, we eventually reached the point where we realized that we needed to go out and create a new interface that solved this, and the related problems that we had, as well as, lay the foundation for a lot of work that we’d still like to do.
First Stop: Observability Improvements
When I first started this project, I knew that I was going to have to spend a lot of time debugging. As such, I knew that I need to solve one of the more frustrating aspects of working with KVM networking: the ability to snoop and capture traffic. At Joyent, we always run a KVM instance inside of zone. This gives us all the benefits of zones: the associated security and resource controls.
However, before this project, data links that belonged to zones were not accessible from the global zone. Because of the design of the KVM branded zone, the only process running is QEMU and you cannot log in, which makes it very hard to pass the data link to snoop or tcpdump. This set up does not make it impossible to debug. One can use DTrace or use snoop on a device in the global zone; however, both of those end up requiring a bit more work or filtering.
The solution to this is to allow the global zone to see the data links for all devices across all zones under
/dev/net and then enhance the associated libraries and commands to support accessing the new devices. If you’re in the global zone, there is now a new directory called
/dev/net/zone. Don’t worry, this new directory can’t break you, as all data links in the system need to end with a number. On my development virtual machine which has a single zone with a single vnic named
net0, you’d see the following:
[root@00-0c-29-37-80-28 ~]# find /dev/net | sort /dev/net /dev/net/e1000g0 /dev/net/vmwarebr0 /dev/net/zone /dev/net/zone/79809c3b-6c21-4eee-ba85-b524bcecfdb8 /dev/net/zone/79809c3b-6c21-4eee-ba85-b524bcecfdb8/net0 /dev/net/zone/global /dev/net/zone/global/e1000g0 /dev/net/zone/global/vmwarebr0
Just as you always have in SmartOS, you’ll still see the data links for your zone at the top level in
/dev/net/vmwarebr0. Next, each of the zones on the system, in this case the global zone, and the non-global zone named 79809c3b-6c21-4eee-ba85-b524bcecfdb8, show up in
/dev/net/zone. Inside of each of those directories are the data links that live in that zone.
The next part of this was exposing this functionality in libdlpi and then using that in snoop. For the moment, I added a private interface called
dlpi_open_zone. It’s similar to
dlpi_open except that it takes an extra argument for the zone name. Once this change gets up to illumos it’ll then become a public interface that you can and should use. You can view the manual page online here or if you’re on a newer SmartOS box you can run
man dlpi_open_zone to see the documentation.
The use of this made its way into snoop in the form of a new option:
-z zonename. Specifying a zone with
-z will cause snoop to use
dlpi_open_zone which will try to open the data link specified via its
-d option from the zone. So if we wanted to watch all the icmp traffic over the data link that the KVM guest used we could run:
# snoop -z 79809c3b-6c21-4eee-ba85-b524bcecfdb8 -d net0 icmp
With this, it should now be easier, as an administrator of multiple zones, to observe what’s going on across multiple zones without having to log into them.
There are numerous people whom helped this project along the way. The entire Joyent engineering team helped from the early days of bouncing ideas about APIs and interfaces all the way through to the final pieces of review. Dan McDonald, Sebastien Roy, and Richard Lowe, all helped review the various write ups and helped deal with various bits of arcana in the networking stack. Finally, the broader SmartOS community helped go through and provide additional alpha and beta testing.
In the following entries, we’ll take a tour of the new sub-systems ranging from the vnd architecture and framed I/O abstraction through the sdev interfaces.