The series so far
If you’re getting started you’ll want to see the previous entries on Project Bardiche:
The illumos Networking Stack
This blog post is going to dive into more detail about what the ‘fastpath’ is in illumos for networking, what it means, and a bit more about how it works. We’ll also go through and cover a bit more information about some of the additions we made as part of this project. Before we go too much further, let’s take another look at the picture of the networking stack from the entry on architecture of vnd:
+---------+----------+----------+
| libdlpi | libvnd | libsocket|
+---------+----------+----------+
| · · VFS |
| VFS · VFS +----------+
| · | sockfs |
+---------+----------+----------+
| | VND | IP |
| +----------+----------+
| DLD/DLS |
+-------------------------------+
| MAC |
+-------------------------------+
| GLDv3 |
+-------------------------------+
If you don’t remember what some of these components are, you might want to refresh your memory with the vnd architecture entry. Importantly, almost everything is layered on top of the DLD and DLS modules.
The illumos networking stack comes from a long lineage of technical work done at Sun Microsystems. Initially the networking stack was implemented using STREAMs. STREAMs is a message passing interface where message blocks (mblk_t
) are sent around from one module to the next. For example, there are modules for things like arp
, tcp/ip
, udp
, etc. These are chained together and can be seen in mdb using the ::stream
dcmd. Here’s an example for my zone development zone:
> ::walk dld_str_cache | ::print dld_str_t ds_rq | ::q2stream | ::stream
+-----------------------+-----------------------+
| 0xffffff0251050690 | 0xffffff0251050598 |
| udp | udp |
| | |
| cnt = 0t0 | cnt = 0t0 |
| flg = 0x20204022 | flg = 0x20244032 |
+-----------------------+-----------------------+
| ^
v |
+-----------------------+-----------------------+
| 0xffffff02510523f8 | 0xffffff0251052300 | if: net0
| ip | ip |
| | |
| cnt = 0t0 | cnt = 0t0 |
| flg = 0x00004022 | flg = 0x00004032 |
+-----------------------+-----------------------+
| ^
v |
+-----------------------+-----------------------+
| 0xffffff0250eda158 | 0xffffff0250eda060 |
| vnic | vnic |
| | |
| cnt = 0t0 | cnt = 0t0 |
| flg = 0x00244062 | flg = 0x00204032 |
+-----------------------+-----------------------+
...
If I sent a udp packet, it would first be processed by the udp
streams module, then the ip
streams module, and finally make its way to the DLD/DLS layer which is represented by the vnic
entry here. The means of this communication was part of the DLPI
. DLPI itself defines several different kinds of messages and responses which can be found in the illumos source code here. The general specification is available here, though there’s a lot more to it than is worth reading. In illumos, it’s been distilled down into libdlpi
.
Recall from the vnd architecture entry that the way devices and drivers communicate with a datalink is by initially using STREAMS modules and by opening a device in /dev/net/
. Each data link in the system is represented by a dls_link_t
. When you open a device in /dev/net
, you get a dld_str_t
which is an instance of a STREAMS device.
The DLPI allows consumers to bind to what they call a SAP
or a service attachment point. What this means depends on the kind of data link. In the case of Ethernet, this refers to the ethertype. In other words, a given dld_str_t
can be bound to something like IP, ARP, LLDP, etc. If this were something other than Ethernet, then that name space would be different.
For a given data link, only one dld_str_t
can be actively bound to a given SAP (ethertype) at a time. An active bind refers to something that is actively consuming and sending data. For example, when you create an IP interface using ifconfig
or ipadm
, that does an active bind. Another example of an active bind is a daemon used for LLDP. There are also passive binds, like in the case of something trying to capture packets like snoop
or tcpdump
. That allows something to capture the data without worrying about blocking someone from using that attachment point.
Speeding things up
While the fundamentals of the DLPI are sound, the implementation in STREAMS, particularly for sending data left something to be desired. It greatly complicated the locking and it was hard to get it to perform in the way that was needed for saturating 10 GbE networks with TCP traffic. For all the details on what happened here and a good background, I’ll refer you to Sunay Tripathi’s Blog, where he covers a lot of what changed in Solaris 10 to fix this.
There are two parts to what folks generally end up calling the ‘IP fastpath’. One part of which we leverage for vnd and the other part which is still firmly used by IP. We’ll touch on the first part of this which eliminates the use of sending STREAMS messages. Instead it uses direct callbacks. Today this happens by negotiating with DLPI messages that discover capabilities of devices and then enabling them. The code for the vnd
driver does this, as well as the ip
driver. Specifically, you first send down a DL_CAPABILITY_REQ
message. The response contains a list of capabilities that exist.
If the capability, DL_CAPAB_DLD
is returned, then you can enable direct function calls to the DLD and DLS layer. The returned values give you a function pointer, which you can then use to do several things, and ultimately use to request to enable DLD_CAPAB_DIRECT
. When you make a call to enable, you specify a function pointer for DLD to call directly when a packet is received. It then will give you a series of functions to use for things like checking flow control and transmitting a packet. These functions allow the system to bypass the issues with STREAMS and directly transmit along packets.
The second part of the ‘IP fastpath’ is something that primarily the IP module uses. In the IP module there is a notion of a neighbor cache entry or nce. This nce describes how to reach another host. When that host is found, the nce asks the lower layers of the stack to generate a layer two header that’s appropriate for this traffic. In the case where you have an Ethernet device, this means that it generates the MAC header including the source and destination mac addresses, ethertype, and vlan tags if there should be one. The IP stack then uses this pre-generated header each time rather than trying to create a new one from scratch for every packet. In addition, the IP module is subscribed to change events that get generated when something like a mac address changes, so that it can regenerate these headers when the administrator makes a change to the system.
New Additions
Finally, it’s worth taking a little bit of time to talk about the new DLPI additions that we added with project bardiche. We needed to solve two problems. Specifically:
- We needed a way for a consumer to try and claim exclusive access to a data link, not just a single ethertype
- We needed a way to tell the system when using promiscuous mode not to loop back the packets we sent
To solve the first case, we added a new request called a DL_EXCLUSIVE_REQ
. This adds a new mode for the bind state of the dld_str_t
. In addition to being active or passive, it can now be exclusive. This can only be requested if no one is actively using the device. If someone is, for example, an IP interface has already been created, then the DL_EXCLUSIVE_REQ
will fail. The opposite is true as well, if someone is using the dld_str_t
in exclusive mode, then the request to bind to the IP ethertype will also fail. This exclusive request lasts until the consumer closes the dld_str_t
.
When a vnd device is created, it makes an explicit request for exclusive access to the device, because it needs to send and receive on all of the different ethertypes. If an IP interface is already active, it doesn’t make sense for a vnd device to be created there. Once the vnd device is destroyed, then anything can use the data link.
Solving our second problem was actually quite simple. The core logic to not loop back packets that were transmitted was already present in the MAC layer. To do that, we created a new promiscuous option that could be specified in the DLPI DL_PROMISCON_REQ
called DL_PROMISC_RX_ONLY
. Enabling this would pass along the flag MAC_PROMISC_FLAGS_NO_TX_LOOP
down to the mac layer which actually does the heavy lifting of duplicating the necessary amount of packets.
Conclusion
This gives a rather rough introduction to the fastpath in the illumos networking stack. The devil, as always, is in the details.
In the next entries, we’ll go over the other new extensions that were added as part of this work: the sdev plugin interface and generalized serialization queues. Finally, we’ll finish with a recap and go over what’s next.