Tales from a Core File

Close this search box.

USB devices have been a mainstay of extending x86 systems for some time now. At Joyent, we used USB keys to contain our own version of iPXE to boot. As part of discussions around RFD 77 Hardware-backed per-zone crypto tokens with Alex Wilson we talked about knowing and constricting which USB devices were trusted based on whether or not they were plugged into an internal USB port or external USB port.

While this wasn’t the first time that this idea had come up, by the time I started working on ideas on improving data center management, having better USB topology ended up on the list of problems I wanted to solve in RFD 89 Project Tiresias. Though at that point, how it was going to work was still a bit of an unknown.

The rest of this blog entry will focus on giving a bit of background on how USB works, some of the building blocks used for topology, examples of how we use the topology information, and then how to flesh it out for a new system.

USB Background

While USB, the Universal Serial Bus, is rather ubiquitous, some of its underlying implementation may not be. This section describes USBv1, v2, and v3 devices. The USBv4 spec is basically Thunderbolt, which is a different enough beast that I don’t want to lump it into here.

Every USB device is plugged into a device called a hub. Each hub consists of one or more ports and may itself be plugged into another hub. In this manner, you can think of USB like a tree. When you get to the root of this tree, you reach what is called the root hub. The root hub is often a bit different from the other hubs — it bridges USB to the rest of the system. Most USB root hubs are either built into the platform’s chipsets or they are on external PCI express cards. The operating system interfaces with these devices generally using standards like xHCI, the eXtensible Host Controller Interface (they already used the E for EHCI – the Enhanced Host Controller Interface).

USB 2.0 and USB 3.x Compatibility

There are many versions of USB devices out on the market; however, even newer devices work on older systems (if you can find the right kind of plug). The secret to this is that every USB 3.x port must support USB 2.0. The way this works is that a USB 3.x port has the wiring for both USB 3.x and USB 2.0 at the same time. In general, this has been a good thing. It means that older devices will work on newer systems and newer devices will work on older systems albeit not always at the maximum speed that they support. However, this does make our life a bit more complicated when it comes to topology.

While a single physical port can support both USB 3.x and USB 2.0, to the operating system the one physical port shows up as two different logical ports on the host controller. Generally, a device will select either USB 3.x or USB 2.0 signaling based on what they support and therefore it will only show up on one of the two logical ports. However, when it comes to topology, the user cares only about the fact that it’s in a given physical port, they don’t (generally speaking) care about the fact that there are multiple logical ports.

USB hubs, which allow for more devices to exist, are an exception to the rule of only using USB 3.x or USB 2.0 signaling. A USB 3.x hub is actually two hubs in one! When a USB 3.x hub is plugged into a port that supports USB 3.x, it will enumerate as two different hubs: one on the USB 2.0 logical port and one on the USB 3.x logical port. This means that the OS will actually see two distinct USB Hubs that it will enumerate and manage independently.

Ultimately, these are all good properties for USB devices to have. It does mean that we have to do a bit more work to map everything together, but that’s fine — that’s our job.

Multiple Host Controllers

The picture we painted above is a nice one, but doesn’t reflect all systems. One of the challenges of USB 3.0 support was that it introduced a new host controller interface: xhci replaced ehci. Now, to help with the transition, Intel produced a number of chipsets that had both xhci and ehci controllers on them. When the system booted up, all of the USB ports would be directed towards the ehci controller. However, on these platforms an xhci device driver could write to a special register which would result in rerouting all of the ports from the ehci controller to the xhci controller.

This allowed operating systems which didn’t support xhci to still have working USB. On Intel platforms, this duality was removed with Intel’s Intel’s Skylake chipsets.

From topology’s perspective, this means that the same physical port could show up not just as two different ports on the same controller, but actually as multiple, disjoint ports on different controllers!

Companion Controllers

With USB 3.x, a single host controller can support USB 3.x, 2.0, and 1.x devices. However, before USB 3.0 this wasn’t the case. Instead, platforms placed what was called a ‘companion controller’ on the motherboard. The basic idea was that the USB 2.0 ports were wired up to one controller and the other ports were wired up to a companion USB 1.0/USB 1.1 controller (ohci or uhci).

The companion controller model required the various drivers to be aware of this reality and trade things back and forth between them. Folding them together in xhci made things simpler. From a topology perspective, this can result in the same problem hat we have in the pre-Skylake USB 3.0 supporting systems — a given physical port can show up under multiple distinct devices.

USB Descriptors and Capabilities

Information about USB devices is broken down into two different groups of information:

  1. Descriptors
  2. Capabilities

Descriptors are used to identify information about the device such as the manufacturer ID, the device ID, the USB revision the device supports, etc.. There are descriptors which identify characteristics about a shared class of devices and others which identify information about different configurations that the device supports. For the purposes of USB topology, we primarily care about the device descriptor.

USB capabilities are stored in what’s called the binary object store. Capabilities first showed up in the USB 3.0 specification (though they appeared first in the briefly used Wireless USB specification). These capabilities are required for devices and generally describe USB-wide aspects of the device.

Topology Building Blocks

USB Topology is complicated for a few different reasons. The first is the fact a single physical port can show up as two different logical ports to the operating system. The second challenge is actually figuring out what all the ports are used for — if they’re used at all.

This second problem deserves a bit more explanation. Most systems have USB support from their platform chipset. The platform chipset implements a number of USB ports. For example, let’s look at one of the Intel 300 series chipsets, the Z390. If you look at the I/O specifications, it lists that it supports 14 USB ports, all of which can support USB 2.0 and some of which can support different forms of USB 3.1. Now, the standard system doesn’t actually have 14 USB ports all wired up, only a subset of them. Even the mobile chipsets are the same and I certainly don’t have 14 USB ports all over my laptop. This means that there are ports that the OS can see, but may not be used or wired up at all. Or they may be wired up to an internal hub.

While this is a challenge, it is surmountable. We’ll talk about a few of the things we can use to map ports together and a few things that also don’t work for us.


ACPI, the Advanced Configuration and Power Interface, provides a multitude of different capabilities to the system. However, there are a few that are specific to USB that are useful for us.

In ACPI, there is a notion of a tree of devices. Every device in the tree has properties and methods that the operating system can read and invoke which are provided by the platform firmware. When the operating system is looking for devices, it searches for ACPI devices in the same way it might search for PCI devices. In this tree, we’ll find three different relevant items: PCI devices, USB hubs, and USB ports. A USB host controller will be represented as a PCI device and it will have a child device, which is a USB hub, representing the root hub that the operating system sees. The hub will then have a port entry that corresponds to each logical port that the USB device has. If the platform has other hubs built into it (not ones that a user a plugs in), then they might also be represented in the ACPI tree.

For each port in the tree, there are three attributes that we care about. The first is the _ADR method. It is a generic ACPI method that determines the address of a given object. The type of address will vary based on the type of the device. A PCI device would have its device and function number while a SATA device would have the port number. In the case of a USB port, it gives us the port number on the hub which corresponds to the logical view that the operating system will see. This gives us a way of correlating the ACPI port objects with the ports that the operating system sees.

The next thing that we use from ACPI is the optional _UPC method, which is used to return the USB port capabilities. It tells us two different types of information:

  1. Whether a device may be connected to the port or not.
  2. The type of USB port. For example, whether it’s a Type A or Type C connector.

The next piece that we use is the _PLD method, which is the physical location of device. This method returns a binary description of the physical device information. It includes some information like the panel, orientation, and more. While theoretically useful, in practice, the binary payload makes it hard to really deterministically say something useful about the layout of the ports.

Now, you may ask if it’s hard to make something sensible about it, then why bother using it at all. The answer to that lies in the xHCI specification. The xHCI specification says that if you want to map two physical ports on an xhci controller together — such as a USB 2 port and its corresponding USB 3 port, then you can actually use the physical location of device information to map them together. If two ports, have the same panel, horizontal and vertical position, shape, group orientation, position, and token, then they are the same port. This only works across a single controller. Unfortunately, you cannot map two ports together on different controllers this way.

Exposing Information

For each USB root controller and its corresponding ports, we end up creating a logical device node for it in the devices tree. ACPI devices are rooted under the fw node. Each USB root hub shows up under it with a way to map it to its corresponding PCI device. Under each hub is a port, and if there’s a hub under that, then another USB hub and its ports.

Each port has a series of properties that correspond to the various ACPI methods discussed above. More specifically, we have the following properties:

  1. acpi-address: The value of the address found through the _ADR method.
  2. acpi-physical-location: A byte array that corresponds to the raw ACPI values. The kernel does not try to interpret the data and instead that is done in user land.
  3. usb-port-connectable: A property that if present indicates that the port is connectable.
  4. usb-port-type: A property that indicates the ACPI USB port type.

These properties can all be read with the libdevinfo(3LIB) library, which allows software to take a snapshot of the tree, walk the various nodes, and read the properties of the different nodes.

A Building Block Not Taken: SMBIOS

SMBIOS, is the system management BIOS. It provides tables of static information about the system. For example, lists of CPUs, memory devices, and more. One of the things I’ve enjoyed doing in illumos is keeping this information up to date and improving it as new releases of the specification come out. It’s proven to be invaluable for other efforts like labeling PCIe devices.

This time though, I mention SMBIOS because it’s something that one might normally think to use, but actually doesn’t work. One of the SMBIOS tables is a list of ports and what they connect to. Unfortunately, the SMBIOS tables usually refer to USB ports based on the headers on the motherboard. While this can be useful for some cases, it isn’t for what we care about — mapping ports that you plug devices into back to the corresponding ports that the operating systems see.

Enumerating USB Topology

With the different building blocks in place, let’s turn directions and now look at how we expose all of this information in the FMA topology trees. We’ll first look at what we expose and then we’ll come back and explain how that’s put together in FMA. We enumerate USB ports in FMA’s topology in three different groups:

  1. USB ports that we know correspond to the chassis
  2. USB ports that come from a PCIe add on card.
  3. All the remaining USB ports, which we place off of the motherboard.

For each port, we list the following information:

  1. The USB revisions the port supports, such as 2.0, 3.0, etc.
  2. The type of the port if we have ACPI information or a metadata file to tell us about it.
  3. Whether we consider the port connectable, visible, or disconnected.
  4. Information about whether we consider the port internal or external, if we have explicit metadata.
  5. A list of all of the logical ports this physical port represents. For example, if a port is wired up to an ehci controller and two ports on an xhci controller (one for USB 2.0 and one for USB 3.x), then we’ll list three different children here.
  6. A label that describes the port, if metadata provides it. This is a string that a person uses to know how to identify a port. For example, ‘Rear Upper Left USB’.

The following is an example of what a single USB port node might look like in fmtopo:

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=Joyent-S10G5:server-id=magma:chassis-id=S287161X8300740/chassis=0/port=0
    FRU               fmri      hc://:product-id=Joyent-S10G5:server-id=magma:chassis-id=S287161X8300740/chassis=0
    label             string    Rear Upper Left USB
  group: authority                      version: 1   stability: Private/Private
    product-id        string    Joyent-S10G5
    chassis-id        string    S287161X8300740
    server-id         string    magma
  group: port                           version: 1   stability: Private/Private
    type              string    usb
  group: usb-port                       version: 1   stability: Private/Private
    port-type         string    USB 3 Standard-A connector
    usb-versions      string[]  [ "2.0" "3.0" ]
    port-attributes   string[]  [ "user-visible" "external-port" ]
    logical-ports     string[]  [ "xhci0@2" "xhci0@18" ]

Under a port, if a USB device is plugged in, we’ll list information about the device. This includes:

  1. The USB revision of the device. For example, this could be 1.1, 2.0, 2.1, 3.0, 3.1, 3.2, etc..
  2. The numeric vendor and device identifiers, which are used to inform the system about the device so the right driver can be attached.
  3. The revision ID of the device. This is a vendor-specific name.
  4. The device’s USB vendor and product name strings, if it provides them.
  5. The USB device’s serial number, if it has one.
  6. The speed of the device, for example super-speed, full-speed, etc.. These represent the type of protocol speed that the system has.

Next, we’ll create a set of properties that describe the driver that’s attached to the device, if any. This is a standard property group that you’ll find on other nodes in the tree, such as a PCIe device. This includes:

  1. The name of the driver.
  2. The instance of the driver (a logical construct in the OS).
  3. The path of the driver in /devices.
  4. The module information for the driver, such as its FMRI (fault management resource identifier).

Here’s an example of the information for a device itself in fmtopo:

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=Joyent-S10G5:server-id=magma:chassis-id=S287161X8300740:serial=00241D8CE563C1B1E94FEBB4:part=DataTraveler-2.0:revision=100/motherboard=0/port=5/usb-device=0
    FRU               fmri      hc://:product-id=Joyent-S10G5:server-id=magma:chassis-id=S287161X8300740:serial=00241D8CE563C1B1E94FEBB4:part=DataTraveler-2.0:revision=100/motherboard=0/port=5/usb-device=0
    label             string    Internal USB
  group: authority                      version: 1   stability: Private/Private
    product-id        string    Joyent-S10G5
    chassis-id        string    S287161X8300740
    server-id         string    magma
  group: usb-properties                 version: 1   stability: Private/Private
    usb-port          uint32    0xa
    usb-vendor-id     int32     2352
    usb-product-id    int32     25924
    usb-revision-id   string    100
    usb-version       string    2.0
    usb-vendor-name   string    Kingston
    usb-product-name  string    DataTraveler 2.0
    usb-serialno      string    00241D8CE563C1B1E94FEBB4
    usb-speed         string    high-speed
  group: io                             version: 1   stability: Private/Private
    driver            string    scsa2usb
    instance          uint32    0x0
    devfs-path        string    /pci@0,0/pci15d9,981@14/storage@a
    module            fmri      mod:///mod-name=scsa2usb/mod-id=110
  group: binding                        version: 1   stability: Private/Private
    occupant-path     string    /pci@0,0/pci15d9,981@14/storage@a/disk@0,0

Finally, based on the kind of device we encounter, we might enumerate children nodes. Right now there are two cases that we’ll enumerate child nodes:

  1. If we encounter a hub and we have found its children, then we’ll enumerate them as we described above and any USB devices that we find under them.
  2. If we encounter a USB device that represents a disk, like an external hard drive or USB key, then we’ll set some additional properties on the node and call into the disk enumerator. This’ll create a disk node that can be used to map disk information back to the physical device.

Other Uses

While having the items in the tree makes it easier for us to see everything in the system, once we have location information and serial numbers, there are other ways we can use this. The tool diskinfo lists disks and, when we have it, the physical location of them and their serial number. For example, when using SATA and SAS drives, this can tell you which drive bay they’re in, whether it’s a front or rear drive, and more.

To get this information, the diskinfo program takes a snapshot of the system topology and then maps the discovered disks to the corresponding disk nodes in the system’s topology. We’ve done the same for these USB devices. So if we have topology information, we can tell you which USB device it is that’s plugged in. For example:

# diskinfo -P
DISK                    VID      PID              SERIAL               FLT LOC LOCATION
c1t0d0                  Kingston DataTraveler 2.0 00241D8CE563C1B1E94FEBB4 -   -   Internal USB
c2t5000CCA0496FCA6Dd0   HGST     HUSMH8010BSS204  0HWZGX6A             no  no  [0] Slot00
c2t5000CCA25319F125d0   HGST     HUH721212AL4200  8DGG88SZ             no  no  [0] Slot01
c2t5000CCA25318BE1Dd0   HGST     HUH721212AL4200  8DGELUWZ             no  no  [0] Slot02
c2t5000CCA2530F9CD5d0   HGST     HUH721212AL4200  8DG8L5JZ             no  no  [0] Slot03
c2t5000CCA25318BE15d0   HGST     HUH721212AL4200  8DGELUUZ             no  no  [0] Slot04

Constructing Topology

Now that we’ve talked about how we use topology and most of the operating system building blocks, it’s worth spending some time talking about how we actually build the USB topology itself.

We gather data from three different sources:

  1. A USB topology metadata file.
  2. Walking the devices tree looking for USB root hubs and their children (non-ACPI).
  3. Walking the ACPI firmware tree, looking for USB information.

Once we gather information from all three of these sources, we combine them all together to create a single, coherent map. We first map an ACPI node to its corresponding devcfg node. Then, if we’ve opted to map ports together based on ACPI (more on that in a little bit), then we’ll combine the different logical nodes together.

USB Metadata File

The topology USB metadata file allows us to create a per-vendor, per-product map of additional information. The file is a simple format that has keywords and arguments.

A file first identifies a given port. From a port, it will then provide additional metadata such as a label and whether or not it is internal or external. Next, if we need to override the ACPI port type either because it’s missing or incorrect, then we can do so here. Finally, a series of ACPI paths that describe the port are listed. This way, when a port has a USB 2.0 and a USB 3.x component, because we’ve listed both, we’ll be able to apply this metadata to either port.

Finally, there are a number of top-level directives. These describe the matching behavior that we’d like to use. We can do the following:

  1. Disable the use of ACPI entirely on this platform.
  2. Disable the use of ACPI matching. We’ve done this on platforms where we’ve determined that the ACPI information that the platform has is incorrect.
  3. We can enable matching based on the metadata information. This is useful in tandem with the above. Here, we use the ACPI paths to perform matching the same way that we did elsewhere.

Here’s a portion of a USB metadata file:

                Rear Lower Right USB

                Internal USB

This example has two ports present. Each port has a label which is used to identify where the port is for a human. The ‘internal’ and ‘external’ keywords are used to indicate whether the port is internal to the system or external. In this case, the ‘internal’ port is found on the motherboard of the system. So it cannot be serviced or used without opening the system. The ‘chassis’ label indicates that this port is found on the chassis of the system itself. This is where most USB ports are that a user would find and use.

The port-type here indicates that they are USB 3 Type-A connectors, meaning that they support both USB 2.0 and USB 3.x. Finally, the various ‘acpi-path’ entries are used to indicate the ACPI path towards the port. Note how the ports are labeled based on the names of the ACPI device nodes. Each ‘.’ separates each node. The starting ‘\’ character is just part of the constructed path, it is not an escape character.

Writing Your Own Map

The way I’ve done this for other platforms is finding a USB 2.0 and USB 3.0 port and plugging it into each port subsequently. At each point, I look at the information in FMA and in the devices tree with prtconf. By looking at the port numbers and what exists in the devices tree, one can, with a bit of manual work, piece together what’s required.

One challenge with doing this is when you’re on a system that has both the ehci and xhci controllers. Generally this is on Intel platforms from Sandy Bridge through Broadwell. In that case, you need to go into the BIOS and do this with xhci enabled and disabled in the BIOS. This will make sure that you can get all the ports connected to the ehci controller.

Further Reading

If you’re interested in the illumos implementation:

For more on the specifications mentioned:

  • The xHCI specification, currently revision 1.2.
  • The SMBIOS specification, currently revision 3.2.
  • The ACPI specification, revision 6.3.

Looking Ahead

While we’ve done some work, there’s more that we can do to improve the situation here in the future. This discusses some of those future directions.

Container ID Capability

The Container ID capability is the first tool at our disposal. The container ID is a 128-bit UUID — a universally unique identifier. All USB 3.x hubs are required to implement this capability and it can be read from the USB binary object store.

The idea behind this capability is that a device will have the same container ID value regardless of the type of bus that they’re on. So even though a USB 3.x hub will appear as two distinct hubs to the operating system, if they’re the same device, they’ll have the same container ID UUID. This gives the operating system a way to map such devices together.

When the USB Container ID capability is found in the binary object store, we add a property to the device node that indicates the UUID. This translates into a 16-byte byte array on the node whose value is the UUID. With this in mind, the USB topology plugin could go ahead and find hubs with matching container IDs.

More Consumers of USB Topology

While we’ve enhanced FMA and some of the tools like diskinfo(1M), we can do more here. For example, tools like cfgadm(1M) could be enhanced to query topology information when available listing devices in verbose

If we have a mapping that we feel confident in, it could even make sense to add another alias under /dev to the device. Though these labels aren’t necessarily stable right now (as they’re meant for humans), so we’ll have to see what makes sense there.

Easier Tools to Build Topology Maps

Right now, it can take a bit of effort to build a topology map. It would be great if we had easy tooling for developing USB topology maps for different platforms that would walk someone through putting this together. It would also be useful if we had a way for a user to generate a topology for their system. That way, even if it’s something custom that’s been put together, it still isn’t too hard to put together a topology for their system.

What’s Next?

There’s a lot more to talk about with USB, topology, and hardware in general. If you’re interested in working on any of these aspects, reach out. I’m sure there’ll be more to do here as we have to deal with USB 3.2, Thunderbolt, and USB 4.0.

If you’d like to get involved, get in touch with the illumos community on IRC in #illumos on Freenode or a mailing list and I or someone else will help you out and see what we can do. As long as you’re willing to learn, receive feedback, and keep going despite difficulties, then it doesn’t matter what your experience is.

One of the stories that has stuck with me over the years came from a support case that a former colleague, Ryan Nelson, had point on. At Joyent, we had third parties running our cloud orchestration software in their own data centers with hardware that they had acquired and assembled themselves. In this particular episode, Ryan was diagnosing a case where a customer was complaining about the fact that networking wasn’t working for them. The operating system saw the link as down, but the customer insisted it was plugged into the switch and that a transceiver was plugged in. Eventually, Ryan asked them to take a picture of the back of the server, which is where the NIC (Network Interface Card) would be visible. It turned out that the transceiver looked like it had been run over by a truck and had been jammed in — it didn’t matter what NIC it was plugged into, it was never going to work.

As part of a broader push on datacenter management, I was thinking about this story and some questions that had often come up in the field regarding why the NIC said the link was down. These were:

  1. Was there actually a transceiver plugged into the NIC?
  2. If so, did the NIC actually support using this transceiver?

Now, the second question is a bit of a funny one. The NIC obviously knows whether or not it can use what’s plugged in, but almost every time, the system doesn’t actually make it easy to find out. A lot of NIC drivers will emit a message that goes to a system log when the transceiver is plugged in or the NIC driver first attaches, but if you’re not looking for that message or just don’t happen to be on the system’s console when that happens, suddenly you’re out of luck. You might also ask why are there transceivers that aren’t supported by a NIC, but that’s a real can of worms.

Anyways, with that all in mind, I set out on a bit of a journey and put together some more concrete proposals for what to do here in terms of RFD 89: Project Tiresias. We’ll spend the rest of this entry going into a bit of background on transceivers and then discuss how we go from knowing whether or not they’re plugged in to actually determining who made them and where they are in the system.

What is a transceiver?

We’ve been using the term 'transceiver' quite a bit so far, but that’s a pretty generic term. Let’s spend a bit of time talking through that. First, we’re really focused on transceivers as used in the context of networking. When people think of wired networking, the most common thing that comes to mind are Ethernet Cables. Ethernet isn’t the only type of cable that’s been used. Before Ethernet was common, BNC coaxial cables were used on some NICs as well.

However, in the data center, Ethernet didn’t end up keeping up with the speeds and distances that connections were being use for (though 10 Gigabit Ethernet, 10GBASE-T, has started becoming more common). In this space, fiber-optic cables and copper twinaxial cables (twinax) are much more prominent. Note, twinaxial cables are rather different from their BNC coaxial relatives. Coaxial cables are used more often when there are shorter distances to cover, such as between a top of rack switch and a server. Fiber optic cables often cover longer distances or have higher throughputs.

Because different types of cables are used in different situations, several vendors got together and agreed upon a set of standards to use when manufacturing these cables. This allowed NIC manufacturers to design a single physical and electrical interface, but still support different types of transceivers. These standards (technically a multi-source agreement) are maintained by the Small Form Factor (SFF) Committee. The committee manages standards not only for networking, but also for SAS cables and other devices.

If you’ve worked in this space, you may have heard of what are called SFP and SFP+ cables. These cables generally support 1 and 10 Gigabit networking respectively. The transceiver is controlled over an i2c bus by the NIC. The addresses and their meanings are standardized. They were originally standardized in a standard called INF-8074, but the current active standard for these devices is called SFF-8472.

With faster networking speeds, there have been additional revisions and standards put out. Devices that support 40 Gigabit networking are called QSFP+ because they combine 4 SFP+ devices. To support 25 Gigabit networking, a variant of SFP+ was created called SFP28. Finally, to support 100 Gigabit networking, they combined 4 SFP28 devices together. The 40 Gigabit devices are standardized in SFF-8436 and the 100 Gigabit have their management interface described in SFF-8636.

The standards for various devices have somewhat similar layouts. They break data into a series of different pages of which a specific offset into the page can then be accessed via the NIC’s i2c bus. These pages contain some of the following information:

  • Control over the device and its configuration

  • Static manufacturing information such as the manufacturer’s name, the
    device’s name, the serial number, and more

  • Optional information about the health of the device such as the
    temperature, voltage, and more

The pages and addresses change from specification to specification, though a large amount of the data overlaps between them. The health information of the device is required when the connector is considered active (generally fiber-optic cables with lasers) and is optional when you have a passive device (such as a copper twinax cable).

The MAC Transceiver Capability

The first part of our adventure with getting to this data begins in the operating system kernel. Similar to the case of managing NIC LEDs, the networking device driver framework has an optional capability that a driver can implement to expose this information called MAC_CAPAB_TRANSCEIVER.

The transceiver capability has a structure that the device driver has to fill out which describes some basic information for dealing with transceivers. This includes the following fields:

  • The number of transceivers present on the device.
  • A function, mct_info() that allows one to get basic information about the transceiver.
  • A function, mct_read() that allows one to read i2c data from the device.

The driver first indicates the number of transceivers that are present for it. In general, this is one. However some devices actually support combining multiple transceivers and ports into one logical device — though this isn’t commonly used. The next item, the mct_info() function, is used to answer the two questions that were posed at the beginning of this: Does the NIC think a transceiver is present and can the NIC use the transceiver? Finally, the mct_read() function allows us to go through and read specific regions of the memory map of the transceiver. Generally, user land reads an entire 256-byte page at any given time.

The kernel only facilitates reading information. It generally doesn’t try and make semantic sense of any of the data. That is purposefully left to user land — unless there’s a good reason to parse binary data in the kernel, you’re usually better off not doing that.

The following device families and drivers support the MAC_CAPAB_TRANSCEIVER capability. Some drivers that you may be more familiar with such as 1 Gigabit Ethernet devices aren’t on this list because they don’t support transceivers. Supported devices include:

  • Broadcom NetExtreme II 10 Gb devices based on the bnxe driver.
  • Chelsio T4, T5, and T6 10, 25, 40, and 100 Gbit devices based on the cxgbe driver.
  • Intel X520 10 Gbit SFP+ devices based on the ixgbe driver.
  • Intel 10, 25, and 40 Gbit SFP+, SFP28, and QSFP+ devices based on the i40e driver.
  • QLogic FastLinQ QL45xxx devices based on the qede driver.

The way that each driver gets access to this information varies from device to device. Some, like i40e and cxgbe, issue a firmware command to read information from the transceiver. Others have dedicated registers that can be programmed to read arbitrary data from the i2c device.

libsff and dltraninfo

Once we have the ability to read the actual data from the transceiver, we have to make logical sense of it. To handle that, the first thing I did was write a small library called libsff. The goal of libsff is to parse the various SFF binary data payloads and return structured data as a set of name-value pairs.

If you look at the header file, libsff.h, you’ll see a list of different keys that can be looked up. Some of these are rather straightforward, such as the string "vendor", which has the name of the manufacturer of the transceiver. Others are a bit more opaque and require referencing the actual SFF documents. Another useful feature of the library is that it tries to abstract out the differences between different versions of the specifications. The goal is that when there is similar data, it should always be found under the same key even if they are found in wildly different parts of the memory map or the way we have to parse the data is different. The goal of libraries (or really any interface and abstraction) should be to take something grotty and transform it into something more usable as though reality were as simple as it presents.

The one thing that the library doesn’t generally do today is parse all of the sensor data that may be available on the transceiver. The main reason for this is that the vast majority of transceivers that I had access to, did not implement it. On SFP, SFP+, and SFP28 devices, sensor information is optional for twinax based devices. With a few devices to test with, it would be pretty straightforward to add support for it though.

On its own, a library isn’t useful unless it has a consumer. The first consumer that I’ll discuss is dltrainfo. This is an unstable, development program that I wrote to exercise this functionality and to try and get a sense of what interfaces might be useful. There are two forms of the dltrainfo command. The first answers the questions that we laid out in the beginning about whether the transceiver is present or usable. When run this way, you see something like:

# /usr/lib/dl/dltraninfo
ixgbe0: discovered 1 transceiver
        transceiver 0 present: yes
        transceiver 0 usable: yes
ixgbe1: discovered 1 transceiver
        transceiver 0 present: yes
        transceiver 0 usable: yes
ixgbe2: discovered 1 transceiver
        transceiver 0 present: yes
        transceiver 0 usable: yes
ixgbe3: discovered 1 transceiver
        transceiver 0 present: yes
        transceiver 0 usable: yes

The next option is to read the information from the transceiver. Here’s an example of reading this on an Intel 10 Gbit fiber-optic transceiver:

# /usr/lib/dl/dltraninfo -v ixgbe1
ixgbe1: discovered 1 transceivers
        transceiver 0 present: yes
        transceiver 0 usable: yes
        Identifier: 'SFP/SFP+/SFP28'
        Extended Identifier: 4
        Connector: 'LC (Lucent Connector)'
        10G+ Ethernet Compliance Codes[0]: '10G Base-SR'
        Ethernet Compliance Codes[0]: '1000BASE-SX'
        Encoding: '64B/66B'
        BR, nominal: '10300 MBd'
        Length 50um OM2: '80 m'
        Length 62.5um OM1: '30 m'
        Length OM3: '300 m'
        Vendor: 'Intel Corp'
        OUI[0]: 0
        OUI[1]: 27
        OUI[2]: 33
        Part Number: 'FTLX8571D3BCV-IT'
        Revision: 'A'
        Laser Wavelength: '850 nm'
        Options[0]: 'Rx_LOS implemented'
        Options[1]: 'TX_FAULT implemented'
        Options[2]: 'TX_DISABLE implemented'
        Options[3]: 'RATE_SELECT implemented'
        Serial Number: 'AKR0EQ0'
        Date Code: '110618'
        Extended Options[0]: 'Soft Rate Select Control Implemented'
        Extended Options[1]: 'Soft RATE_SELECT implemented'
        Extended Options[2]: 'Soft RX_LOS implemented'
        Extended Options[3]: 'Soft TX_FAULT implemented'
        Extended Options[4]: 'Soft TX_DISABLE implemented'
        Extended Options[5]: 'Alarm/Warning flags implemented'
        8472 Compliance: 'Rev 10.2'

This allows us to interact with the information in a readable way. Effectively, this dumps out the entire name-value pair set that we construct when parsing data with libsff. There are two additional ways to print this data. The first one, -x, dumps out the data as hex data (kind of like if you run the program xxd). The second option -w writes out the first page, 0xa0, to a file. This allows you to take the raw data with you.

Seeing Transceivers in Topo

The next step with all this work is to expose the transceivers as part of the system topology in the fault management architecture (FMA). This is useful for a few reasons:

  1. It allows us to see what devices are present in the same snapshot as other devices like disks, CPUs, DIMMs, etc.
  2. FMA’s topology is a natural place for us to expose sensors.
  3. If a device is in topology, then we can generate error reports and faults against those devices.

Basically, being visible in the topology allows us to integrate it more fully into the system and makes it easy for various monitoring and inventory tools in the system to see these devices without having to make them aware of the underlying ways of getting data.

The topology information is organized as a tree. When we encounter hardware that we believe is a networking device (because its PCI class indicates it is), then we ask the kernel about how many transceivers it supports. For each transceiver, we create a port node under the NIC whose type indicates that it is intended for SFF devices.

When a transceiver is present, then we will place a transceiver node under the port. This node has two different groups of properties. The first is generic to all transceivers, which is where we indicate whether or not the hardware can use the transceiver. The second group are properties that we derive from the SFF specifications about the transceiver’s manufacturing data. This includes the vendor, part number, serial number, etc. The following block of text shows three different nodes: the NIC, the port, and the transceiver:

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pc
    label             string    MB
    FRU               fmri      hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0
    ASRU              fmri      dev:////pci@0,0/pci8086,155@1,1/pci103c,17d3@0,1
  group: authority                      version: 1   stability: Private/Private
    product-id        string    X9SCL-X9SCM
    chassis-id        string    0123456789
    server-id         string    ivy
  group: io                             version: 1   stability: Private/Private
    dev               string    /pci@0,0/pci8086,155@1,1/pci103c,17d3@0,1
    driver            string    ixgbe
    module            fmri      mod:///mod-name=ixgbe/mod-id=242
  group: pci                            version: 1   stability: Private/Private
    device-id         string    10fb
    extended-capabilities string    pciexdev
    class-code        string    20000
    vendor-id         string    8086
    assigned-addresses uint32[]  [ 2197946640 0 3750756352 0 1048576 2164392216 0 57344 0 32 2197946656 0 3753902080 0 16384 ]

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pc
    FRU               fmri      hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pc
  group: authority                      version: 1   stability: Private/Private
    product-id        string    X9SCL-X9SCM
    chassis-id        string    0123456789
    server-id         string    ivy
  group: port                           version: 1   stability: Private/Private
    type              string    sff

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789:serial=AKR0EQ0:part=FTLX8571D3BCV-IT:revision=A/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=1/port=0/transceiver=0
    FRU               fmri      hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789:serial=AKR0EQ0:part=FTLX8571D3BCV-IT:revision=A/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=1/port=0/transceiver=0
  group: authority                      version: 1   stability: Private/Private
    product-id        string    X9SCL-X9SCM
    chassis-id        string    0123456789
    server-id         string    ivy
  group: transceiver                    version: 1   stability: Private/Private
    type              string    sff
    usable            string    true
  group: sff-transceiver                version: 1   stability: Private/Private
    vendor            string    Intel Corp
    part-number       string    FTLX8571D3BCV-IT
    revision          string    A
    serial-number     string    AKR0EQ0

If you plug in a transceiver dedicated to fibre channel, then we’ll properly note that we can’t use the transceiver by setting the usable property to false. The following is an example of the transceiver node in that case:

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=X10SLM+-LN4F:server-id=haswell:chassis-id=0123456789/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=1/port=0/transceiver=0
    FRU               fmri      hc://:product-id=X10SLM+-LN4F:server-id=haswell:chassis-id=0123456789/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=1/port=0/transceiver=0
  group: authority                      version: 1   stability: Private/Private
    product-id        string    X10SLM+-LN4F
    chassis-id        string    0123456789
    server-id         string    haswell
  group: transceiver                    version: 1   stability: Private/Private
    type              string    sff
    usable            string    false

Further Reading

If you’d like to read more on this, there are a couple of different places that I can send you.

For more on the SFF standards, there’s:

  • SFF-8472 which describes the SFP/SFP+ devices.
  • SFF-8436 which describes QSFP+ devices.
  • SFF-8636 which describes the memory map of QSFP28+ devices.

If you’re interested in the illumos implementation, there’s:

Looking Ahead

If you have a favorite NIC that uses SFP-based transceivers and it isn’t supported, reach out and we’ll see what we can do. If you’d find it interesting to work on exposing more of the sensor information present in the SFPs, then we’d be happy to further mentor someone there. Once these pieces are exposed in topology, it could also make sense to wire up the FRU monitor to watch for temperature thresholds, voltage drops, or device faults.

Up next, we’ll talk about understanding the topology of USB devices.

It was the brightest of LEDs, it was the darkest of LEDs, it was the age of data links, it was the age of AHCI enclosure services, …

Today, I’d like to talk about two aspects of a project that I worked on a little while back under the aegis of RFD 89 Project Tiresias. This project covered improving the infrastructure for how we gathered and used information about various components in the system. So, let’s talk about LEDs for a moment.

LEDs are strewn across systems and show up on disks, networking cards, or just to tell us the system is powered on. In many cases, we rely on the blinking lights of a NIC, switch, hard drive, or another component to see that data is flowing. The activity LED is a mainstay of many devices. However, there’s another reason that we want to be able to control the LEDs: for identification purposes. If you have a rack of servers and you’re trying to make sure you pull the right networking cable, it can be helpful to be able to turn a LED on, off, or blink it with a regular pattern. So without further ado, let’s talk about how we control LEDs for network interface cards (NICs or data links) and a class of SATA hard drives.


The first class of devices that I’d like to talk about are networking cards. Most every NIC that you can plug an Ethernet cable or a copper/fiber-optic transceiver in has at least two LEDs for each port. One that indicates that traffic is flowing over the link, an activity LED, and one that indicates the fact that the link is actually up. Sometimes different colors are used to indicate different speeds. For example, a 25 GbE capable device may use an orange color to indicate that the link is operating at 10 GbE and a green one to indicate that it is operating at 25 GbE.

The first challenge we have in the operating system is to figure out how to actually tell the device to cause its LEDs to behave in a specific way. Unfortunately, practically every NIC (or family of NICs) has its own unique way of doing this and well, just about everything else. This is the operating system’s curse and one of its primary responsibilities — designing abstractions for device drivers that make it easy to enable new hardware and take advantage of unique features that different devices offer while minimizing the amount of work that is required to do so.

In illumos, there is a framework for networking device drivers called mac. The mac framework is named after the device driver that implements it, mac(9E). Sometimes you’ll hear it called by an alternative name: ‘Generic LAN Device version 3’ (GLDv3). The mac framework uses the concept of capabilities, which represent different features that hardware offers. For example, this includes such things as checksum offload, TCP segmentation offload (TSO), and more. Each capability can have arbitrary data associated with it that describes more information about the capability itself.

For controlling the LEDs, there’s a new capability called MAC_CAPAB_LED. With the capability, a driver has to fill in three pieces of information:

  1. A set of flags to indicate optional behavior. Currently none are defined, but this is here for future expansion.
  2. A set of supported modes that the LED can be set to. This includes:
    • MAC_LED_ON which indicates that the LED should be turned on solidly.
    • MAC_LED_OFF which indicates that the LED should be turned off.
    • MAC_LED_IDENT which indicates that the LED should be set in a way to identify the device. Generally this means that it should blink. When this can use a different color, that’s even better.
  3. A function that the operating system should call to set the LED state.

The structure looks something like:

typedef struct mac_capab_led {
    uint_t mcl_flags;
    mac_led_mode_t mcl_modes;
    int (*mcl_set)(void *driver, mac_led_mode_t mode, uint_t flags);
} mac_capab_led_t;

Basically, when the operating system wants to change the LED state, it’ll end up calling the mcl_set function to set the new mode. When the LED state should be changed back to normal, a fourth state is passed: MAC_LED_DEFAULT. The operating system guarantees that it will never call this function in parallel for the driver to try to simplify the programming model, though the driver will likely have I/O ongoing.

Some devices, such as older Intel client parts based on the e1000g driver don’t actually have blink functionality. As such, the driver today emulates that, though it would be good to move that into the broader mac framework when we need to do that for another driver.

The following devices and drivers currently have support for this functionality:

  • Chelsio T4, T5, and T6 10, 25, 40, and 100 GbE parts based on the cxgbe driver
  • Intel 1 GbE Client and server parts based on the e1000g and igb drivers
  • Intel 10 GbE SFP/Copper devices in the X520, X540, and X550 families based on the ixgbe driver
  • Intel 10, 25, and 40 GbE devices in the X710 and X722 family based on the i40e driver

Plumbing it up in user land

With all of the above hardware support, there’s currently a private utility in illumos to control these called dlled that can be found at /usr/lib/dl/dlled. Now, you might ask: why is the program hiding in /usr/lib/dl? The main reason is that we’re still not sure what the right interface for controlling this should be. We should likely integrate it into the dladm(1M) command and allow the LEDs to be controlled through the Fault Management Architecture like we do with other LEDs. However, until we figure out exactly what we want, this gives us something to experiment with.

If you run it on a system, you’ll see something like:

# /usr/lib/dl/dlled
LINK                 ACTIVE       SUPPORTED
igb0                 default      default,off,on,ident
igb1                 default      default,off,on,ident
igb3                 default      default,off,on,ident
igb2                 default      default,off,on,ident

From here, we can use the -s option to then control and change the state. If you set the state, it should persist across anyone pulling or removing a cable in the device, but nothing that’s set will persist across a reboot of the system. If we set igb0 to ident mode, then we’ll see that the current state is updated:

# /usr/lib/dl/dlled -s ident igb0
# /usr/lib/dl/dlled
LINK                 ACTIVE       SUPPORTED
igb0                 ident        default,off,on,ident
igb1                 default      default,off,on,ident
igb3                 default      default,off,on,ident
igb2                 default      default,off,on,ident

While we have a video to demo this in action, for the moment let’s instead talk about another class of LEDs.


Many systems have an AHCI controller built into their primary chipset. The AHCI Controller is used to attach SATA hard disk drives (HDDs) and solid state drives (SSDs). AHCI stands for ‘Advanced Host Controller Interface’. It is a specification that describes how to discover attached devices and send commands to HDDs and SSDs.

Now, you may be saying to yourself, I’ve seen a hard drive or an SSD, but I don’t recall seeing any LEDs on them. And that’s right. Unlike NICs, which have the LEDs built in, the LEDs aren’t built into the drives, but into enclosures — the bays on the system that house drives.

LED Modes

When dealing with hard drives, there have historically been four different things that the system wants to communicate:

  1. That the drive in the bay has a fault — it is no longer working or it is OK.
  2. That the drive in the bay is ‘OK to remove’.
  3. To identify a specific bay by blinking the LED.
  4. To show that there is activity to the device in the bay.

Typically, the first three items share a single LED, while a second LED is used to drive activity. A side effect of this is that somewhere in either hardware or firmware there is a hierarchy for the first three tasks. For example, if nothing else is going on then the drive’s LED may be a solid green color. If the drive has been faulted, it may instead turn to an amber color. Blinking the LED may be in the amber color or it may be something else entirely.

In the majority of cases, the activity LED isn’t controllable by software; however, the other LED can be. This means that when say ZFS decides a drive is bad, it can eventually lead to the operating system turning on an amber LED to indicate where a problem is.

While we have mentioned the ‘OK to remove’ LED, that has become less and less commonly used and implemented. It’s listed here for completeness, but may not actually be something that you can control or even see on some systems.

Enclosure Management

As we mentioned earlier, the LEDs are part of the bays and not part of the drives. This has lead to a suite of different standards and means for enclosure management. The one that applies will actually vary depending on the wiring of the system. For example, while the same 2.5 inch drive bay can be used for SATA, SAS, and NVMe devices, the way that the enclosure is controlled varies a lot based on the disk controller and the system’s broader design.

In systems that are using SAS controllers (regardless of whether one uses SATA or SAS drives), there is a specification that describes how to enumerate the different bays that devices can show up in, figure out which drives are in which bays, and control the LEDs and get other information. In a SAS world, this is called SCSI enclosure services or SES. When using an all NVMe system, SES won’t exist. Similarly, when you’re using SATA devices and an AHCI controller, then something different ends up happening. In fact, you’re not even guaranteed that there will be a SES device even in a SAS world!

The AHCI specification provides optional enclosure management features. If you want to follow along in the AHCI specification, open up to chapter 12 — Enclosure Management. There are three primary items the related to enclosure management:

  • A capability bit to indicate whether or not enclosure services are supported.
  • A register that describes the capabilities and controls sending messages.
  • A region of memory for sending and receiving messages.

This region of memory that can be used to send and receive messages is used because the standard allows the system to participate in one of four different messaging schemes. Messages specific to the underlying scheme are placed in a region of memory and if the scheme supports replies, it’ll place replies in there.

Now, the specification supports capability bits for up to four different schemes:

  1. SAF-TE protocol
  2. SES-2 protocol
  3. SGPIO
  4. The AHCI specification’s LED format

Despite all of the different forms mentioned above, we’ve yet to discover a system that indicates support for something other than the LED format. Though we have found one or two systems that incorrectly say they support the LED format and don’t. As a result, the LED format is the only one that we currently support.

The message format is pretty straightforward. You indicate which port on the controller you care about and then indicate whether you want to enable the identification LED or the fault LED.

Ultimately though, whether or not any of this works at all depends on the manufacturer of the system and what they wire up. Hopefully, if they advertise the LED format support, then we’ll be able to properly drive it.

Wiring it up

To make all of this work, the first thing we had to do was to add support to the ahci(7D) driver to make it detect whether or not the controller had enclosure services and provide a means to control it. The state and capabilities are all managed by a private ioctl(2). These ioctls allow one to get information about what’s supported, the number of ports, and the current state of each port. A second ioctl exists to set the state of a specific port.

Hardware has the constraint that only a single message can be sent at any given time. Therefore the driver serializes these requests and centralizes all the activity in a task queue. In that taskq, we’ll then attempt to send and receive the messages that hardware expects. You can see the details of creating the message and writing it to hardware in the ahci_em_set_led() function.

Don’t worry. You don’t have to try and write ioctls yourself. Similar to the dlled private command, there’s a private command to manipulate this. Let’s take a look at it:

# /usr/lib/ahci/ahciem
PORT                 ACTIVE       SUPPORTED
ahci0/0              default      ident,fault,default
ahci0/1              default      ident,fault,default
ahci0/2              default      ident,fault,default
ahci0/3              default      ident,fault,default
ahci0/4              default      ident,fault,default
ahci0/5              default      ident,fault,default

You can set the state of an individual port here in the same way as with the dlled command. With the ‘default’ state being its normal behavior, with neither the ident or fault LED turned on.

Mapping LEDs to bays and disks

Now, there’s still a problem. We haven’t actually solved the problem of mapping up the ports described above to actual disks. Because this is being driven by the AHCI controller, it only provides us the means of toggling this on its logical ports. While we can know what disk is attached to what port, it doesn’t actually tell us where that disk physically is.

In illumos, we have the ability to construct a per-platform map that defines additional information about the topology of the system. It helps us answer the question of what is where. My colleague Jordan Hendricks was the one who solved this problem for us. She added support for us to relate specific bays that we declare in a per-platform topology map to the corresponding port. This allows us to answer the mapping question as well as control the LEDs through the topology like we do for other parts of the system. It’s great work that takes what was a building block and actually makes it useful in the broader system. You can find a lot of examples of it in the illumos bug tracker and you should read the code itself!

See it in action

When I had a working demo of the ability to control NIC LEDs, I ended up recording an impromptu demo with the help of my occasional partner in crime, Alex Wilson.

What’s Next?

If you’re interested in adding support for NIC LEDs to another device that you have, get in touch with the illumos community on IRC in #illumos on Freenode or a mailing list and I or someone else will help you out and see what we can do. Ultimately, both of these changes are building blocks that can be built on top of. Jordan’s work is an example of that in the AHCI case. There’s more that can be built on top of the basic NIC functionality as well.

Next time, we’ll touch on another piece of related work: understanding the state of copper and fiber-optic transceivers for different NICs and reading their EEPROMs.

A while back, I did a bit of work that I’ve been meaning to come back to and write about. The first of these are all about making it easier to see the temperature that different parts of the system are working with. In particular, I wanted to make sure that I could understand the temperature of the following different things:

  • Intel CPU Cores
  • Intel CPU Sockets
  • AMD CPUs
  • Intel Chipsets

While on some servers this data is available via IPMI, that doesn’t help you if you’re running a desktop or a laptop. Also, if the OS can know about this as a first class item, why bother going through IPMI to get at it? This is especially true as IPMI sometimes summarizes all of the different readings into a single one.

Seeing is Believing

Now, with these present, you can ask fmtopo to see the sensor values. While fmtopo itself isn’t a great user interface, it’s a great building block to centralize all of the different sensor information in the system. From here, we can build tooling on top of the fault management architecture (FMA) to better see and understand the different sensors. FMA will abstract all of the different sources. Some of them may be delivered by the kernel while others may be delivered by user land. With that in mind, let’s look at what this looks like on a small Kaby Lake NUC:

[root@estel ~]# /usr/lib/fm/fmd/fmtopo -V *sensor=temp
TIME                 UUID
Aug 12 20:44:08 88c3752d-53c2-ea3a-c787-cbeff0578cd0

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0/core=0?sensor=temp
  group: authority                      version: 1   stability: Private/Private
    product-id        string    NUC7i3BNH
    chassis-id        string    G6BN735007J5
    server-id         string    estel
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    threshold
    type              uint32    0x1 (TEMP)
    units             uint32    0x1 (DEGREES_C)
    reading           double    43.000000

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0/core=1?sensor=temp
  group: authority                      version: 1   stability: Private/Private
    product-id        string    NUC7i3BNH
    chassis-id        string    G6BN735007J5
    server-id         string    estel
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    threshold
    type              uint32    0x1 (TEMP)
    units             uint32    0x1 (DEGREES_C)
    reading           double    49.000000

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0?sensor=temp
  group: authority                      version: 1   stability: Private/Private
    product-id        string    NUC7i3BNH
    chassis-id        string    G6BN735007J5
    server-id         string    estel
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    threshold
    type              uint32    0x1 (TEMP)
    units             uint32    0x1 (DEGREES_C)
    reading           double    49.000000

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chipset=0?sensor=temp
  group: authority                      version: 1   stability: Private/Private
    product-id        string    NUC7i3BNH
    chassis-id        string    G6BN735007J5
    server-id         string    estel
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    threshold
    type              uint32    0x1 (TEMP)
    units             uint32    0x1 (DEGREES_C)
    reading           double    46.000000

While it’s a little bit opaque, you might be able to see that we have four different temperature sensors here:

  • Core 0’s temperature sensor
  • Core 1’s temperature sensor
  • Package 0’s temperature sensor
  • The chipset’s temperature sensor (Intel calls this the Platform Controller Hub)

On an AMD system, the output is similar, but the sensor exists on a slightly different granularity than a per-core basis. We’ll come back to that a bit later.

Pieces of the Puzzle

To make all of this work, there are a few different pieces that we put together:

  1. Kernel device drivers to cover the Intel and AMD CPU sensors, and the Intel PCH
  2. A new evolving, standardized way to export simple temperature sensors
  3. Support in FMA’s topology code to look for such sensors and attach them

We’ll work through each of these different pieces in turn. The first part of this was to write three new device drivers one to cover each of the different cases that we cared about.

coretemp driver

The first part of the drivers is the coretemp driver. This uses the temperature interface that was introduced on Intel Core family processors. It allows an operating system to read a MSR (Model Specific Register) to determine what the temperature is. The support for this is indicated by a bit in one of the CPUID registers and exists on almost every Intel CPU that has come out since the Intel Core Duo.

Around the time of Intel’s Haswell processor (approximately), Intel added another CPUID bit and MSR that indicates what the temperature is on each socket.

The coretemp driver has two interesting dynamics and problems:

  1. Because of the use of the rdmsr instruction, which reads a model-specific register, one can only read the temperature for the CPU that you’re currently executing on. This isn’t too hard to arrange in the kernel, but it means that when we read the temperature we’ll need to organize what’s called a ‘cross-call’. You can think of a cross-call as a remote procedure call, except that the target is a specific CPU in the system and not a remote host.

  2. Intel doesn’t actually directly encode the temperature in the MSRs. Technically, the value we read represents an offset from the processor’s maximum junction temperature, often abbreviated Tj Max. Modern Intel processors provide a means for us to read this directly via an MSR. However, older ones, unfortunately, do not. On such older processors, the Tj Max actually varies based on not just the processor family, but also the brand, so different processors running at different frequencies have different values. Some of this information can be found in various datasheets, but for the moment, we’ve only enabled this driver for CPUs that we can guarantee the Tj Max value. If you have an older CPU and you’d like to see if we could manually enable it, please reach out.

pchtemp driver

The pchtemp driver is a temperature sensor driver for the Intel platform controller hub (PCH). The driver supports most Intel CPUs since the Haswell generation, as the format of the sensor changed starting with the Haswell-era chipsets.

This driver is much simpler than the other ones. The PCH exposes a dedicated pseudo-PCI device for this purpose. The pchtemp driver simply attaches to that and reads the temperature when required. Unlike the coretemp driver, the offset in temperature is the same across all of the currently supported platforms so we don’t need to encode anything special there like we do for the coretemp driver.

amdf17nbdf driver

The amdf17nbdf driver is a bit of a mouthful. It stands for the AMD Family 17h North Bridge and Data Fabric driver. Let’s take that apart for a moment. On x86, CPUs are broken into different families to represent different generations. Currently all of AMD’s Ryzen/EPYC processors that are based on the Zen microarchitecture are all grouped under Family 17h. The North Bridge is a part of the CPU that is used to interface with various I/O components on the system. The Data Fabric is a part of AMD CPUs which connects CPUs, I/O devices, and DRAM.

On AMD Zen family processors, the temperature sensor exists on a per-die basis. Each die is a group of cores. The physical chip has a number of such units, each of which in the original AMD Naples/Zen 1 processor has up to 8 cores. See the illumos cpuid.c big theory statement for the details on how the CPU is constructed and this terminology. Effectively, you can think of it as there are a number of different temperature sensors, one for each discrete group of cores.

To talk to the temperature sensor, we need to send a message on the what AMD calls the ‘system management network’ (SMN). The SMN is used to connect different management devices together. The SMN can be used for a number of different purposes beyond just temperature sensors. The operating system has a dedicated means of communicating and making requests over the SMN by using the corresponding north bridge, which is exposed as a PCI device.

The same way that with the coretemp driver you needed to issue a rdmsr instruction for the core that you wanted the temperature from, you need to do the same thing here. Each die has its own north bridge and therefore we need to use the right instance to talk to the right group of CPUs.

The wrinkle with all of this is that the north bridge on its own doesn’t give us enough information to map it back to a group of cores that an operator sees. This is critical, since if you can’t tell which cores you’re getting the temperature reading for, it immediately becomes much less useful.

This is where the data fabric devices come into play. The data fabric devices exist at a rather specific PCI bus, device, and function. They all are always defined to be on PCI bus 0. The data fabric device for the first die is always defined to be at device 18h. The next one is at 19h, and so on. This means that we have a deterministic way to map between a data fabric device and a group of cores. Now, that’s not enough on its own. While we know the data fabric, we don’t know how to map that to the north bridge.

Each north bridge in the system is always defined to be on its own PCI bus. The device we care about is always going to be device and function 0. The data fabric happens to have a register which defines for us the starting PCI bus for its corresponding north bridge. This means that we have a path now to get to the temperature sensor. For a given die, we can find the corresponding data fabric. From the data fabric, we can find the corresponding north bridge. And finally, from the north bridge, we can find the corresponding SMN (system management network) that we can communicate with.

With all that, there’s one more wrinkle. On some processors, most notably the Ryzen and ThreadRipper families, the temperature that we read has an offset encoded with it. Unfortunately, these offsets have only been documented in blog posts by AMD and not in the formal documentation. Still, it’s not too hard to take this into account once official documentation becomes available.

While our original focus was on adding support for AMD’s most recent processors, if you have an older AMD processor and are interested in wiring up the temperature sensors on it, please get in touch and we can work together to implement something.


Now that we have drivers that know how to read this information, the next problem we need to tackle is how do we expose this information to user land. In illumos, the most common way is often some kind of structured data that can be retrieved by an ioctl on a character device, or some other mechanism like a kernel statistic.

After talking with some folks, we put together a starting point for a way for a kernel to exposes sets of statistics and created a new header file in illumos called sys/sensors.h. This header file isn’t currently shipped and is only used by software in illumos itself. This makes it easy for us to experiment with the API and change it without worrying about breaking user software. Right now, each of the above drivers implements a specific type of character device that implements the same, generic interface.

The current interface supports two different types of commands. The first, SENSOR_IOCTL_TYPE, answers the question of what kind of sensor is this. Right now, the only valid answer is SENSOR_KIND_TEMPERATURE. The idea is that if we have other sensors, say voltage, current, or
something else, we could return a different kind. Each kind, in turn, promises to implement the same interface and information.

For temperature devices, we need to fill in a singular structure which is used to retrieve the temperature. This structure currently looks something like:

typedef struct sensor_ioctl_temperature {
        uint32_t        sit_unit;
        int32_t         sit_gran;
        int64_t         sit_temp;
} sensor_ioctl_temperature_t;

This is kind of interesting and incorporates some ideas from Joshua Clulow and Alex Wilson. The sit_unit member is used to describe what unit the temperature is in. For example, it may be in Celsius, Kelvin, or Fahrenheit.

The next two members are a little more interesting. The sit_temp member contains a temperature reading, the sit_gran member is whats important in how we interpret that temperature. While many sensors end up communicating to digital consumers using a power of 2 based reading, that’s not always the case. Some sensors often may report a reading in units such as 1/10th of a degree. Others may actually report something in a granularity of 2 degrees!

To try and deal with this menagerie, the sit_gran member indicates the number of increments per degree in the sit_temp member. If this was set to 10, then that would mean that sit_temp was in 10ths of a degree and to get the actual value in degrees, one would need to divide by 10. On the other hand, a negative value instead means that we would need to multiply. So, a value of -2 would mean that sit_temp was in units of 2 degrees. To get the actual temperature, you would need to multiply sit_temp by 2.

Now, you may ask why not just have the kernel compute this and have a ones digit and a decimal portion. The main goal is to avoid floating point math in the kernel. For various reasons, this historically has been avoided and we’d rather keep it that way. While this may seem a little weird, it does allow for the driver to do something simpler and lets user land figure out how to transform this into a value that makes semantic sense for it. This gets the kernel out of trying to play the how many digits after the decimal point would you like game.

Exposing Devices

The second part of this bit of kernel support is to try and provide a uniform and easy way to see these different things under /dev in the file system. In illumos, when someone creates a minor node in a device driver, you have to specify a type of device and a name for the minor node. While most of the devices in the system use a standard type, we added a new class of types for sensors that translate roughly into where you’ll find them.

So, for example, the CPU drivers use the class that has the string "ddi_sensor:temperature:cpu" (usually as the macro DDI_NT_SENSOR_TEMP_CPU). This is used to indicate that it is a temperature sensor for CPUs. The coretemp driver then creates different nodes for each core and socket. For example, on a system with an Intel E3-1270v3 (Haswell), we see the following devices:

[root@haswell ~]# find /dev/sensors/

On the other hand on an AMD EPYC system with two AMD EPYC 7601 processors, we see:

[root@odyssey ~]# find /dev/sensors/

The nice thing about the current scheme is that anything of type ddi_sensor will have a directory hierarchy created for it based on the different interspersed : characters. This makes it very easy for us to experiment with different kinds of sensors without having to go through too much effort. That said, this is all still evolving, so there’s no promise that this won’t change. Please don’t write code that relies on this. If you do, it’ll likely break!

FMA Topology

The last piece of this was to wire it up in FMA’s topology. To do that, I did a few different pieces. The first was to make it easy to add a node to the topology that represents a sensor backed by this kernel interface. There’s one generic implementation of that which is parametrized by the path.

With that, I first modified the CPU enumerator. The logic will use a core sensor if available, but can also fall back to a processor-node sensor if it exists. Next, I added a new piece of enumeration, which was to represent the chipset. If we have a temperature sensor, then we’ll enumerate the chipset under the motherboard. While this is just the first item there, I suspect we’ll add more over time as we try to properly capture more information about what it’s doing, the firmware revisions that are a part of it, and more.

This piece is, in some ways, the simplest of them all. It just builds on everything else that was already built up. FMA already had a notion of a sensor (which is used for everything from disk temperature to the fan RPM), so this was just a simple matter of wiring things up.

Now, we have all of the different pieces that made the original example of the CPU and PCH temperature sensor work.

Further Reading

If you’re interested in learning more about this, you can find more information in the following resources:

In addition, you can find theory statements that describe the purpose of the different drivers and other pieces that were discussed earlier and
their manual pages:

Finally, if you want to see the actual original commits that integrated these changes, then you can find the following from illumos-gate:

commit dc90e12310982077796c5117ebfe92ee04b370a3
Author: Robert Mustacchi <rm@joyent.com>
Date:   Wed Apr 24 03:05:13 2019 +0000

    11273 Want Intel PCH temperature sensor
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Mike Zeller <mike.zeller@joyent.com>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Reviewed by: Gergő Doma <domag02@gmail.com>
    Reviewed by: Paul Winder <Paul.Winder@wdc.com>
    Approved by: Richard Lowe <richlowe@richlowe.net>

commit f2dbfd322ec9cd157a6e2cd8a53569e718a4b0af
Author: Robert Mustacchi <rm@joyent.com>
Date:   Sun Jun 2 14:55:56 2019 +0000

    11184 Want CPU Temperature Sensors
    11185 i86pc chip module should be smatch clean
    Reviewed by: Hans Rosenfeld <hans.rosenfeld@joyent.com>
    Reviewed by: Jordan Hendricks <jordan.hendricks@joyent.com>
    Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Approved by: Garrett D'Amore <garrett@damore.org>

Looking Ahead

While this is focused on a few useful temperature sensors, there are more that we’d like to add. If you have a favorite sensor that you’d like to see, please reach out to me or the broader illumos community and we’d be happy to take a look at it.

Another avenue that we’re looking to explore is having a standardized sensor driver such that a device driver doesn’t necessarily have to have its own character device or implementation of the ioctl handling.

Finally, I’d like to thank all those who helped me as we discussed different aspects of the API, reviewed the work, and tested it. None of this happens in a vacuum.

The modern networking card (NIC) has evolved quite a bit from the simple Ethernet cards of yesteryear. As such, the way that the operating system uses them has had to evolve in tandem. Gone are the simple 10 Mbit/s copper or (BNC) devices. Instead, 1 Gb/s is common-place in the home, 10 Gb/s rules the server, and you can buy cards that come in speeds like 25 Gb/s, 40 Gb/s, and even 100 Gb/s! Speed isn’t the only thing that’s changed, there’s been a big push to virtualization. What used to be one app on one server, transformed into a couple apps on a few Hardware Virtual Machines, and these days can be hundreds of containers all on a single server.

For this entry, we’re going to be focusing on how the Operating System sees NICs, what abstractions they provide together, how things have changed to deal with the increased tenancy and performance demands, and then finally where we’re going next with all this. We’re going to focus on where scalability problems have come about and talk about how they’ve been solved.

The Simple NIC

While this is a broad generalization, the simplest way to think of a networking card is that it has five primary pieces:

  1. A MAC Address that it can use to filter incoming packets.
  2. A ring, or circular buffer, that packets are received into from the network.
  3. A ring, or circular buffer, that packets are sent from to the network.
  4. The ability to generate interrupts.
  5. A way to program all of the above, generally done with PCI memory-mapped registers.

First, let’s talk about rings. Both of these rings are considered a type of circular buffer. With a circular buffer, the valid region of the buffer changes over time. Circular buffers are often used because of the property that they consume a fixed amount of memory and they handle the model of a single producer and a single consumer rather well. One end, the producer, places data in the buffer and moves a head pointer, while another end, the consumer, removes data from the buffer, moving the tail pointer. For the rest of this article, we won’t be using the term circular buffer, but instead ring, which is commonly used in both networking card programming manuals and operating systems to describe a circular buffer used for packet descriptors.

Let’s go into a bit more detail about how this works. Rings are the lifeblood of a networking card. They occupy a fixed chunk of memory in normal system memory and when the hardware wants to access it, it will perform DMA, direct memory access. All the data that comes and goes through it is described by an entry in a ring, called a descriptor. The ring is made of a series of these descriptors. A descriptor is generally made up of three parts:

  1. A buffer address
  2. A buffer length
  3. Metadata about the packet

Consider the receive ring. Each entry in it describes a place that the hardware can place an incoming packet. The buffer in the descriptor says where in main memory to place the packet and the length says how big that buffer is. When a packet arrives, a networking card will generally generate an interrupt, thus letting the operating system know that it should check the ring.

The transmit side is similar, but slightly different in orientation. The OS places descriptors into the ring and then indicates to the networking card that it should transmit those packets. The descriptor says where in memory to find the packet and the length says how long the packet is. The metadata may include information such as whether the packet is broken up into one or more descriptors, the type of data in the packet, or optional features to perform.

To make this more concrete, let’s think of this in the context of receiving packets. Initially, the ring is empty. The OS then fills all of the descriptors with pointers to where the hardware can put packets it receives. The OS doesn’t have to fill the entire ring, but it’s standard practice in most operating systems to do so. For this example, we’ll assume that the OS has filled the entire ring.

Because it’s written to the entire ring, the OS will set its pointer to the start of the buffer. Then, as the hardware receives packets, the hardware will bump its pointer and send the OS an interrupt.

The following ASCII art image describes how the ring might look after the hardware has received two packets and delivered the OS an interrupt:

        +--------------+ <----- OS Pointer
        | Descriptor 0 |
        | Descriptor 1 |
        +--------------+ <----- Hardware Pointer
        | Descriptor 2 |
        |     ...      |
        | Descriptor n |

When the OS receives the interrupt, it reads where the hardware pointer is and processes those packets between its pointer and the hardware’s. Once it’s done, it doesn’t have to do anything until it prepares those descriptors with fresh buffers. Once it does, it’ll update its pointer by writing to the hardware. For example, if the OS has processed those first two descriptors and then updates the hardware, the ring will look something like:

        | Descriptor 0 |
        | Descriptor 1 |
        +--------------+ <----- Hardware Pointer, OS Pointer
        | Descriptor 2 |
        |     ...      |
        | Descriptor n |

When you send packets, it’s similar. The OS fills in descriptors and then notifies the hardware. Once the hardware has sent them out on the wire, it then injects an interrupt and indicates which descriptors it’s written to the network, allowing the OS to free the associated memory.

MAC Address Filters and Promiscuous Mode

The missing piece we haven’t talked about is the filter. Just about every networking card has support for filtering incoming packets, which is done based on the destination MAC address of the packet. By default, this filter is always programmed with the MAC address of the card. By default, if the destination MAC address of the Ethernet header doesn’t match the networking card’s MAC address, and it isn’t a broadcast or multicast packet, then it will be dropped.

When networking cards want to receive more traffic than that which they’ve been programmed to receive, they’re put into what’s called promiscuous mode, which means that the card will place every packet it receives into the receive ring for the operating system to process, regardless of the destination MAC address.

Now, the question that comes to mind, is why do cards have this filtering capability at all? Why does it matter? Why would we only care about a subset of packets on the network?

To answer this, we first have to go back to the world before network switches, there were only hubs. When a packet came into a hub, it was replicated to every port. This mean that every NIC would end up seeing everything that went out. If they hadn’t filtered by default, then the sheer amount of traffic and interrupts might have overwhelmed many early systems, particularly on larger networks. Nowadays, this isn’t quite as problematic as we use switches, which learn which MAC addresses are behind which switch ports. Of course, there are limits to this, which have motivated a lot of the work around network virtualization, but more on that in a future blog post.

Problem: The Single Busy CPU

In the previous section, we talked about how we had an interrupt that fired whenever packets were successfully received or transmitted. By default, there was only ever a single interrupt used and that interrupt vector usually mapped to exactly one CPU, in other words only one CPU could process the initial stream of packets. In many systems, that CPU would then process that incoming packet all the way through the TCP/IP stack until it reached a socket buffer for an application to read.

This has led to many problems. The biggest are that:

  1. You have a single CPU that’s spending all its time handling interrupts, unable to do user work
  2. You might often have increased latency for incoming packets due to the processing time

These problems were especially prevalent on single CPU systems. While it may be hard to remember, for a long time, most systems only had a single socket with a single CPU. There were no hardware threads and there were no cores. As multi-processing became more common (particularly on x86), the question became how do networking cards take advantage of horizontal scalability.

A Swing and a Miss

The first solution that springs to mind when we talk about this is say why don’t we just have multiple threads process the same ring? If we have multiple CPUs in the system, why not just have a thread we can schedule to help do some of the heavy lifting along with the thread that’s processing the interrupt?

Unfortunately, this is where Amdahl starts to rear his head, glaring at us and reminding us of the harsh truths of multi-processing. Multiple threads processing the same queue doesn’t really do us much good. Instead, we’ll end up changing where our contention is. Instead of having a single CPU 100% busy, we’ll have several CPUs, quite busy, and spending a lot of their time blocked on locks. The problem is that we still have shared state here — the ring itself.

There’s actually another somewhat nastier and much more subtle problem here: packet ordering. While TCP has a lot of logic to deal with out of order packets and UDP users are left on their own, no one likes out of order packets. In many TCP stacks, whenever a packet is detected to have arrived out of order, that sends up red flags in the stack. This will cause TCP to believe there is something wrong in the network and often throttle traffic or require data to be retransmitted, injecting noticeable latency and performance impacts.

Now, if the packets arrive out of order on the networking card, then there’s not a lot we can do here. However, if they arrive in order and we have multiple threads processing the same ring, then due to lock ordering and the scheduler, it all of a sudden becomes very easy to have packets arrive out of order, leading to TCP throwing its hands up, and performance being left behind in the gutter.

So we now actually have two different problems:

  1. We need to make sure that we don’t allow packets to be delivered out of order
  2. We want to avoid sharing data structures protected by the same lock

Nine Rings for Packets Doomed to be Hashed

The answer to this problem is two-fold. Earlier we said that a ring is the lifeblood of a NIC, so what if we just put added a bunch more rings to the card. If each ring has its own interrupt associated with it, then we’ve solved our contention problem. Each ring is still processed by only a single thread. The fastest way to deal with shared resources is not to share at all!

So now we’re left with the question of how do we deal with the out of order packet problem. If we simply assigned each incoming packet to a ring in a round-robin fashion, we’d only be making the problem of out of order delivery a certainty. So this means that we need to do something a little bit smarter.

The important observation is that what we care about is that a given TCP or UDP connection always go to the same place. It’s a TCP connection that can become out of order. If there are two different connections ongoing, the order that their packets are received in doesn’t matter. Only the order of a single connection matters. Now all we need to do is assign a given connection to a ring. For a given TCP connection, the source and destination IP addresses, and the source and destination ports will be the same throughout the lifetime of the connection. You might sometimes hear this called a flow, a series of identifying information that identifies some set of traffic.

The way the assignments are made is by hashing. Networking cards have a hash function that takes into account various fields in the header that are constant and use them to produce a hash value. Then that hash value is used to assign something to a ring. For example, if there were four rings in use, a simple way to assign traffic is to simply take the hash value and compute its modulus by the number of rings, giving a constant ring index.

Different cards use different strategies for this. You also don’t necessarily need to use all of the members of the header. For example, while you could use both the source and destination ports and IP addresses for a TCP connection, if the card ignored the ports, the right thing would still happen. Traffic wouldn’t end up out of order; however, it might not be spread quite as evenly amongst the rings.

This technology is often referred to as receive side scaling (RSS). On SmartOS and other illumos derived systems, RSS is enabled automatically if the device and its driver support it.

Problem: Density, Density, Density

As Steve Ballmer famously once said, “Developers, Developers, Developers…”. For many companies today, it isn’t developers that is the problem, but density. The density and utilization of machines is one of the most important factors for their efficiency and their capex and opex budgets. The first major entry into enterprise for tackling this density was VMware and the Hardware Virtual Machine. However, operating system virtualization had also kicked off. For more on the history, listen to Bryan Cantrill‘s talk at Papers we Love.

Just like airlines don’t want to fly with empty seats on planes, when companies are buying servers, they want to make sure that they are fully utilized. Due to the increasing size of machines, that means that the number of things running on it has to increase. With rare exception, gone are the days of the single app on a server. This means that the number of IP addresses and MAC addresses per server has jumped dramatically. We cannot just load up a box with NICs and assign each physical device to a guest. That’s a lot of cards, ports, cables, and top of rack switches.

However, increasing density isn’t necessarily free. We now have new problems that come up as a result and new scalability problems.

A Brief Aside: The Virtual NIC

Once you start driving up the density and tenancy of a single machine, then you immediately have the question of how do you represent these different devices. Each of these instances, whether they be hardware virtualized or a container, has their own networking information. They not only have their own IP address, but they also have their own unique MAC address and different properties on them. So how do you represent these along with the physical devices?

Enter the Virtual NIC or VNIC. Each VNIC has its own MAC address and its own link properties. For example, they have their own MTU and may have a VLAN tag. Each VNIC is created over a physical device and can be given to any kind of instance. This allows a physical device to be shared between multiple different instances, all of which may not trust one another. VNICs allow the administrator or operator to describe logically what they want to exist, without worrying about the mess of the many different possible abstractions that have since shown up in the form of bridges, virtual switches, etc. In addition, VNICs have capabilities that allow us to prevent MAC address spoofing, IP address spoofing, DHCP spoofing, and more, making it very nice for a multi-tenant environment where you don’t trust your neighbor, especially when your neighbor sits next to you.

Always Promiscuous?

We earlier talked about how devices don’t receive every frame and instead have filters. On the flip side, as density demands increased, so does the number of MAC addresses that exist on one server. When we talked about the simple NIC, it only had one MAC address filter. If the OS wanted to receive traffic for more than one MAC address, it would need to put the device into promiscuous mode.

However, here’s another way that devices have changed substantially. Rather than just having a single MAC address filter, they have added several. If you consider SmartOS and illumos, every time a virtual NIC is created, it tries to use an available MAC address filter. The number of filters present determines how many devices we can support before we have to put a NIC in promiscuous mode. On some devices there can be hundreds of these filters. Some of which also take into account the VLAN tag as well.

The Classification Challenge

So, we’ve managed to improve things a bit here. We’ve got a couple hundred devices here and we’ve been able to program those devices into our MAC address filters. Our devices are employing RSS so we’re able to better spread the load across every device; however, we still have a problem: when a packet comes in we need to now figure out what virtual or physical device it corresponds to so we deliver it to the right instance.

By pushing all of this density into a single server, that server needs its own logical, virtual switches. At a high level, implementing this is straightforward, we simply need to look at the destination MAC address, find the corresponding device, and then deliver the packet in the context of that device.

NIC manufacturers paid attention to this need and the fact that operating systems were spending a lot of time dealing with this and so they came up with some additional features to help. We’ve already mentioned how devices can support RSS and how they can have MAC address filters. So, what happens if we combine the two ideas: given a piece of hardware multiple rings, each of which can be programmed with its own MAC address filter.

In this world, we assemble rings into what we call a group. A group consists of one or more MAC address filters and one or more rings. Consider the case where each group has one ring and one filter. So each VNIC in the system will be assigned to a group while they’re available. If a group can be assigned then all the traffic coming to the corresponding ring is guaranteed to be for that device. If we know that, then we can bypass any and all classification in software. When we process the ring, we know exactly what device in the OS this corresponds to and we can skip the classification step entirely.

We mentioned that some devices can assign multiple rings to a given group. If multiple rings are assigned to a group, then the NIC will perform RSS across that set of rings. That means that after the traffic gets filtered and assigned to a group, we then hash the incoming packets and assign it to one of those rings.

Now, you might be asking what about the case where we run out of groups? Well, at that point we try and leverage the fact that some groups can support multiple MAC addresses. This default group will be programmed with all the remaining MAC addresses. If those are exceeded, then we can put that default group and only that group into promiscuous mode.

What we’ve done now is taken the basic building blocks of both rings and MAC address filters and combined them in a way to tackle the density problem. This lets a single network card scale up much more dramatically.

Problem: CPUs are too ‘slow’

One of the other prime features of modern NICs is what various NIC manufacturers call hardware offloads. TCP, UDP, and other networking protocols often have checksums that need to be calculated. The reason this came up is that many CPUs were considered too slow. What this really means is that there was a bit more latency and CPU time spent processing these checksums and verifying them. NIC vendors decided to add additional logic to their silicon (or firmware) to calculate those checksums themselves.

The general idea is that if when the OS needs to transmit the packet, it can ask the NIC to fill in the checksums and when it is receiving a packet, it can ask the NIC to verify the checksum. If the NIC verifies the checksum, then often times the OS will trust the NIC and then not verify it itself.

In general, these offload technologies are fairly common and generally enabled by default. They do add a nice little advantage; however, it often comes at the cost of debugability and may leave you at the mercy of hardware and firmware bugs in the devices. Historically, some devices have had bugs in this logic or had various edge conditions that will lead them to corrupt the data or incorrectly calculate the checksum. Debugging these kinds of problems is often very difficult, because everything that the OS generates itself looks fine.

There are other offload technologies that have also been introduced such as TCP Segmentation Offload which seek to offload parts of the TCP stack processing to networking cards. Whenever looking at or considering hardware offloads, it’s always important to understand the trade offs in performance, debugability, and value. Not every hardware offload is worth it. Always measure.

Problem: The Interrupts are Coming in too Hot

Let’s set the stage. Originally devices could handle 10 Mbits of traffic in a single second. If you assume a default MTU of 1500 bytes and that every packet was that size (depending on your workload, this can easily be a bad assumption), then that means that a device could in theory receive about 833 packets in a given second (10 Mbit/s * 1,000,000 bits/Mbit / 8 bits/byte / 1500 bytes/packet). When you start accounting for the Ethernet header and the VLAN tag, this number falls a bit more.

So if we had 833 packets per second and then we assume that each interrupt only has a single packet (the worst case), we have 833 interrupts per second and we have over 1 ms to process each packet. Okay, that’s easy.

Now, we’re not using 10 Mbit/s devices, we’re often using 10 Gbit/s devices. That means we have 1000x more packets per second! So that means that if we have a single packet in every interrupt that’s 833,000 interrupts per second. All of a sudden the time we have to just do all of the accounting for the interrupt becomes ridiculous and starts to add a lot of overhead.

Solution One: Do Less Work

The first way that this is often handled is to simply do less. Modern devices have interrupt throttling. This allows device drivers to limit the number of interrupts that occur per second per ring. The rates and schemes are different on every device, but a simple way to think about it is if you set an upper bound on interrupts per second, then the device will enforce a minimum amount of time between interrupts. Say you wanted to allow 8000 interrupts per second, then this would mean that an interrupt could fire at most every 125 microseconds.

When an interrupt comes in, the operating system can process more than one packet per interrupt. If there are several packets available in the ring, then there’s nothing that stops the system from processing multiple in a single interrupt and in fact, if you want to perform well at higher speeds, you need to.

While most operating systems enable this by default, there is a trade off. You can imagine that there is a small latency hit. For most uses this isn’t very important; however, if you’re in the finance world where every microsecond counts, then you may not care about the CPU cost and would rather avoid the latency. At the end of the day though, most workloads will end up benefiting from using interrupt throttling, especially as it can be used to help reduce the CPU load.

Solution Two: Turn Off Interrupts

Here we’re going to go and do something entirely different. We’re going to stop bothering with interrupts. Modern PCI Express devices all support multiple interrupts. We’ve talked about how there are multiple rings, well each ring has its own interrupt identified by a vector. These interrupts are called MSI-X. These devices allow you to mask the interrupt and turn it off on a per-ring basis.

Regardless of whether interrupts are turned on or off on a given ring, packets will still accumulate in the ring. This means that if the operating system were to look at the ring, it could see that there were packets available to be processed. If the OS marks the received entries processed, then the hardware will continue delivering packets into the ring. When the OS decides to turn off interrupts and process the ring with a dedicated thread, we call this polling.

Polling works in conjunction with dedicated rings being used for classification. In essence, when the OS notices that there’s a large number of packets coming in on the ring, it will then transition the ring to polling mode, disabling interrupts. The dedicated thread will then continue to poll the ring for work, consuming up to a fixed number of packets in any one poll. After that, if there is still a large backlog, it will keep polling until a low watermark is reached, at which point it will disable polling and transition back to interrupt based processing.

Each time a poll occurs, the packets will be delivered in bulk to the networking stack. So if 50 packets came in in-between polls, then they would all be delivered at once.

As with everything we’ve seen, there is a trade off of sorts. When you’re in polling mode, there can be an additional latency hit to processing some of these packets; however, polling can make it much easier to saturate 10 Gbit/s and faster devices with very little CPU.


We started off with introducing the concept of rings in a networking card. Rings form the basic building block of the NIC and are the basis for a lot of the more interesting trends in hardware. From there, we talked about how you can use rings for RSS and when you combine rings with MAC address filters you can drive up tenancy with hardware classification, which helps enable polling, among other features.

One important thing here is that all of the features we’ve talked about are completely transparent to applications. All of the software built upon the BSD/POSIX sockets APIs, functions like connect(), accept(), sendmsg(), and recvmsg(), automatically get all of the benefits of these developments and enhancements in the operating system and hardware, without having to change a single thing in the application.

Future Directions and More Reading

This is only the tip of the iceberg both in terms of detail and in terms of what we can do. For example, while we’ve primarily talked about the hardware and software classification of flows for the purpose of delivering them to the right device, there are other things we can do such as throttle bandwidth and more with tools like flowadm(1M).

At Joyent, we’re continuing to explore these areas and push ahead in new and interesting ways. For example, as cards have been adding more and more functionality to support things like VXLAN and Geneve, we’re exploring how we perform hardware classification of those protocols, leverage newer checksum offloading for them, and coming up with novel and different ways to improve the performance. If any of the following sound interesting, make sure to reach out.

If you found this topic interesting, but find yourself looking for more, you may want to read some of the following:

Saturday September 27th was illumos day 2014, hosted as a follow on to Surge 2014. illumos day was really quite nice and it was a good gathering of both folks who have been in the community for some time, and those who were just getting started. I was able to record the talks and so you can find them all in the following Youtube playlist:

The following folks gave talks:

Thanks to everyone who attended!

Recent Posts

September 27, 2019
September 6, 2019
October 1, 2014