Tales from a Core File

Close this search box.

Month: September 2019

USB devices have been a mainstay of extending x86 systems for some time now. At Joyent, we used USB keys to contain our own version of iPXE to boot. As part of discussions around RFD 77 Hardware-backed per-zone crypto tokens with Alex Wilson we talked about knowing and constricting which USB devices were trusted based on whether or not they were plugged into an internal USB port or external USB port.

While this wasn’t the first time that this idea had come up, by the time I started working on ideas on improving data center management, having better USB topology ended up on the list of problems I wanted to solve in RFD 89 Project Tiresias. Though at that point, how it was going to work was still a bit of an unknown.

The rest of this blog entry will focus on giving a bit of background on how USB works, some of the building blocks used for topology, examples of how we use the topology information, and then how to flesh it out for a new system.

USB Background

While USB, the Universal Serial Bus, is rather ubiquitous, some of its underlying implementation may not be. This section describes USBv1, v2, and v3 devices. The USBv4 spec is basically Thunderbolt, which is a different enough beast that I don’t want to lump it into here.

Every USB device is plugged into a device called a hub. Each hub consists of one or more ports and may itself be plugged into another hub. In this manner, you can think of USB like a tree. When you get to the root of this tree, you reach what is called the root hub. The root hub is often a bit different from the other hubs — it bridges USB to the rest of the system. Most USB root hubs are either built into the platform’s chipsets or they are on external PCI express cards. The operating system interfaces with these devices generally using standards like xHCI, the eXtensible Host Controller Interface (they already used the E for EHCI – the Enhanced Host Controller Interface).

USB 2.0 and USB 3.x Compatibility

There are many versions of USB devices out on the market; however, even newer devices work on older systems (if you can find the right kind of plug). The secret to this is that every USB 3.x port must support USB 2.0. The way this works is that a USB 3.x port has the wiring for both USB 3.x and USB 2.0 at the same time. In general, this has been a good thing. It means that older devices will work on newer systems and newer devices will work on older systems albeit not always at the maximum speed that they support. However, this does make our life a bit more complicated when it comes to topology.

While a single physical port can support both USB 3.x and USB 2.0, to the operating system the one physical port shows up as two different logical ports on the host controller. Generally, a device will select either USB 3.x or USB 2.0 signaling based on what they support and therefore it will only show up on one of the two logical ports. However, when it comes to topology, the user cares only about the fact that it’s in a given physical port, they don’t (generally speaking) care about the fact that there are multiple logical ports.

USB hubs, which allow for more devices to exist, are an exception to the rule of only using USB 3.x or USB 2.0 signaling. A USB 3.x hub is actually two hubs in one! When a USB 3.x hub is plugged into a port that supports USB 3.x, it will enumerate as two different hubs: one on the USB 2.0 logical port and one on the USB 3.x logical port. This means that the OS will actually see two distinct USB Hubs that it will enumerate and manage independently.

Ultimately, these are all good properties for USB devices to have. It does mean that we have to do a bit more work to map everything together, but that’s fine — that’s our job.

Multiple Host Controllers

The picture we painted above is a nice one, but doesn’t reflect all systems. One of the challenges of USB 3.0 support was that it introduced a new host controller interface: xhci replaced ehci. Now, to help with the transition, Intel produced a number of chipsets that had both xhci and ehci controllers on them. When the system booted up, all of the USB ports would be directed towards the ehci controller. However, on these platforms an xhci device driver could write to a special register which would result in rerouting all of the ports from the ehci controller to the xhci controller.

This allowed operating systems which didn’t support xhci to still have working USB. On Intel platforms, this duality was removed with Intel’s Intel’s Skylake chipsets.

From topology’s perspective, this means that the same physical port could show up not just as two different ports on the same controller, but actually as multiple, disjoint ports on different controllers!

Companion Controllers

With USB 3.x, a single host controller can support USB 3.x, 2.0, and 1.x devices. However, before USB 3.0 this wasn’t the case. Instead, platforms placed what was called a ‘companion controller’ on the motherboard. The basic idea was that the USB 2.0 ports were wired up to one controller and the other ports were wired up to a companion USB 1.0/USB 1.1 controller (ohci or uhci).

The companion controller model required the various drivers to be aware of this reality and trade things back and forth between them. Folding them together in xhci made things simpler. From a topology perspective, this can result in the same problem hat we have in the pre-Skylake USB 3.0 supporting systems — a given physical port can show up under multiple distinct devices.

USB Descriptors and Capabilities

Information about USB devices is broken down into two different groups of information:

  1. Descriptors
  2. Capabilities

Descriptors are used to identify information about the device such as the manufacturer ID, the device ID, the USB revision the device supports, etc.. There are descriptors which identify characteristics about a shared class of devices and others which identify information about different configurations that the device supports. For the purposes of USB topology, we primarily care about the device descriptor.

USB capabilities are stored in what’s called the binary object store. Capabilities first showed up in the USB 3.0 specification (though they appeared first in the briefly used Wireless USB specification). These capabilities are required for devices and generally describe USB-wide aspects of the device.

Topology Building Blocks

USB Topology is complicated for a few different reasons. The first is the fact a single physical port can show up as two different logical ports to the operating system. The second challenge is actually figuring out what all the ports are used for — if they’re used at all.

This second problem deserves a bit more explanation. Most systems have USB support from their platform chipset. The platform chipset implements a number of USB ports. For example, let’s look at one of the Intel 300 series chipsets, the Z390. If you look at the I/O specifications, it lists that it supports 14 USB ports, all of which can support USB 2.0 and some of which can support different forms of USB 3.1. Now, the standard system doesn’t actually have 14 USB ports all wired up, only a subset of them. Even the mobile chipsets are the same and I certainly don’t have 14 USB ports all over my laptop. This means that there are ports that the OS can see, but may not be used or wired up at all. Or they may be wired up to an internal hub.

While this is a challenge, it is surmountable. We’ll talk about a few of the things we can use to map ports together and a few things that also don’t work for us.


ACPI, the Advanced Configuration and Power Interface, provides a multitude of different capabilities to the system. However, there are a few that are specific to USB that are useful for us.

In ACPI, there is a notion of a tree of devices. Every device in the tree has properties and methods that the operating system can read and invoke which are provided by the platform firmware. When the operating system is looking for devices, it searches for ACPI devices in the same way it might search for PCI devices. In this tree, we’ll find three different relevant items: PCI devices, USB hubs, and USB ports. A USB host controller will be represented as a PCI device and it will have a child device, which is a USB hub, representing the root hub that the operating system sees. The hub will then have a port entry that corresponds to each logical port that the USB device has. If the platform has other hubs built into it (not ones that a user a plugs in), then they might also be represented in the ACPI tree.

For each port in the tree, there are three attributes that we care about. The first is the _ADR method. It is a generic ACPI method that determines the address of a given object. The type of address will vary based on the type of the device. A PCI device would have its device and function number while a SATA device would have the port number. In the case of a USB port, it gives us the port number on the hub which corresponds to the logical view that the operating system will see. This gives us a way of correlating the ACPI port objects with the ports that the operating system sees.

The next thing that we use from ACPI is the optional _UPC method, which is used to return the USB port capabilities. It tells us two different types of information:

  1. Whether a device may be connected to the port or not.
  2. The type of USB port. For example, whether it’s a Type A or Type C connector.

The next piece that we use is the _PLD method, which is the physical location of device. This method returns a binary description of the physical device information. It includes some information like the panel, orientation, and more. While theoretically useful, in practice, the binary payload makes it hard to really deterministically say something useful about the layout of the ports.

Now, you may ask if it’s hard to make something sensible about it, then why bother using it at all. The answer to that lies in the xHCI specification. The xHCI specification says that if you want to map two physical ports on an xhci controller together — such as a USB 2 port and its corresponding USB 3 port, then you can actually use the physical location of device information to map them together. If two ports, have the same panel, horizontal and vertical position, shape, group orientation, position, and token, then they are the same port. This only works across a single controller. Unfortunately, you cannot map two ports together on different controllers this way.

Exposing Information

For each USB root controller and its corresponding ports, we end up creating a logical device node for it in the devices tree. ACPI devices are rooted under the fw node. Each USB root hub shows up under it with a way to map it to its corresponding PCI device. Under each hub is a port, and if there’s a hub under that, then another USB hub and its ports.

Each port has a series of properties that correspond to the various ACPI methods discussed above. More specifically, we have the following properties:

  1. acpi-address: The value of the address found through the _ADR method.
  2. acpi-physical-location: A byte array that corresponds to the raw ACPI values. The kernel does not try to interpret the data and instead that is done in user land.
  3. usb-port-connectable: A property that if present indicates that the port is connectable.
  4. usb-port-type: A property that indicates the ACPI USB port type.

These properties can all be read with the libdevinfo(3LIB) library, which allows software to take a snapshot of the tree, walk the various nodes, and read the properties of the different nodes.

A Building Block Not Taken: SMBIOS

SMBIOS, is the system management BIOS. It provides tables of static information about the system. For example, lists of CPUs, memory devices, and more. One of the things I’ve enjoyed doing in illumos is keeping this information up to date and improving it as new releases of the specification come out. It’s proven to be invaluable for other efforts like labeling PCIe devices.

This time though, I mention SMBIOS because it’s something that one might normally think to use, but actually doesn’t work. One of the SMBIOS tables is a list of ports and what they connect to. Unfortunately, the SMBIOS tables usually refer to USB ports based on the headers on the motherboard. While this can be useful for some cases, it isn’t for what we care about — mapping ports that you plug devices into back to the corresponding ports that the operating systems see.

Enumerating USB Topology

With the different building blocks in place, let’s turn directions and now look at how we expose all of this information in the FMA topology trees. We’ll first look at what we expose and then we’ll come back and explain how that’s put together in FMA. We enumerate USB ports in FMA’s topology in three different groups:

  1. USB ports that we know correspond to the chassis
  2. USB ports that come from a PCIe add on card.
  3. All the remaining USB ports, which we place off of the motherboard.

For each port, we list the following information:

  1. The USB revisions the port supports, such as 2.0, 3.0, etc.
  2. The type of the port if we have ACPI information or a metadata file to tell us about it.
  3. Whether we consider the port connectable, visible, or disconnected.
  4. Information about whether we consider the port internal or external, if we have explicit metadata.
  5. A list of all of the logical ports this physical port represents. For example, if a port is wired up to an ehci controller and two ports on an xhci controller (one for USB 2.0 and one for USB 3.x), then we’ll list three different children here.
  6. A label that describes the port, if metadata provides it. This is a string that a person uses to know how to identify a port. For example, ‘Rear Upper Left USB’.

The following is an example of what a single USB port node might look like in fmtopo:

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=Joyent-S10G5:server-id=magma:chassis-id=S287161X8300740/chassis=0/port=0
    FRU               fmri      hc://:product-id=Joyent-S10G5:server-id=magma:chassis-id=S287161X8300740/chassis=0
    label             string    Rear Upper Left USB
  group: authority                      version: 1   stability: Private/Private
    product-id        string    Joyent-S10G5
    chassis-id        string    S287161X8300740
    server-id         string    magma
  group: port                           version: 1   stability: Private/Private
    type              string    usb
  group: usb-port                       version: 1   stability: Private/Private
    port-type         string    USB 3 Standard-A connector
    usb-versions      string[]  [ "2.0" "3.0" ]
    port-attributes   string[]  [ "user-visible" "external-port" ]
    logical-ports     string[]  [ "xhci0@2" "xhci0@18" ]

Under a port, if a USB device is plugged in, we’ll list information about the device. This includes:

  1. The USB revision of the device. For example, this could be 1.1, 2.0, 2.1, 3.0, 3.1, 3.2, etc..
  2. The numeric vendor and device identifiers, which are used to inform the system about the device so the right driver can be attached.
  3. The revision ID of the device. This is a vendor-specific name.
  4. The device’s USB vendor and product name strings, if it provides them.
  5. The USB device’s serial number, if it has one.
  6. The speed of the device, for example super-speed, full-speed, etc.. These represent the type of protocol speed that the system has.

Next, we’ll create a set of properties that describe the driver that’s attached to the device, if any. This is a standard property group that you’ll find on other nodes in the tree, such as a PCIe device. This includes:

  1. The name of the driver.
  2. The instance of the driver (a logical construct in the OS).
  3. The path of the driver in /devices.
  4. The module information for the driver, such as its FMRI (fault management resource identifier).

Here’s an example of the information for a device itself in fmtopo:

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=Joyent-S10G5:server-id=magma:chassis-id=S287161X8300740:serial=00241D8CE563C1B1E94FEBB4:part=DataTraveler-2.0:revision=100/motherboard=0/port=5/usb-device=0
    FRU               fmri      hc://:product-id=Joyent-S10G5:server-id=magma:chassis-id=S287161X8300740:serial=00241D8CE563C1B1E94FEBB4:part=DataTraveler-2.0:revision=100/motherboard=0/port=5/usb-device=0
    label             string    Internal USB
  group: authority                      version: 1   stability: Private/Private
    product-id        string    Joyent-S10G5
    chassis-id        string    S287161X8300740
    server-id         string    magma
  group: usb-properties                 version: 1   stability: Private/Private
    usb-port          uint32    0xa
    usb-vendor-id     int32     2352
    usb-product-id    int32     25924
    usb-revision-id   string    100
    usb-version       string    2.0
    usb-vendor-name   string    Kingston
    usb-product-name  string    DataTraveler 2.0
    usb-serialno      string    00241D8CE563C1B1E94FEBB4
    usb-speed         string    high-speed
  group: io                             version: 1   stability: Private/Private
    driver            string    scsa2usb
    instance          uint32    0x0
    devfs-path        string    /pci@0,0/pci15d9,981@14/storage@a
    module            fmri      mod:///mod-name=scsa2usb/mod-id=110
  group: binding                        version: 1   stability: Private/Private
    occupant-path     string    /pci@0,0/pci15d9,981@14/storage@a/disk@0,0

Finally, based on the kind of device we encounter, we might enumerate children nodes. Right now there are two cases that we’ll enumerate child nodes:

  1. If we encounter a hub and we have found its children, then we’ll enumerate them as we described above and any USB devices that we find under them.
  2. If we encounter a USB device that represents a disk, like an external hard drive or USB key, then we’ll set some additional properties on the node and call into the disk enumerator. This’ll create a disk node that can be used to map disk information back to the physical device.

Other Uses

While having the items in the tree makes it easier for us to see everything in the system, once we have location information and serial numbers, there are other ways we can use this. The tool diskinfo lists disks and, when we have it, the physical location of them and their serial number. For example, when using SATA and SAS drives, this can tell you which drive bay they’re in, whether it’s a front or rear drive, and more.

To get this information, the diskinfo program takes a snapshot of the system topology and then maps the discovered disks to the corresponding disk nodes in the system’s topology. We’ve done the same for these USB devices. So if we have topology information, we can tell you which USB device it is that’s plugged in. For example:

# diskinfo -P
DISK                    VID      PID              SERIAL               FLT LOC LOCATION
c1t0d0                  Kingston DataTraveler 2.0 00241D8CE563C1B1E94FEBB4 -   -   Internal USB
c2t5000CCA0496FCA6Dd0   HGST     HUSMH8010BSS204  0HWZGX6A             no  no  [0] Slot00
c2t5000CCA25319F125d0   HGST     HUH721212AL4200  8DGG88SZ             no  no  [0] Slot01
c2t5000CCA25318BE1Dd0   HGST     HUH721212AL4200  8DGELUWZ             no  no  [0] Slot02
c2t5000CCA2530F9CD5d0   HGST     HUH721212AL4200  8DG8L5JZ             no  no  [0] Slot03
c2t5000CCA25318BE15d0   HGST     HUH721212AL4200  8DGELUUZ             no  no  [0] Slot04

Constructing Topology

Now that we’ve talked about how we use topology and most of the operating system building blocks, it’s worth spending some time talking about how we actually build the USB topology itself.

We gather data from three different sources:

  1. A USB topology metadata file.
  2. Walking the devices tree looking for USB root hubs and their children (non-ACPI).
  3. Walking the ACPI firmware tree, looking for USB information.

Once we gather information from all three of these sources, we combine them all together to create a single, coherent map. We first map an ACPI node to its corresponding devcfg node. Then, if we’ve opted to map ports together based on ACPI (more on that in a little bit), then we’ll combine the different logical nodes together.

USB Metadata File

The topology USB metadata file allows us to create a per-vendor, per-product map of additional information. The file is a simple format that has keywords and arguments.

A file first identifies a given port. From a port, it will then provide additional metadata such as a label and whether or not it is internal or external. Next, if we need to override the ACPI port type either because it’s missing or incorrect, then we can do so here. Finally, a series of ACPI paths that describe the port are listed. This way, when a port has a USB 2.0 and a USB 3.x component, because we’ve listed both, we’ll be able to apply this metadata to either port.

Finally, there are a number of top-level directives. These describe the matching behavior that we’d like to use. We can do the following:

  1. Disable the use of ACPI entirely on this platform.
  2. Disable the use of ACPI matching. We’ve done this on platforms where we’ve determined that the ACPI information that the platform has is incorrect.
  3. We can enable matching based on the metadata information. This is useful in tandem with the above. Here, we use the ACPI paths to perform matching the same way that we did elsewhere.

Here’s a portion of a USB metadata file:

                Rear Lower Right USB

                Internal USB

This example has two ports present. Each port has a label which is used to identify where the port is for a human. The ‘internal’ and ‘external’ keywords are used to indicate whether the port is internal to the system or external. In this case, the ‘internal’ port is found on the motherboard of the system. So it cannot be serviced or used without opening the system. The ‘chassis’ label indicates that this port is found on the chassis of the system itself. This is where most USB ports are that a user would find and use.

The port-type here indicates that they are USB 3 Type-A connectors, meaning that they support both USB 2.0 and USB 3.x. Finally, the various ‘acpi-path’ entries are used to indicate the ACPI path towards the port. Note how the ports are labeled based on the names of the ACPI device nodes. Each ‘.’ separates each node. The starting ‘\’ character is just part of the constructed path, it is not an escape character.

Writing Your Own Map

The way I’ve done this for other platforms is finding a USB 2.0 and USB 3.0 port and plugging it into each port subsequently. At each point, I look at the information in FMA and in the devices tree with prtconf. By looking at the port numbers and what exists in the devices tree, one can, with a bit of manual work, piece together what’s required.

One challenge with doing this is when you’re on a system that has both the ehci and xhci controllers. Generally this is on Intel platforms from Sandy Bridge through Broadwell. In that case, you need to go into the BIOS and do this with xhci enabled and disabled in the BIOS. This will make sure that you can get all the ports connected to the ehci controller.

Further Reading

If you’re interested in the illumos implementation:

For more on the specifications mentioned:

  • The xHCI specification, currently revision 1.2.
  • The SMBIOS specification, currently revision 3.2.
  • The ACPI specification, revision 6.3.

Looking Ahead

While we’ve done some work, there’s more that we can do to improve the situation here in the future. This discusses some of those future directions.

Container ID Capability

The Container ID capability is the first tool at our disposal. The container ID is a 128-bit UUID — a universally unique identifier. All USB 3.x hubs are required to implement this capability and it can be read from the USB binary object store.

The idea behind this capability is that a device will have the same container ID value regardless of the type of bus that they’re on. So even though a USB 3.x hub will appear as two distinct hubs to the operating system, if they’re the same device, they’ll have the same container ID UUID. This gives the operating system a way to map such devices together.

When the USB Container ID capability is found in the binary object store, we add a property to the device node that indicates the UUID. This translates into a 16-byte byte array on the node whose value is the UUID. With this in mind, the USB topology plugin could go ahead and find hubs with matching container IDs.

More Consumers of USB Topology

While we’ve enhanced FMA and some of the tools like diskinfo(1M), we can do more here. For example, tools like cfgadm(1M) could be enhanced to query topology information when available listing devices in verbose

If we have a mapping that we feel confident in, it could even make sense to add another alias under /dev to the device. Though these labels aren’t necessarily stable right now (as they’re meant for humans), so we’ll have to see what makes sense there.

Easier Tools to Build Topology Maps

Right now, it can take a bit of effort to build a topology map. It would be great if we had easy tooling for developing USB topology maps for different platforms that would walk someone through putting this together. It would also be useful if we had a way for a user to generate a topology for their system. That way, even if it’s something custom that’s been put together, it still isn’t too hard to put together a topology for their system.

What’s Next?

There’s a lot more to talk about with USB, topology, and hardware in general. If you’re interested in working on any of these aspects, reach out. I’m sure there’ll be more to do here as we have to deal with USB 3.2, Thunderbolt, and USB 4.0.

If you’d like to get involved, get in touch with the illumos community on IRC in #illumos on Freenode or a mailing list and I or someone else will help you out and see what we can do. As long as you’re willing to learn, receive feedback, and keep going despite difficulties, then it doesn’t matter what your experience is.

One of the stories that has stuck with me over the years came from a support case that a former colleague, Ryan Nelson, had point on. At Joyent, we had third parties running our cloud orchestration software in their own data centers with hardware that they had acquired and assembled themselves. In this particular episode, Ryan was diagnosing a case where a customer was complaining about the fact that networking wasn’t working for them. The operating system saw the link as down, but the customer insisted it was plugged into the switch and that a transceiver was plugged in. Eventually, Ryan asked them to take a picture of the back of the server, which is where the NIC (Network Interface Card) would be visible. It turned out that the transceiver looked like it had been run over by a truck and had been jammed in — it didn’t matter what NIC it was plugged into, it was never going to work.

As part of a broader push on datacenter management, I was thinking about this story and some questions that had often come up in the field regarding why the NIC said the link was down. These were:

  1. Was there actually a transceiver plugged into the NIC?
  2. If so, did the NIC actually support using this transceiver?

Now, the second question is a bit of a funny one. The NIC obviously knows whether or not it can use what’s plugged in, but almost every time, the system doesn’t actually make it easy to find out. A lot of NIC drivers will emit a message that goes to a system log when the transceiver is plugged in or the NIC driver first attaches, but if you’re not looking for that message or just don’t happen to be on the system’s console when that happens, suddenly you’re out of luck. You might also ask why are there transceivers that aren’t supported by a NIC, but that’s a real can of worms.

Anyways, with that all in mind, I set out on a bit of a journey and put together some more concrete proposals for what to do here in terms of RFD 89: Project Tiresias. We’ll spend the rest of this entry going into a bit of background on transceivers and then discuss how we go from knowing whether or not they’re plugged in to actually determining who made them and where they are in the system.

What is a transceiver?

We’ve been using the term 'transceiver' quite a bit so far, but that’s a pretty generic term. Let’s spend a bit of time talking through that. First, we’re really focused on transceivers as used in the context of networking. When people think of wired networking, the most common thing that comes to mind are Ethernet Cables. Ethernet isn’t the only type of cable that’s been used. Before Ethernet was common, BNC coaxial cables were used on some NICs as well.

However, in the data center, Ethernet didn’t end up keeping up with the speeds and distances that connections were being use for (though 10 Gigabit Ethernet, 10GBASE-T, has started becoming more common). In this space, fiber-optic cables and copper twinaxial cables (twinax) are much more prominent. Note, twinaxial cables are rather different from their BNC coaxial relatives. Coaxial cables are used more often when there are shorter distances to cover, such as between a top of rack switch and a server. Fiber optic cables often cover longer distances or have higher throughputs.

Because different types of cables are used in different situations, several vendors got together and agreed upon a set of standards to use when manufacturing these cables. This allowed NIC manufacturers to design a single physical and electrical interface, but still support different types of transceivers. These standards (technically a multi-source agreement) are maintained by the Small Form Factor (SFF) Committee. The committee manages standards not only for networking, but also for SAS cables and other devices.

If you’ve worked in this space, you may have heard of what are called SFP and SFP+ cables. These cables generally support 1 and 10 Gigabit networking respectively. The transceiver is controlled over an i2c bus by the NIC. The addresses and their meanings are standardized. They were originally standardized in a standard called INF-8074, but the current active standard for these devices is called SFF-8472.

With faster networking speeds, there have been additional revisions and standards put out. Devices that support 40 Gigabit networking are called QSFP+ because they combine 4 SFP+ devices. To support 25 Gigabit networking, a variant of SFP+ was created called SFP28. Finally, to support 100 Gigabit networking, they combined 4 SFP28 devices together. The 40 Gigabit devices are standardized in SFF-8436 and the 100 Gigabit have their management interface described in SFF-8636.

The standards for various devices have somewhat similar layouts. They break data into a series of different pages of which a specific offset into the page can then be accessed via the NIC’s i2c bus. These pages contain some of the following information:

  • Control over the device and its configuration

  • Static manufacturing information such as the manufacturer’s name, the
    device’s name, the serial number, and more

  • Optional information about the health of the device such as the
    temperature, voltage, and more

The pages and addresses change from specification to specification, though a large amount of the data overlaps between them. The health information of the device is required when the connector is considered active (generally fiber-optic cables with lasers) and is optional when you have a passive device (such as a copper twinax cable).

The MAC Transceiver Capability

The first part of our adventure with getting to this data begins in the operating system kernel. Similar to the case of managing NIC LEDs, the networking device driver framework has an optional capability that a driver can implement to expose this information called MAC_CAPAB_TRANSCEIVER.

The transceiver capability has a structure that the device driver has to fill out which describes some basic information for dealing with transceivers. This includes the following fields:

  • The number of transceivers present on the device.
  • A function, mct_info() that allows one to get basic information about the transceiver.
  • A function, mct_read() that allows one to read i2c data from the device.

The driver first indicates the number of transceivers that are present for it. In general, this is one. However some devices actually support combining multiple transceivers and ports into one logical device — though this isn’t commonly used. The next item, the mct_info() function, is used to answer the two questions that were posed at the beginning of this: Does the NIC think a transceiver is present and can the NIC use the transceiver? Finally, the mct_read() function allows us to go through and read specific regions of the memory map of the transceiver. Generally, user land reads an entire 256-byte page at any given time.

The kernel only facilitates reading information. It generally doesn’t try and make semantic sense of any of the data. That is purposefully left to user land — unless there’s a good reason to parse binary data in the kernel, you’re usually better off not doing that.

The following device families and drivers support the MAC_CAPAB_TRANSCEIVER capability. Some drivers that you may be more familiar with such as 1 Gigabit Ethernet devices aren’t on this list because they don’t support transceivers. Supported devices include:

  • Broadcom NetExtreme II 10 Gb devices based on the bnxe driver.
  • Chelsio T4, T5, and T6 10, 25, 40, and 100 Gbit devices based on the cxgbe driver.
  • Intel X520 10 Gbit SFP+ devices based on the ixgbe driver.
  • Intel 10, 25, and 40 Gbit SFP+, SFP28, and QSFP+ devices based on the i40e driver.
  • QLogic FastLinQ QL45xxx devices based on the qede driver.

The way that each driver gets access to this information varies from device to device. Some, like i40e and cxgbe, issue a firmware command to read information from the transceiver. Others have dedicated registers that can be programmed to read arbitrary data from the i2c device.

libsff and dltraninfo

Once we have the ability to read the actual data from the transceiver, we have to make logical sense of it. To handle that, the first thing I did was write a small library called libsff. The goal of libsff is to parse the various SFF binary data payloads and return structured data as a set of name-value pairs.

If you look at the header file, libsff.h, you’ll see a list of different keys that can be looked up. Some of these are rather straightforward, such as the string "vendor", which has the name of the manufacturer of the transceiver. Others are a bit more opaque and require referencing the actual SFF documents. Another useful feature of the library is that it tries to abstract out the differences between different versions of the specifications. The goal is that when there is similar data, it should always be found under the same key even if they are found in wildly different parts of the memory map or the way we have to parse the data is different. The goal of libraries (or really any interface and abstraction) should be to take something grotty and transform it into something more usable as though reality were as simple as it presents.

The one thing that the library doesn’t generally do today is parse all of the sensor data that may be available on the transceiver. The main reason for this is that the vast majority of transceivers that I had access to, did not implement it. On SFP, SFP+, and SFP28 devices, sensor information is optional for twinax based devices. With a few devices to test with, it would be pretty straightforward to add support for it though.

On its own, a library isn’t useful unless it has a consumer. The first consumer that I’ll discuss is dltrainfo. This is an unstable, development program that I wrote to exercise this functionality and to try and get a sense of what interfaces might be useful. There are two forms of the dltrainfo command. The first answers the questions that we laid out in the beginning about whether the transceiver is present or usable. When run this way, you see something like:

# /usr/lib/dl/dltraninfo
ixgbe0: discovered 1 transceiver
        transceiver 0 present: yes
        transceiver 0 usable: yes
ixgbe1: discovered 1 transceiver
        transceiver 0 present: yes
        transceiver 0 usable: yes
ixgbe2: discovered 1 transceiver
        transceiver 0 present: yes
        transceiver 0 usable: yes
ixgbe3: discovered 1 transceiver
        transceiver 0 present: yes
        transceiver 0 usable: yes

The next option is to read the information from the transceiver. Here’s an example of reading this on an Intel 10 Gbit fiber-optic transceiver:

# /usr/lib/dl/dltraninfo -v ixgbe1
ixgbe1: discovered 1 transceivers
        transceiver 0 present: yes
        transceiver 0 usable: yes
        Identifier: 'SFP/SFP+/SFP28'
        Extended Identifier: 4
        Connector: 'LC (Lucent Connector)'
        10G+ Ethernet Compliance Codes[0]: '10G Base-SR'
        Ethernet Compliance Codes[0]: '1000BASE-SX'
        Encoding: '64B/66B'
        BR, nominal: '10300 MBd'
        Length 50um OM2: '80 m'
        Length 62.5um OM1: '30 m'
        Length OM3: '300 m'
        Vendor: 'Intel Corp'
        OUI[0]: 0
        OUI[1]: 27
        OUI[2]: 33
        Part Number: 'FTLX8571D3BCV-IT'
        Revision: 'A'
        Laser Wavelength: '850 nm'
        Options[0]: 'Rx_LOS implemented'
        Options[1]: 'TX_FAULT implemented'
        Options[2]: 'TX_DISABLE implemented'
        Options[3]: 'RATE_SELECT implemented'
        Serial Number: 'AKR0EQ0'
        Date Code: '110618'
        Extended Options[0]: 'Soft Rate Select Control Implemented'
        Extended Options[1]: 'Soft RATE_SELECT implemented'
        Extended Options[2]: 'Soft RX_LOS implemented'
        Extended Options[3]: 'Soft TX_FAULT implemented'
        Extended Options[4]: 'Soft TX_DISABLE implemented'
        Extended Options[5]: 'Alarm/Warning flags implemented'
        8472 Compliance: 'Rev 10.2'

This allows us to interact with the information in a readable way. Effectively, this dumps out the entire name-value pair set that we construct when parsing data with libsff. There are two additional ways to print this data. The first one, -x, dumps out the data as hex data (kind of like if you run the program xxd). The second option -w writes out the first page, 0xa0, to a file. This allows you to take the raw data with you.

Seeing Transceivers in Topo

The next step with all this work is to expose the transceivers as part of the system topology in the fault management architecture (FMA). This is useful for a few reasons:

  1. It allows us to see what devices are present in the same snapshot as other devices like disks, CPUs, DIMMs, etc.
  2. FMA’s topology is a natural place for us to expose sensors.
  3. If a device is in topology, then we can generate error reports and faults against those devices.

Basically, being visible in the topology allows us to integrate it more fully into the system and makes it easy for various monitoring and inventory tools in the system to see these devices without having to make them aware of the underlying ways of getting data.

The topology information is organized as a tree. When we encounter hardware that we believe is a networking device (because its PCI class indicates it is), then we ask the kernel about how many transceivers it supports. For each transceiver, we create a port node under the NIC whose type indicates that it is intended for SFF devices.

When a transceiver is present, then we will place a transceiver node under the port. This node has two different groups of properties. The first is generic to all transceivers, which is where we indicate whether or not the hardware can use the transceiver. The second group are properties that we derive from the SFF specifications about the transceiver’s manufacturing data. This includes the vendor, part number, serial number, etc. The following block of text shows three different nodes: the NIC, the port, and the transceiver:

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pc
    label             string    MB
    FRU               fmri      hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0
    ASRU              fmri      dev:////pci@0,0/pci8086,155@1,1/pci103c,17d3@0,1
  group: authority                      version: 1   stability: Private/Private
    product-id        string    X9SCL-X9SCM
    chassis-id        string    0123456789
    server-id         string    ivy
  group: io                             version: 1   stability: Private/Private
    dev               string    /pci@0,0/pci8086,155@1,1/pci103c,17d3@0,1
    driver            string    ixgbe
    module            fmri      mod:///mod-name=ixgbe/mod-id=242
  group: pci                            version: 1   stability: Private/Private
    device-id         string    10fb
    extended-capabilities string    pciexdev
    class-code        string    20000
    vendor-id         string    8086
    assigned-addresses uint32[]  [ 2197946640 0 3750756352 0 1048576 2164392216 0 57344 0 32 2197946656 0 3753902080 0 16384 ]

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pc
    FRU               fmri      hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pc
  group: authority                      version: 1   stability: Private/Private
    product-id        string    X9SCL-X9SCM
    chassis-id        string    0123456789
    server-id         string    ivy
  group: port                           version: 1   stability: Private/Private
    type              string    sff

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789:serial=AKR0EQ0:part=FTLX8571D3BCV-IT:revision=A/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=1/port=0/transceiver=0
    FRU               fmri      hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789:serial=AKR0EQ0:part=FTLX8571D3BCV-IT:revision=A/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=1/port=0/transceiver=0
  group: authority                      version: 1   stability: Private/Private
    product-id        string    X9SCL-X9SCM
    chassis-id        string    0123456789
    server-id         string    ivy
  group: transceiver                    version: 1   stability: Private/Private
    type              string    sff
    usable            string    true
  group: sff-transceiver                version: 1   stability: Private/Private
    vendor            string    Intel Corp
    part-number       string    FTLX8571D3BCV-IT
    revision          string    A
    serial-number     string    AKR0EQ0

If you plug in a transceiver dedicated to fibre channel, then we’ll properly note that we can’t use the transceiver by setting the usable property to false. The following is an example of the transceiver node in that case:

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=X10SLM+-LN4F:server-id=haswell:chassis-id=0123456789/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=1/port=0/transceiver=0
    FRU               fmri      hc://:product-id=X10SLM+-LN4F:server-id=haswell:chassis-id=0123456789/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=1/port=0/transceiver=0
  group: authority                      version: 1   stability: Private/Private
    product-id        string    X10SLM+-LN4F
    chassis-id        string    0123456789
    server-id         string    haswell
  group: transceiver                    version: 1   stability: Private/Private
    type              string    sff
    usable            string    false

Further Reading

If you’d like to read more on this, there are a couple of different places that I can send you.

For more on the SFF standards, there’s:

  • SFF-8472 which describes the SFP/SFP+ devices.
  • SFF-8436 which describes QSFP+ devices.
  • SFF-8636 which describes the memory map of QSFP28+ devices.

If you’re interested in the illumos implementation, there’s:

Looking Ahead

If you have a favorite NIC that uses SFP-based transceivers and it isn’t supported, reach out and we’ll see what we can do. If you’d find it interesting to work on exposing more of the sensor information present in the SFPs, then we’d be happy to further mentor someone there. Once these pieces are exposed in topology, it could also make sense to wire up the FRU monitor to watch for temperature thresholds, voltage drops, or device faults.

Up next, we’ll talk about understanding the topology of USB devices.

It was the brightest of LEDs, it was the darkest of LEDs, it was the age of data links, it was the age of AHCI enclosure services, …

Today, I’d like to talk about two aspects of a project that I worked on a little while back under the aegis of RFD 89 Project Tiresias. This project covered improving the infrastructure for how we gathered and used information about various components in the system. So, let’s talk about LEDs for a moment.

LEDs are strewn across systems and show up on disks, networking cards, or just to tell us the system is powered on. In many cases, we rely on the blinking lights of a NIC, switch, hard drive, or another component to see that data is flowing. The activity LED is a mainstay of many devices. However, there’s another reason that we want to be able to control the LEDs: for identification purposes. If you have a rack of servers and you’re trying to make sure you pull the right networking cable, it can be helpful to be able to turn a LED on, off, or blink it with a regular pattern. So without further ado, let’s talk about how we control LEDs for network interface cards (NICs or data links) and a class of SATA hard drives.


The first class of devices that I’d like to talk about are networking cards. Most every NIC that you can plug an Ethernet cable or a copper/fiber-optic transceiver in has at least two LEDs for each port. One that indicates that traffic is flowing over the link, an activity LED, and one that indicates the fact that the link is actually up. Sometimes different colors are used to indicate different speeds. For example, a 25 GbE capable device may use an orange color to indicate that the link is operating at 10 GbE and a green one to indicate that it is operating at 25 GbE.

The first challenge we have in the operating system is to figure out how to actually tell the device to cause its LEDs to behave in a specific way. Unfortunately, practically every NIC (or family of NICs) has its own unique way of doing this and well, just about everything else. This is the operating system’s curse and one of its primary responsibilities — designing abstractions for device drivers that make it easy to enable new hardware and take advantage of unique features that different devices offer while minimizing the amount of work that is required to do so.

In illumos, there is a framework for networking device drivers called mac. The mac framework is named after the device driver that implements it, mac(9E). Sometimes you’ll hear it called by an alternative name: ‘Generic LAN Device version 3’ (GLDv3). The mac framework uses the concept of capabilities, which represent different features that hardware offers. For example, this includes such things as checksum offload, TCP segmentation offload (TSO), and more. Each capability can have arbitrary data associated with it that describes more information about the capability itself.

For controlling the LEDs, there’s a new capability called MAC_CAPAB_LED. With the capability, a driver has to fill in three pieces of information:

  1. A set of flags to indicate optional behavior. Currently none are defined, but this is here for future expansion.
  2. A set of supported modes that the LED can be set to. This includes:
    • MAC_LED_ON which indicates that the LED should be turned on solidly.
    • MAC_LED_OFF which indicates that the LED should be turned off.
    • MAC_LED_IDENT which indicates that the LED should be set in a way to identify the device. Generally this means that it should blink. When this can use a different color, that’s even better.
  3. A function that the operating system should call to set the LED state.

The structure looks something like:

typedef struct mac_capab_led {
    uint_t mcl_flags;
    mac_led_mode_t mcl_modes;
    int (*mcl_set)(void *driver, mac_led_mode_t mode, uint_t flags);
} mac_capab_led_t;

Basically, when the operating system wants to change the LED state, it’ll end up calling the mcl_set function to set the new mode. When the LED state should be changed back to normal, a fourth state is passed: MAC_LED_DEFAULT. The operating system guarantees that it will never call this function in parallel for the driver to try to simplify the programming model, though the driver will likely have I/O ongoing.

Some devices, such as older Intel client parts based on the e1000g driver don’t actually have blink functionality. As such, the driver today emulates that, though it would be good to move that into the broader mac framework when we need to do that for another driver.

The following devices and drivers currently have support for this functionality:

  • Chelsio T4, T5, and T6 10, 25, 40, and 100 GbE parts based on the cxgbe driver
  • Intel 1 GbE Client and server parts based on the e1000g and igb drivers
  • Intel 10 GbE SFP/Copper devices in the X520, X540, and X550 families based on the ixgbe driver
  • Intel 10, 25, and 40 GbE devices in the X710 and X722 family based on the i40e driver

Plumbing it up in user land

With all of the above hardware support, there’s currently a private utility in illumos to control these called dlled that can be found at /usr/lib/dl/dlled. Now, you might ask: why is the program hiding in /usr/lib/dl? The main reason is that we’re still not sure what the right interface for controlling this should be. We should likely integrate it into the dladm(1M) command and allow the LEDs to be controlled through the Fault Management Architecture like we do with other LEDs. However, until we figure out exactly what we want, this gives us something to experiment with.

If you run it on a system, you’ll see something like:

# /usr/lib/dl/dlled
LINK                 ACTIVE       SUPPORTED
igb0                 default      default,off,on,ident
igb1                 default      default,off,on,ident
igb3                 default      default,off,on,ident
igb2                 default      default,off,on,ident

From here, we can use the -s option to then control and change the state. If you set the state, it should persist across anyone pulling or removing a cable in the device, but nothing that’s set will persist across a reboot of the system. If we set igb0 to ident mode, then we’ll see that the current state is updated:

# /usr/lib/dl/dlled -s ident igb0
# /usr/lib/dl/dlled
LINK                 ACTIVE       SUPPORTED
igb0                 ident        default,off,on,ident
igb1                 default      default,off,on,ident
igb3                 default      default,off,on,ident
igb2                 default      default,off,on,ident

While we have a video to demo this in action, for the moment let’s instead talk about another class of LEDs.


Many systems have an AHCI controller built into their primary chipset. The AHCI Controller is used to attach SATA hard disk drives (HDDs) and solid state drives (SSDs). AHCI stands for ‘Advanced Host Controller Interface’. It is a specification that describes how to discover attached devices and send commands to HDDs and SSDs.

Now, you may be saying to yourself, I’ve seen a hard drive or an SSD, but I don’t recall seeing any LEDs on them. And that’s right. Unlike NICs, which have the LEDs built in, the LEDs aren’t built into the drives, but into enclosures — the bays on the system that house drives.

LED Modes

When dealing with hard drives, there have historically been four different things that the system wants to communicate:

  1. That the drive in the bay has a fault — it is no longer working or it is OK.
  2. That the drive in the bay is ‘OK to remove’.
  3. To identify a specific bay by blinking the LED.
  4. To show that there is activity to the device in the bay.

Typically, the first three items share a single LED, while a second LED is used to drive activity. A side effect of this is that somewhere in either hardware or firmware there is a hierarchy for the first three tasks. For example, if nothing else is going on then the drive’s LED may be a solid green color. If the drive has been faulted, it may instead turn to an amber color. Blinking the LED may be in the amber color or it may be something else entirely.

In the majority of cases, the activity LED isn’t controllable by software; however, the other LED can be. This means that when say ZFS decides a drive is bad, it can eventually lead to the operating system turning on an amber LED to indicate where a problem is.

While we have mentioned the ‘OK to remove’ LED, that has become less and less commonly used and implemented. It’s listed here for completeness, but may not actually be something that you can control or even see on some systems.

Enclosure Management

As we mentioned earlier, the LEDs are part of the bays and not part of the drives. This has lead to a suite of different standards and means for enclosure management. The one that applies will actually vary depending on the wiring of the system. For example, while the same 2.5 inch drive bay can be used for SATA, SAS, and NVMe devices, the way that the enclosure is controlled varies a lot based on the disk controller and the system’s broader design.

In systems that are using SAS controllers (regardless of whether one uses SATA or SAS drives), there is a specification that describes how to enumerate the different bays that devices can show up in, figure out which drives are in which bays, and control the LEDs and get other information. In a SAS world, this is called SCSI enclosure services or SES. When using an all NVMe system, SES won’t exist. Similarly, when you’re using SATA devices and an AHCI controller, then something different ends up happening. In fact, you’re not even guaranteed that there will be a SES device even in a SAS world!

The AHCI specification provides optional enclosure management features. If you want to follow along in the AHCI specification, open up to chapter 12 — Enclosure Management. There are three primary items the related to enclosure management:

  • A capability bit to indicate whether or not enclosure services are supported.
  • A register that describes the capabilities and controls sending messages.
  • A region of memory for sending and receiving messages.

This region of memory that can be used to send and receive messages is used because the standard allows the system to participate in one of four different messaging schemes. Messages specific to the underlying scheme are placed in a region of memory and if the scheme supports replies, it’ll place replies in there.

Now, the specification supports capability bits for up to four different schemes:

  1. SAF-TE protocol
  2. SES-2 protocol
  3. SGPIO
  4. The AHCI specification’s LED format

Despite all of the different forms mentioned above, we’ve yet to discover a system that indicates support for something other than the LED format. Though we have found one or two systems that incorrectly say they support the LED format and don’t. As a result, the LED format is the only one that we currently support.

The message format is pretty straightforward. You indicate which port on the controller you care about and then indicate whether you want to enable the identification LED or the fault LED.

Ultimately though, whether or not any of this works at all depends on the manufacturer of the system and what they wire up. Hopefully, if they advertise the LED format support, then we’ll be able to properly drive it.

Wiring it up

To make all of this work, the first thing we had to do was to add support to the ahci(7D) driver to make it detect whether or not the controller had enclosure services and provide a means to control it. The state and capabilities are all managed by a private ioctl(2). These ioctls allow one to get information about what’s supported, the number of ports, and the current state of each port. A second ioctl exists to set the state of a specific port.

Hardware has the constraint that only a single message can be sent at any given time. Therefore the driver serializes these requests and centralizes all the activity in a task queue. In that taskq, we’ll then attempt to send and receive the messages that hardware expects. You can see the details of creating the message and writing it to hardware in the ahci_em_set_led() function.

Don’t worry. You don’t have to try and write ioctls yourself. Similar to the dlled private command, there’s a private command to manipulate this. Let’s take a look at it:

# /usr/lib/ahci/ahciem
PORT                 ACTIVE       SUPPORTED
ahci0/0              default      ident,fault,default
ahci0/1              default      ident,fault,default
ahci0/2              default      ident,fault,default
ahci0/3              default      ident,fault,default
ahci0/4              default      ident,fault,default
ahci0/5              default      ident,fault,default

You can set the state of an individual port here in the same way as with the dlled command. With the ‘default’ state being its normal behavior, with neither the ident or fault LED turned on.

Mapping LEDs to bays and disks

Now, there’s still a problem. We haven’t actually solved the problem of mapping up the ports described above to actual disks. Because this is being driven by the AHCI controller, it only provides us the means of toggling this on its logical ports. While we can know what disk is attached to what port, it doesn’t actually tell us where that disk physically is.

In illumos, we have the ability to construct a per-platform map that defines additional information about the topology of the system. It helps us answer the question of what is where. My colleague Jordan Hendricks was the one who solved this problem for us. She added support for us to relate specific bays that we declare in a per-platform topology map to the corresponding port. This allows us to answer the mapping question as well as control the LEDs through the topology like we do for other parts of the system. It’s great work that takes what was a building block and actually makes it useful in the broader system. You can find a lot of examples of it in the illumos bug tracker and you should read the code itself!

See it in action

When I had a working demo of the ability to control NIC LEDs, I ended up recording an impromptu demo with the help of my occasional partner in crime, Alex Wilson.

What’s Next?

If you’re interested in adding support for NIC LEDs to another device that you have, get in touch with the illumos community on IRC in #illumos on Freenode or a mailing list and I or someone else will help you out and see what we can do. Ultimately, both of these changes are building blocks that can be built on top of. Jordan’s work is an example of that in the AHCI case. There’s more that can be built on top of the basic NIC functionality as well.

Next time, we’ll touch on another piece of related work: understanding the state of copper and fiber-optic transceivers for different NICs and reading their EEPROMs.

Recent Posts

September 27, 2019
September 6, 2019
October 1, 2014