Tales from a Core File

Search
Close this search box.

I recently found myself having to dive into the depths of SMF — The SunOS (illumos / Solaris) Service Management Framework — to debug a nasty race condition between svccfg import and svcadm enable -s. Understanding what happened sent me chasing around and dealing with a cheerful cast of characters that you might or might not expect: svc.configd, svc.startd, the EMI (early manifest import) service, and the ON build process. I found myself digging and doing a lot of reading to understand how all these different pieces worked together and communicated, which made me realize that this would be incredibly useful for the next person (really when I forget) who has to make another trip back into this important yet quite complicated subsystem.

The Problem

We had a heavily loaded system that was doing boot up and initializing lots of zones. This was running on VMware Fusion, which while great for development, is understandably not a performance king. During this process we have lots of scripts that do something similar to the following shell snippet:

# svccfg import service.xml
# svcadm enable -s service
svcadm: svc:/SERVICE/:default is misconfigured (lacks "restarter" property group)

Well, that’s a problem. Now, you might say that obviously our manifest is misconfigured, but that actually isn’t the case. Manifests optionally may specify a restarter property group. If they don’t, svc.startd takes control of restarting the instance. This is what the majority of services want so the problem here isn’t that we didn’t specify the restarter group, but for some reason it’s missing after we imported! Before we can explain what actually happened and how to fix it, we need to do a bit of an explanation for how SMF works and communicates. Keep in mind I didn’t write SMF, so there may be one or two oversights.

Rough SMF Architecture

There are a few different components that make up SMF and are responsible for different pieces of functionality that are used:

  • svc.startd(1M): A daemon that is in charge of starting, stopping, and restarting services and instances
  • svc.configd: A daemon that manages the repository that stores information, property groups, and state of the different services and instances
  • libscf(3LIB): A C library that provides the glue for communicating, accessing, and updating information about services and instances
  • svccfg(1M): A utility to add and remove services as well as change the properties associated with different services and instances.
  • svcadm(1M): A utility to control the different instance of a service. You can use this to enable and disable them among some other useful things.
  • svcs(1): A utility that reports on the status of various services on the system

Now how all of these work together is far from simple, in fact it can be quite confusing. Here’s a block diagram I put together that helps explain everything and how they all communicate:

/*
 * The SMF Block Diagram
 *                                                       Repository
 *   This attempts to show       ___________             __________
 *   the relations between       |         |     SQL     |        |
 *   the different pieces        | configd |<----------->| SQLite |
 *   that make SMF work and      |         | Transaction |        |
 *   users/administrators        -----------             ----------
 *   call into.                  /|\    /|\
 *                                |      |
 *                   door_call(3C)|      | door_call(3C)
 *                                |      |
 *                               \|/    \|/
 *      ____________     __________      __________      ____________
 *      |          |     |        |      |        |      |  svccfg  |
 *      |  startd  |<--->| libscf |      | libscf |<---->|  svcadm  |
 *      |          |     | (3LIB) |      | (3LIB) |      |   svcs   |
 *      ------------     ----------      ----------      ------------
 *       /|\    /|\
 *        |      | fork(2)/exec(2)
 *        |      | libcontract(3LIB)
 *       \|/    \|/                          Various System/User services
 *       ---------------------------------------------------------------------
 *       | system/filesystem/local:default      system/coreadm:default       |
 *       | network/lookpback:default            system/zones:default         |
 *       | network/ntp:default                  system/cron:default          |
 *       | smartdc/agent/ca/cainstsvc:default   network/ssh:default          |
 *       | appliance/kit/akd:default            system/svc/restarter:default |
 *       ---------------------------------------------------------------------
 */

Chatting with configd and sharing repository information

As you run commands with svcs, svccfg, and svcadm, they are all creating a libscf handle to communicate with configd. As calls are made via libscf they ultimately go and talk to configd to get information. However, how we actually are talking to configd is not as straightforward as it appears.

When configd starts up it creates a door located at /etc/svc/volatile/repository_door. This door runs the routine called main_switcher() from usr/src/cmd/svc/configd/maindoor.c. When you first invoke svc(cfg|s|adm), one of the first things that occurs is creating a scf_handle_t and binding it to configd by calling scf_handle_bind(). This function makes a door call to configd and gets returned a new file descriptor. This file descriptor is itself another door which calls into configd’s client_switcher(). This is the door that is actually used when getting and fetching properties, and many other useful things.

svc.startd needs a way to notice the changes that occur to the repository. For example, if you enabled a service that was not previously running, it’s up to startd to notice that this has happened, check dependencies, and eventually start up the service. The way it gets these notifications is via a thread who’s sole purpose in life is to call _scf_notify_wait(). This function acts like poll(2) but for changes that occur in the repository. Once this thread gets the event, it dispatches it handles the event appropriately.

The Events of svc.startd

svc.startd has to handle a lot of complexity. Understanding how you go from getting the notification that a service was enabled to actually enabling it is not obvious from a cursory glance. The first thing to keep in mind is that startd maintains a graph of all the related services and instances so it can keep track of what is enabled, what dependencies exist, etc. all so that it can answer the question of what is affected by a change. Internally there are a lot of different queues for events, threads to process these queues, and different paths to have events enter these queues. What follows is a diagram that attempts to explain some of those paths, though it’s important to note that for some of these pieces, such as the graph and vertex events, there are many additional ways and code paths these threads and functions can take. And yes, restarter_event_enqueue() is not the same thing as restarter_queue_event().

/*
 *   Threads/Functions                 Queues                  Threads/Functions
 *
 * called by various
 *     ------------------             ---------                  ---------------
 * --->| graph_protocol | graph_event | graph |   graph_event_   | graph_event |
 * --->| _send_event()  |------------>| event |----------------->| _thread     |
 *     ------------------ _enqueue()  | queue |   dequeue()      ---------------
 *                                    ---------                         |
 *  _scf_notify_wait()                               vertex_send_event()|
 *  |                                                                  \|/
 *  |  --------------------                              ----------------------
 *  |->| repository_event | vertex_send_event()          | restarter_protocol |
 *     | _thread          |----------------------------->| _send_event()      |
 *     --------------------                              ----------------------
 *                                                          |    | out to other
 *                restarter_                     restarter_ |    | restarters
 *                event_dequeue() -------------  event_     |    | not startd
 *               |----------------| restarter |<------------|    |------------->
 *              \|/               |   event   |  enqueue()
 *      -------------------       |   queue   |             |------------------>
 *      | restarter_event |       -------------             ||----------------->
 *      | _thread         |                                 |||---------------->
 *      -------------------                                 ||| start/stop inst
 *               |               ----------------       ----------------------
 *               |               |   instance   |       | restarter_process_ |
 *               |-------------->|    event     |------>| events             |
 *                restarter_     |    queue     |       | per-instance lwp   |
 *                queue_event()  ----------------       ----------------------
 *                                                          ||| various funcs
 *                                                          ||| controlling
 *                                                          ||| instance state
 *                                                          |||--------------->
 *                                                          ||---------------->
 *                                                          |----------------->
 */

What’s important to take away is that there is a queue for each instance on the system that handles events related to dealing directly with that instance and that events can be added to it because of changes to properties that are made to configd and acted upon asynchronously by startd.

How does the restarter property group show up

The last thing that we wanted to answer was where does the restarter property actually get set if it is not specified. While looking around the source code, I finally came across an interesting function: libscf_inst_get_or_add_pg. This function was getting called in a few various places and specifies the restarter property group. However, none of this is done in configd or svccfg when you import the manifest. Rather it is all taken care of by startd asynchronously.

To test that this was getting called when you imported a service for the first time and verify that this was getting called by startd, I used the following DTrace snippet that utilizes the pid provider. For more on how to use it, consult Brendan’s blog articles on the pid provider.

[root@headnode (coal:0) ~]# dtrace -n 'pid8::libscf_inst_get_or_add_pg:entry{
printf("%s", copyinstr(arg1)); ustack(); }'
dtrace: description 'pid8::libscf_inst_get_or_add_pg:entry' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0  82690  libscf_inst_get_or_add_pg:entry restarter
              svc.startd`libscf_inst_get_or_add_pg
              svc.startd`libscf_note_method_log+0x6c
              svc.startd`method_run+0x132
              svc.startd`method_thread+0x184
              libc.so.1`_thrp_setup+0x9b
              libc.so.1`_lwp_start

  0  82690  libscf_inst_get_or_add_pg:entry restarter
              svc.startd`libscf_inst_get_or_add_pg
              svc.startd`libscf_note_method_log+0x6c
              svc.startd`method_run+0x132
              svc.startd`method_thread+0x184
              libc.so.1`_thrp_setup+0x9b
              libc.so.1`_lwp_start

  0  82690  libscf_inst_get_or_add_pg:entry restarter
              svc.startd`libscf_inst_get_or_add_pg
              svc.startd`libscf_write_start_pid+0x6e
              svc.startd`method_run+0x43a
              svc.startd`method_thread+0x184
              libc.so.1`_thrp_setup+0x9b
              libc.so.1`_lwp_start

  1  82690  libscf_inst_get_or_add_pg:entry restarter
              svc.startd`libscf_inst_get_or_add_pg
              svc.startd`libscf_write_method_status+0xbc
              svc.startd`write_status+0x2f
              svc.startd`method_run+0x616
              svc.startd`method_thread+0x184
              libc.so.1`_thrp_setup+0x9b
              libc.so.1`_lwp_start

From this, we see that as a part of getting ready to actually run the specified instance we’re writing out the restarter property group. Thus svccfg should not return until this this property group has been added by startd otherwise we will see invalid state that causes the tools like svcs and svcadm to complain.

The fix and some gotchas

So, the fix here is actually pretty straightforward. What we want to do is after we have imported all of the services and instances associated with a given manifest, we want to verify that every service and instance has a restarter property group. They will have this property group regardless of whether the instance is enabled, disabled, in maintenance, or can’t start due to missing dependencies. The logic here is very simple, iterate over each service and instance specified in the manifest and don’t move on until we can retrieve that property group. Once we can, move onto the next instance. This is pretty straightforward, but there are two times when this logic surprisingly breaks that we have to watch out for and special case.

Native Build

I discovered that as a part of the build process for ON, there is a phase where it builds a version of svc.configd and svccfg which it calls svc.configd-native and svccfg-native. These create initial repositories for the system. However, they are designed to run separately from the normal series of configd and startd that are on the system. In fact, there is no native startd while the native configd and svccfg are running. If we did this check, the restarter property groups will never be created and the build will always spin forever. The only solution is to not do the check. There are a few other places throughout configd and svccfg that already have to deal with the fact that we’re using the same source base and running it in two very different environments. We can work around it by using the preprocessor directive NATIVE_BUILD and a few #ifdefs. I did not introduce that directive, it was already being used liberally in configd and in a few places in svccfg.

Early Manifest Import

PSARC 2010/013 SMF Early Manifest Import introduced a substantial change in when various manifest are imported into the repository during boot. In this case svc.startd purposefully does not listen for notifications from configd while it is running EMI. This has two important ramifications:

  • We cannot check for the restarter property while EMI is running its start method.
  • This race condition could occur after the EMI service finishes and startd starts acting upon all the changes.

To deal with this, we check the state of the EMI service. If the instance is online, that means that EMI has successfully finished and will never run again until the next time the system boots. This is how svc.startd makes sure not to run it twice in case startd restarts. In our case, we do not try and verify that the instance has a restarter property group unless svc:/system/early-manifest-import:default is online.

The likelihood of the race condition occurring after EMI starts is very unlikely because most start methods are not calling svcadm enable -s on some other service that was imported via EMI, but that does not mean it does not exist and it is worth keeping that in mind if writing the manifest for such a service.

Takeaways

Hopefully the block diagrams here help someone who is making future dives into the depths of SMF. If you do, here are a couple things to keep in mind:

  • Have the code base open and in cscope.
  • Use DTrace on the different pieces of SMF doing known actions and use ustack().
  • Watch out for the two special modes that these programs run in: EMI and NATIVE_BUILD.
  • The SMF Codebase is huge. Do not try to keep it all in memory at once. Trace interactions and keep notes.

Just the other day I was working with Ryan Dahl on debugging an issue he hit while working on adding support for Crankshaft —  the new JIT for Google’s v8 — for SunOS. This came about from Bryan’s discovery of what can happen when magic collides. Now, this is a rather delicate operation and there is a lot of special stuff that is going on. Since Ryan and I had an interesting little debugging session and both learned something, I thought I’d share a bit of what was going on with an explanation.

As a part of Crankshaft, they are firing a signal to do a bit of the profiling. Some of the code that is in bleeding edge for src/platform-solaris.cc currently looks like:

615 static void ProfilerSignalHandler(int signal, siginfo_t* info, void* context) {
616   USE(info);
617   if (signal != SIGPROF) return;
618   if (active_sampler_ == NULL || !active_sampler_->IsActive()) return;
619   if (vm_tid_ != pthread_self()) return;
620
621   TickSample sample_obj;
622   TickSample* sample = CpuProfiler::TickSampleEvent();
623   if (sample == NULL) sample = &sample_obj;
624
625   // Extracting the sample from the context is extremely machine dependent.
626   ucontext_t* ucontext = reinterpret_cast(context);
627   mcontext_t& mcontext = ucontext->uc_mcontext;
628   sample->state = Top::current_vm_state();
629
630 #if V8_HOST_ARCH_IA32
631   sample->pc = reinterpret_cast(mcontext.gregs[KDIREG_EIP]);
632   sample->sp = reinterpret_cast(mcontext.gregs[KDIREG_ESP]);
633   sample->fp = reinterpret_cast(mcontext.gregs[KDIREG_EBP]);
634 #elif V8_HOST_ARCH_X64
635   sample->pc = reinterpret_cast(mcontext.gregs[KDIREG_RIP]);
636   sample->sp = reinterpret_cast(mcontext.gregs[KDIREG_RSP]);
637   sample->fp = reinterpret_cast(mcontext.gregs[KDIREG_RBP]);
638 #else
639   UNIMPLEMENTED();
640 #endif
641   active_sampler_->SampleStack(sample);
642   active_sampler_->Tick(sample);
643 }

Now for those of you who have spent a long time working with SunOS might notice what’s wrong with this right away. But in some ways it’s not quite so obvious, so let’s talk about what’s happening.

This code is being used as a signal handler, specifically for SIGPROF. If we pull up the manual page for sigaction(2), the Solaris version has the following comment in its notes section:

NOTES
     The handler routine can be declared:

       void handler (int sig, siginfo_t *sip, ucontext_t *ucp);

     The sig argument is the signal number. The sip argument is a
     pointer (to space on the stack) to  a  siginfo_t  structure,
     which  provides  additional detail about the delivery of the
     signal. The ucp argument is a pointer (again to space on the
     stack)  to  a  ucontext_t  structure  (defined in <sys/ucon-
     text.h>) which contains the context from before the  signal.
     It  is  not  recommended  that ucp be used by the handler to
     restore the context from before the signal delivery.

SunOS 5.11           Last change: 23 Mar 2005                   5

When a signal is delivered on an x86 UNIX system a program stops doing what it is currently doing and if there is a signal handler, executes the code for the signal handler and then returns to what it was previously doing (this is a bit more complicated in a multi-threaded program). We generally describe this as a signal interrupting the thread in question. This third argument to the handler is a context, which is all the information necessary to describe where a user program is executing. If we peek our heads into <sys/ucontext.h> on an x86 based system (the SPARC version is different)) we will find the following declaration for the structure (with a few #ifdefs along for the ride):

 75 #if !defined(_XPG4_2) || defined(__EXTENSIONS__)
 76 struct  ucontext {
 77 #else
 78 struct  __ucontext {
 79 #endif
 80         unsigned long   uc_flags;
 81         ucontext_t      *uc_link;
 82         sigset_t        uc_sigmask;
 83         stack_t         uc_stack;
 84         mcontext_t      uc_mcontext;
 85         long            uc_filler[5];   /* see ABI spec for Intel386 */
 86 };

Specifically here we are interested in the mcontext — what v8 is using. To best understand what the mcontext is, I took a look at what the OpenGroup defines for ucontext.h in SUSv2. They have the following to say about the mcontext:

mcontext_t  uc_mcontext a machine-specific representation of the saved context

More specifically the mcontext_t has two members. From <sys/regset.h> we get:

378 /*
379  * Structure mcontext defines the complete hardware machine state.
380  * (This structure is specified in the i386 ABI suppl.)
381  */
382 typedef struct {
383         gregset_t       gregs;          /* general register set */
384         fpregset_t      fpregs;         /* floating point register set */
385 } mcontext_t;

Well, that’s exactly what v8 is looking for. From the code snippet there, they are saving three registers that describe how the machine works:

  • The Base Pointer – ebp for i386 and rbp on amd64
  • The Instruction Pointer – eip for i386 and rip for amd64
  • The Stack Pointer – esp for i386 and rsp on amd64

Now keeping track of what each of these does can be quite confusing, so let’s do a quick review.

The instruction pointer holds the address of the next assembly instruction that the CPU should execute for this program. The Base Pointer and Stack Pointer are unfortunately, not quite as intuitive. Memory is laid out in the stack from high addresses towards low addresses. The stack pointer tells us where the bottom of the stack is, i.e. if we decrement the address we can store a new value. When we use the stack, we break it up into what are called stack frames. A stack frame contains everything necessary to run a function: arguments to the function, copies of registers that are expected to be saved, the instruction to return to after the function completes (the eip) and a pointer to the previous stack frame. The ebp points into the current stack frame.

After this brief interlude, we now return to the code that we were working on v8 src/platform_solaris.cc. Now, every so often that code would segfault. With a brief bit of debugging work and comparing the registers before the interrupt was taken with those in the mcontext, we found that we were using the wrong value! Now, if you look back, you’ll see that we’re using macros with prefix KDIREG. These are generally gotten from <sys/kdi_regs.h>. Specifically the definitions used are architecture dependent and for x86 will be found in <ia32/sys/kdi_regs.h> and in <amd64/sys/kdi_regs.h> for amd64. This is the interface that kmdb uses for operating.

In this context, kdi stands for the Kernel/Debugger Interface. So these definitions are meant for structures that are using that interface. When we specified KDIREGS_ESP the value it ended up actually getting out of the register actually was giving us the register ECX. ECX can be used as a general purpose and historically CX was used for loop counters, so the chances that we’re getting an invalid address are pretty high.

However, it turned out it was not too hard to use the correct registers. Looking at <sys/regset.h> had the answer right in front of us:

 91 /*
 92  * The names and offsets defined here are specified by i386 ABI suppl.
 93  */
 94
 95 #define SS              18      /* only stored on a privilege transition */
 96 #define UESP            17      /* only stored on a privilege transition */
 97 #define EFL             16
 98 #define CS              15
 99 #define EIP             14
100 #define ERR             13
101 #define TRAPNO          12
102 #define EAX             11
103 #define ECX             10
104 #define EDX             9
105 #define EBX             8
106 #define ESP             7
107 #define EBP             6
108 #define ESI             5
109 #define EDI             4
110 #define DS              3
111 #define ES              2
112 #define FS              1
113 #define GS              0

This led us to making the obvious substitutions:

630 #if V8_HOST_ARCH_IA32
631   sample->pc = reinterpret_cast(mcontext.gregs[EIP]);
632   sample->sp = reinterpret_cast(mcontext.gregs[ESP]);
633   sample->fp = reinterpret_cast(mcontext.gregs[EBP]);

Well, actually it was almost too obvious, because it segfaulted as well in the same location. However, instead of using address 0xf (a reasonable value for ECX), it actually had 0x0 in the ESP register! Now wait a minute, this is what tells us where the bottom of the stack is, that’s not right, if the bottom of the stack is at 0 we’re in a lot of trouble.

Now, on Solaris x86/amd64 we take interrupts on the stack. These days, most systems use a 1:1 threading model (for reasons why, ask Bryan or read his paper) so for each userland thread there is a kernel thread that corresponds to it which means that each thread has a stack in both userland and the kernel. So here ESP really could be called KESP — referring to the ESP of the kernel thread. So really what we are interested in here is the ESP for userland or the register UESP.

Now that we know that we need to be using UESP, I took another look at the header file and found the following snippet:

115 /* aliases for portability */
116
117 #if defined(__amd64)
118
119 #define REG_PC  REG_RIP
120 #define REG_FP  REG_RBP
121 #define REG_SP  REG_RSP
122 #define REG_PS  REG_RFL
123 #define REG_R0  REG_RAX
124 #define REG_R1  REG_RDX
125
126 #else   /* __i386 */
127
128 #define REG_PC  EIP
129 #define REG_FP  EBP
130 #define REG_SP  UESP
131 #define REG_PS  EFL
132 #define REG_R0  EAX
133 #define REG_R1  EDX

One of the nice things about this here is that it makes it easier to write code that works across both the x86 and amd64 architectures. Of course, this doesn’t really work when looking at SPARC platforms because the ABI and calling conventions are different due to the differences in CPU architecture. This is one of the things that I personally enjoy about SunOS. The act of defining these more portable aliases is really helpful and if we ever get a 128 bit processor for some reason, those macros will be extended to make sense for it as well. Those portable definitions allowed us to take those architecture ifdefs and just replace it with the following three lines:

631   sample->pc = reinterpret_cast(mcontext.gregs[REG_PC]);
632   sample->sp = reinterpret_cast(mcontext.gregs[REG_SP]);
633   sample->fp = reinterpret_cast(mcontext.gregs[REG_FP]);

That’s about it for our little trip down to sys/regset.h. The fix should hopefully land in v8 (it may even have by the time I get around to posting this) shortly. It should be fun to play around with node and a proper Crankshaft on v8.

As a part of the work for Cloud Analytics that Dave, Bryan, Brendan, and I been doing, we’ve added several USDT probes to Node. As I’ve been doing a lot of the integration work and working with Ryan Dahl to get them into Node itself, I thought that I would document what we’ve added and talk a bit about what information we can get from the probes. Dave has a blog entry that explains how you can use these probes and Cloud Analytics to understand what’s going on with your application.

Currently Node has probes that let us track the following events:

Before we jump too far ahead of ourselves, we should talk a bit about how to enable DTrace USDT probes with node. To get the DTrace probes in node, you can pass the option –with-dtrace to configure. This flag will fail on systems that aren’t SunOS (i.e. Illumos and OpenSolaris). Now, OS X and FreeBSD do have DTrace. You can certainly hack the wscript for node to get the probes compiled into OS X and FreeBSD; however, they have not been tested. On OS X you can get the probes to fire; however, the translator does not currently work on OS X, which limits their power. I have not tested this on FreeBSD 9 which added support for the USDT provider. So, let’s build node with support for DTrace!

$ ./configure --with-dtrace && make && make install

So, now we can go ahead and list all of the probes that DTrace has for node. To do this we need to fire up our newly compiled node and then run the following DTrace command to see all the probes that we have:

rm@devel ~ $ dtrace -l -n node*:::
   ID   PROVIDER            MODULE                          FUNCTION NAME
 4331  node12090              node _ZN4node14dtrace_gc_doneEN2v86GCTypeENS0_15GCCallbackFlagsE gc-done
 4332  node12090              node _ZN4node15dtrace_gc_startEN2v86GCTypeENS0_15GCCallbackFlagsE gc-start
 4333  node12090              node _ZN4node26DTRACE_HTTP_CLIENT_REQUESTERKN2v89ArgumentsE http-client-request
 4334  node12090              node _ZN4node27DTRACE_HTTP_CLIENT_RESPONSEERKN2v89ArgumentsE http-client-response
 4335  node12090              node _ZN4node26DTRACE_HTTP_SERVER_REQUESTERKN2v89ArgumentsE http-server-request
 4336  node12090              node _ZN4node27DTRACE_HTTP_SERVER_RESPONSEERKN2v89ArgumentsE http-server-response
 4337  node12090              node _ZN4node28DTRACE_NET_SERVER_CONNECTIONERKN2v89ArgumentsE net-server-connection
 4338  node12090              node _ZN4node22DTRACE_NET_SOCKET_READERKN2v89ArgumentsE net-socket-read
 4339  node12090              node _ZN4node23DTRACE_NET_SOCKET_WRITEERKN2v89ArgumentsE net-socket-write
 4340  node12090              node _ZN4node21DTRACE_NET_STREAM_ENDERKN2v89ArgumentsE net-stream-end

As we can see there are several probes here, for this entry we’re going to leave out net-stream-end and net-server-connection. If we include the path to the node.d file, we can even view all the arguments of the probe. The node.d file is a translator that is installed into $PREFIX/lib/dtrace. As long as you have the libdir patch (or move the file into /usr/lib/dtrace) then you should be all set and we can run:

rm@devel ~ $ dtrace -L $PREFIX/lib/dtrace -v -n node*:::
   ID   PROVIDER            MODULE                          FUNCTION NAME
 4331  node12090              node _ZN4node14dtrace_gc_doneEN2v86GCTypeENS0_15GCCallbackFlagsE gc-done

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Evolving
                Data Semantics:   Evolving
                Dependency Class: ISA

        Argument Types
                args[0]: int
                args[1]: int

 4332  node12090              node _ZN4node15dtrace_gc_startEN2v86GCTypeENS0_15GCCallbackFlagsE gc-start

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Evolving
                Data Semantics:   Evolving
                Dependency Class: ISA

        Argument Types
                args[0]: int
                args[1]: int

 4333  node12090              node _ZN4node26DTRACE_HTTP_CLIENT_REQUESTERKN2v89ArgumentsE http-client-request

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Evolving
                Data Semantics:   Evolving
                Dependency Class: ISA

        Argument Types
                args[0]: node_http_request_t *
                args[1]: node_connection_t *
 4334  node12090              node _ZN4node27DTRACE_HTTP_CLIENT_RESPONSEERKN2v89ArgumentsE http-client-response

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Evolving
                Data Semantics:   Evolving
                Dependency Class: ISA

        Argument Types
                args[0]: node_connection_t *

 4335  node12090              node _ZN4node26DTRACE_HTTP_SERVER_REQUESTERKN2v89ArgumentsE http-server-request

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Evolving
                Data Semantics:   Evolving
                Dependency Class: ISA

        Argument Types
                args[0]: node_http_request_t *
                args[1]: node_connection_t *

 4336  node12090              node _ZN4node27DTRACE_HTTP_SERVER_RESPONSEERKN2v89ArgumentsE http-server-response

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Evolving
                Data Semantics:   Evolving
                Dependency Class: ISA

        Argument Types
                args[0]: node_connection_t *

 4338  node12090              node _ZN4node22DTRACE_NET_SOCKET_READERKN2v89ArgumentsE net-socket-read

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Evolving
                Data Semantics:   Evolving
                Dependency Class: ISA

        Argument Types
                args[0]: node_connection_t *
                args[1]: int

 4339  node12090              node _ZN4node23DTRACE_NET_SOCKET_WRITEERKN2v89ArgumentsE net-socket-write

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Evolving
                Data Semantics:   Evolving
                Dependency Class: ISA

        Argument Types
                args[0]: node_connection_t *
                args[1]: int


For the rest of this entry we’ll go over the different probes and their arguments. I’ll detail what they are and how you can get some useful information from them. In a future entry I’ll discuss how you can use this information.

HTTP Server Probes

When an HTTP request comes in it will fire the http-server-request probe. The probe is fired once node receives all of the HTTP headers for the incoming connection. Thus the probe fires right before the HTTP socket emits the event ‘request’. The http-server-request probe has two arguments. The first argument tells us information about the HTTP request. We can use it to get the method and url requested. The second argument tells us information about who is connecting to our HTTP server. We can view all this information with the following DTrace snippet:

# cat -n /tmp/node-httpd-request.d
     1  #pragma D option quiet
     2
     3  node21291:::http-server-request
     4  {
     5          printf("Connection from %s:%d on fd %d - Request %s: %s\n",
     6              args[1]->remoteAddress, args[1]->remotePort, args[1]->fd,
     7              args[0]->method, args[0]->url);
     8  }
# dtrace -L$PREFIX/lib/dtrace -s node-httpd-request.d
Connection from 10.99.99.254:63486 on fd 7 - Request GET: /
Connection from 10.99.99.254:63487 on fd 7 - Request GET: /
Connection from 10.99.99.254:63488 on fd 7 - Request GET: /
Connection from 10.99.99.254:63489 on fd 7 - Request GET: /
Connection from 10.99.99.254:63490 on fd 7 - Request GET: /
^C

The corresponding probe, http-server-response, will fire once we have finished sending all the data in our response to the client. This probe no longer has the information on the incoming HTTP request, but does have all of the information about who we are talking to. Here’s all the information that we have:

# cat node-http-response.d
     1  #pragma D option quiet
     2
     3  node*:::http-server-response
     4  {
     5          printf("Connection from %s:%d on fd %d\n",
     6              args[0]->remoteAddress, args[0]->remotePort, args[0]->fd);
     7  }
# dtrace -L$PREFIX/lib/dtrace -s node-http-response.d
Connection from 127.0.0.1:59149 on fd 8
Connection from 127.0.0.1:39130 on fd 9
Connection from 10.99.99.254:63691 on fd 7
Connection from 10.99.99.254:55004 on fd 19
Connection from 10.99.99.254:55000 on fd 14
Connection from 10.99.99.254:55276 on fd 16
Connection from 10.99.99.254:54977 on fd 11

Now, if we want to track the time that has elapsed, we have to be a little smarter. In node we could have multiple connections going on at the same time, and these are all going to occur in the same thread. This means that we cannot use the traditional thread-local variables that we might do via self->foo. So, how do we actually correlate one http-server-request probe with its http-server-response probe? Each connection is using a separate socket and each socket has its own file descriptor. There is a one to one correlation between sockets and file descriptors. A file descriptor will only be reused if the socket it was previously associated with was closed. So if we wanted to track the latency we can use the file descriptor to correlate events and write a small script that looks something like:

rm@devel ~ $ cat  -n/tmp/httpd-lat.d
     1  #pragma D option quiet
     2
     3  node*:::http-server-request
     4  {
     5          latency[args[1]->fd] = timestamp;
     6  }
     7
     8  node*:::http-server-response
     9  {
    10          printf("Request took %d nanoseconds to complete\n",
    11              timestamp - latency[args[0]->fd]);
    12  }
rm@devel ~ $ dtrace -L$PREFIX/lib/dtrace -s /tmp/httpd-lat.d
Request took 444589 nanoseconds to complete
Request took 939339 nanoseconds to complete
Request took 353588 nanoseconds to complete
Request took 1431926 nanoseconds to complete
Request took 334404 nanoseconds to complete
^C

Now, if we run this with the sample Node ‘Hello, world!’ HTTP server we certainly can get some rather interesting data. But the value comes from being able to visualize this data, something that Brendan has looked at extensively.

HTTP Client Probes

The HTTP Client probes have been designed to correspond to their HTTP Server counterparts. There are both http-client-request and http-client-response probes. The http-client-request gets fired when an outgoing request is made and the http-client response probe is fired when we have received the response. In terms of arguments, the HTTP Client probes work identically to their server counterparts. The http-client-request has information about what the underlying connection and the HTTP resource we have asked for. The http-client-response probe only has information on the underlying socket.

Let’s look at the same information as we did with the HTTP Server probes, but this time with the client:

# cat -n /tmp/node-httpc-request.d
     1  #pragma D option quiet
     2
     3  node*:::http-client-request
     4  {
     5          printf("Connection to %s:%d on fd %d - Request %s: %s\n",
     6              args[1]->remoteAddress, args[1]->remotePort, args[1]->fd,
     7              args[0]->method, args[0]->url);
     8  }
# dtrace -L$PREFIX/lib/dtrace -s /tmp/node-httpc-request.d
Connection to 127.0.0.1:23184 on fd 10 - Request GET: /ca/instrumentations/1/value/raw?start_time=1298733995
Connection to 127.0.0.1:23185 on fd 12 - Request GET: /ca/instrumentations/4/value/raw?start_time=1298733995
Connection to 127.0.0.1:23184 on fd 10 - Request GET: /ca/instrumentations/1/value/raw?start_time=1298733997
Connection to 127.0.0.1:23184 on fd 13 - Request GET: /ca/instrumentations/5/value/raw?start_time=1298733997
Connection to 127.0.0.1:23185 on fd 12 - Request GET: /ca/instrumentations/4/value/raw?start_time=1298733997
^C

Again, we can do a similar thing for the response:

# cat -n /tmp/node-httpc-response.d
     1  #pragma D option quiet
     2
     3  node*:::http-client-response
     4  {
     5          printf("Connection to %s:%d on fd %d\n",
     6              args[0]->remoteAddress, args[0]->remotePort, args[0]->fd);
     7  }
# dtrace -L$PREFIX/lib/dtrace -s /tmp/node-httpc-response.d
Connection to 127.0.0.1:23185 on fd 15
Connection to 127.0.0.1:23185 on fd 13
Connection to 127.0.0.1:23184 on fd 18
Connection to 127.0.0.1:23184 on fd 12
Connection to 127.0.0.1:23184 on fd 10
^C

With all of this information we can gather a lot of useful information. Because both probes have the file descriptor, we can link them together to get the latency, as we did previously. Because a node program may be making several HTTP requests to external services, having the ability to know which request is inducing how much latency is quite valuable.

Socket Read/Write Probes

This next set of probes is in fact two different probes, one which fires on every read from a socket and one that fires on every write to a socket. In addition to the information about the socket itself, we get two very useful numbers: the size of the data and the amount buffered. Whenever node does a write to a socket, it attempts to write all of it to the kernel at once. However, if the write would block, node stores the data in userland and will write it out a bit later. When this happens, the call to write returns false. If you aren’t careful or checking the return value from write that buffer can grow quite quickly in userland and eat up a lot of memory. Let’s look at how we can use these two probes:

# cat -n socket.d
     1  #pragma D option quiet
     2
     3  node*:::net-socket-read,
     4  node*:::net-socket-write
     5  {
     6          printf("%s:%d -- %d bytes -- %d buffered\n", args[0]->remoteAddress,
     7              args[0]->remotePort, arg1, args[0]->bufferSize);
     8  }
# dtrace -L$PREFIX/lib/dtrace -s /tmp/socket.d
10.99.99.254:55000 -- 314 bytes -- 0 buffered
10.99.99.254:55000 -- 2740 bytes -- 0 buffered
10.99.99.254:55000 -- 2 bytes -- 0 buffered
10.99.99.254:55000 -- 5 bytes -- 0 buffered
^C

Garbage Collection Probes

Javascript is a language where you don’t have to explicitly manage memory like you do in C. Instead, the runtime uses various techniques to allocate and later determine what memory can be freed. This technique is known as garbage collection. Node is using v8 under the hood to do this work. v8 gives us the ability to be notified right before garbage collection starts and ends. We have two pieces of information from the garbage collection process: the type of garbage collection and various flags.

To understand the different types, we have to dive into the v8 header file. Node v0.4.0 is currently using v8 version v3.1. If we open up /usr/include/v8.h we see the following for valid garbage collection types:

2472 enum GCType {
2473   kGCTypeScavenge = 1 << 0,
2474   kGCTypeMarkSweepCompact = 1 << 1,
2475   kGCTypeAll = kGCTypeScavenge | kGCTypeMarkSweepCompact
2476 };

From this we can see that v8 supports two different kinds of garbage collection. To help understand the differences, it's important to know that v8 uses the notion of generations. Generational garbage collection makes the observation that objects fall into two camps: those that come and go rather quickly and those that stick around for a while. Mark and Sweep compaction is a traditional form of garbage collection which works well for older objects, but takes much longer. The Scavenge algorithm works well for short lived objects and is generally much quicker than Mark and Sweep. Dave helps explain why we care about GC.

The other argument to our GC probes are a series of flags. Currently this looks like:

2478 enum GCCallbackFlags {
2479   kNoGCCallbackFlags = 0,
2480   kGCCallbackFlagCompacted = 1 << 0
2481 };

When using the probes, there are a few useful things to know. When GC occurs, nothing else is going to run. Furthermore, v8 uses the same thread to call the GC prologue and epilogue handlers. These callbacks are respectively fired immediately before and immediately after the GC operation. Because we have the same thread calling it, we can use a thread local variable in D to keep track of and answer the question of how much latency was due to garbage collection. We can do this with the following D script:

# cat -n /tmp/node-gc.d
     1  #pragma D option quiet
     2
     3  node*:::gc-start
     4  {
     5          self->ts = timestamp;
     6  }
     7
     8  node*:::gc-done
     9  {
    10          printf("Spent %d nanoseconds doing Garbage Collection\n",
    11              timestamp - self->ts);
    12          self->ts = 0
    13  }
# dtrace -s /tmp/node-gc.d
Spent 780604 nanoseconds doing Garbage Collection
Spent 50416942 nanoseconds doing Garbage Collection
Spent 33276517 nanoseconds doing Garbage Collection
Spent 14498463 nanoseconds doing Garbage Collection
Spent 3847561 nanoseconds doing Garbage Collection
^C

Those are all the probes that we have currently added to node. All of them are available in the stock node src that you can download and compile yourself. The idea behind all of the probes that we added is to be able to answer a question directly related to what your application is doing and to be able to understand why and where issues are coming from.

As a part of the work Bryan, Dave, Brendan, and I have been doing to instrument Node, Bryan wrote a translator file. One of the nice things about DTrace is that you can define the arguments to a probe to be a structure and simply access them like a C struct. The way that this all works is via a translator for USDT probes which embeds knowledge of how to look up members of a probes struct. Unfortunately, there is currently a bug in how we handle dependencies for translators, but we also have a fix for that problem. But before I get there, let’s show an example of he use of translators. Let’s say that I want to get the method for every HTTP request that is coming in with node. I can use the following DTrace snippet:

# dtrace -n 'node*:::http-server-request{ printf("%s", args[0]->method); }'
dtrace: description 'node*:::http-server-request' matched 6 probes
CPU ID FUNCTION:NAME
0 3705 _ZN4node26DTRACE_HTTP_SERVER_REQUESTERKN2v89ArgumentsE:http-server-request GET
0 3705 _ZN4node26DTRACE_HTTP_SERVER_REQUESTERKN2v89ArgumentsE:http-server-request GET
0 3705 _ZN4node26DTRACE_HTTP_SERVER_REQUESTERKN2v89ArgumentsE:http-server-request GET
0 3705 _ZN4node26DTRACE_HTTP_SERVER_REQUESTERKN2v89ArgumentsE:http-server-request GET
0 3705 _ZN4node26DTRACE_HTTP_SERVER_REQUESTERKN2v89ArgumentsE:http-server-request GET
0 3705 _ZN4node26DTRACE_HTTP_SERVER_REQUESTERKN2v89ArgumentsE:http-server-request GET
0 3705 _ZN4node26DTRACE_HTTP_SERVER_REQUESTERKN2v89ArgumentsE:http-server-request GET
0 3705 _ZN4node26DTRACE_HTTP_SERVER_REQUESTERKN2v89ArgumentsE:http-server-request GET

The DTrace probes themselves specify a specific argument type. If that translator argument is a struct, we can define an operation to transform the probe argument and get out a specific value. Generally this involves a lot of use of copyin and copyinstr and is quite messy o write, but once written makes all scripts that leverage it much more powerful. These translators is that you can express the dependencies between them. In this case, the Node translator relies on the procfs translator. Thus for this to work, you need to have procfs.d which ships with SunOS.

Now, the big question to ask is how does DTrace know where to look for these translators. By default DTrace looks in /usr/lib/dtrace. However, you can also tell DTrace to look in an additional location by using the -Lpath flag to dtrace(1M). So if you wanted to keep your node.d provider in say /opt/rm/lib/dtrace, we could run the above command as:

# dtrace -L/opt/rm/lib/dtrace -n 'node*:::http-server-request{ printf("%s", args[0]->method); }'

Now, unfortunately there is a bug in this; however, I have a fix for it. Without this fix when trying to resolve library dependencies, DTrace only looks for the file in the same directory. So it would try and look for the dependency procfs.d in /opt/rm/lib/dtrace/procfs.d. However, we don’t have it there, we have it in /usr/lib/dtrace.

Here is a patch against Illumos that fixes this issue. Please note that the patch there is released under the CDDL and copyrighted by Joyent. It causes us to search our entire library path to try and find the first occurrence of a file with the name we’re looking for. So in our case above, we will find that proc.d is in /usr/lib/dtrace and resolve the dependency correctly. Hopefully we’ll see this fix make it upstream into Illumos before too long, as well as, Mac OS X and FreeBSD.

This past Wednesday Brendan Gregg and Bryan Cantrill gave a presentation on the work that’s been going on at Joyent related to visualizing performance and how we’re using this to solve customer problems. This is based on the work that the the four of us (Bryan, Brendan, Dave and I) have been doing.

You can watch the recorded live stream of the presentation Solving Big Problems (with Cloud Analytics) here.

I haven’t been sitting idly these past few weeks, I’ve skipped the frying pan and gone straight into the fire at Joyent!. At Joyent I’m joining a group of great engineers. I’ll be tackling the problem
of Cloud Analytics with a familiar cast of engineers:

Things have been moving fast and not even the holidays aren’t slowing us down too much. Dave has compiled some links to information on a bit of the background of Cloud Analytics, which can be found here. As we make progress I’ll hopefully have some interesting things to talk about related to the problem on here among some other talks.

Recent Posts

September 27, 2019
September 6, 2019
October 1, 2014

Archives