A ptrace-based tracing mechanism for syscalls

Hidden Treasures

Tracing Workflow

Once the tracer and the first tracee have been set up, tracing commences. The stracer may trace multiple processes and threads. A simplified version of the tracing workflow is visualized as a flowchart in Figure 5.

Figure 5: Simplified overview of the tracing workflow. Start and End pertain to the life cycle of the tracee.

Initially, the stracer sets the first breakpoint for the stopped tracee, identified by tid in:

ptrace(PTRACE_SYSCALL, tid, 0, pending_signal);

The ptrace request PTRACE_SYSCALL also allows passing signals (sent by other processes) to the tracee. After the request has finished, the tracee will continue executing until it hits the next breakpoint set by the PTRACE_SYSCALL request.

The stracer checks subsequently in a non-blocking way for tracees that have changed state with:

waitpid(-1, &trapped_tracee_status, WNOHANG);

State changes occur when

  • a tracee hits a breakpoint,
  • a signal has been sent to a tracee, or
  • a tracee has terminated.

The kernel stops the execution when such a state change occurs by sending a SIGTRAP to the affected tracee. The tracer then has the opportunity to inspect the stopped tracee's state with ptrace.

Whether the tracee has stopped or terminated can be easily checked with the macro WIFSTOPPED(tid). If the tracee has terminated, the stracer will decrement its tracee count and check again for a new tracee that changed its state. Otherwise, the tracee is stopped.

A stop signal of a type other than SIGTRAP indicates that the tracee was stopped because of a signal sent by a different process. The stracer delivers this signal to the tracee when the next breakpoint is set.

A SIGTRAP signal indicates that the tracee has hit a breakpoint. A syscall involves a syscall enter and a syscall exit. Hence, it also must be checked whether the breakpoint was hit on a syscall enter or on a syscall exit. This event can be examined by retrieving the register contents with

ptrace(PTRACE_GETREGSET, tid, NT_PRSTATUS, &iovec);

and checking whether the rax register contains the negative value of the errno constant ENOSYS as the return value. The tracee is in a syscall exit if the register content doesn't match the negative errno constant. However, there's one small catch: The syscall number, originally stored in the rax prior syscall enter, already will be overwritten with the return value at this point in time. Thankfully, ptrace preserves this value as orig_rax, so it is still accessible at syscall exit.

If the current stop was indeed induced by a syscall exit, the stracer will check whether libiotrace traced the function call. To accomplish this, the execution stack of the stopped tracee must be unwound. If the stack has no entry indicating the function call was traced by libiotrace, the stracer will print a warning and pass the syscall event to libiotrace, and the flowchart cycle starts anew.

Execution Stack Unwinding

The libraries libunwind and libdw handle the execution stack unwinding of the remote process. A simplified version for dynamically linked libiotrace, without any error checking or cleanup, can be seen in Listing 3.

Listing 3

Simplified Stack Unwinding

01 /* - Init libunwind - */
02 unw_cursor_t unw_cursor;
03 unw_context_t *unw_ctx = _UPT_create(tid);
04 unw_init_remote(&unw_cursor, g_unw_as, unw_ctx));
05
06 /* - Search - */
07 bool found_stack_entry = false;
08 do {
09     unw_word_t ip = 0;
10     unw_get_reg(&unw_cursor, UNW_REG_IP, &ip);
11
12     Dwfl_Module* module = dwfl_addrmodule(dwfl, (uintptr_t)ip);
13     const char *module_name = dwfl_module_info(module, 0, 0, 0, 0, 0, 0, 0);
14     if (! strstr(module_name, stacktrace_module_name) ) {
15         continue;
16     }
17
18     found_stack_entry = true;
19     break;
20 } while (unw_step(&unw_cursor) > 0);

After initialization of libunwind , the code iterates over each stack frame. Once the current address of the instruction pointer has been retrieved, the module name is resolved. The module name is the shared object (.so) file, from which the function was loaded. For each stack frame, this module name is compared with libiotrace_shared.so. With a match, it is evident that the function call was traced by libiotrace (Figure 6).

Figure 6: The libiotrace shared library file was found in the execution stack.

In this example, libiotrace_shared.so was in the stack trace. Thus, the LD_PRELOAD symbol for mmap has been correctly resolved.

The libiotrace tool can also be statically linked with the target program. For this specific purpose, the libiotrace wrapper functions are prefixed with __wrap_, which allows insertion of functions (much like LD_PRELOAD does) by using the linker option --wrap=symbol. In this case, the function name also has to be checked for the prefix when unwinding the stack.

IPC with libiotrace

Finally, you should understand the employed interprocess communication (IPC) mechanisms. The simplest way of tracing entire process hierarchies is by tracing the "root process," which creates all descendants, and setting the ptrace option PTRACE_O_TRACECLONE. However, this option is not always desirable; for example, for scientific applications, the root process is often Open MPI's orted, which is responsible for spawning MPI processes and shall not be traced. Therefore, a registration mechanism is needed that allows processes to request tracing from the stracer through a Unix domain socket. Also, the socket ensures that only one stracer instance runs at any given time because only one process can bind to the socket.

The second employed IPC mechanism is the already mentioned interface for passing syscall events from the stracer to the running tracee, which is accomplished through the use of shared memory. Every tracee sets up a shared memory object, where the name of the object is derived from the tid of the current tracee process. A simplified version without error checks is shown in Listing 4.

Listing 4

Shared Memory for Syscall Events

01  int smo_fd = shm_open(smo_name, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR);
02  ftruncate(smo_fd, smo_min_len);
03
04  struct stat stat_info;
05  fstat(smo_fd, &stat_info);
06  *shared_mem_len_ptr = stat_info.st_size;
07  *shared_mem_addr_ptr = mmap(NULL, *shared_mem_len_ptr, PROT_READ | PROT_WRITE, MAP_SHARED, smo_fd, 0);

The tracee requests to be traced by the stracer through the aforementioned Unix domain socket, and the stracer maps the shared memory region in its virtual memory and sends an acknowledgment to the tracee. The resulting shared memory segments for each tracee is visualized in Figure 7.

Figure 7: Shared memory segments for each tracee.

As already mentioned in the Tracing Workflow section, when a syscall event isn't traced, the stracer writes the event into the shared memory segment of the current tracee. Once the tracee has resumed its execution, it reads and processes the contents from the shared memory segment.

Internally, a ring buffer implementation from GitHub user rmind [10] is currently used, which supports multiproducer single-consumer operation without the use of locks. Sadly, we weren't able to find a suitable implementation that supports multiconsumer; otherwise, just one memory mapping could be used for all tracees.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus