A ptrace-based tracing mechanism for syscalls
Hidden Treasures
Kernel Facilities for Tracing
Commonly used tracers under Linux include ptrace
, eBPF, and bpftrace
, which is based on eBPF. For implementing stracer, we use ptrace
because it is supported by default, unlike eBPF, which requires a kernel configuration. Also, more online resources are available for ptrace
.
The ptrace
(process trace) syscall, available on many Unix-like systems, allows you to set breakpoints on syscalls for your tracee. Once the tracee hits a breakpoint, it stops, allowing you to gather information about the syscall by reading data from the tracee's address space with ptrace
. Finally, once stracer has finished analyzing the current syscall, it can set a breakpoint on the next syscall and resume the tracee.
To grok the current syscall, you need to investigate (1) which syscall you're dealing with and (2) the kind of data (i.e., arguments) the particular syscall provides. Thus, you need syscall definitions that contain this information.
Parsing Syscalls
Each syscall has a unique number, which is passed by the caller in CPU register rax on x86-64 machines [8]. Syscall arguments are passed in registers rdi , rsi , rdx , r10 , r8 , and r9 . The return value of the syscall is returned to the caller in rax .
For instance, the open syscall (Table 1), is numbered 2, which goes in rax
, and its arguments filename
, flags
, and mode
are passed in rdi
, rsi
, and rdx
.
Table 1
x86-64 Syscall Excerpt
%rax | System Call | %rdi | %rsi | %rdx |
---|---|---|---|---|
0 | sys_read
|
unsigned int fd
|
char *buf
|
size_t count
|
1 | sys_write
|
unsigned int fd
|
const char *buf
|
size_t count
|
2 | sys_open
|
const char *filename
|
int flags
|
int mode
|
3 | sys_close
|
unsigned int fd
|
||
4 | sys_stat
|
const char *filename
|
struct stat *statbuf
|
|
5 | sys_fstat
|
unsigned int fd
|
struct stat *statbuf
|
Unfortunately, there's no C header file that specifies the syscall number including arguments and data types for all syscalls; hard coding these values is problematic because syscalls vary for different CPU architectures and can change from kernel version to kernel version. Thus, the syscalls have to be parsed from the Linux kernel source. To ensure that the source code corresponds to the system's current kernel release, use apt source linux
for retrieving the code.
A Python script developed for this purpose then parses the file arch/x86/entry/syscalls/syscall_64.tbl
(Listing 1) with regular expressions (regex).
Listing 1
Excerpt of Parsed File
# 64-bit system call numbers and entry vectors # The format is: # <number> <abi> <name> <entry point> # 0 common read sys_read 1 common write sys_write 2 common open sys_open # ...
Each line without a comment marker contains the syscall number, application binary interface (ABI), name, and entry point of one syscall. Normally, only the syscall number and name would be relevant. However, some syscalls (e.g., writev
) exist for both the 32-bit and 64-bit ABIs. Thus, for generating unique C preprocessor macros for each syscall, the ABI must also be considered.
Lastly, the missing arguments of each syscall are required. The solution to this task was inspired by the ministrace
parsing script by nelhage [9]. The script first searches for C source files in the fs
, include
, ipc
, kernel
, mm
, net
, and security
directories, as well as in architecture-specific directories. After all source files have been found, the script regex combs through each line, searching for SYSCALL_DEFINE
macros. For example, the SYSCALL_DEFINE
for the write
syscall in source file fs/read_write.c
is:
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, size_t, count) { return ksys_write(fd, buf, count); }
The SYSCALL_DEFINE
suffix (3
in this case) refers to the number of arguments the syscall expects. Because syscalls can have up to six arguments, Linux provides seven of these macros (i.e., SYSCALL_DEFINE0
through SYSCALL_DEFINE6
). The first argument of the macro is the name of the syscall – in this case write
. The syscall arguments ensue, wherein each argument is split in the data type (e.g., unsigned int
and the argument name fd
).
Once such a macro has been matched, the name and the arguments are extracted with regex capture groups. In the final step, the syscall numbers from the first step are merged with the arguments and names from the second step on the basis of the syscall name. This data is then used to generate a lookup table for syscalls (Listing 2).
Listing 2
Excerpt of Syscall Lookup Table
const syscall_entry_t syscalls[] = { [__SNR_read] = { .name = "read", .nargs = 3, .args = {ARG_INT, ARG_STR, ARG_INT, -1, -1, -1}}, [__SNR_write] = { .name = "write", .nargs = 3, .args = {ARG_INT, ARG_STR, ARG_INT, -1, -1, -1}}, [__SNR_open] = { .name = "open", .nargs = 3, .args = {ARG_STR, ARG_INT, ARG_INT, -1, -1, -1}}, // ...
The stracer can now easily index into the table with the syscall number and retrieve type information for the arguments or simply the syscall name.
Ptrace Tracing Roles
The most common tracing setup involves the parent process as tracer (Figure 2). The parent, which acts as the tracer, forks itself and waits for the child. The child sets up ptrace
and sends itself SIGSTOP
to stop its execution. After the child has stopped, the parent has the opportunity to set ptrace
options and, for example, set a breakpoint on the next syscall. Setting this breakpoint will resume the execution of the tracee until the next breakpoint is hit. The tracee (i.e., the child) proceeds with the execve
syscall, replacing the executing image.
Many GNU/Linux distributions use the Yama Security Module to restrict tracing in the ptrace
scope. Thus, tracing as parent process has the benefit of avoiding most issues related to tracing permissions, which can become an issue when roles are reversed, requiring the tracee to set additional tracing permissions.
That said, a downside to tracing as parent process is that Unix signals sent by other processes will be addressed to the tracer and not to the tracee, thus preventing the correct delivery of the signal. Alternatively, you can also trace as child, effectively reversing roles. This choice, however, could interfere with the tracee's process execution because it now has an unexpected child process (the tracer).
To avoid both issues, the stracer is "daemonized," similar to the -DD
option in strace
. The relevant setup steps are depicted in Figure 3.
The parent process, which will later act as tracee, first sets the required permissions for tracing with prctl
and then forks itself and waits to be attached by the stracer. The resulting child forks itself again and waits until it gets terminated by the grandchild. Finally, the grandchild, which will become the stracer process, kills the child and becomes an orphan process. After some time, the grandchild will be adopted by the init process (Figure 4).
As a last step, the stracer leaves the process group of the tracee by calling setpgid(0, 0);
, which ensures that no signals delivered to the tracee's process group are delivered to the stracer. The stracer can now start attaching tracees via the ptrace
request PTRACE_ATTACH
.
Buy this article as PDF
(incl. VAT)