Wednesday, 31 December 2014

strace & system call tracing

I had written this as few of my colleagues had requested to know how strace works, hope below articles helps out.

Today being the last day of 2014, I thought to share publicly and wishing all readers HAPPY  NEW  YEAR - 2015 " 

The strace tool is one of the most powerful problem determination tools available for Linux. It traces the thin layer (the system calls) between a process and the Linux kernel. System call tracing is particularly useful as a first investigation tool or for problems that involve a call to the operating system.

A system call is a special type of function that is run inside the kernel. It provides fair and secure access to system resources such as disk, network, and memory. System calls also provide access to kernel services such as inter-process communication and system information.

When to use ?

The strace tool should be used as a first investigation tool or for problems that are related or involved  at the operating system level. the system call trace will clearly show the cause of problem. Experienced users might use strace either way until they narrow down the scope of a problem.

The following example uses a simple program to show how to use strace, I would try to open an file which doesn't exist.

root@localhost]# cat main.c 


int main()

int fd;
int i=0;

fd = open( "/tmp/foo", O_RDONLY);

if (fd < 0)

return i;

# gcc main.c -o ./main
# strace -o main.strace ./main
# cat -n ./main.strace 
     1 execve("./main", ["./main"], [/* 21 vars */]) = 0
     2 brk(0)                                  = 0x8697000
     3 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7774000
     4 access("/etc/", R_OK)      = -1 ENOENT (No such file or directory)
     5 open("/etc/", O_RDONLY)      = 3
     6 fstat64(3, {st_mode=S_IFREG|0644, st_size=28116, ...}) = 0
     7 mmap2(NULL, 28116, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb776d000
     8 close(3)                                = 0
     9 open("/lib/", O_RDONLY)        = 3
    10 read(3, "\177ELF\1\1\1\3\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\220\356\300\0004\0\0\0"..., 512) = 512
    11 fstat64(3, {st_mode=S_IFREG|0755, st_size=1906308, ...}) = 0
    12 mmap2(0xbf8000, 1661356, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xbf8000
    13 mmap2(0xd88000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x190) = 0xd88000
    14 mmap2(0xd8b000, 10668, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xd8b000
    15 close(3)                                = 0
    16 mprotect(0xd88000, 8192, PROT_READ)     = 0
    17 mprotect(0xbe6000, 4096, PROT_READ)     = 0
    18 munmap(0xb776d000, 28116)               = 0
    19 open("/tmp/foo", O_RDONLY)              = -1 ENOENT (No such file or directory)
    20 exit_group(5)                           = ?

In this strace output, the vast majority of the system calls are actually for process initialization. In fact, the only system call (on line 19) from the actual program code is open("/tmp/foo", O _ RDONLY ). Also notice that there are no system calls from the if statement or any other code in the program because the if statement does not invoke a system call.

Below would be detail how above code works line by line: 

Line #1:  The execve system call (or one of the exec system calls) is always the first system call in the strace output if strace is used to trace a program off the command line. The strace tool forks, executes the program, and the exec system call actually returns as the first system call in the new process.

Line #2: The brk system call is called with an argument of zero to find the current "break point." This is the beginning of memory management for the process.

Line #3: The mmap call is used to create an anonymous 4KB page. The address of this page is at 0xb7774000

Line #4: This line attempts to open the file. This file contains a list of ELF shared libraries that are to be pre-loaded before a program is able to run.

Line #5-#9: These lines involve finding and loading the libc library.

Line #10: Loads in the ELF header for the libc library.

Line #11: Gets more information (including size) for the libc library file.

Line #12:  This line actually loads ( mmaps ) the contents of libc into memory at address at 0xbf8000

Line #13: This line loads the data section at address 0xd88000 for 12288 bytes, from the beginning of memory segment (0x00d88000). According to the ELF layout of, the data section starts at 0x00d881c8 , but that section must be aligned on 0x1000 boundaries (hence the offset of 0x00d88000 )

# readelf -l /lib/

Elf file type is DYN (Shared object file) Entry point 0xc0ee90 There are 10 program headers, starting at offset 52

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  PHDR           0x000034 0x00bf8034 0x00bf8034 0x00140 0x00140 R E 0x4
  INTERP         0x15c2c8 0x00d542c8 0x00d542c8 0x00013 0x00013 R   0x1
      [Requesting program interpreter: /lib/]
  LOAD           0x000000 0x00bf8000 0x00bf8000 0x18ff98 0x18ff98 R E 0x1000
  LOAD           0x1901c8 0x00d881c8 0x00d881c8 0x027d4 0x057e4 RW  0x1000
  DYNAMIC        0x191d7c 0x00d89d7c 0x00d89d7c 0x000f8 0x000f8 RW  0x4

Line #14: Creates an anonymous memory segment for the bss section This is a special section of a loaded executable or shared library for uninitialized data. Because the data is not initialized, the storage for it is not included in an ELF object like a shared library (there are no real data values to store). Instead, memory is allocated(0xd8b000)for the bss section when the library is loaded.

Line #15: Closes the file descriptor for libc.

Line #16-#17: This removes any protection for a region of memory at 0xd88000

Line #18: unmap files or devices into memory

Line #19: The only system call from the actual program code.

Line #20: Exits the process with a return code of 5.

It can also be useful to time both the difference between system call entries and the time spent in the system calls. With this information, it is possible to get the time spent in the user code between the system calls.

# strace -Tr ./main
     0.000000 execve("./main", ["./main"], [/* 21 vars */]) = 0 <0.000169>
     0.000406 brk(0)                    = 0x8b26000 <0.000014>
     0.000194 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77e3000 <0.000028>
     0.000111 access("/etc/", R_OK) = -1 ENOENT (No such file or directory) <0.000022>
     0.000117 open("/etc/", O_RDONLY) = 3 <0.000022>
     0.000068 fstat64(3, {st_mode=S_IFREG|0644, st_size=28116, ...}) = 0 <0.000000>

Another useful way to time system calls is with the -c switch. This switch summarizes the output in tabular form:

# strace -c ./main
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  -nan    0.000000           0         1           read
  -nan    0.000000           0         3         1 open
  -nan    0.000000           0         2           close
  -nan    0.000000           0         1           execve
  -nan    0.000000           0         1         1 access

Sometimes it is necessary to trace an existing process that is running, such as a Web daemon (such as apache) or xinetd. The strace tool provides a simple way to attach to running processes with the -p switch:

# strace -p <PID>

I would conclude this article, by letting everyone know that strace when used effectively can narrow the scope of the problem analysis.