Saturday, March 17, 2012

Linux Threads Through a Magnifier: Local Threads

Source code for this article is here.

Threads are everywhere. Even now, when you browse this page, threads are involved in the process. Most likely, you have more than one tab opened in the browser and each one has at least one thread associated with it. The server supplying this page runs several threads in order to serve multiple connections simultaneously. There may be unnumbered examples for threads, but let us concentrate on one specific implementation thereof. Namely, Linux implementation of threads.

It is hard to believe, that earlier Linux kernels did not support threads. Instead, all the "threading" was performed entirely in user space by a pthread (POSIX thread) library chosen for specific program. This reminds me of my attempt to implement multitasking in DOS when I was in college - possible, but full of headache.

Modern kernels, on the contrary, have full support for threads, which, from kernel's point of view are so-called "Light-weight Processes". They are usually organized in thread groups, which, in turn, represent processes as we know them. As a matter of fact, the getpid libc function (and sys_getpid system call) return an identifier of a thread group.

Let me reiterate - the best explanation is an explanation by example. In this article, I am going to cover the process of thread creation on 64 bit Linux running on PC using FASM (flat assembler).


Clone, Fork, Exec...
There are several system calls involved in process manipulations. The most known one is sys_fork. This system call "splits" a running process in two - parent and child. While they both continue execution from the instruction immediately following the sys_fork invocation, they have different PID (process ID) or, as we now know - different TGID (thread group ID) as well as each one gets a different return value from sys_fork. The return value is a child TGID for the parent process and 0 for the child. In case of error, fork returns -1 and sets errno appropriately, while sys_fork returns a negative error code. 

Exec does not return at all. Well, it formally has a return type of int, but getting a return value means, that the function failed. Exec* libc function or sys_execve system call are used in order to launch a new process. For example, if your application has to start another application, but you do not want or cannot, for any reason, execute system() function, then your application has to fork and the child process calls exec, thus, being replaced in memory by the new process. The execution of the new process starts normally from its entry point.

Clone - this is the function we are interested in. Clone is a libc wrapper for sys_clone Linux system call and is declared in the sched.h header as follows:

int clone(int (*fn)(void*), void *child_stack, int flags, void *arg, ...);

I encourage you to read the man page for clone libc function at http://linux.die.net/man/2/clone or with "man clone" :-) 


sys_clone
We are not going to deal with clone function here. There are lots of good resources on the internet which provide good examples for it. Instead, we are going to examine the sys_clone Linux system call.

First of all, let us take a look at the definition of the sys_clone in arch/x86/kernel/process.c:

long sys_clone(unsigned long clone_flags, unsigned long newsp,
               void __user *parent_tid, void __user *child_tid, struct pt_regs *regs)

Although, the definition looks quite complicated, in reality, it only needs clone_flags and newsp to be specified. 

But there is a strange thing - it does not take a pointer to the thread function as a parameter. That is normal - sys_clone only performs the action suggested by its name - clones the process. But how about the libc's clone? - you may ask. As I have mentioned above, libc's clone is a wrapper and what is does in addition to calling sys_clone is setting its return address in the cloned process to the address of the thread function. But let us examine it in more detail.

clone_flags - this value tells the kernel about how we want our process to be cloned. In our case, as we want to create a thread, rather then a separate process, we should use the following or'ed values:

CLONE_VM  (0x100) - tells the kernel to let the original process and the clone in the same memory space;
CLONE_FS (0x200) - both get the same file system information;
CLONE_FILES (0x400) - share file descriptors;
CLONE_SIGHAND (0x800) - both processes share the same signal handlers;
CLONE_THREAD (0x10000) - this tells the kernel, that both processes would belong to the same thread group (be threads within the same process);

SIGCHLD (0x11) - this is not a flag, this is the number of the SIGCHLD signal, which would be sent to the original process (thread) when the thread is terminated (used by wait functions).

newsp - the value of the stack pointer for the cloned process (new thread). This value may be NULL in which case, both threads are using the same stack. However, if the new thread attempts to write to the stack, then, due to the copy-on-write mechanism, it gets new memory pages, thus, leaving the stack of the original thread untouched.


Stack Allocation
Due to the fact, that in most cases, you would want to allocate a new stack for a new thread, I cannot leave this aspect uncovered in this article. To make things easier, let us implement a small function, which would receive the size of the requested  stack in bytes and return a pointer to the allocated memory region.

Important note:
As Linux follows AMD64 calling convention when running in 64 bits, function parameters and system call arguments are passed via the following registers:
Function call: arguments 1 - 6 via RDI, RSI, RDX, RCX, R8, R9; additional arguments are passed on stack.
System call: arguments 1 - 6 via RDI, RSI, RDX, R10, R8, R9; additional arguments are passed on stack.


C declaration:
void* map_stack(unsigned long stack_size);

Implementation:
PROT_READ     = 1
PROT_WRITE    = 2
MAP_PRIVATE   = 0x002
MAP_ANON      = 0x020
MAP_GROWSDOWN = 0x100
SYS_MMAP      = 9

map_stack:
   push  rdi rsi rdx r10 r8 r9                 ;Save registers
   mov   rsi, rdi                              ;Requested size
   xor   rdi, rdi                              ;Preferred address (may be NULL)   
   mov   rdx, PROT_READ or PROT_WRITE          ;Memory protection
   mov   r10, MAP_PRIVATE or MAP_ANON or MAP_GROWSDOWN ;Allocation attributes
   xor   r8, r8                                ;File descriptor (-1)
   dec   r8     
   xor   r9, r9                                ;Offset - irrelevant, so 0
   mov   rax, SYS_MMAP                         ;Set system call number
   syscall                                     ;Execute system call
   pop   r9 r8 r10 rdx rsi rdi                 ;Restore registers
   ret 

Calling this function would be as easy as:

mov  rdi, size
call map_stack

This function returns either a negative error code as provided by sys_mmap or the address of the allocated memory region. As we specified MAP_GROWSDOWN attribute, the obtained address would point to the top of the allocated region instead of pointing to its bottom, thus, making it perfect to specify as a new stack pointer.


Creation of Thread
In this section, we will implement a trivial create_thread function. It would allocate stack (of default size = 0x1000 bytes) for a new thread, invoke sys_clone and to either the instruction following call create_thread or to the thread function, depending on the return value of sys_clone.

C declaration:
long create_thread(void(*thread_func)(void*), void* param);

As you may see, the return type of the thread_func is void, unlike the real clone function. I will show you why a bit later.

Implementation:
create_thread:
   mov   r14, rdi    ;Save the address of the thread_func
   mov   r15, rsi    ;Save thread parameter
   mov   rdi, 0x1000 ;Requested stack size
   call  map_stack   ;Allocate stack
   mov   rsi, rax    ;Set newsp
   mov   rdi, CLONE_VM or CLONE_FS or CLONE_THREAD or CLONE_SIGHAND or SIGCHLD ;Set clone_flags
   xor   r10, r10    ;parent_tid
   xor   r8, r8      ;child_tid
   xor   r9, r9      ;regs
   mov   rax, SYS_CLONE
   syscall           ;Execute system call
   or    rax, 0      ;Check sys_clone return value
   jnz   .parent     ;If not 0, then it is the ID of the new thread
   push  r14         ;Otherwise, set new return address (thread_func)
   mov   rdi, r15    ;Set argument for the thread_func
   ret               ;Return to thread_func
.parent:
   ret               ;Return to parent (main thread)


Exiting Thread
Everyone who has ever searched the Web for Assembly programming tutorial for Linux is familiar with sys_exit system call. On 64 bit Intel platform it is call number 60. However, they all (tutorials) miss the point. Although, sys_exit works perfectly with single threaded hello-world-like applications, the situation is different with multithreaded ones. In general, sys_exit terminates thread, not a process, which, in case of a process with a single thread, is definitely enough, but may lead to strange artifacts (or even zombies) if, for example, a thread continues to print to stdout after you have terminated the main thread.

Now, the promised explanation on the the thread_func return type. In our case (as in most cases) the thread_func does not return by means of using the ret instruction. It just can't as there is no return address on the stack and even if you put one - returning would not terminate the thread. Instead, you should implement something like this exit_thread function.

C declaration:
void exit_thread(long result);

Implementation:
SYS_EXIT = 60
exit_thread:
                         ; Result is already in RDI
   mov   rax, SYS_EXIT   ; Set system call number
   syscall               ; Execute system call


Exiting Process
By exiting process we usually mean total termination of the running process. Linux gracefully provides us with a system call which terminates a group of threads (process) - sys_exit_group (call number 231). The function for terminating the process is as simple as this:

C declaration:
void exit_process(long result);

Implementation:
SYS_EXIT_GROUP = 231
exit_process:
                             ; Result is already in RDI
   mov   rax, SYS_EXIT_GROUP ; Set system call number
   syscall                   ; Execute system call



Attached Source Code
The source code attached to this article (which may be found here) contains a trivial example of the application that creates thread with the method described above. In addition, it contains the list of system call numbers for both 32 and 64 bit platforms.

Note for Nerds:
The attached code is for demonstration purpose only and may not contain such important elements as checking for errors, etc.


32 bit Systems
If you decide to convert the code given above to run on 32 bit systems, that would be quite easy. First of all - change register names to appropriate 32 bit ones. 

Second thing is to remember how parameters are passed to system calls in 32 bit kernels. They are still passed through registers, but the registers are different. Parameters 1st through 5th are passed through EBX, ECX, EDX, ESI, EDI. The system call number is placed as usual in EAX, the same register is used to store return value upon system call's completion.

Third - use int 0x80 instead of syscall instruction.

Forth - remember to change function prologues due to a different calling convention. While 64 bit systems use AMD64 ABI, 32 bit systems use cdecl passing arguments on stack by default.


Hope this article was interesting and helpful.

See you at the next (remote threads in Linux - stay tuned).


12 comments:

  1. =) cool article, but)) actually in the kernel space we don't have any thread, all "threads" it's just processes, but some of them aggregate in groups. more fix, i don't think that look on sys_clone() is significant, all handling contains in the do_fork(), sys_clone() contains just few lines and it's just wrapper around do_fork.

    about stranges)) I think it's quite normal, bcoz all properly forked processes ("threads") inherent all open descriptors include stdin, astoud and stderr. when parent process died (parent process for "threads"), "threads" has open this descriptors, so they can write to it. linux will close all descriptors after last "thread" close it. as we remember linux count open() and keep file opened until close will be called the same times. one can check it - create file, open it in a process and delete in other terminal, it will be deleted only after the process exit();

    btw)) article quite good

    ReplyDelete
    Replies
    1. All's cool, but the article is not about kernel space :) and do_fork() is not available in user space.

      I am afraid, that the term "properly forked" depends on the current need. One may need to fork but not inherit any descriptor.

      Delete
    2. Properly = fork for pthread needs.

      when you write about long sys_clone(), about registers, stack, heap and so on = you write about kernel space in my mind.

      Delete
    3. You do not fork for pthread needs - you clone.

      sys_clone (just as all other sys_* stuff) is the interface Operating System provides to user space code.

      registers, stack, etc. exist in user space as well. More than that, kernel considers user space stack, heap and so on as just user space memory.

      Basically, the only time I somehow relate to kernel space is when I explain how kernel treats threads. The rest of the article is totally "in user space".

      Delete
  2. This is a really interesting blog, Alexey. Many thanks for your posts. Have you considered getting a flattr account? I would pay, and I think many others would, too.

    ReplyDelete
    Replies
    1. Thank you for warm words! I actually did not know such thing as flattr exist, so thanks for pointing that out for me. I will definitely consider it now :)

      Delete
  3. Nice article! It is really nice that you are going in depth and roots to teach reader what is really happening. Just one confirmation.

    When we create a thread, the user-space has to malloc some memory for stack usage and give it to kernel for using as a stack for LWP. So later kernel will set PC to the malloced memory (or the func pointer passed) for the LWP and passes it to the scheduler for scheduling. Am I right here? Please correct me if it is wrong

    ReplyDelete
    Replies
    1. Thanks :)

      As to the stack allocation - it depends on your needs. If you do not allocate separate stack for the thread, then it would use the same stack as the parent until either of them attempts to write. In this case, the copy on write mechanism is involved and each one of them (thread and its parent) would get a separate memory page for the area being modified.

      Delete
  4. This article was fantastic, and just what I needed. Thanks!

    ReplyDelete
  5. You are very welcome. Glad you find it useful.

    ReplyDelete
  6. I use clone to create a thread but when call waitpid it failed with "No child processes" , do not know why , here is my code
    #define _GNU_SOURCE /* or _BSD_SOURCE or _SVID_SOURCE */
    #include
    #include /* For SYS_xxx definitions */
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define STACK_SIZE 1024*1024*8 //8M

    int thread_func(void *lparam)
    {
    printf("thread id %d \n", (int)syscall(SYS_gettid));
    printf("thread get param : %d \n", (int)lparam);
    sleep(1);
    return 0;
    }


    void child_handler(int sig)
    {
    printf("I got a SIGCHLD\n");
    }

    int main(int argc, char **argv)
    {
    setvbuf(stdout, NULL, _IONBF, 0);
    signal(SIGCHLD, child_handler);
    //signal(SIGUSR1, SIG_IGN);

    void *pstack = (void *)mmap(NULL,
    STACK_SIZE,
    PROT_READ | PROT_WRITE ,
    MAP_PRIVATE | MAP_ANONYMOUS | MAP_ANON ,//| MAP_GROWSDOWN ,
    -1,
    0);
    if (MAP_FAILED != pstack)
    {
    int ret;
    printf("strace addr : 0x%X\n", (int)pstack);
    /*
    CLONE_VM (0x100) - tells the kernel to let the original process and the clone in the same memory space;
    CLONE_FS (0x200) - both get the same file system information;
    CLONE_FILES (0x400) - share file descriptors;
    CLONE_SIGHAND (0x800) - both processes share the same signal handlers;
    CLONE_THREAD (0x10000) - this tells the kernel, that both processes would belong to the same thread group (be threads within the same process);
    */
    ret = clone(thread_func,
    (void *)((unsigned char *)pstack + STACK_SIZE),
    CLONE_VM | CLONE_FS | CLONE_THREAD | CLONE_FILES | CLONE_SIGHAND |SIGCHLD,
    (void *)NULL);
    if (-1 != ret)
    {
    pid_t pid = 0;
    printf("start thread %d \n", ret);
    sleep(5);
    pid = waitpid(-1, NULL, __WCLONE | __WALL);
    printf("child : %d exit %s\n", pid,strerror(errno));
    }
    else
    {
    printf("clone failed %s\n", strerror(errno) );
    }
    }
    else
    {
    printf("mmap() failed %s\n", strerror(errno));
    }
    return 0;
    }

    ReplyDelete
    Replies
    1. Well, this is quite simple. Taking a look at "man clone", we read the following under the "CLONE_THREAD":
      "When a CLONE_THREAD thread terminates, the thread that created it using clone() is not sent a SIGCHLD (or other termination) signal; nor can the status of such a thread be obtained using wait(2). (The thread is said to be detached.)"

      Delete

Note: Only a member of this blog may post a comment.