System Programming: system call

Showing posts with label system call. Show all posts

Wednesday, May 23, 2012

Passing Events to a Virtual Machine

The source code for this article may be found here.

Virtual machines and Software Frameworks are an initial part of our digital life. There are complex VM and simple Software Frameworks. These two articles (Simple Virtual Machine and Simple Runtime Framework by Example) show how easy it may be to implement one yourself. I did my best to describe the way VM code may interact with native code and the Operating System, however, the backwards interaction is still left unexplained. This article is going to fix this omission.

As usual - note for nerds:

The source code given in this article is for example purposes only. I know that this framework is far from being perfect, therefore, this article is not a howto or tutorial - just an explanation of principle. Error checks are omitted on purpose. You want to implement a real framework - do it yourself, including error checks.

By saying VM's code I do not refer to the implementation of the virtual machine, but to the pseudo code that runs inside it.

Architecture Overview

Needless to mention, that the ability to pass events/signals to a code executed by the virtual machine implies a more complex VM architecture. While all previous examples were based on a single function responsible for the execution, adding events means not only adding another function, but we will have to introduce threads to our implementation.

At least two threads are needed:

Fig.1

VM Architecture with Event Listener

Actual VM - this thread is responsible for the execution of the VM's executable code and events queue dispatch (processor);
Event Listener - this thread is responsible for collection of relevant events from the Operating Systems and adding them to the VM's event queue (listener).

You may see that the Core() function, in the attached source code, creates additional thread.

Event ListenerThis thread collects events from the Operating System (mouse move, key up/down, etc) and adds a new entry to the list of EVENT structures.

typedef struct _EVENT

{

struct _EVENT* next_event; // Pointer to the next event in the queue

int code; // Code of the event

unsigned int data; // Either unsigned int data or the address of the buffer

// containing information to be passed to the handler

}EVENT;

The code for the listener is quite simple:

while(WAIT_TIMEOUT == WaitForSingleObject(processor_thread, 1))

{

// Check for events from the OS

if(event_present)

{

EnterCriticalSection(&cs);

event = (EVENT*)malloc(sizeof(EVENT));

event->code = whatever_code_is_needed;

event->data = whatever_data_is_relevant;

add_event(event_list, event);

event->next_event = NULL;

LeaveCriticalSection(&cs);

}

The code is self explanatory enough. First of all it checks for available events (this part is omitted and replaced by a comment). If there is a new event to pass to the VM, it adds it to the queue. While in this example, event collection is implemented as a loop, in real life, you may do it in a form of callbacks and use the loop above just to wait for the processor thread to exit.

Processor

Obviously, the "processor" thread is going to be a bit more complicated, then in the previous article (Simple Runtime Framework by Example), as in addition to running the run_opcode(CPU**) function, it has to check for pending events and pass the control flow to the associated handler in the VM code.

typedef struct _EVENT_HANDLER

{

struct _EVENT_HANDLER* next_handler; // Pointer to the next handler

int event_code; // Code of the event

unsigned int handler_base; // Address of the handler in the VM's code

}EVENT_HANDLER;

DWORD WINAPI RunningThread(void* param)

{

CPU* cpu = (CPU*)param;

EVENT* event;

EVENT_HANDLER* handler;

do{

EnterCriticalSection(&cs);

if(NULL != events)

{

event = events;

events = events->next_event;

// Save current context by pushing VM registers to VM's stack

cpu->regs[REG_A] = (unsigned int)event->code;

cpu->regs[REG_B] = event->data;

handler = handlers;

while(NULL != handler && event->code != handler->event_code)

handler = handler->next_handler;

cpu->regs[REG_IP] = handler->handler_base;

free(event);

}

LeaveCriticalSection(&cs);

}while(0 != run_opcode(&cpu));

return cpu->regs[REG_A];

}

We are almost done. Our framework already knows how to pass events to a correct handler in the VM's code. Two more things are yet uncovered - registering a handler and returning from a handler.

Returning from Handler

Due to the fact that Event Handler is not a regular routine, we cannot return from it using the regular RET instruction, instead, let's introduce another instruction - IRET. As event actually interrupts the execution flow of the program, IRET - interrupt return is exactly what we need. The source code that handles this instruction is so simple, that there is no need to give it here in the text of the article. All it does is simply restoring the context of the VM's code by popping the registers previously pushed on stack.

Registering an Event Handler

The last thing left is to "teach" the program written in pseudo assembly to register a handler for a given event type. In order to do this, we need to add one simple system call - SYS_ADD_LISTENER. This system call accepts two parameters:

Code of the event to handle;
Address of the handler function.

loadi A, 0 ;Code of the event

loadi B, handler ;Address of the handler subroutine

_int sys_add_listener ;Register the handler

Example Code

The example code attached to this article is the implementation of all of the above. It does the following:

Registers event handler;
Enters an infinite loop printing out '.' every several milliseconds;
The first thread waits a bit and generates an event;
Event handler terminates the infinite loop and returns;
The program prints out a message and exits.

I hope this post was helpful or, at least, interesting.

See you at the next.

Saturday, March 17, 2012

Linux Threads Through a Magnifier: Local Threads

Source code for this article is here.

Threads are everywhere. Even now, when you browse this page, threads are involved in the process. Most likely, you have more than one tab opened in the browser and each one has at least one thread associated with it. The server supplying this page runs several threads in order to serve multiple connections simultaneously. There may be unnumbered examples for threads, but let us concentrate on one specific implementation thereof. Namely, Linux implementation of threads.

It is hard to believe, that earlier Linux kernels did not support threads. Instead, all the "threading" was performed entirely in user space by a pthread (POSIX thread) library chosen for specific program. This reminds me of my attempt to implement multitasking in DOS when I was in college - possible, but full of headache.

Modern kernels, on the contrary, have full support for threads, which, from kernel's point of view are so-called "Light-weight Processes". They are usually organized in thread groups, which, in turn, represent processes as we know them. As a matter of fact, the getpid libc function (and sys_getpid system call) return an identifier of a thread group.

Let me reiterate - the best explanation is an explanation by example. In this article, I am going to cover the process of thread creation on 64 bit Linux running on PC using FASM (flat assembler).

Clone, Fork, Exec...

There are several system calls involved in process manipulations. The most known one is sys_fork. This system call "splits" a running process in two - parent and child. While they both continue execution from the instruction immediately following the sys_fork invocation, they have different PID (process ID) or, as we now know - different TGID (thread group ID) as well as each one gets a different return value from sys_fork. The return value is a child TGID for the parent process and 0 for the child. In case of error, fork returns -1 and sets errno appropriately, while sys_fork returns a negative error code.

Exec does not return at all. Well, it formally has a return type of int, but getting a return value means, that the function failed. Exec* libc function or sys_execve system call are used in order to launch a new process. For example, if your application has to start another application, but you do not want or cannot, for any reason, execute system() function, then your application has to fork and the child process calls exec, thus, being replaced in memory by the new process. The execution of the new process starts normally from its entry point.

Clone - this is the function we are interested in. Clone is a libc wrapper for sys_clone Linux system call and is declared in the sched.h header as follows:

int clone(int (*fn)(void*), void *child_stack, int flags, void *arg, ...);

I encourage you to read the man page for clone libc function at http://linux.die.net/man/2/clone or with "man clone" :-)

sys_clone

We are not going to deal with clone function here. There are lots of good resources on the internet which provide good examples for it. Instead, we are going to examine the sys_clone Linux system call.

First of all, let us take a look at the definition of the sys_clone in arch/x86/kernel/process.c:

long sys_clone(unsigned long clone_flags, unsigned long newsp,

void __user *parent_tid, void __user *child_tid, struct pt_regs *regs)

Although, the definition looks quite complicated, in reality, it only needs clone_flags and newsp to be specified.

But there is a strange thing - it does not take a pointer to the thread function as a parameter. That is normal - sys_clone only performs the action suggested by its name - clones the process. But how about the libc's clone? - you may ask. As I have mentioned above, libc's clone is a wrapper and what is does in addition to calling sys_clone is setting its return address in the cloned process to the address of the thread function. But let us examine it in more detail.

clone_flags - this value tells the kernel about how we want our process to be cloned. In our case, as we want to create a thread, rather then a separate process, we should use the following or'ed values:

CLONE_VM (0x100) - tells the kernel to let the original process and the clone in the same memory space;

CLONE_FS (0x200) - both get the same file system information;

CLONE_FILES (0x400) - share file descriptors;

CLONE_SIGHAND (0x800) - both processes share the same signal handlers;

CLONE_THREAD (0x10000) - this tells the kernel, that both processes would belong to the same thread group (be threads within the same process);

SIGCHLD (0x11) - this is not a flag, this is the number of the SIGCHLD signal, which would be sent to the original process (thread) when the thread is terminated (used by wait functions).

newsp - the value of the stack pointer for the cloned process (new thread). This value may be NULL in which case, both threads are using the same stack. However, if the new thread attempts to write to the stack, then, due to the copy-on-write mechanism, it gets new memory pages, thus, leaving the stack of the original thread untouched.

Stack Allocation

Due to the fact, that in most cases, you would want to allocate a new stack for a new thread, I cannot leave this aspect uncovered in this article. To make things easier, let us implement a small function, which would receive the size of the requested stack in bytes and return a pointer to the allocated memory region.

Important note:

As Linux follows AMD64 calling convention when running in 64 bits, function parameters and system call arguments are passed via the following registers:

Function call: arguments 1 - 6 via RDI, RSI, RDX, RCX, R8, R9; additional arguments are passed on stack.

System call: arguments 1 - 6 via RDI, RSI, RDX, R10, R8, R9; additional arguments are passed on stack.

C declaration:

void* map_stack(unsigned long stack_size);

Implementation:

PROT_READ = 1

PROT_WRITE = 2

MAP_PRIVATE = 0x002

MAP_ANON = 0x020

MAP_GROWSDOWN = 0x100

SYS_MMAP = 9

map_stack:

push rdi rsi rdx r10 r8 r9 ;Save registers

mov rsi, rdi ;Requested size

xor rdi, rdi ;Preferred address (may be NULL)

mov rdx, PROT_READ or PROT_WRITE ;Memory protection

mov r10, MAP_PRIVATE or MAP_ANON or MAP_GROWSDOWN ;Allocation attributes

xor r8, r8 ;File descriptor (-1)

dec r8

xor r9, r9 ;Offset - irrelevant, so 0

mov rax, SYS_MMAP ;Set system call number

syscall ;Execute system call

pop r9 r8 r10 rdx rsi rdi ;Restore registers

ret

Calling this function would be as easy as:

mov rdi, size

call map_stack

This function returns either a negative error code as provided by sys_mmap or the address of the allocated memory region. As we specified MAP_GROWSDOWN attribute, the obtained address would point to the top of the allocated region instead of pointing to its bottom, thus, making it perfect to specify as a new stack pointer.

Creation of Thread

In this section, we will implement a trivial create_thread function. It would allocate stack (of default size = 0x1000 bytes) for a new thread, invoke sys_clone and to either the instruction following call create_thread or to the thread function, depending on the return value of sys_clone.

C declaration:

long create_thread(void(*thread_func)(void*), void* param);

As you may see, the return type of the thread_func is void, unlike the real clone function. I will show you why a bit later.

Implementation:

create_thread:

mov r14, rdi ;Save the address of the thread_func

mov r15, rsi ;Save thread parameter

mov rdi, 0x1000 ;Requested stack size

call map_stack ;Allocate stack

mov rsi, rax ;Set newsp

mov rdi, CLONE_VM or CLONE_FS or CLONE_THREAD or CLONE_SIGHAND or SIGCHLD ;Set clone_flags

xor r10, r10 ;parent_tid

xor r8, r8 ;child_tid

xor r9, r9 ;regs

mov rax, SYS_CLONE

syscall ;Execute system call

or rax, 0 ;Check sys_clone return value

jnz .parent ;If not 0, then it is the ID of the new thread

push r14 ;Otherwise, set new return address (thread_func)

mov rdi, r15 ;Set argument for the thread_func

ret ;Return to thread_func

.parent:

ret ;Return to parent (main thread)

Exiting Thread

Everyone who has ever searched the Web for Assembly programming tutorial for Linux is familiar with sys_exit system call. On 64 bit Intel platform it is call number 60. However, they all (tutorials) miss the point. Although, sys_exit works perfectly with single threaded hello-world-like applications, the situation is different with multithreaded ones. In general, sys_exit terminates thread, not a process, which, in case of a process with a single thread, is definitely enough, but may lead to strange artifacts (or even zombies) if, for example, a thread continues to print to stdout after you have terminated the main thread.

Now, the promised explanation on the the thread_func return type. In our case (as in most cases) the thread_func does not return by means of using the ret instruction. It just can't as there is no return address on the stack and even if you put one - returning would not terminate the thread. Instead, you should implement something like this exit_thread function.

C declaration:

void exit_thread(long result);

Implementation:

SYS_EXIT = 60

exit_thread:

; Result is already in RDI

mov rax, SYS_EXIT ; Set system call number

syscall ; Execute system call

Exiting Process

By exiting process we usually mean total termination of the running process. Linux gracefully provides us with a system call which terminates a group of threads (process) - sys_exit_group (call number 231). The function for terminating the process is as simple as this:

C declaration:

void exit_process(long result);

Implementation:

SYS_EXIT_GROUP = 231

exit_process:

; Result is already in RDI

mov rax, SYS_EXIT_GROUP ; Set system call number

syscall ; Execute system call

Attached Source Code

The source code attached to this article (which may be found here) contains a trivial example of the application that creates thread with the method described above. In addition, it contains the list of system call numbers for both 32 and 64 bit platforms.

Note for Nerds:

The attached code is for demonstration purpose only and may not contain such important elements as checking for errors, etc.

32 bit Systems

If you decide to convert the code given above to run on 32 bit systems, that would be quite easy. First of all - change register names to appropriate 32 bit ones.

Second thing is to remember how parameters are passed to system calls in 32 bit kernels. They are still passed through registers, but the registers are different. Parameters 1st through 5th are passed through EBX, ECX, EDX, ESI, EDI. The system call number is placed as usual in EAX, the same register is used to store return value upon system call's completion.

Third - use int 0x80 instead of syscall instruction.

Forth - remember to change function prologues due to a different calling convention. While 64 bit systems use AMD64 ABI, 32 bit systems use cdecl passing arguments on stack by default.

Hope this article was interesting and helpful.

See you at the next (remote threads in Linux - stay tuned).

Thursday, October 13, 2011

Hijack Linux System Calls: Part III. System Call Table

This is the last part of the Hijack Linux System Calls series. By now, we have created a simple loadable kernel module which registers a miscellaneous character device. This means, that we have everything we need in order to patch the system call table. Almost everything, to be honest. We still have to fill the our_ioctl function and add a couple of declarations to our source file. By the end of this article we will be able to intercept any system call in our system should there be a need for that.

System Call Table

System Call table is simply an area in the kernel memory space that contains addresses of system call handlers. Actually, a system call number is an offset into that table. This means that when we call sys_write (to be more precise - when libc calls sys_write) on a 32 bit system and passes number 4 in EAX register before int 0x80, it simply tells the kernel to go to the system call table, get the value at offset 4 from the system call table's address and call the function that address points to. It may be number 1 in RAX in case of a 64 bit system (and syscall instead of int 0x80). System call numbers are defined in arch/x86/include/asm/unistd_32.h and arch/x86/include/asm/unistd_64.h for 32 and 64 bit platforms respectively. In this article, we are going to deal with sys_open system call which is number 5 for 32 bit systems and number 2 for 64 bit systems.

Due to the fact, that modern kernels do not export the sys_call_table symbol any more, we will have to find its location in memory ourselves. There are some "hackish" ways of finding the location of the sys_call_table programmatically, but the problem is that they may work, but may not work as well. Especially the way they are written. Therefore, we are going to use the simplest and the safest way - read its location from /boot/System.map file. For simplicity reasons, we will just use grep and hardcode the address. On my computer, the command grep "sys_call_table" /boot/System.map (you should check the file name on your system, as on mine it is /boot/System.map-2.6.38-11-generic) gives this output "ffffffff816002e0 R sys_call_table". Add global variable unsigned long *sys_call_table = (unsigned long*)0xYour_Address_Of_Sys_call_table.

Preparations

We will start, as usual, by adding new includes to our code. This time, those include files are:

#include <linux/highmem.h>

#include <asm/unistd.h>

The first one is needed due to the fact that system call table is located in read only memory area in modern kernels and we will have to modify the protection attributes of the memory page containing the address of the system call that we want to intercept. The second one is self explanatory after the previous paragraph. We are not going to use hardcoded values for system calls, instead, we will use the values defined in unistd.h header.

Now we define two values, which would be used as cmd argument to our_ioctl function. One will tell us to patch the table, another one will tell us to fix it by restoring the original value.

/* IOCTL commands */

#define IOCTL_PATCH_TABLE 0x00000001

#define IOCTL_FIX_table 0x00000004

Add one more global variable int is_set=0 which will be used as flag telling whether the real (0) or custom(1) system call is in use.

It is important to save the address of the original sys_open as we are not going to fully implement our own, instead, our function will log information about the call arguments and then perform the actual (original) call. Therefore, we define a function pointer (for original call) and a function (for custom call):

/* Pointer to the original sys_open */

asmlinkage int (*real_open)(const char* __user, int, int);

/* Our replacement */

asmlinkage int custom_open(const char* __user file_name, int flags, int mode)

{

printk("interceptor: open(\"%s\", %X, %X)\n", file_name,

flags,

mode);

return real_open(file_name, flags, mode);

}

You have noticed the "asmlinkage" attribute. Well, it is, actually, a define for the attribute. We will not go that deep this time, I will just say that this attribute tells the compiler about how it should pass arguments to the function, given that it is being called from an assembly code. The "__user" macro, signifies that the argument is in user space and the function must perform certain operations to copy it to kernel space when needed. We do not need that, meaning that we may ignore it for now.

Another couple of crucial functions is the set that will allow us modify the memory page protection attributes directly. One may say that his is risky, but, in my opinion, this is less risky then actually patching the system call table as it is, first of all, architecture dependent and we know that architectures do not change drastically, second - we use kernel functions for that.

/* Make the page writable */

int make_rw(unsigned long address)

{

unsigned int level;

pte_t *pte = lookup_address(address, &level);

if(pte->pte &~ _PAGE_RW)

pte->pte |= _PAGE_RW;

return 0;

}

/* Make the page write protected */

int make_ro(unsinged long address)

{

unsigned int level;

pte_t *pte = lookup_address(address, &level);

pte->pte = pte->pte &~ _PAGE_RW;

return 0;

}

pte_t stands for typedef struct { unsigned long pte } pte_t and represents the page table entry Although, it is simply an unsigned long, it is declared as struct in order to avoid type misuse.

pte_t *lookup_address(unsigned long address, unsigned int *level) is provided by the kernel and performs all the dirty work for us and returns a pointer to the page table entry that describes the page containing the address. This function accepts the following arguments:

address - an address in virtual memory;

level - pointer to unsigned integer value which accepts the level of the mapping.

Let's Get to Business

We are almost there. The only thing left is the actual implementation of the our_ioctl function. Add the following lines:

switch(cmd)

{

case IOCTL_PATCH_TABLE:

make_rw((unsigned long)sys_call_table);

real_open = (void*)*(sys_call_table + __NR_open);

*(sys_call_table + __NR_open) = (unsigned long)custom_open;

make_ro((unsigned long)sys_call_table);

is_set=1;

break;

case IOCTL_FIX_TABLE:

make_rw((unsigned long)sys_call_table);

*(sys_call_table + __NR_open) = (unsigned long)real_open;

make_ro((unsigned long)sys_call_table);

is_set=0;

break;

default:

printk("Ooops....\n");

break;

}

And these lines to the cleanup_module function:

if(is_set)

{

make_rw((unsigned long)sys_call_table);

*(sys_call_table + __NR_open) = (unsigned long)real_open;

make_ro((unsigned long)sys_call_table);

}

Our interceptor module is ready. Well, almost ready as we need to compile it. Do that as usual - make.

Test

Finally, we have our module set and ready to use, but we have to create a "client" application, the code that will "talk" to our module and tell it what to do. Fortunately, this is much simpler then the rest of the work, that we have done here. Create a new source file and enter the following lines:

#include <stdio.h>
#include <sys/ioctl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

/* Define ioctl commands */
#define IOCTL_PATCH_TABLE 0x00000001
#define IOCTL_FIX_TABLE 0x00000004

int main(void)
{
int device = open("/dev/interceptor", O_RDWR);
ioctl(device, IOCTL_PATCH_TABLE);
sleep(5);
ioctl(device, IOCTL_FIX_TABLE);
close(device);
return 0;
}

save it as manager.c and compile it with gcc -o manager manager.c.

Load the module, run ./manager and then unload the module when manager exits. If you issue the dmesg | tail command. If you see lines containing "interceptor: open(blah blah blah)", then you know that those lines were produced by our handler.

Now we are able to intercept system calls in modern kernels despite the fact that sys_call_table is no longer exported. Although, we deal with low level structures, which normally are only used by kernel, this still is a relatively safe method as long as your module is compiled against the running kernel.

Hope this post was helpful. See you at the next one!

Search This Blog

Wednesday, May 23, 2012

Passing Events to a Virtual Machine

Saturday, March 17, 2012

Linux Threads Through a Magnifier: Local Threads

Thursday, October 13, 2011

Hijack Linux System Calls: Part III. System Call Table